Concedo
7590a0ea39
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/CMakeLists.txt
# models/templates/README.md
# tools/imatrix/imatrix.cpp
2025-08-05 19:24:29 +08:00
compilade
19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix ( #15076 )
...
* imatrix : add warning when suffix is not .gguf for GGUF imatrix
* imatrix : only warn about suffix when output format is unspecified
2025-08-04 23:26:52 +02:00
Sigbjørn Skjæret
2721257e3e
quantize : fix confusing error message if ftype is invalid ( #15071 )
2025-08-04 18:11:02 +02:00
Concedo
8bd0a560f0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/compare-llama-bench.py
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-04 22:42:02 +08:00
compilade
d31192b4ee
imatrix : use GGUF by default ( #14842 )
...
* imatrix : use GGUF by default
* imatrix : use GGUF regardless of the output filename
The legacy format can only be produced with --output-format dat
2025-08-03 22:00:05 +02:00
compilade
0a2f5496be
imatrix : fix 3d activation handling for hybrid and recurrent models ( #14994 )
...
* imatrix : use a single count for dense 3d tensors
* imatrix : fix 3d activations when model tensor is 2d
* imatrix : fix 3d tensor counts
2025-08-03 21:49:13 +02:00
R0CKSTAR
3025b621d1
llama-bench: rename DB table name from test to llama_bench ( #15003 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-02 17:20:40 +08:00
Johannes Gäßler
f906275537
server: enable token array inputs for OAI API ( #15001 )
2025-08-02 10:12:41 +02:00
Concedo
f430916a71
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/add.cl
# ggml/src/ggml-opencl/kernels/mul.cl
# scripts/compare-commits.sh
# scripts/compare-llama-bench.py
# scripts/sync-ggml.last
# tools/server/README.md
2025-08-02 10:25:10 +08:00
tc-mb
952a47f455
mtmd : support MiniCPM-V 4.0 ( #14983 )
...
* support minicpm-v 4
* add md
* support MiniCPM-o 4.0
* add default location
* temp rm MiniCPM-o 4.0
* fix code
* fix "minicpmv_projector" default path
2025-07-31 17:22:17 +02:00
g2mt
94933c8c2e
server : implement universal assisted decoding ( #12635 )
...
* llama-server : implement universal assisted decoding
* Erase prompt tail for kv-cache
* set vocab_dft_compatible in common_speculative
* rename ctx_main to ctx_tgt
* move vocab_dft_compatible to spec struct
* clear mem_dft, remove mem
* detokenize id_last for incompatible models
* update comment
* add --spec-replace flag
* accept special tokens when translating between draft/main models
* Escape spec-replace
* clamp draft result to size to params.n_draft
* fix comment
* clean up code
* restore old example
* log common_speculative_are_compatible in speculative example
* fix
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-31 14:25:23 +02:00
Lukas Straub
a9f77a8be3
server : add openai-style logit_bias support ( #14946 )
...
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2025-07-31 14:08:23 +02:00
Ed Addario
e9192bec56
quantize : fix using combined imatrix GGUFs (multiple datasets) ( #14973 )
2025-07-30 21:11:56 +02:00
Daniel Bevenius
41e78c567e
server : add support for embd_normalize parameter ( #14964 )
...
This commit adds support for the `embd_normalize` parameter in the
server code.
The motivation for this is that currently if the server is started with
a pooling type that is not `none`, then Euclidean/L2 normalization will
be the normalization method used for embeddings. However, this is not
always the desired behavior, and users may want to use other
normalization (or none) and this commit allows that.
Example usage:
```console
curl --request POST \
--url http://localhost:8080/embedding \
--header "Content-Type: application/json" \
--data '{"input": "Hello world today", "embd_normalize": -1}
```
2025-07-30 18:07:11 +02:00
Concedo
3284757b56
voxstral mini is really bad
2025-07-29 21:22:17 +08:00
Radoslav Gerganov
c556418b60
llama-bench : use local GPUs along with RPC servers ( #14917 )
...
Currently if RPC servers are specified with '--rpc' and there is a local
GPU available (e.g. CUDA), the benchmark will be performed only on the
RPC device(s) but the backend result column will say "CUDA,RPC" which is
incorrect. This patch is adding all local GPU devices and makes
llama-bench consistent with llama-cli.
2025-07-28 18:59:04 +03:00
Concedo
b8425f5a9c
merge but voxtral not working
2025-07-28 22:08:05 +08:00
Xuan-Son Nguyen
00fa15fedc
mtmd : add support for Voxtral ( #14862 )
...
* mtmd : add support for Voxtral
* clean up
* fix python requirements
* add [BEGIN_AUDIO] token
* also support Devstral conversion
* add docs and tests
* fix regression for ultravox
* minor coding style improvement
* correct project activation fn
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-28 15:01:48 +02:00
Ed Addario
7f97599581
quantize : update README.md ( #14905 )
...
* Update README.md
* Fix trailing whitespace
* Update README.md
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-27 23:31:11 +02:00
Concedo
21b7d0a899
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/rocm.Dockerfile
# docs/build-s390x.md
# docs/development/HOWTO-add-model.md
# docs/ops.md
# docs/ops/CPU.csv
# docs/ops/CUDA.csv
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/rms_norm.cl
# scripts/create_ops_docs.py
# tests/test-backend-ops.cpp
# tools/export-lora/export-lora.cpp
2025-07-27 17:10:53 +08:00
Concedo
0d72c794fa
Merge commit ' c8ade30036' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/im2col_f16.cl
# ggml/src/ggml-opencl/kernels/im2col_f32.cl
# ggml/src/ggml-sycl/im2col.cpp
# tools/mtmd/clip.cpp
2025-07-25 19:42:45 +08:00
kiwi
749e0d27f0
mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip ( #14503 )
...
* [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip
* Update export-lora.cpp
* Update clip.cpp
* Update export-lora.cpp
* format: use space to replace tab
2025-07-25 13:08:04 +02:00
Ed Addario
d1aa0cc5d1
imatrix: add option to display importance score statistics for a given imatrix file ( #12718 )
...
* Add --show-statistics option
* Add --show-statistics logic
* Add tensor name parsing
* Tidy output format
* Fix typo in title
* Improve tensor influence ranking
* Add better statistics
* Change statistics' sort order
* Add Cosine Similarity
* Add header search path
* Change header search path to private
* Add weighted statistics per layer
* Update report title
* Refactor compute_statistics out of main
* Refactor compute_cossim out of load_imatrix
* Refactor compute_statistics out of load_imatrix
* Move imatrix statistics calculation into its own functions
* Add checks and validations
* Remove unnecessary include directory
* Rename labels
* Add m_stats getter and refactor compute_statistics out of load_imatrix
* Refactor variable names
* Minor cosmetic change
* Retrigger checks (empty commit)
* Rerun checks (empty commit)
* Fix unnecessary type promotion
Co-authored-by: compilade <git@compilade.net>
* Reverting change to improve code readability
* Rerun checks (empty commit)
* Rerun checks (empty commit)
* Rerun checks - third time's the Charm 🤞 (empty commit)
* Minor cosmetic change
* Update README
* Fix typo
* Update README
* Rerun checks (empty commit)
* Re-implement changes on top of #9400
* Update README.md
* Update README
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Update README.md
* Remove duplicate option in print_usage()
* Update README.md
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Update README.md
Co-authored-by: compilade <git@compilade.net>
* Remove input check
* Remove commented out code
---------
Co-authored-by: compilade <git@compilade.net>
2025-07-22 14:33:37 +02:00
stduhpf
c8ade30036
Mtmd: add a way to select device for vision encoder ( #14236 )
...
* Mtmd: add a way to select device for vision encoder
* simplify
* format
* Warn user if manual device selection failed
* initialize backend to nullptr
2025-07-22 12:51:03 +02:00
Molly Sophia
adef81781a
server : allow setting --reverse-prompt arg ( #14799 )
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 09:24:22 +08:00
Concedo
4abea4b5c9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# README.md
# docs/build.md
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cpu/kleidiai/kernels.cpp
# ggml/src/ggml-cpu/kleidiai/kernels.h
# ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
# tests/test-backend-ops.cpp
# tools/server/README.md
2025-07-21 23:37:42 +08:00
Molly Sophia
c82d48ec23
llama : fix --reverse-prompt crashing issue ( #14794 )
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-21 17:38:36 +08:00
IsaacDynamo
b4efd77f8a
server : add parse_special option to /tokenize endpoint ( #14783 )
2025-07-21 10:24:51 +03:00
Concedo
30675b0798
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# docs/build.md
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
2025-07-20 22:47:31 +08:00
compilade
90083283ec
imatrix : use GGUF to store importance matrices ( #9400 )
...
* imatrix : allow processing multiple chunks per batch
* perplexity : simplify filling the batch
* imatrix : fix segfault when using a single chunk per batch
* imatrix : use GGUF to store imatrix data
* imatrix : fix conversion problems
* imatrix : use FMA and sort tensor names
* py : add requirements for legacy imatrix convert script
* perplexity : revert changes
* py : include imatrix converter requirements in toplevel requirements
* imatrix : avoid using designated initializers in C++
* imatrix : remove unused n_entries
* imatrix : allow loading mis-ordered tensors
Sums and counts tensors no longer need to be consecutive.
* imatrix : more sanity checks when loading multiple imatrix files
* imatrix : use ggml_format_name instead of std::string concatenation
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* quantize : use unused imatrix chunk_size with LLAMA_TRACE
* common : use GGUF for imatrix output by default
* imatrix : two-way conversion between old format and GGUF
* convert : remove imatrix to gguf python script
* imatrix : use the function name in more error messages
* imatrix : don't use FMA explicitly
This should make comparisons between the formats easier
because this matches the behavior of the previous version.
* imatrix : avoid returning from void function save_imatrix
* imatrix : support 3d tensors with MUL_MAT
* quantize : fix dataset name loading from gguf imatrix
* common : move string_remove_suffix from quantize and imatrix
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* imatrix : add warning when legacy format is written
* imatrix : warn when writing partial data, to help guess dataset coverage
Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.
* imatrix : avoid loading model to convert or combine imatrix
* imatrix : avoid using imatrix.dat in README
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-19 12:51:22 -04:00
Concedo
bdff33e0de
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ci/run.sh
# docs/build.md
# examples/CMakeLists.txt
# examples/parallel/parallel.cpp
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# scripts/server-bench.py
# src/llama-kv-cache-unified.cpp
# tests/test-backend-ops.cpp
# tools/batched-bench/batched-bench.cpp
# tools/server/README.md
2025-07-17 00:28:37 +08:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Georgi Gerganov
6ffd4e9c44
server : pre-calculate EOG logit biases ( #14721 )
...
ggml-ci
2025-07-16 14:04:12 +03:00
Georgi Gerganov
538cc77f7f
server : fix handling of the ignore_eos flag ( #14710 )
...
ggml-ci
2025-07-16 12:13:57 +03:00
Johannes Gäßler
5cae766541
scripts: synthetic prompt mode for server-bench.py ( #14695 )
2025-07-16 09:33:28 +02:00
Concedo
ce7aa0d5c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-all.txt
2025-07-15 23:59:53 +08:00
Johannes Gäßler
494c5899cb
scripts: benchmark for HTTP server throughput ( #14668 )
...
* scripts: benchmark for HTTP server throughput
* fix server connection reset
2025-07-14 13:14:30 +02:00
Concedo
8cebec5128
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakePresets.json
# README.md
# common/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tools/run/CMakeLists.txt
2025-07-13 23:39:41 +08:00
Concedo
dca49de059
fixed qwen2 audio issues, works fine now (+3 squashed commit)
...
Squashed commit:
[b3053a1ba] updated lite
[5071630d6] fixed mtmd issues, audio works
[06efa5af4] fix mtmd compile
2025-07-12 18:54:41 +08:00
Concedo
e9473305d0
wip2 (+1 squashed commits)
...
Squashed commits:
[4628777b6] wip
2025-07-12 18:54:40 +08:00
Douglas Hanley
0c1df14b5f
server : fix pooled embedding output ( #14645 )
2025-07-12 13:21:02 +03:00
Concedo
8cd72ea924
fix for clip first so that it loads qwen omni (expects dual backends)
2025-07-10 22:53:38 +08:00
Eric Zhang
a457551332
cmake : do not search for curl libraries by ourselves ( #14613 )
...
* cmake : do not search for curl libraries by ourselves
* run : do not search for curl libraries by ourselves
2025-07-10 15:29:05 +03:00
Concedo
b8c1fc7c9e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/development/HOWTO-add-model.md
# ggml/src/ggml-sycl/rope.cpp
# tests/test-backend-ops.cpp
2025-07-09 19:25:28 +08:00
Alawode Oluwandabira
17a1f0d2d4
server: Add ability to mount server at prefix ( #14544 )
...
* Add server_prefix
* Correct server path env
* Rename cli flag to --api-prefix
* Change all to api_prefix
2025-07-08 11:47:33 +03:00
Concedo
a17c79b1a9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/eval-callback/eval-callback.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/gelu.cl
# tests/test-backend-ops.cpp
2025-07-07 17:46:58 +08:00
Sigbjørn Skjæret
ddef99522d
server : fix assistant prefilling when content is an array ( #14360 )
2025-07-05 09:17:14 +02:00
Concedo
57ce374240
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/labeler.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# .gitmodules
# CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/softmax_4_f16.cl
# ggml/src/ggml-opencl/kernels/softmax_4_f32.cl
# ggml/src/ggml-opencl/kernels/softmax_f16.cl
# ggml/src/ggml-opencl/kernels/softmax_f32.cl
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.last
# scripts/sync-ggml.sh
# tests/test-backend-ops.cpp
# tests/test-c.c
2025-07-05 12:16:28 +08:00
Sigbjørn Skjæret
28657a8229
ggml : implement GEGLU_ERF and GEGLU_QUICK ops ( #14445 )
2025-07-03 23:07:22 +02:00
Concedo
cdda9d16e0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/tools.sh
# build-xcframework.sh
# ci/run.sh
# examples/Miku.sh
# examples/chat-13B.sh
# examples/chat-persistent.sh
# examples/chat-vicuna.sh
# examples/chat.sh
# examples/jeopardy/jeopardy.sh
# examples/reason-act.sh
# examples/server-llama2-13B.sh
# examples/sycl/build.sh
# examples/sycl/run-llama2.sh
# examples/sycl/run-llama3.sh
# examples/ts-type-to-grammar.sh
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/apple/validate-apps.sh
# scripts/apple/validate-ios.sh
# scripts/apple/validate-macos.sh
# scripts/apple/validate-tvos.sh
# scripts/apple/validate-visionos.sh
# scripts/check-requirements.sh
# scripts/ci-run.sh
# scripts/compare-commits.sh
# scripts/debug-test.sh
# scripts/gen-authors.sh
# scripts/get-hellaswag.sh
# scripts/get-pg.sh
# scripts/get-wikitext-103.sh
# scripts/get-wikitext-2.sh
# scripts/get-winogrande.sh
# scripts/hf.sh
# scripts/qnt-all.sh
# scripts/run-all-perf.sh
# scripts/run-all-ppl.sh
# scripts/sync-ggml-am.sh
# scripts/sync-ggml.sh
# scripts/tool_bench.sh
# tests/test-backend-ops.cpp
# tests/test-lora-conversion-inference.sh
# tests/test-tokenizer-0.sh
# tools/server/README.md
2025-06-30 20:38:44 +08:00