Commit graph

11505 commits

Author SHA1 Message Date
Concedo
7b393fa487 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	AUTHORS
#	ci/run.sh
#	docs/backend/SYCL.md
#	docs/build.md
#	docs/multimodal/minicpmo2.6.md
#	docs/multimodal/minicpmo4.0.md
#	docs/multimodal/minicpmv2.5.md
#	docs/multimodal/minicpmv2.6.md
#	docs/multimodal/minicpmv4.0.md
#	docs/multimodal/minicpmv4.5.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	docs/speculative.md
#	examples/deprecation-warning/README.md
#	examples/deprecation-warning/deprecation-warning.cpp
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/convert-model.sh
#	ggml/include/ggml-cann.h
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/concat.cl
#	ggml/src/ggml-opencl/kernels/repeat.cl
#	ggml/src/ggml-opencl/kernels/scale.cl
#	ggml/src/ggml-opencl/kernels/tanh.cl
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/outprod.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	src/llama-vocab.cpp
#	tests/test-autorelease.cpp
#	tests/test-backend-ops.cpp
#	tools/cvector-generator/pca.hpp
#	tools/export-lora/export-lora.cpp
#	tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
George
e9a859db3c
ggml: added cleanups in ggml_quantize_free (#19278)
Some checks failed
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Has been cancelled
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-03 08:43:39 +02:00
Gaurav Garg
41e3f02647
cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (#19227)
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-03 08:41:02 +02:00
Alexey Dubrov
1efb5f7ae1
vocab: add Falcon-H1-Tiny-Coder FIM tokens (#19249) 2026-02-03 08:31:01 +02:00
Georgi Gerganov
aeb827a3cc
spec : simplify time measurement using common_time_meas (#19262) 2026-02-03 08:20:15 +02:00
lhez
91ea44e89b
opencl: refactor some ops, concat, repeat, tanh and scale (#19226)
* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables
2026-02-02 15:54:43 -08:00
Sid Mohan
0dfcd3b607
jinja : add missing 'in' test to template engine (#19004) (#19239)
* jinja : add missing 'in' test to template engine (#19004)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* reuse test_is_in in binary op

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-02-02 21:00:55 +01:00
Xuan-Son Nguyen
07a7412a3b
mtmd: add min/max pixels gguf metadata (#19273) 2026-02-02 20:59:06 +01:00
Aman Gupta
9f682fb640
ggml-cpu: FA split across kv for faster TG (#19209)
* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl
2026-02-03 01:19:55 +08:00
Matthieu Coudron
a3fa035822
server: print actual model name in 'model not found" error (#19117)
Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.

before:

Error: {
  code = 400,
  message = "model not found",
  type = "invalid_request_error"
}

After:

Error: {
  code = 400,
  message = "model 'toto' not found",
  type = "invalid_request_error"
}
2026-02-02 16:55:27 +01:00
Aman Gupta
15818ac44c
ci: add test-backend-ops test for CPU (#19268) 2026-02-02 22:40:28 +08:00
Neo Zhang
bf38346d13
Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246)
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-02 21:06:21 +08:00
Tamar
4d5e972673
sycl: implement GGML_OP_TOP_K (#19242) 2026-02-02 21:05:51 +08:00
Georgi Gerganov
6fdddb4987
metal : support virtual devices (#18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-02 14:29:44 +02:00
Daniel Bevenius
6156ae5111
model-conversion : add debug option to conversion script (#19265)
This commit adds a debug option to the model conversion script to enable
using the Python debugger (pdb) during model conversion.

The motivation for this is that I've found myself adding this a few
times now and it would be quicker to have this flag as an option and a
makefile target/recipe for it.
2026-02-02 11:29:57 +01:00
Johannes Gäßler
59377a6c87
ggml-backend: fix async set/get fallback sync (#19179) 2026-02-02 10:00:05 +01:00
Georgi Gerganov
1239267cc4
authors : update (#19263)
[no ci]
2026-02-02 08:51:25 +02:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-02 08:38:55 +02:00
Sascha Rogmann
b4d05a3d2f
spec : various improvements ton ngram-map + docs (#19253)
* spec: ngram-map and reasoning chats

* spec: add t_begin and t_accept

* ngram-map : add internal hash map

* docs : update ngram-map, add ngram-mod

* docs : fix ngram-map-k

* docs : differences between implementations
2026-02-02 08:26:58 +02:00
Concedo
77f4afe72b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/nixpkgs-instances.nix
#	docs/backend/snapdragon/CMakeUserPresets.json
#	ggml/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-02-02 11:43:58 +08:00
Concedo
68f9c6df91 fix cuda graph spams 2026-02-02 11:28:50 +08:00
Nikhil Jain
2dc3ce2166
Remove pipeline cache mutexes (#19195)
* Remove mutex for pipeline caches, since they are now per-thread.

* Add comment

* Run clang-format

* Cleanup

* Run CI again

* Run CI once more

* Run clang-format
2026-02-01 18:47:29 -08:00
Max Krasnyansky
3bc8d2cf23
Bump cmake max version (needed for Windows on Snapdragon builds) (#19188)
* Bump max cmake version (needed for Windows on Snapdragon builds)

* cmake: move max version setting into ggml/CMakeLists
2026-02-01 14:13:38 -08:00
Alexis Williams
8a98ba4582
nix: fix allowUnfreePredicate for packages with multiple licenses (#19237)
The allowUnfreePredicate in pkgsCuda was wrapping p.meta.license in a
list unconditionally. This fails when meta.license is already a list
of licenses, as it creates a nested list and then tries to access
.free and .shortName on the inner list.

Use lib.toList instead, which correctly handles both cases:
- Single license attrset -> wraps in list
- List of licenses -> returns unchanged
2026-02-01 22:10:48 +02:00
Concedo
ddce19db72 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package-gguf-py.nix
#	.devops/nix/scope.nix
#	common/CMakeLists.txt
#	docs/backend/SYCL.md
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/sycl/run-llama2.sh
#	examples/sycl/win-run-llama2.bat
#	examples/sycl/win-test.bat
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hvx-dump.h
#	ggml/src/ggml-hexagon/htp/hvx-reduce.h
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
Concedo
76b22a7b23 updated lite 2026-02-01 22:16:13 +08:00
Concedo
a5ae116033 increase z-image default clamp to 4.0, to tolerate z-image base requirement for higher cfg 2026-02-01 22:02:20 +08:00
Concedo
b13bf44285 kde fractional scaling fix, tooltip fix (+1 squashed commits)
Squashed commits:

[1cf02dcce] kde fractional scaling fix
2026-02-01 21:55:44 +08:00
Neo Zhang
2634ed207a
create test.sh to enhance the parameters for testing, update the guide, rm useless script (#19243) 2026-02-01 18:24:00 +08:00
Concedo
9ef5d34740 fix mcp cert issues 2026-02-01 16:48:37 +08:00
Matthieu Coudron
41ea26144e
nix: fix nix develop .#python-scripts (#19218)
Without this I get:

> * Getting build dependencies for wheel...
> * Building wheel...
> Successfully built gguf-0.17.1-py3-none-any.whl
> Finished creating a wheel...
> Finished executing pypaBuildPhase
> Running phase: pythonRuntimeDepsCheckHook
> Executing pythonRuntimeDepsCheck
> Checking runtime dependencies for gguf-0.17.1-py3-none-any.whl
>   - requests not installed
For full logs, run:
    nix log /nix/store/x0c4a251l68bvdgang9d8v2fsmqay8a4-python3.12-gguf-0.0.0.drv

I changed a bit the style to make it more terse ~> more elegant in my
opinion.
2026-01-31 18:01:46 +02:00
Concedo
ffdc1b0f9f flux2 image editing 2026-01-31 16:36:45 +08:00
nullname
89f10baad5
ggml-hexagon: flash-attention and reduce-sum optimizations (#19141)
* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-30 21:14:20 -08:00
Concedo
71069253b7 update sdui 2026-01-31 12:48:57 +08:00
EugeoSynthesisThirtyTwo
3dd95914d0
quantize: add option --tensor-type-file to llama-quantize (#18572)
* add option --tensor-type-file to llama-quantize, but it raises an error.

* add error message when file not found

* quantize: update help menu, fix CI

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
2026-01-31 11:39:21 +08:00
tc-mb
ec6c7421e4
mtmd: support MiniCPM-o 4.5(vision only) (#19211)
Some checks failed
Python Type-Check / pyright type-check (push) Has been cancelled
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
2026-01-30 23:19:30 +01:00
Daniele Pinna
1488339138
lookup, lookahead: fix crash when n_ctx not specified (#18729)
* lookup, lookahead: fix crash when n_ctx not specified

Since PR #16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic
GPU memory fitting. This causes llama-lookup and llama-lookahead to crash
when run without explicit -c flag:

    GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded")

Root cause: Both examples use params.n_ctx directly for batch initialization,
but params.n_ctx remains 0 even after the context is properly initialized
to n_ctx_train internally.

Bug history:
- Nov 2023: lookahead.cpp created (PR #4207) with params.n_ctx pattern
- Dec 2023: lookup.cpp created (PR #4484) with same pattern
- Nov 2024: default n_ctx changed to 4096 (PR #10136) - bug dormant
- Dec 2025: default n_ctx changed to 0 (PR #16653) - bug activated

The bug was dormant for 2+ years because params.n_ctx defaulted to 512,
then 4096. PR #16653 changed it to 0 for GPU auto-fitting, triggering
the crash.

Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching
the pattern already used elsewhere in lookup.cpp (line 72) and in
speculative.cpp/speculative-simple.cpp.

Tested: llama-lookup now works without -c flag (12.5% acceptance on
Gemma-3-1B).

Note: llama-lookahead has a separate pre-existing issue with sequence
initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix.

* lookahead: fix n_seq_max and kv_unified configuration

Lookahead decoding requires:
- W + G + 1 = 31 sequences for parallel Jacobi decoding
- Unified KV cache for coupled sequences in batch splitting

These requirements were broken after PR #14482 changed validation logic.

Consolidates fix from PR #18730 per maintainer request.

Commit message drafted with Claude.
2026-01-30 22:10:24 +02:00
Georgi Gerganov
4927795810
ngram-mod : fix build [no ci] (#19216) 2026-01-30 21:27:27 +02:00
shaofeiqi
971facc38e
opencl: add optimized q8_0 mm kernel for adreno (#18871)
* Add Q8_0 OpenCL kernel

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>

* opencl: fix build for non-adreno

* opencl: refactor q8_0

* opencl: enforce subgroup size of 64 for adreno for q8_0

* For A750 and older generations, subgroup size can be 64 or 128.
  This kernel assumes subgroup size 64.

* opencl: suppress warning when adreno kernels are disabled

---------

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-01-30 10:19:27 -08:00
Georgi Gerganov
d9a2a4bcaa sync : ggml 2026-01-30 20:09:21 +02:00
Georgi Gerganov
dfd6106c84 cuda : fix compile warnings (whisper/0) 2026-01-30 20:09:21 +02:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter (#19207)
* server : wrap around the "id_slot" parameter

* cont : minor
2026-01-30 19:46:10 +02:00
Simon Redman
13f3ebfae1
Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (#19194) 2026-01-30 17:27:16 +01:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod (#19164)
* spec : add ngram-mod

* cont : simplify + keep track of occupancy

* cont : cleanup

* cont : move initialization to common/speculative

* cont : cleanup

* cont : cleanup

* cont : fix
2026-01-30 18:21:48 +02:00
Marcello Seri
2e916f996a
jinja : add unordered_map include to value.h [no ci] (#19205)
On macos Sequoia 15.7.3, x86_64, the build has recently started failing with
```
In file included from .../code/cpp/llama.cpp/common/jinja/string.cpp:2:
.../code/cpp/llama.cpp/common/./jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std'
  478 |     std::unordered_map<value, value, value_hasher, value_equivalence> unordered;
      |     ~~~~~^
In file included from .../code/cpp/llama.cpp/common/jinja/caps.cpp:1:
.../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std'
  478 |     std::unordered_map<value, value, value_hasher, value_equivalence> unordered;
      |     ~~~~~^
In file included from .../code/cpp/llama.cpp/common/jinja/value.cpp:1:
In file included from .../code/cpp/llama.cpp/common/jinja/runtime.h:4:
.../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std'
  478 |     std::unordered_map<value, value, value_hasher, value_equivalence> unordered;
[...]
```

After a bit of digging to make sure all the appropriate flags were used, I notifced that the necessary header was not included. This fixes the build for me and should not affect negatively other builds that for some reasons were already succeeding
2026-01-30 16:09:44 +01:00
Daniel Bevenius
f3bc98890c
memory : clarify comments for r_l and s_l tensors [no ci] (#19203)
This commit updates the comments in state_write_data to clarify that it
is handling the R and S tensors and not Key and Value tensors.
2026-01-30 15:18:41 +01:00
Concedo
885fec37c1 update sdui 2026-01-30 21:05:10 +08:00
Concedo
a6efa9d182 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	tests/test-backend-ops.cpp
2026-01-30 20:37:37 +08:00
Georgi Gerganov
c3b87cebff
tests : add GQA=20 FA test (#19095)
Some checks failed
Python Type-Check / pyright type-check (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
2026-01-30 13:52:57 +02:00
Daniel Bevenius
0562503154
convert : add missing return statement for GraniteMoeModel (#19202)
This commit adds a missing return statement to the GraniteMoeModel class
to fix an issue in the model conversion process.

Resolves: https://github.com/ggml-org/llama.cpp/issues/19201
2026-01-30 11:12:53 +01:00