Concedo
30c74d5cce
fixed mcp bug
2026-02-04 20:46:55 +08:00
Concedo
4b073f3aa0
fix sse parsing in mcp
2026-02-04 20:38:33 +08:00
Concedo
349c461453
add stop reason for error
2026-02-04 20:23:18 +08:00
Concedo
a2251a154f
Merge remote-tracking branch 'jeff/rope_noncontig' into concedo_experimental
2026-02-04 16:21:31 +08:00
Concedo
1f803ae27b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/server.yml
# CMakeLists.txt
# cmake/common.cmake
# ggml/src/ggml-virtgpu/apir_cs_ggml-rpc-front.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-device.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
# ggml/src/ggml-virtgpu/backend/backend-dispatched.h
# ggml/src/ggml-virtgpu/backend/backend.cpp
# ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
# ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
# ggml/src/ggml-virtgpu/ggml-backend-device.cpp
# ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
# ggml/src/ggml-virtgpu/ggml-remoting.h
# ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml
# ggml/src/ggml-virtgpu/regenerate_remoting.py
# ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
# ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
# ggml/src/ggml-virtgpu/virtgpu-shm.cpp
# ggml/src/ggml-virtgpu/virtgpu.cpp
# ggml/src/ggml-virtgpu/virtgpu.h
2026-02-04 16:21:06 +08:00
Wagner Bruna
d9ac52a01a
sd: sync to master-492-f957fa3 ( #1957 )
...
* sd: sync to master-492-f957fa3
* add Res Multistep and Res 2s samplers
* make sdflashattention control flash_attn too
2026-02-04 16:12:39 +08:00
Daniel Bevenius
25f40ca65f
completion : simplify batch (embd) processing ( #19286 )
...
Python Type-Check / pyright type-check (push) Waiting to run
* completion : simplify batch (embd) processing
This commit simplifies the processing of embd by removing the for loop
that currently exists which uses params.n_batch as its increment. This
commit also removes the clamping of n_eval as the size of embd is always
at most the size of params.n_batch.
The motivation is to clarify the code as it is currently a little
confusing when looking at this for loop in isolation and thinking that
it can process multiple batches.
* add an assert to verify n_eval is not greater than n_batch
2026-02-04 05:43:28 +01:00
Kevin Pouget
015deb9048
ggml-virtgpu: make the code thread safe ( #19204 )
...
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function
* ggml-virtgpu: deprecate buffer_type is_host remoting
not necessary
* ggml-virtgpu: stop using static vars as cache
The static init isn't thread safe.
* ggml-virtgpu: protect the use of the shared memory to transfer data
* ggml-virtgpu: make the remote calls thread-safe
* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory
* ggml-virtgpu: add a cleanup function for consistency
* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing
* fix style and ordering
* Remove the static variable in apir_device_get_count
* ggml-virtgpu: improve the logging
* fix review minor formatting changes
2026-02-04 10:46:18 +08:00
Aman Gupta
2ceda3f662
ggml-cpu: use LUT for converting e8->f32 scales on x86 ( #19288 )
...
* ggml-cpu: use LUT for converting e8->f32 scales on x86
* add dispatch based on macro
2026-02-04 09:43:29 +08:00
Georgi Gerganov
44008ce8f9
metal : add solve_tri ( #19302 )
2026-02-03 23:43:14 +02:00
Georgi Gerganov
6a9bf2f788
ci : add sanitizer runs for server ( #19291 )
2026-02-03 22:41:20 +02:00
Georgi Gerganov
faa1bc26ee
sampling : delegate input allocation to the scheduler ( #19266 )
...
* sampling : delegate input allocation to the scheduler
* graph : compute backend samplers only if needed
2026-02-03 22:16:16 +02:00
Jeff Bolz
5de50e9d86
vulkan: fix non-contig rope
2026-02-03 12:20:08 -06:00
Ruben Ortlam
32b17abdb0
vulkan: disable coopmat1 fa on Nvidia Turing ( #19290 )
2026-02-03 17:37:32 +01:00
Aman Gupta
8bece2eb20
CUDA: use mmvq for mul-mat-id for small batch sizes ( #18958 )
...
* CUDA: use mmvq for mul-mat-id for small batch sizes
* add mmvq too
* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs
* templatize multi_token_path
2026-02-03 23:31:23 +08:00
Concedo
e7d980cf4a
updated sdui
2026-02-03 21:33:51 +08:00
Sigbjørn Skjæret
a6fd8ca1fe
models : remove unnecessary cont in openelm ( #19289 )
2026-02-03 14:20:57 +01:00
Concedo
dfa725c58d
make the dpi fix more universal. not a perfect solution
2026-02-03 19:49:38 +08:00
Georgi Gerganov
c55bce4159
metal : minor cleanup ( #19251 )
2026-02-03 13:43:29 +02:00
Concedo
316530e9cf
fix cuda graph spams
2026-02-03 19:00:50 +08:00
Concedo
7b393fa487
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# AUTHORS
# ci/run.sh
# docs/backend/SYCL.md
# docs/build.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmo4.0.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# docs/multimodal/minicpmv4.0.md
# docs/multimodal/minicpmv4.5.md
# docs/ops.md
# docs/ops/SYCL.csv
# docs/speculative.md
# examples/deprecation-warning/README.md
# examples/deprecation-warning/deprecation-warning.cpp
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/include/ggml-cann.h
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/concat.cl
# ggml/src/ggml-opencl/kernels/repeat.cl
# ggml/src/ggml-opencl/kernels/scale.cl
# ggml/src/ggml-opencl/kernels/tanh.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/wkv.cpp
# src/llama-vocab.cpp
# tests/test-autorelease.cpp
# tests/test-backend-ops.cpp
# tools/cvector-generator/pca.hpp
# tools/export-lora/export-lora.cpp
# tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
Oliver Simons
1f1e57f2bf
CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup ( #19053 )
...
By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816) )
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:
| GPU | Model | Test | t/s master | t/s osimons/fix_bw_mmq_fixup_kernel | Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation | gpt-oss 20B MXFP4 MoE | pp8096 | 8404.05 | 8375.79 | 1.00 |
| NVIDIA RTX 6000 Ada Generation | llama 3B Q4_K_M | pp8096 | 16148.93 | 16019.60 | 0.99 |
| NVIDIA RTX 6000 Ada Generation | llama 8B Q4_0 | pp8096 | 8008.29 | 7978.80 | 1.00 |
| NVIDIA RTX 6000 Ada Generation | nemotron_h 9B BF16 | pp8096 | 4263.16 | 4248.53 | 1.00 |
| NVIDIA RTX 6000 Ada Generation | nemotron_h 9B Q4_K_M | pp8096 | 5165.11 | 5157.43 | 1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 | 12582.80 | 12758.37 | 1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M | pp8096 | 16879.10 | 17619.47 | 1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0 | pp8096 | 10649.90 | 10982.65 | 1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16 | pp8096 | 7717.73 | 7716.22 | 1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M | pp8096 | 7301.90 | 7370.38 | 1.01 |
2026-02-03 11:33:14 +01:00
George
e9a859db3c
ggml: added cleanups in ggml_quantize_free ( #19278 )
...
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Has been cancelled
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-03 08:43:39 +02:00
Gaurav Garg
41e3f02647
cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated ( #19227 )
...
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042 ) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-03 08:41:02 +02:00
Alexey Dubrov
1efb5f7ae1
vocab: add Falcon-H1-Tiny-Coder FIM tokens ( #19249 )
2026-02-03 08:31:01 +02:00
Georgi Gerganov
aeb827a3cc
spec : simplify time measurement using common_time_meas ( #19262 )
2026-02-03 08:20:15 +02:00
lhez
91ea44e89b
opencl: refactor some ops, concat, repeat, tanh and scale ( #19226 )
...
* opencl: refactor concat
* opencl: refactor repeat
* opencl: refactor tanh
* opencl: enable fp16 for tanh
* opencl: refactor scale
* opencl: fix unused variables
2026-02-02 15:54:43 -08:00
Sid Mohan
0dfcd3b607
jinja : add missing 'in' test to template engine ( #19004 ) ( #19239 )
...
* jinja : add missing 'in' test to template engine (#19004 )
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* reuse test_is_in in binary op
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-02-02 21:00:55 +01:00
Xuan-Son Nguyen
07a7412a3b
mtmd: add min/max pixels gguf metadata ( #19273 )
2026-02-02 20:59:06 +01:00
Aman Gupta
9f682fb640
ggml-cpu: FA split across kv for faster TG ( #19209 )
...
* ggml-cpu: split across kv for faster TG
* simplify sinks application
* add ref impl
2026-02-03 01:19:55 +08:00
Matthieu Coudron
a3fa035822
server: print actual model name in 'model not found" error ( #19117 )
...
Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.
before:
Error: {
code = 400,
message = "model not found",
type = "invalid_request_error"
}
After:
Error: {
code = 400,
message = "model 'toto' not found",
type = "invalid_request_error"
}
2026-02-02 16:55:27 +01:00
Aman Gupta
15818ac44c
ci: add test-backend-ops test for CPU ( #19268 )
2026-02-02 22:40:28 +08:00
Neo Zhang
bf38346d13
Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. ( #19246 )
...
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-02 21:06:21 +08:00
Tamar
4d5e972673
sycl: implement GGML_OP_TOP_K ( #19242 )
2026-02-02 21:05:51 +08:00
Georgi Gerganov
6fdddb4987
metal : support virtual devices ( #18919 )
...
* metal : support virtual devices
* cont : manage buffer type context memory
* metal : add events
* cont : implement cpy_tensor_async
2026-02-02 14:29:44 +02:00
Daniel Bevenius
6156ae5111
model-conversion : add debug option to conversion script ( #19265 )
...
This commit adds a debug option to the model conversion script to enable
using the Python debugger (pdb) during model conversion.
The motivation for this is that I've found myself adding this a few
times now and it would be quicker to have this flag as an option and a
makefile target/recipe for it.
2026-02-02 11:29:57 +01:00
Johannes Gäßler
59377a6c87
ggml-backend: fix async set/get fallback sync ( #19179 )
2026-02-02 10:00:05 +01:00
Georgi Gerganov
1239267cc4
authors : update ( #19263 )
...
[no ci]
2026-02-02 08:51:25 +02:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups ( #19252 )
...
* Update old URLs to github.com/ggml-org/
* Bump copyrights
2026-02-02 08:38:55 +02:00
Sascha Rogmann
b4d05a3d2f
spec : various improvements ton ngram-map + docs ( #19253 )
...
* spec: ngram-map and reasoning chats
* spec: add t_begin and t_accept
* ngram-map : add internal hash map
* docs : update ngram-map, add ngram-mod
* docs : fix ngram-map-k
* docs : differences between implementations
2026-02-02 08:26:58 +02:00
Concedo
77f4afe72b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/nixpkgs-instances.nix
# docs/backend/snapdragon/CMakeUserPresets.json
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-02-02 11:43:58 +08:00
Concedo
68f9c6df91
fix cuda graph spams
2026-02-02 11:28:50 +08:00
Nikhil Jain
2dc3ce2166
Remove pipeline cache mutexes ( #19195 )
...
* Remove mutex for pipeline caches, since they are now per-thread.
* Add comment
* Run clang-format
* Cleanup
* Run CI again
* Run CI once more
* Run clang-format
2026-02-01 18:47:29 -08:00
Max Krasnyansky
3bc8d2cf23
Bump cmake max version (needed for Windows on Snapdragon builds) ( #19188 )
...
* Bump max cmake version (needed for Windows on Snapdragon builds)
* cmake: move max version setting into ggml/CMakeLists
2026-02-01 14:13:38 -08:00
Alexis Williams
8a98ba4582
nix: fix allowUnfreePredicate for packages with multiple licenses ( #19237 )
...
The allowUnfreePredicate in pkgsCuda was wrapping p.meta.license in a
list unconditionally. This fails when meta.license is already a list
of licenses, as it creates a nested list and then tries to access
.free and .shortName on the inner list.
Use lib.toList instead, which correctly handles both cases:
- Single license attrset -> wraps in list
- List of licenses -> returns unchanged
2026-02-01 22:10:48 +02:00
Concedo
ddce19db72
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package-gguf-py.nix
# .devops/nix/scope.nix
# common/CMakeLists.txt
# docs/backend/SYCL.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/sycl/run-llama2.sh
# examples/sycl/win-run-llama2.bat
# examples/sycl/win-test.bat
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hvx-dump.h
# ggml/src/ggml-hexagon/htp/hvx-reduce.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
Concedo
76b22a7b23
updated lite
2026-02-01 22:16:13 +08:00
Concedo
a5ae116033
increase z-image default clamp to 4.0, to tolerate z-image base requirement for higher cfg
2026-02-01 22:02:20 +08:00
Concedo
b13bf44285
kde fractional scaling fix, tooltip fix (+1 squashed commits)
...
Squashed commits:
[1cf02dcce] kde fractional scaling fix
2026-02-01 21:55:44 +08:00
Neo Zhang
2634ed207a
create test.sh to enhance the parameters for testing, update the guide, rm useless script ( #19243 )
2026-02-01 18:24:00 +08:00