Commit graph

13027 commits

Author SHA1 Message Date
Tai An
dfd87c4fb6
feat(router): add --routermodetimeout to make reverse-proxy timeout configurable (#2169)
Closes the hardcoded 600s timeout in the router-mode reverse proxy: long
generations through --routermode would be cut off at the upstream
HTTPConnection timeout regardless of how long the model actually takes,
because http.client.HTTPConnection('localhost', upstream_port, timeout=600)
was wired with a literal 600.

Adds a new --routermodetimeout (default 600) under the admin group, and
threads it through the three HTTPConnection sites in the router handler:
the model-swap reload, the autoswap reload, and the main upstream proxy
forward. Behavior is unchanged at the default; users with long generations
can now pass e.g. --routermodetimeout 3600.

Reported in https://github.com/LostRuins/koboldcpp/issues/2168
2026-04-29 20:20:42 +08:00
Concedo
70be589894 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	examples/debug/debug.cpp
#	examples/eval-callback/eval-callback.cpp
#	ggml/src/ggml-cpu/amx/mmq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
Concedo
755037dacc add docker reference source files 2026-04-27 20:38:26 +08:00
rankaiyx
42401c72b8
Fix type casting for unaccounted memory calculation (#22424)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-04-27 14:31:13 +02:00
Georgi Gerganov
e940b3d468
download : prefer q8_0 when q4_k not available (#22428) 2026-04-27 14:30:29 +02:00
Concedo
9eaed2ec32 make musicui accessible to screen readers 2026-04-27 19:43:51 +08:00
Concedo
9b38d83377 updated the readme for more docker information to make it clearer what to expect. please don't use the docker on a M series macOS 2026-04-27 18:21:22 +08:00
ynankani
0f1bb602dd
model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421)
Signed-off-by: Yash Nankani <ynankani@nvidia.com>
2026-04-27 09:58:48 +02:00
Sigbjørn Skjæret
d13540becd
convert : remove input_scale for dequantized fp8 modelopt (#22356) 2026-04-27 08:45:01 +02:00
Adrien Gallouët
f84270ea10
ggml : use 64 bytes aligned tile buffers (#21058)
| Model                            | Test   |   t/s OLD |   t/s NEW |   Speedup |
|:---------------------------------|:-------|----------:|----------:|----------:|
| qwen35 0.8B BF16                 | pp512  |    584.59 |    595.41 |      1.02 |
| qwen35 0.8B BF16                 | tg128  |     52.23 |     52.82 |      1.01 |
| qwen35 0.8B IQ2_M - 2.7 bpw      | pp512  |    260.64 |    261.70 |      1.00 |
| qwen35 0.8B IQ2_M - 2.7 bpw      | tg128  |     81.17 |     80.89 |      1.00 |
| qwen35 0.8B IQ2_XXS - 2.0625 bpw | pp512  |    302.36 |    302.56 |      1.00 |
| qwen35 0.8B IQ2_XXS - 2.0625 bpw | tg128  |     84.93 |     85.12 |      1.00 |
| qwen35 0.8B IQ3_XXS - 3.0625 bpw | pp512  |    263.22 |    260.01 |      0.99 |
| qwen35 0.8B IQ3_XXS - 3.0625 bpw | tg128  |     80.29 |     78.94 |      0.98 |
| qwen35 0.8B IQ4_NL - 4.5 bpw     | pp512  |    728.65 |    742.09 |      1.02 |
| qwen35 0.8B IQ4_NL - 4.5 bpw     | tg128  |     82.39 |     84.46 |      1.03 |
| qwen35 0.8B IQ4_XS - 4.25 bpw    | pp512  |    681.33 |    677.06 |      0.99 |
| qwen35 0.8B IQ4_XS - 4.25 bpw    | tg128  |     80.18 |     79.28 |      0.99 |
| qwen35 0.8B Q2_K_M               | pp512  |    413.28 |    415.94 |      1.01 |
| qwen35 0.8B Q2_K_M               | tg128  |     81.90 |     82.78 |      1.01 |
| qwen35 0.8B Q3_K_M               | pp512  |    493.17 |    495.08 |      1.00 |
| qwen35 0.8B Q3_K_M               | tg128  |     82.75 |     83.23 |      1.01 |
| qwen35 0.8B Q3_K_S               | pp512  |    429.35 |    427.64 |      1.00 |
| qwen35 0.8B Q3_K_S               | tg128  |     86.69 |     87.02 |      1.00 |
| qwen35 0.8B Q4_0                 | pp512  |    783.46 |    782.32 |      1.00 |
| qwen35 0.8B Q4_0                 | tg128  |     88.23 |     87.90 |      1.00 |
| qwen35 0.8B Q4_1                 | pp512  |    741.71 |    729.76 |      0.98 |
| qwen35 0.8B Q4_1                 | tg128  |     85.44 |     86.01 |      1.01 |
| qwen35 0.8B Q4_K_M               | pp512  |    676.24 |    681.31 |      1.01 |
| qwen35 0.8B Q4_K_M               | tg128  |     76.59 |     77.06 |      1.01 |
| qwen35 0.8B Q4_K_S               | pp512  |    683.12 |    688.81 |      1.01 |
| qwen35 0.8B Q4_K_S               | tg128  |     80.50 |     81.19 |      1.01 |
| qwen35 0.8B Q5_K_M               | pp512  |    635.33 |    642.11 |      1.01 |
| qwen35 0.8B Q5_K_M               | tg128  |     72.07 |     72.49 |      1.01 |
| qwen35 0.8B Q5_K_S               | pp512  |    660.95 |    658.18 |      1.00 |
| qwen35 0.8B Q5_K_S               | tg128  |     72.19 |     72.95 |      1.01 |
| qwen35 0.8B Q6_K                 | pp512  |    647.97 |    638.84 |      0.99 |
| qwen35 0.8B Q6_K                 | tg128  |     72.83 |     72.49 |      1.00 |
| qwen35 0.8B Q8_0                 | pp512  |    805.01 |    785.49 |      0.98 |
| qwen35 0.8B Q8_0                 | tg128  |     70.10 |     70.13 |      1.00 |

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-27 09:30:55 +03:00
Max Krasnyansky
5594d13224
common: fix missing exports in llama-common (#22340)
* common: refactor common/debug to move abort_on_nan into base_callback_data

Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.

* cont : cleanup

* common : use pimpl in debug.h to reduce header dependencies

Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.

This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-27 08:06:39 +03:00
Georgi Gerganov
f535774325
pr2wt : symlink .pi (#22386) 2026-04-26 19:49:26 +03:00
Rithik Sharma
06a811d085
add performance-portable tuning for register-tile and subgroup matmul (#22241) 2026-04-26 09:26:28 -07:00
Gaurav Garg
78433f606f
Fix recurrent state serialization for partial reads and writes (#22362)
The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
2026-04-26 13:34:40 +02:00
Johannes Gäßler
7ec36aa861
Github: set meta backend code owner (#22388) 2026-04-26 13:34:13 +02:00
Concedo
095cfd6354 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	tests/test-chat-auto-parser.cpp
#	tests/test-chat.cpp
2026-04-26 15:57:35 +08:00
Oliver Simons
b1a5bd4e0c
CUDA: better coalesce data-access for contiguous concat (#22330)
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
2026-04-26 09:21:45 +02:00
Concedo
f679e3fec5 fix missing ipv4 support 2026-04-26 14:44:26 +08:00
Sigbjørn Skjæret
0c6ee1cade
ggml-cpu : re-enable fast gelu_quick_f16 (#22339) 2026-04-26 09:28:14 +03:00
Eve
2dd84169d1
ggml-cpu: optimize avx2 q6_k (#22345) 2026-04-26 09:27:50 +03:00
lhez
f454bd7eb8
opencl: add iq4_nl support (#22272)
* opencl: add general support for iq4_nl

* opencl: add iq4_nl gemm/gemv for adreno

* opencl: pack 2 lut entries into a uint
2026-04-25 21:21:58 -07:00
Trivikram Reddy
b760272f1a
hexagon: guard HMX clock request for v75+ platforms (#22377) 2026-04-25 17:58:26 -07:00
Piotr Wilkin (ilintar)
dcad77cc3b
chat: fix handling of space in reasoning markers (#22353)
* chat: fix handling of space in reasoning markers

* fix tests

* whitespace
2026-04-25 21:24:13 +02:00
Georgi Gerganov
98dc1418ea
spec : fix vocab compat checks (#22358) 2026-04-25 20:11:35 +03:00
Concedo
929f214bf6 updated docs, handle seed oss thinking 2026-04-25 22:44:40 +08:00
Johannes Gäßler
9725a313be
CUDA: reduce MMQ stream-k overhead (#22298)
Some checks failed
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* CUDA: reduce MMQ stream-k overhead

* use 32 bit integers for kbc
2026-04-25 14:15:03 +02:00
Developer-Ecosystem-Engineering
d1649047a3
metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962)
* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-25 15:14:28 +03:00
Concedo
b31877e8ec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/pull_request_template.md
#	.gitignore
#	docs/backend/SYCL.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	examples/sycl/test.sh
#	examples/sycl/win-test.bat
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/sycl_hw.cpp
#	ggml/src/ggml-sycl/sycl_hw.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-04-25 19:06:32 +08:00
Wagner Bruna
c04832bb2b
sd: add eta support (#2164) 2026-04-25 19:04:13 +08:00
Concedo
18a3bedf63 fixed a deadlock 2026-04-25 19:03:03 +08:00
ddh0
9d34231bb8
llama-quant : default ftype param Q5_1 --> Q8_0 (#20828)
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.

In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
2026-04-25 09:25:35 +03:00
Georgi Gerganov
8ea8fee966
gitignore : add .pi + personal SYSTEM.md (#22316)
* gitignore : add .pi + personal SYSTEM.md

* cont : fix requirements heading in PR template

* cont : shorten line
2026-04-25 09:20:45 +03:00
Neo Zhang
eddd7a13a5
[SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291)
* opt arc770 for Q4_0

* add for Q4_0

* update the script

* add help script for windows

* update guide

* fix format issue

* convert from dos to unix for format issue

* fix missed -sm parameter
2026-04-25 09:20:14 +03:00
Reese Levine
dd2914dc81
ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327)
* Implement ssm_scan

* Remove blocking in graph_compute and check for set rows

* Fix bindings

* Update op support
2026-04-25 09:18:15 +03:00
Concedo
ee2ecfbf81 updated sdui 2026-04-25 12:18:20 +08:00
Concedo
340b22283e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.github/workflows/build-android.yml
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	.gitignore
#	docs/backend/SYCL.md
#	docs/backend/snapdragon/README.md
#	examples/model-conversion/scripts/causal/convert-model.sh
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/htp_iface.idl
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/libggml-htp.inf
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/mmvq.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_blk.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
#	scripts/server-test-structured.py
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/snapdragon/adb/run-mtmd.sh
#	scripts/snapdragon/adb/run-tool.sh
#	scripts/snapdragon/qdc/requirements.txt
#	scripts/snapdragon/windows/run-bench.ps1
#	scripts/snapdragon/windows/run-cli.ps1
#	scripts/snapdragon/windows/run-completion.ps1
#	scripts/snapdragon/windows/run-mtmd.ps1
#	scripts/snapdragon/windows/run-tool.ps1
#	tests/test-backend-ops.cpp
#	tools/cli/cli.cpp
#	ty.toml
2026-04-25 12:13:14 +08:00
Concedo
4090400dff improved gemma toolcall handling 2026-04-25 09:51:29 +08:00
Piotr Wilkin (ilintar)
0adede866d
parser: fix structured output bug (#22302)
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
* fix very stupid structured output bug

* Things just cannot be too easy.
2026-04-24 23:19:55 +02:00
Trivikram Reddy
361fe72acb
Hexagon: Bump HMX Frequency to Max Corner (#22334)
* hexagon: bump HMX freq to max corner

* hex-mm: fix error in log msg
2026-04-24 13:55:17 -07:00
Shreya Jain
a702f39597
CI Snapdragon: Switch ubuntu-latest to ubuntu-slim runner (#22303)
* switch ubuntu-latest to ubuntu-slim

* Fix the path for upload so CI doesn't fail

* Update .github/workflows/build-and-test-snapdragon.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use -slim image for key check and consistent naming for artifact dir

Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>

* Remove check-secret extra job

* move QDC key check for Run QDC jobs step specifically

* add a step before to check the secret for qdc jobs

---------

Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-24 21:21:36 +02:00
Zheyuan Chen
13d36cf891
ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199)
* ggml-webgpu: add tile flash attention fallback

* ggml-webgpu: add new fields and discard usage of mnk for tile version

* ggml-webgpu: modify the vec path to discard the mnk parameter

* ggml-webgpu: enable flash attention vec and tile version for broswer

* ggml-webgpu: stagging KV for flash attention tile version

* formatting

* turn on subgroup uniformity check

* remove Q_TILE as it is always 1 for vec path

* make row_max and exp_sum to local register

* make different bindings with same underlying buffer to have the same usage flags

* move path selection into the shader library and have the host consume a single flash-attn decision object.

* turn off skip_validation and address buffer overlapping when nwg==1

* formatting

* merge binding when kv overlap
2026-04-24 10:39:09 -07:00
Mengsheng Wu
f65bc34c68
hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (#22306) 2026-04-24 09:21:33 -07:00
Georgi Gerganov
15fa3c493b
metal : print GPU description (#22318) 2026-04-24 13:56:03 +03:00
Adrien Gallouët
dc80c5252a
common : fix jinja warnings with clang 21 (#22313)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-24 12:36:02 +02:00
Georgi Gerganov
e583f3b4f5
ggml : minor coding style (#22308) 2026-04-24 11:02:00 +03:00
Georgi Gerganov
017f090442
jinja : remove unused header (#22310) 2026-04-24 11:01:46 +03:00
Georgi Gerganov
ffdd983fb8
server : fix swa-full logic (#22288) 2026-04-24 10:17:37 +03:00
Yes You Can Have Your Own
793d0a7931
server: rename debug tags to match --cache-idle-slots naming (#22292) 2026-04-24 09:28:44 +03:00
Mengsheng Wu
8bc492ebb4
hexagon: add SOLVE_TRI op (#21974)
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-04-23 18:39:13 -07:00
Chen Yuan
e5f070a1dc
fix(shader): handle the buffer aliasing for rms fuse (#22266) 2026-04-23 16:32:59 -07:00