Commit graph

1340 commits

Author SHA1 Message Date
Concedo
2905c6254f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.pi/gg/SYSTEM.md
#	docs/speculative.md
#	ggml/src/ggml-virtgpu/virtgpu-shm.cpp
#	ggml/src/ggml-virtgpu/virtgpu.cpp
#	ggml/src/ggml-virtgpu/virtgpu.h
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-05-04 15:36:13 +08:00
Julien Denize
048a490f76
convert : Mistral format yarn apply_scale support (#22612)
* [BUGFIX] Mistral format apply_scale support.

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix misunderstood boolean parameters

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-03 21:51:21 +02:00
Georgi Gerganov
0754b7b6fe
server : avoid checkpoint data host copies (#22558)
* server : avoid checkpoint data host copies

* llama : refactor llama_io_read_i
2026-05-02 18:03:25 +03:00
Concedo
7c70187e26 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/010-bug-compilation.yml
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	.github/ISSUE_TEMPLATE/020-enhancement.yml
#	.github/ISSUE_TEMPLATE/030-research.yml
#	.github/ISSUE_TEMPLATE/040-refactor.yml
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-ops.h
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/hvx-copy.h
#	ggml/src/ggml-hexagon/htp/hvx-exp.h
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-virtgpu/ggml-backend.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
ddh0
b97ebdc98f
llama-quant : fix --tensor-type when default qtype is overriden (#22572)
fix #22544 (my fault!)

Credit to @Anai-Guo, ref #22559 - since that one was closed due to the
new contributor policy I am taking the liberty of re-submitting that PR
here.
2026-05-01 19:55:55 +02:00
Reese Levine
5cbfb18075
Update llama-mmap to use ftello/fseeko (#22497)
* Update llama-mmap to work with 32-bit wasm and >2GB models

* Update to gguf.cpp style
2026-04-30 14:17:52 -07:00
Concedo
70be589894 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	examples/debug/debug.cpp
#	examples/eval-callback/eval-callback.cpp
#	ggml/src/ggml-cpu/amx/mmq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
ynankani
0f1bb602dd
model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421)
Signed-off-by: Yash Nankani <ynankani@nvidia.com>
2026-04-27 09:58:48 +02:00
Concedo
b31877e8ec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/pull_request_template.md
#	.gitignore
#	docs/backend/SYCL.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	examples/sycl/test.sh
#	examples/sycl/win-test.bat
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/sycl_hw.cpp
#	ggml/src/ggml-sycl/sycl_hw.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-04-25 19:06:32 +08:00
ddh0
9d34231bb8
llama-quant : default ftype param Q5_1 --> Q8_0 (#20828)
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.

In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
2026-04-25 09:25:35 +03:00
Concedo
0755f27372 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/openvino.Dockerfile
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/build.yml
#	common/chat.cpp
#	docs/backend/OPENVINO.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/libggml-htp.inf
#	ggml/src/ggml-openvino/ggml-decoder.cpp
#	ggml/src/ggml-openvino/ggml-openvino-extra.cpp
#	ggml/src/ggml-openvino/ggml-openvino.cpp
#	ggml/src/ggml-openvino/ggml-quants.cpp
#	ggml/src/ggml-openvino/openvino/op/rope.cpp
#	ggml/src/ggml-openvino/openvino/op_table.cpp
#	ggml/src/ggml-openvino/openvino/op_table.h
#	ggml/src/ggml-openvino/openvino/translate_session.cpp
#	ggml/src/ggml-openvino/openvino/utils.cpp
#	ggml/src/ggml-openvino/openvino/utils.h
#	ggml/src/ggml-openvino/utils.cpp
#	ggml/src/ggml-openvino/utils.h
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/convert.hpp
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync_vendor.py
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
manayang
7bfe60fdf9
mtmd, llama : Update HunyuanVL vision-language model support (#22037)
* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
2026-04-22 11:58:43 +02:00
Concedo
19a12bb080 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	common/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	scripts/sync-ggml.last
#	tools/cli/cli.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cd03ec7642
llama-ext : fix exports (#22202) 2026-04-21 11:04:46 +03:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device (#22171)
* fit-params : add option to output estimated memory per device

* cont : minor

* cont : refactor

* cont : move fit params implementation to libcommon

* cont : header

* cont : headers

* cont : codeowners
2026-04-21 09:54:36 +03:00
Johannes Gäßler
fb19f94c71
TP: fix 0-sized tensor slices, AllReduce fallback (#21808)
* TP: fix 0-sized tensor slices, AllReduce fallback

* fix layer structure <-> GPU count aliasing

* add missing std::fill

* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
Concedo
cd6788007e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-cross.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	examples/llama.android/lib/src/main/cpp/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-rpc/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync_vendor.py
#	tests/test-chat.cpp
#	tests/test-mtmd-c-api.c
#	tools/server/README.md
2026-04-20 20:19:11 +08:00
SamareshSingh
81df3f7cfa
fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* llama: fix crash in print_info for GLM-DSA when vocab_only is set

* addressed code review comments

* cont : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-20 10:32:46 +03:00
Sigbjørn Skjæret
4f02d47339
model : refactor bias tensor variable names (#22079)
* refactor bias tensor variable names

* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Johannes Gäßler
fd1c0ec3f0
llama: fit ctx size for CPU only (#21568) 2026-04-18 08:16:04 +02:00
Concedo
79882d669a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-android.yml
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/common.h
#	docs/ops.md
#	docs/ops/Metal.csv
#	examples/batched/CMakeLists.txt
#	examples/convert-llama2c-to-ggml/CMakeLists.txt
#	examples/debug/CMakeLists.txt
#	examples/diffusion/CMakeLists.txt
#	examples/embedding/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/gen-docs/CMakeLists.txt
#	examples/idle/CMakeLists.txt
#	examples/lookahead/CMakeLists.txt
#	examples/lookup/CMakeLists.txt
#	examples/parallel/CMakeLists.txt
#	examples/passkey/CMakeLists.txt
#	examples/retrieval/CMakeLists.txt
#	examples/save-load-state/CMakeLists.txt
#	examples/speculative-simple/CMakeLists.txt
#	examples/speculative/CMakeLists.txt
#	examples/sycl/CMakeLists.txt
#	examples/training/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	pocs/vdot/CMakeLists.txt
#	src/CMakeLists.txt
#	tests/CMakeLists.txt
#	tests/test-quantize-stats.cpp
#	tools/batched-bench/CMakeLists.txt
#	tools/cli/CMakeLists.txt
#	tools/cli/cli.cpp
#	tools/completion/CMakeLists.txt
#	tools/cvector-generator/CMakeLists.txt
#	tools/cvector-generator/cvector-generator.cpp
#	tools/export-lora/CMakeLists.txt
#	tools/gguf-split/CMakeLists.txt
#	tools/gguf-split/gguf-split.cpp
#	tools/imatrix/CMakeLists.txt
#	tools/llama-bench/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/perplexity/CMakeLists.txt
#	tools/quantize/CMakeLists.txt
#	tools/quantize/quantize.cpp
#	tools/results/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/tokenize/CMakeLists.txt
#	tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
9a38091207 support q5_1 kv 2026-04-17 17:06:15 +08:00
Eric Zhang
fcc7508759
model : Gemma4 model type detection (#22027)
* model : Gemma4 model type detection

* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
Xuan-Son Nguyen
089dd41fe3
cmake: use glob to collect src/models sources (#22005) 2026-04-16 23:25:16 +02:00
Xuan-Son Nguyen
4fbdabdc61
model: using single llm_build per arch (#21970)
* model: using single llm_build per arch

* fix merge

* nits
2026-04-16 21:10:22 +02:00
PikaPikachu
9db77a020c
model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245)
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers

* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret
f772f6e434
model : support NVFP4 tensors for Gemma4 (#21971)
* support nvfp4 tensors for Gemma4

* add wo_s to build_attn

* add wo_s to build_attn

* fix glm4
2026-04-16 16:51:47 +02:00
Concedo
ae292c496e handle SWA conflicting with rewind, increased default SWA padding. 2026-04-16 17:00:26 +08:00
Concedo
0251c6dbde added swa padding controls 2026-04-16 16:21:48 +08:00
Concedo
ac29e6f0c0 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	.github/workflows/server-self-hosted.yml
#	docs/build.md
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	tests/test-backend-ops.cpp
#	tests/test-mtmd-c-api.c
2026-04-15 15:15:19 +08:00
Xuan-Son Nguyen
fae3a28070
ggml : remove ggml-ext.h (#21869)
* ggml: correct placement of ggml-ext.h

* ggml : remove ggml-ext.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-14 17:32:58 +03:00
Concedo
5361b45fba Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	requirements/requirements-tool_bench.txt
2026-04-12 16:22:26 +08:00
Johannes Gäßler
865ff06b2f
TP: fix Qwen 3 Next data split (#21732) 2026-04-11 09:23:42 +02:00
Concedo
4c860ae4ae Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/download.cpp
#	docs/backend/OPENVINO.md
#	docs/backend/snapdragon/CMakeUserPresets.json
#	docs/backend/snapdragon/README.md
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/argsort-ops.c
#	ggml/src/ggml-hexagon/htp/binary-ops.c
#	ggml/src/ggml-hexagon/htp/cpy-ops.c
#	ggml/src/ggml-hexagon/htp/cumsum-ops.c
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/get-rows-ops.c
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-ops.h
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/htp_iface.idl
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/repeat-ops.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/set-rows-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/ssm-conv.c
#	ggml/src/ggml-hexagon/htp/sum-rows-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
#	models/templates/google-gemma-4-31B-it-interleaved.jinja
#	models/templates/google-gemma-4-31B-it.jinja
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/snapdragon/adb/run-tool.sh
#	scripts/snapdragon/windows/run-bench.ps1
#	scripts/snapdragon/windows/run-cli.ps1
#	scripts/snapdragon/windows/run-mtmd.ps1
#	scripts/snapdragon/windows/run-tool.ps1
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/llama-bench/llama-bench.cpp
2026-04-11 11:19:32 +08:00
Concedo
a165a73120 Merge commit 'd6f3030047' into concedo_experimental
# Conflicts:
#	examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/amx/amx.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-openvino/ggml-openvino.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-virtgpu/ggml-backend-buffer.cpp
#	ggml/src/ggml-virtgpu/ggml-backend.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	pyproject.toml
#	requirements/requirements-convert_legacy_llama.txt
#	requirements/requirements-tool_bench.txt
#	src/llama-model.cpp
#	src/llama.cpp
#	tests/test-llama-archs.cpp
#	tests/test-tokenizer-0.py
#	tests/test-tokenizer-random.py
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
2026-04-11 11:10:55 +08:00
Concedo
8b90bfe094 Merge commit '4ef9301e4d' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	docs/multimodal.md
#	embd_res/ggml-vocab-gemma-4.gguf
#	embd_res/ggml-vocab-gemma-4.gguf.inp
#	embd_res/ggml-vocab-gemma-4.gguf.out
#	ggml/src/ggml-sycl/fattn-tile.cpp
#	ggml/src/ggml-sycl/fattn-tile.hpp
#	ggml/src/ggml-sycl/fattn-vec.hpp
#	ggml/src/ggml-sycl/fattn.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q8_0.cpp
#	tests/CMakeLists.txt
#	tests/test-jinja.cpp
#	tools/mtmd/CMakeLists.txt
2026-04-11 09:38:50 +08:00
MoonRide303
e62fa13c24
model : make Gemma 4 shared-KV tail attn_k tensors optional on load (#21739) 2026-04-10 21:45:50 +02:00
Johannes Gäßler
d6f3030047
ggml: backend-agnostic tensor parallelism (experimental) (#19378)
* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (#17)

* meta : formatting, naming, indentation (#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-09 16:42:19 +02:00
Xuan-Son Nguyen
057dba336e
model: fix multimodal padding token for gemma3n/gemma4 (#21625)
* model: fix multimodal padding token for gemma3n/gemma4

* nits
2026-04-09 12:18:23 +02:00
Concedo
c82c0b463a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/release.yml
#	examples/debug/debug.cpp
#	ggml/src/ggml-cuda/common.cuh
#	ggml/src/ggml-cuda/mmq.cuh
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	src/llama-vocab.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tools/mtmd/CMakeLists.txt
2026-04-09 17:45:04 +08:00
Piotr Wilkin (ilintar)
0ec191e1d7
vocab: add gemma4 tokenizer tests, fix edge case (#21534)
* YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests!
* Remove unnecessary hash  from update script.
* minor: move constant
2026-04-09 11:41:14 +02:00
Concedo
5529748a01 Merge commit 'de1aa6fa73' into concedo_experimental
# Conflicts:
#	docs/build.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/quants.hpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	tests/test-backend-ops.cpp
#	tests/test-quantize-fns.cpp
2026-04-09 17:16:33 +08:00
Aldehir Rojas
d9a12c82f0
vocab : remove </s> eog token if gemma4 (#21492) 2026-04-08 09:53:06 -05:00
Erik Scholz
3ba12fed0a
kv-cache : extend cache quantization checks (#21586)
to also check for enabled flash attention, instead of just auto.
2026-04-08 16:08:57 +03:00
Georgi Gerganov
5764d7c6a6
gemma : perform per-layer projections in the first layer (#21612)
* gemma : reduce graph splits by keeping per-layer ops in the input layer

* gemma : put the per-layer proj in the first layer

* cont : move the projection before the layer loop
2026-04-08 16:06:30 +03:00
Georgi Gerganov
4eb19514dd
kv-cache : support attention rotation for heterogeneous iSWA (#21513)
* kv-cache : support attention rotation for heterogeneous iSWA

* cont : remove assert
2026-04-07 20:31:28 +03:00
Son H. Nguyen
0d049d6a92
unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257)
* unicode : add custom Qwen2 regex handler to fix segfault on long input

std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.

Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.

Closes: https://github.com/ggml-org/llama.cpp/issues/21113

* cont : remove TODO comment

* cont : update comment to reflect original regex

* use the correct regex in the comment this time... [no ci]

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-04-07 16:13:38 +03:00
Johannes Gäßler
a8ec0df461
llama: remove per-arch tensor name lists (#21531) 2026-04-07 15:02:03 +02:00
Concedo
15d269197e Merge commit '506200cf8b' into concedo_experimental
# Conflicts:
#	docs/multimodal.md
#	scripts/compare-llama-bench.py
#	src/llama-vocab.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
2026-04-07 14:58:36 +08:00
Pasha Khosravi
2e1f0a889e
ggml: add Q1_0 1-bit quantization support (CPU) (#21273)
* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU)

* add generic fallback for x86

* remove Q1_0 (group size 32)

* rename Q1_0_g128 => Q1_0

* fix Q1_0 LlamaFileType Enum

* Fix trailing spaces; add generic fallback for othre backends

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix /r/n spacing + arch-fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-06 20:55:21 +02:00