Concedo
70be589894
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# examples/debug/debug.cpp
# examples/eval-callback/eval-callback.cpp
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
ynankani
0f1bb602dd
model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) ( #22421 )
...
Signed-off-by: Yash Nankani <ynankani@nvidia.com>
2026-04-27 09:58:48 +02:00
Concedo
b31877e8ec
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/pull_request_template.md
# .gitignore
# docs/backend/SYCL.md
# docs/ops.md
# docs/ops/WebGPU.csv
# examples/sycl/test.sh
# examples/sycl/win-test.bat
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/sycl_hw.cpp
# ggml/src/ggml-sycl/sycl_hw.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-04-25 19:06:32 +08:00
ddh0
9d34231bb8
llama-quant : default ftype param Q5_1 --> Q8_0 ( #20828 )
...
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.
In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
2026-04-25 09:25:35 +03:00
Concedo
0755f27372
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/openvino.Dockerfile
# .github/workflows/build-self-hosted.yml
# .github/workflows/build.yml
# common/chat.cpp
# docs/backend/OPENVINO.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-openvino/ggml-decoder.cpp
# ggml/src/ggml-openvino/ggml-openvino-extra.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-openvino/ggml-quants.cpp
# ggml/src/ggml-openvino/openvino/op/rope.cpp
# ggml/src/ggml-openvino/openvino/op_table.cpp
# ggml/src/ggml-openvino/openvino/op_table.h
# ggml/src/ggml-openvino/openvino/translate_session.cpp
# ggml/src/ggml-openvino/openvino/utils.cpp
# ggml/src/ggml-openvino/openvino/utils.h
# ggml/src/ggml-openvino/utils.cpp
# ggml/src/ggml-openvino/utils.h
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/mtmd/CMakeLists.txt
# tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
manayang
7bfe60fdf9
mtmd, llama : Update HunyuanVL vision-language model support ( #22037 )
...
* mtmd, llama : add HunyuanVL vision-language model support
- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh
* fix: fix HunyuanVL XD-RoPE h/w section order
* fix: Remove redundant code
* convert : fix HunyuanOCR / HunyuanVL conversion
- Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
- successfully and produce correct inference output on Metal (F16 / Q8_0).
* clip : fix -Werror=misleading-indentation in bilinear resize
* fix CI: convert_hf_to_gguf type check error
- convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.
---------
Co-authored-by: wendadawen <wendadawen@tencent.com>
2026-04-22 11:58:43 +02:00
Concedo
19a12bb080
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# common/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/sync-ggml.last
# tools/cli/cli.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cd03ec7642
llama-ext : fix exports ( #22202 )
2026-04-21 11:04:46 +03:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device ( #22171 )
...
* fit-params : add option to output estimated memory per device
* cont : minor
* cont : refactor
* cont : move fit params implementation to libcommon
* cont : header
* cont : headers
* cont : codeowners
2026-04-21 09:54:36 +03:00
Johannes Gäßler
fb19f94c71
TP: fix 0-sized tensor slices, AllReduce fallback ( #21808 )
...
* TP: fix 0-sized tensor slices, AllReduce fallback
* fix layer structure <-> GPU count aliasing
* add missing std::fill
* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
Concedo
cd6788007e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cross.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# examples/llama.android/lib/src/main/cpp/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/test-chat.cpp
# tests/test-mtmd-c-api.c
# tools/server/README.md
2026-04-20 20:19:11 +08:00
SamareshSingh
81df3f7cfa
fix: GLM-DSA crash in llama-tokenize when using vocab_only ( #22102 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* llama: fix crash in print_info for GLM-DSA when vocab_only is set
* addressed code review comments
* cont : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-20 10:32:46 +03:00
Sigbjørn Skjæret
4f02d47339
model : refactor bias tensor variable names ( #22079 )
...
* refactor bias tensor variable names
* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Johannes Gäßler
fd1c0ec3f0
llama: fit ctx size for CPU only ( #21568 )
2026-04-18 08:16:04 +02:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
9a38091207
support q5_1 kv
2026-04-17 17:06:15 +08:00
Eric Zhang
fcc7508759
model : Gemma4 model type detection ( #22027 )
...
* model : Gemma4 model type detection
* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
Xuan-Son Nguyen
089dd41fe3
cmake: use glob to collect src/models sources ( #22005 )
2026-04-16 23:25:16 +02:00
Xuan-Son Nguyen
4fbdabdc61
model: using single llm_build per arch ( #21970 )
...
* model: using single llm_build per arch
* fix merge
* nits
2026-04-16 21:10:22 +02:00
PikaPikachu
9db77a020c
model : refactor QKV into common build_qkv and create_tensor_qkv helpers ( #21245 )
...
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers
* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret
f772f6e434
model : support NVFP4 tensors for Gemma4 ( #21971 )
...
* support nvfp4 tensors for Gemma4
* add wo_s to build_attn
* add wo_s to build_attn
* fix glm4
2026-04-16 16:51:47 +02:00
Concedo
ae292c496e
handle SWA conflicting with rewind, increased default SWA padding.
2026-04-16 17:00:26 +08:00
Concedo
0251c6dbde
added swa padding controls
2026-04-16 16:21:48 +08:00
Concedo
ac29e6f0c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/vulkan.Dockerfile
# .github/workflows/build-self-hosted.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# .github/workflows/server-self-hosted.yml
# docs/build.md
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# tests/test-backend-ops.cpp
# tests/test-mtmd-c-api.c
2026-04-15 15:15:19 +08:00
Xuan-Son Nguyen
fae3a28070
ggml : remove ggml-ext.h ( #21869 )
...
* ggml: correct placement of ggml-ext.h
* ggml : remove ggml-ext.h
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-14 17:32:58 +03:00
Concedo
5361b45fba
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# requirements/requirements-tool_bench.txt
2026-04-12 16:22:26 +08:00
Johannes Gäßler
865ff06b2f
TP: fix Qwen 3 Next data split ( #21732 )
2026-04-11 09:23:42 +02:00
Concedo
4c860ae4ae
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/download.cpp
# docs/backend/OPENVINO.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/argsort-ops.c
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-hexagon/htp/cumsum-ops.c
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/repeat-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/ssm-conv.c
# ggml/src/ggml-hexagon/htp/sum-rows-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# models/templates/google-gemma-4-31B-it-interleaved.jinja
# models/templates/google-gemma-4-31B-it.jinja
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
2026-04-11 11:19:32 +08:00
Concedo
a165a73120
Merge commit ' d6f3030047' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/amx/amx.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend-buffer.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# pyproject.toml
# requirements/requirements-convert_legacy_llama.txt
# requirements/requirements-tool_bench.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-llama-archs.cpp
# tests/test-tokenizer-0.py
# tests/test-tokenizer-random.py
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-11 11:10:55 +08:00
Concedo
8b90bfe094
Merge commit ' 4ef9301e4d' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# docs/multimodal.md
# embd_res/ggml-vocab-gemma-4.gguf
# embd_res/ggml-vocab-gemma-4.gguf.inp
# embd_res/ggml-vocab-gemma-4.gguf.out
# ggml/src/ggml-sycl/fattn-tile.cpp
# ggml/src/ggml-sycl/fattn-tile.hpp
# ggml/src/ggml-sycl/fattn-vec.hpp
# ggml/src/ggml-sycl/fattn.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-f16.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q4_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q4_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q5_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q5_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q8_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-f16.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q4_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q4_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q5_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q5_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q8_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-f16.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q4_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q4_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q5_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q5_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q8_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-f16.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q4_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q4_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q5_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q5_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q8_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-f16.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q4_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q4_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q5_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q5_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q8_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-f16.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q4_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q4_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q5_0.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q5_1.cpp
# ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q8_0.cpp
# tests/CMakeLists.txt
# tests/test-jinja.cpp
# tools/mtmd/CMakeLists.txt
2026-04-11 09:38:50 +08:00
MoonRide303
e62fa13c24
model : make Gemma 4 shared-KV tail attn_k tensors optional on load ( #21739 )
2026-04-10 21:45:50 +02:00
Johannes Gäßler
d6f3030047
ggml: backend-agnostic tensor parallelism (experimental) ( #19378 )
...
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8 )
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9 )
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7 )
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11 )
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12 )
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16 )
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17 )
* meta : formatting, naming, indentation (#18 )
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-09 16:42:19 +02:00
Xuan-Son Nguyen
057dba336e
model: fix multimodal padding token for gemma3n/gemma4 ( #21625 )
...
* model: fix multimodal padding token for gemma3n/gemma4
* nits
2026-04-09 12:18:23 +02:00
Concedo
c82c0b463a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/release.yml
# examples/debug/debug.cpp
# ggml/src/ggml-cuda/common.cuh
# ggml/src/ggml-cuda/mmq.cuh
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# src/llama-vocab.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
# tools/mtmd/CMakeLists.txt
2026-04-09 17:45:04 +08:00
Piotr Wilkin (ilintar)
0ec191e1d7
vocab: add gemma4 tokenizer tests, fix edge case ( #21534 )
...
* YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests!
* Remove unnecessary hash from update script.
* minor: move constant
2026-04-09 11:41:14 +02:00
Concedo
5529748a01
Merge commit ' de1aa6fa73' into concedo_experimental
...
# Conflicts:
# docs/build.md
# docs/ops.md
# docs/ops/WebGPU.csv
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/quants.hpp
# ggml/src/ggml-sycl/vecdotq.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2026-04-09 17:16:33 +08:00
Aldehir Rojas
d9a12c82f0
vocab : remove </s> eog token if gemma4 ( #21492 )
2026-04-08 09:53:06 -05:00
Erik Scholz
3ba12fed0a
kv-cache : extend cache quantization checks ( #21586 )
...
to also check for enabled flash attention, instead of just auto.
2026-04-08 16:08:57 +03:00
Georgi Gerganov
5764d7c6a6
gemma : perform per-layer projections in the first layer ( #21612 )
...
* gemma : reduce graph splits by keeping per-layer ops in the input layer
* gemma : put the per-layer proj in the first layer
* cont : move the projection before the layer loop
2026-04-08 16:06:30 +03:00
Georgi Gerganov
4eb19514dd
kv-cache : support attention rotation for heterogeneous iSWA ( #21513 )
...
* kv-cache : support attention rotation for heterogeneous iSWA
* cont : remove assert
2026-04-07 20:31:28 +03:00
Son H. Nguyen
0d049d6a92
unicode : add custom Qwen2 regex handler to fix segfault on long input ( #21257 )
...
* unicode : add custom Qwen2 regex handler to fix segfault on long input
std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.
Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.
Closes: https://github.com/ggml-org/llama.cpp/issues/21113
* cont : remove TODO comment
* cont : update comment to reflect original regex
* use the correct regex in the comment this time... [no ci]
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-04-07 16:13:38 +03:00
Johannes Gäßler
a8ec0df461
llama: remove per-arch tensor name lists ( #21531 )
2026-04-07 15:02:03 +02:00
Concedo
15d269197e
Merge commit ' 506200cf8b' into concedo_experimental
...
# Conflicts:
# docs/multimodal.md
# scripts/compare-llama-bench.py
# src/llama-vocab.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
2026-04-07 14:58:36 +08:00
Pasha Khosravi
2e1f0a889e
ggml: add Q1_0 1-bit quantization support (CPU) ( #21273 )
...
* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU)
* add generic fallback for x86
* remove Q1_0 (group size 32)
* rename Q1_0_g128 => Q1_0
* fix Q1_0 LlamaFileType Enum
* Fix trailing spaces; add generic fallback for othre backends
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix /r/n spacing + arch-fallback
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-06 20:55:21 +02:00
Aldehir Rojas
4aa962e2b0
vocab : add byte token handling to BPE detokenizer for Gemma4 ( #21488 )
2026-04-06 09:08:37 -05:00
Concedo
a395af65db
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-riscv.yml
# .github/workflows/build.yml
# ggml/src/ggml-hexagon/htp/argsort-ops.c
# ggml/src/ggml-sycl/fattn-tile.hpp
# tools/mtmd/CMakeLists.txt
2026-04-06 20:56:02 +08:00
Georgi Gerganov
400ac8e194
convert : set "add bos" == True for Gemma 4 ( #21500 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
* convert : set "add bos" == True for Gemma 4
* cont : handle old GGUFs
2026-04-06 13:52:07 +03:00
anchortense
58190cc84d
llama : correct platform-independent loading of BOOL metadata ( #21428 )
...
* model-loader : fix GGUF bool array conversion
* model-loader : fix remaining GGUF bool pointer uses
2026-04-06 01:40:38 +02:00
Richard Davison
af76639f72
model : add HunyuanOCR support ( #21395 )
...
* HunyuanOCR: add support for text and vision models
- Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge
- Add separate HUNYUAN_OCR chat template (content-before-role format)
- Handle HunyuanOCR's invalid pad_token_id=-1 in converter
- Fix EOS/EOT token IDs from generation_config.json
- Support xdrope RoPE scaling type
- Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.)
- Register HunYuanVLForConditionalGeneration for both text and mmproj conversion
* fix proper mapping
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Update tools/mtmd/clip.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* address comments
* update
* Fix typecheck
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-05 23:32:14 +02:00
Concedo
9b1f1bbf35
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-vulkan.yml
# .github/workflows/docker.yml
# embd_res/templates/google-gemma-4-31B-it-interleaved.jinja
# embd_res/templates/google-gemma-4-31B-it.jinja
# tests/test-chat.cpp
2026-04-05 18:46:23 +08:00