Commit graph

1362 commits

Author SHA1 Message Date
Concedo
cc82c3164e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.github/workflows/build-cross.yml
#	.github/workflows/build-sycl.yml
#	.github/workflows/build.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/release.yml
#	cmake/riscv64-spacemit-linux-gnu-gcc.cmake
#	docs/backend/OPENVINO.md
#	docs/backend/SYCL.md
#	docs/build-riscv64-spacemit.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	embd_res/ggml-vocab-qwen35.gguf
#	embd_res/ggml-vocab-qwen35.gguf.inp
#	embd_res/ggml-vocab-qwen35.gguf.out
#	examples/model-conversion/Makefile
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/common.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
#	ggml/src/ggml-zendnn/CMakeLists.txt
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	scripts/snapdragon/adb/run-completion.sh
#	tests/CMakeLists.txt
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/mtmd/clip-impl.h
#	tools/mtmd/clip.cpp
#	tools/mtmd/clip.h
#	tools/server/README.md
2026-05-14 19:04:04 +08:00
Kabir Potdar
42532afff4
unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110)
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks

* cont : remove trailing whitespace

---------

Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-05-14 11:03:40 +02:00
Concedo
f7923b261f need to fix cuda compile. Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/python-type-check.yml
#	examples/speculative-simple/README.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/src/ggml-cuda/im2col.cu
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	tests/test-backend-ops.cpp
#	tools/cli/README.md
#	tools/mtmd/CMakeLists.txt
#	tools/server/README.md
2026-05-12 20:47:07 +08:00
Georgi Gerganov
68e7ea3eab
spec : parallel drafting support (#22838)
* spec : refactor

* spec : drop support for incompatible vocabs

* spec : update common_speculative_init()

* cont : pass seq_id

* cont : dedup ctx_seq_rm_type

* server : sketch the ctx_dft decode loop

* server : draft prompt cache and checkpoints

* server : improve ctx names

* server, spec : transition to unified spec context

* cont : sync main and drft contexts

* cont : async drft eval when possible

* cont : handle non-ckpt models

* cont : pass correct n_past for drafting

* cont : process images throught the draft context

* spec : handle draft running out of context

* server : fix mtmd draft processing

* server : fix URL for draft model

* server : add comment

* server : clean-up + dry

* speculative-simple : update

* spec : fix n_past type

* server : fix slot ctx_drft ptr

* tools : update readme

* naming : improve consistency

* spec : refactor for multi-sequence speculative context

* cont : prepare params

* cont : prepare params

* spec : support parallel drafts

* server : support parallel drafting

* llama : reuse device buffers when possible

* server, spec : clean-up

* cont : clean-up

* cont : minor

* spec : reset `drafting` flag at the end

* spec : introduce `common_speculative_process()`

* spec : allow for multiple spec types (chain of speculators)

* replace old type field of type common_speculative_type in the
  common_params_speculative struct with a vector to allow multiple
  types to be specified

* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
  to figure out which implementations the user has enabled

* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
  to parse the already user provided spec types

* all speculators run sequentially, best one wins (we verify its drafted tokens)

* maximize expected accepted tokens for current round by calculating the
  product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
  and the draft's length

---------

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-05-11 19:09:43 +03:00
Concedo
2771e16fbc Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.devops/nix/package.nix
#	.gitignore
#	docs/backend/SYCL.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-cuda/ggml-cuda.cu
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/fattn-common.hpp
#	ggml/src/ggml-sycl/getrows.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/im2col.cpp
#	ggml/src/ggml-sycl/im2col.hpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/quants.hpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	ggml/src/ggml-virtgpu/ggml-backend-device.cpp
#	scripts/sync-ggml.last
#	scripts/sync_vendor.py
#	tests/test-backend-ops.cpp
2026-05-11 16:18:28 +08:00
Concedo
9b0b36b5ef Merge commit '66001722aa' into concedo_experimental
# Conflicts:
#	README.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	examples/sycl/start-svr.sh
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-sycl/gated_delta_net.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/pad.cpp
#	ggml/src/ggml-sycl/ssm_conv.cpp
#	tests/test-backend-ops.cpp
#	tests/test-reasoning-budget.cpp
#	tools/server/README.md
#	tools/server/webui/src/lib/constants/settings-config.ts
2026-05-11 15:40:10 +08:00
Sigbjørn Skjæret
5755a100cd
model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite (#22870) 2026-05-10 08:44:29 +02:00
Sumit Chatterjee
1e5ad35d56
model : add sarvam_moe architecture support (#20275) 2026-05-09 16:31:50 +02:00
ynankani
9f5f0e689c
model : support Gemma4_26B_A4B_NVFP4 (#22804)
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-08 20:42:09 +02:00
Concedo
eb30b29d69 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/gguf-publish.yml
#	CODEOWNERS
#	examples/sycl/test.sh
#	pyproject.toml
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/README.md
2026-05-08 14:48:57 +08:00
Georgi Gerganov
e43431b381
llama : fix device state save/load (#22805) 2026-05-07 21:43:40 +03:00
Georgi Gerganov
803627f121
llama : remove unnecessary seq_id check during state restore (#22797)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-05-07 16:37:26 +03:00
AesSedai
8e52631d55
model: Add Mimo v2.5 model support (#22493)
* add mimo-v2.5 support

* mimo-v2.5: fix modify_tensors row split

* mimi-v2.5: forgot `add_attn_value_scale` plumbing

* mimi-v2.5: fix tp dequant to detect tp rows

* mimo-v2.5: fix TP iteration to be descending

* mimo-v2.5: fix comment

* mimo-v2.5: retain fused qkv

* mimo-v2.5: missed the attn_value scale during merge

* mimo-v2.5: fused QKV needs contiguous for scaling attention value

* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mimo-v2.5: include MTP weights in gguf

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-07 13:21:58 +02:00
Adrien Gallouët
3980e04d5a
llama : add missing call to ggml_backend_load_all() (#22752)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-07 08:24:47 +03:00
Gilad S.
5207d120ea
model : don't crash on unsupported architecture (#22742)
* model: don't crash on unsupported architecture

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 18:51:21 +02:00
Concedo
9e9497f0cc Merge remote-tracking branch 'origin/upstream' into concedo_experimental
# Conflicts:
#	examples/save-load-state/save-load-state.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/sync-ggml.last
#	scripts/sync_vendor.py
#	src/llama-graph.cpp
#	tests/test-backend-ops.cpp
#	tests/test-state-restore-fragmented.cpp
2026-05-06 21:20:06 +08:00
Concedo
7240da764a Merge commit '935a340292' into concedo_experimental
# Conflicts:
#	examples/diffusion/CMakeLists.txt
#	scripts/server-test-function-call.py
#	src/llama-model.cpp
#	src/models/gemma4.cpp
#	tests/test-chat.cpp
#	tests/test-reasoning-budget.cpp
#	tools/server/README.md
2026-05-06 21:02:25 +08:00
Adrien Gallouët
bf76ac77be
common : only load backends when required (#22290)
* common : only load backends when required

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama : call ggml_backend_load_all() directly from llama_backend_init()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add ggml_backend_load_all() where llama_backend_init() is not used

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Georgi Gerganov
d6e7b033a4
llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret
fa595462ca
graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630)
* qkv may not always be contiguous

* cont : make the cont conditional

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-05 06:34:44 +03:00
Ismail
a817a22bc6
ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (#22631) 2026-05-05 10:05:05 +08:00
Xuan-Son Nguyen
994118a183
model: move load_hparams and load_tensors to per-model definition (#22004)
* git-friendly migration

* add build_graph

* nits

* exclude old code from build

* wip

* add llm_arch_model_i

* prepare downstream functions

* nits

* nits

* wip

* wip

* add back create_tensor_qkv

* fix files missing include

* enforce one llm_build per arch

* cmake: use glob

* missing model params

* nits

* wip

* wip (2)

* wip (3)

* test-llama-archs is happy

* improve switch case

* move more stuff into llm_arch_model_i

* fix downstream code

* nits

* nits (2)

* fix order

* llama_model_base

* LLAMA_LOAD_LOCALS

* small fix

* fix build errors

* auto

* rm migration script and ifdef
2026-05-04 12:36:59 +02:00
Concedo
2905c6254f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.pi/gg/SYSTEM.md
#	docs/speculative.md
#	ggml/src/ggml-virtgpu/virtgpu-shm.cpp
#	ggml/src/ggml-virtgpu/virtgpu.cpp
#	ggml/src/ggml-virtgpu/virtgpu.h
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-05-04 15:36:13 +08:00
Julien Denize
048a490f76
convert : Mistral format yarn apply_scale support (#22612)
* [BUGFIX] Mistral format apply_scale support.

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix misunderstood boolean parameters

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-03 21:51:21 +02:00
Georgi Gerganov
0754b7b6fe
server : avoid checkpoint data host copies (#22558)
* server : avoid checkpoint data host copies

* llama : refactor llama_io_read_i
2026-05-02 18:03:25 +03:00
Concedo
7c70187e26 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/ISSUE_TEMPLATE/010-bug-compilation.yml
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	.github/ISSUE_TEMPLATE/020-enhancement.yml
#	.github/ISSUE_TEMPLATE/030-research.yml
#	.github/ISSUE_TEMPLATE/040-refactor.yml
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-ops.h
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/hvx-copy.h
#	ggml/src/ggml-hexagon/htp/hvx-exp.h
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-virtgpu/ggml-backend.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
ddh0
b97ebdc98f
llama-quant : fix --tensor-type when default qtype is overriden (#22572)
fix #22544 (my fault!)

Credit to @Anai-Guo, ref #22559 - since that one was closed due to the
new contributor policy I am taking the liberty of re-submitting that PR
here.
2026-05-01 19:55:55 +02:00
Reese Levine
5cbfb18075
Update llama-mmap to use ftello/fseeko (#22497)
* Update llama-mmap to work with 32-bit wasm and >2GB models

* Update to gguf.cpp style
2026-04-30 14:17:52 -07:00
Concedo
70be589894 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	examples/debug/debug.cpp
#	examples/eval-callback/eval-callback.cpp
#	ggml/src/ggml-cpu/amx/mmq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
ynankani
0f1bb602dd
model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421)
Signed-off-by: Yash Nankani <ynankani@nvidia.com>
2026-04-27 09:58:48 +02:00
Concedo
b31877e8ec Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/pull_request_template.md
#	.gitignore
#	docs/backend/SYCL.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	examples/sycl/test.sh
#	examples/sycl/win-test.bat
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/sycl_hw.cpp
#	ggml/src/ggml-sycl/sycl_hw.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-04-25 19:06:32 +08:00
ddh0
9d34231bb8
llama-quant : default ftype param Q5_1 --> Q8_0 (#20828)
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.

In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
2026-04-25 09:25:35 +03:00
Concedo
0755f27372 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/openvino.Dockerfile
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/build.yml
#	common/chat.cpp
#	docs/backend/OPENVINO.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/libggml-htp.inf
#	ggml/src/ggml-openvino/ggml-decoder.cpp
#	ggml/src/ggml-openvino/ggml-openvino-extra.cpp
#	ggml/src/ggml-openvino/ggml-openvino.cpp
#	ggml/src/ggml-openvino/ggml-quants.cpp
#	ggml/src/ggml-openvino/openvino/op/rope.cpp
#	ggml/src/ggml-openvino/openvino/op_table.cpp
#	ggml/src/ggml-openvino/openvino/op_table.h
#	ggml/src/ggml-openvino/openvino/translate_session.cpp
#	ggml/src/ggml-openvino/openvino/utils.cpp
#	ggml/src/ggml-openvino/openvino/utils.h
#	ggml/src/ggml-openvino/utils.cpp
#	ggml/src/ggml-openvino/utils.h
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/convert.hpp
#	ggml/src/ggml-sycl/gemm.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/set_rows.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync_vendor.py
#	tests/CMakeLists.txt
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
manayang
7bfe60fdf9
mtmd, llama : Update HunyuanVL vision-language model support (#22037)
* mtmd, llama : add HunyuanVL vision-language model support

- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh

* fix: fix HunyuanVL XD-RoPE h/w section order

* fix: Remove redundant code

* convert : fix HunyuanOCR / HunyuanVL conversion
 - Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
 - successfully and produce correct inference output on Metal (F16 / Q8_0).

* clip : fix -Werror=misleading-indentation in bilinear resize

* fix CI: convert_hf_to_gguf type check error
 - convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.

---------

Co-authored-by: wendadawen <wendadawen@tencent.com>
2026-04-22 11:58:43 +02:00
Concedo
19a12bb080 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	common/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	scripts/sync-ggml.last
#	tools/cli/cli.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cd03ec7642
llama-ext : fix exports (#22202) 2026-04-21 11:04:46 +03:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device (#22171)
* fit-params : add option to output estimated memory per device

* cont : minor

* cont : refactor

* cont : move fit params implementation to libcommon

* cont : header

* cont : headers

* cont : codeowners
2026-04-21 09:54:36 +03:00
Johannes Gäßler
fb19f94c71
TP: fix 0-sized tensor slices, AllReduce fallback (#21808)
* TP: fix 0-sized tensor slices, AllReduce fallback

* fix layer structure <-> GPU count aliasing

* add missing std::fill

* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
Concedo
cd6788007e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-cross.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	examples/llama.android/lib/src/main/cpp/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-rpc/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync_vendor.py
#	tests/test-chat.cpp
#	tests/test-mtmd-c-api.c
#	tools/server/README.md
2026-04-20 20:19:11 +08:00
SamareshSingh
81df3f7cfa
fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* llama: fix crash in print_info for GLM-DSA when vocab_only is set

* addressed code review comments

* cont : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-20 10:32:46 +03:00
Sigbjørn Skjæret
4f02d47339
model : refactor bias tensor variable names (#22079)
* refactor bias tensor variable names

* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Johannes Gäßler
fd1c0ec3f0
llama: fit ctx size for CPU only (#21568) 2026-04-18 08:16:04 +02:00
Concedo
79882d669a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-android.yml
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/common.h
#	docs/ops.md
#	docs/ops/Metal.csv
#	examples/batched/CMakeLists.txt
#	examples/convert-llama2c-to-ggml/CMakeLists.txt
#	examples/debug/CMakeLists.txt
#	examples/diffusion/CMakeLists.txt
#	examples/embedding/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/gen-docs/CMakeLists.txt
#	examples/idle/CMakeLists.txt
#	examples/lookahead/CMakeLists.txt
#	examples/lookup/CMakeLists.txt
#	examples/parallel/CMakeLists.txt
#	examples/passkey/CMakeLists.txt
#	examples/retrieval/CMakeLists.txt
#	examples/save-load-state/CMakeLists.txt
#	examples/speculative-simple/CMakeLists.txt
#	examples/speculative/CMakeLists.txt
#	examples/sycl/CMakeLists.txt
#	examples/training/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	pocs/vdot/CMakeLists.txt
#	src/CMakeLists.txt
#	tests/CMakeLists.txt
#	tests/test-quantize-stats.cpp
#	tools/batched-bench/CMakeLists.txt
#	tools/cli/CMakeLists.txt
#	tools/cli/cli.cpp
#	tools/completion/CMakeLists.txt
#	tools/cvector-generator/CMakeLists.txt
#	tools/cvector-generator/cvector-generator.cpp
#	tools/export-lora/CMakeLists.txt
#	tools/gguf-split/CMakeLists.txt
#	tools/gguf-split/gguf-split.cpp
#	tools/imatrix/CMakeLists.txt
#	tools/llama-bench/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/perplexity/CMakeLists.txt
#	tools/quantize/CMakeLists.txt
#	tools/quantize/quantize.cpp
#	tools/results/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/tokenize/CMakeLists.txt
#	tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
9a38091207 support q5_1 kv 2026-04-17 17:06:15 +08:00
Eric Zhang
fcc7508759
model : Gemma4 model type detection (#22027)
* model : Gemma4 model type detection

* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
Xuan-Son Nguyen
089dd41fe3
cmake: use glob to collect src/models sources (#22005) 2026-04-16 23:25:16 +02:00
Xuan-Son Nguyen
4fbdabdc61
model: using single llm_build per arch (#21970)
* model: using single llm_build per arch

* fix merge

* nits
2026-04-16 21:10:22 +02:00
PikaPikachu
9db77a020c
model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245)
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers

* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret
f772f6e434
model : support NVFP4 tensors for Gemma4 (#21971)
* support nvfp4 tensors for Gemma4

* add wo_s to build_attn

* add wo_s to build_attn

* fix glm4
2026-04-16 16:51:47 +02:00
Concedo
ae292c496e handle SWA conflicting with rewind, increased default SWA padding. 2026-04-16 17:00:26 +08:00