Concedo
cc82c3164e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/workflows/build-cross.yml
# .github/workflows/build-sycl.yml
# .github/workflows/build.yml
# .github/workflows/editorconfig.yml
# .github/workflows/release.yml
# cmake/riscv64-spacemit-linux-gnu-gcc.cmake
# docs/backend/OPENVINO.md
# docs/backend/SYCL.md
# docs/build-riscv64-spacemit.md
# docs/ops.md
# docs/ops/WebGPU.csv
# embd_res/ggml-vocab-qwen35.gguf
# embd_res/ggml-vocab-qwen35.gguf.inp
# embd_res/ggml-vocab-qwen35.gguf.out
# examples/model-conversion/Makefile
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.cpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# ggml/src/ggml-zendnn/CMakeLists.txt
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/snapdragon/adb/run-completion.sh
# tests/CMakeLists.txt
# tools/cli/README.md
# tools/completion/README.md
# tools/mtmd/clip-impl.h
# tools/mtmd/clip.cpp
# tools/mtmd/clip.h
# tools/server/README.md
2026-05-14 19:04:04 +08:00
Kabir Potdar
42532afff4
unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… ( #22110 )
...
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests
- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919 ).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.
This mirrors the Qwen2 fix (commit 0d049d6 ), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.
Closes #21919 .
* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
* cont : remove trailing whitespace
---------
Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-05-14 11:03:40 +02:00
Concedo
f7923b261f
need to fix cuda compile. Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# examples/speculative-simple/README.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-cuda/im2col.cu
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# tests/test-backend-ops.cpp
# tools/cli/README.md
# tools/mtmd/CMakeLists.txt
# tools/server/README.md
2026-05-12 20:47:07 +08:00
Georgi Gerganov
68e7ea3eab
spec : parallel drafting support ( #22838 )
...
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-05-11 19:09:43 +03:00
Concedo
2771e16fbc
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .devops/nix/package.nix
# .gitignore
# docs/backend/SYCL.md
# docs/ops.md
# docs/ops/SYCL.csv
# ggml/CMakeLists.txt
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-cuda/ggml-cuda.cu
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/fattn-common.hpp
# ggml/src/ggml-sycl/getrows.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/im2col.cpp
# ggml/src/ggml-sycl/im2col.hpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/quants.hpp
# ggml/src/ggml-sycl/vecdotq.hpp
# ggml/src/ggml-virtgpu/ggml-backend-device.cpp
# scripts/sync-ggml.last
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
2026-05-11 16:18:28 +08:00
Concedo
9b0b36b5ef
Merge commit ' 66001722aa' into concedo_experimental
...
# Conflicts:
# README.md
# docs/ops.md
# docs/ops/SYCL.csv
# examples/sycl/start-svr.sh
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-sycl/gated_delta_net.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/pad.cpp
# ggml/src/ggml-sycl/ssm_conv.cpp
# tests/test-backend-ops.cpp
# tests/test-reasoning-budget.cpp
# tools/server/README.md
# tools/server/webui/src/lib/constants/settings-config.ts
2026-05-11 15:40:10 +08:00
Sigbjørn Skjæret
5755a100cd
model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite ( #22870 )
2026-05-10 08:44:29 +02:00
Sumit Chatterjee
1e5ad35d56
model : add sarvam_moe architecture support ( #20275 )
2026-05-09 16:31:50 +02:00
ynankani
9f5f0e689c
model : support Gemma4_26B_A4B_NVFP4 ( #22804 )
...
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes
Signed-off-by: ynankani <ynankani@nvidia.com>
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Address review comments
Signed-off-by: ynankani <ynankani@nvidia.com>
* fix CRLF
Signed-off-by: ynankani <ynankani@nvidia.com>
* Lint error fix
Signed-off-by: ynankani <ynankani@nvidia.com>
---------
Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-08 20:42:09 +02:00
Concedo
eb30b29d69
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/gguf-publish.yml
# CODEOWNERS
# examples/sycl/test.sh
# pyproject.toml
# tools/mtmd/CMakeLists.txt
# tools/mtmd/README.md
2026-05-08 14:48:57 +08:00
Georgi Gerganov
e43431b381
llama : fix device state save/load ( #22805 )
2026-05-07 21:43:40 +03:00
Georgi Gerganov
803627f121
llama : remove unnecessary seq_id check during state restore ( #22797 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-05-07 16:37:26 +03:00
AesSedai
8e52631d55
model: Add Mimo v2.5 model support ( #22493 )
...
* add mimo-v2.5 support
* mimo-v2.5: fix modify_tensors row split
* mimi-v2.5: forgot `add_attn_value_scale` plumbing
* mimi-v2.5: fix tp dequant to detect tp rows
* mimo-v2.5: fix TP iteration to be descending
* mimo-v2.5: fix comment
* mimo-v2.5: retain fused qkv
* mimo-v2.5: missed the attn_value scale during merge
* mimo-v2.5: fused QKV needs contiguous for scaling attention value
* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mimo-v2.5: include MTP weights in gguf
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-07 13:21:58 +02:00
Adrien Gallouët
3980e04d5a
llama : add missing call to ggml_backend_load_all() ( #22752 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-07 08:24:47 +03:00
Gilad S.
5207d120ea
model : don't crash on unsupported architecture ( #22742 )
...
* model: don't crash on unsupported architecture
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 18:51:21 +02:00
Concedo
9e9497f0cc
Merge remote-tracking branch 'origin/upstream' into concedo_experimental
...
# Conflicts:
# examples/save-load-state/save-load-state.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl
# ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl
# ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl
# ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl
# ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# scripts/sync-ggml.last
# scripts/sync_vendor.py
# src/llama-graph.cpp
# tests/test-backend-ops.cpp
# tests/test-state-restore-fragmented.cpp
2026-05-06 21:20:06 +08:00
Concedo
7240da764a
Merge commit ' 935a340292' into concedo_experimental
...
# Conflicts:
# examples/diffusion/CMakeLists.txt
# scripts/server-test-function-call.py
# src/llama-model.cpp
# src/models/gemma4.cpp
# tests/test-chat.cpp
# tests/test-reasoning-budget.cpp
# tools/server/README.md
2026-05-06 21:02:25 +08:00
Adrien Gallouët
bf76ac77be
common : only load backends when required ( #22290 )
...
* common : only load backends when required
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* llama : call ggml_backend_load_all() directly from llama_backend_init()
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Add ggml_backend_load_all() where llama_backend_init() is not used
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Georgi Gerganov
d6e7b033a4
llama : add option to save memory in device buffers ( #22679 )
...
* llama : add option to save memory in device buffers
* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret
fa595462ca
graph : handle non-contiguous Q/K/V in mul_mat_aux ( #22630 )
...
* qkv may not always be contiguous
* cont : make the cont conditional
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-05 06:34:44 +03:00
Ismail
a817a22bc6
ggml : implement fast walsh-hadamard transform for kv rotation ( #21352 ) ( #22631 )
2026-05-05 10:05:05 +08:00
Xuan-Son Nguyen
994118a183
model: move load_hparams and load_tensors to per-model definition ( #22004 )
...
* git-friendly migration
* add build_graph
* nits
* exclude old code from build
* wip
* add llm_arch_model_i
* prepare downstream functions
* nits
* nits
* wip
* wip
* add back create_tensor_qkv
* fix files missing include
* enforce one llm_build per arch
* cmake: use glob
* missing model params
* nits
* wip
* wip (2)
* wip (3)
* test-llama-archs is happy
* improve switch case
* move more stuff into llm_arch_model_i
* fix downstream code
* nits
* nits (2)
* fix order
* llama_model_base
* LLAMA_LOAD_LOCALS
* small fix
* fix build errors
* auto
* rm migration script and ifdef
2026-05-04 12:36:59 +02:00
Concedo
2905c6254f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .pi/gg/SYSTEM.md
# docs/speculative.md
# ggml/src/ggml-virtgpu/virtgpu-shm.cpp
# ggml/src/ggml-virtgpu/virtgpu.cpp
# ggml/src/ggml-virtgpu/virtgpu.h
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-05-04 15:36:13 +08:00
Julien Denize
048a490f76
convert : Mistral format yarn apply_scale support ( #22612 )
...
* [BUGFIX] Mistral format apply_scale support.
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix misunderstood boolean parameters
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-03 21:51:21 +02:00
Georgi Gerganov
0754b7b6fe
server : avoid checkpoint data host copies ( #22558 )
...
* server : avoid checkpoint data host copies
* llama : refactor llama_io_read_i
2026-05-02 18:03:25 +03:00
Concedo
7c70187e26
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-copy.h
# ggml/src/ggml-hexagon/htp/hvx-exp.h
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
ddh0
b97ebdc98f
llama-quant : fix --tensor-type when default qtype is overriden ( #22572 )
...
fix #22544 (my fault!)
Credit to @Anai-Guo, ref #22559 - since that one was closed due to the
new contributor policy I am taking the liberty of re-submitting that PR
here.
2026-05-01 19:55:55 +02:00
Reese Levine
5cbfb18075
Update llama-mmap to use ftello/fseeko ( #22497 )
...
* Update llama-mmap to work with 32-bit wasm and >2GB models
* Update to gguf.cpp style
2026-04-30 14:17:52 -07:00
Concedo
70be589894
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# examples/debug/debug.cpp
# examples/eval-callback/eval-callback.cpp
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# scripts/pr2wt.sh
2026-04-28 21:13:40 +08:00
ynankani
0f1bb602dd
model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) ( #22421 )
...
Signed-off-by: Yash Nankani <ynankani@nvidia.com>
2026-04-27 09:58:48 +02:00
Concedo
b31877e8ec
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/pull_request_template.md
# .gitignore
# docs/backend/SYCL.md
# docs/ops.md
# docs/ops/WebGPU.csv
# examples/sycl/test.sh
# examples/sycl/win-test.bat
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/sycl_hw.cpp
# ggml/src/ggml-sycl/sycl_hw.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
2026-04-25 19:06:32 +08:00
ddh0
9d34231bb8
llama-quant : default ftype param Q5_1 --> Q8_0 ( #20828 )
...
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.
In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
2026-04-25 09:25:35 +03:00
Concedo
0755f27372
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/openvino.Dockerfile
# .github/workflows/build-self-hosted.yml
# .github/workflows/build.yml
# common/chat.cpp
# docs/backend/OPENVINO.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-openvino/ggml-decoder.cpp
# ggml/src/ggml-openvino/ggml-openvino-extra.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-openvino/ggml-quants.cpp
# ggml/src/ggml-openvino/openvino/op/rope.cpp
# ggml/src/ggml-openvino/openvino/op_table.cpp
# ggml/src/ggml-openvino/openvino/op_table.h
# ggml/src/ggml-openvino/openvino/translate_session.cpp
# ggml/src/ggml-openvino/openvino/utils.cpp
# ggml/src/ggml-openvino/openvino/utils.h
# ggml/src/ggml-openvino/utils.cpp
# ggml/src/ggml-openvino/utils.h
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/mtmd/CMakeLists.txt
# tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
manayang
7bfe60fdf9
mtmd, llama : Update HunyuanVL vision-language model support ( #22037 )
...
* mtmd, llama : add HunyuanVL vision-language model support
- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh
* fix: fix HunyuanVL XD-RoPE h/w section order
* fix: Remove redundant code
* convert : fix HunyuanOCR / HunyuanVL conversion
- Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
- successfully and produce correct inference output on Metal (F16 / Q8_0).
* clip : fix -Werror=misleading-indentation in bilinear resize
* fix CI: convert_hf_to_gguf type check error
- convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.
---------
Co-authored-by: wendadawen <wendadawen@tencent.com>
2026-04-22 11:58:43 +02:00
Concedo
19a12bb080
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# common/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/sync-ggml.last
# tools/cli/cli.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cd03ec7642
llama-ext : fix exports ( #22202 )
2026-04-21 11:04:46 +03:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device ( #22171 )
...
* fit-params : add option to output estimated memory per device
* cont : minor
* cont : refactor
* cont : move fit params implementation to libcommon
* cont : header
* cont : headers
* cont : codeowners
2026-04-21 09:54:36 +03:00
Johannes Gäßler
fb19f94c71
TP: fix 0-sized tensor slices, AllReduce fallback ( #21808 )
...
* TP: fix 0-sized tensor slices, AllReduce fallback
* fix layer structure <-> GPU count aliasing
* add missing std::fill
* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
Concedo
cd6788007e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cross.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# examples/llama.android/lib/src/main/cpp/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/test-chat.cpp
# tests/test-mtmd-c-api.c
# tools/server/README.md
2026-04-20 20:19:11 +08:00
SamareshSingh
81df3f7cfa
fix: GLM-DSA crash in llama-tokenize when using vocab_only ( #22102 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* llama: fix crash in print_info for GLM-DSA when vocab_only is set
* addressed code review comments
* cont : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-20 10:32:46 +03:00
Sigbjørn Skjæret
4f02d47339
model : refactor bias tensor variable names ( #22079 )
...
* refactor bias tensor variable names
* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Johannes Gäßler
fd1c0ec3f0
llama: fit ctx size for CPU only ( #21568 )
2026-04-18 08:16:04 +02:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
9a38091207
support q5_1 kv
2026-04-17 17:06:15 +08:00
Eric Zhang
fcc7508759
model : Gemma4 model type detection ( #22027 )
...
* model : Gemma4 model type detection
* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
Xuan-Son Nguyen
089dd41fe3
cmake: use glob to collect src/models sources ( #22005 )
2026-04-16 23:25:16 +02:00
Xuan-Son Nguyen
4fbdabdc61
model: using single llm_build per arch ( #21970 )
...
* model: using single llm_build per arch
* fix merge
* nits
2026-04-16 21:10:22 +02:00
PikaPikachu
9db77a020c
model : refactor QKV into common build_qkv and create_tensor_qkv helpers ( #21245 )
...
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers
* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret
f772f6e434
model : support NVFP4 tensors for Gemma4 ( #21971 )
...
* support nvfp4 tensors for Gemma4
* add wo_s to build_attn
* add wo_s to build_attn
* fix glm4
2026-04-16 16:51:47 +02:00
Concedo
ae292c496e
handle SWA conflicting with rewind, increased default SWA padding.
2026-04-16 17:00:26 +08:00