* tests : move save-load-state from examples to tests
- Move examples/save-load-state/ to tests/test-save-load-state.cpp
- Remove subdirectory reference from examples/CMakeLists.txt
- Add test to tests/CMakeLists.txt as a model test
- Remove CODEOWNERS entry for removed example directory
Assisted-by: llama.cpp:local pi
* cont : update ci
* metal : fix GGML_OP_SET kernel threads
* tests : extend test_cpy to support different src/dst shapes
Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.
- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility
Assisted-by: llama.cpp:local pi
* metal : optimize concat kernel with row batching for small widths
When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.
- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1
Assisted-by: llama.cpp:local pi
* tests : clean-up
* tests : refactor CPY shape tests to use dimension permutations
Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).
Assisted-by: llama.cpp:local pi
* hexagon: remove gathers and better handling of vtcm in ssm-conv
* hexagon: relax ssm-conv gating requirements
* hexagon: add new prefill ssm-conv backend test
* hexagon: remove trailing white space
* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The MUL_MAT test loop iterates over base_types[] to generate non-contig
permutation cases (3 standard permutations across n in {1, 8, 16}).
BF16 is absent from base_types[], so these 9 cases were never generated
for BF16 even though every other type covered by base_types[] tests them.
Add the missing 9 cases explicitly: BF16 x F32, m=16, k=256, bs=[2,3],
permutations {0,2,1,3}, {0,1,3,2}, {0,3,2,1}, with n in {1, 8, 16}.
Suggested-by: @jeffbolznv
* Support for Codex CLI by skipping unsupported Responses tools
* Warn on skipped Responses tools and preserve gpt-oss apply_patch rejection
* Revert gpt-oss apply_patch special handling
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests
- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes#21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.
This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.
Closes#21919.
* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
* cont : remove trailing whitespace
---------
Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)
* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)
* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)
* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16
bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.
* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128
* mimo-v2.5: follow (256, 256) fattn templates
* mimo-v2.5: cleanup comments
* mimo-v2.5: further comment cleanup
* mimo-v2.5: address PR feedback
fix GQA handling
check for other dangling 320/576 carveouts and mirror them for 192
Add to backend ops test so new paths are covered
* cuda: fuse snake activation (mul, sin, sqr, mul, add)
Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.
Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.
* cuda: address review feedback from @am17an
Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.
* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an
* Update tests/test-backend-ops.cpp
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cuda: snake fusion check add->type matches x->type
Address review feedback from @am17an
* cuda: snake fusion check add->type matches x->type
Moved for readability (equivalent)
Address review feedback from @am17an
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
* vulkan: Support asymmetric FA in coopmat2 path
There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.
Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).
We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.
* reorder cases
* Changed to leak logger singleton to prevent hanging on Windows
* Fix comment
* Stopped using static vector
Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash
* Change so all logs are output before exit
* Added debug logging
* added more logging
* Added logging
* Explicitly free logger to avoid hanging on Win
* Reverted to leak logger instance again
* Removed debug log and fixed comment
* Fixed comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.
Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
* feat: (vocab) fix stray text appended in llama_decode_text
Remove accidental concatenation of the full `text` string when
formatting UNK_BYTE hex escapes. Only the closing "]" should be appended.
* feat(mtmd): add Yasa2 vision encoder support
Add a Yasa2 (ConvNeXtV2-based) vision encoder for reka-edge:
- Register PROJECTOR_TYPE_YASA2 and tensor name definitions
- Add yasa2_block/yasa2_stage model structs
- Implement graph builder with ConvNeXt stages, GRN, adaptive pooling
- Wire into clip.cpp switch statements and mtmd.cpp init_vision
- Use mtmd_image_preprocessor_fixed_size for image preprocessing
* feat(chat): add reka-edge template handler (tools, thinking)
- Add chat-reka.cpp/h implementing PEG-based parser for reka-edge format
- Add Reka-Edge.jinja chat template
- Detect reka-edge template in try_specialized_template()
- Add LLAMA_EXAMPLE_MTMD to chat-template-file arg
* feat: add reka vlm to gguf conversion script
Converts Reka Yasa2 hf checkpoints to GGUF format:
- Text decoder: Llama-arch with tiktoken/BPE vocab
- Mmproj (--mmproj): ConvNeXt vision backbone + language_projection
- Generates 2D sincos positional embeddings for vision encoder
* test: add Reka Edge chat template and parser tests
- test-chat-template: oracle tests comparing Jinja engine output vs
common_chat_templates_apply for text, tools, thinking, images, video
- test-chat: PEG parser tests for Reka Edge format, round-trip tests
for image/video content parts, common path integration tests
* scripts: add Reka Edge mixed quantization helper
Q4_0 base quantization with Q8_0 override for the last 8 transformer
blocks (layers 24-31) via --tensor-type regex.
* fix: adapt chat-reka and tests to upstream API
- Use autoparser::generation_params (not templates_params)
- Add p.prefix(generation_prompt) to PEG parser
- Simplify reasoning parser to match LFM2 pattern
- Remove image/video oracle tests (unsupported by oaicompat parser;
no other multimodal models test this path)
* fix: avoid duplicate tensor loading in yasa2 vision encoder
TN_YASA_PATCH_W and TN_PATCH_EMBD both resolve to "v.patch_embd.weight",
causing the same tensor to be loaded twice into ctx_data and overflowing
the memory pool. Reuse the tensors already loaded by the common section.
* chore: update image pre-processing settings
The reka-edge model depends on the following settings in an older
fork of llama.cpp:
1. Fixed square resize
2. BICUBIC
3. add_padding=false
In current llama.cpp, this means setting:
- image_resize_algo = RESIZE_ALGO_BICUBIC
- image_resize_pad = false
* chore: remove reka gguf conversion script
* chore: remove reka quantization script
* chore: remove unnecessary changes from PR scope
This commit removes a couple of unnecessary changes for the PR scope:
1. BPE decoder bug fix - this affects reka edge because there's a bug
in our tokenization that doesn't represent <think> tokens as special
tokens. However this isn't meant to be a thinking model so when run
with --reasoning off the edge case does not affect us
2. --chat-template-file support from llama-mtmd-cli - the focus is on
llama-server and the reka edge gguf contains the necessary metadata
to detect the chat template
3. reka edge oracle test cases - no other model has similar test cases,
so I removed it for standardization
* chore: remove unnecessary ggml_cast
This commit removes unnecessary ggml_cast after updating the
reka vlm -> gguf conversion script on hugging face.
* chore: remove redundant code
* chore: remove unnecessary ggml_cont calls
This commit removes all ggml_cont calls except the four that
precede ggml_reshape_3d/ggml_reshape_4d. Those are necessary
because ggml_reshape recomputes strides assuming contiguous
layout and asserts ggml_is_contiguous.
Other operations (ggml_mean, ggml_add, ggml_mul etc.) use
stride-based indexing and handle non-contiguous inputs
correctly and so we are ok to remove ggml_cont for those.
* chore: remove unnecessary ggml_repeat calls
This commit removes unnecessary ggml_repeat calls because the underlying
ops already broadcast automatically.
Every ggml_repeat in yasa2.cpp was expanding a smaller tensor to match
a larger one's shape before passing both to an elementwise op (ggml_add,
ggml_sub, ggml_mul, or ggml_div). This is unnecessary because all four
of these ops already support broadcasting internally.
* chore: restore ggml_cont needed for cpu operations
* refactor: locate reka chat template handler in chat.cpp
* chore: remove unnecessary warmup tokens
* chore: add code comments on image_resize_pad
* chore: remove custom reka parsing code
* chore: revert common/chat.cpp
* Uncomment debug logging for PEG input parsing
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* fix: enable reasoning budget sampler for gemma4
Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.
Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).
Add test case for empty thinking block.
Fixes#21487
* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* requirements : update transformers to 5.5.0
This commit updates the transformers dependency to version 5.5.0.
The motivation for this is that transformers 5.5.0 includes support for
Gemma4 and is required to be able to convert Gemma4 models. This is also
causing issues for user of gguf-my-repo.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202
* fix huggingface_hub version
* set version of transformers to 5.5.0
* convert : add ty ignore directives to convert_hf_to_gguf.py
This commit adds `ty: ignore` directives to transformers tokenizers
field/methods to avoid type check errors. There might be better ways to
handle this and perhaps this can be done in a follow up commit.
The motivation for this is that it looks like in transformers 5.5.0
AutoTokenizer.from_pretrained can return generic tokenizer types or None
and the type checker now produces an error when the conversion script
accesses field like tokenizer.vocab.
* convert : add ty ignore to suppress type check errors
* convert : remove incorrect type ignores
* convert : fix remaining python checks
I was running a newer version of ty locally but I've switched to
version 0.0.26 which is what CI uses and I was then able to reproduce
the errors. Sorry about the noise.
* update transformers version to 5.5.1
* feat: jinja engine improvements for reka-edge
Port three Jinja engine improvements needed for the reka-edge model:
1. Python-style string repetition ("ab" * 3 → "ababab")
2. ensure_ascii=true support for tojson filter (escapes non-ASCII to \uXXXX)
3. int() builtin on value_int_t (identity, needed for Reka Edge template)
* fix: escape invalid utf8 bytes when ensure_ascii=true
The json_ensure_ascii_preserving_format function does not correctly
handle an edge case where if UTF-8 parsing fails, it adds the non-ascii
character back to the output as a raw byte.
This commit fixes that by adding the unicode standard replacement
character \\ufffd to the output instead. This is the standard behavior
for various programming languages like Python, Rust, Go, etc.
* chore: address PR comments
1. Add todo comment for supporting string repetition for array/tuples
2. Add support for float identity operation
3. Move invalid ascii test case to test_fuzzing
* chore: accept suggestion for common/jinja/value.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>