The MEAN/CLS/LAST pooling paths in encode() and decode() used
n_embd_inp() (16384 for qwen3vl with deepstack) to read from the
pooled embedding tensor, which only has n_embd_out() (4096) floats
per sequence. This caused a tensor read out of bounds assertion.
Fixes embedding mode for Qwen3-VL-Embedding models.
* context: zero output buffer on allocation
Address GHSA-wqq9-25mr-rw76.
The logits output buffer allocated in output_reserve() uses
posix_memalign(), which does not zero memory. The buffer is only
written during decode when needs_raw_logits() returns true. When
backend samplers cover all output sequences, needs_raw_logits()
returns false and the buffer is never written, but
llama_get_logits() still returns a pointer to it, exposing stale
heap content.
Zero the buffer after allocation to prevent information disclosure
through the public logits API.
Found-by: Pwno
* Update src/llama-context.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : fix pooling assertion crash in chunked GDN detection path
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).
Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.
Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.
* server : add mean pooling tests to embedding test suite
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.
These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
---------
Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
* tests: allow loading test-backend-ops tests from json
* add error threshold based on op
* add error when file cannot be read
* add graph operator json extraction tool
* add nb parameter for non-contiguous input tensors
* fix view check
* only use view if non-contiguous/permuted, use C++ random instead of rand()
* replace internal API calls with public llama_graph_reserve call
* reduce test description length
* fix nb[0] not getting set for view
* add name to tests
* fix inplace error
* use text file instead of json
* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/
* fix missing declaration
* use pragma once
* fix indent
* fix Windows build
* llama : enable chunked fused GDN path
* models : avoid Q and K repeats when using fused GDA
* cont : fix comment
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cont : fix the fix
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cont : fix
* metal : add GDN kernel (#20361)
* metal : add Metal backend for GGML_OP_GATED_DELTA_NET
Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.
Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.
Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
tg128: 170 -> 213 t/s (+25%)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* metal : validate contiguity of all input tensors in supports_op
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* metal : add algorithm equivalence comment for GDA decay path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* cont : unslop + optimize
* cont : clean-up
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* CUDA: AR gated delta net improvements (#20391)
* Add FastDiv to gated_delta_net_cuda
* Shard columns across warps
This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).
* Remove unneded include in gated_delta_net.cu
* Improve comments
* Apply code-formating
* Make sharding HIP-compatible
1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA
* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t
* Rename variables
* Enable GDN also for prefill, move TODO for chunked_GDN
* Actually remove the TODO from 206890897546bd16602c3b79394fd5ea09ef199f
* Get warp size at runtime
warp_size is not known at compile time in hip host code.
* Don't expose ggml_cuda_get_physical_warp_size on host
---------
Co-authored-by: uvos <devnull@uvos.xyz>
* llama : refactor llm_build_delta_net_base API
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* llama : remove write/read of output ids/logits/embeddings
This commit removes the write/read of output ids, logits and
embeddings from the llama context state.
Refs: https://github.com/ggml-org/llama.cpp/pull/18862#issuecomment-3756330941
* completion : add replying of session state
This commit updates the session handing in the completion tool to handle
the that logits are no longer stored in the session file. Instead, we
need to replay the last token to get the logits for sampling.
* common : add common_prompt_batch_decode function
This commit adds a new function which is responsible for decoding prompt
and optionally handle the saving for session data.
* update save-state.cpp to use llama_state_load_file
This commit updates the save-load-state example to utilize the new
llama_state_load_file function for loading the model state from a file.
And it also replays the last token after loading since this state is now
stored before the last token is processed.
* examples : set n_seq_max = 2 for ctx3
This commit updates the save-load-state example to set the n_seq_max
parameter to 2 when initializing the ctx3 context.
The motivation for this change is that using 1 as n_parallel/n_seq_max
the context only supports one sequence, but the test laster tries to
use a second sequence which results in the following error:
```console
main : loaded state with 4 tokens
main : seq 0 copied, 225760 bytes
main : kv cache cleared
find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value
state_read_meta: failed to find available cells in kv cache
```
This seems to only happen for recurrent/hybrid models.
This commit updates get_logits_ith(), and get_embeddings_ith() to use
output_resolve_row() to resolve the batch index to output row index.
The motivation for this is to remove some code duplication between these
functions.
* full modern bert support
* added gelu op in rank pooling for modern bert
* still working on stuff, added mean calculation before classifier head
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* first layer is dense, as per modern bert research paper
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Refactoring to use new llama_put_adapter_loras
* cont : alternative lora API
---------
Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : refactor sampling_info to use buffer_view template
This commit updates the sampling_info struct in llama-context to use a
buffer_view template for the logits, probs, sampled tokens, and
candidates buffers.
The motivation for this is to simplify the code, improve type safety
and readability.
* support qwen3.5 series
* remove deepstack for now, and some code clean
* code clean
* add FULL_ATTENTION_INTERVAL metadata
* code clean
* reorder v heads for linear attention to avoid expensive interleaved repeat
* Unified delta net handling
* Remove old methods.
* Refactor and optimize
* Adapt autoregressive version from @ymcki
* Change to decay mask approach
* Fix bad permute
* Qwen 3.5 support
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Further fixes
* Use inheritance, remove unneeded conts
* Not like this!
* Remove ggml.h explicit import
* Remove transformers, fix the views
* ACTUALLY fix views, make super calls explicit in conversion.
* Fix conversion again
* Remove extra ggml.h imports
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* kimi linear model implementation
* kimi linear convert_hf_to_gguf
* kimi linear constants.py tensor_mapping.py
* Kimi Linear ggml.h
* kimi linear ggml-cpu
* Kimi Linear ggml-cuda
* Kimi Linear ggml.c
* kimi linear src/llama
* remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning
* remove type mismatch warning
* read MoE params
* removed some hard coded code
* removed all hard code
* use DeepseekV2 tokenizer
* removed unnecessary internal methods called by the old set_vocab of KimiLinear
* rewrite get_vocab for KimiLinear. Removed all kda_scan code
* removed all traces of kda_scan
* reduce OP count by 1 due to removal of kda_scan
* Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache
* set n_embd_head_k/v to ensure kv cache works
* don't quantize conv1d of Kimi Linear
* Kimi Linear backend agnostic
* removed LOG_INFO
* naive chunking form implemented
* fixed some comments
* add Kimi-K2 specific tokens to be recognized as EOG
* build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682
* replaced Akk and Aqk with mul_mat and clamp
* no clamp version
* Moved Aqk computation out of the loop
* fixed typo and split wkv_b into wk_b and wv_b
* MLA KV cache support
* fix trailing spaces
* moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error
* fix trailing whitespace
* removed traling whitespaces in empty line + make sure indentation is multiple of 4
* try to make lint happy
* remove blank lines to make lint happy
* removed at least blank line containing white space
* fixed flake8 complaints locally
* return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement
* removed Kimi-Linear specific change that causes failure at server-windows
* removed private: from kimi_linear to make build checks happy
* removed unnecessary ggml_cont before ggml_reshape
* created static function causal_conv1d to abtract similar code for q/k/v
* merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py.
* reverted to original
* fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms.
* remove DT_B from constants.py. remove one comment line in llama-model.cpp
* new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight
* remove ssm_o_norm_b
* remove ssm_o_norm_b
* changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k
* removed all ggml_cont b4 ggml_reshape_4d
* Whitespace
* replaced all hparams.get with find_hparams
* added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp
* use is_mla to switch between different mem_hybrid types
* fixed logical errors in convert_hf_to_gguf.py pointed out by CISC
* removed if else for required parameters kv_lora_rank and qk_rope_head_dim
* add back ggml_cont for Vcur
* minor changes
* removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp
* f16 gguf cannot run without context length
* made a mistake of adding back n_ctx parsing
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* sampling : remove sampling branching in output_reserve
This commit updates output_reserve in llama-context.cpp to always
allocate sampling buffers regardless of whether sampling is needed for
the current batch.
The motivation for this is to avoid reallocations and branching based on
the sampling requirements of the batch.