Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
* metal : fix GGML_OP_SET kernel threads
* tests : extend test_cpy to support different src/dst shapes
Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.
- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility
Assisted-by: llama.cpp:local pi
* metal : optimize concat kernel with row batching for small widths
When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.
- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1
Assisted-by: llama.cpp:local pi
* tests : clean-up
* tests : refactor CPY shape tests to use dimension permutations
Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).
Assisted-by: llama.cpp:local pi
The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).
For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.
Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.
ref: https://github.com/ggml-org/llama.cpp/issues/23395
Assisted-by: llama.cpp:local pi
* vocab : add Carbon-3B (HybridDNATokenizer) support
Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the
HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}.
The base BPE is Qwen3-4B-Base's; what differs is that text inside
<dna>...</dna> regions is chunked into fixed 6-mers (right-padded
with 'A' on the trailing partial), and any base outside ACGT maps
to <oov>.
* src/llama-vocab.{h,cpp}: new pre-type, dispatched from
llm_tokenizer_bpe_session::tokenize.
* src/llama-vocab-carbon.h: pure helpers (tokenize_carbon,
emit_dna_kmers) factored out for unit testing — no llama_vocab
dependency, vocab access goes through a std::function.
* conversion/base.py: detect HybridDNATokenizer by class name in
get_vocab_base_pre (chktxt collides with Qwen3 base since it
has no <dna>), and pass trust_remote_code=True in get_vocab_base
so the custom tokenizer class can load.
* tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer,
multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer
right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
two regions, vocab miss.
* vocab : align Carbon-3B changes with llama.cpp conventions
* Fold tokenize_carbon + emit_dna_kmers inline into
llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h),
matching how every other tokenizer keeps its helpers inside
llama-vocab.cpp.
* Replace the standalone unit test with the conventional
test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf
(vocab-only conversion) + .inp/.out fixtures covering single
6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial
right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
two regions.
* Register "carbon" in convert_hf_to_gguf_update.py's model list
(pointing at HuggingFaceBio/Carbon-3B) and teach both
AutoTokenizer call sites in the updater to pass
trust_remote_code=True for it, matching how t5 is special-cased.
* vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch
Refactor the conversion-side changes to follow the per-tokenizer-family
convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm,
etc. instead of conditionalising the shared get_vocab_base /
get_vocab_base_pre paths.
* conversion/base.py: add _set_vocab_carbon — self-contained, loads
with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA
vocab is visible, writes tokenizer.ggml.pre = "carbon" directly.
* conversion/llama.py: branch in LlamaModel.set_vocab on
tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and
dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py
(tokenizer_class branch between BertTokenizer / RobertaTokenizer) and
conversion/phi.py.
* conversion/base.py: revert the conditional in get_vocab_base and the
class-name short-circuit in the auto-generated get_vocab_base_pre.
* tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples
Add 6 cases from the Carbon-3B model card on top of the existing edge
coverage: the unterminated basic-completion prompt, the closed 33-bp
example, the metadata-conditioned prompt (with <vertebrate_mammalian>
and <protein_coding_region> which BPE-decompose since they are not in
the vocab), the documented anti-pattern of raw DNA without <dna> tags,
and the two likelihood-scoring examples. Brings the suite to 19 cases.
* vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE
Refactor per upstream review:
> This should be its own tokenizer model, ie. carbonhybriddna instead
> of gpt2 and not carbon pre-tokenizer. That way you can keep the
> correct pre-tokenizer, in case that ever changes.
Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a
new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific
branch inside llm_tokenizer_bpe_session::tokenize (only existing
pre-types differ in regex, not dispatch logic), and (b) conflated
"hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer".
This change moves it to its own vocab type, peer to PLAMO2, with the
GGUF model name matching the HF tokenizer class (HybridDNATokenizer):
* include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7.
* src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that
owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and
routes raw text through a DNA-aware splitter; wired into
init_tokenizer, tokenize, type_name, byte_to_token, and the
BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov>
are pure ASCII, so byte-level BPE decoding handles them).
LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type
config block alongside SPM/WPM/UGM/RWKV, where pre_type is set
to QWEN2 and the matching add_space_prefix / escape_whitespaces /
clean_spaces flags are applied — mirroring qwen2's BPE path so
byte-level BPE merging stays bit-identical to the Python
reference for non-DNA text.
* src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON.
* conversion/base.py: _set_vocab_hybriddna writes
tokenizer.ggml.model = "hybriddna" (no separate pre).
* conversion/llama.py: dispatch on tokenizer_class ==
"HybridDNATokenizer" same as bert.py / phi.py do.
* models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture +
regenerated metadata.
* convert_hf_to_gguf_update.py: drop the stale chkhsh entry and
trust_remote_code special-case (no longer needed since dispatch
is now class-name driven, not chkhsh).
Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}:
tokenization is bit-identical to the Python HybridDNATokenizer for
all 19 test fixtures plus the model-card metadata-conditioned
prompt; greedy completion produces the same DNA continuation as
the Python reference; spec-dec with 500M as draft for 8B still
works.
* vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA
* vocab : drop llm_tokenizer_bpe vocab-type assert
* vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch
* vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe
* vocab : annotate #endif with PRETOKENIZERDEBUG
* vocab : drop local hybriddna fixture (moves to ggml-org/vocabs)
* deduplicate
* simplify
* simplify
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4),
the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs,
self_kq_mask) are created as graph input nodes but never consumed by any compute node,
so the backend scheduler never allocates a buffer for them. Calling
mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits
GGML_ASSERT(buffer) at ggml-backend.cpp:194.
The same scenario applies symmetrically: if a model had zero SWA layers, the SWA
tensors would be unallocated.
Fix: guard both the base and SWA set_input calls with null/buffer checks, matching
the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674)
which has the comment: 'base tensors may not be allocated if there are no non-SWA
attention layers'.
Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for
unallocated tensors, preventing a null-dereference on the reuse path.
* hexagon: remove gathers and better handling of vtcm in ssm-conv
* hexagon: relax ssm-conv gating requirements
* hexagon: add new prefill ssm-conv backend test
* hexagon: remove trailing white space
* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
- HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference.
- Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch
* webui: Add max image size option
* remove magic numbers
* support all image formats
* use const
* Move regex to match b64 images to constants
* use SETTINGS_KEYS to get max image resolution setting
* Do not touch the image if already under the size threshold
* Move to backend sampling for MTP draft path
Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits
Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.
* Allow sampler chains to be partially offloaded to backend
* Add --spec-draft-backend-sampling argument. Enabled by default.
* opencl: refactor initialization
* opencl: refactor GPU identification
* opencl: rename for consistency
* opencl: cache global mem size in dev_ctx
* opencl: adjust log level
* opencl: load argsort and flash_attn kernels in supports_op
* argsort kernel must be built for supports_op for querying the max
workgroups
* flash_attn kernel has many variants, only load them when needed
ggml_backend_dev_by_name always appends a nullptr sentinel to the devices
vector. Skipping nullptr entries prevents assertion failure in
ggml_backend_dev_name.
Assisted-by: llama.cpp:local pi
* mtmd : deepseek-ocr fixes, improvements and refactoring
- image processing changes to achieve full parity with Pillow (reference impl)
- SAM mask casting only when flash-attn is on
- SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it)
- llama-chat changes to fix server/WebUI issue (new media_markers_first())
- adapted test-chat-template and added test cases for deepseek-ocr
- changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model
- ty.toml ignore unresolved-import for tools/mtmd/tests/**
* image-text reordering fix removed
* refactor bool add_padding + pad_rounding enum into a single pad_style enum
* hmx-mm: update debug logging in hmx-mm
* hmx-mm: update dequant logic to use HVX_vector_x2/4
* hmx-mm: remove non-pipelined version of the quantize matmul
It seems that we don't reall need non-pipelined version
* hmx-mm: use activation depth mode and update naming
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
* hex-mm: minor hmx matmul naming updates
* hmx-mm: remove unused vars
* snapdragon: scripts bump default ubatch-size to 1K
* hexagon: combine HMX and power and clock settings into a single set_power call
* hmx-mm: remove leftover of the scale repl helper
* hexagon: fix editconf error
---------
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
* Adds initial PDL setup.
* Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.
* Further optimization pass of the first half of kernels
* Optimized PDL barriers for the second batch of kernels
* Further refinements after rebase.
* Moves pdl logic to separate function, removes some whitespace
* Strips post-hoc PDL logic
* Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
overlap execution with previous kernels
* Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL
* Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL
* Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
to enable hip/musa compatibility
* Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32
* Enrolls flash_attn_combine_results
* Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
works or is without effect.
* Enrolls flash-attention kernels to pdl
* Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
kernels args. This fixes PDL.
* Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
and template alias and template expansion
* Enrolls all remaining kernels for qwen3-coder-next into PDL
* Remove all PDL LC calls to create a baseline
* Added LC according to internal guidance and tested kernel performance.
* Enrols missing qwen3-5 kernels passively into PDL.
* Kernel optimizations (LC signals) for qwen3.5
* Enrolls ssm-scan kernels into PDL
* Adds GGML_CUDA_PDL command line option to toggle PDL.
* Fix: Ada and lower compilation by guarding PDL calls correctly
* Cleanup: Removes commented out GGML_CUDA_PDL_LC
* Cleanup: Removes experimental comments
* Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.
* Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.
* Fix: Correct PDL en/disablement based on device-side arch check. Host
side check is UB. Required moving from macros to inlined functions
* Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1
* Enable PDL by default for Hopper+ devices
* Enrolls softcap_f32 and two flash_attn kernels into PDL.
* Improves flash attn PDL barrier placement
* Fix: Perf regression on ada; excludes ada and below from PDL launches
* Improves some sync barrier placements
* Drops superfluous constructor
* Adds #endif guard comments
* Reverts experimental change to top-k-moe.cu, which moved expensive allocations
in front of the PDL barrier. It did not have a meaningful impact.
* Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
PDL is disabled
* Revert "Drops superfluous constructor". Adds const to remaining
arguments
This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1.
* Cleanup: Removes and fixes some comments and whitespace
* Clarifies comment of sync-barrier position
* Relocates and refactors PDL launch functions and accessories
* Adds error checking to the regular kernel launch path
* Drops "auto" in favor of "ggml_cuda_kernel_params"
* Adds "const" to ggml_cuda_kernel_launch_params
* [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy
* mtmd: fit_params now take into account mmproj
* rename alloc_compute_meta to reserve_compute_meta
* rm unused functions
* add ggml_backend_dev_t support
* add debug log
* snapdragon: update compiler flags to enable all CPU features
* snapdragon: update readme to point to toolchain v0.6
* snapdragon: bump toolchain docker to v0.6
* opencl: add q4_k moe support
* opencl: add q5_k moe support
* opencl: add q6_k moe support
* opencl: adjust format
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
This commit attempts to clarify a code comment in graph_mtp regarding
where the MTP layer is stored.
The motivation for this is that it was not obvious to me what the
original comment meant and hopefully this makes it clearer.
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().
Assisted-by: llama.cpp:local pi
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* save-load-state : refactor into separate phase functions
- Split monolithic main() into 4 self-contained phase functions, each
managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)
Assisted-by: llama.cpp:local pi
* save-load-state : use params.out_file instead of separate state_file
- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt
Assisted-by: llama.cpp:local pi
* save-load-state : use smart pointers for ctx and smpl
- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)
Assisted-by: llama.cpp:local pi
* save-load-state : add local llama_batch_ptr RAII wrapper
- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls
Assisted-by: llama.cpp:local pi
* save-load-state : replace printf/fprintf with logging macros
- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG
Assisted-by: llama.cpp:local pi
* save-load-state : refactor tests to check results inline
Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().
Assisted-by: llama.cpp:local pi
* save-load-state : improve test output readability
Add phase labels, remove redundant run prefixes, and show
PASS after each test.
Assisted-by: llama.cpp:local pi
* pi : add rule about git signing
* save-load-state : simplify llama_batch_ptr
Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.
Assisted-by: llama.cpp:local pi
* save-load-state : extract generate_tokens helper
Factor out the repeated token generation loop into a shared
helper function used by all phases.
Assisted-by: llama.cpp:local pi
* save-load-state : update comments to use test terminology
Replace "Phase" with "Test" and list each test's steps
as bullet points.
Assisted-by: llama.cpp:local pi
* save-load-state : rename test functions
Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.
Assisted-by: llama.cpp:local pi
* pi : add rule to never git push without confirmation
Assisted-by: llama.cpp:local pi
* common : add model_only option to common_init_from_params
Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.
Use in save-load-state to initialize only the model,
with each test creating its own context.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* llama-eval : add per-problem summary table to HTML reports
- Add chunk_idx and problem_idx to TaskState and saved case dicts
- Group completed cases by problem_idx in dump_html()
- Render per-problem summary table before individual task table
- Columns: Problem (zero-padded), Runs, Correct (n/r),
Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max)
- Sorted by problem index, monospace font, right-aligned numbers
- Colspan headers for grouped stats, auto width
- Simulator: add /v1/models endpoint, timings in response,
template-aware question matching, --dataset arg (aime/aime2025)
Assisted-by: llama.cpp:local pi
* llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally
- Wrap Detailed and Summary tables in switchable tabs (Detailed active by default)
- Remove summary-section wrapper, use tab labels instead
- Apply monospace font to all tables and the top bar
Assisted-by: llama.cpp:local pi
* llama-eval : redesign top bar as CSS grid label/value pairs
- Replace flat span list with 4-column grid layout (2 pairs per row)
- Labels in muted color (#888), values in dark (#222)
- Bold dataset name and model name
- Removed media query, always uses 4 columns
Assisted-by: llama.cpp:local pi
* llama-eval : use realistic token counts and throughput in simulator
- comp_tokens: [30, 80] → [10000, 60000]
- tps_gen: derived → uniform [90.0, 110.0]
- t_gen_ms: now computed from tokens/tps
Assisted-by: llama.cpp:local pi
* llama-eval : color Answer column green/red based on correctness
Use the same .correct/.incorrect CSS classes on the Answer column
to make correct answers green and incorrect answers red.
Assisted-by: llama.cpp:local pi
* llama-eval : fix pyright errors from max(..., key=len) type inference
Use key=lambda x: len(x) instead of key=len so the type checker
infers the return type as str instead of Sized, fixing:
- unresolved-attribute: Object of type Sized has no attribute lower
- not-subscriptable: Cannot subscript object of type Sized
Assisted-by: llama.cpp:local pi