koboldcpp/conversion
Aman Gupta 255582687b
llama + spec: MTP Support (#22673)
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (#7)

* MTP: clean-up (#9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: 8c05923630

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (#11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (#13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-16 20:06:23 +08:00
..
__init__.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
afmoe.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
arctic.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
baichuan.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
bailingmoe.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
base.py llama + spec: MTP Support (#22673) 2026-05-16 20:06:23 +08:00
bert.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
bitnet.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
bloom.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
chameleon.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
chatglm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
codeshell.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
cogvlm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
command_r.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
dbrx.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
deci.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
deepseek.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
dots1.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
dotsocr.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
dream.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
ernie.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
exaone.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
falcon.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
falcon_h1.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
gemma.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
glm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
gpt2.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
gpt_oss.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
gptneox.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
granite.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
grok.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
grovemoe.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
hunyuan.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
internlm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
internvl.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
jais.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
jamba.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
januspro.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
kimi_linear.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
kimivl.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
lfm2.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
lighton_ocr.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
llada.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
llama.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
llama4.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
llava.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
maincoder.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
mamba.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
mimo.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
minicpm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
minimax.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
mistral.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
mistral3.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
mpt.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
nemotron.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
olmo.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
openelm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
orion.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
pangu.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
phi.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
pixtral.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
plamo.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
plm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
qwen.py llama + spec: MTP Support (#22673) 2026-05-16 20:06:23 +08:00
qwen3vl.py convert : fix Qwen3 ASR conversion (#23081) 2026-05-15 18:38:39 +02:00
qwenvl.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
refact.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
rwkv.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
sarashina2.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
smallthinker.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
smolvlm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
stablelm.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
starcoder.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
step3.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
t5.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
ultravox.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
wavtokenizer.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
xverse.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
youtuvl.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00