koboldcpp/examples
Georgi Gerganov 68e7ea3eab
spec : parallel drafting support (#22838)
* spec : refactor

* spec : drop support for incompatible vocabs

* spec : update common_speculative_init()

* cont : pass seq_id

* cont : dedup ctx_seq_rm_type

* server : sketch the ctx_dft decode loop

* server : draft prompt cache and checkpoints

* server : improve ctx names

* server, spec : transition to unified spec context

* cont : sync main and drft contexts

* cont : async drft eval when possible

* cont : handle non-ckpt models

* cont : pass correct n_past for drafting

* cont : process images throught the draft context

* spec : handle draft running out of context

* server : fix mtmd draft processing

* server : fix URL for draft model

* server : add comment

* server : clean-up + dry

* speculative-simple : update

* spec : fix n_past type

* server : fix slot ctx_drft ptr

* tools : update readme

* naming : improve consistency

* spec : refactor for multi-sequence speculative context

* cont : prepare params

* cont : prepare params

* spec : support parallel drafts

* server : support parallel drafting

* llama : reuse device buffers when possible

* server, spec : clean-up

* cont : clean-up

* cont : minor

* spec : reset `drafting` flag at the end

* spec : introduce `common_speculative_process()`

* spec : allow for multiple spec types (chain of speculators)

* replace old type field of type common_speculative_type in the
  common_params_speculative struct with a vector to allow multiple
  types to be specified

* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
  to figure out which implementations the user has enabled

* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
  to parse the already user provided spec types

* all speculators run sequentially, best one wins (we verify its drafted tokens)

* maximize expected accepted tokens for current round by calculating the
  product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
  and the draft's length

---------

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-05-11 19:09:43 +03:00
..
batched libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
batched.swift examples : remove references to make in examples [no ci] (#15457) 2025-08-21 06:12:28 +02:00
convert-llama2c-to-ggml libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
debug common: fix missing exports in llama-common (#22340) 2026-04-27 08:06:39 +03:00
deprecation-warning Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
diffusion examples: refactor diffusion generation (#22590) 2026-05-04 20:19:30 +08:00
embedding libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
eval-callback common: fix missing exports in llama-common (#22340) 2026-04-27 08:06:39 +03:00
gen-docs spec : refactor params (#22397) 2026-04-28 09:07:33 +03:00
gguf Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
gguf-hash Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
idle libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
llama.android android : libcommon -> libllama-common (#22076) 2026-04-18 11:19:40 +02:00
llama.swiftui llama : deprecate llama_kv_self_ API (#14030) 2025-06-06 14:11:15 +03:00
lookahead libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
lookup spec : refactor params (#22397) 2026-04-28 09:07:33 +03:00
model-conversion model-conversion : fix mmproj output file name [no ci] (#22274) 2026-04-23 15:07:38 +02:00
parallel libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
passkey libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
retrieval libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
save-load-state common : only load backends when required (#22290) 2026-05-05 09:23:50 +02:00
simple Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
simple-chat Fix locale-dependent float printing in GGUF metadata (#17331) 2026-03-04 09:30:40 +01:00
simple-cmake-pkg examples : add missing code block end marker [no ci] (#17756) 2025-12-04 14:17:30 +01:00
speculative spec : fix vocab compat checks in spec example (#22426) 2026-04-30 08:18:25 +03:00
speculative-simple spec : parallel drafting support (#22838) 2026-05-11 19:09:43 +03:00
sycl fix script error (#22795sycl : ) 2026-05-08 06:54:57 +03:00
training libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
CMakeLists.txt examples : add debug utility/example (#18464) 2026-01-07 10:42:19 +01:00
convert_legacy_llama.py metadata: Detailed Dataset Authorship Metadata (#8875) 2024-11-13 21:10:38 +11:00
json_schema_pydantic_example.py py : type-check all Python scripts with Pyright (#8341) 2024-07-07 15:04:39 -04:00
json_schema_to_grammar.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
llama.vim chore : correct typos [no ci] (#20041) 2026-03-05 08:50:21 +01:00
pydantic_models_to_grammar.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
pydantic_models_to_grammar_examples.py llama : move end-user examples to tools directory (#13249) 2025-05-02 20:27:13 +02:00
reason-act.sh scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
regex_to_grammar.py py : switch to snake_case (#8305) 2024-07-05 07:53:33 +03:00
server-llama2-13B.sh scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
server_embd.py llama : fix FA when KV cache is not used (i.e. embeddings) (#12825) 2025-04-08 19:54:51 +03:00
ts-type-to-grammar.sh scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00