Ruixiang Wang
2fc8d1851e
doc: fix spec mtp typo ( #23435 )
2026-05-21 09:30:55 +03:00
Georgi Gerganov
d14ce3dab4
llama : MTP clean-up ( #23269 )
...
* llama : disable equal splits for recurrent memory with partial rollback
* spec : re-enable p-min with MTP drafts
* spec : re-enable ngram spec in combination with RS rollback
* spec : fix ngram-map-* params
* spec : fix acceptance logic in combined ngram + draft configs
* graph : fix reuse for combined `token` + `embd` batches
* spec : log parameters for each speculative implementation
- add LOG_INF in each constructor with implementation type and parameters
- extract device string logic into common_speculative_get_devices_str()
- move 'adding speculative implementation' log from init into constructors
Assisted-by: llama.cpp:local pi
* spec : extend --spec-default with ngram-map-k4v
Assisted-by: llama.cpp:local pi
* minor : fix n_embd log
* args : update draft.n_max == 3 + regen docs
* spec : relax ngram-mod rejection thold to 0.25 @ 5 low
* logs : improve
* docs : update speculative decoding CLI argument documentation
- Add missing draft model CPU scheduling and tensor override parameters
- Update --spec-type to include all available types (excluding draft-eagle3 WIP)
- Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0)
- Remove deprecated options (spec-draft-ctx-size, spec-draft-replace)
- Add environment variables for new parameters
Assisted-by: llama.cpp:local pi
* arg : step-back on adding k4v to the default spec config
* cont : fix name
2026-05-19 15:32:58 +03:00
Georgi Gerganov
846262d787
docs : update speculative decoding parameters after refactor ( #22397 ) ( #22539 )
...
* docs : update speculative decoding parameters after refactor (#22397 )
Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397 :
- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands
Assisted-by: llama.cpp:local pi
* pi : add rule to use gh CLI for GitHub resources
Assisted-by: llama.cpp:local pi
* docs : run llama-gen-docs
* arg : fix typo
2026-05-04 08:52:07 +03:00
Sascha Rogmann
292f6908cd
spec : remove check rate ( #19377 )
...
* spec: remove parameter spec-ngram-check-rate
* spec : renamed statistics vars
* spec : add n_call_begin, n_call_accept
* spec : don't enable key-map-stats
2026-02-09 15:30:50 +02:00
Sascha Rogmann
b4d05a3d2f
spec : various improvements ton ngram-map + docs ( #19253 )
...
* spec: ngram-map and reasoning chats
* spec: add t_begin and t_accept
* ngram-map : add internal hash map
* docs : update ngram-map, add ngram-mod
* docs : fix ngram-map-k
* docs : differences between implementations
2026-02-02 08:26:58 +02:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor ( #18471 )
...
* server: introduce self-speculative decoding
* server: moved self-call into speculative.cpp
* can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: can_speculate() tests self-spec
* server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* server: can_speculate() requires a task instance
* common: ngram map, config self-speculative decoding
* common: add enum common_speculative_type
* common: add vector of speculative states
* common: add option --spec-draftless
* server: cleanup (remove slot.batch_spec, rename)
* common: moved self-spec impl to ngram-map
* common: cleanup (use common_speculative_state_draft)
* spec : refactor
* cont : naming
* spec: remove --spec-config
* doc: (draftless) speculative decoding
* common: print performance in spec decoding
* minor : cleanup
* common : better names
* minor : cleanup + fix build
* minor: comments
* CODEOWNERS: add common/ngram-map.* (#18471 )
* common : rename speculative.draftless_type -> speculative.type
* ngram-map : fix uninitialized values
* ngram-map : take into account the input can become shorter
* ngram-map : revert len check for now
* arg : change `--spec-draftless` -> `--spec-type`
* spec : add common_speculative_state::accept()
* spec : refactor + add common_speculative_begin()
* spec : fix begin() call with mtmd
* spec : additional refactor + remove common_speculative_params
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00