Georgi Gerganov
cf8b0dbda9
server : remove /api endpoints ( #22165 )
...
* server : remove /api endpoints
* cont : remove /api/tags
2026-04-20 20:41:19 +03:00
Georgi Gerganov
de71b5f81c
server : refactor "use checkpoint" logic ( #22114 )
2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots ( #21741 )
2026-04-20 08:30:24 +03:00
Sascha Rogmann
455d8e4be8
server : speculative checkpointing ( #19493 )
...
* server : speculative decoding using checkpoints
* server : fix draft check with checkpoints
* server : rename spec vars
* server : log levels
* server : refactored spec logic to speculative.cpp
* server : renamed spec checkpoints option
* server : fix spec checkpoints, logging
* speculative : checkpoints with draft model, logging
* server : n_tokens_cur and create_checkpoint in draft
* server : fix server_speculative_callback (slot.id)
* spec : fix ngram-map/begin idx_last_check
* spec : init ckpt (begin() wasn't called)
* chore: update webui build output
* server : restore sampler in spec checkpoint and clear mem
* cont : avoid --spec-use-checkpoints argument
* cont : remove server_prompt_checkpoint_with_size
* spec : rename (leave_draft_state)
* cont : clean-up
* cont : do not ignore partial drafts even if the are short
* cont : spec callback owned by session
* cont : simplify
* cont : avoid empty speculative session
* cont : simplify
* cont : simplify
* cont : enable mtmd speculative decoding
* cont : keep the spec sampler alive
* cont : simplify
* cont : fix nullptr deref + draft checkpoints
* cont : remove common_speculative_accept_response
* cont : remove callback
* cont : simplify
* cont : minor
* cont : simplify
* cont : fix accepted number
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Cetarthoriphros
9e5647affa
server: Expose media_tag on /props endpoint. ( #22028 )
2026-04-19 00:27:17 +02:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common ( #21936 )
...
* cmake : allow libcommon to be shared
* cmake : rename libcommon to libllama-common
* cont : set -fPIC for httplib
* cont : export all symbols
* cont : fix build_info exports
* libs : add libllama-common-base
* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Xuan-Son Nguyen
408225bb1a
server: use random media marker ( #21962 )
...
* server: use random media marker
* nits
* remove legacy <__image__> token
* revert special char in random
2026-04-15 23:52:22 +02:00
Xuan-Son Nguyen
e489a5ca0e
server: support OAI /v1/audio/transcriptions API ( #21863 )
...
* server: support OAI /v1/audio/transcriptions API
* address autoreview comments
* correct default response_format value
2026-04-14 11:09:52 +02:00
Yuri Khrustalev
660600081f
server: respect the ignore eos flag ( #21203 )
2026-04-08 17:12:15 +02:00
Aaron Teo
69c28f1547
llama-server: fix model params not propagated ( #21509 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-04-07 21:39:41 +08:00
Georgi Gerganov
e8f5082697
server : fix restore for checkpoints with pos_min == 0 ( #21510 )
2026-04-07 15:29:17 +03:00
Dan Hoffman
9c699074c9
server: Fix undefined timing measurement errors in server context ( #21201 )
...
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
2026-04-04 22:11:19 +08:00
Yes You Can Have Your Own
50e0ad08fb
server: save and clear idle slots on new task (--clear-idle) ( #20993 )
...
* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)
* server: move idle slot KV clearing to slot release
The save "cost" is now paid by the finishing request.
* server: add --kv-clear-idle flag, enable by default
* server: skip clearing last idle slot, clear on launch
* server: test --no-kv-clear-idle flag
* server: simplify on-release clearing loop
* server: remove on-release KV clearing, keep launch-only
* cont : clean-up
* tests: update log strings after --clear-idle rename
* tests: use debug tags instead of log message matching
* test: fix Windows CI by dropping temp log file unlink
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-03 19:02:27 +02:00
Georgi Gerganov
edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks ( #21107 )
2026-03-28 16:27:36 +02:00
Xuan-Son Nguyen
49bfddeca1
server: allow router to report child instances sleep status ( #20849 )
...
* server: allow router to report child instances sleep status
* refactor
* move sleeping to state
* nits
2026-03-22 18:33:52 +01:00
Georgi Gerganov
ab9d4c3678
server : improve mtmd ctx checkpoints ( #20726 )
...
* server : improve mtmd ctx checkpoints
* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00
Ryan Goulden
26c9ce1288
server: Add cached_tokens info to oaicompat responses ( #19361 )
...
* tests : fix fetch_server_test_models.py
* server: to_json_oaicompat cached_tokens
Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
2026-03-19 19:09:33 +01:00
Piotr Wilkin (ilintar)
5e54d51b19
common/parser: add proper reasoning tag prefill reading ( #20424 )
...
* Implement proper prefill extraction
* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp
* Update tools/server/server-task.cpp
* refactor: move grammars to variant, remove grammar_external, handle exception internally
* Make code less C++y
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
Piotr Wilkin (ilintar)
d2ecd2d1cf
common/parser: add --skip-chat-parsing to force a pure content parser. ( #20289 )
...
* Add `--force-pure-content` to force a pure content parser.
* Update common/arg.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Change parameter name [no ci]
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Georgi Gerganov
8cc2d81264
server : fix ctx checkpoint invalidation ( #20671 )
2026-03-17 15:21:14 +02:00
SoftwareRenderer
d7ba99c485
server: reset counter related to kill-switch on client error ( #20513 )
...
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
Piotr Wilkin (ilintar)
acb7c79069
common/parser: handle reasoning budget ( #20297 )
...
* v1
* Finished!
* Handlie cli
* Reasoning sampler
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Less explosive terminology :)
* Add utf-8 case and tests
* common : migrate reasoning budget sampler to common
* cont : clean up
* cont : expose state and allow passing as initial state
* cont : remove unused imports
* cont : update state machine doc string
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Georgi Gerganov
a7b3dee7a5
server : make 2 checkpoints near the end of the prompt ( #20288 )
...
* server : make 2 checkpoints near the end of the prompt
* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
Georgi Gerganov
96cfc4992c
server : fix checkpoints n_tokens calculation ( #20287 )
2026-03-09 16:47:06 +02:00
Georgi Gerganov
344ee2a38a
server : warn swa-full is not supported for non-SWA models ( #20291 )
2026-03-09 16:44:25 +02:00
Georgi Gerganov
d6e1556499
server : fix off-by-1 in server_tokens::size_up_to_pos() ( #20279 )
...
* server : fix off-by-1 in server_tokens::size_up_to_pos()
* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
Georgi Gerganov
107d599952
server : add kill switch when server is stuck ( #20277 )
2026-03-09 10:33:12 +02:00
Georgi Gerganov
d417bc43dd
server : do not create checkpoints right after mtmd chunks ( #20232 )
2026-03-08 22:16:46 +02:00
Piotr Wilkin (ilintar)
f5ddcd1696
Checkpoint every n tokens: squash ( #20087 )
2026-03-06 11:39:26 +01:00
Pascal
2e7e638523
server : support multiple model aliases via comma-separated --alias ( #19926 )
...
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Georgi Gerganov
01cd448b8c
server : fix ctx checkpoint restore logic ( #19924 )
2026-02-26 18:20:16 +02:00
Georgi Gerganov
f20469d919
server : enable multi-modal prompt caching ( #19877 )
2026-02-25 15:15:42 +02:00
Georgi Gerganov
d7d826b3c1
server : support multi-modal context checkpoints ( #19849 )
...
* Modify llama-memory-hybrid-iswa.cpp
* Modify llama-memory-recurrent.cpp
* Modify server-common.cpp
* Modify server-common.h
* Modify server-context.cpp
* Modify server-task.h
* Added comment to llama-memory-hybrid-iswa.cpp
* Remove comment from server-context.cpp
* Stylistic fix server-context.cpp
* Fix an issue when seqrm isn't called in server-context.cpp
* cont : alternative impl
* cont : cleanup
* cont : n_tokens -> int64_t
---------
Co-authored-by: timkhronos <timkhronos@gmail.com>
2026-02-25 15:14:27 +02:00
Sigbjørn Skjæret
e8e261699a
cli : provide model with text filename ( #19783 )
2026-02-22 22:33:49 +01:00
matteo
b55dcdef5d
server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) ( #19622 )
...
* save generated text for the /slots endpoint
* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0
* Apply suggestions from code review
---------
Co-authored-by: Matteo <matteo@matteo>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-18 18:53:37 +01:00
손희준
820ebfa6f4
Server: log when converting requests to chat completions format ( #19457 )
...
* Log converting requests
* Print as debug instead of info [no ci]
---------
Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Georgi Gerganov
eb449cdfa4
server : improve context checkpoint logic ( #19408 )
2026-02-08 09:40:04 +02:00
Georgi Gerganov
dfde5993ea
common : add common_speculative_is_compat() ( #19270 )
...
* llama : add llama_memory_can_rm_suffix()
* Revert "llama : add llama_memory_can_rm_suffix()"
This reverts commit d30e59b62a15ef4266a6503e3f4eba770aec001b.
* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter ( #19207 )
...
* server : wrap around the "id_slot" parameter
* cont : minor
2026-01-30 19:46:10 +02:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod ( #19164 )
...
* spec : add ngram-mod
* cont : simplify + keep track of occupancy
* cont : cleanup
* cont : move initialization to common/speculative
* cont : cleanup
* cont : cleanup
* cont : fix
2026-01-30 18:21:48 +02:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor ( #18471 )
...
* server: introduce self-speculative decoding
* server: moved self-call into speculative.cpp
* can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: can_speculate() tests self-spec
* server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* server: can_speculate() requires a task instance
* common: ngram map, config self-speculative decoding
* common: add enum common_speculative_type
* common: add vector of speculative states
* common: add option --spec-draftless
* server: cleanup (remove slot.batch_spec, rename)
* common: moved self-spec impl to ngram-map
* common: cleanup (use common_speculative_state_draft)
* spec : refactor
* cont : naming
* spec: remove --spec-config
* doc: (draftless) speculative decoding
* common: print performance in spec decoding
* minor : cleanup
* common : better names
* minor : cleanup + fix build
* minor: comments
* CODEOWNERS: add common/ngram-map.* (#18471 )
* common : rename speculative.draftless_type -> speculative.type
* ngram-map : fix uninitialized values
* ngram-map : take into account the input can become shorter
* ngram-map : revert len check for now
* arg : change `--spec-draftless` -> `--spec-type`
* spec : add common_speculative_state::accept()
* spec : refactor + add common_speculative_begin()
* spec : fix begin() call with mtmd
* spec : additional refactor + remove common_speculative_params
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Xuan-Son Nguyen
51fa458a92
server : support preserving reasoning_content in assistant message ( #18994 )
...
* support reasoning_content input
* report template caps to webui
* add docs
* rm commented code
2026-01-22 21:30:06 +01:00
손희준
fbbf3ad190
server: /v1/responses (partial) ( #18486 )
...
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
2026-01-21 17:47:23 +01:00
Xuan-Son Nguyen
6df686bee6
server : refactor oai_parser_opt, move it to server_chat_params ( #18937 )
...
* server_chat_params
* move chat format into CLI
* use meta whenever possible
* clean up, no more chatml fallback
2026-01-19 23:28:01 +01:00
Lennart Austenfeld
18361c579c
server: fix memory reservations in populate_token_probs ( #18787 )
2026-01-19 19:13:31 +01:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine ( #18462 )
...
* jinja vm
* lexer
* add vm types
* demo
* clean up
* parser ok
* binary_expression::execute
* shadow naming
* bin ops works!
* fix map object
* add string builtins
* add more builtins
* wip
* use mk_val
* eval with is_user_input
* render gemma tmpl ok
* track input string even after transformations
* support binded functions
* keyword arguments and slicing array
* use shared_ptr for values
* add mk_stmt
* allow print source on exception
* fix negate test
* testing more templates
* mostly works
* add filter_statement
* allow func to access ctx
* add jinja-value.cpp
* impl global_from_json
* a lot of fixes
* more tests
* more fix, more tests
* more fixes
* rm workarounds
* demo: type inferrence
* add placeholder for tojson
* improve function args handling
* rm type inference
* no more std::regex
* trailing spaces
* make testing more flexible
* make output a bit cleaner
* (wip) redirect minja calls
* test: add --output
* fix crash on macro kwargs
* add minimal caps system
* add some workarounds
* rm caps_apply_workarounds
* get rid of preprocessing
* more fixes
* fix test-chat-template
* move test-chat-jinja into test-chat-template
* rm test-chat-jinja from cmake
* test-chat-template: use common
* fix build
* fix build (2)
* rename vm --> interpreter
* improve error reporting
* correct lstrip behavior
* add tojson
* more fixes
* disable tests for COMMON_CHAT_FORMAT_GENERIC
* make sure tojson output correct order
* add object.length
* fully functional selectattr / rejectattr
* improve error reporting
* more builtins added, more fixes
* create jinja rendering tests
* fix testing.h path
* adjust whitespace rules
* more fixes
* temporary disable test for ibm-granite
* r/lstrip behavior matched with hf.js
* minimax, glm4.5 ok
* add append and pop
* kimi-k2 ok
* test-chat passed
* fix lstrip_block
* add more jinja tests
* cast to unsigned char
* allow dict key to be numeric
* nemotron: rm windows newline
* tests ok
* fix test
* rename interpreter --> runtime
* fix build
* add more checks
* bring back generic format support
* fix Apertus
* [json.exception.out_of_range.403] key 'content' not found
* rm generic test
* refactor input marking
* add docs
* fix windows build
* clarify error message
* improved tests
* split/rsplit with maxsplit
* non-inverse maxsplit
forgot to change after simplifying
* implement separators for tojson and fix indent
* i like to move it move it
* rename null -- > none
* token::eof
* some nits + comments
* add exception classes for lexer and parser
* null -> none
* rename global -> env
* rm minja
* update docs
* docs: add input marking caveats
* imlement missing jinja-tests functions
* oops
* support trim filter with args, remove bogus to_json reference
* numerous argument fixes
* updated tests
* implement optional strip chars parameter
* use new chars parameter
* float filter also has default
* always leave at least one decimal in float string
* jinja : static analysis + header cleanup + minor fixes
* add fuzz test
* add string.cpp
* fix chat_template_kwargs
* nits
* fix build
* revert
* unrevert
sorry :)
* add fuzz func_args, refactor to be safer
* fix array.map()
* loosen ensure_vals max count condition, add not impl for map(int)
* hopefully fix windows
* check if empty first
* normalize newlines
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Xuan-Son Nguyen
a04c2b06a3
server: improve slots scheduling for n_cmpl ( #18789 )
...
* server : make sure children tasks are scheduled to launch with parent
* fix
* add comment pointing to this PR
* fix
* clean up
* more debug messages
* add pop_deferred_task with specific ID version
* improve the logic
* simple approach
* no double move
* correct return type of launch_slots_with_parent_task
2026-01-15 17:10:28 +01:00
Georgi Gerganov
39173bcacb
context : reserve new scheduler when graph topology changes ( #18547 )
...
Python Type-Check / pyright type-check (push) Waiting to run
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
* context : reserve new scheduler when graph topology changes
* cont : fix
* cont : fix reserve
* cont : reserve only when changes occur + timing
* context : add comments
* llama : reserve on sampler changes
* common : allow null common_sampler
* server : task declares needs (embd, logits, sampling)
* server : do not init sampler if not needed
* llama : fix need_reserve when unsetting a sampler
* server : consolidate slot reset/clear logic
2026-01-15 16:39:17 +02:00
Xuan-Son Nguyen
9ac2693a30
server: fix n_cmpl not skipping processing prompt ( #18663 )
...
* server: fix n_cmpl not skipping processing
* fix infinite loop on empty batch
* cont : init child samplers + modify child logic
* cont : cleanup
* cont : improve n_cmpl logic
- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache
* cont : remove redundant function
* cont : reduce parent checks
* fix : nullptr task dereference
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00
Georgi Gerganov
53eb9435da
server : fix timing of prompt/generation ( #18713 )
2026-01-09 12:59:50 +02:00