Commit graph

73 commits

Author SHA1 Message Date
Georgi Gerganov
cf8b0dbda9
server : remove /api endpoints (#22165)
* server : remove /api endpoints

* cont : remove /api/tags
2026-04-20 20:41:19 +03:00
Georgi Gerganov
de71b5f81c
server : refactor "use checkpoint" logic (#22114) 2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots (#21741) 2026-04-20 08:30:24 +03:00
Sascha Rogmann
455d8e4be8
server : speculative checkpointing (#19493)
* server : speculative decoding using checkpoints

* server : fix draft check with checkpoints

* server : rename spec vars

* server : log levels

* server : refactored spec logic to speculative.cpp

* server : renamed spec checkpoints option

* server : fix spec checkpoints, logging

* speculative : checkpoints with draft model, logging

* server : n_tokens_cur and create_checkpoint in draft

* server : fix server_speculative_callback (slot.id)

* spec : fix ngram-map/begin idx_last_check

* spec : init ckpt (begin() wasn't called)

* chore: update webui build output

* server : restore sampler in spec checkpoint and clear mem

* cont : avoid --spec-use-checkpoints argument

* cont : remove server_prompt_checkpoint_with_size

* spec : rename (leave_draft_state)

* cont : clean-up

* cont : do not ignore partial drafts even if the are short

* cont : spec callback owned by session

* cont : simplify

* cont : avoid empty speculative session

* cont : simplify

* cont : simplify

* cont : enable mtmd speculative decoding

* cont : keep the spec sampler alive

* cont : simplify

* cont : fix nullptr deref + draft checkpoints

* cont : remove common_speculative_accept_response

* cont : remove callback

* cont : simplify

* cont : minor

* cont : simplify

* cont : fix accepted number

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Cetarthoriphros
9e5647affa
server: Expose media_tag on /props endpoint. (#22028) 2026-04-19 00:27:17 +02:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common (#21936)
* cmake : allow libcommon to be shared

* cmake : rename libcommon to libllama-common

* cont : set -fPIC for httplib

* cont : export all symbols

* cont : fix build_info exports

* libs : add libllama-common-base

* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Xuan-Son Nguyen
408225bb1a
server: use random media marker (#21962)
* server: use random media marker

* nits

* remove legacy <__image__> token

* revert special char in random
2026-04-15 23:52:22 +02:00
Xuan-Son Nguyen
e489a5ca0e
server: support OAI /v1/audio/transcriptions API (#21863)
* server: support OAI /v1/audio/transcriptions API

* address autoreview comments

* correct default response_format value
2026-04-14 11:09:52 +02:00
Yuri Khrustalev
660600081f
server: respect the ignore eos flag (#21203) 2026-04-08 17:12:15 +02:00
Aaron Teo
69c28f1547
llama-server: fix model params not propagated (#21509)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-04-07 21:39:41 +08:00
Georgi Gerganov
e8f5082697
server : fix restore for checkpoints with pos_min == 0 (#21510) 2026-04-07 15:29:17 +03:00
Dan Hoffman
9c699074c9
server: Fix undefined timing measurement errors in server context (#21201)
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
2026-04-04 22:11:19 +08:00
Yes You Can Have Your Own
50e0ad08fb
server: save and clear idle slots on new task (--clear-idle) (#20993)
* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)

* server: move idle slot KV clearing to slot release

The save "cost" is now paid by the finishing request.

* server: add --kv-clear-idle flag, enable by default

* server: skip clearing last idle slot, clear on launch

* server: test --no-kv-clear-idle flag

* server: simplify on-release clearing loop

* server: remove on-release KV clearing, keep launch-only

* cont : clean-up

* tests: update log strings after --clear-idle rename

* tests: use debug tags instead of log message matching

* test: fix Windows CI by dropping temp log file unlink

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-03 19:02:27 +02:00
Georgi Gerganov
edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks (#21107) 2026-03-28 16:27:36 +02:00
Xuan-Son Nguyen
49bfddeca1
server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Georgi Gerganov
ab9d4c3678
server : improve mtmd ctx checkpoints (#20726)
* server : improve mtmd ctx checkpoints

* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00
Ryan Goulden
26c9ce1288
server: Add cached_tokens info to oaicompat responses (#19361)
* tests : fix fetch_server_test_models.py

* server: to_json_oaicompat cached_tokens

Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
2026-03-19 19:09:33 +01:00
Piotr Wilkin (ilintar)
5e54d51b19
common/parser: add proper reasoning tag prefill reading (#20424)
* Implement proper prefill extraction

* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp

* Update tools/server/server-task.cpp

* refactor: move grammars to variant, remove grammar_external, handle exception internally

* Make code less C++y

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
Piotr Wilkin (ilintar)
d2ecd2d1cf
common/parser: add --skip-chat-parsing to force a pure content parser. (#20289)
* Add `--force-pure-content` to force a pure content parser.

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Change parameter name [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Georgi Gerganov
8cc2d81264
server : fix ctx checkpoint invalidation (#20671) 2026-03-17 15:21:14 +02:00
SoftwareRenderer
d7ba99c485
server: reset counter related to kill-switch on client error (#20513)
* server: reset kill-switch on client error

This avoids triggering a server kill switch.

If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.

However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.

* moved counter reset as per recommendation

* cont : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
Piotr Wilkin (ilintar)
acb7c79069
common/parser: handle reasoning budget (#20297)
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Georgi Gerganov
a7b3dee7a5
server : make 2 checkpoints near the end of the prompt (#20288)
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
Georgi Gerganov
96cfc4992c
server : fix checkpoints n_tokens calculation (#20287) 2026-03-09 16:47:06 +02:00
Georgi Gerganov
344ee2a38a
server : warn swa-full is not supported for non-SWA models (#20291) 2026-03-09 16:44:25 +02:00
Georgi Gerganov
d6e1556499
server : fix off-by-1 in server_tokens::size_up_to_pos() (#20279)
* server : fix off-by-1 in server_tokens::size_up_to_pos()

* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
Georgi Gerganov
107d599952
server : add kill switch when server is stuck (#20277) 2026-03-09 10:33:12 +02:00
Georgi Gerganov
d417bc43dd
server : do not create checkpoints right after mtmd chunks (#20232) 2026-03-08 22:16:46 +02:00
Piotr Wilkin (ilintar)
f5ddcd1696
Checkpoint every n tokens: squash (#20087) 2026-03-06 11:39:26 +01:00
Pascal
2e7e638523
server : support multiple model aliases via comma-separated --alias (#19926)
* server : support multiple model aliases via comma-separated --alias

* server : update --alias description and regenerate docs

* server : multiple model aliases and tags

- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields

* regenerate docs

* nits

* server : use first alias as model_name for backward compat

address review feedback from ngxson

* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Georgi Gerganov
01cd448b8c
server : fix ctx checkpoint restore logic (#19924) 2026-02-26 18:20:16 +02:00
Georgi Gerganov
f20469d919
server : enable multi-modal prompt caching (#19877) 2026-02-25 15:15:42 +02:00
Georgi Gerganov
d7d826b3c1
server : support multi-modal context checkpoints (#19849)
* Modify llama-memory-hybrid-iswa.cpp

* Modify llama-memory-recurrent.cpp

* Modify server-common.cpp

* Modify server-common.h

* Modify server-context.cpp

* Modify server-task.h

* Added comment to llama-memory-hybrid-iswa.cpp

* Remove comment from server-context.cpp

* Stylistic fix server-context.cpp

* Fix an issue when seqrm isn't called in server-context.cpp

* cont : alternative impl

* cont : cleanup

* cont : n_tokens -> int64_t

---------

Co-authored-by: timkhronos <timkhronos@gmail.com>
2026-02-25 15:14:27 +02:00
Sigbjørn Skjæret
e8e261699a
cli : provide model with text filename (#19783) 2026-02-22 22:33:49 +01:00
matteo
b55dcdef5d
server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) (#19622)
* save generated text for the /slots endpoint

* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0

* Apply suggestions from code review

---------

Co-authored-by: Matteo <matteo@matteo>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-18 18:53:37 +01:00
손희준
820ebfa6f4
Server: log when converting requests to chat completions format (#19457)
* Log converting requests

* Print as debug instead of info [no ci]

---------

Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Georgi Gerganov
eb449cdfa4
server : improve context checkpoint logic (#19408) 2026-02-08 09:40:04 +02:00
Georgi Gerganov
dfde5993ea
common : add common_speculative_is_compat() (#19270)
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b62a15ef4266a6503e3f4eba770aec001b.

* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter (#19207)
* server : wrap around the "id_slot" parameter

* cont : minor
2026-01-30 19:46:10 +02:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod (#19164)
* spec : add ngram-mod

* cont : simplify + keep track of occupancy

* cont : cleanup

* cont : move initialization to common/speculative

* cont : cleanup

* cont : cleanup

* cont : fix
2026-01-30 18:21:48 +02:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding

* server: moved self-call into speculative.cpp

* can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: can_speculate() tests self-spec

* server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: can_speculate() requires a task instance

* common: ngram map, config self-speculative decoding

* common: add enum common_speculative_type

* common: add vector of speculative states

* common: add option --spec-draftless

* server: cleanup (remove slot.batch_spec, rename)

* common: moved self-spec impl to ngram-map

* common: cleanup (use common_speculative_state_draft)

* spec : refactor

* cont : naming

* spec: remove --spec-config

* doc: (draftless) speculative decoding

* common: print performance in spec decoding

* minor : cleanup

* common : better names

* minor : cleanup + fix build

* minor: comments

* CODEOWNERS: add common/ngram-map.* (#18471)

* common : rename speculative.draftless_type -> speculative.type

* ngram-map : fix uninitialized values

* ngram-map : take into account the input can become shorter

* ngram-map : revert len check for now

* arg : change `--spec-draftless` -> `--spec-type`

* spec : add common_speculative_state::accept()

* spec : refactor + add common_speculative_begin()

* spec : fix begin() call with mtmd

* spec : additional refactor + remove common_speculative_params

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Xuan-Son Nguyen
51fa458a92
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
2026-01-22 21:30:06 +01:00
손희준
fbbf3ad190
server: /v1/responses (partial) (#18486)
* from previous PR

* Make instruction(system) as first message

* Convert [input_message] (text/image/file)

* Rename convert_responses_to_chatcmpl(body) -> response_body

* Initial tool call support

* Erase instructions field from chatcmpl body

* Feed reasoning texts to chat template

* Use std::vector instead of opaque json array

* Make output_item.added events consistent

* Move `server_task_result_cmpl_partial::update` from header to source

* Match ID of output_item.added and .done events

* Add function_call only if there is no "fc_" prefix

* Add function call output at non-streaming API

* Test if ID is persistent

* Add doc

* Fix style - use trailing comma

* Rewrite state management

* catch up with upstream/master

* Fix style - "type" is the first item of SSE data

* Explicitly check "instructions" from response_body

* Make lambdas static

* Check if reasoning content exists

* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final

* Reject `input_file` since it is not supported by chatcmpl

* Add "fc_" prefix to non-straming function call id as coderabbit pointed out

---------

Co-authored-by: openingnow <>
2026-01-21 17:47:23 +01:00
Xuan-Son Nguyen
6df686bee6
server : refactor oai_parser_opt, move it to server_chat_params (#18937)
* server_chat_params

* move chat format into CLI

* use meta whenever possible

* clean up, no more chatml fallback
2026-01-19 23:28:01 +01:00
Lennart Austenfeld
18361c579c
server: fix memory reservations in populate_token_probs (#18787) 2026-01-19 19:13:31 +01:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine (#18462)
* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Xuan-Son Nguyen
a04c2b06a3
server: improve slots scheduling for n_cmpl (#18789)
* server : make sure children tasks are scheduled to launch with parent

* fix

* add comment pointing to this PR

* fix

* clean up

* more debug messages

* add pop_deferred_task with specific ID version

* improve the logic

* simple approach

* no double move

* correct return type of launch_slots_with_parent_task
2026-01-15 17:10:28 +01:00
Georgi Gerganov
39173bcacb
context : reserve new scheduler when graph topology changes (#18547)
Some checks failed
Python Type-Check / pyright type-check (push) Waiting to run
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
* context : reserve new scheduler when graph topology changes

* cont : fix

* cont : fix reserve

* cont : reserve only when changes occur + timing

* context : add comments

* llama : reserve on sampler changes

* common : allow null common_sampler

* server : task declares needs (embd, logits, sampling)

* server : do not init sampler if not needed

* llama : fix need_reserve when unsetting a sampler

* server : consolidate slot reset/clear logic
2026-01-15 16:39:17 +02:00
Xuan-Son Nguyen
9ac2693a30
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00
Georgi Gerganov
53eb9435da
server : fix timing of prompt/generation (#18713) 2026-01-09 12:59:50 +02:00