Concedo
261d78eaaa
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# README.md
# docs/speculative.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/mtmd/clip.cpp
2026-02-12 18:05:20 +08:00
손희준
820ebfa6f4
Server: log when converting requests to chat completions format ( #19457 )
...
* Log converting requests
* Print as debug instead of info [no ci]
---------
Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Sascha Rogmann
292f6908cd
spec : remove check rate ( #19377 )
...
* spec: remove parameter spec-ngram-check-rate
* spec : renamed statistics vars
* spec : add n_call_begin, n_call_accept
* spec : don't enable key-map-stats
2026-02-09 15:30:50 +02:00
Concedo
757b293ac9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server-webui.yml
# .github/workflows/server.yml
# tools/rpc/rpc-server.cpp
2026-02-09 00:33:11 +08:00
Georgi Gerganov
eb449cdfa4
server : improve context checkpoint logic ( #19408 )
2026-02-08 09:40:04 +02:00
Concedo
a0a78dacc4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/ops.md
# docs/ops/SYCL.csv
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# pyproject.toml
# requirements/requirements-convert_legacy_llama.txt
# src/CMakeLists.txt
# src/llama-vocab.cpp
# tests/test-backend-ops.cpp
2026-02-07 15:54:02 +08:00
Georgi Gerganov
dfde5993ea
common : add common_speculative_is_compat() ( #19270 )
...
* llama : add llama_memory_can_rm_suffix()
* Revert "llama : add llama_memory_can_rm_suffix()"
This reverts commit d30e59b62a15ef4266a6503e3f4eba770aec001b.
* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Concedo
7b393fa487
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# AUTHORS
# ci/run.sh
# docs/backend/SYCL.md
# docs/build.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmo4.0.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# docs/multimodal/minicpmv4.0.md
# docs/multimodal/minicpmv4.5.md
# docs/ops.md
# docs/ops/SYCL.csv
# docs/speculative.md
# examples/deprecation-warning/README.md
# examples/deprecation-warning/deprecation-warning.cpp
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/include/ggml-cann.h
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/concat.cl
# ggml/src/ggml-opencl/kernels/repeat.cl
# ggml/src/ggml-opencl/kernels/scale.cl
# ggml/src/ggml-opencl/kernels/tanh.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/wkv.cpp
# src/llama-vocab.cpp
# tests/test-autorelease.cpp
# tests/test-backend-ops.cpp
# tools/cvector-generator/pca.hpp
# tools/export-lora/export-lora.cpp
# tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
Matthieu Coudron
a3fa035822
server: print actual model name in 'model not found" error ( #19117 )
...
Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.
before:
Error: {
code = 400,
message = "model not found",
type = "invalid_request_error"
}
After:
Error: {
code = 400,
message = "model 'toto' not found",
type = "invalid_request_error"
}
2026-02-02 16:55:27 +01:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups ( #19252 )
...
* Update old URLs to github.com/ggml-org/
* Bump copyrights
2026-02-02 08:38:55 +02:00
Concedo
ddce19db72
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package-gguf-py.nix
# .devops/nix/scope.nix
# common/CMakeLists.txt
# docs/backend/SYCL.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/sycl/run-llama2.sh
# examples/sycl/win-run-llama2.bat
# examples/sycl/win-test.bat
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hvx-dump.h
# ggml/src/ggml-hexagon/htp/hvx-reduce.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter ( #19207 )
...
* server : wrap around the "id_slot" parameter
* cont : minor
2026-01-30 19:46:10 +02:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod ( #19164 )
...
* spec : add ngram-mod
* cont : simplify + keep track of occupancy
* cont : cleanup
* cont : move initialization to common/speculative
* cont : cleanup
* cont : cleanup
* cont : fix
2026-01-30 18:21:48 +02:00
Concedo
8d173f50c2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/backend/SYCL.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# docs/backend/snapdragon/developer.md
# docs/ops.md
# docs/ops/SYCL.csv
# embd_res/templates/upstage-Solar-Open-100B.jinja
# ggml/src/CMakeLists.txt
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# tests/test-chat.cpp
2026-01-30 15:32:59 +08:00
Concedo
7e755014b2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/winget.yml
# CODEOWNERS
# common/CMakeLists.txt
# common/arg.cpp
# docs/ops/SYCL.csv
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# tests/test-chat-template.cpp
2026-01-29 23:05:05 +08:00
Andrew Marshall
84b0a98319
webui: Update Svelte to fix effect_update_depth_exceeded errors ( #19144 )
...
The upstream fix is first available in 5.38.2, so constrain to at least
that version.
Rebuild pre-compiled webui index.html.gz based on these changes.
See also:
https://github.com/ggml-org/llama.cpp/issues/16347
https://github.com/huntabyte/bits-ui/issues/1687
https://github.com/sveltejs/svelte/issues/16548
2026-01-29 15:56:39 +01:00
Concedo
46cd17c17e
Merge commit ' 88d23ad515' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# docs/build.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zendnn/CMakeLists.txt
# tests/test-chat-template.cpp
2026-01-29 22:25:56 +08:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor ( #18471 )
...
* server: introduce self-speculative decoding
* server: moved self-call into speculative.cpp
* can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: can_speculate() tests self-spec
* server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* server: can_speculate() requires a task instance
* common: ngram map, config self-speculative decoding
* common: add enum common_speculative_type
* common: add vector of speculative states
* common: add option --spec-draftless
* server: cleanup (remove slot.batch_spec, rename)
* common: moved self-spec impl to ngram-map
* common: cleanup (use common_speculative_state_draft)
* spec : refactor
* cont : naming
* spec: remove --spec-config
* doc: (draftless) speculative decoding
* common: print performance in spec decoding
* minor : cleanup
* common : better names
* minor : cleanup + fix build
* minor: comments
* CODEOWNERS: add common/ngram-map.* (#18471 )
* common : rename speculative.draftless_type -> speculative.type
* ngram-map : fix uninitialized values
* ngram-map : take into account the input can become shorter
* ngram-map : revert len check for now
* arg : change `--spec-draftless` -> `--spec-type`
* spec : add common_speculative_state::accept()
* spec : refactor + add common_speculative_begin()
* spec : fix begin() call with mtmd
* spec : additional refactor + remove common_speculative_params
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Georgi Gerganov
b931f81b5a
server : adjust spec tests to generate up to 16 tokens ( #19093 )
2026-01-28 09:11:40 +02:00
Daniel Bevenius
16639ba217
common : use two decimal places for float arg help messages ( #19048 )
...
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs
2026-01-25 07:31:42 +01:00
Concedo
e8e7c357c9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cache.yml
# .github/workflows/build-cmake-pkg.yml
# .github/workflows/build-linux-cross.yml
# .github/workflows/build.yml
# .github/workflows/check-vendor.yml
# .github/workflows/close-issue.yml
# .github/workflows/copilot-setup-steps.yml
# .github/workflows/docker.yml
# .github/workflows/editorconfig.yml
# .github/workflows/gguf-publish.yml
# .github/workflows/labeler.yml
# .github/workflows/pre-tokenizer-hashes.yml
# .github/workflows/python-check-requirements.yml
# .github/workflows/python-lint.yml
# .github/workflows/python-type-check.yml
# .github/workflows/release.yml
# .github/workflows/server-webui.yml
# .github/workflows/server.yml
# .github/workflows/update-ops-docs.yml
# .github/workflows/winget.yml
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# requirements/requirements-tool_bench.txt
# src/CMakeLists.txt
# src/llama-quant.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/server/README.md
2026-01-23 14:27:04 +08:00
Xuan-Son Nguyen
51fa458a92
server : support preserving reasoning_content in assistant message ( #18994 )
...
* support reasoning_content input
* report template caps to webui
* add docs
* rm commented code
2026-01-22 21:30:06 +01:00
Xuan-Son Nguyen
4e595b250a
server: do not log certain endpoints (avoid log spam) ( #19028 )
2026-01-22 19:24:37 +01:00
손희준
c6926d1d95
server: Reorder methods in server-task.cpp ( #19016 )
...
* Move `task_result_state::update_chat_msg` to match with header
* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header
---------
Co-authored-by: openingnow <>
2026-01-22 14:36:04 +01:00
Hendrik Erz
3802d3c78f
fix: Use tabular-nums for chat message statistics ( #18915 )
...
* fix: Use `tabular-nums` for chat message statistics
* fix: Rebuild WebUI
2026-01-21 18:46:01 +01:00
손희준
fbbf3ad190
server: /v1/responses (partial) ( #18486 )
...
* from previous PR
* Make instruction(system) as first message
* Convert [input_message] (text/image/file)
* Rename convert_responses_to_chatcmpl(body) -> response_body
* Initial tool call support
* Erase instructions field from chatcmpl body
* Feed reasoning texts to chat template
* Use std::vector instead of opaque json array
* Make output_item.added events consistent
* Move `server_task_result_cmpl_partial::update` from header to source
* Match ID of output_item.added and .done events
* Add function_call only if there is no "fc_" prefix
* Add function call output at non-streaming API
* Test if ID is persistent
* Add doc
* Fix style - use trailing comma
* Rewrite state management
* catch up with upstream/master
* Fix style - "type" is the first item of SSE data
* Explicitly check "instructions" from response_body
* Make lambdas static
* Check if reasoning content exists
* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final
* Reject `input_file` since it is not supported by chatcmpl
* Add "fc_" prefix to non-straming function call id as coderabbit pointed out
---------
Co-authored-by: openingnow <>
2026-01-21 17:47:23 +01:00
Concedo
4984c9bc16
Merge commit ' 12a4a47e6a' into concedo_experimental
...
# Conflicts:
# ci/run.sh
# examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
# examples/model-conversion/scripts/causal/run-converted-model.sh
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# tests/CMakeLists.txt
# tests/test-chat-parser.cpp
# tests/test-chat-peg-parser.cpp
# tests/test-chat.cpp
# tools/cli/cli.cpp
2026-01-21 21:00:44 +08:00
Adrien Gallouët
1c7cf94b22
common, server : use the same User-Agent by default ( #18957 )
...
This commit also ensures that if a custom User-Agent is used, it will be
the only one sent.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-20 18:28:43 +01:00
Xuan-Son Nguyen
2c1f199653
cli : fix reasoning responses in CLI ( #18961 )
...
* cli : fix reasoning responses in CLI
* fix build
* fix build (2)
2026-01-20 18:23:25 +01:00
Xuan-Son Nguyen
6df686bee6
server : refactor oai_parser_opt, move it to server_chat_params ( #18937 )
...
* server_chat_params
* move chat format into CLI
* use meta whenever possible
* clean up, no more chatml fallback
2026-01-19 23:28:01 +01:00
Lennart Austenfeld
18361c579c
server: fix memory reservations in populate_token_probs ( #18787 )
2026-01-19 19:13:31 +01:00
Concedo
8855a7f52b
Merge commit ' c945aaaef2' into concedo_experimental
...
# Conflicts:
# .devops/cann.Dockerfile
# .github/workflows/build.yml
# .github/workflows/release.yml
# README.md
# common/CMakeLists.txt
# common/chat.cpp
# docs/function-calling.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/peg-parser/tests.h
# tests/test-chat-peg-parser.cpp
# tests/test-chat-template.cpp
# tests/test-chat.cpp
# tests/testing.h
# tools/llama-bench/llama-bench.cpp
2026-01-17 10:24:03 +08:00
Concedo
0d43bdc46d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/batched/batched.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# src/llama-context.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-01-17 00:41:28 +08:00
Xuan-Son Nguyen
c15395f73c
common : implement new jinja template engine ( #18462 )
...
* jinja vm
* lexer
* add vm types
* demo
* clean up
* parser ok
* binary_expression::execute
* shadow naming
* bin ops works!
* fix map object
* add string builtins
* add more builtins
* wip
* use mk_val
* eval with is_user_input
* render gemma tmpl ok
* track input string even after transformations
* support binded functions
* keyword arguments and slicing array
* use shared_ptr for values
* add mk_stmt
* allow print source on exception
* fix negate test
* testing more templates
* mostly works
* add filter_statement
* allow func to access ctx
* add jinja-value.cpp
* impl global_from_json
* a lot of fixes
* more tests
* more fix, more tests
* more fixes
* rm workarounds
* demo: type inferrence
* add placeholder for tojson
* improve function args handling
* rm type inference
* no more std::regex
* trailing spaces
* make testing more flexible
* make output a bit cleaner
* (wip) redirect minja calls
* test: add --output
* fix crash on macro kwargs
* add minimal caps system
* add some workarounds
* rm caps_apply_workarounds
* get rid of preprocessing
* more fixes
* fix test-chat-template
* move test-chat-jinja into test-chat-template
* rm test-chat-jinja from cmake
* test-chat-template: use common
* fix build
* fix build (2)
* rename vm --> interpreter
* improve error reporting
* correct lstrip behavior
* add tojson
* more fixes
* disable tests for COMMON_CHAT_FORMAT_GENERIC
* make sure tojson output correct order
* add object.length
* fully functional selectattr / rejectattr
* improve error reporting
* more builtins added, more fixes
* create jinja rendering tests
* fix testing.h path
* adjust whitespace rules
* more fixes
* temporary disable test for ibm-granite
* r/lstrip behavior matched with hf.js
* minimax, glm4.5 ok
* add append and pop
* kimi-k2 ok
* test-chat passed
* fix lstrip_block
* add more jinja tests
* cast to unsigned char
* allow dict key to be numeric
* nemotron: rm windows newline
* tests ok
* fix test
* rename interpreter --> runtime
* fix build
* add more checks
* bring back generic format support
* fix Apertus
* [json.exception.out_of_range.403] key 'content' not found
* rm generic test
* refactor input marking
* add docs
* fix windows build
* clarify error message
* improved tests
* split/rsplit with maxsplit
* non-inverse maxsplit
forgot to change after simplifying
* implement separators for tojson and fix indent
* i like to move it move it
* rename null -- > none
* token::eof
* some nits + comments
* add exception classes for lexer and parser
* null -> none
* rename global -> env
* rm minja
* update docs
* docs: add input marking caveats
* imlement missing jinja-tests functions
* oops
* support trim filter with args, remove bogus to_json reference
* numerous argument fixes
* updated tests
* implement optional strip chars parameter
* use new chars parameter
* float filter also has default
* always leave at least one decimal in float string
* jinja : static analysis + header cleanup + minor fixes
* add fuzz test
* add string.cpp
* fix chat_template_kwargs
* nits
* fix build
* revert
* unrevert
sorry :)
* add fuzz func_args, refactor to be safer
* fix array.map()
* loosen ensure_vals max count condition, add not impl for map(int)
* hopefully fix windows
* check if empty first
* normalize newlines
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
ddh0
13f1e4a9ca
llama : add adaptive-p sampler ( #17927 )
...
* initial commit for branch
* simplify constants
* add params to `struct common_params_sampling`, add reference to PR
* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`
* add args, rename `queue_size` -> `window_size`
* improved comments
* minor
* remove old unused code from algorithm
* minor
* add power law case to `common_sampler_init`, add sampler name mappings
* clarify behaviour when `window_size = 0`
* add missing enums
* remove `target_range` param, make `target == 1` no-op, cleanup code
* oops, straggler
* add missing parameters in `server-task.cpp`
* copy from author
ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069
* remove old debug log, style nit
* fix compiler warning, add commented-out logging per token
* re-write + change parameters + simplify
* oops forgot args.cpp
* fix leftover `window_size`
* add missing values to `common_params_sampling::print()`
* with logging
* does this fix it?
* no, but does this?
* update default decay
* optimize
* fix bad merge
my git skills are lacking
* silence `missing initializer for member`
* update default decay to 0.9
* fix logging
* format (double)
* add power law to the new `samplers` vector
* log sampler init values
* improve logging messages in llama_sampler_power_law
* remove extraneous logging
* simplify target computation
last commit with debug logging!
* remove debug logging, explicitly clamp params at init
* add `use_power_law` flag + logic, minor cleanup
* update `power-law` -> `adaptive-p`
* fix cold start EMA
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`
this fixes a "cold start" problem with the moving average
* update `SHARPNESS` constant to `10.0f`
* minor style fixes
no functional changes
* minor style fixes cont.
* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004 )
* separate into `apply` + `accept` functions
* `pending_token_idx`: switch from `llama_token` to `int32`
functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now
* don't transform logits <= -1e9f
* fix masking in backend top-p, min-p
* address review comments
* typo in comments `RND` -> `RNG`
* add docs
* add recommended values in completion docs
* address PR feedback
* remove trailing whitespace (for CI `editorconfig`)
* add to adaptive-p to `common_sampler_types_from_chars`
2026-01-15 19:16:29 +02:00
Xuan-Son Nguyen
a04c2b06a3
server: improve slots scheduling for n_cmpl ( #18789 )
...
* server : make sure children tasks are scheduled to launch with parent
* fix
* add comment pointing to this PR
* fix
* clean up
* more debug messages
* add pop_deferred_task with specific ID version
* improve the logic
* simple approach
* no double move
* correct return type of launch_slots_with_parent_task
2026-01-15 17:10:28 +01:00
Georgi Gerganov
39173bcacb
context : reserve new scheduler when graph topology changes ( #18547 )
...
Python Type-Check / pyright type-check (push) Waiting to run
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
* context : reserve new scheduler when graph topology changes
* cont : fix
* cont : fix reserve
* cont : reserve only when changes occur + timing
* context : add comments
* llama : reserve on sampler changes
* common : allow null common_sampler
* server : task declares needs (embd, logits, sampling)
* server : do not init sampler if not needed
* llama : fix need_reserve when unsetting a sampler
* server : consolidate slot reset/clear logic
2026-01-15 16:39:17 +02:00
Concedo
7d2c1c4f46
note: clip_is_mrope was moved to mtmd_decode_use_mrope upstream and no longer syncs since https://github.com/ggml-org/llama.cpp/pull/18793
...
Merge commit 'c1e79e610f ' into concedo_experimental
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CONTRIBUTING.md
# MIT_LICENSE_GGML_SDCPP_LLAMACPP_ONLY.md
# README.md
# SECURITY.md
# ci/run.sh
# common/CMakeLists.txt
# common/arg.cpp
# docs/ops.md
# docs/ops/BLAS.csv
# docs/ops/zDNN.csv
# docs/preset.md
# examples/batched/batched.cpp
# examples/debug/debug.cpp
# ggml/src/ggml-blas/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# licenses/LICENSE-curl
# licenses/LICENSE-httplib
# scripts/pr2wt.sh
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/llama-bench/llama-bench.cpp
# tools/server/README.md
# vendor/cpp-httplib/LICENSE
2026-01-13 23:31:14 +08:00
Concedo
0dc18c668c
Merge commit ' a61c8bc3bf' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/pr2wt.sh
# src/llama-model.cpp
# tools/CMakeLists.txt
# tools/mtmd/CMakeLists.txt
# tools/mtmd/clip.cpp
# tools/mtmd/clip.h
2026-01-13 23:06:50 +08:00
Radoslav Gerganov
bcf7546160
server : add arg for disabling prompt caching ( #18776 )
...
* server : add arg for disabling prompt caching
Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.
* address review comments
* address review comments
2026-01-12 19:21:34 +02:00
Xuan-Son Nguyen
ce3bf9b1a4
server: update docs for sleeping [no ci] ( #18777 )
2026-01-12 13:01:24 +01:00
Georgi Gerganov
f307926482
server : adjust unified KV cache tests ( #18716 )
2026-01-10 17:51:56 +02:00
Xuan-Son Nguyen
9ac2693a30
server: fix n_cmpl not skipping processing prompt ( #18663 )
...
* server: fix n_cmpl not skipping processing
* fix infinite loop on empty batch
* cont : init child samplers + modify child logic
* cont : cleanup
* cont : improve n_cmpl logic
- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache
* cont : remove redundant function
* cont : reduce parent checks
* fix : nullptr task dereference
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00
Pascal
ec8fd7876b
Webui/file upload ( #18694 )
...
* webui: fix restrictive file type validation
* webui: simplify file processing logic
* chore: update webui build output
* webui: remove file picker extension whitelist (1/2)
* webui: remove file picker extension whitelist (2/2)
* chore: update webui build output
* refactor: Cleanup
* chore: update webui build output
* fix: update ChatForm storybook test after removing accept attribute
* chore: update webui build output
* refactor: more cleanup
* chore: update webui build output
2026-01-09 16:45:32 +01:00
Georgi Gerganov
53eb9435da
server : fix timing of prompt/generation ( #18713 )
2026-01-09 12:59:50 +02:00
Georgi Gerganov
f5f8812f7c
server : use different seeds for child completions ( #18700 )
...
* server : use different seeds for child completions
* cont : handle default seed
* cont : note
2026-01-09 09:33:50 +02:00
Concedo
983baac46b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/vulkan.Dockerfile
# .github/workflows/build.yml
# ci/run.sh
# examples/model-conversion/Makefile
# examples/model-conversion/README.md
# examples/model-conversion/scripts/causal/compare-logits.py
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# examples/model-conversion/scripts/utils/common.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/pr2wt.sh
# scripts/sync_vendor.py
# tests/test-arg-parser.cpp
2026-01-09 01:23:10 +08:00
Concedo
956ab99934
Merge commit ' 56d2fed2b3' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .gitignore
# README.md
# examples/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/model-conversion/scripts/causal/compare-logits.py
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
# examples/model-conversion/scripts/causal/run-converted-model.sh
# examples/model-conversion/scripts/causal/run-org-model.py
# examples/model-conversion/scripts/embedding/run-converted-model.sh
# examples/model-conversion/scripts/embedding/run-original-model.py
# examples/model-conversion/scripts/utils/common.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-msg.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/hvx-utils.c
# ggml/src/ggml-hexagon/htp/hvx-utils.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# scripts/snapdragon/adb/run-bench.sh
# tests/test-arg-parser.cpp
# tools/CMakeLists.txt
2026-01-09 00:30:53 +08:00
Adrien Gallouët
55abc39355
vendor : update cpp-httplib to 0.30.0 ( #18660 )
...
* vendor : update cpp-httplib to 0.30.0
* common : allow custom headers when downloading
2026-01-08 13:53:54 +01:00
R
3d26a09dc7
server : add thinking content blocks to Anthropic Messages API ( #18551 )
...
* server : add thinking content blocks to Anthropic Messages API
Add support for returning reasoning/thinking content in Anthropic API
responses when using models with --reasoning-format deepseek and the
thinking parameter enabled.
- Non-streaming: adds thinking block before text in content array
- Streaming: emits thinking_delta events with correct block indices
- Partial streaming: tracks reasoning state across chunks via
anthropic_has_reasoning member variable
Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model.
* server : fix Anthropic API streaming for thinking content blocks
Add signature field and fix duplicate content_block_start events in
Anthropic Messages API streaming responses for reasoning models.
* server: refactor Anthropic streaming state to avoid raw pointer
Replace raw pointer to task_result_state with direct field copies:
- Copy state fields in update() before processing chunk
- Use local copies in to_json_anthropic() instead of dereferencing
- Pre-compute state updates for next chunk in update()
This makes the data flow clearer and avoids unsafe pointer patterns.
2026-01-06 16:17:13 +01:00