Commit graph

711 commits

Author SHA1 Message Date
Concedo
bff3fd3e34 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/common.cpp
#	docs/backend/snapdragon/README.md
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/pr2wt.sh
#	tests/test-backend-ops.cpp
#	tools/server/README.md
2026-02-13 14:00:45 +08:00
Concedo
55524e160b temp merge, not working 2026-02-13 12:11:26 +08:00
Aleksander Grygier
4d688f9ebb
(webui) FEATURE: Enable adding or injecting System Message into chat (#19556)
* feat: Enable adding System Prompt per-chat

* fix: Save draft message in Chat Form when adding System Prompt from new chat view

* fix: Proper system message deletion logic

* chore: Formatting

* chore: update webui build output
2026-02-12 13:56:08 +01:00
Aleksander Grygier
f486ce9f30
(webui) REFACTOR: UI primitives and polish (#19551)
* webui: UI primitives and polish (non-MCP)

* chore: update webui build output
2026-02-12 12:21:00 +01:00
Aleksander Grygier
38adc7d469
WebUI Architecture Cleanup (#19541)
* webui: architecture foundation (non-MCP core refactors)

* chore: update webui build output
2026-02-12 11:22:27 +01:00
Concedo
261d78eaaa Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	README.md
#	docs/speculative.md
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tools/mtmd/clip.cpp
2026-02-12 18:05:20 +08:00
RichardScottOZ
fa16e517a3
server : fix typo in README.md for features list (#19510)
extra l for full
2026-02-12 08:56:25 +01:00
AesSedai
e463bbdf65
model: Add Kimi-K2.5 support (#19170)
* Move dequant_model to after the text_config merge
Add new kimi-k2.5 keys to mtmd convert
Update V_MMPROJ tensor mapping for new mm_projector.proj keys
Update V_M_IMP_NORM for new mm_projector.pre_norm key

* Fix a couple of oversights

* Add image support for Kimi-K2.5

* Revert changes to KimiVLForConditionalGeneration

* Fix an assert crash

* Fix permute swapping w / h on accident

* Kimi-K2.5: Use merged QKV for vision

* Kimi-K2.5: pre-convert vision QK to use build_rope_2d

* Kimi-K2.5: support non-interleaved rope for vision

* Kimi-K2.5: fix min / max pixel

* Kimi-K2.5: remove v/o permutes, unnecessary

* Kimi-K2.5: update permute name to match

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Kimi-K2.5: replace build_rope_2d ggml_cont with ggml_view_3d pointers

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-11 16:47:30 +01:00
Georgi Gerganov
6d95707827
model : fix wavtokenizer embedding notions (#19479) 2026-02-11 07:52:20 +02:00
JJJYmmm
fc0fe40049
models : support qwen3.5 series (#19468)
* support qwen3.5 series

* remove deepstack for now, and some code clean

* code clean

* add FULL_ATTENTION_INTERVAL metadata

* code clean

* reorder v heads for linear attention to avoid expensive interleaved repeat
2026-02-10 18:00:26 +02:00
Daniel Bevenius
66d403c480
tts : fix typos in README.md [no ci] (#19463) 2026-02-10 07:30:41 +01:00
Tarek Dakhran
262364e31d
mtmd: Implement tiling for LFM2-VL (#19454) 2026-02-09 17:30:32 +01:00
손희준
820ebfa6f4
Server: log when converting requests to chat completions format (#19457)
* Log converting requests

* Print as debug instead of info [no ci]

---------

Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Sascha Rogmann
292f6908cd
spec : remove check rate (#19377)
* spec: remove parameter spec-ngram-check-rate

* spec : renamed statistics vars

* spec : add n_call_begin, n_call_accept

* spec : don't enable key-map-stats
2026-02-09 15:30:50 +02:00
Concedo
757b293ac9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/server-webui.yml
#	.github/workflows/server.yml
#	tools/rpc/rpc-server.cpp
2026-02-09 00:33:11 +08:00
Adrien Gallouët
5fa1c190d9
rpc : update from common.cpp (#19400)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-08 09:06:45 +01:00
Georgi Gerganov
eb449cdfa4
server : improve context checkpoint logic (#19408) 2026-02-08 09:40:04 +02:00
ddh0
5999b50eb0
llama-quantize : cleanup --help output (#19317)
* cleanup `llama-quantize --help` output

some much needed TLC

* remove future argument

oops, spoiler

* cleanup of cleanup
2026-02-08 09:22:38 +02:00
Concedo
a0a78dacc4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	pyproject.toml
#	requirements/requirements-convert_legacy_llama.txt
#	src/CMakeLists.txt
#	src/llama-vocab.cpp
#	tests/test-backend-ops.cpp
2026-02-07 15:54:02 +08:00
Georgi Gerganov
dfde5993ea
common : add common_speculative_is_compat() (#19270)
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b62a15ef4266a6503e3f4eba770aec001b.

* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Concedo
1f803ae27b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/server.yml
#	CMakeLists.txt
#	cmake/common.cmake
#	ggml/src/ggml-virtgpu/apir_cs_ggml-rpc-front.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-device.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
#	ggml/src/ggml-virtgpu/backend/backend-dispatched.h
#	ggml/src/ggml-virtgpu/backend/backend.cpp
#	ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
#	ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
#	ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
#	ggml/src/ggml-virtgpu/ggml-backend-device.cpp
#	ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
#	ggml/src/ggml-virtgpu/ggml-remoting.h
#	ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml
#	ggml/src/ggml-virtgpu/regenerate_remoting.py
#	ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
#	ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
#	ggml/src/ggml-virtgpu/virtgpu-shm.cpp
#	ggml/src/ggml-virtgpu/virtgpu.cpp
#	ggml/src/ggml-virtgpu/virtgpu.h
2026-02-04 16:21:06 +08:00
Daniel Bevenius
25f40ca65f
completion : simplify batch (embd) processing (#19286)
Some checks are pending
Python Type-Check / pyright type-check (push) Waiting to run
* completion : simplify batch (embd) processing

This commit simplifies the processing of embd by removing the for loop
that currently exists which uses params.n_batch as its increment. This
commit also removes the clamping of n_eval as the size of embd is always
at most the size of params.n_batch.

The motivation is to clarify the code as it is currently a little
confusing when looking at this for loop in isolation and thinking that
it can process multiple batches.

* add an assert to verify n_eval is not greater than n_batch
2026-02-04 05:43:28 +01:00
Concedo
7b393fa487 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	AUTHORS
#	ci/run.sh
#	docs/backend/SYCL.md
#	docs/build.md
#	docs/multimodal/minicpmo2.6.md
#	docs/multimodal/minicpmo4.0.md
#	docs/multimodal/minicpmv2.5.md
#	docs/multimodal/minicpmv2.6.md
#	docs/multimodal/minicpmv4.0.md
#	docs/multimodal/minicpmv4.5.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	docs/speculative.md
#	examples/deprecation-warning/README.md
#	examples/deprecation-warning/deprecation-warning.cpp
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/convert-model.sh
#	ggml/include/ggml-cann.h
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/concat.cl
#	ggml/src/ggml-opencl/kernels/repeat.cl
#	ggml/src/ggml-opencl/kernels/scale.cl
#	ggml/src/ggml-opencl/kernels/tanh.cl
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/outprod.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	src/llama-vocab.cpp
#	tests/test-autorelease.cpp
#	tests/test-backend-ops.cpp
#	tools/cvector-generator/pca.hpp
#	tools/export-lora/export-lora.cpp
#	tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
Xuan-Son Nguyen
07a7412a3b
mtmd: add min/max pixels gguf metadata (#19273) 2026-02-02 20:59:06 +01:00
Matthieu Coudron
a3fa035822
server: print actual model name in 'model not found" error (#19117)
Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.

before:

Error: {
  code = 400,
  message = "model not found",
  type = "invalid_request_error"
}

After:

Error: {
  code = 400,
  message = "model 'toto' not found",
  type = "invalid_request_error"
}
2026-02-02 16:55:27 +01:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-02 08:38:55 +02:00
Concedo
ddce19db72 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package-gguf-py.nix
#	.devops/nix/scope.nix
#	common/CMakeLists.txt
#	docs/backend/SYCL.md
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/sycl/run-llama2.sh
#	examples/sycl/win-run-llama2.bat
#	examples/sycl/win-test.bat
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hvx-dump.h
#	ggml/src/ggml-hexagon/htp/hvx-reduce.h
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
EugeoSynthesisThirtyTwo
3dd95914d0
quantize: add option --tensor-type-file to llama-quantize (#18572)
* add option --tensor-type-file to llama-quantize, but it raises an error.

* add error message when file not found

* quantize: update help menu, fix CI

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
2026-01-31 11:39:21 +08:00
tc-mb
ec6c7421e4
mtmd: support MiniCPM-o 4.5(vision only) (#19211)
Some checks failed
Python Type-Check / pyright type-check (push) Has been cancelled
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
2026-01-30 23:19:30 +01:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter (#19207)
* server : wrap around the "id_slot" parameter

* cont : minor
2026-01-30 19:46:10 +02:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod (#19164)
* spec : add ngram-mod

* cont : simplify + keep track of occupancy

* cont : cleanup

* cont : move initialization to common/speculative

* cont : cleanup

* cont : cleanup

* cont : fix
2026-01-30 18:21:48 +02:00
Concedo
8d173f50c2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/backend/SYCL.md
#	docs/backend/snapdragon/CMakeUserPresets.json
#	docs/backend/snapdragon/README.md
#	docs/backend/snapdragon/developer.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	embd_res/templates/upstage-Solar-Open-100B.jinja
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/element_wise.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
#	tests/test-chat.cpp
2026-01-30 15:32:59 +08:00
Concedo
7e755014b2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/winget.yml
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/arg.cpp
#	docs/ops/SYCL.csv
#	examples/lookup/lookup-create.cpp
#	examples/lookup/lookup-stats.cpp
#	examples/lookup/lookup.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/norm.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	tests/test-chat-template.cpp
2026-01-29 23:05:05 +08:00
Andrew Marshall
84b0a98319
webui: Update Svelte to fix effect_update_depth_exceeded errors (#19144)
The upstream fix is first available in 5.38.2, so constrain to at least
that version.

Rebuild pre-compiled webui index.html.gz based on these changes.

See also:
https://github.com/ggml-org/llama.cpp/issues/16347
https://github.com/huntabyte/bits-ui/issues/1687
https://github.com/sveltejs/svelte/issues/16548
2026-01-29 15:56:39 +01:00
Concedo
46cd17c17e Merge commit '88d23ad515' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	docs/build.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zendnn/CMakeLists.txt
#	tests/test-chat-template.cpp
2026-01-29 22:25:56 +08:00
Sascha Rogmann
72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding

* server: moved self-call into speculative.cpp

* can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: can_speculate() tests self-spec

* server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: can_speculate() requires a task instance

* common: ngram map, config self-speculative decoding

* common: add enum common_speculative_type

* common: add vector of speculative states

* common: add option --spec-draftless

* server: cleanup (remove slot.batch_spec, rename)

* common: moved self-spec impl to ngram-map

* common: cleanup (use common_speculative_state_draft)

* spec : refactor

* cont : naming

* spec: remove --spec-config

* doc: (draftless) speculative decoding

* common: print performance in spec decoding

* minor : cleanup

* common : better names

* minor : cleanup + fix build

* minor: comments

* CODEOWNERS: add common/ngram-map.* (#18471)

* common : rename speculative.draftless_type -> speculative.type

* ngram-map : fix uninitialized values

* ngram-map : take into account the input can become shorter

* ngram-map : revert len check for now

* arg : change `--spec-draftless` -> `--spec-type`

* spec : add common_speculative_state::accept()

* spec : refactor + add common_speculative_begin()

* spec : fix begin() call with mtmd

* spec : additional refactor + remove common_speculative_params

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Georgi Gerganov
b931f81b5a
server : adjust spec tests to generate up to 16 tokens (#19093) 2026-01-28 09:11:40 +02:00
Concedo
f6ece6fd37 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/check-vendor.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/gguf-publish.yml
#	.github/workflows/labeler.yml
#	.github/workflows/pre-tokenizer-hashes.yml
#	.github/workflows/python-check-requirements.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/python-type-check.yml
#	.github/workflows/server.yml
#	.github/workflows/update-ops-docs.yml
#	README.md
#	docs/build.md
#	examples/model-conversion/scripts/utils/perplexity-gen.sh
#	examples/model-conversion/scripts/utils/perplexity-run-simple.sh
#	examples/model-conversion/scripts/utils/perplexity-run.sh
#	examples/model-conversion/scripts/utils/quantize.sh
#	examples/model-conversion/scripts/utils/run-embedding-server.sh
#	ggml/src/ggml-cpu/ggml-cpu.c
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-opencl/kernels/mul_mv_q6_k_f32.cl
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/compare-llama-bench.py
#	tests/test-backend-ops.cpp
#	tests/test-gguf.cpp
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-01-27 23:06:13 +08:00
Georgi Gerganov
080b161995
completion : fix prompt cache for recurrent models (#19045) 2026-01-25 09:12:50 +02:00
Daniel Bevenius
16639ba217
common : use two decimal places for float arg help messages (#19048)
* common : use two decimal places for float arg help messages

This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.

The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.

For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.

* docs : run llama-gen-docs to update docs
2026-01-25 07:31:42 +01:00
Johannes Gäßler
e9fd8dcab4
llama-fit-params: keep explicit --ctx-size 0 (#19070) 2026-01-24 22:13:08 +01:00
Concedo
e8e7c357c9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-cache.yml
#	.github/workflows/build-cmake-pkg.yml
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	.github/workflows/check-vendor.yml
#	.github/workflows/close-issue.yml
#	.github/workflows/copilot-setup-steps.yml
#	.github/workflows/docker.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/gguf-publish.yml
#	.github/workflows/labeler.yml
#	.github/workflows/pre-tokenizer-hashes.yml
#	.github/workflows/python-check-requirements.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/python-type-check.yml
#	.github/workflows/release.yml
#	.github/workflows/server-webui.yml
#	.github/workflows/server.yml
#	.github/workflows/update-ops-docs.yml
#	.github/workflows/winget.yml
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	requirements/requirements-tool_bench.txt
#	src/CMakeLists.txt
#	src/llama-quant.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/server/README.md
2026-01-23 14:27:04 +08:00
Aldehir Rojas
a3e812811d
cli : load parser definition (#19031)
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
* cli : load parser definition

* cont : only unload if a parser is defined
2026-01-22 20:31:22 -06:00
Xuan-Son Nguyen
51fa458a92
server : support preserving reasoning_content in assistant message (#18994)
* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
2026-01-22 21:30:06 +01:00
Xuan-Son Nguyen
4e595b250a
server: do not log certain endpoints (avoid log spam) (#19028) 2026-01-22 19:24:37 +01:00
Xuan-Son Nguyen
9eb5bfec1a
mtmd : update docs to use llama_model_n_embd_inp (#18999) 2026-01-22 14:36:32 +01:00
손희준
c6926d1d95
server: Reorder methods in server-task.cpp (#19016)
* Move `task_result_state::update_chat_msg` to match with header

* Move `server_task_result_cmpl_partial::to_json_anthropic()` to match with header

---------

Co-authored-by: openingnow <>
2026-01-22 14:36:04 +01:00
Hendrik Erz
3802d3c78f
fix: Use tabular-nums for chat message statistics (#18915)
* fix: Use `tabular-nums` for chat message statistics

* fix: Rebuild WebUI
2026-01-21 18:46:01 +01:00
손희준
fbbf3ad190
server: /v1/responses (partial) (#18486)
* from previous PR

* Make instruction(system) as first message

* Convert [input_message] (text/image/file)

* Rename convert_responses_to_chatcmpl(body) -> response_body

* Initial tool call support

* Erase instructions field from chatcmpl body

* Feed reasoning texts to chat template

* Use std::vector instead of opaque json array

* Make output_item.added events consistent

* Move `server_task_result_cmpl_partial::update` from header to source

* Match ID of output_item.added and .done events

* Add function_call only if there is no "fc_" prefix

* Add function call output at non-streaming API

* Test if ID is persistent

* Add doc

* Fix style - use trailing comma

* Rewrite state management

* catch up with upstream/master

* Fix style - "type" is the first item of SSE data

* Explicitly check "instructions" from response_body

* Make lambdas static

* Check if reasoning content exists

* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final

* Reject `input_file` since it is not supported by chatcmpl

* Add "fc_" prefix to non-straming function call id as coderabbit pointed out

---------

Co-authored-by: openingnow <>
2026-01-21 17:47:23 +01:00
Concedo
4984c9bc16 Merge commit '12a4a47e6a' into concedo_experimental
# Conflicts:
#	ci/run.sh
#	examples/model-conversion/scripts/causal/run-converted-model-embeddings-logits.sh
#	examples/model-conversion/scripts/causal/run-converted-model.sh
#	examples/model-conversion/scripts/embedding/run-converted-model.sh
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	tests/CMakeLists.txt
#	tests/test-chat-parser.cpp
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
2026-01-21 21:00:44 +08:00