Concedo
749a606374
whisper broke
2026-02-26 16:45:04 +08:00
Concedo
44182ebefe
Merge commit ' 8c2c0108dd' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/utils/inspect-org-model.py
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/hex-dma.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/windows/run-cli.ps1
# scripts/sync_vendor.py
# tests/test-backend-sampler.cpp
2026-02-26 16:30:37 +08:00
Concedo
7e53bfd28d
Merge commit ' 2b6dfe824d' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/save-load-state/save-load-state.cpp
# src/llama-context.cpp
# tools/cli/cli.cpp
2026-02-26 15:07:23 +08:00
Georgi Gerganov
f20469d919
server : enable multi-modal prompt caching ( #19877 )
2026-02-25 15:15:42 +02:00
Georgi Gerganov
d7d826b3c1
server : support multi-modal context checkpoints ( #19849 )
...
* Modify llama-memory-hybrid-iswa.cpp
* Modify llama-memory-recurrent.cpp
* Modify server-common.cpp
* Modify server-common.h
* Modify server-context.cpp
* Modify server-task.h
* Added comment to llama-memory-hybrid-iswa.cpp
* Remove comment from server-context.cpp
* Stylistic fix server-context.cpp
* Fix an issue when seqrm isn't called in server-context.cpp
* cont : alternative impl
* cont : cleanup
* cont : n_tokens -> int64_t
---------
Co-authored-by: timkhronos <timkhronos@gmail.com>
2026-02-25 15:14:27 +02:00
Pascal
47eb12b953
server: fix query params lost when proxying requests in multi-model router mode ( #19854 )
...
* server: fix query params lost when proxying requests in multi-model router mode
* server: re-encode query params using httplib::encode_query_component in proxy
2026-02-24 21:46:06 +01:00
Radoslav Gerganov
c830f99cfa
server : support max_completion_tokens request property ( #19831 )
...
"max_tokens" is deprectated in favor of "max_completion_tokens" which
sets the upper bound for reasoning+output token.
Closes : #13700
2026-02-24 10:30:00 +02:00
Aleksander Grygier
5eb0ea32f0
feat: Add code blocks full height setting to parameter sync service ( #19835 )
2026-02-23 22:30:13 +01:00
Aleksander Grygier
9051663d5d
webui: Add setting to have full height Code Blocks in Chat Messages ( #19829 )
2026-02-23 14:16:50 +01:00
Sigbjørn Skjæret
e8e261699a
cli : provide model with text filename ( #19783 )
2026-02-22 22:33:49 +01:00
Kilian Krampf
cacc371f99
Fix wrong cli-argument in documentation ( #19804 )
2026-02-22 16:26:33 +01:00
Aldehir Rojas
34ec1c3f18
server : merge contiguous Responses input items into a single assistant message ( #19773 )
...
* server : merge contiguous input items into a single assistant message
* cont : simplify tool call msg
* cont : reduce and combine content
* cont : fix merging content items
2026-02-22 14:11:31 +01:00
Concedo
d06700687f
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/rocm.Dockerfile
# .github/workflows/release.yml
# CMakeLists.txt
# ggml/src/ggml-cuda/common.cuh
# scripts/sync_vendor.py
# tests/test-chat.cpp
2026-02-22 09:33:13 +08:00
crsawyer
07968d53e4
fix: UI single model selection in router mode ( #19767 )
2026-02-21 09:28:39 +01:00
Concedo
e626de2430
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/ops.md
# docs/ops/WebGPU.csv
# embd_res/templates/stepfun-ai-Step-3.5-Flash.jinja
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# src/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/mtmd/CMakeLists.txt
2026-02-20 15:16:26 +08:00
Concedo
07c45ced56
Merge commit ' c78e682245' into concedo_experimental
...
# Conflicts:
# src/models/qwen35.cpp
# src/models/qwen35moe.cpp
2026-02-20 14:41:32 +08:00
Concedo
9eb9e4eb83
Merge commit ' 8a70973557' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/backend/SYCL.md
# examples/model-conversion/scripts/utils/tensor-info.py
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/expm1.cl
# ggml/src/ggml-opencl/kernels/mean.cl
# ggml/src/ggml-opencl/kernels/softplus.cl
# ggml/src/ggml-opencl/kernels/sum_rows.cl
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/scale.wgsl
# tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
2026-02-20 14:36:49 +08:00
crsawyer
10b26ee23a
WebUI hide models in router mode ( #19374 )
2026-02-19 22:53:42 +01:00
Tarek Dakhran
c5897995a7
mtmd : chat : Fix extra \n between text and media marker ( #19595 )
...
* mtmd : chat : Fix extra \n between text and media marker
Thanks to @tugot17 for detecting and reporting the issue.
For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.
However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.
This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.
PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.
With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.
I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`
Please propose alternative ways of fixing this issue.
* Refactor to use explicite per type ifs
* Update common/chat.cpp
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* Update common_chat_templates_apply_legacy
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-19 12:18:57 +01:00
Aleksander Grygier
03fd9d3bb4
webui: Fix Attachments not being included in completion request ( #19731 )
...
* fix: Add missing argument
* chore: update webui build output
2026-02-19 10:27:38 +01:00
matteo
b55dcdef5d
server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) ( #19622 )
...
* save generated text for the /slots endpoint
* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0
* Apply suggestions from code review
---------
Co-authored-by: Matteo <matteo@matteo>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-18 18:53:37 +01:00
Aleksander Grygier
ea003229d3
Pre-MCP UI and architecture cleanup ( #19689 )
2026-02-18 12:02:02 +01:00
Aleksander Grygier
afa6bfe4f7
Pre-MCP UI and architecture cleanup ( #19685 )
...
* webui: extract non-MCP changes from mcp-mvp review split
* webui: extract additional pre-MCP UI and architecture cleanup
* chore: update webui build output
2026-02-17 13:47:45 +01:00
Adrien Gallouët
ae46a61e41
build : link ws2_32 as PUBLIC on Windows ( #19666 )
...
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2026-02-17 08:37:07 +01:00
Concedo
72f7e01b27
Merge commit ' 01d8eaa28d' into concedo_experimental
...
# Conflicts:
# build-xcframework.sh
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
# tools/mtmd/CMakeLists.txt
# tools/rpc/rpc-server.cpp
2026-02-16 15:36:59 +08:00
Adrien Gallouët
9e118b97c4
build : remove LLAMA_HTTPLIB option ( #19623 )
...
This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-15 15:38:50 +01:00
Aleksander Grygier
baa12f3831
webui: Architecture and UI improvements ( #19596 )
2026-02-14 09:06:41 +01:00
Concedo
45dc155530
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# AGENTS.md
# SECURITY.md
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# scripts/sync_vendor.py
# src/unicode.cpp
# tests/test-backend-ops.cpp
# tools/cli/cli.cpp
2026-02-14 12:44:16 +08:00
Aleksander Grygier
5174d7206f
webui: UI and routing fixes ( #19586 )
...
* chore: update webui build output
* chore: update webui build output
* fix: Scroll issues in DropdownMenuSearchable
* webui: fix redirect to root ignoring base path
* fix: Word wrapping
* fix: remove obsolete modality UI tests causing CI failures
- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)
* feat: Improve formatting performance time
---------
Co-authored-by: Pascal <admin@serveurperso.com>
2026-02-13 12:31:00 +01:00
Concedo
bff3fd3e34
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/common.cpp
# docs/backend/snapdragon/README.md
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# scripts/pr2wt.sh
# tests/test-backend-ops.cpp
# tools/server/README.md
2026-02-13 14:00:45 +08:00
Aleksander Grygier
4c61875bf8
webui: Add switcher to Chat Message UI to show raw LLM output ( #19571 )
2026-02-12 19:55:51 +01:00
Aleksander Grygier
4d688f9ebb
(webui) FEATURE: Enable adding or injecting System Message into chat ( #19556 )
...
* feat: Enable adding System Prompt per-chat
* fix: Save draft message in Chat Form when adding System Prompt from new chat view
* fix: Proper system message deletion logic
* chore: Formatting
* chore: update webui build output
2026-02-12 13:56:08 +01:00
Aleksander Grygier
f486ce9f30
(webui) REFACTOR: UI primitives and polish ( #19551 )
...
* webui: UI primitives and polish (non-MCP)
* chore: update webui build output
2026-02-12 12:21:00 +01:00
Aleksander Grygier
38adc7d469
WebUI Architecture Cleanup ( #19541 )
...
* webui: architecture foundation (non-MCP core refactors)
* chore: update webui build output
2026-02-12 11:22:27 +01:00
Concedo
261d78eaaa
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CMakeLists.txt
# README.md
# docs/speculative.md
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tools/mtmd/clip.cpp
2026-02-12 18:05:20 +08:00
RichardScottOZ
fa16e517a3
server : fix typo in README.md for features list ( #19510 )
...
extra l for full
2026-02-12 08:56:25 +01:00
손희준
820ebfa6f4
Server: log when converting requests to chat completions format ( #19457 )
...
* Log converting requests
* Print as debug instead of info [no ci]
---------
Co-authored-by: openingnow <>
2026-02-09 16:22:57 +01:00
Sascha Rogmann
292f6908cd
spec : remove check rate ( #19377 )
...
* spec: remove parameter spec-ngram-check-rate
* spec : renamed statistics vars
* spec : add n_call_begin, n_call_accept
* spec : don't enable key-map-stats
2026-02-09 15:30:50 +02:00
Concedo
757b293ac9
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/server-webui.yml
# .github/workflows/server.yml
# tools/rpc/rpc-server.cpp
2026-02-09 00:33:11 +08:00
Georgi Gerganov
eb449cdfa4
server : improve context checkpoint logic ( #19408 )
2026-02-08 09:40:04 +02:00
Concedo
a0a78dacc4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/ops.md
# docs/ops/SYCL.csv
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# pyproject.toml
# requirements/requirements-convert_legacy_llama.txt
# src/CMakeLists.txt
# src/llama-vocab.cpp
# tests/test-backend-ops.cpp
2026-02-07 15:54:02 +08:00
Georgi Gerganov
dfde5993ea
common : add common_speculative_is_compat() ( #19270 )
...
* llama : add llama_memory_can_rm_suffix()
* Revert "llama : add llama_memory_can_rm_suffix()"
This reverts commit d30e59b62a15ef4266a6503e3f4eba770aec001b.
* spec : check if the target context is compatible for spec decoding
2026-02-06 16:47:22 +02:00
Concedo
7b393fa487
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# AUTHORS
# ci/run.sh
# docs/backend/SYCL.md
# docs/build.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmo4.0.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# docs/multimodal/minicpmv4.0.md
# docs/multimodal/minicpmv4.5.md
# docs/ops.md
# docs/ops/SYCL.csv
# docs/speculative.md
# examples/deprecation-warning/README.md
# examples/deprecation-warning/deprecation-warning.cpp
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/include/ggml-cann.h
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/concat.cl
# ggml/src/ggml-opencl/kernels/repeat.cl
# ggml/src/ggml-opencl/kernels/scale.cl
# ggml/src/ggml-opencl/kernels/tanh.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/wkv.cpp
# src/llama-vocab.cpp
# tests/test-autorelease.cpp
# tests/test-backend-ops.cpp
# tools/cvector-generator/pca.hpp
# tools/export-lora/export-lora.cpp
# tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
Matthieu Coudron
a3fa035822
server: print actual model name in 'model not found" error ( #19117 )
...
Experimenting with AI, my environment gets messy fast and it's not
always easy to know what model my software is trying to load. This helps
with troubleshooting.
before:
Error: {
code = 400,
message = "model not found",
type = "invalid_request_error"
}
After:
Error: {
code = 400,
message = "model 'toto' not found",
type = "invalid_request_error"
}
2026-02-02 16:55:27 +01:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups ( #19252 )
...
* Update old URLs to github.com/ggml-org/
* Bump copyrights
2026-02-02 08:38:55 +02:00
Concedo
ddce19db72
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package-gguf-py.nix
# .devops/nix/scope.nix
# common/CMakeLists.txt
# docs/backend/SYCL.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/sycl/run-llama2.sh
# examples/sycl/win-run-llama2.bat
# examples/sycl/win-test.bat
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hvx-dump.h
# ggml/src/ggml-hexagon/htp/hvx-reduce.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
Georgi Gerganov
bbada8bfb9
server : wrap around the "id_slot" parameter ( #19207 )
...
* server : wrap around the "id_slot" parameter
* cont : minor
2026-01-30 19:46:10 +02:00
Georgi Gerganov
dabaa2e77a
spec : add ngram-mod ( #19164 )
...
* spec : add ngram-mod
* cont : simplify + keep track of occupancy
* cont : cleanup
* cont : move initialization to common/speculative
* cont : cleanup
* cont : cleanup
* cont : fix
2026-01-30 18:21:48 +02:00
Concedo
8d173f50c2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/backend/SYCL.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# docs/backend/snapdragon/developer.md
# docs/ops.md
# docs/ops/SYCL.csv
# embd_res/templates/upstage-Solar-Open-100B.jinja
# ggml/src/CMakeLists.txt
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# tests/test-chat.cpp
2026-01-30 15:32:59 +08:00
Concedo
7e755014b2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/winget.yml
# CODEOWNERS
# common/CMakeLists.txt
# common/arg.cpp
# docs/ops/SYCL.csv
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# tests/test-chat-template.cpp
2026-01-29 23:05:05 +08:00