Concedo
7c70187e26
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-copy.h
# ggml/src/ggml-hexagon/htp/hvx-exp.h
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
Aleksander Grygier
ab6120cde5
webui: Spring Cleaning Refactor v1 ( #22505 )
...
* wip: server_tools
* feat: Integrate with `/tools` endpoint
* feat: Builtin + MCP + JSON Schema Tools WIP
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* feat: Improvements
* chore: update webui build output
* refactor: Updates after server updates
* chore: update webui build output
* change arg to --tools all
* feat: UI improvements
* chore: update webui build output
* add readme mention
* llama-gen-docs
* chore: update webui build output
* chore: update webui build output
* chore: update webui build output
* feat: Reorganize settings sections
* feat: Separate dialogs for MCP Servers Settings and Import/Export
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* WIP on allozaur/20677-webui-server-tools
* feat: UI improvements
* chore: Update package lock
* chore: Run `npm audit fix`
* feat: UI WIP
* feat: UI
* refactor: Desktop Icon Strip DRY
* feat: Cleaner rendering and transition for ChatScreen
* feat: UI improvements
* feat: UI improvement
* feat: Remove MCP Server "enable" switch from Tools submenu
* chore: Run `npm audit fix`
* feat: WIP
* feat: Logic improvements
* refactor: Cleanup
* refactor: DRY
* test: Fix Chat Sidebar UI Tests
* chore: Update package lock
* refactor: Cleanup
* feat: Chat Message Action Card with Continue and Permission flow implementations
* feat: Add agentic steering messages, draft messages and improve chat UX
* fix: Search results UI
* test: Fix unit test
* feat: UI/UX improvements
* refactor: Simplify `useToolsPanel` access in components
* feat: Implement Processing Info Context API
* feat: Implement 'Go back to chat' functionality for settings
* feat: Enhance MCP Server management in Chat Form Attachments
* style: Minor UI and branding adjustments
* chore: Update webui static build output
* chore: Formatting, linting & type checks
* feat: Draft messages logic
* feat: UI improvements
* feat: Steering Messages improvements
* refactor: Cleanup
* refactor: Cleanup
* feat: Improve UI
* refactor: Settings navigation hook
* refactor: DRY code
* refactor: DRY ChatMessageUser UI components
* refactor: Desktop Icon Strip DRY
* refactor: Tools & permissions
* fix: Navigation condition
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
* fix: preserve reasoning_content in agentic flow
* refactor: Storybook cleanup
* refactor: isInViewport util function
* refactor: Rename globally `onClick` to `onclick`
* chore: `npm audit fix`
* refactor: Action Icon usage
* refactor: Naming
* refactor: JS in `class` directive
* refactor: Chat components cleanup WIP
* refactor: Components structure
* refactor: Cleanup WIP
* feat: New ChatAttachmentsPreview component
* feat: UI improvements
* feat: UI improvements
* refactor: Cleanup
* refactor: ChatAttachmentsPreview UI/UX
* refactor: Remove dead code
* refactor: Cleanup
* fix: Model Name aliases displaying
* feat: Shortcut improvements
* refactor: Chat Message
* feat: Move Import/Export to settings
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-01 18:36:29 +02:00
Concedo
61478cbf4a
Merge commit ' c20c44514a' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# examples/speculative/speculative.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# scripts/jinja/jinja-tester.py
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
2026-05-01 00:07:46 +08:00
Concedo
37073bc13d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/mmq.cuh
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tests/test-log.cpp
2026-04-30 17:37:52 +08:00
Georgi Gerganov
80afa33aad
spec : fix draft model checkpoints ( #22521 )
...
* spec : fix draft model checkpoints
* cont : clean-up
* cont : gate the ngram-mod reset warning behind verbose flag
2026-04-30 08:32:18 +03:00
Concedo
45f8ff49bb
Merge commit ' 52e5f0a5c1' into concedo_experimental
...
# Conflicts:
# examples/gen-docs/gen-docs.cpp
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-vulkan/ggml-vulkan.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/binary.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/rms_norm_mul.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/ssm_scan.wgsl
# tests/test-arg-parser.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-reasoning-budget.cpp
# tools/llama-bench/llama-bench.cpp
# tools/rpc/rpc-server.cpp
# tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
# tools/server/webui/src/lib/components/app/chat/ChatSidebar/ChatSidebar.svelte
# tools/server/webui/src/routes/(chat)/+page.svelte
2026-04-29 22:27:36 +08:00
Georgi Gerganov
683c5acb90
spec : disacard last drafted token with low prob ( #22506 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
2026-04-29 17:00:00 +03:00
Pascal
59237bfbbc
webui: fix slow mic stop and WAV encode ( #22480 )
...
* webui: instant mic stop, race-free recorder restart
* webui: faster WAV PCM encode via hoisted channels and Int16Array
* chore: update webui build output
* webui: drop setTimeout(0) hack and harden cancelRecording
* chore: update webui build output
2026-04-29 12:58:35 +02:00
Aleksander Grygier
f42e29fdf1
webui: Server tools ( #21237 )
...
* wip: server_tools
* feat: Integrate with `/tools` endpoint
* feat: Builtin + MCP + JSON Schema Tools WIP
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* feat: Improvements
* chore: update webui build output
* refactor: Updates after server updates
* chore: update webui build output
* change arg to --tools all
* feat: UI improvements
* chore: update webui build output
* add readme mention
* llama-gen-docs
* chore: update webui build output
* chore: update webui build output
* chore: update webui build output
* feat: Reorganize settings sections
* feat: Separate dialogs for MCP Servers Settings and Import/Export
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* WIP on allozaur/20677-webui-server-tools
* feat: UI improvements
* chore: Update package lock
* chore: Run `npm audit fix`
* feat: UI WIP
* feat: UI
* refactor: Desktop Icon Strip DRY
* feat: Cleaner rendering and transition for ChatScreen
* feat: UI improvements
* feat: UI improvement
* feat: Remove MCP Server "enable" switch from Tools submenu
* chore: Run `npm audit fix`
* feat: WIP
* feat: Logic improvements
* refactor: Cleanup
* refactor: DRY
* test: Fix Chat Sidebar UI Tests
* chore: Update package lock
* refactor: Cleanup
* feat: Chat Message Action Card with Continue and Permission flow implementations
* feat: Add agentic steering messages, draft messages and improve chat UX
* fix: Search results UI
* test: Fix unit test
* feat: UI/UX improvements
* refactor: Simplify `useToolsPanel` access in components
* feat: Implement Processing Info Context API
* feat: Implement 'Go back to chat' functionality for settings
* feat: Enhance MCP Server management in Chat Form Attachments
* style: Minor UI and branding adjustments
* chore: Update webui static build output
* chore: Formatting, linting & type checks
* feat: Draft messages logic
* feat: UI improvements
* feat: Steering Messages improvements
* refactor: Cleanup
* refactor: Cleanup
* feat: Improve UI
* refactor: Settings navigation hook
* refactor: DRY code
* refactor: DRY ChatMessageUser UI components
* refactor: Desktop Icon Strip DRY
* refactor: Tools & permissions
* fix: Navigation condition
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
* fix: preserve reasoning_content in agentic flow
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-04-28 14:35:49 +03:00
Georgi Gerganov
14e733e36f
spec : refactor params ( #22397 )
...
* spec : refactor params
* cont : fix
* cont : rename "sparam" to "sampling"
* cont : add spec params category
* cont : add info about removed arguments
* cont : skip param length check for spec params
* cont : adapt server tests
2026-04-28 09:07:33 +03:00
Aman Gupta
516e8d7a8a
server: use pos_next instead of n_tokens for m-rope ( #22439 )
2026-04-28 08:41:00 +03:00
tha80
983ca8992e
server: (router) Forward form-data to model server ( Fixes #22044 ) ( #22118 )
...
* This commit enables the router to forward form-data to model server.
Fixes #22044 (enabling to use the /v1/audio/transcriptions in router mode)
* * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload.
* Addressed Copilots third comment by extending the files representation to also include filename and content-type
* Addressed Copilots fourth comment by making the RNG thread_local
* Changed variable body from std::string to std::ostringstream in build_multipart_body
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053
* Added sanitize_field lambda in build_multipart_body for key, filename and content_type
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647
* explicitly checking if value/item is string before calling value/item.get<std::string>()
as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279
* Added double quote to the sanitize lambda and throw on json parse failure
---------
Co-authored-by: Ralph Paßgang <ralph@trust-it.de>
2026-04-27 23:55:00 +02:00
Concedo
340b22283e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# .gitignore
# docs/backend/SYCL.md
# docs/backend/snapdragon/README.md
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/mmvq.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_blk.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
# scripts/server-test-structured.py
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/qdc/requirements.txt
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-completion.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tools/cli/cli.cpp
# ty.toml
2026-04-25 12:13:14 +08:00
Piotr Wilkin (ilintar)
0adede866d
parser: fix structured output bug ( #22302 )
...
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
* fix very stupid structured output bug
* Things just cannot be too easy.
2026-04-24 23:19:55 +02:00
Georgi Gerganov
ffdd983fb8
server : fix swa-full logic ( #22288 )
2026-04-24 10:17:37 +03:00
Yes You Can Have Your Own
793d0a7931
server: rename debug tags to match --cache-idle-slots naming ( #22292 )
2026-04-24 09:28:44 +03:00
srkizer
185cbff6f1
server : convert_anthropic_to_oai: also copy chat_template_kwargs ( #22154 )
2026-04-23 13:32:46 -05:00
Song Li
c78fb909b2
server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) ( #22267 )
...
* server: clamp n_discard to non-negative at JSON parse boundary (CVE-2026-21869)
A negative n_discard from client JSON causes heap-buffer-overflow in
update_slots() context-shift loop (CWE-787, CVSS 8.8). Clamp to 0 at
ingress; n_discard=0 already triggers auto-discard (n_left/2).
Ref: GHSA-8947-pfff-2f3c
* cont : cleaner
* cont : cleanerer
* cont : cleanest
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-23 18:39:07 +02:00
kvc0
c807c6e3b0
server: (anthropic API) fix prefix caching ( #21793 )
...
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.
I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!
It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
2026-04-23 17:45:02 +02:00
Tarek Dakhran
550d684bd1
server: Enable transcriptions API for LFM2-Audio ( #22000 )
2026-04-23 10:47:26 +02:00
Concedo
0755f27372
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/openvino.Dockerfile
# .github/workflows/build-self-hosted.yml
# .github/workflows/build.yml
# common/chat.cpp
# docs/backend/OPENVINO.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-openvino/ggml-decoder.cpp
# ggml/src/ggml-openvino/ggml-openvino-extra.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-openvino/ggml-quants.cpp
# ggml/src/ggml-openvino/openvino/op/rope.cpp
# ggml/src/ggml-openvino/openvino/op_table.cpp
# ggml/src/ggml-openvino/openvino/op_table.h
# ggml/src/ggml-openvino/openvino/translate_session.cpp
# ggml/src/ggml-openvino/openvino/utils.cpp
# ggml/src/ggml-openvino/openvino/utils.h
# ggml/src/ggml-openvino/utils.cpp
# ggml/src/ggml-openvino/utils.h
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/gemm.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/CMakeLists.txt
# tests/test-chat.cpp
# tools/cli/cli.cpp
# tools/mtmd/CMakeLists.txt
# tools/server/CMakeLists.txt
2026-04-23 00:55:05 +08:00
Piotr Wilkin (ilintar)
8bccdbbff9
chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs ( #22217 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs
* Fix ty errors.
* Fix flake8 err
2026-04-22 18:10:56 +02:00
Georgi Gerganov
bcb5eeb645
speculative-simple : add checkpoint support ( #22227 )
...
* speculative-simple : add checkpoint support
* cont : fix build
2026-04-22 15:44:45 +03:00
Xuan-Son Nguyen
17f6245168
server: ignore reasoning content from transcription api ( #21905 )
2026-04-22 12:10:50 +02:00
Ethan Turner
750579ff14
common: Refactoring sampler parameters ( #20429 ) ( #22233 )
...
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.
Issue: https://github.com/ggml-org/llama.cpp/issues/20429
Original PR: https://github.com/ggml-org/llama.cpp/pull/20297
2026-04-22 10:40:19 +02:00
Piotr Wilkin (ilintar)
134d6e54d4
common/chat, server: refactor, move all conversion functions to common, add tests ( #20690 )
...
* Refactor conversion functions
2026-04-22 10:28:45 +02:00
Xuan-Son Nguyen
04fe84b69d
server: allow cancel loading model ( #21814 )
2026-04-22 00:26:09 +02:00
Concedo
19a12bb080
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# common/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/sync-ggml.last
# tools/cli/cli.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device ( #22171 )
...
* fit-params : add option to output estimated memory per device
* cont : minor
* cont : refactor
* cont : move fit params implementation to libcommon
* cont : header
* cont : headers
* cont : codeowners
2026-04-21 09:54:36 +03:00
xris99
ff6b1062af
server : fix hardcoded proxy connection timeout in router mode ( #18760 ) ( #22003 )
...
Fixes: https://github.com/ggml-org/llama.cpp/issues/18760
Co-authored-by: Christian <christian@example.com>
2026-04-21 06:41:14 +02:00
Georgi Gerganov
cf8b0dbda9
server : remove /api endpoints ( #22165 )
...
* server : remove /api endpoints
* cont : remove /api/tags
2026-04-20 20:41:19 +03:00
Concedo
cd6788007e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cross.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# examples/llama.android/lib/src/main/cpp/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/test-chat.cpp
# tests/test-mtmd-c-api.c
# tools/server/README.md
2026-04-20 20:19:11 +08:00
Georgi Gerganov
de71b5f81c
server : refactor "use checkpoint" logic ( #22114 )
2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots ( #21741 )
2026-04-20 08:30:24 +03:00
Sascha Rogmann
455d8e4be8
server : speculative checkpointing ( #19493 )
...
* server : speculative decoding using checkpoints
* server : fix draft check with checkpoints
* server : rename spec vars
* server : log levels
* server : refactored spec logic to speculative.cpp
* server : renamed spec checkpoints option
* server : fix spec checkpoints, logging
* speculative : checkpoints with draft model, logging
* server : n_tokens_cur and create_checkpoint in draft
* server : fix server_speculative_callback (slot.id)
* spec : fix ngram-map/begin idx_last_check
* spec : init ckpt (begin() wasn't called)
* chore: update webui build output
* server : restore sampler in spec checkpoint and clear mem
* cont : avoid --spec-use-checkpoints argument
* cont : remove server_prompt_checkpoint_with_size
* spec : rename (leave_draft_state)
* cont : clean-up
* cont : do not ignore partial drafts even if the are short
* cont : spec callback owned by session
* cont : simplify
* cont : avoid empty speculative session
* cont : simplify
* cont : simplify
* cont : enable mtmd speculative decoding
* cont : keep the spec sampler alive
* cont : simplify
* cont : fix nullptr deref + draft checkpoints
* cont : remove common_speculative_accept_response
* cont : remove callback
* cont : simplify
* cont : minor
* cont : simplify
* cont : fix accepted number
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Cetarthoriphros
9e5647affa
server: Expose media_tag on /props endpoint. ( #22028 )
2026-04-19 00:27:17 +02:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
768527b031
Merge commit ' 1e796eb41f' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/workflows/build-riscv.yml
# .github/workflows/build-vulkan.yml
# .github/workflows/build.yml
# docs/backend/SYCL.md
# docs/build.md
# docs/development/HOWTO-add-model.md
# embd_res/templates/Reka-Edge.jinja
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_id.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# tests/test-chat.cpp
# tools/rpc/README.md
2026-04-17 21:47:29 +08:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common ( #21936 )
...
* cmake : allow libcommon to be shared
* cmake : rename libcommon to libllama-common
* cont : set -fPIC for httplib
* cont : export all symbols
* cont : fix build_info exports
* libs : add libllama-common-base
* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Pascal
4adac43f6f
server: tests: fetch random media marker via /apply-template ( #21962 ) ( #21980 )
...
* server: tests: fetch random media marker via /apply-template (#21962 fix)
* server: allow pinning media marker via LLAMA_MEDIA_MARKER env var
get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it
as-is if set, falling back to the random marker otherwise.
Tests no longer need to fetch the marker dynamically via /apply-template:
the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts
work as before.
Address review feedback from ngxson
* server: make get_media_marker() thread-safe via magic statics
Use a C++11 static local with a lambda initializer instead of a global
static with an empty-check. The runtime guarantees initialization exactly
once without explicit locking.
Address review feedback from ggerganov
* nits
* nits
2026-04-16 20:46:21 +03:00
Xuan-Son Nguyen
408225bb1a
server: use random media marker ( #21962 )
...
* server: use random media marker
* nits
* remove legacy <__image__> token
* revert special char in random
2026-04-15 23:52:22 +02:00
Concedo
236ae27329
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/close-issue.yml
# docs/multimodal.md
# embd_res/templates/deepseek-ai-DeepSeek-V3.2.jinja
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl
# tests/peg-parser/test-gbnf-generation.cpp
# tests/test-chat.cpp
2026-04-14 21:01:41 +08:00
Concedo
9c0b9b0bb1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/development/HOWTO-add-model.md
# docs/multimodal.md
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/gated_delta_net.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/upscale.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# tests/test-backend-ops.cpp
# tests/test-llama-archs.cpp
# tools/mtmd/CMakeLists.txt
2026-04-14 20:06:04 +08:00
Xuan-Son Nguyen
e489a5ca0e
server: support OAI /v1/audio/transcriptions API ( #21863 )
...
* server: support OAI /v1/audio/transcriptions API
* address autoreview comments
* correct default response_format value
2026-04-14 11:09:52 +02:00
Gaspard Petit
ce8fd4b1a6
server: Expose build_info in router mode ( #21835 )
2026-04-13 11:14:42 +02:00
Rohan Jain
974c8c94cc
webui: add setting for first-line chat titles ( #21797 )
...
* webui: add setting for first-line chat titles
Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first
non-empty line of a prompt as the generated conversation title.
Previously, the complete multi-line prompt was being used, which created
long titles for complex queries. Coupled with
"Ask for confirmation before changing conversation title", the dialog
would overflow.
* Update tools/server/webui/src/lib/utils/text.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/utils/text.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: Run build to update the bundle
As requested in:
https://github.com/ggml-org/llama.cpp/pull/21797#pullrequestreview-4094935065
* webui: Fix missing import for NEWLINE_SEPARATOR
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-04-13 09:30:46 +02:00
Aleksander Grygier
227ed28e12
webui: MCP Diagnostics improvements ( #21803 )
...
* Add MCP Connection diagnostics and CORS hint to web-ui
* tidy up test
* webui: Refactor and improve MCP diagnostic logging
---------
Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>
2026-04-13 07:58:38 +02:00
Aleksander Grygier
9e209c5aee
fix: Proper messages rendering for "Show raw output" ( #21672 )
2026-04-12 13:08:11 +02:00
Concedo
4c860ae4ae
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/download.cpp
# docs/backend/OPENVINO.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/argsort-ops.c
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-hexagon/htp/cumsum-ops.c
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/repeat-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/ssm-conv.c
# ggml/src/ggml-hexagon/htp/sum-rows-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# models/templates/google-gemma-4-31B-it-interleaved.jinja
# models/templates/google-gemma-4-31B-it.jinja
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
2026-04-11 11:19:32 +08:00
Concedo
a165a73120
Merge commit ' d6f3030047' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/amx/amx.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend-buffer.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# pyproject.toml
# requirements/requirements-convert_legacy_llama.txt
# requirements/requirements-tool_bench.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-llama-archs.cpp
# tests/test-tokenizer-0.py
# tests/test-tokenizer-random.py
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-11 11:10:55 +08:00