Commit graph

12488 commits

Author SHA1 Message Date
Concedo
5440ca4794 rolling builds 2026-03-31 13:12:11 +08:00
Concedo
0afcf4bc6d up jimver cuda toolkit version 2026-03-30 21:43:29 +08:00
Concedo
894591da7c increase ctx size slider 2026-03-30 21:41:31 +08:00
Concedo
a3a5897d93 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.github/workflows/python-type-check.yml
#	embd_res/templates/Qwen3.5-4B.jinja
#	examples/model-conversion/scripts/causal/compare-logits.py
#	examples/model-conversion/scripts/utils/check-nmse.py
#	examples/model-conversion/scripts/utils/compare_tokens.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	examples/sycl/build.sh
#	examples/sycl/run-llama2.sh
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hex-dma.h
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	scripts/gen-unicode-data.py
#	tests/test-chat.cpp
2026-03-30 21:41:19 +08:00
Concedo
9864d46389 add password for musicui 2026-03-30 21:03:12 +08:00
Concedo
42ad89cd86 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.devops/cpu.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/nix/package.nix
#	.github/workflows/build-android.yml
#	.github/workflows/build-cann.yml
#	.github/workflows/build-msys.yml
#	.github/workflows/docker.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/gguf-publish.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	docs/backend/CANN.md
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/sync_vendor.py
#	tests/test-chat-auto-parser.cpp
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-reasoning-budget.cpp
#	tools/cli/cli.cpp
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2026-03-30 20:45:38 +08:00
Aleksander Grygier
389c7d4955
webui: Fix branching logic on edit message (#21175)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* fix: Branching logic + small refactor

* chore: update webui build output
2026-03-30 14:40:50 +02:00
Concedo
923d5fc5d0 warning: clip_image_preprocess has been moved, now you must manually copy init_vision from mtmd into clip.cpp's setup_init_vision_shim_kcpp 2026-03-30 20:39:55 +08:00
Aman Gupta
278521c33a
llama-model-loader: print warning when using overrides with mmap (#20978)
* llama-model-loader: use pinned memory for tensor overrides

* change to warning
2026-03-30 17:40:17 +08:00
Sigbjørn Skjæret
e2eb39e81c
ci : bump ty to 0.0.26 (#21156)
* fix incorrect type ignore comments

* bump ty to 0.0.26
2026-03-30 09:29:15 +02:00
Xuan-Son Nguyen
abf9a62161
server: wrap headers for mcp proxy (#21072)
* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-30 08:59:16 +02:00
Sigbjørn Skjæret
7c203670f8
add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150) 2026-03-29 19:45:40 +02:00
Gaurav Garg
ec16a072f0
Optimize MOE GEMV kernel for BS > 1. (#20905)
* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-03-29 18:35:18 +02:00
Concedo
4fc3c28f1a reasoning output parsing improvements 2026-03-29 23:35:24 +08:00
Max Krasnyansky
f5d1c4179f
hexagon: dma optimizations (mostly fixing regressions) (#21137)
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
2026-03-29 06:40:13 -07:00
Concedo
4a09f3805b prepare for breaking merge 2026-03-29 14:09:29 +08:00
Davi Henrique Linhares
2405d59cb6
devops: including compute-runtime for intel.Dockerfile (#21076) 2026-03-29 13:34:03 +08:00
Neo Zhang
afe65aa282
[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093)
* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0
2026-03-29 09:02:45 +08:00
Sigbjørn Skjæret
65097181e4
fix **/x glob matching (#21129) 2026-03-28 22:27:38 +01:00
Piotr Wilkin (ilintar)
98ae0a0d36
common/parser: fix handling of tool definition with missing properties key (#21128) 2026-03-28 20:41:32 +01:00
Sigbjørn Skjæret
3a14a542f5
common : add character class support to glob_match (#21111)
* add character class support to glob_match

* remove pointless reference
2026-03-28 19:57:37 +01:00
Concedo
df6b7b5fdb Merge branch 'concedo_experimental' of https://github.com/LostRuins/koboldcpp into concedo_experimental 2026-03-29 01:25:07 +08:00
Concedo
3eedde8ab5 Merge commit 'ded446b34c' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	tests/test-backend-ops.cpp
2026-03-29 01:24:31 +08:00
Wagner Bruna
9223f41320
sd: call SetCircularAxesAll directly (#2078) 2026-03-29 01:17:48 +08:00
Concedo
8760d22a84 switch back to newly updated jimver github cuda toolkit 2026-03-29 01:17:11 +08:00
Concedo
aac220f7e3 Merge commit '0fac87b157' into concedo_experimental
# Conflicts:
#	.github/workflows/build-android.yml
#	.github/workflows/hip-quality-check.yml
#	docs/multimodal.md
#	scripts/hip/gcn-cdna-vgpr-check.py
#	scripts/snapdragon/windows/run-bench.ps1
#	scripts/snapdragon/windows/run-cli.ps1
#	scripts/snapdragon/windows/run-tool.ps1
#	tests/test-backend-ops.cpp
#	tests/test-llama-archs.cpp
#	tools/imatrix/imatrix.cpp
#	tools/mtmd/CMakeLists.txt
2026-03-29 01:14:33 +08:00
BlueMöhre
968189729f
WebUI: Replace illegal nested button elements (#21026)
* remove/replace nested button elements

* map rest props to outer element

* solve TODO

* chore: update webui build output
2026-03-28 17:57:59 +01:00
Concedo
674b7f5eee indicate support for claude messages api 2026-03-29 00:57:58 +08:00
Adrien
e397d3885c
common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)
2026-03-28 17:55:38 +01:00
Concedo
e3b7905e1c added anthropic messages api support 2026-03-29 00:55:32 +08:00
Concedo
5ad9e3ee31 crude openai responses streaming 2026-03-29 00:16:30 +08:00
Aldehir Rojas
e6f2ec01ff
common : add reasoning_format = none support to gpt-oss (#21094) 2026-03-28 09:33:39 -05:00
Georgi Gerganov
edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks (#21107) 2026-03-28 16:27:36 +02:00
Adrien Gallouët
3d66da1809
ci : gracefully shut down the server (#21110)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 14:49:57 +01:00
Woof Dog
82b703f8bc
Document custom default webui preferences in server README (#19771) 2026-03-28 14:19:16 +01:00
Concedo
94b266a6b0 musicui fix reset defaults 2026-03-28 21:09:40 +08:00
Aleksander Grygier
51a84efc53
webui: Conversation forking + branching improvements (#21021)
* refactor: Make `DialogConfirmation` extensible with children slot

* feat: Add conversation forking logic

* feat: Conversation forking UI

* feat: Update delete/edit dialogs and logic for forks

* refactor: Improve Chat Sidebar UX and add MCP Servers entry

* refactor: Cleanup

* feat: Update message in place when editing leaf nodes

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* chore: Cleanup

* refactor: Post-review improvements

* chore: update webui build output

* test: Update Storybook test

* chore: update webui build output

* chore: update webui build output
2026-03-28 13:38:15 +01:00
Concedo
1e787cd03a improve responses api 2026-03-28 18:42:15 +08:00
Concedo
f768b2a4bd whatever, i tried 2026-03-28 17:32:07 +08:00
Adrien Gallouët
b0f0dd3e51
vendor : update cpp-httplib to 0.40.0 (#21100)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 08:59:44 +01:00
Ruben Ortlam
0eb4764182
vulkan: add noncontiguous GLU support (#21081)
* vulkan: add noncontiguous GLU support

* fix compile issue
2026-03-28 08:44:56 +01:00
Piotr Wilkin (ilintar)
1f5d15e665
common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)
* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker
2026-03-28 07:29:26 +01:00
Concedo
f80fdd4314 updated sdui 2026-03-28 11:24:03 +08:00
Concedo
547659fdbf allow planning music with llm (+1 squashed commits)
Squashed commits:

[9a3bbf072] allow planning music with llm
2026-03-28 11:19:39 +08:00
Sigbjørn Skjæret
c46758d28f
cli : add /glob command (#21084)
* add /glob command

* output error when max files reached

* support globbing outside curdir
2026-03-28 02:33:04 +01:00
Ts-sound
bf934f28db
docker : fix and enable ARM64 image build (#20929)
* CI: fix ARM64 image build error & enable compilation

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: revert ggml/src/ggml-cpu/CMakeLists.txt

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")

* CI: change cpu.Dockerfile gcc to 14;

* CI : cpu.Dockerfile , update pip install .

* Update .github/workflows/docker.yml

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-28 01:45:09 +01:00
Adrien Gallouët
5c1a7b8355
server : add custom socket options to disable SO_REUSEPORT (#21056)
* server : add custom socket options to disable SO_REUSEPORT

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --reuse-port

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

    $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
    setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update tools/server/README.md (llama-gen-docs)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix windows

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-28 01:12:43 +01:00
Aldehir Rojas
59d840209a
common : inhibit lazy grammar sampler while reasoning is active (#20970)
* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
2026-03-27 18:30:40 +01:00
Concedo
3ec6381123 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/build.yml
#	.github/workflows/copilot-setup-steps.yml
#	.github/workflows/gguf-publish.yml
#	ci/run.sh
#	docs/backend/OPENVINO.md
#	examples/llama.android/lib/src/main/cpp/ai_chat.cpp
#	ggml/src/ggml-sycl/add-id.cpp
#	requirements/requirements-pydantic.txt
#	tests/test-gguf.cpp
#	tests/test-jinja.cpp
#	tests/test-llama-archs.cpp
#	tools/gguf-split/README.md
#	tools/llama-bench/llama-bench.cpp
2026-03-28 01:18:20 +08:00
Concedo
2cdf02102e preserve previous filename 2026-03-28 01:13:03 +08:00