Commit graph

12334 commits

Author SHA1 Message Date
Concedo
c6213e9be6 Revert "Revert "llama : disable graph reuse with pipeline parallelism (#20463)""
This reverts commit 8043f35b22.
2026-03-25 22:25:20 +08:00
Concedo
b81103d6ba clean up colab a bit 2026-03-25 22:14:38 +08:00
Concedo
24ab1c1451 upgrade musicui to do tts, show musicui for tts models (+1 squashed commits)
Squashed commits:

[975630b15] upgrade musicui to do tts
2026-03-25 00:24:44 +08:00
Concedo
efdc52fe8b q3tts custom voice support 2026-03-24 23:38:18 +08:00
Concedo
8437c346a7 fixed tts instruction regex, encapsulate thinking by default 2026-03-24 13:53:46 +08:00
Concedo
9e9028b1a9 fixed cpu mis-selection 2026-03-23 21:30:57 +08:00
Concedo
e7ffe718f0 updated lite 2026-03-23 19:01:02 +08:00
Concedo
0d50cafd8b added CustomVoice support 2026-03-23 18:50:08 +08:00
Wagner Bruna
abe55fa424
sd: fix metadata for generated images (#2061)
* sd: fix metadata for generated images

* sd: refactor output image conversion
2026-03-23 17:04:32 +08:00
Alistair Stewart
5ff6cefce0
Fix music generation token stopping (#2057)
* Fix music generation token stopping for quantized models

In Phase 1 lyrics mode, the FSM transitions to CODES state after
TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was
not efficiently generating TOKEN_IM_END to stop the generation,
causing it to continue until hitting the 8192 token limit.

This fix forces TOKEN_IM_END to be generated immediately after
TOKEN_THINK_END in lyrics mode, ensuring clean completion of the
planning phase without excessive token generation.

Testing shows generation now completes in ~500ms instead of 80+
seconds with timeout errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Clarify comment - fix applies to all models, not just quantized

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Improve fix: only force TOKEN_IM_END at token limit

Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END,
only force it when we've reached the token limit. This allows the model
to generate lyrics after the thinking block while still preventing KV
cache exhaustion.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-03-23 17:02:14 +08:00
Concedo
993925ba96 gracefully handle bad grammar instead of crashing 2026-03-23 17:00:53 +08:00
Concedo
ef854f002e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/python-type-check.yml
#	AGENTS.md
#	CONTRIBUTING.md
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/model-conversion/scripts/utils/compare_tokens.py
#	examples/pydantic_models_to_grammar.py
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	pyrightconfig.json
#	scripts/compare-llama-bench.py
#	scripts/jinja/jinja-tester.py
#	scripts/server-bench.py
#	tests/test-grammar-integration.cpp
#	tests/test-grammar-parser.cpp
#	tests/test-llama-grammar.cpp
#	tests/test-tokenizer-random.py
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2026-03-22 23:39:13 +08:00
Wagner Bruna
592dedee28
sd: ensure previous generation results are cleaned up on all code paths (#2060) 2026-03-22 23:18:09 +08:00
Concedo
3bda0bf102 passthrough mode without any gens 2026-03-22 23:09:08 +08:00
Concedo
1259ac495f remove excess logs 2026-03-22 22:52:25 +08:00
Concedo
efc1db9ec8 add mirror for colab 2026-03-22 17:43:41 +08:00
Concedo
0aa6f21c88 jinja prefill fixed 2026-03-22 14:55:44 +08:00
Concedo
f846c83a7a pre-seed the tts so it can be shown 2026-03-22 10:36:42 +08:00
ddh0
3306dbaef7
misc : prefer ggml-org models in docs and examples (#20827)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* misc : prefer ggml-org models in docs and examples

Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.

* remove accidentally committed file
2026-03-21 22:00:26 +01:00
Andrea Arcangeli
990e4d9698
common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
* grammar: add test case for nullable symbol loop

Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= ( [x]* )*"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: prevent stack overflow with nullable symbol loop

Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

* grammar: convert recursive llama_grammar_advance_stack to iterative

This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration

convert from recursive to interactive"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.

Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* grammar: add test case for hang in repetition grammar processing

This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= (([^x]*){0,99}){0,99}"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: add repetition threshold check

The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.

A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.
2026-03-21 18:43:35 +01:00
Tom Hillbrunner
212f4521b0
context : use n_embd_out for pooled embedding extraction (#20840)
The MEAN/CLS/LAST pooling paths in encode() and decode() used
n_embd_inp() (16384 for qwen3vl with deepstack) to read from the
pooled embedding tensor, which only has n_embd_out() (4096) floats
per sequence. This caused a tensor read out of bounds assertion.

Fixes embedding mode for Qwen3-VL-Embedding models.
2026-03-21 19:35:00 +02:00
Concedo
9d4653bcb9 colab: clip and vae to gpu (+1 squashed commits)
Squashed commits:

[d5de2f86d] colab: clip and vae to gpu
2026-03-22 01:10:55 +08:00
Concedo
79e39e1989 fixed a help menu bug, updated colab (+1 squashed commits)
Squashed commits:

[618478e00] fixed a help menu bug, updated colab
2026-03-22 01:00:30 +08:00
Xuan-Son Nguyen
568aec82d2
docs : explicit about banning accounts that violates policy (#19593) 2026-03-21 15:50:16 +01:00
y198
2bcdddd5e3
fix(rpc): prevent division by zero in deserialize_tensor (#20712)
rpc : prevent division by zero in deserialize_tensor

When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. 

This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0.

(Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here).

* style: remove trailing whitespace
2026-03-21 15:59:43 +02:00
Michael Wand
eac9c6ea83
Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730)
* Corrected convert script for NVFP4 naming and updated gguf constants

* Add mostly_MXFP4 to FileType

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* simplify

* set initial value [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-21 13:35:21 +02:00
Concedo
89e2397014 updatede lite, up ver (+1 squashed commits)
Squashed commits:

[f1f899070] up version
2026-03-21 17:42:58 +08:00
Concedo
fdfb713d91 added --sdmaingpu allowing image models to be independently placed on any gpu 2026-03-21 17:34:12 +08:00
Concedo
a3d3800f3e added passthrough mode for esrgan upscale, triggered by img2img denoise 0.0 with 1 step 2026-03-21 16:19:10 +08:00
Sigbjørn Skjæret
29b28a9824
ci : switch from pyright to ty (#20826)
* type fixes

* switch to ty

* tweak rules

* tweak more rules

* more tweaks

* final tweak

* use common import-not-found rule
2026-03-21 08:54:34 +01:00
Concedo
58a585d0e7 popular templates section in help menu 2026-03-21 15:37:07 +08:00
Matt Corallo
cea560f483
Add shader count for Intel Arc Pro B60 (#20818) 2026-03-21 05:22:51 +01:00
Concedo
6054bacadd Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/ai-issues.yml
#	CONTRIBUTING.md
#	docs/autoparser.md
#	docs/ops.md
#	docs/ops/Metal.csv
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hex-dma.h
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-msg.h
#	ggml/src/ggml-hexagon/htp/htp_iface.idl
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hip/CMakeLists.txt
#	models/templates/Apriel-1.6-15b-Thinker-fixed.jinja
#	models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja
#	models/templates/deepseek-ai-DeepSeek-V3.1.jinja
#	models/templates/llama-cpp-deepseek-r1.jinja
#	models/templates/meetkai-functionary-medium-v3.1.jinja
#	scripts/fetch_server_test_models.py
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/snapdragon/adb/run-mtmd.sh
#	scripts/snapdragon/adb/run-tool.sh
#	tests/test-chat-auto-parser.cpp
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/server/README.md
2026-03-21 12:06:01 +08:00
Concedo
98f099aecc Merge commit 'c1258830b2' into concedo_experimental
# Conflicts:
#	docs/docker.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
2026-03-21 12:00:52 +08:00
Concedo
07327b6c10 double n_batch size when pipeline parallel is enabled, keep u_batch the same 2026-03-21 11:22:10 +08:00
Concedo
3113e3a643 move main device print 2026-03-21 10:47:21 +08:00
Concedo
9ba8c7a661 fixed colab 2026-03-21 10:21:18 +08:00
Piotr Wilkin (ilintar)
b1c70e2e54
common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Has been cancelled
2026-03-21 00:19:04 +01:00
shalinib-ibm
e6ec21e62f
ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)
Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-03-21 07:11:45 +08:00
Georgi Gerganov
4cb7e0bd61
ai : limit runtime of the agent (#20816) 2026-03-20 20:31:25 +02:00
James O'Leary
149b2493c0
common : fix typo in debug log ('extracft' -> 'extract') (#20807) 2026-03-20 18:23:18 +01:00
Georgi Gerganov
b31b30f31d
ai : do not run bash commands in the prompt (#20810) 2026-03-20 19:06:33 +02:00
Victor Villar
58c81f7e81
model : fix Granite Hybrid type check for 7B.A1B (#20795)
* Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B

* Use feed fwd dim instead of num of experts

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-20 15:16:09 +01:00
Concedo
1225b1b155 add nothink autoguess 2026-03-20 21:21:06 +08:00
Xuan-Son Nguyen
fb78ad29bb
server: (doc) clarify in-scope and out-scope features (#20794)
* server: (doc) clarify in-scope and out-scope features

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-20 14:03:50 +01:00
Jeff Bolz
e06c3ab2bc
vulkan: change gated_delta_net to shard a column across a subgroup (#20662)
* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
2026-03-20 12:17:15 +01:00
Concedo
2d349723d3 fixed colab 2026-03-20 18:19:59 +08:00
Ruikai Peng
dc6592431b
context: zero output buffer on allocation (#20781)
* context: zero output buffer on allocation

Address GHSA-wqq9-25mr-rw76.

The logits output buffer allocated in output_reserve() uses
posix_memalign(), which does not zero memory. The buffer is only
written during decode when needs_raw_logits() returns true. When
backend samplers cover all output sequences, needs_raw_logits()
returns false and the buffer is never written, but
llama_get_logits() still returns a pointer to it, exposing stale
heap content.

Zero the buffer after allocation to prevent information disclosure
through the public logits API.

Found-by: Pwno

* Update src/llama-context.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-20 11:31:34 +02:00
Ruikai Peng
3adbef7776
model: assert nextn_predict_layers to prevent underflow (#20783)
Address GHSA-645x-v54x-34w8.

When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers
can underflow (unsigned wrap), which corrupts n_layer_kv_from_start.

Assert nextn_predict_layers immediately after parsing the GGUF key.

Found-by: Pwno
2026-03-20 10:17:58 +01:00
Georgi Gerganov
ab9d4c3678
server : improve mtmd ctx checkpoints (#20726)
* server : improve mtmd ctx checkpoints

* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00