koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-02 07:19:23 +00:00

Author	SHA1	Message	Date
Concedo	c6213e9be6	Revert "Revert "llama : disable graph reuse with pipeline parallelism (#20463 )"" This reverts commit `8043f35b22`.	2026-03-25 22:25:20 +08:00
Concedo	b81103d6ba	clean up colab a bit	2026-03-25 22:14:38 +08:00
Concedo	24ab1c1451	upgrade musicui to do tts, show musicui for tts models (+1 squashed commits) Squashed commits: [975630b15] upgrade musicui to do tts	2026-03-25 00:24:44 +08:00
Concedo	efdc52fe8b	q3tts custom voice support	2026-03-24 23:38:18 +08:00
Concedo	8437c346a7	fixed tts instruction regex, encapsulate thinking by default	2026-03-24 13:53:46 +08:00
Concedo	9e9028b1a9	fixed cpu mis-selection	2026-03-23 21:30:57 +08:00
Concedo	e7ffe718f0	updated lite	2026-03-23 19:01:02 +08:00
Concedo	0d50cafd8b	added CustomVoice support	2026-03-23 18:50:08 +08:00
Wagner Bruna	abe55fa424	sd: fix metadata for generated images (#2061 ) * sd: fix metadata for generated images * sd: refactor output image conversion	2026-03-23 17:04:32 +08:00
Alistair Stewart	5ff6cefce0	Fix music generation token stopping (#2057 ) * Fix music generation token stopping for quantized models In Phase 1 lyrics mode, the FSM transitions to CODES state after TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was not efficiently generating TOKEN_IM_END to stop the generation, causing it to continue until hitting the 8192 token limit. This fix forces TOKEN_IM_END to be generated immediately after TOKEN_THINK_END in lyrics mode, ensuring clean completion of the planning phase without excessive token generation. Testing shows generation now completes in ~500ms instead of 80+ seconds with timeout errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Clarify comment - fix applies to all models, not just quantized 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Improve fix: only force TOKEN_IM_END at token limit Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END, only force it when we've reached the token limit. This allows the model to generate lyrics after the thinking block while still preventing KV cache exhaustion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-03-23 17:02:14 +08:00
Concedo	993925ba96	gracefully handle bad grammar instead of crashing	2026-03-23 17:00:53 +08:00
Concedo	ef854f002e	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/python-type-check.yml # AGENTS.md # CONTRIBUTING.md # examples/model-conversion/scripts/embedding/run-original-model.py # examples/model-conversion/scripts/utils/compare_tokens.py # examples/pydantic_models_to_grammar.py # ggml/src/ggml-rpc/ggml-rpc.cpp # pyrightconfig.json # scripts/compare-llama-bench.py # scripts/jinja/jinja-tester.py # scripts/server-bench.py # tests/test-grammar-integration.cpp # tests/test-grammar-parser.cpp # tests/test-llama-grammar.cpp # tests/test-tokenizer-random.py # tools/cli/README.md # tools/completion/README.md # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2026-03-22 23:39:13 +08:00
Wagner Bruna	592dedee28	sd: ensure previous generation results are cleaned up on all code paths (#2060 )	2026-03-22 23:18:09 +08:00
Concedo	3bda0bf102	passthrough mode without any gens	2026-03-22 23:09:08 +08:00
Concedo	1259ac495f	remove excess logs	2026-03-22 22:52:25 +08:00
Concedo	efc1db9ec8	add mirror for colab	2026-03-22 17:43:41 +08:00
Concedo	0aa6f21c88	jinja prefill fixed	2026-03-22 14:55:44 +08:00
Concedo	f846c83a7a	pre-seed the tts so it can be shown	2026-03-22 10:36:42 +08:00
ddh0	3306dbaef7	misc : prefer ggml-org models in docs and examples (#20827 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details * misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file	2026-03-21 22:00:26 +01:00
Andrea Arcangeli	990e4d9698	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 ) * grammar: add test case for nullable symbol loop Reproduce stack overflow (or OOM) with ( [x]* )* found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= ( [x]* )" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: prevent stack overflow with nullable symbol loop Fix a potential stack overflow in llama_grammar_advance_stack that could occur when processing grammars with nullable symbols that lead to infinite derivations of empty strings. The fix introduces cycle detection by tracking visited stacks to prevent infinite recursion. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A20 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) * grammar: convert recursive llama_grammar_advance_stack to iterative This change converts the function to an iterative approach using explicit stacks, which prevents deep recursion and eliminates the risk of stack overflow. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A30 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration convert from recursive to interactive""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) v2: Added a `std::set` to perform tree-based lookups with O(N log N) complexity. Testing with a parallel run of `test-grammar-integration` shows a double-digit percentage increase in runtime. An `unordered_set` with O(1) hashing was also evaluated, but the overhead of constructing hash keys from pointers made it significantly slower than the rbtree implementation that only requires an ordering operator. The performance regression in the test suite appears justified by the overall reduction in algorithmic complexity. Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> * grammar: add test case for hang in repetition grammar processing This commit adds a new test case to the grammar integration tests that specifically targets a hang scenario in the repetition grammar parser found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= (([^x]){0,99}){0,99}" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: add repetition threshold check The change introduces a maximum repetition threshold to avoid excessive rule expansion during grammar parsing. When parsing repetition patterns like {m,n}, the parser now calculates the potential number of rules that would be generated and throws an error if the product of previous rules and new rules exceeds the threshold. A test case was added to verify the threshold is properly enforced for deeply nested repetition patterns that would otherwise cause hangs.	2026-03-21 18:43:35 +01:00
Tom Hillbrunner	212f4521b0	context : use n_embd_out for pooled embedding extraction (#20840 ) The MEAN/CLS/LAST pooling paths in encode() and decode() used n_embd_inp() (16384 for qwen3vl with deepstack) to read from the pooled embedding tensor, which only has n_embd_out() (4096) floats per sequence. This caused a tensor read out of bounds assertion. Fixes embedding mode for Qwen3-VL-Embedding models.	2026-03-21 19:35:00 +02:00
Concedo	9d4653bcb9	colab: clip and vae to gpu (+1 squashed commits) Squashed commits: [d5de2f86d] colab: clip and vae to gpu	2026-03-22 01:10:55 +08:00
Concedo	79e39e1989	fixed a help menu bug, updated colab (+1 squashed commits) Squashed commits: [618478e00] fixed a help menu bug, updated colab	2026-03-22 01:00:30 +08:00
Xuan-Son Nguyen	568aec82d2	docs : explicit about banning accounts that violates policy (#19593 )	2026-03-21 15:50:16 +01:00
y198	2bcdddd5e3	fix(rpc): prevent division by zero in deserialize_tensor (#20712 ) rpc : prevent division by zero in deserialize_tensor When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0. (Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here). * style: remove trailing whitespace	2026-03-21 15:59:43 +02:00
Michael Wand	eac9c6ea83	Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730 ) * Corrected convert script for NVFP4 naming and updated gguf constants * Add mostly_MXFP4 to FileType Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * simplify * set initial value [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-21 13:35:21 +02:00
Concedo	89e2397014	updatede lite, up ver (+1 squashed commits) Squashed commits: [f1f899070] up version	2026-03-21 17:42:58 +08:00
Concedo	fdfb713d91	added `--sdmaingpu` allowing image models to be independently placed on any gpu	2026-03-21 17:34:12 +08:00
Concedo	a3d3800f3e	added passthrough mode for esrgan upscale, triggered by img2img denoise 0.0 with 1 step	2026-03-21 16:19:10 +08:00
Sigbjørn Skjæret	29b28a9824	ci : switch from pyright to ty (#20826 ) * type fixes * switch to ty * tweak rules * tweak more rules * more tweaks * final tweak * use common import-not-found rule	2026-03-21 08:54:34 +01:00
Concedo	58a585d0e7	popular templates section in help menu	2026-03-21 15:37:07 +08:00
Matt Corallo	cea560f483	Add shader count for Intel Arc Pro B60 (#20818 )	2026-03-21 05:22:51 +01:00
Concedo	6054bacadd	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/ai-issues.yml # CONTRIBUTING.md # docs/autoparser.md # docs/ops.md # docs/ops/Metal.csv # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/hex-dma.h # ggml/src/ggml-hexagon/htp/hex-utils.h # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp_iface.idl # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hip/CMakeLists.txt # models/templates/Apriel-1.6-15b-Thinker-fixed.jinja # models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja # models/templates/deepseek-ai-DeepSeek-V3.1.jinja # models/templates/llama-cpp-deepseek-r1.jinja # models/templates/meetkai-functionary-medium-v3.1.jinja # scripts/fetch_server_test_models.py # scripts/snapdragon/adb/run-cli.sh # scripts/snapdragon/adb/run-completion.sh # scripts/snapdragon/adb/run-mtmd.sh # scripts/snapdragon/adb/run-tool.sh # tests/test-chat-auto-parser.cpp # tests/test-chat-peg-parser.cpp # tests/test-chat.cpp # tools/cli/cli.cpp # tools/server/README.md	2026-03-21 12:06:01 +08:00
Concedo	98f099aecc	Merge commit '`c1258830b2`' into concedo_experimental # Conflicts: # docs/docker.md # docs/ops.md # docs/ops/WebGPU.csv # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl	2026-03-21 12:00:52 +08:00
Concedo	07327b6c10	double n_batch size when pipeline parallel is enabled, keep u_batch the same	2026-03-21 11:22:10 +08:00
Concedo	3113e3a643	move main device print	2026-03-21 10:47:21 +08:00
Concedo	9ba8c7a661	fixed colab	2026-03-21 10:21:18 +08:00
Piotr Wilkin (ilintar)	b1c70e2e54	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details	2026-03-21 00:19:04 +01:00
shalinib-ibm	e6ec21e62f	ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791 ) Explicitly mark save_acc and add_save_Acc with always_inline in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator disassembly within kernel's register context, preventing un-necessary stask spills. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-03-21 07:11:45 +08:00
Georgi Gerganov	4cb7e0bd61	ai : limit runtime of the agent (#20816 )	2026-03-20 20:31:25 +02:00
James O'Leary	149b2493c0	common : fix typo in debug log ('extracft' -> 'extract') (#20807 )	2026-03-20 18:23:18 +01:00
Georgi Gerganov	b31b30f31d	ai : do not run bash commands in the prompt (#20810 )	2026-03-20 19:06:33 +02:00
Victor Villar	58c81f7e81	model : fix Granite Hybrid type check for 7B.A1B (#20795 ) * Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B * Use feed fwd dim instead of num of experts Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-20 15:16:09 +01:00
Concedo	1225b1b155	add nothink autoguess	2026-03-20 21:21:06 +08:00
Xuan-Son Nguyen	fb78ad29bb	server: (doc) clarify in-scope and out-scope features (#20794 ) * server: (doc) clarify in-scope and out-scope features * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 14:03:50 +01:00
Jeff Bolz	e06c3ab2bc	vulkan: change gated_delta_net to shard a column across a subgroup (#20662 ) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups	2026-03-20 12:17:15 +01:00
Concedo	2d349723d3	fixed colab	2026-03-20 18:19:59 +08:00
Ruikai Peng	dc6592431b	context: zero output buffer on allocation (#20781 ) * context: zero output buffer on allocation Address GHSA-wqq9-25mr-rw76. The logits output buffer allocated in output_reserve() uses posix_memalign(), which does not zero memory. The buffer is only written during decode when needs_raw_logits() returns true. When backend samplers cover all output sequences, needs_raw_logits() returns false and the buffer is never written, but llama_get_logits() still returns a pointer to it, exposing stale heap content. Zero the buffer after allocation to prevent information disclosure through the public logits API. Found-by: Pwno * Update src/llama-context.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 11:31:34 +02:00
Ruikai Peng	3adbef7776	model: assert nextn_predict_layers to prevent underflow (#20783 ) Address GHSA-645x-v54x-34w8. When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers can underflow (unsigned wrap), which corrupts n_layer_kv_from_start. Assert nextn_predict_layers immediately after parsing the GGUF key. Found-by: Pwno	2026-03-20 10:17:58 +01:00
Georgi Gerganov	ab9d4c3678	server : improve mtmd ctx checkpoints (#20726 ) * server : improve mtmd ctx checkpoints * server : fix off-by-one in pos_min_thold	2026-03-20 11:13:12 +02:00

1 2 3 4 5 ...

12334 commits