koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-07 17:22:04 +00:00

Author	SHA1	Message	Date
Concedo	d3d50a7b3c	fixed reasoning content response in fakestreaming tools	2026-04-04 23:03:33 +08:00
Concedo	ac92ac22d7	tool call fix	2026-04-04 22:35:03 +08:00
Concedo	eb3422996a	BOS fix for gemma4	2026-04-04 22:15:01 +08:00
Concedo	2e4f94822e	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-self-hosted.yml # .github/workflows/docker.yml # ci/run.sh # docs/build.md # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # src/llama-vocab.cpp # tests/test-chat.cpp # tests/test-jinja.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-04-04 14:27:23 +08:00
Concedo	235ec9a1b9	updated lite	2026-04-04 14:24:05 +08:00
Concedo	a33eda3842	more template fixes for the gemma4 31b	2026-04-04 14:23:16 +08:00
Concedo	1c834fcbd3	try to match template more closely (+2 squashed commit) Squashed commit: [466808010] try to match template more closely [9f805e753] try to match template more closely	2026-04-04 13:50:04 +08:00
Reese Levine	d006858316	ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting	2026-04-03 11:40:14 -07:00
Masato Nakasaka	e439700992	ci: Add Windows Vulkan backend testing on Intel (#21292 ) * experimenting CI * Experimenting CI fix for MinGW * experimenting CI on Windows * modified script for integration with VisualStudio * added proxy handling * adding python version for Windows execution * fix iterator::end() dereference * fixed proxy handling * Fix errors occurring on Windows * fixed ci script * Reverted to master * Stripping test items to simplify Windows test * adjusting script for windows testing * Changed shell * Fixed shell * Fixed shell * Fix CI setting * Fix CI setting * Fix CI setting * Experimenting ci fix * Experimenting ci fix * Experimenting ci fix * Experimenting ci fix * experimenting fix for unit test error * Changed to use BUILD_LOW_PERF to skip python tests * Fix CI * Added option to specify Ninja generator * Reverted proxy related changes	2026-04-03 20:16:44 +03:00
Yes You Can Have Your Own	50e0ad08fb	server: save and clear idle slots on new task (`--clear-idle`) (#20993 ) * server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE) * server: move idle slot KV clearing to slot release The save "cost" is now paid by the finishing request. * server: add --kv-clear-idle flag, enable by default * server: skip clearing last idle slot, clear on launch * server: test --no-kv-clear-idle flag * server: simplify on-release clearing loop * server: remove on-release KV clearing, keep launch-only * cont : clean-up * tests: update log strings after --clear-idle rename * tests: use debug tags instead of log message matching * test: fix Windows CI by dropping temp log file unlink --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-03 19:02:27 +02:00
Piotr Wilkin (ilintar)	f1f793ad06	common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230 ) * Fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers * Rename * Update common/chat-auto-parser-generator.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-03 17:51:52 +02:00
Samanvya Tripathi	af5c13841f	common : fix tool call type detection for nullable and enum schemas (#21327 ) * common : fix tool call type detection for nullable and enum schemas * common, tests : fix grammar delegation for nullable/enum schemas and add tests Fix enum type inference to scan all enum values (not just index 0) so schemas like {"enum": [0, "celsius"]} correctly detect string type. Fix schema_delegates in peg-parser to handle nullable type arrays (["string", "null"]) and typeless enum schemas in raw mode, allowing the tagged parser to use raw text instead of JSON-formatted strings. Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format): - nullable string ["string", "null"] - nullable string with null first ["null", "string"] - nullable integer ["integer", "null"] - enum without explicit type key	2026-04-03 17:51:23 +02:00
M1DNYT3	277ff5fff7	docker : bump cuda12 to 12.9.1 (#20920 ) Some checks failed Python Type-Check / python type-check (push) Waiting to run Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan> Co-authored-by: CISC <CISC@users.noreply.github.com>	2026-04-03 15:06:45 +02:00
jeromew	384c0076bc	docs: Update build.md: HSA_OVERRIDE_GFX_VERSION clarification (#21331 ) The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture. This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.	2026-04-03 21:05:14 +08:00
Sigbjørn Skjæret	1f34806c44	jinja: coerce input for string-specific filters (#21370 )	2026-04-03 15:03:33 +02:00
Aaron Teo	887535c33f	ci: add more binary checks (#21349 )	2026-04-03 20:50:00 +08:00
Piotr Wilkin (ilintar)	d3416a4aa9	fix: remove stale assert (#21369 )	2026-04-03 13:40:41 +02:00
Concedo	784e193fbb	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/package.nix # .github/workflows/build.yml # .github/workflows/hip-quality-check.yml # docs/backend/ZenDNN.md # docs/ops.md # docs/ops/ZenDNN.csv # ggml/src/ggml-zendnn/CMakeLists.txt # ggml/src/ggml-zendnn/ggml-zendnn.cpp	2026-04-03 19:04:57 +08:00
Concedo	975e48b985	updated lite	2026-04-03 18:51:43 +08:00
Concedo	f39bfc39e9	gemma4 jina tool calls fixed, added gemma4 <eos> token handling	2026-04-03 17:39:21 +08:00
uvos	43a4ee4a2c	HIP: build eatch ci build test for a different architecture (#21337 ) This helps improve our chances of finding build failures before the release workflow builds for all architectures.	2026-04-03 11:38:22 +02:00
Tillerino	f851fa5ab0	fix: add openssl to nix dependencies (#21353 ) (#21355 )	2026-04-03 12:21:07 +03:00
Vishal Singh	f1ac84119c	ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315 ) * ggml-zendnn : add MUL_MAT_ID op support for MoE models - Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13 * ggml-zendnn : add braces to sgemm failure condition for consistency Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-04-03 12:19:08 +03:00
Concedo	8fa87621d1	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/labeler.yml # common/chat.cpp # ggml/src/ggml-rpc/ggml-rpc.cpp	2026-04-03 16:36:41 +08:00
Piotr Wilkin (ilintar)	b069b10ab4	vocab: fix Gemma4 tokenizer (#21343 ) * seems to work * fix case with new line Co-authored-by: sayap <sokann@gmail.com> * gemma 4: fix pre tok regex --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: sayap <sokann@gmail.com>	2026-04-03 10:33:03 +02:00
Concedo	97f785efce	ensure BOS on vision prefix	2026-04-03 16:20:36 +08:00
Concedo	e8cffa37c8	fixed gemma4v image crashing on encode, however images are not yet working correctly	2026-04-03 15:56:35 +08:00
Radoslav Gerganov	0c58ba3365	rpc : reuse compute graph buffers (#21299 ) Reuse the buffer for the ggml context which is used for creating the compute graph on the server side. This partially addresses a memory leak created by the CUDA backend due to using buffer addresses as cache keys. ref: #21265 ref: #20315	2026-04-03 10:28:09 +03:00
Concedo	17678748ac	fixed mtmd build	2026-04-03 14:41:24 +08:00
Concedo	11b384aa6a	tweak tooltip	2026-04-03 14:30:02 +08:00
Shoaib	f05f680203	[admin]password defaults to env var (#2086 )	2026-04-03 14:28:03 +08:00
Georgi Gerganov	57ace0d612	chat : avoid including json in chat.h (#21306 )	2026-04-03 09:07:59 +03:00
Georgi Gerganov	39b27f0da0	(revert) kv-cache : do not quantize SWA KV cache (#21332 ) This reverts commit `17193cce34`.	2026-04-03 09:07:01 +03:00
Concedo	e6a49c404f	fixed gemma4 templates, basic non thinking is working fine on all models now	2026-04-03 12:20:01 +08:00
Concedo	34ad53e950	merged support for gemma4. the e2b, e4b and 26b work, the 31b does not	2026-04-03 11:07:46 +08:00
Vishal Singh	f49e917876	ci : add AMD ZenDNN label to PR labeler (#21345 ) * ci : add AMD CPU label to PR labeler Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files * ci : rename label AMD CPU to AMD ZenDNN in labeler config Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-04-03 10:35:15 +08:00
Slobodan Josic	7c7d6ce5c7	[HIP] Bump ROCm version to 7.2.1 (#21066 ) Bump ROCm version on Linux from 7.2 to 7.2.1 Add gfx1102 target Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865 --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-03 00:59:20 +02:00
Piotr Wilkin (ilintar)	5208e2d5ba	fix: gemma 4 template (#21326 )	2026-04-02 23:31:02 +02:00
Bartowski	7992aa7c8e	tests : add unit test coverage for llama_tensor_get_type (#20112 ) * Add unit test coverage for llama_tensor_get_type * Fix merge conflicts, add more schemas * clang formatter changes * Trailing whitespace * Update name * Start rebase * Updating files with upstream changes prior to rebase * Changes needed from rebase * Update attn_qkv schema, change throw behaviour * Fix merge conflicts * White space * Update with latest changes to state counters * Revert accidental personal CLAUDE.md changes * Change quotation mark * Reuse metadata.name since we have it * Move test-only stuff out of llama-quant.cpp * Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns * cont : inital deslop guidelines * Cleanup based on review comments * Continue cleanup * Small cleanup * Manually set proper ordering of tensors, mostly applies to gemma * Formatting * Update tests/test-quant-type-selection.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix merge conflicts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-02 22:53:58 +02:00
Zheyuan Chen	a1cfb64530	ggml-webgpu: add vectorized flash attention (#20709 ) * naive vectorized version * add vectorized flash attention * update vec version * remove unused path and shader * remove unused helper functions * add comments * remove pad path * ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization * change back to vec4 * enable multi split * enable vec path when: - Q->ne[1] < 20 - Q->ne[0] % 32 == 0 - V->ne[0] % 4 == 0 - K->type == f16 * update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select * enable vec path for q4 and q8 * flash-attn vec nwg=1 fast path (skip tmp/reduce staging) * use packed f16 K loads in flash-attn vec split * use packed f16 K loads in flash-attn vec split on host side * tune flash-attn vec f16 VEC_NE by head dim * cleanup * cleanup * keep host side clean * cleanup host side * change back to original host wait/submit behavior * formatting * reverted param-buffer pool r ecfactor * add helper functions * ggml-webgpu: move flash-attn vec pipeline caching back into shader lib * ggml-webgpu: remove duplicate functions * ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation * ggml-webgpu: revert unrelated change * ggml-webgpu: revert deleted comment * disable uniformity check * remove unnecessary change * Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl * Update ggml/src/ggml-webgpu/ggml-webgpu.cpp --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-04-02 10:40:42 -07:00
Concedo	5dee1a1cbb	Merge commit '`fbd441c379`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # AGENTS.md # ci/run.sh # docs/build.md # embd_res/templates/LFM2.5-Instruct.jinja # ggml/CMakeLists.txt # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/hvx-div.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/dequantize.hpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/vecdotq.hpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl # scripts/hip/gcn-cdna-vgpr-check.py # scripts/sync-ggml.last # tests/test-chat.cpp	2026-04-03 01:06:02 +08:00
Concedo	b286d76873	updated ci to publish and build rolling for all main targets	2026-04-03 00:28:01 +08:00
Concedo	be6ff2602d	added config overwriting	2026-04-03 00:21:14 +08:00
Ruben Ortlam	5803c8d115	tests: allow exporting graph ops from HF file without downloading weights (#21182 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / python type-check (push) Waiting to run Details * tests: allow exporting graph ops from HF file without downloading weights * use unique_ptr for llama_context in HF metadata case * fix missing non-required tensors falling back to type f32 * use unique pointers where possible * use no_alloc instead of fixing f32 fallback * fix missing space	2026-04-02 18:19:20 +02:00
Xuan-Son Nguyen	63f8fe0ef4	model, mtmd: fix gguf conversion for audio/vision mmproj (#21309 ) * fix gguf conversion for audio/vision mmproj * fix test	2026-04-02 17:10:32 +02:00
Aldehir Rojas	223373742b	common : add commentary rules for gpt-oss-20b (#21286 )	2026-04-02 08:59:59 -05:00
Concedo	5f0038cdc9	this section wasnt needed for qwen3.5 jinja and actually breaks.	2026-04-02 21:39:24 +08:00
Piotr Wilkin (ilintar)	e15efe007d	Relax prefill parser to allow space. (#21240 ) * Relax prefill parser to allow space. * Move changes from prefix() to parser generation * Only allow spaces if we're not having a pure content parser next	2026-04-02 11:29:11 +02:00
Jesus Talavera	6137c325a1	chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804 ) * chat : add Granite 4.0 chat template with correct tool_call role mapping Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite 3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`). The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the `assistant_tool_call` role to `<\|start_of_role\|>assistant<\|end_of_role\|><\|tool_call\|>`. Without a matching C++ handler, the fallback path emits the literal role `assistant_tool_call` which the model does not recognize, breaking tool calling when `--jinja` is not used. Changes: - Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X` (preserves existing 3.x behavior unchanged) - Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler - Detection: `<\|start_of_role\|>` + (`<tool_call>` or `<tools>`) → 4.0, otherwise → 3.x - Add production Granite 4.0 Jinja template - Add tests for both 3.x and 4.0 template paths (C++ and Jinja) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Code review: follow standard format and use common logic in test-chat-template.cpp * Rename custom_conversation variable for extra_conversation to give it a more meaningful name --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 11:28:56 +02:00
Georgi Gerganov	17193cce34	kv-cache : do not quantize SWA KV cache (#21277 )	2026-04-02 11:54:05 +03:00

1 2 3 4 5 ...

12598 commits