koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-01 06:00:36 +00:00

Author	SHA1	Message	Date
Concedo	5b22858dbd	updated docs	2026-03-12 00:20:20 +08:00
Concedo	3cc6e2ea17	make stereo default	2026-03-12 00:10:25 +08:00
Concedo	211d4fe632	lots of tweaks for ace step	2026-03-11 23:57:52 +08:00
Concedo	ecc4865244	improves code output quality	2026-03-10 23:07:52 +08:00
Concedo	8095bf9807	include overhead fromn music models	2026-03-10 22:52:20 +08:00
Concedo	6adcd0b5db	Merge commit '`34df42f7be`' into concedo_experimental # Conflicts: # README.md # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/act-ops.c # ggml/src/ggml-hexagon/htp/binary-ops.c # ggml/src/ggml-hexagon/htp/cpy-ops.c # ggml/src/ggml-hexagon/htp/get-rows-ops.c # ggml/src/ggml-hexagon/htp/htp-msg.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/hvx-arith.h # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/hvx-inverse.h # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/set-rows-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # tests/test-backend-ops.cpp # tools/cli/cli.cpp # tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte	2026-03-10 22:20:04 +08:00
Concedo	746664fde6	Merge commit '`2cd20b72ed`' into concedo_experimental # Conflicts: # CONTRIBUTING.md # docs/backend/CANN.md # docs/backend/SYCL.md # docs/backend/snapdragon/README.md # docs/backend/snapdragon/windows.md # docs/build.md # docs/multimodal/MobileVLM.md # docs/ops.md # docs/ops/WebGPU.csv # examples/debug/README.md # examples/llama.vim # examples/model-conversion/README.md # examples/sycl/README.md # ggml/src/ggml-cpu/amx/mmq.cpp # ggml/src/ggml-cpu/arch/x86/repack.cpp # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp-drv.cpp # ggml/src/ggml-hexagon/htp/flash-attn-ops.c # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/hvx-copy.h # ggml/src/ggml-hexagon/htp/hvx-inverse.h # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/worker-pool.c # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cpy.cl # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/quants.hpp # ggml/src/ggml-sycl/softmax.cpp # ggml/src/ggml-vulkan/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # scripts/pr2wt.sh # scripts/server-bench.py # scripts/snapdragon/windows/run-cli.ps1 # tests/test-alloc.cpp # tests/test-backend-ops.cpp # tests/test-chat.cpp # tools/cli/cli.cpp # tools/completion/README.md # tools/cvector-generator/cvector-generator.cpp # tools/imatrix/README.md # tools/perplexity/README.md # tools/server/public_simplechat/readme.md # tools/server/tests/README.md	2026-03-10 22:11:08 +08:00
Concedo	c8800ed16c	gcc path fix	2026-03-10 21:40:32 +08:00
Concedo	b06dd2606e	ruff: linting	2026-03-10 21:32:36 +08:00
Wagner Bruna	3f42ed1af7	support for customizing LoRA multipliers through the sdapi (#1982 ) * fix corner case in sd_oai_transform_params Also fix typo in the function name. * support for customizing loaded LoRA multipliers The `sdloramult` flag now accepts a list of multipliers, one for each LoRA. If all multipliers are non-zero, LoRAs load as before, with no extra VRAM usage or performance impact. If any LoRA has a multiplier of 0, we switch to `at_runtime` mode, and these LoRAs will be available to multiplier changes via the `lora` sdapi field and show up in the `sdapi/v1/loras` endpoint. All LoRAs are still preloaded on startup, and cached to avoid file reloads. If the list of multipliers is shorter than the list of LoRAs, the multiplier list is extended with the first multiplier (1.0 by default), to keep it compatible with the previous behavior. * support for `<lora:name:multiplier>` prompt syntax and metadata * add a few tests for sanitize_lora_multipliers	2026-03-10 21:29:39 +08:00
Concedo	eafb5ff4c5	autofit improvement e.g. for strix (+1 squashed commits) Squashed commits: [`6f6fd59c3`] autofit improvement e.g. for strix	2026-03-10 21:20:02 +08:00
Concedo	500a1ab466	disable smartcache if slots is zero	2026-03-10 08:57:31 +08:00
Concedo	2bd6b87d5b	remove a file	2026-03-09 23:08:53 +08:00
Concedo	ee96e71bae	don't resample audio	2026-03-09 22:53:55 +08:00
Concedo	45c74da08b	adjust ace step, still wip on caption rework	2026-03-09 00:11:48 +08:00
JustCommitRandomness	9ddd74111f	OpenBSD changes for vulkan backend (#2026 ) * OpenBSD also needs alloca.h * Changes to compile vulkan backend with OpenBSD * Update README.md tweak details for OpenBSD vulkan backend * Update README.md	2026-03-08 20:41:36 +08:00
Concedo	270d4ad2c1	fixed a typo	2026-03-08 12:56:08 +08:00
Concedo	73fc5c4767	handle jinja exceptions	2026-03-08 12:12:02 +08:00
Concedo	41df8b09e5	jinjatools now works mostly well	2026-03-08 11:55:22 +08:00
Concedo	a981d1ece9	updated lite	2026-03-08 02:33:18 +08:00
Wagner Bruna	9158bd8b4d	sd: sync to master-520-d950627 (#2006 ) * sd: sync to master-509-4cdfff5 * sd: Anima support * sd: sync to master-514-5792c66 * sd: additional workaround for Anima .safetensors model * sd: sync to master-517-ba35dd7 * sd: sync to master-520-d950627	2026-03-08 01:23:03 +08:00
Concedo	ebe44e7819	modify q3tts loader	2026-03-08 00:53:33 +08:00
Concedo	0df18d2ae2	fixed single token bans	2026-03-07 22:50:53 +08:00
Concedo	a40038d8e6	further reverse the mxfp4 changes	2026-03-07 22:42:22 +08:00
Todor Boinovski	34df42f7be	hexagon: add f32 ssm_conv op (#20122 ) * hexagon: add ssm_conv op * hexagon: hvx kernel is functional * hexagon: improvements to ssm-conv hvx kernel * hexagon: added dma to ssm-conv hvx kernel * hexagon: ssm-conv dynamically compute gather scratchpad * hex-ssm-conv: add local context and fix various issues (spad indexing, etc) --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-06 09:59:26 -08:00
Tom Vaucourt	e68f2fb894	server : preserve anthropic thinking blocks in conversion (#20120 ) * server : preserve anthropic thinking blocks in conversion (#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home>	2026-03-06 17:41:12 +01:00
Max Krasnyansky	ba2fd11cdf	cpu: skip redudant ROPE cache updates (#20149 )	2026-03-06 08:32:40 -08:00
Aman Gupta	d48e876467	ggml-cuda: add mem check for fusion (#19916 ) * ggml-cuda: add mem check for fusion * Replace NaNs with -FLT_MAX * fix typo Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-07 00:05:43 +08:00
Aaron Teo	ba2ff79e43	ggml: update comments for backends which have no memory to report (#20157 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-06 23:24:38 +08:00
shalinib-ibm	c6980ff29d	ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083 ) (#20130 ) This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast by doing a cat on the data first and then calling the intrinsic. This bypasses the buggy compiler path while maintaining identical instruction selection. Performance Verification: Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction: `plxv 40, 2(14)` This ensures zero performance regression while unblocking builds on newer toolchains. Reproduced on: - Alpine Linux + GCC 15.2.0-r2 - RHEL 9 + GCC 15.1.1 (gcc-toolset-15) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-03-06 23:22:39 +08:00
Aman Gupta	1e38a7a6fa	CUDA: use shared mem for ssm_conv (#20128 ) * CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-06 23:09:59 +08:00
Concedo	d20e60ddd5	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/build.md # examples/batched/batched.cpp # examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp # examples/deprecation-warning/deprecation-warning.cpp # examples/eval-callback/eval-callback.cpp # examples/gen-docs/gen-docs.cpp # examples/gguf-hash/gguf-hash.cpp # examples/gguf/gguf.cpp # examples/lookahead/lookahead.cpp # examples/lookup/lookup-create.cpp # examples/lookup/lookup-merge.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/parallel/parallel.cpp # examples/passkey/passkey.cpp # examples/retrieval/retrieval.cpp # examples/save-load-state/save-load-state.cpp # examples/simple-chat/simple-chat.cpp # examples/simple/simple.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # examples/sycl/ls-sycl-device.cpp # examples/training/finetune.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/amx/common.h # ggml/src/ggml-cpu/kleidiai/kernels.cpp # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_general_q8_0_f32.cl # ggml/src/ggml-opencl/kernels/transpose.cl # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl # scripts/get-wikitext-2.sh # tests/test-backend-ops.cpp # tools/batched-bench/batched-bench.cpp # tools/cvector-generator/cvector-generator.cpp # tools/export-lora/export-lora.cpp # tools/imatrix/imatrix.cpp # tools/llama-bench/llama-bench.cpp # tools/perplexity/perplexity.cpp # tools/rpc/rpc-server.cpp # tools/tokenize/tokenize.cpp	2026-03-06 21:19:49 +08:00
Concedo	2c38638b3d	Merge commit '`2afcdb9777`' into concedo_experimental # Conflicts: # scripts/sync_vendor.py # tests/CMakeLists.txt	2026-03-06 21:13:15 +08:00
Concedo	abcca8c0f9	do not use the mxfp4 repack - repack must be synced again from before this commit if it's ever to be used in future. this will break compilation with older w64devkit	2026-03-06 21:07:41 +08:00
Tim Neumann	388baabc06	context: ignore zero scale LoRAs when checking sameness (#20166 )	2026-03-06 15:05:52 +02:00
Gustavo Rocha Dias	cbecc34667	Fix OAI-compatible token usage and unique request IDs (#2015 ) * fix: token usage fix for mistral-vibe * fix: generate unique request IDs for OAI-compatible responses * fix: prompt_tokens reporting KV cache size instead of actual count during streaming * fixes for PR #2015 For (1), this is not a good idea. If it returned 0 (e.g. during an error), this value may not be updated and will return the value of a previous or different request. It's better to return 0 in those cases. For (2), this is a good idea but we don't need that level of randomness. I'll probably swap it with a 6 digit random number instead. For (3), the official openai spec gates it behind stream_options.include_usage = true so i'll do that too * missed 1 item --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2026-03-06 20:57:22 +08:00
JustCommitRandomness	2fbc3b2ae5	Adjust int types in format strings (#2009 ) * tweak format sting types This may not be all of them, but it's the ones which warn on OpenBSD * complete the changes needed to fix the format string specifers * avoid using inttypes, directly cast to size_t (u64 usually) instead --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2026-03-06 19:06:18 +08:00
Piotr Wilkin (ilintar)	f5ddcd1696	Checkpoint every n tokens: squash (#20087 )	2026-03-06 11:39:26 +01:00
Aleksander Grygier	f6235a41ef	webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655 )	2026-03-06 10:00:39 +01:00
Johannes Gäßler	2850bc6a13	ggml-cpu: fix data race for debug asserts (#20148 )	2026-03-06 09:12:49 +01:00
Georgi Gerganov	17a4258946	kv-cache : fix M-RoPE checkpoints (#20132 )	2026-03-06 08:46:51 +02:00
Concedo	e36d7b6464	warn about RNN models not supporting antislop	2026-03-06 14:02:51 +08:00
Roj234	f7db3f3789	cli : Don't clear system prompt when using '/clear' (#20067 ) * Enhance /clear command to include system prompt Add system prompt to messages when clearing chat history. * Use lambda	2026-03-06 06:41:11 +01:00
lhez	6c97bffd65	opencl: add neg, exp and diag (#20127 ) * opencl: add `neg` * opencl: add `exp` * opencl: add `diag`	2026-03-05 21:16:39 -08:00
YardenTal44	2b10b62677	hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139 ) * hexagon: add fp16 support for binary ops: add,sub,mul,div * hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79) * hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad * snapdragon: fix readme link --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-05 18:29:13 -08:00
ymcki	a0ed91a442	models : kda chunk size = 16 (#19827 ) * models : add llm_build_delta_net_base * cont : keep qwen35 and qwen35moe graphs intact * cont : add comments [no ci] * add kimi linear to delta-net-base * removed unnecessary ggml_cont from g_exp_t * removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp * removed unnecessary diag mask * cont : simplify * cont : avoid graph splits * scale q after mul instead of beginning * scale q after mul instead of beginning * identical ppl * cont : fix scale and decay mask * minor : remove TODO * block implementation for kda * remove space at the end of line 101 * concat+pad * pad+binary row concat * chunk size 16 for kda * removed minor differences to master --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-05 17:01:23 +02:00
Andreas Kieslinger	2cd20b72ed	CUDA: Improve performance via less synchronizations between token (#17795 ) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-05 13:53:21 +02:00
Eric Zhang	872646b30c	model : update Qwen3.5 model type detection (#20126 ) * model : fix Qwen3.5 model type detection * Update src/llama-model.cpp whoops, my bad Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 12:47:14 +01:00
Sigbjørn Skjæret	b5ed0e058c	cli : add command and file auto-completion (#19985 )	2026-03-05 10:47:28 +01:00
Sigbjørn Skjæret	cf232515c9	convert : register Qwen 3.5 ForCausalLM for text only (#20119 )	2026-03-05 10:30:02 +01:00

1 2 3 4 5 ...

11992 commits