koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-09 11:00:40 +00:00

Author	SHA1	Message	Date
Concedo	c8800ed16c	gcc path fix	2026-03-10 21:40:32 +08:00
Concedo	b06dd2606e	ruff: linting	2026-03-10 21:32:36 +08:00
Wagner Bruna	3f42ed1af7	support for customizing LoRA multipliers through the sdapi (#1982 ) * fix corner case in sd_oai_transform_params Also fix typo in the function name. * support for customizing loaded LoRA multipliers The `sdloramult` flag now accepts a list of multipliers, one for each LoRA. If all multipliers are non-zero, LoRAs load as before, with no extra VRAM usage or performance impact. If any LoRA has a multiplier of 0, we switch to `at_runtime` mode, and these LoRAs will be available to multiplier changes via the `lora` sdapi field and show up in the `sdapi/v1/loras` endpoint. All LoRAs are still preloaded on startup, and cached to avoid file reloads. If the list of multipliers is shorter than the list of LoRAs, the multiplier list is extended with the first multiplier (1.0 by default), to keep it compatible with the previous behavior. * support for `<lora:name:multiplier>` prompt syntax and metadata * add a few tests for sanitize_lora_multipliers	2026-03-10 21:29:39 +08:00
Concedo	eafb5ff4c5	autofit improvement e.g. for strix (+1 squashed commits) Squashed commits: [`6f6fd59c3`] autofit improvement e.g. for strix	2026-03-10 21:20:02 +08:00
Concedo	500a1ab466	disable smartcache if slots is zero	2026-03-10 08:57:31 +08:00
Concedo	2bd6b87d5b	remove a file	2026-03-09 23:08:53 +08:00
Concedo	ee96e71bae	don't resample audio	2026-03-09 22:53:55 +08:00
Concedo	45c74da08b	adjust ace step, still wip on caption rework	2026-03-09 00:11:48 +08:00
JustCommitRandomness	9ddd74111f	OpenBSD changes for vulkan backend (#2026 ) * OpenBSD also needs alloca.h * Changes to compile vulkan backend with OpenBSD * Update README.md tweak details for OpenBSD vulkan backend * Update README.md	2026-03-08 20:41:36 +08:00
Concedo	270d4ad2c1	fixed a typo	2026-03-08 12:56:08 +08:00
Concedo	73fc5c4767	handle jinja exceptions	2026-03-08 12:12:02 +08:00
Concedo	41df8b09e5	jinjatools now works mostly well	2026-03-08 11:55:22 +08:00
Concedo	a981d1ece9	updated lite	2026-03-08 02:33:18 +08:00
Wagner Bruna	9158bd8b4d	sd: sync to master-520-d950627 (#2006 ) * sd: sync to master-509-4cdfff5 * sd: Anima support * sd: sync to master-514-5792c66 * sd: additional workaround for Anima .safetensors model * sd: sync to master-517-ba35dd7 * sd: sync to master-520-d950627	2026-03-08 01:23:03 +08:00
Concedo	ebe44e7819	modify q3tts loader	2026-03-08 00:53:33 +08:00
Concedo	0df18d2ae2	fixed single token bans	2026-03-07 22:50:53 +08:00
Concedo	a40038d8e6	further reverse the mxfp4 changes	2026-03-07 22:42:22 +08:00
Concedo	d20e60ddd5	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/build.md # examples/batched/batched.cpp # examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp # examples/deprecation-warning/deprecation-warning.cpp # examples/eval-callback/eval-callback.cpp # examples/gen-docs/gen-docs.cpp # examples/gguf-hash/gguf-hash.cpp # examples/gguf/gguf.cpp # examples/lookahead/lookahead.cpp # examples/lookup/lookup-create.cpp # examples/lookup/lookup-merge.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/parallel/parallel.cpp # examples/passkey/passkey.cpp # examples/retrieval/retrieval.cpp # examples/save-load-state/save-load-state.cpp # examples/simple-chat/simple-chat.cpp # examples/simple/simple.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # examples/sycl/ls-sycl-device.cpp # examples/training/finetune.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/amx/common.h # ggml/src/ggml-cpu/kleidiai/kernels.cpp # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_general_q8_0_f32.cl # ggml/src/ggml-opencl/kernels/transpose.cl # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl # scripts/get-wikitext-2.sh # tests/test-backend-ops.cpp # tools/batched-bench/batched-bench.cpp # tools/cvector-generator/cvector-generator.cpp # tools/export-lora/export-lora.cpp # tools/imatrix/imatrix.cpp # tools/llama-bench/llama-bench.cpp # tools/perplexity/perplexity.cpp # tools/rpc/rpc-server.cpp # tools/tokenize/tokenize.cpp	2026-03-06 21:19:49 +08:00
Concedo	2c38638b3d	Merge commit '`2afcdb9777`' into concedo_experimental # Conflicts: # scripts/sync_vendor.py # tests/CMakeLists.txt	2026-03-06 21:13:15 +08:00
Concedo	abcca8c0f9	do not use the mxfp4 repack - repack must be synced again from before this commit if it's ever to be used in future. this will break compilation with older w64devkit	2026-03-06 21:07:41 +08:00
Gustavo Rocha Dias	cbecc34667	Fix OAI-compatible token usage and unique request IDs (#2015 ) * fix: token usage fix for mistral-vibe * fix: generate unique request IDs for OAI-compatible responses * fix: prompt_tokens reporting KV cache size instead of actual count during streaming * fixes for PR #2015 For (1), this is not a good idea. If it returned 0 (e.g. during an error), this value may not be updated and will return the value of a previous or different request. It's better to return 0 in those cases. For (2), this is a good idea but we don't need that level of randomness. I'll probably swap it with a 6 digit random number instead. For (3), the official openai spec gates it behind stream_options.include_usage = true so i'll do that too * missed 1 item --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2026-03-06 20:57:22 +08:00
JustCommitRandomness	2fbc3b2ae5	Adjust int types in format strings (#2009 ) * tweak format sting types This may not be all of them, but it's the ones which warn on OpenBSD * complete the changes needed to fix the format string specifers * avoid using inttypes, directly cast to size_t (u64 usually) instead --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2026-03-06 19:06:18 +08:00
Concedo	e36d7b6464	warn about RNN models not supporting antislop	2026-03-06 14:02:51 +08:00
JustCommitRandomness	389773070f	OpenBSD also needs alloca.h (#2012 )	2026-03-05 12:32:31 +08:00
Concedo	8658af1018	qwen3tts default to cpu unless gpu selected	2026-03-05 11:11:46 +08:00
Concedo	da2bde4767	updated readme	2026-03-05 01:35:45 +08:00
Concedo	4f1b22c415	kv snapshots save and load last logits for correctness. added some text for musicui, updated docs	2026-03-04 21:57:28 +08:00
Sigbjørn Skjæret	d969e933e1	tools : add missing clocale include in mtmd-cli [no ci] (#20107 )	2026-03-04 14:18:04 +01:00
Johannes Gäßler	7f5ee54968	ggml: fix ggml_is_contiguous_n for ne == 1 (#20092 )	2026-03-04 12:04:31 +01:00
Adrien Gallouët	66199c9f03	ggml : use a simple std::thread in AMX without OpenMP (#20074 ) Disabling OpenMP generally provides better inference performance (at least in my testing) but the loading becomes slightly slower. Benchmark results for `convert_B_packed_format()`: Before this commit: N K \| No OpenMP OpenMP \| Diff \| Speedup ------------------------------------------------------------ 512 2880 \| 640.9us 263.5us \| -58.9% \| 0.41x 2880 4096 \| 2.55ms 261.7us \| -89.8% \| 0.10x 201088 2880 \| 256.44ms 21.61ms \| -91.6% \| 0.08x ------------------------------------------------------------ Total: 325.43ms vs 31.05ms After: N K \| No OpenMP OpenMP \| Diff \| Speedup ------------------------------------------------------------ 512 2880 \| 1.49ms 263.5us \| -82.3% \| 0.18x 2880 4096 \| 1.55ms 261.7us \| -83.1% \| 0.17x 201088 2880 \| 24.03ms 21.61ms \| -10.1% \| 0.90x ------------------------------------------------------------ Total: 78.97ms vs 31.05ms Tested with unsloth/gpt-oss-20b-GGUF:Q4_K_M. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-04 11:57:09 +01:00
ddh0	c99909dd0b	impl : use 6 digits for tensor dims (#20094 ) Many models have vocabulary sizes, and thus tensor shapes, with more than 5 digits (ex: Gemma 3's vocab size is 262,208). I already fixed this for `llama_format_tensor_shape` but missed it for `llama_format_tensor_shape` until now. Oops.	2026-03-04 09:53:38 +01:00
SamareshSingh	cb8f4fa3f8	Fix locale-dependent float printing in GGUF metadata (#17331 ) * Set C locale for consistent float formatting across all binaries. * Add C locale setting to all tools binaries Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/ directory to ensure consistent floating-point formatting. * Apply suggestion from @JohannesGaessler --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-04 09:30:40 +01:00
standby24x7	54910bd4f3	completion : Fix a typo in warning message (#20082 ) resuse -> reuse	2026-03-04 06:44:49 +01:00
Concedo	54cf43ae64	rnn fix adjust	2026-03-04 10:59:51 +08:00
Mickael Desgranges	ecd99d6a9a	docs: Fix intel documentation link (#20040 )	2026-03-03 21:50:00 +08:00
Concedo	5d35193749	fixed a sse stream issue	2026-03-03 21:30:28 +08:00
Concedo	7df210833e	missed one case for autofit	2026-03-03 21:05:59 +08:00
Concedo	707f7b37bf	optimize pp	2026-03-03 21:02:51 +08:00
Charles Xu	137435ff15	kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043 )	2026-03-03 11:40:26 +02:00
shaofeiqi	24350fdf9b	opencl: add optimized q4_1 mm kernel for adreno (#19840 ) * Add Q4_1 OpenCL Kernels * opencl: refactor transpose * opencl: format * opencl: refactor q4_1 unpack * opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno` * opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels * opencl: rename kernel files and kernes * opencl: fix build for non adreno * opencl: move code around and format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-02 19:49:41 -08:00
Abhijit Ramesh	49a7564ac1	ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965 ) * ggml-webgpu: fix workgroup dispatch limit for large batch sizes WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT operations with batch sizes exceedeing this limi would fail. * add compute_2d_workgroups() helper to split total workgroup ID across X/Y dimensions * update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D dispatch * update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID from 2D dispatch * update mul_mat.wgsl to compute global index from 2D workgroup coordinates * refactor all three mul_mat dispatch paths to use the shared helper * ggml-webgpu: add bounds checking for over-dispatched workgroups 2D workgroup dispatch can over-dispatch when total workgroups don't divide evenly into the 65535 per-dimension limit. Extra workgroups would compute invalid batch indices, causing memory corruption. * add batch_idx bound check to mul_mat_reg_tile.wgsl and mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups from accessing invalid memory * fixes test failures with large batch sizes (eg., bs=[128, 1024]) * ggml-webgpu: add back TODO for spliting large sizes into batches * Optimize 2d workgroup provisioning * Set some parameters that increase speed --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-03-02 19:35:11 -08:00
Nikhil Jain	4d828bd1ab	ggml webgpu: Clean up per-thread parameter buffer pool and job submission logic (#19772 ) * Allow webgpu_buf_pool to resize if needed, remove inflight_threads, and replace inflight_threads with num_kernels for submission * Run clang-format * Keep track of num batched kernels that have not been submitted yet * Run clang-format * Increase buf pool max size * Increase param buf pool init size * Remove webgpu buf pool resizing * Merge with master * Add buffer pool growth * Move buffer pool growth outside of lock * Reduce max pool size to 32 * Run clang-format * Only resize param buf pool	2026-03-02 10:23:34 -08:00
Masashi Yoshimura	36a7a6589c	ggml-webgpu: Support non-contiguous `src0` and overlapping `src0/src1` in binary ops (#19850 ) * ggml-webgpu: Add binary op support for overlapping and non-contiguous. * Add newline to binary.wgsl * Append the test of binary op for src overlapping to test_bin_bcast. * Remove unnecessary newline.	2026-03-02 07:59:53 -08:00
Ruben Ortlam	feefb92836	vulkan: tune MMVQ for Intel Windows (#19988 )	2026-03-02 15:58:25 +01:00
Adrien Gallouët	ec88c3ceea	scripts : improve get-wikitext-2.sh (#19952 ) * scripts : improve get-wikitext-2.sh Switch to sh, add curl fallback, and avoid redundant downloads Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> * fix indent Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-02 15:40:49 +01:00
Concedo	ae67caa2f7	ace qwen rep pen for codes	2026-03-02 21:18:06 +08:00
Concedo	de9840afac	qwen image max ref image size fix from 512x512 to 1024x1024	2026-03-02 21:08:52 +08:00
Concedo	b632d2ce1c	print timestamp when image generated	2026-03-02 18:38:21 +08:00
Concedo	cf158f1b6e	updated lite	2026-03-02 16:59:16 +08:00
Aaron Teo	2afcdb9777	ggml-cpu: optimise s390x multiply extend instructions (#20032 ) Some checks failed Python Type-Check / pyright type-check (push) Has been cancelled Details	2026-03-02 16:23:56 +08:00

1 2 3 4 5 ...

11958 commits