koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-08 01:41:37 +00:00

Author	SHA1	Message	Date
Concedo	0df18d2ae2	fixed single token bans	2026-03-07 22:50:53 +08:00
Concedo	a40038d8e6	further reverse the mxfp4 changes	2026-03-07 22:42:22 +08:00
Concedo	d20e60ddd5	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/build.md # examples/batched/batched.cpp # examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp # examples/deprecation-warning/deprecation-warning.cpp # examples/eval-callback/eval-callback.cpp # examples/gen-docs/gen-docs.cpp # examples/gguf-hash/gguf-hash.cpp # examples/gguf/gguf.cpp # examples/lookahead/lookahead.cpp # examples/lookup/lookup-create.cpp # examples/lookup/lookup-merge.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/parallel/parallel.cpp # examples/passkey/passkey.cpp # examples/retrieval/retrieval.cpp # examples/save-load-state/save-load-state.cpp # examples/simple-chat/simple-chat.cpp # examples/simple/simple.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # examples/sycl/ls-sycl-device.cpp # examples/training/finetune.cpp # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-cpu/amx/common.h # ggml/src/ggml-cpu/kleidiai/kernels.cpp # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_general_q8_0_f32.cl # ggml/src/ggml-opencl/kernels/transpose.cl # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl # scripts/get-wikitext-2.sh # tests/test-backend-ops.cpp # tools/batched-bench/batched-bench.cpp # tools/cvector-generator/cvector-generator.cpp # tools/export-lora/export-lora.cpp # tools/imatrix/imatrix.cpp # tools/llama-bench/llama-bench.cpp # tools/perplexity/perplexity.cpp # tools/rpc/rpc-server.cpp # tools/tokenize/tokenize.cpp	2026-03-06 21:19:49 +08:00
Concedo	2c38638b3d	Merge commit '`2afcdb9777`' into concedo_experimental # Conflicts: # scripts/sync_vendor.py # tests/CMakeLists.txt	2026-03-06 21:13:15 +08:00
Concedo	abcca8c0f9	do not use the mxfp4 repack - repack must be synced again from before this commit if it's ever to be used in future. this will break compilation with older w64devkit	2026-03-06 21:07:41 +08:00
Gustavo Rocha Dias	cbecc34667	Fix OAI-compatible token usage and unique request IDs (#2015 ) * fix: token usage fix for mistral-vibe * fix: generate unique request IDs for OAI-compatible responses * fix: prompt_tokens reporting KV cache size instead of actual count during streaming * fixes for PR #2015 For (1), this is not a good idea. If it returned 0 (e.g. during an error), this value may not be updated and will return the value of a previous or different request. It's better to return 0 in those cases. For (2), this is a good idea but we don't need that level of randomness. I'll probably swap it with a 6 digit random number instead. For (3), the official openai spec gates it behind stream_options.include_usage = true so i'll do that too * missed 1 item --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2026-03-06 20:57:22 +08:00
JustCommitRandomness	2fbc3b2ae5	Adjust int types in format strings (#2009 ) * tweak format sting types This may not be all of them, but it's the ones which warn on OpenBSD * complete the changes needed to fix the format string specifers * avoid using inttypes, directly cast to size_t (u64 usually) instead --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2026-03-06 19:06:18 +08:00
Concedo	e36d7b6464	warn about RNN models not supporting antislop	2026-03-06 14:02:51 +08:00
JustCommitRandomness	389773070f	OpenBSD also needs alloca.h (#2012 )	2026-03-05 12:32:31 +08:00
Concedo	8658af1018	qwen3tts default to cpu unless gpu selected	2026-03-05 11:11:46 +08:00
Concedo	da2bde4767	updated readme	2026-03-05 01:35:45 +08:00
Concedo	4f1b22c415	kv snapshots save and load last logits for correctness. added some text for musicui, updated docs	2026-03-04 21:57:28 +08:00
Sigbjørn Skjæret	d969e933e1	tools : add missing clocale include in mtmd-cli [no ci] (#20107 )	2026-03-04 14:18:04 +01:00
Johannes Gäßler	7f5ee54968	ggml: fix ggml_is_contiguous_n for ne == 1 (#20092 )	2026-03-04 12:04:31 +01:00
Adrien Gallouët	66199c9f03	ggml : use a simple std::thread in AMX without OpenMP (#20074 ) Disabling OpenMP generally provides better inference performance (at least in my testing) but the loading becomes slightly slower. Benchmark results for `convert_B_packed_format()`: Before this commit: N K \| No OpenMP OpenMP \| Diff \| Speedup ------------------------------------------------------------ 512 2880 \| 640.9us 263.5us \| -58.9% \| 0.41x 2880 4096 \| 2.55ms 261.7us \| -89.8% \| 0.10x 201088 2880 \| 256.44ms 21.61ms \| -91.6% \| 0.08x ------------------------------------------------------------ Total: 325.43ms vs 31.05ms After: N K \| No OpenMP OpenMP \| Diff \| Speedup ------------------------------------------------------------ 512 2880 \| 1.49ms 263.5us \| -82.3% \| 0.18x 2880 4096 \| 1.55ms 261.7us \| -83.1% \| 0.17x 201088 2880 \| 24.03ms 21.61ms \| -10.1% \| 0.90x ------------------------------------------------------------ Total: 78.97ms vs 31.05ms Tested with unsloth/gpt-oss-20b-GGUF:Q4_K_M. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-04 11:57:09 +01:00
ddh0	c99909dd0b	impl : use 6 digits for tensor dims (#20094 ) Many models have vocabulary sizes, and thus tensor shapes, with more than 5 digits (ex: Gemma 3's vocab size is 262,208). I already fixed this for `llama_format_tensor_shape` but missed it for `llama_format_tensor_shape` until now. Oops.	2026-03-04 09:53:38 +01:00
SamareshSingh	cb8f4fa3f8	Fix locale-dependent float printing in GGUF metadata (#17331 ) * Set C locale for consistent float formatting across all binaries. * Add C locale setting to all tools binaries Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/ directory to ensure consistent floating-point formatting. * Apply suggestion from @JohannesGaessler --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-04 09:30:40 +01:00
standby24x7	54910bd4f3	completion : Fix a typo in warning message (#20082 ) resuse -> reuse	2026-03-04 06:44:49 +01:00
Concedo	54cf43ae64	rnn fix adjust	2026-03-04 10:59:51 +08:00
Mickael Desgranges	ecd99d6a9a	docs: Fix intel documentation link (#20040 )	2026-03-03 21:50:00 +08:00
Concedo	5d35193749	fixed a sse stream issue	2026-03-03 21:30:28 +08:00
Concedo	7df210833e	missed one case for autofit	2026-03-03 21:05:59 +08:00
Concedo	707f7b37bf	optimize pp	2026-03-03 21:02:51 +08:00
Charles Xu	137435ff15	kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043 )	2026-03-03 11:40:26 +02:00
shaofeiqi	24350fdf9b	opencl: add optimized q4_1 mm kernel for adreno (#19840 ) * Add Q4_1 OpenCL Kernels * opencl: refactor transpose * opencl: format * opencl: refactor q4_1 unpack * opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno` * opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels * opencl: rename kernel files and kernes * opencl: fix build for non adreno * opencl: move code around and format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-02 19:49:41 -08:00
Abhijit Ramesh	49a7564ac1	ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965 ) * ggml-webgpu: fix workgroup dispatch limit for large batch sizes WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT operations with batch sizes exceedeing this limi would fail. * add compute_2d_workgroups() helper to split total workgroup ID across X/Y dimensions * update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D dispatch * update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID from 2D dispatch * update mul_mat.wgsl to compute global index from 2D workgroup coordinates * refactor all three mul_mat dispatch paths to use the shared helper * ggml-webgpu: add bounds checking for over-dispatched workgroups 2D workgroup dispatch can over-dispatch when total workgroups don't divide evenly into the 65535 per-dimension limit. Extra workgroups would compute invalid batch indices, causing memory corruption. * add batch_idx bound check to mul_mat_reg_tile.wgsl and mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups from accessing invalid memory * fixes test failures with large batch sizes (eg., bs=[128, 1024]) * ggml-webgpu: add back TODO for spliting large sizes into batches * Optimize 2d workgroup provisioning * Set some parameters that increase speed --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-03-02 19:35:11 -08:00
Nikhil Jain	4d828bd1ab	ggml webgpu: Clean up per-thread parameter buffer pool and job submission logic (#19772 ) * Allow webgpu_buf_pool to resize if needed, remove inflight_threads, and replace inflight_threads with num_kernels for submission * Run clang-format * Keep track of num batched kernels that have not been submitted yet * Run clang-format * Increase buf pool max size * Increase param buf pool init size * Remove webgpu buf pool resizing * Merge with master * Add buffer pool growth * Move buffer pool growth outside of lock * Reduce max pool size to 32 * Run clang-format * Only resize param buf pool	2026-03-02 10:23:34 -08:00
Masashi Yoshimura	36a7a6589c	ggml-webgpu: Support non-contiguous `src0` and overlapping `src0/src1` in binary ops (#19850 ) * ggml-webgpu: Add binary op support for overlapping and non-contiguous. * Add newline to binary.wgsl * Append the test of binary op for src overlapping to test_bin_bcast. * Remove unnecessary newline.	2026-03-02 07:59:53 -08:00
Ruben Ortlam	feefb92836	vulkan: tune MMVQ for Intel Windows (#19988 )	2026-03-02 15:58:25 +01:00
Adrien Gallouët	ec88c3ceea	scripts : improve get-wikitext-2.sh (#19952 ) * scripts : improve get-wikitext-2.sh Switch to sh, add curl fallback, and avoid redundant downloads Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> * fix indent Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-02 15:40:49 +01:00
Concedo	ae67caa2f7	ace qwen rep pen for codes	2026-03-02 21:18:06 +08:00
Concedo	de9840afac	qwen image max ref image size fix from 512x512 to 1024x1024	2026-03-02 21:08:52 +08:00
Concedo	b632d2ce1c	print timestamp when image generated	2026-03-02 18:38:21 +08:00
Concedo	cf158f1b6e	updated lite	2026-03-02 16:59:16 +08:00
Aaron Teo	2afcdb9777	ggml-cpu: optimise s390x multiply extend instructions (#20032 ) Some checks failed Python Type-Check / pyright type-check (push) Has been cancelled Details	2026-03-02 16:23:56 +08:00
Concedo	d7fb3df10a	support 1 level deep admindir	2026-03-02 16:23:34 +08:00
Concedo	d904b51b0f	adjust slot counts	2026-03-02 15:56:15 +08:00
Concedo	42134db6b4	finally fixed smartcache for qwen	2026-03-02 00:47:38 +08:00
Ruben Ortlam	319146247e	vulkan: improve partial offloading performance on AMD (#19976 ) * vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks	2026-03-01 17:32:14 +01:00
oobabooga	66d65ec29b	cuda: cap grid.y at 65535 in non-contiguous dequantize/convert kernels (#19999 )	2026-03-01 13:40:22 +08:00
Concedo	6c5a7a27af	clamp music duration	2026-03-01 01:15:26 +08:00
Concedo	c9e651f7e5	updated lite, fix some cuda spams, fix qwen3tts voice loading	2026-03-01 00:41:56 +08:00
Dmitry Atamanov	05728db18e	vendors : update miniaudio library to 0.11.24 (#19914 )	2026-02-28 16:10:01 +01:00
Adrien Gallouët	4720819d45	vendor : update cpp-httplib to 0.35.0 (#19969 ) Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>	2026-02-28 13:53:56 +01:00
Concedo	0b76f73fc2	smartcache bug seems to be fixed	2026-02-28 18:08:54 +08:00
Bartowski	d979f2b176	tests : model metadata loading from huggingface (#19796 ) * Add model metadata loading from huggingface for use with other tests * Add incremental chunking instead of full redownload, fix caching issue and add warning when it fails * Add support for split models, load metadata from each individual split file, also avoid mmproj * Code cleanup, revert incremental downloading * Only compile when cpp-httplib has SSL support * Fix formatting	2026-02-28 10:44:38 +01:00
Concedo	4e358265a3	Merge commit '`8387ffb28d`' into concedo_experimental # Conflicts: # docs/backend/VirtGPU.md # docs/backend/ZenDNN.md # ggml/src/ggml-cpu/amx/amx.cpp # ggml/src/ggml-cpu/amx/mmq.cpp # ggml/src/ggml-sycl/add-id.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h # ggml/src/ggml-virtgpu/backend/backend-dispatched.h # ggml/src/ggml-virtgpu/backend/backend-virgl-apir.h # ggml/src/ggml-virtgpu/backend/backend.cpp # ggml/src/ggml-virtgpu/backend/shared/api_remoting.h # ggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.h # ggml/src/ggml-virtgpu/backend/shared/apir_backend.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.h # ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp # ggml/src/ggml-virtgpu/ggml-backend-device.cpp # ggml/src/ggml-virtgpu/ggml-backend-reg.cpp # ggml/src/ggml-virtgpu/ggml-backend.cpp # ggml/src/ggml-virtgpu/ggml-remoting.h # ggml/src/ggml-virtgpu/include/apir_hw.h # ggml/src/ggml-virtgpu/regenerate_remoting.py # ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-impl.h # ggml/src/ggml-virtgpu/virtgpu-forward.gen.h # ggml/src/ggml-virtgpu/virtgpu.cpp # ggml/src/ggml-virtgpu/virtgpu.h # ggml/src/ggml-zendnn/CMakeLists.txt # ggml/src/ggml-zendnn/ggml-zendnn.cpp # src/CMakeLists.txt # tests/CMakeLists.txt # tests/test-tokenizer-0.sh # tools/cli/README.md # tools/completion/README.md # tools/imatrix/imatrix.cpp # tools/server/README.md	2026-02-28 12:45:16 +08:00
Wagner Bruna	5c40f07d4a	sd: sync to 0752cc9 (master-507-b314d80 +1) (#1999 ) * sd: sync to 0752cc9 (master-507-b314d80 +1) * sd: add flow-shift support to gendefaults	2026-02-28 12:22:32 +08:00
Concedo	d643d945f5	clamp music inference steps to 100 max	2026-02-28 12:12:50 +08:00
Concedo	dd08d675f2	incomplete fix for rnn models, load state works but logits slightly different	2026-02-28 11:52:24 +08:00

1 2 3 4 5 ...

11943 commits