koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-14 16:43:28 +00:00

Author	SHA1	Message	Date
Mario Limonciello	8fdf269dad	ci : update Windows ROCm build to 26.Q1 [no ci] (#19810 ) * Update build command to build llama-* tools not just ggml-hip * Update rocWMMA headers to 7.2 * Add GFX1150 target * Correct library paths for AMD libraries in 26.Q1	2026-02-25 12:30:19 +01:00
Aldehir Rojas	a96a1120b4	gguf : fix ftell/fseek for Windows (#19870 )	2026-02-25 06:58:11 +02:00
Georgi Gerganov	244641955f	models : fix graph splits (#19866 )	2026-02-25 00:01:13 +02:00
Pascal	47eb12b953	server: fix query params lost when proxying requests in multi-model router mode (#19854 ) * server: fix query params lost when proxying requests in multi-model router mode * server: re-encode query params using httplib::encode_query_component in proxy	2026-02-24 21:46:06 +01:00
Georgi Gerganov	418dea39ce	ggml/gguf : prevent integer overflows (#19856 ) * gguf : prevent integer overflow for ggml_context mem size * ggml : fix int overflows in ggml_new_object() * gguf : prevent string exhaustion * gguf : prevent array elements exhaustion * ggml : fix negative tensor type oob * py : assert that alignment is non-zero power of 2 * ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object * gguf-py : error on duplicate keys when reading * py : restore tensor_fields * enforce proper alignment in add_custom_alignment * gguf : better name * gguf : fix ctx size for no_alloc == true * gguf : minor print fix * ggml : print values when overflow * ggml : remove deprecated ggml_type_sizef() * ggml : relax ggml_type asserts to debug-only * gguf : add mem_size overflow test * gguf : add file size check for arrays * ggml : relax asseerts for ggml_get_type_traits() * flake8 fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-24 20:17:11 +02:00
Concedo	0eafc3cf2d	ace step lowvram mode done, improved	2026-02-24 23:12:26 +08:00
Concedo	11a85d62fc	lowvram for music lm	2026-02-24 22:21:17 +08:00
Concedo	aa58d1ed3b	all working, but needs to optimize vram	2026-02-24 21:55:57 +08:00
Tarek Dakhran	da426cb250	model : update label for LFM2-24B-A2B (#19848 ) * model : Update label for LFM2-24B-A2B ``` ❯ build/bin/llama-bench -m /data/playground/checkpoints/LFM2-24B-A2B-Preview-Q4_0.gguf,/data/playground/checkpoints/LFM2-8B-A1B-Q4_0.gguf -p 1 -n 0 \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| ------: \| --------------: \| -------------------: \| \| lfm2moe 24B.A2B Q4_0 \| 12.54 GiB \| 23.84 B \| CPU \| 10 \| pp1 \| 30.35 ± 2.49 \| \| lfm2moe 8B.A1B Q4_0 \| 4.41 GiB \| 8.34 B \| CPU \| 10 \| pp1 \| 49.24 ± 1.93 \| ``` * Remove extra line	2026-02-24 14:27:42 +01:00
Concedo	488c431331	not yet working	2026-02-24 17:47:50 +08:00
Radoslav Gerganov	c830f99cfa	server : support max_completion_tokens request property (#19831 ) "max_tokens" is deprectated in favor of "max_completion_tokens" which sets the upper bound for reasoning+output token. Closes: #13700	2026-02-24 10:30:00 +02:00
Ruben Ortlam	aa6f918c1c	Vulkan Scalar Flash Attention Refactor (#19625 ) * vulkan: allow using fp16 in scalar flash attention shader * split rows inside of subgroups for faster synchronization * use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 * use f32 scalar FA if f16 is not supported by device * fix amd workgroup size issue * optimize masksh use * add medium rows FA shader Br size * fixes * add padding to mask shmem buffer * cache q values into registers for KQ * fuse lf accumulation, pf and v accumulation into a loop * stage K loads through shmem * stage V loads through shmem * only stage through shmem on Nvidia * default to Bc 32 * also stage V through shmem when this is done for K * dynamic subgroups for intel * use vectorized stores * use float_type for dequantize4 functions * use smaller scalar rows size for smaller rows count * relax flash attention split_k condition to allow non-gqa use * use minimal subgroup size on Intel * fix shmem support function * fix rebase issues * fixes * Bc 4 for scalar FA is not a valid configuration * Use wave32 on AMD RDNA for scalar FA * add Intel shader core count lookup-table * fix regressions * device tuning * tmpsh size fix * fix editorconfig * refactor fa tuning logic into a single place * fix gqa opt logic * fix block_rows with small n_rows * amd tuning * fix hsk=72/80 issue * tuning * allow condition skipping for column check * use float16 for Of if available * address feedback * fix bad RDNA performance on head size <= 128 by limiting occupancy * allow printing pipeline stats * cleanup and fixes * limit occupancy for GCN for small batch FA with large HSK * disable f16 FA for GCN AMD GPUs on the proprietary driver	2026-02-24 08:35:48 +01:00
Concedo	0fd7d2c0e5	ace step diffusion loading	2026-02-24 15:24:15 +08:00
Jeff Bolz	8c2c0108dd	vulkan: fix coopmat1 without bf16 support (#19793 )	2026-02-24 07:48:32 +01:00
Jeff Bolz	3ea5360c00	vulkan: fix data race in mul_mat_id shader (#19790 )	2026-02-24 07:43:12 +01:00
Max Krasnyansky	39fb81f875	hexagon refactor all Ops to use local context struct (#19819 ) * hexagon: refactor set/get/sum-rows ops to use local context * hexagon: refactor ROPE and Softmax Ops to use local context Improves performance a bit by precomputing things and saving in the context. * hexagon: refactor activation ops to use local context struct * hexagon: refactor unary ops to use local context struct and DMA/VTCM * hexagon: use aligned hvx_scale function * hexagon: remove unused fields from op_context * hexagon: rewrite ROPE to use DMA and VTCM scratchpad * hex-rope: keep N rows in scratchpad (instead of just two) * hex-rope: introduce rowidx cache * hex-rope: remove unused fields * hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute also removes the need for fastdiv. * hex-rope: minor formatting * hex-rope: use indices and unroll the loops * hex-rope: more updates to cleanup rope-block handling * hexagon: cleanup supported type/dims checks * hexagon: all reduce funcs replicated across lanes There is no need to explicitly replicate the first value. * snapdragon: update adb and windows scripts to use ubatch-size 256 Updated Ops support handles larger ubatches.	2026-02-23 16:32:14 -08:00
Aleksander Grygier	5eb0ea32f0	feat: Add code blocks full height setting to parameter sync service (#19835 )	2026-02-23 22:30:13 +01:00
Adrien Gallouët	b68a83e641	vendor : update cpp-httplib to 0.34.0 (#19830 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-23 21:05:48 +01:00
Concedo	749536f464	fixed wav header wrong size	2026-02-24 01:13:44 +08:00
Daniel Bevenius	d8aeb65cee	tests : fix typos in comments in test-backend-sampler [no ci] (#19824 ) * tests : fix typos in comments in test-backend-sampler [no ci]	2026-02-23 17:12:02 +01:00
askmyteapot	062e361968	Update ace-qwen3.cpp to build on MSVC (#1992 ) need to include <sstream> otherwise build fails with lots of the below errors: ``` C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2297: '<<': not valid as right operand has type 'const cha r [26]' [C:\koboldcpp\build\music_adapter.vcxproj] (compiling source file '../otherarch/acestep/music_adapter.cpp') C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2679: binary '<<': no operator found which takes a right-h and operand of type 'std::string' (or there is no acceptable conversion) [C:\koboldcpp\build\music_adapter.vcxproj] (compiling source file '../otherarch/acestep/music_adapter.cpp') C:\Program Files (x86)\Microsoft Visual Studio\18\BuildTools\VC\Tools\MSVC\14.50.35717\include\__msvc_int128.hpp( 753,46): could be 'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept' [found us ing argument-dependent lookup] C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): 'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept': cannot conver t argument 2 from 'std::string' to 'const std::_Base128 &' C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57): Reason: cannot convert from 'std::string' to 'const std::_Base128' C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57): No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called ```	2026-02-23 23:03:07 +08:00
Concedo	5311997581	updated ace step cpp	2026-02-23 23:01:10 +08:00
Concedo	2e713cfff5	fixed compile issue, trying out 8bit pcm	2026-02-23 21:19:03 +08:00
Aleksander Grygier	9051663d5d	webui: Add setting to have full height Code Blocks in Chat Messages (#19829 )	2026-02-23 14:16:50 +01:00
Daniel Bevenius	72b44c0d21	model-conversion : merge inspect-org-model.py with tensor-info.py (#19823 ) This commit replaces/merges the inspect-org-model.py script with the contents tensor-info.py script. The merged script has also been updated to also print tensor sizes which was the only thing that was not done before (by tensor-info.py that is). The motivation for this is that tensor-info.py does not load the tensor weights which can be time consuming for larger models. And also now that both are doing almost the same thing it makes sense to just have one and not two scripts to maintain.	2026-02-23 14:15:16 +01:00
Alberto Cabrera Pérez	bc160d3582	ggml-cpu: arm64: q5_K repack gemm and gemv (and generic) implementations (dotprod) (#19356 ) * Generic GEMV and boilerplate for q5_K dotprod * Generic GEMM and boilerplate for q5_K dotprod * ARM64 q5_K dotprod GEMM * ARM64 q5_K dotprod GEMV	2026-02-23 12:42:52 +00:00
Wagner Bruna	a6c0a224b2	sd: sync to master-506-c9cd497 (#1991 )	2026-02-23 17:35:59 +08:00
Concedo	06c0ffaead	with am17an fix for henk to test	2026-02-23 17:30:19 +08:00
Concedo	c2b0cb26a8	ace step codes api	2026-02-23 14:04:45 +08:00
Daniel Bevenius	2b6dfe824d	llama : remove write/read of output ids/logits/embeddings (#18862 ) * llama : remove write/read of output ids/logits/embeddings This commit removes the write/read of output ids, logits and embeddings from the llama context state. Refs: https://github.com/ggml-org/llama.cpp/pull/18862#issuecomment-3756330941 * completion : add replying of session state This commit updates the session handing in the completion tool to handle the that logits are no longer stored in the session file. Instead, we need to replay the last token to get the logits for sampling. * common : add common_prompt_batch_decode function This commit adds a new function which is responsible for decoding prompt and optionally handle the saving for session data. * update save-state.cpp to use llama_state_load_file This commit updates the save-load-state example to utilize the new llama_state_load_file function for loading the model state from a file. And it also replays the last token after loading since this state is now stored before the last token is processed. * examples : set n_seq_max = 2 for ctx3 This commit updates the save-load-state example to set the n_seq_max parameter to 2 when initializing the ctx3 context. The motivation for this change is that using 1 as n_parallel/n_seq_max the context only supports one sequence, but the test laster tries to use a second sequence which results in the following error: ```console main : loaded state with 4 tokens main : seq 0 copied, 225760 bytes main : kv cache cleared find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value state_read_meta: failed to find available cells in kv cache ``` This seems to only happen for recurrent/hybrid models.	2026-02-23 07:04:30 +01:00
Concedo	d100c8660e	added Tlacuilo	2026-02-23 10:48:56 +08:00
Sigbjørn Skjæret	e8e261699a	cli : provide model with text filename (#19783 )	2026-02-22 22:33:49 +01:00
Xuan-Son Nguyen	5452d736f8	jinja: correct stats for tojson and string filters (#19785 )	2026-02-22 21:08:23 +01:00
Aldehir Rojas	ed4837891d	common : fix improper trimming in XML parser on complete message (#19805 ) Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>	2026-02-22 17:34:54 +01:00
Concedo	4be93db21c	ace step codes generation now working	2026-02-23 00:27:26 +08:00
Kilian Krampf	cacc371f99	Fix wrong cli-argument in documentation (#19804 )	2026-02-22 16:26:33 +01:00
Concedo	71d42fae85	Revert "Revert "Revert "cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645 )""" This reverts commit `edc04f3f7d`.	2026-02-22 23:18:53 +08:00
Concedo	13db5aee9e	stub files for loading ace step	2026-02-22 23:15:08 +08:00
HelloKS	ae2368e74e	model : add Kanana-2 model support (#19803 ) * model: Add Kanana-2 model support * lint: adjust spacing	2026-02-22 16:15:02 +01:00
Sigbjørn Skjæret	9f0684f003	ci : fix rocm archive name [no ci] (#19808 )	2026-02-22 16:14:37 +01:00
Aldehir Rojas	34ec1c3f18	server : merge contiguous Responses input items into a single assistant message (#19773 ) * server : merge contiguous input items into a single assistant message * cont : simplify tool call msg * cont : reduce and combine content * cont : fix merging content items	2026-02-22 14:11:31 +01:00
Concedo	37ae068dee	set default to GPU test	2026-02-22 17:03:43 +08:00
Sigbjørn Skjæret	e877ad8bd9	ci : fix rocm release path [no ci] (#19784 )	2026-02-22 08:07:46 +01:00
Concedo	fdf868f397	add ace step cpp license info	2026-02-22 13:24:28 +08:00
Concedo	5cd6e50eab	initial files for ace step	2026-02-22 13:22:24 +08:00
Concedo	ac70ca35dd	preliminary patches for acestep.cpp	2026-02-22 12:50:08 +08:00
Wagner Bruna	19588f18ea	sd: relax size restrictions for DiT models (#1986 ) Round image dimensions to the specific multiple required by each DiT model, which range from 32 (certain Wan models) to 1 (Chroma Radiance), with most requiring multiples of 8 or 16. Unet models keep being rounded to multiples of 64. Current sd.cpp rounds the sizes internally; but it always rounds up, so we still need to round on our side to apply image size restrictions, and to trigger VAE tiling correctly. Also, remove a legacy test that could abort a generation with unsupported image sizes: it'd never run, because it was applied after the image side adjustements.	2026-02-22 11:00:10 +08:00
Concedo	0a87f5501e	updated sdui, fix img imports	2026-02-22 10:49:55 +08:00
Concedo	73f3ffaeb7	fix followup tool call check with assistant prefills	2026-02-22 10:33:00 +08:00
Concedo	edc04f3f7d	Revert "Revert "cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645 )"" This reverts commit `131e3cb17a`.	2026-02-22 09:33:25 +08:00

1 2 3 4 5 ...

11893 commits