koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-08 09:59:50 +00:00

Author	SHA1	Message	Date
Concedo	d643d945f5	clamp music inference steps to 100 max	2026-02-28 12:12:50 +08:00
Concedo	dd08d675f2	incomplete fix for rnn models, load state works but logits slightly different	2026-02-28 11:52:24 +08:00
Concedo	14d82bb38e	allow music llm and diffusion gen models to be loaded independently	2026-02-27 21:56:48 +08:00
Concedo	19eb78844c	audio codes working	2026-02-27 21:23:00 +08:00
Concedo	ba42f22fc8	stereo is working	2026-02-27 20:36:44 +08:00
Concedo	5a57ed8ca4	revert to 8 step	2026-02-26 22:07:01 +08:00
Concedo	173702d1a4	music lowvram indicator	2026-02-26 21:30:47 +08:00
Concedo	05834eecb3	Merge commit '`1ca3d1de15`' into concedo_experimental # Conflicts: # tools/server/README.md	2026-02-26 19:55:06 +08:00
Concedo	adebf63877	ace converter	2026-02-26 19:53:02 +08:00
Georgi Gerganov	1ca3d1de15	gguf : avoid too many file size calls (#19919 )	2026-02-26 12:46:32 +02:00
yggdrasil75	bd72300591	server : fix typo in server README.md (#19900 ) fix typo	2026-02-26 11:26:16 +01:00
Concedo	ac8f12f259	still a bit wonky	2026-02-26 17:50:49 +08:00
Concedo	81fb4d773c	swap resampling function	2026-02-26 17:37:53 +08:00
Concedo	749a606374	whisper broke	2026-02-26 16:45:04 +08:00
Concedo	44182ebefe	Merge commit '`8c2c0108dd`' into concedo_experimental # Conflicts: # examples/model-conversion/Makefile # examples/model-conversion/scripts/utils/inspect-org-model.py # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/act-ops.c # ggml/src/ggml-hexagon/htp/get-rows-ops.c # ggml/src/ggml-hexagon/htp/hex-dma.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/set-rows-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # scripts/snapdragon/adb/run-cli.sh # scripts/snapdragon/adb/run-completion.sh # scripts/snapdragon/adb/run-mtmd.sh # scripts/snapdragon/windows/run-cli.ps1 # scripts/sync_vendor.py # tests/test-backend-sampler.cpp	2026-02-26 16:30:37 +08:00
Concedo	7e53bfd28d	Merge commit '`2b6dfe824d`' into concedo_experimental # Conflicts: # .github/workflows/release.yml # examples/save-load-state/save-load-state.cpp # src/llama-context.cpp # tools/cli/cli.cpp	2026-02-26 15:07:23 +08:00
Wagner Bruna	d400b37215	config file saving enhancements (#1994 ) * process --exportconfig and --exporttemplate after --config This allows using `--config oldfile.kcpps --exportconfig newfile.kcpps` to update old config items, copy a config file with changed parameters, download and save a remote config, etc. * filter out command flags from the saved config files Also ident files saved by command-line.	2026-02-26 14:55:01 +08:00
Concedo	fb3f7d92bc	reenable cfg	2026-02-26 14:51:15 +08:00
Concedo	b7d2fe68e7	adjust	2026-02-26 14:46:41 +08:00
Concedo	edbc4fe592	music lm finally working	2026-02-26 14:00:58 +08:00
Concedo	cf042af701	Revert "still not working" This reverts commit `a1305ffff9`.	2026-02-26 10:55:55 +08:00
Concedo	a1305ffff9	still not working	2026-02-26 10:48:21 +08:00
Neo Zhang	2943210c1e	support permuted, remove check s0/s10 (#19889 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-02-26 10:27:20 +08:00
Jeff Bolz	3769fe6eb7	vulkan: check for memory overlap before doing fusion (#19768 ) * vulkan: check for memory overlap before doing fusion * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp * address feedback	2026-02-25 18:25:38 +01:00
Concedo	5c5fe55f7d	bump kv overrides max (+1 squashed commits) Squashed commits: [9bc8212a0] bump kv overrides max	2026-02-26 00:24:53 +08:00
Concedo	d8746a851f	still bugged	2026-02-26 00:07:04 +08:00
Concedo	8a3ccfcba5	some fixes but some issues	2026-02-25 23:41:32 +08:00
ddh0	832aa94762	common : add more aliases for sampler CLI params (#19797 ) * common : add more aliases for sampler CLI params	2026-02-25 16:34:25 +01:00
Slobodan Josic	3af34b9ff5	ci : update the ROCm/HIP toolchain versions [no ci] (#19891 ) * [HIP] Update ROCm build container to rocm/dev-ubuntu-22.04:7.2 and HIP_SDK to 26.Q1 * revert container version --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-25 15:54:49 +01:00
Georgi Gerganov	f20469d919	server : enable multi-modal prompt caching (#19877 )	2026-02-25 15:15:42 +02:00
Georgi Gerganov	d7d826b3c1	server : support multi-modal context checkpoints (#19849 ) * Modify llama-memory-hybrid-iswa.cpp * Modify llama-memory-recurrent.cpp * Modify server-common.cpp * Modify server-common.h * Modify server-context.cpp * Modify server-task.h * Added comment to llama-memory-hybrid-iswa.cpp * Remove comment from server-context.cpp * Stylistic fix server-context.cpp * Fix an issue when seqrm isn't called in server-context.cpp * cont : alternative impl * cont : cleanup * cont : n_tokens -> int64_t --------- Co-authored-by: timkhronos <timkhronos@gmail.com>	2026-02-25 15:14:27 +02:00
Xuan-Son Nguyen	c747294b2d	scripts: update corpus of compare-logprobs (#19326 ) * scripts: update corpus of compare-logprobs * fix	2026-02-25 12:57:34 +01:00
Mario Limonciello	8fdf269dad	ci : update Windows ROCm build to 26.Q1 [no ci] (#19810 ) * Update build command to build llama-* tools not just ggml-hip * Update rocWMMA headers to 7.2 * Add GFX1150 target * Correct library paths for AMD libraries in 26.Q1	2026-02-25 12:30:19 +01:00
Aldehir Rojas	a96a1120b4	gguf : fix ftell/fseek for Windows (#19870 )	2026-02-25 06:58:11 +02:00
Georgi Gerganov	244641955f	models : fix graph splits (#19866 )	2026-02-25 00:01:13 +02:00
Pascal	47eb12b953	server: fix query params lost when proxying requests in multi-model router mode (#19854 ) * server: fix query params lost when proxying requests in multi-model router mode * server: re-encode query params using httplib::encode_query_component in proxy	2026-02-24 21:46:06 +01:00
Georgi Gerganov	418dea39ce	ggml/gguf : prevent integer overflows (#19856 ) * gguf : prevent integer overflow for ggml_context mem size * ggml : fix int overflows in ggml_new_object() * gguf : prevent string exhaustion * gguf : prevent array elements exhaustion * ggml : fix negative tensor type oob * py : assert that alignment is non-zero power of 2 * ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object * gguf-py : error on duplicate keys when reading * py : restore tensor_fields * enforce proper alignment in add_custom_alignment * gguf : better name * gguf : fix ctx size for no_alloc == true * gguf : minor print fix * ggml : print values when overflow * ggml : remove deprecated ggml_type_sizef() * ggml : relax ggml_type asserts to debug-only * gguf : add mem_size overflow test * gguf : add file size check for arrays * ggml : relax asseerts for ggml_get_type_traits() * flake8 fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-24 20:17:11 +02:00
Concedo	0eafc3cf2d	ace step lowvram mode done, improved	2026-02-24 23:12:26 +08:00
Concedo	11a85d62fc	lowvram for music lm	2026-02-24 22:21:17 +08:00
Concedo	aa58d1ed3b	all working, but needs to optimize vram	2026-02-24 21:55:57 +08:00
Tarek Dakhran	da426cb250	model : update label for LFM2-24B-A2B (#19848 ) * model : Update label for LFM2-24B-A2B ``` ❯ build/bin/llama-bench -m /data/playground/checkpoints/LFM2-24B-A2B-Preview-Q4_0.gguf,/data/playground/checkpoints/LFM2-8B-A1B-Q4_0.gguf -p 1 -n 0 \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| ------: \| --------------: \| -------------------: \| \| lfm2moe 24B.A2B Q4_0 \| 12.54 GiB \| 23.84 B \| CPU \| 10 \| pp1 \| 30.35 ± 2.49 \| \| lfm2moe 8B.A1B Q4_0 \| 4.41 GiB \| 8.34 B \| CPU \| 10 \| pp1 \| 49.24 ± 1.93 \| ``` * Remove extra line	2026-02-24 14:27:42 +01:00
Concedo	488c431331	not yet working	2026-02-24 17:47:50 +08:00
Radoslav Gerganov	c830f99cfa	server : support max_completion_tokens request property (#19831 ) "max_tokens" is deprectated in favor of "max_completion_tokens" which sets the upper bound for reasoning+output token. Closes: #13700	2026-02-24 10:30:00 +02:00
Ruben Ortlam	aa6f918c1c	Vulkan Scalar Flash Attention Refactor (#19625 ) * vulkan: allow using fp16 in scalar flash attention shader * split rows inside of subgroups for faster synchronization * use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 * use f32 scalar FA if f16 is not supported by device * fix amd workgroup size issue * optimize masksh use * add medium rows FA shader Br size * fixes * add padding to mask shmem buffer * cache q values into registers for KQ * fuse lf accumulation, pf and v accumulation into a loop * stage K loads through shmem * stage V loads through shmem * only stage through shmem on Nvidia * default to Bc 32 * also stage V through shmem when this is done for K * dynamic subgroups for intel * use vectorized stores * use float_type for dequantize4 functions * use smaller scalar rows size for smaller rows count * relax flash attention split_k condition to allow non-gqa use * use minimal subgroup size on Intel * fix shmem support function * fix rebase issues * fixes * Bc 4 for scalar FA is not a valid configuration * Use wave32 on AMD RDNA for scalar FA * add Intel shader core count lookup-table * fix regressions * device tuning * tmpsh size fix * fix editorconfig * refactor fa tuning logic into a single place * fix gqa opt logic * fix block_rows with small n_rows * amd tuning * fix hsk=72/80 issue * tuning * allow condition skipping for column check * use float16 for Of if available * address feedback * fix bad RDNA performance on head size <= 128 by limiting occupancy * allow printing pipeline stats * cleanup and fixes * limit occupancy for GCN for small batch FA with large HSK * disable f16 FA for GCN AMD GPUs on the proprietary driver	2026-02-24 08:35:48 +01:00
Concedo	0fd7d2c0e5	ace step diffusion loading	2026-02-24 15:24:15 +08:00
Jeff Bolz	8c2c0108dd	vulkan: fix coopmat1 without bf16 support (#19793 )	2026-02-24 07:48:32 +01:00
Jeff Bolz	3ea5360c00	vulkan: fix data race in mul_mat_id shader (#19790 )	2026-02-24 07:43:12 +01:00
Max Krasnyansky	39fb81f875	hexagon refactor all Ops to use local context struct (#19819 ) * hexagon: refactor set/get/sum-rows ops to use local context * hexagon: refactor ROPE and Softmax Ops to use local context Improves performance a bit by precomputing things and saving in the context. * hexagon: refactor activation ops to use local context struct * hexagon: refactor unary ops to use local context struct and DMA/VTCM * hexagon: use aligned hvx_scale function * hexagon: remove unused fields from op_context * hexagon: rewrite ROPE to use DMA and VTCM scratchpad * hex-rope: keep N rows in scratchpad (instead of just two) * hex-rope: introduce rowidx cache * hex-rope: remove unused fields * hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute also removes the need for fastdiv. * hex-rope: minor formatting * hex-rope: use indices and unroll the loops * hex-rope: more updates to cleanup rope-block handling * hexagon: cleanup supported type/dims checks * hexagon: all reduce funcs replicated across lanes There is no need to explicitly replicate the first value. * snapdragon: update adb and windows scripts to use ubatch-size 256 Updated Ops support handles larger ubatches.	2026-02-23 16:32:14 -08:00
Aleksander Grygier	5eb0ea32f0	feat: Add code blocks full height setting to parameter sync service (#19835 )	2026-02-23 22:30:13 +01:00
Adrien Gallouët	b68a83e641	vendor : update cpp-httplib to 0.34.0 (#19830 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-23 21:05:48 +01:00

1 2 3 4 5 ...

11875 commits