koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-08 18:30:50 +00:00

Author	SHA1	Message	Date
Concedo	3fb0f337fe	remove z-image clamping for now	2025-12-11 23:05:00 +08:00
Concedo	278e45becf	Merge commit '`2fa51c19b0`' into concedo_experimental # Conflicts: # .github/actions/windows-setup-cuda/action.yml # .github/workflows/build-linux-cross.yml # .github/workflows/release.yml # README.md # docs/build-riscv64-spacemit.md # examples/model-conversion/logits.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # models/templates/Kimi-K2-Instruct.jinja # models/templates/Kimi-K2-Thinking.jinja # tests/test-chat.cpp # tools/server/README.md	2025-12-11 23:04:48 +08:00
Concedo	d07d2c1b39	stub loras endpoint for comfy	2025-12-11 22:48:38 +08:00
Concedo	fd0d0cab03	move pipeline parallelism to a --pipelineparallel launch flag	2025-12-11 21:03:41 +08:00
Concedo	b7428048fc	try reduce pipeline parallelism in order to reduce compute buffer sizes	2025-12-11 14:30:38 +08:00
Concedo	798473d867	updated sdui, fixed image import	2025-12-11 11:43:40 +08:00
Concedo	34634aef1b	tweak to smartcache for contextshifting	2025-12-10 20:08:11 +08:00
Concedo	8a18e094f5	added smartcaching implementation inspired from Pento95 (+2 squashed commit) Squashed commit: [fcc498688] wip basic smart caching test [b6e8b2577] wip basic smart caching test	2025-12-10 18:00:03 +08:00
Concedo	1aab32fe03	fixed safetensors loading for zimage	2025-12-09 18:09:47 +08:00
Daniel Bevenius	2fa51c19b0	model-conversion : add token ids to prompt token output [no ci] (#17863 ) This commit adds the token ids to the printed prompt outputs. The motivation for this is that is can be useful to see the actual token ids alongside the token strings for debugging.	2025-12-08 17:13:08 +01:00
Xuan-Son Nguyen	951520ddb0	server: delegate result_state creation to server_task (#17835 ) * server: delegate result_state creation to server_task * remove unued states * add more docs	2025-12-08 17:04:38 +01:00
Neo Zhang	68522c678d	ci : support bfloat16 SYCL release package (#17855 ) * support bfloat16 release package * add fallback file	2025-12-08 15:09:39 +01:00
Xuan-Son Nguyen	f896d2c34f	server: improve speed of speculative decoding (#17808 ) * server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-08 14:35:28 +01:00
Piotr Wilkin (ilintar)	e4e9c4329c	Make graph_max_nodes vary by ubatch size (#17794 ) * Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-08 14:32:41 +01:00
hksdpc255	636fc17a37	Fix Kimi-K2 tool-call parsing issues (#17376 ) * Fix kimi-k2 parsing * fix template & add more tests for kimi-k2 * Another fix for Kimi-K2 chat template. * enable allow_toolcall_in_think for Kimi-K2 * Refine key-value separator and value end format * Enable tool call in think for kimi-k2 * allow_toolcall_in_think is now tested with Kimi-K2 * Remove outdated TODO comment in XML tool call parser Removed TODO comment about untested tool call feature. * Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"	2025-12-08 14:32:04 +01:00
Jay Zenith	51e0c2d917	cuda : add FILL op support (#17851 ) * cuda : add FILL op support * cuda : add missing FILL op files	2025-12-08 21:10:12 +08:00
Xuan-Son Nguyen	37a4f63244	server : add development documentation (#17760 ) * first draft * rewrite * update & remove duplicated sections	2025-12-08 13:54:58 +01:00
Wagner Bruna	801840d3bd	sd: sync to master-391-5865b5e (#1878 )	2025-12-08 19:53:52 +08:00
Concedo	242ae8b8f3	http get cleanup	2025-12-08 19:51:55 +08:00
Concedo	cd73613136	moved volta onto tile kernels, so building for cc7.0 can be avoided this shouldn't do anything (+2 squashed commit) Squashed commit: [1cdcb302a] another attempt to tip the scales, part 2 [8f647b709] another attempt to tip the scales (volta)	2025-12-08 19:51:54 +08:00
Georgi Gerganov	2bc96931d2	server : make cache_reuse configurable per request (#17858 )	2025-12-08 12:43:12 +02:00
wsbagnsv1	5814b4dce1	cuda: optimize SOLVE_TRI using registers and FMAF (#17703 ) * ggml-cuda: optimize solve_tri_f32_fast and fix stride handling - Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char ` before addition). - Remove unused `MAX_K_FAST` definition. Small cleanup * Remove comments in solve_tri.cu * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Use const for variables in solve_tri.cu * Replace fmaf with more readable code * remove last fmaf --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-08 10:41:08 +01:00
ixgbe	79d61896d3	ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784 ) * ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * cmake: enable RISC-V zihintpause extension for Spacemit builds * readme : add ZIHINTPAUSE support for RISC-V --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-08 10:41:34 +02:00
Xuan-Son Nguyen	4d3726278b	model: add llama 4 scaling for mistral-large (deepseek arch) (#17744 )	2025-12-07 22:29:54 +01:00
lovedheart	08f9d3cc1d	Vulkan: improve mul_mat_vec_iq1_m (#16907 ) * Optimize Vulkan shader for matrix-vector multiplication * Revert changes on compute_outputs and main Refactor compute_outputs to handle remaining rows correctly. * Fix trailing whitespace	2025-12-07 18:40:42 +01:00
Sigbjørn Skjæret	0a540f9abd	ci : add windows-cuda 13.1 release (#17839 )	2025-12-07 14:02:04 +01:00
Concedo	40d3d830a1	updated lite	2025-12-07 17:13:23 +08:00
Concedo	17c0c8d55d	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # docs/backend/zDNN.md # docs/build.md # docs/ops.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # src/llama-quant.cpp # tests/test-backend-ops.cpp # tools/llama-bench/llama-bench.cpp # tools/server/README.md	2025-12-07 16:48:38 +08:00
Concedo	7c5d271d6c	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/release.yml # .github/workflows/winget.yml # CMakeLists.txt # CODEOWNERS # CONTRIBUTING.md # cmake/build-info.cmake # docs/ops.md # docs/ops/BLAS.csv # docs/ops/Metal.csv # examples/CMakeLists.txt # examples/save-load-state/save-load-state.cpp # examples/simple-cmake-pkg/README.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py # src/llama-quant.cpp # tests/test-backend-ops.cpp # tools/server/CMakeLists.txt	2025-12-07 16:37:32 +08:00
Concedo	20363dc6e7	z image limit cfg scale to 3.0 max	2025-12-07 16:24:26 +08:00
Concedo	8577628874	freeze lcpp ui forever, modify branding	2025-12-07 13:11:01 +08:00
Concedo	8c17541cc0	modify llama.cpp branding on lcpp ui (+1 squashed commits) Squashed commits: [067343edf] modify llama.cpp branding on lcpp ui	2025-12-07 12:53:33 +08:00
Sigbjørn Skjæret	22577583a3	common : change --color to accept on/off/auto, default to auto (#17827 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details	2025-12-07 03:43:50 +01:00
Law Po Ying	d9e03db1e7	sycl: add missing BF16 conversion support for Intel oneAPI (#17780 ) * sycl: add missing BF16 conversion support for Intel oneAPI * Fix Line 645: Trailing whitespace	2025-12-07 09:18:18 +08:00
Jeff Bolz	db97837385	vulkan: perf_logger improvements (#17672 ) * vulkan: perf_logger improvements - Move perf_logger from device to ctx. - Add an env var to control the frequency we dump the stats. If you set a very large value, it just dumps when the ctx is destroyed. - Add a fusion info string to the tracking, only log one item per fused op. - Fix MUL_MAT_ID flops calculation. * fix vector sizes	2025-12-06 18:46:46 +01:00
Vishal Singh	017761daf5	ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690 ) * ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>	2025-12-07 00:13:33 +08:00
Xuan-Son Nguyen	c42712b056	server: support multiple generations from one prompt (OAI "n" option) (#17775 ) * backend support * server: support multiple generations from one prompt (OAI "n" option) * fix invalid batch * format oai * clean up * disable ctx shift * add test * update comments * fix style * add n_cmpl to docs [no ci] * allowing using both n_cmpl and n	2025-12-06 15:54:38 +01:00
Phylliida Dev	09c7c50e64	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985 ) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-06 15:07:02 +01:00
Concedo	d27949f22a	Revert "try remove volta as a dedicated target b (+1 squashed commits)" This reverts commit `ddba580f00`.	2025-12-06 21:31:44 +08:00
Concedo	ddba580f00	try remove volta as a dedicated target b (+1 squashed commits) Squashed commits: [2df689a03] try remove volta as a dedicated target	2025-12-06 21:31:06 +08:00
Johannes Gäßler	f334b79494	HIP: fix RDNA3 FP16/BF16 matrix multiplication (#17817 )	2025-12-06 13:45:36 +01:00
Aleksander Grygier	a28e3c7567	webui: Stop generation from chat sidebar (#17806 ) * feat: Add stop generation button for Conversation Item * chore: update webui build output	2025-12-06 13:29:15 +01:00
Aleksander Grygier	e31b5c55c3	webui: Fix context available value in Multi-model Router mode (#17804 ) * fix: Use context size from `/props?model=...` in ROUTER mode * chore: update webui build output	2025-12-06 13:23:29 +01:00
Aleksander Grygier	21f24f27a9	webui: Per-conversation system message with UI displaying, edition & branching (#17275 ) * feat: Per-conversation system message with optional display in UI, edition and branching (WIP) * chore: update webui build output	2025-12-06 13:19:05 +01:00
Sky	7b43f55753	ggml : improve error handling for search path existence checks (#17653 ) * Improve error handling for search path existence checks Refactor existence checks for search paths using std::error_code to handle potential errors. * Improve cache file existence check with error code Update fs::exists to use std::error_code for error handling. * Simplify existence check for search paths Simplify existence check for search paths * Fix logging path in error message for posix_stat * Update ggml/src/ggml-backend-reg.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Adapt to the coding standard --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-12-06 12:28:16 +01:00
Daniel Bevenius	444f00b0ec	llama : remove quantization sanity check (#17788 ) * llama : remove quantization sanity check This commit removes the quantization sanity check for attention layers. The motivation for this is that there are model that are hybrid models that have recurrent layers, experts layers, and attention layers. For these models the current check fails as the experts layers are not taking into account. After consideration, it was decided that this check is not strictly necessary, and can be removed to allow for more flexible model architectures. * llama : remove unused pruned_attention_w and is_clip_model vars	2025-12-06 12:26:20 +01:00
Jeff Bolz	2960eb2975	vulkan: Use one row per workgroup for f32 mmv (#17711 ) The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.	2025-12-06 11:12:26 +01:00
Xuan-Son Nguyen	dbc15a7967	convert: support Mistral 3 Large MoE (#17730 ) * convert: support Mistral 3 Large MoE * filter out vision tensors, add missing keys * handle vocab * add temperature_length * fix mscale_all_dim * clean up * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-06 10:49:33 +01:00
Jeff Bolz	c6c5e85979	vulkan: support solve_tri with larger N/K values (#17781 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Update Operations Documentation / update-ops-docs (push) Waiting to run Details Split N into chunks to fit into shared memory. If K > 128, use a larger workgroup with enough invocations. Add perf tests matching qwen3next.	2025-12-06 08:56:45 +01:00
Concedo	1a14ae1183	lets try without volta specific kernels, fattn should fall back to tile	2025-12-06 15:56:07 +08:00

1 2 3 4 5 ...

10709 commits