koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-09 19:46:11 +00:00

Author	SHA1	Message	Date
Concedo	68f9c6df91	fix cuda graph spams	2026-02-02 11:28:50 +08:00
Concedo	ddce19db72	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/package-gguf-py.nix # .devops/nix/scope.nix # common/CMakeLists.txt # docs/backend/SYCL.md # examples/lookahead/lookahead.cpp # examples/lookup/lookup.cpp # examples/sycl/run-llama2.sh # examples/sycl/win-run-llama2.bat # examples/sycl/win-test.bat # ggml/src/ggml-hexagon/CMakeLists.txt # ggml/src/ggml-hexagon/htp/flash-attn-ops.c # ggml/src/ggml-hexagon/htp/hvx-dump.h # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # scripts/sync-ggml.last	2026-02-01 22:35:25 +08:00
Concedo	76b22a7b23	updated lite	2026-02-01 22:16:13 +08:00
Concedo	a5ae116033	increase z-image default clamp to 4.0, to tolerate z-image base requirement for higher cfg	2026-02-01 22:02:20 +08:00
Concedo	b13bf44285	kde fractional scaling fix, tooltip fix (+1 squashed commits) Squashed commits: [1cf02dcce] kde fractional scaling fix	2026-02-01 21:55:44 +08:00
Neo Zhang	2634ed207a	create test.sh to enhance the parameters for testing, update the guide, rm useless script (#19243 )	2026-02-01 18:24:00 +08:00
Concedo	9ef5d34740	fix mcp cert issues	2026-02-01 16:48:37 +08:00
Matthieu Coudron	41ea26144e	nix: fix nix develop .#python-scripts (#19218 ) Without this I get: > * Getting build dependencies for wheel... > * Building wheel... > Successfully built gguf-0.17.1-py3-none-any.whl > Finished creating a wheel... > Finished executing pypaBuildPhase > Running phase: pythonRuntimeDepsCheckHook > Executing pythonRuntimeDepsCheck > Checking runtime dependencies for gguf-0.17.1-py3-none-any.whl > - requests not installed For full logs, run: nix log /nix/store/x0c4a251l68bvdgang9d8v2fsmqay8a4-python3.12-gguf-0.0.0.drv I changed a bit the style to make it more terse ~> more elegant in my opinion.	2026-01-31 18:01:46 +02:00
Concedo	ffdc1b0f9f	flux2 image editing	2026-01-31 16:36:45 +08:00
nullname	89f10baad5	ggml-hexagon: flash-attention and reduce-sum optimizations (#19141 ) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-30 21:14:20 -08:00
Concedo	71069253b7	update sdui	2026-01-31 12:48:57 +08:00
EugeoSynthesisThirtyTwo	3dd95914d0	quantize: add option --tensor-type-file to llama-quantize (#18572 ) * add option --tensor-type-file to llama-quantize, but it raises an error. * add error message when file not found * quantize: update help menu, fix CI Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>	2026-01-31 11:39:21 +08:00
tc-mb	ec6c7421e4	mtmd: support MiniCPM-o 4.5(vision only) (#19211 ) Some checks failed Python Type-Check / pyright type-check (push) Has been cancelled Details Signed-off-by: tc-mb <caitianchi@modelbest.cn>	2026-01-30 23:19:30 +01:00
Daniele Pinna	1488339138	lookup, lookahead: fix crash when n_ctx not specified (#18729 ) * lookup, lookahead: fix crash when n_ctx not specified Since PR #16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR #4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR #4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR #10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR #16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR #16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR #14482 changed validation logic. Consolidates fix from PR #18730 per maintainer request. Commit message drafted with Claude.	2026-01-30 22:10:24 +02:00
Georgi Gerganov	4927795810	ngram-mod : fix build [no ci] (#19216 )	2026-01-30 21:27:27 +02:00
shaofeiqi	971facc38e	opencl: add optimized q8_0 mm kernel for adreno (#18871 ) * Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-01-30 10:19:27 -08:00
Georgi Gerganov	d9a2a4bcaa	sync : ggml	2026-01-30 20:09:21 +02:00
Georgi Gerganov	dfd6106c84	cuda : fix compile warnings (whisper/0)	2026-01-30 20:09:21 +02:00
Georgi Gerganov	bbada8bfb9	server : wrap around the "id_slot" parameter (#19207 ) * server : wrap around the "id_slot" parameter * cont : minor	2026-01-30 19:46:10 +02:00
Simon Redman	13f3ebfae1	Correctly fetch q8_1 quantize pipeline in test as needed by `8a3519b` (#19194 )	2026-01-30 17:27:16 +01:00
Georgi Gerganov	dabaa2e77a	spec : add ngram-mod (#19164 ) * spec : add ngram-mod * cont : simplify + keep track of occupancy * cont : cleanup * cont : move initialization to common/speculative * cont : cleanup * cont : cleanup * cont : fix	2026-01-30 18:21:48 +02:00
Marcello Seri	2e916f996a	jinja : add unordered_map include to value.h [no ci] (#19205 ) On macos Sequoia 15.7.3, x86_64, the build has recently started failing with ``` In file included from .../code/cpp/llama.cpp/common/jinja/string.cpp:2: .../code/cpp/llama.cpp/common/./jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std' 478 \| std::unordered_map<value, value, value_hasher, value_equivalence> unordered; \| ~~~~~^ In file included from .../code/cpp/llama.cpp/common/jinja/caps.cpp:1: .../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std' 478 \| std::unordered_map<value, value, value_hasher, value_equivalence> unordered; \| ~~~~~^ In file included from .../code/cpp/llama.cpp/common/jinja/value.cpp:1: In file included from .../code/cpp/llama.cpp/common/jinja/runtime.h:4: .../code/cpp/llama.cpp/common/jinja/value.h:478:10: error: no template named 'unordered_map' in namespace 'std' 478 \| std::unordered_map<value, value, value_hasher, value_equivalence> unordered; [...] ``` After a bit of digging to make sure all the appropriate flags were used, I notifced that the necessary header was not included. This fixes the build for me and should not affect negatively other builds that for some reasons were already succeeding	2026-01-30 16:09:44 +01:00
Daniel Bevenius	f3bc98890c	memory : clarify comments for r_l and s_l tensors [no ci] (#19203 ) This commit updates the comments in state_write_data to clarify that it is handling the R and S tensors and not Key and Value tensors.	2026-01-30 15:18:41 +01:00
Concedo	885fec37c1	update sdui	2026-01-30 21:05:10 +08:00
Concedo	a6efa9d182	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # tests/test-backend-ops.cpp	2026-01-30 20:37:37 +08:00
Georgi Gerganov	c3b87cebff	tests : add GQA=20 FA test (#19095 ) Some checks failed Python Type-Check / pyright type-check (push) Waiting to run Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details	2026-01-30 13:52:57 +02:00
Daniel Bevenius	0562503154	convert : add missing return statement for GraniteMoeModel (#19202 ) This commit adds a missing return statement to the GraniteMoeModel class to fix an issue in the model conversion process. Resolves: https://github.com/ggml-org/llama.cpp/issues/19201	2026-01-30 11:12:53 +01:00
Daniel Bevenius	83bcdf7217	memory : remove unused tmp_buf (#19199 ) This commit removes the unused tmp_buf variable from llama-kv-cache.cpp and llama-memory-recurrent.cpp. The tmp_buf variable was declared but never used but since it has a non-trivial constructor/desctuctor we don't get an unused variable warning about it.	2026-01-30 10:37:06 +01:00
Concedo	8d173f50c2	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # docs/backend/SYCL.md # docs/backend/snapdragon/CMakeUserPresets.json # docs/backend/snapdragon/README.md # docs/backend/snapdragon/developer.md # docs/ops.md # docs/ops/SYCL.csv # embd_res/templates/upstage-Solar-Open-100B.jinja # ggml/src/CMakeLists.txt # ggml/src/ggml-hexagon/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl # tests/test-chat.cpp	2026-01-30 15:32:59 +08:00
Antonis Makropoulos	b316895ff9	docs: Add LlamaLib to UI projects (#19181 )	2026-01-30 14:54:28 +08:00
bssrdf	ecbf01d441	add tensor type checking as part of cuda graph properties (#19186 ) Some checks failed Update Operations Documentation / update-ops-docs (push) Has been cancelled Details	2026-01-30 12:57:52 +08:00
s8322	1025fd2c09	sycl: implement GGML_UNARY_OP_SOFTPLUS (#19114 ) * sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS	2026-01-30 12:01:38 +08:00
RachelMantel	c7358ddf64	sycl: implement GGML_OP_TRI (#19089 ) * sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI	2026-01-30 12:00:49 +08:00
DDXDB	d284baf1b5	Fix typos in SYCL documentation (#19162 ) * Fix typos in SYCL documentation * Update SYCL.md * Update SYCL.md * Update SYCL.md * Update docs/backend/SYCL.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update SYCL.md --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-30 09:46:57 +08:00
Zheyuan Chen	bd90fc74c3	ggml-webgpu: improve flastAttention performance by software pipelining (#19151 ) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll QK accumlation inner loop ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-29 14:05:30 -08:00
Todor Boinovski	ce38a4db47	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 ) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-29 12:33:21 -08:00
Georgi Gerganov	4fdbc1e4db	cuda : fix nkvo, offload and cuda graph node properties matching (#19165 ) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert	2026-01-29 18:45:30 +02:00
Concedo	66e1913da6	fix blocked UA for mcp	2026-01-29 23:53:50 +08:00
Concedo	5c29510330	mcp try handle vscode	2026-01-29 23:39:36 +08:00
Aldehir Rojas	7b7ae857f6	chat : add parsing for solar-open-100b (#18540 ) * chat : add parsing for solar-open-100b * add comments to rules * cont : make assistant start optional * cont : remove assistant start prefix altogether --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-01-29 16:06:15 +01:00
Concedo	7e755014b2	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/winget.yml # CODEOWNERS # common/CMakeLists.txt # common/arg.cpp # docs/ops/SYCL.csv # examples/lookup/lookup-create.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/norm.cpp # ggml/src/ggml-zendnn/ggml-zendnn.cpp # tests/test-chat-template.cpp	2026-01-29 23:05:05 +08:00
Andrew Marshall	84b0a98319	webui: Update Svelte to fix effect_update_depth_exceeded errors (#19144 ) The upstream fix is first available in 5.38.2, so constrain to at least that version. Rebuild pre-compiled webui index.html.gz based on these changes. See also: https://github.com/ggml-org/llama.cpp/issues/16347 https://github.com/huntabyte/bits-ui/issues/1687 https://github.com/sveltejs/svelte/issues/16548	2026-01-29 15:56:39 +01:00
Concedo	46cd17c17e	Merge commit '`88d23ad515`' into concedo_experimental # Conflicts: # CODEOWNERS # docs/build.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zendnn/CMakeLists.txt # tests/test-chat-template.cpp	2026-01-29 22:25:56 +08:00
Concedo	cd6e087eeb	include kde in the fractional scaling fix	2026-01-29 22:06:21 +08:00
Wagner Bruna	1f01d54848	sd: sync to master-487-43e829f (#1947 )	2026-01-29 21:37:30 +08:00
Rose	deee1c2cfc	fixed mcp stdio server tool listing (#1950 )	2026-01-29 21:35:35 +08:00
Sigbjørn Skjæret	b45ef2702c	jinja : do not pass empty tools and add some none filters (#19176 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Update Operations Documentation / update-ops-docs (push) Waiting to run Details	2026-01-29 14:06:54 +01:00
yulo	f3dd7b8e68	HIP: add mmf for CDNA (#18896 ) * refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>	2026-01-29 11:10:53 +01:00
Georgi Gerganov	eed25bc6b0	arg : add -kvu to llama-batched-bench (#19172 )	2026-01-29 08:50:47 +02:00
Vishal Singh	b33df266d0	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159 )	2026-01-29 12:28:57 +08:00

1 2 3 4 5 ...

11482 commits