koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-02 15:39:26 +00:00

Author	SHA1	Message	Date
Concedo	76b22a7b23	updated lite	2026-02-01 22:16:13 +08:00
Concedo	a5ae116033	increase z-image default clamp to 4.0, to tolerate z-image base requirement for higher cfg	2026-02-01 22:02:20 +08:00
Concedo	b13bf44285	kde fractional scaling fix, tooltip fix (+1 squashed commits) Squashed commits: [1cf02dcce] kde fractional scaling fix	2026-02-01 21:55:44 +08:00
Concedo	9ef5d34740	fix mcp cert issues	2026-02-01 16:48:37 +08:00
Concedo	ffdc1b0f9f	flux2 image editing	2026-01-31 16:36:45 +08:00
Concedo	71069253b7	update sdui	2026-01-31 12:48:57 +08:00
Concedo	885fec37c1	update sdui	2026-01-30 21:05:10 +08:00
Concedo	a6efa9d182	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # tests/test-backend-ops.cpp	2026-01-30 20:37:37 +08:00
Georgi Gerganov	c3b87cebff	tests : add GQA=20 FA test (#19095 ) Some checks failed Python Type-Check / pyright type-check (push) Waiting to run Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details	2026-01-30 13:52:57 +02:00
Daniel Bevenius	0562503154	convert : add missing return statement for GraniteMoeModel (#19202 ) This commit adds a missing return statement to the GraniteMoeModel class to fix an issue in the model conversion process. Resolves: https://github.com/ggml-org/llama.cpp/issues/19201	2026-01-30 11:12:53 +01:00
Daniel Bevenius	83bcdf7217	memory : remove unused tmp_buf (#19199 ) This commit removes the unused tmp_buf variable from llama-kv-cache.cpp and llama-memory-recurrent.cpp. The tmp_buf variable was declared but never used but since it has a non-trivial constructor/desctuctor we don't get an unused variable warning about it.	2026-01-30 10:37:06 +01:00
Concedo	8d173f50c2	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # docs/backend/SYCL.md # docs/backend/snapdragon/CMakeUserPresets.json # docs/backend/snapdragon/README.md # docs/backend/snapdragon/developer.md # docs/ops.md # docs/ops/SYCL.csv # embd_res/templates/upstage-Solar-Open-100B.jinja # ggml/src/CMakeLists.txt # ggml/src/ggml-hexagon/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl # tests/test-chat.cpp	2026-01-30 15:32:59 +08:00
Antonis Makropoulos	b316895ff9	docs: Add LlamaLib to UI projects (#19181 )	2026-01-30 14:54:28 +08:00
bssrdf	ecbf01d441	add tensor type checking as part of cuda graph properties (#19186 ) Some checks failed Update Operations Documentation / update-ops-docs (push) Has been cancelled Details	2026-01-30 12:57:52 +08:00
s8322	1025fd2c09	sycl: implement GGML_UNARY_OP_SOFTPLUS (#19114 ) * sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS	2026-01-30 12:01:38 +08:00
RachelMantel	c7358ddf64	sycl: implement GGML_OP_TRI (#19089 ) * sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI	2026-01-30 12:00:49 +08:00
DDXDB	d284baf1b5	Fix typos in SYCL documentation (#19162 ) * Fix typos in SYCL documentation * Update SYCL.md * Update SYCL.md * Update SYCL.md * Update docs/backend/SYCL.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update SYCL.md --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-30 09:46:57 +08:00
Zheyuan Chen	bd90fc74c3	ggml-webgpu: improve flastAttention performance by software pipelining (#19151 ) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll QK accumlation inner loop ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-29 14:05:30 -08:00
Todor Boinovski	ce38a4db47	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 ) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-29 12:33:21 -08:00
Georgi Gerganov	4fdbc1e4db	cuda : fix nkvo, offload and cuda graph node properties matching (#19165 ) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert	2026-01-29 18:45:30 +02:00
Concedo	66e1913da6	fix blocked UA for mcp	2026-01-29 23:53:50 +08:00
Concedo	5c29510330	mcp try handle vscode	2026-01-29 23:39:36 +08:00
Aldehir Rojas	7b7ae857f6	chat : add parsing for solar-open-100b (#18540 ) * chat : add parsing for solar-open-100b * add comments to rules * cont : make assistant start optional * cont : remove assistant start prefix altogether --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-01-29 16:06:15 +01:00
Concedo	7e755014b2	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/winget.yml # CODEOWNERS # common/CMakeLists.txt # common/arg.cpp # docs/ops/SYCL.csv # examples/lookup/lookup-create.cpp # examples/lookup/lookup-stats.cpp # examples/lookup/lookup.cpp # examples/speculative-simple/speculative-simple.cpp # examples/speculative/speculative.cpp # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/norm.cpp # ggml/src/ggml-zendnn/ggml-zendnn.cpp # tests/test-chat-template.cpp	2026-01-29 23:05:05 +08:00
Andrew Marshall	84b0a98319	webui: Update Svelte to fix effect_update_depth_exceeded errors (#19144 ) The upstream fix is first available in 5.38.2, so constrain to at least that version. Rebuild pre-compiled webui index.html.gz based on these changes. See also: https://github.com/ggml-org/llama.cpp/issues/16347 https://github.com/huntabyte/bits-ui/issues/1687 https://github.com/sveltejs/svelte/issues/16548	2026-01-29 15:56:39 +01:00
Concedo	46cd17c17e	Merge commit '`88d23ad515`' into concedo_experimental # Conflicts: # CODEOWNERS # docs/build.md # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-zendnn/CMakeLists.txt # tests/test-chat-template.cpp	2026-01-29 22:25:56 +08:00
Concedo	cd6e087eeb	include kde in the fractional scaling fix	2026-01-29 22:06:21 +08:00
Wagner Bruna	1f01d54848	sd: sync to master-487-43e829f (#1947 )	2026-01-29 21:37:30 +08:00
Rose	deee1c2cfc	fixed mcp stdio server tool listing (#1950 )	2026-01-29 21:35:35 +08:00
Sigbjørn Skjæret	b45ef2702c	jinja : do not pass empty tools and add some none filters (#19176 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / pyright type-check (push) Waiting to run Details Update Operations Documentation / update-ops-docs (push) Waiting to run Details	2026-01-29 14:06:54 +01:00
yulo	f3dd7b8e68	HIP: add mmf for CDNA (#18896 ) * refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>	2026-01-29 11:10:53 +01:00
Georgi Gerganov	eed25bc6b0	arg : add -kvu to llama-batched-bench (#19172 )	2026-01-29 08:50:47 +02:00
Vishal Singh	b33df266d0	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159 )	2026-01-29 12:28:57 +08:00
Aman Gupta	3bcc990997	CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126 )	2026-01-29 10:31:28 +08:00
Neo Zhang	d4964a7c66	sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-29 09:20:22 +08:00
Sigbjørn Skjæret	50e8962f79	ci : find latest release with asset for winget (#19161 )	2026-01-28 22:05:39 +01:00
Ruben Ortlam	f6b533d898	Vulkan Flash Attention Coopmat1 Refactor (#19075 ) * vulkan: use coopmat for flash attention pv matrix multiplication fix P loading issue * fix barrier position * remove reduction that is no longer needed * move max thread reduction into loop * remove osh padding * add bounds checks and padding * remove unused code * fix shmem sizes, loop duration and accesses * don't overwrite Qf, add new shared psh buffer instead * add missing bounds checks * use subgroup reductions * optimize * move bounds check, reduce barriers * support other Bc values and other subgroup sizes * remove D_split * replace Of register array with shared memory Ofsh array * parallelize HSV across the rowgroups * go back to Of in registers, not shmem * vectorize sfsh * don't store entire K tile in shmem * fixes * load large k tiles to shmem on Nvidia * adapt shared memory host check function to shader changes * remove Bc 32 case * remove unused variable * fix missing mask reduction tmspsh barrier * fix mask bounds check * fix rowmax f16 under/overflow to inf * fix flash_attn_cm2 BLOCK_SIZE preprocessor directives	2026-01-28 18:52:45 +01:00
Sascha Rogmann	72d3b1898a	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 ) * server: introduce self-speculative decoding * server: moved self-call into speculative.cpp * can_speculate() includes self-speculation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: can_speculate() tests self-spec * server: replace can_speculate() with slot.can_speculate() Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common: use %zu format specifier for size_t in logging Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server: can_speculate() requires a task instance * common: ngram map, config self-speculative decoding * common: add enum common_speculative_type * common: add vector of speculative states * common: add option --spec-draftless * server: cleanup (remove slot.batch_spec, rename) * common: moved self-spec impl to ngram-map * common: cleanup (use common_speculative_state_draft) * spec : refactor * cont : naming * spec: remove --spec-config * doc: (draftless) speculative decoding * common: print performance in spec decoding * minor : cleanup * common : better names * minor : cleanup + fix build * minor: comments * CODEOWNERS: add common/ngram-map.* (#18471) * common : rename speculative.draftless_type -> speculative.type * ngram-map : fix uninitialized values * ngram-map : take into account the input can become shorter * ngram-map : revert len check for now * arg : change `--spec-draftless` -> `--spec-type` * spec : add common_speculative_state::accept() * spec : refactor + add common_speculative_begin() * spec : fix begin() call with mtmd * spec : additional refactor + remove common_speculative_params --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-28 19:42:42 +02:00
Daniel Bevenius	ebf5725870	convert : yield Mamba2Model/GraniteMoeModel modify_tensors (#19157 ) * convert : yield Mamba2Model/GraniteMoeModel modify_tensors This commit updates the `GraniteHybridModel` class' modify_tensors function to properly delegate to `Mamba2Model.modify_tensors` and `GraniteMoeModel.modify_tensors` using 'yield from' instead of 'return'. The motivation for this is that modify_tensors is a generator function (it uses 'yield from'), but the two calls above use return statements but don't yield anything which means that the the caller of this function will not receive any yielded values from it. And this causes layer tensors to be silently dropped during conversion.	2026-01-28 16:49:36 +01:00
Patryk Kaminski	0cd7032ca4	ggml-sycl: remove unused syclcompat header (#19140 ) The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826). This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.	2026-01-28 23:33:54 +08:00
Concedo	226c79338f	handle glm4.7 flash template	2026-01-28 23:29:08 +08:00
Concedo	ef7fe1b5d4	make flash attention default in cli. added `--noflashattention`	2026-01-28 23:28:48 +08:00
Sigbjørn Skjæret	60368e1d73	jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147 ) * undefined is treated as iterable (string/array) by filters `tojson` is not a supported `undefined` filter * add tests * add sequence and iterable tests keep it DRY and fix some types	2026-01-28 14:40:29 +01:00
Oleksandr Kuvshynov	88d23ad515	vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058 ) Deduplication here relied on the fact that vulkan would return unique UUID for different physical GPUs. It is at the moment not always the case. On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total), MotlenVK would assign same UUID to pairs of GPUs, unless they are connected with Infinity Fabric. See more details here: KhronosGroup/MoltenVK#2683. The right way is to fix that in MoltenVK, but until it is fixed, llama.cpp would only recognize 2 of 4 GPUs in such configuration. The deduplication logic here is changed to only filter GPUs if UUID is same but driver is different.	2026-01-28 12:35:54 +01:00
Ben Chen	0a95026da9	doc: add build instruction to use Vulkan backend on macos (#19029 )	2026-01-28 12:30:16 +01:00
Concedo	d877ec7aec	updated lite	2026-01-28 19:11:49 +08:00
Kevin Pouget	b7feacf7f3	ggml: new backend for Virglrenderer API Remoting acceleration (v2) (#18718 )	2026-01-28 17:49:40 +08:00
Alberto Cabrera Pérez	6ad70c5a77	ggml-cpu: arm64: Q4_K scale unroll and vectorization (#19108 )	2026-01-28 09:15:56 +02:00
Georgi Gerganov	631cbfcc7a	cuda : fix "V is K view" check for non-unified KV cache (#19145 )	2026-01-28 09:15:27 +02:00
Georgi Gerganov	2eee6c866c	CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (#19142 )	2026-01-28 09:15:11 +02:00

1 2 3 4 5 ...

11465 commits