koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-31 05:03:44 +00:00

Author	SHA1	Message	Date
Adrien Gallouët	98e480a32e	app : move licences to llama-app (#23824 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-29 07:46:11 +02:00
Andreas Kieslinger	241cbd41d2	cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825 )	2026-05-29 07:46:10 +03:00
Matt Corallo	33c718db1f	meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (#23480 ) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.	2026-05-29 06:30:24 +03:00
Max Krasnyansky	19e92c33ef	hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835 ) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.	2026-05-28 14:05:54 -07:00
Xuan-Son Nguyen	751ebd17a5	mtmd-debug: add color and rainbow mode (#23829 ) * mtmd-debug: add color and rainbow mode * fix M_PI * max_dist	2026-05-28 20:59:14 +02:00
Xuan-Son Nguyen	c8914ad4f4	mtmd: fix gemma 4 projector pre_norm (#23822 )	2026-05-28 20:58:55 +02:00
lhez	408ae2b9e5	opencl: move backend info printing into its own function (#23702 ) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path	2026-05-28 11:05:42 -07:00
Sigbjørn Skjæret	3ef2369551	ci : run ui publish on ubuntu-slim (#23818 ) * run ui publish on self-hosted fast * run on ubuntu-slim	2026-05-28 20:58:32 +03:00
ValdikSS	2f6c815dc4	ui: fix audio and video modality detection (#23756 ) When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.	2026-05-28 17:36:10 +02:00
Georgi Gerganov	445b7cef62	ci : releases use Github-hosted builds for the UI (#23823 ) * ci : releases use Github-hosted builds for the UI * cont : fix name	2026-05-28 17:50:32 +03:00
Adrien Gallouët	479a9a1b03	app : improve help output (#23805 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-28 16:45:06 +02:00
Saba Fallah	0b56d283bf	mtmd: n_head_kv defaults to n_head (#23782 ) removed AI-generated comment	2026-05-28 16:44:36 +02:00
Xuan-Son Nguyen	d6be3158e1	mtmd: fix gemma 4 audio rms norm eps (#23815 ) * mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-28 16:31:37 +02:00
Georgi Gerganov	dd1557907a	ci : change Vulkan builds to Release to reduce ccache (#23820 ) * ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type	2026-05-28 17:29:11 +03:00
Mikolaj Kucharski	7fb1e70b59	arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167 )	2026-05-28 16:25:40 +02:00
Johannes Gäßler	d374e71e55	test-llama-archs: fix table format [no release] (#23810 )	2026-05-28 15:53:54 +02:00
fl0rianr	30af6e2b98	ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007 )	2026-05-28 15:01:14 +02:00
redfox	d7be46189f	mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729 ) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-28 14:51:14 +02:00
Jaden_Mach	bc81d47aba	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227 ) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-28 14:50:25 +02:00
Funtowicz Morgan	0b246862b9	server: minor tweaks to use more cpp features (#23785 ) * misc(server): add default port to impl RAII * misc(server): register_gcp_compat() can be const * misc(server): use proper cpp const/auto methods * misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe	2026-05-28 14:00:25 +02:00
Max Krasnyansky	a919001134	hexagon: minor refresh for HMX FA and MM (#23796 ) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-05-28 04:49:11 -07:00
Jeff Bolz	48e7078ee0	vulkan: fast path for walsh-hadamard transform (#23687 ) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault	2026-05-28 13:18:43 +02:00
Jesus Talavera	bb771cbd2b	chat : add Granite 4.1 chat template (#23518 )	2026-05-28 13:13:33 +02:00
Winston Ma	7c48fb81ce	vulkan: fix wrong index variable in inner loop (#23665 )	2026-05-28 12:48:34 +02:00
Winston Ma	91eb8f4fa0	vulkan: Fix memory logger unsafe iterator access (#23667 )	2026-05-28 12:46:07 +02:00
Markus Tavenrath	d205df6812	server, ui : Add support for HTTP ETags in llama-server (#23701 ) * allow caching of ui elements in llama-server * use fnv_hash * Update tools/server/server-http.cpp etag has to be set always Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-05-28 12:21:24 +02:00
Sachin Sharma	e8d2567429	docker : add ZenDNN Dockerfile (#23716 )	2026-05-28 11:40:49 +02:00
fairydreaming	09e7b76c93	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-28 10:55:42 +02:00
Adrien Gallouët	48e7eae41c	perplexity : fix format specifier in LOG_ERR (#23788 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-28 10:34:58 +03:00
ynankani	c5229087a5	convert : add FP8 to Q8 conversion (#23250 ) Signed-off-by: ynankani <ynankani@nvidia.com>	2026-05-28 10:16:17 +03:00
Martin Klacer	e31cdaa0eb	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841 ) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-28 10:04:21 +03:00
Georgi Gerganov	491c4d7d2e	ci : refactor (#23789 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details * ci : separate CUDA windows workflow + fix names * ci : rename workflow * ci : prefix cache names with workflow name * ci : rename build.yml -> build-cpu.yml * ci : cache keys * ci : fix windows cuda/hip concurrency of release workflow * ci : fix apple cache names * ci : add TODOs * cont : keep just the last cache * ci : update release concurrency to queue * ci : move the release trigger to ubuntu-slim * ci : hip add TODO * cont : improve words Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-28 09:44:25 +03:00
ymcki	939a7dd648	Hexagon: OP_GATED_DELTA_NET K>1 support (#23531 ) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-27 23:05:25 -07:00
ymcki	8ad8aef447	opencl: OP_GATED_DELTA_NET (#23312 ) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-27 21:23:21 -07:00
Reese Levine	f12cc6d0fa	ggml-webgpu: remove legacy constants (#23672 )	2026-05-27 14:22:33 -07:00
Max Krasnyansky	aa50b2c2ae	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647 ) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-27 10:46:11 -07:00
Masashi Yoshimura	c40006a62e	ggml-webgpu: Fix how to dispatch WG to some ops (#23750 )	2026-05-27 09:48:12 -07:00
Matt Corallo	c6e4088376	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887 ) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-27 17:19:23 +02:00
Jeff Bolz	b36eefc1b3	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (#23541 )	2026-05-27 17:18:28 +02:00
l8bloom	837bb6b447	vulkan: add REPEAT op support for f16 to f16. (#23298 ) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-27 16:59:08 +02:00
Georgi Gerganov	ba4dd0bc67	ci : move ARM jobs to self-hosted + disable kleidiai mac release (#23780 ) * ci : move ARM jobs to 3rd-party runners + disable kleidiai release * cont : fix deps + fix names * ocd : fix names * cont : fix PR links	2026-05-27 17:22:20 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	617255d437	vendor : update cpp-httplib to 0.46.0 (#23650 )	2026-05-27 21:36:24 +08:00
Sigbjørn Skjæret	87b0a60cdd	pyproject : add conversion folder and update dependencies (#23746 ) * add conversion folder and update dependencies * limit python version for triton * update dev-dependencies section	2026-05-27 15:06:18 +02:00
Oliver Simons	fda8528aa8	CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (#23742 )	2026-05-27 15:21:04 +03:00
Sigbjørn Skjæret	2d0656fbdd	ci : bump cuda release to 13.3 (#23749 )	2026-05-27 15:06:08 +03:00
Georgi Gerganov	6b4e4bd582	common : fix env names to all have LLAMA_ARG_ prefix (#23778 )	2026-05-27 14:52:47 +03:00
Georgi Gerganov	9f0e4b14d2	ci : fix windows ccaches (#23777 ) * ci : server windows set build type explicitly * cont : try windows-2025 * ci : use llvm * cont : use ninja * cont : fix shell * ci : set number of jobs correctly * ci : fix windows with vulkan ccache by using llvm * ci : server ccache only on master * ocd : fix job names [no release]	2026-05-27 13:54:21 +03:00
Sigbjørn Skjæret	b3a739c9b6	ci : remove wasm test (#23733 ) * run tests in correct build folder * remove wasm test	2026-05-27 13:11:37 +03:00
Winston Ma	4d8cc0c56f	vulkan: avoid preferring transfer queue on AMD UMA devices (#22455 )	2026-05-27 11:48:40 +02:00
Georgi Gerganov	0d227ec358	ci : add ccache to server builds + fix undefined sanitizer build (#23763 ) * ci : fix undefined sanitizer build to use Debug build type only * ci : ccache the server builds * cont : remove ui dependency + reuse ccache for both ubuntu jobs * tmp : force ccache save * Revert "tmp : force ccache save" This reverts commit a857b03a10b1304d456129a017e0e46b185618ee. * cont : no need for node.js	2026-05-27 11:45:12 +03:00

1 2 3 4 5 ...

9405 commits