koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-30 20:33:39 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	06d26dfdff	download: add option to skip_download (#23059 ) Some checks are pending Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run Details Python check requirements.txt / check-requirements (push) Waiting to run Details Python Type-Check / python type-check (push) Waiting to run Details * download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag	2026-05-29 16:30:55 +02:00
Saba Fallah	da3f990a47	mtmd: Add DeepSeekOCR 2 Support (#20975 ) * mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-05-29 16:13:51 +02:00
Oliver Simons	6ed481eea4	CUDA: Check PTX version on host side to guard PDL dispatch (#23530 ) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: `2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)` * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 12:28:18 +02:00
Xuan-Son Nguyen	cb47092b00	server: bump timeout to 3600s (#23842 ) * server: bump timeout to 3600s * nits: change wording	2026-05-29 10:23:17 +02:00
fairydreaming	1f0aa2a696	model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346 ) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-05-29 10:15:17 +02:00
Aman Gupta	031ddb2e08	llama: use f16 mask for FA to save VRAM (#23764 ) * llama: use f16 mask for FA * review: add llama_cast + formatting * simplify	2026-05-29 15:44:43 +08:00
Georgi Gerganov	fe12e422ad	sync : ggml	2026-05-29 09:56:08 +03:00
Georgi Gerganov	ea02bc37f5	ggml : bump version to 0.13.1 (ggml/1523)	2026-05-29 09:56:08 +03:00
Omid Azizi	b000431a0b	ngram-mod : Add missing include (#23857 ) [no release] Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>	2026-05-29 09:21:37 +03:00
Aman Gupta	eef59a7642	llama: add llm_graph_input_mtp (#23643 ) * llama: add llm_graph_input_mtp * rename input_mtp -> input_token_embd * add TODO about mtmd embedding * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-29 09:17:32 +03:00
Adrien Gallouët	98e480a32e	app : move licences to llama-app (#23824 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-29 07:46:11 +02:00
Andreas Kieslinger	241cbd41d2	cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825 )	2026-05-29 07:46:10 +03:00
Matt Corallo	33c718db1f	meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (#23480 ) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.	2026-05-29 06:30:24 +03:00
Max Krasnyansky	19e92c33ef	hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835 ) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.	2026-05-28 14:05:54 -07:00
Xuan-Son Nguyen	751ebd17a5	mtmd-debug: add color and rainbow mode (#23829 ) * mtmd-debug: add color and rainbow mode * fix M_PI * max_dist	2026-05-28 20:59:14 +02:00
Xuan-Son Nguyen	c8914ad4f4	mtmd: fix gemma 4 projector pre_norm (#23822 )	2026-05-28 20:58:55 +02:00
lhez	408ae2b9e5	opencl: move backend info printing into its own function (#23702 ) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path	2026-05-28 11:05:42 -07:00
Sigbjørn Skjæret	3ef2369551	ci : run ui publish on ubuntu-slim (#23818 ) * run ui publish on self-hosted fast * run on ubuntu-slim	2026-05-28 20:58:32 +03:00
ValdikSS	2f6c815dc4	ui: fix audio and video modality detection (#23756 ) When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.	2026-05-28 17:36:10 +02:00
Georgi Gerganov	445b7cef62	ci : releases use Github-hosted builds for the UI (#23823 ) * ci : releases use Github-hosted builds for the UI * cont : fix name	2026-05-28 17:50:32 +03:00
Adrien Gallouët	479a9a1b03	app : improve help output (#23805 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-28 16:45:06 +02:00
Saba Fallah	0b56d283bf	mtmd: n_head_kv defaults to n_head (#23782 ) removed AI-generated comment	2026-05-28 16:44:36 +02:00
Xuan-Son Nguyen	d6be3158e1	mtmd: fix gemma 4 audio rms norm eps (#23815 ) * mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-28 16:31:37 +02:00
Georgi Gerganov	dd1557907a	ci : change Vulkan builds to Release to reduce ccache (#23820 ) * ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type	2026-05-28 17:29:11 +03:00
Mikolaj Kucharski	7fb1e70b59	arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167 )	2026-05-28 16:25:40 +02:00
Johannes Gäßler	d374e71e55	test-llama-archs: fix table format [no release] (#23810 )	2026-05-28 15:53:54 +02:00
fl0rianr	30af6e2b98	ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007 )	2026-05-28 15:01:14 +02:00
redfox	d7be46189f	mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729 ) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-28 14:51:14 +02:00
Jaden_Mach	bc81d47aba	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227 ) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-28 14:50:25 +02:00
Funtowicz Morgan	0b246862b9	server: minor tweaks to use more cpp features (#23785 ) * misc(server): add default port to impl RAII * misc(server): register_gcp_compat() can be const * misc(server): use proper cpp const/auto methods * misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe	2026-05-28 14:00:25 +02:00
Max Krasnyansky	a919001134	hexagon: minor refresh for HMX FA and MM (#23796 ) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-05-28 04:49:11 -07:00
Jeff Bolz	48e7078ee0	vulkan: fast path for walsh-hadamard transform (#23687 ) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault	2026-05-28 13:18:43 +02:00
Jesus Talavera	bb771cbd2b	chat : add Granite 4.1 chat template (#23518 )	2026-05-28 13:13:33 +02:00
Winston Ma	7c48fb81ce	vulkan: fix wrong index variable in inner loop (#23665 )	2026-05-28 12:48:34 +02:00
Winston Ma	91eb8f4fa0	vulkan: Fix memory logger unsafe iterator access (#23667 )	2026-05-28 12:46:07 +02:00
Markus Tavenrath	d205df6812	server, ui : Add support for HTTP ETags in llama-server (#23701 ) * allow caching of ui elements in llama-server * use fnv_hash * Update tools/server/server-http.cpp etag has to be set always Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-05-28 12:21:24 +02:00
Sachin Sharma	e8d2567429	docker : add ZenDNN Dockerfile (#23716 )	2026-05-28 11:40:49 +02:00
fairydreaming	09e7b76c93	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-28 10:55:42 +02:00
Adrien Gallouët	48e7eae41c	perplexity : fix format specifier in LOG_ERR (#23788 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-28 10:34:58 +03:00
ynankani	c5229087a5	convert : add FP8 to Q8 conversion (#23250 ) Signed-off-by: ynankani <ynankani@nvidia.com>	2026-05-28 10:16:17 +03:00
Martin Klacer	e31cdaa0eb	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841 ) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-28 10:04:21 +03:00
Georgi Gerganov	491c4d7d2e	ci : refactor (#23789 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details * ci : separate CUDA windows workflow + fix names * ci : rename workflow * ci : prefix cache names with workflow name * ci : rename build.yml -> build-cpu.yml * ci : cache keys * ci : fix windows cuda/hip concurrency of release workflow * ci : fix apple cache names * ci : add TODOs * cont : keep just the last cache * ci : update release concurrency to queue * ci : move the release trigger to ubuntu-slim * ci : hip add TODO * cont : improve words Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-28 09:44:25 +03:00
ymcki	939a7dd648	Hexagon: OP_GATED_DELTA_NET K>1 support (#23531 ) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-27 23:05:25 -07:00
ymcki	8ad8aef447	opencl: OP_GATED_DELTA_NET (#23312 ) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-27 21:23:21 -07:00
Reese Levine	f12cc6d0fa	ggml-webgpu: remove legacy constants (#23672 )	2026-05-27 14:22:33 -07:00
Max Krasnyansky	aa50b2c2ae	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647 ) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-27 10:46:11 -07:00
Masashi Yoshimura	c40006a62e	ggml-webgpu: Fix how to dispatch WG to some ops (#23750 )	2026-05-27 09:48:12 -07:00
Matt Corallo	c6e4088376	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887 ) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-27 17:19:23 +02:00
Jeff Bolz	b36eefc1b3	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (#23541 )	2026-05-27 17:18:28 +02:00
l8bloom	837bb6b447	vulkan: add REPEAT op support for f16 to f16. (#23298 ) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-27 16:59:08 +02:00

1 2 3 4 5 ...

9415 commits