koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-01 06:00:36 +00:00

Author	SHA1	Message	Date
Oliver Simons	6ed481eea4	CUDA: Check PTX version on host side to guard PDL dispatch (#23530 ) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: `2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)` * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 12:28:18 +02:00
fairydreaming	1f0aa2a696	model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346 ) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-05-29 10:15:17 +02:00
Georgi Gerganov	ea02bc37f5	ggml : bump version to 0.13.1 (ggml/1523)	2026-05-29 09:56:08 +03:00
Andreas Kieslinger	241cbd41d2	cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825 )	2026-05-29 07:46:10 +03:00
Matt Corallo	33c718db1f	meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (#23480 ) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.	2026-05-29 06:30:24 +03:00
Max Krasnyansky	19e92c33ef	hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835 ) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.	2026-05-28 14:05:54 -07:00
lhez	408ae2b9e5	opencl: move backend info printing into its own function (#23702 ) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path	2026-05-28 11:05:42 -07:00
fl0rianr	30af6e2b98	ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007 )	2026-05-28 15:01:14 +02:00
redfox	d7be46189f	mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729 ) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-28 14:51:14 +02:00
Jaden_Mach	bc81d47aba	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227 ) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-28 14:50:25 +02:00
Max Krasnyansky	a919001134	hexagon: minor refresh for HMX FA and MM (#23796 ) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-05-28 04:49:11 -07:00
Jeff Bolz	48e7078ee0	vulkan: fast path for walsh-hadamard transform (#23687 ) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault	2026-05-28 13:18:43 +02:00
Winston Ma	7c48fb81ce	vulkan: fix wrong index variable in inner loop (#23665 )	2026-05-28 12:48:34 +02:00
Winston Ma	91eb8f4fa0	vulkan: Fix memory logger unsafe iterator access (#23667 )	2026-05-28 12:46:07 +02:00
fairydreaming	09e7b76c93	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-28 10:55:42 +02:00
Martin Klacer	e31cdaa0eb	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841 ) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-28 10:04:21 +03:00
ymcki	939a7dd648	Hexagon: OP_GATED_DELTA_NET K>1 support (#23531 ) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-27 23:05:25 -07:00
ymcki	8ad8aef447	opencl: OP_GATED_DELTA_NET (#23312 ) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-27 21:23:21 -07:00
Reese Levine	f12cc6d0fa	ggml-webgpu: remove legacy constants (#23672 )	2026-05-27 14:22:33 -07:00
Max Krasnyansky	aa50b2c2ae	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647 ) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-27 10:46:11 -07:00
Masashi Yoshimura	c40006a62e	ggml-webgpu: Fix how to dispatch WG to some ops (#23750 )	2026-05-27 09:48:12 -07:00
Matt Corallo	c6e4088376	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887 ) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-27 17:19:23 +02:00
Jeff Bolz	b36eefc1b3	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (#23541 )	2026-05-27 17:18:28 +02:00
l8bloom	837bb6b447	vulkan: add REPEAT op support for f16 to f16. (#23298 ) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-27 16:59:08 +02:00
Oliver Simons	fda8528aa8	CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (#23742 )	2026-05-27 15:21:04 +03:00
Winston Ma	4d8cc0c56f	vulkan: avoid preferring transfer queue on AMD UMA devices (#22455 )	2026-05-27 11:48:40 +02:00
Vladislav	b4c0549a49	ggml-zendnn : fixed naming of matmul function (#20964 ) * ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>	2026-05-27 00:59:35 +02:00
Jeff Bolz	7799d31e68	vulkan: optimize conv2d and implement coopmat1 support (#22620 ) * vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values	2026-05-26 15:48:05 +02:00
Max Krasnyansky	ef66bfab68	hexagon: add support for CONCAT op (#23648 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details * hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.	2026-05-26 06:20:05 -07:00
Alexey Kopytko	581d020b12	SYCL: implement ggml_sycl_pool_vmm (#22862 ) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-26 07:59:00 +03:00
Masashi Yoshimura	1506d39e76	ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (#23594 ) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx	2026-05-25 20:42:49 -07:00
Nikhil Jain	54121f7325	[WebGPU] Check batch_compute_passes before sending passes when not doing GPU profiling (#23457 ) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml	2026-05-25 20:32:49 -07:00
Johannes Gäßler	192d8ae8b8	CUDA: missing PDL sync for FWHT, better fallback (#23690 )	2026-05-26 11:05:51 +08:00
forforever73	35c9b1f39e	metal : add apple device id (#23566 ) Co-authored-by: lvyichen <lvyichen@stepfun.com>	2026-05-25 21:05:16 +03:00
Aman Gupta	c1f1e28d29	CUDA: add fast walsh-hadamard transform (#23615 ) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-25 21:12:10 +08:00
Georgi Gerganov	45158f460e	ggml : bump version to 0.13.0 (ggml/1510)	2026-05-25 12:43:27 +03:00
Georgi Gerganov	ce5890b5f7	ggml : bump version to 0.12.1 (ggml/1508)	2026-05-25 12:38:01 +03:00
Ori Pekelman	b251f74f49	ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)	2026-05-25 12:38:01 +03:00
Dev-X25874	fa97041524	ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492)	2026-05-25 12:38:01 +03:00
Johannes Gäßler	ae251b5ff2	TP: fix ggml context size calculation (#22616 ) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include	2026-05-25 12:37:25 +03:00
Gilad S.	66efd13375	ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (#22341 ) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void `, change `max_expected_size` and offsets to `uint64_t` fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com>	2026-05-25 11:33:29 +02:00
Jeff Bolz	826539ce59	ggml : Parallelize quant LUT init (#23595 ) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in.	2026-05-25 10:15:46 +03:00
Johannes Gäßler	fff63b5108	TP: fix entirely zero-sized slices per device (#23525 )	2026-05-24 08:19:33 +02:00
shaofeiqi	f3061116ff	opencl: batch profiling to improve speed and prevent memory leaks (#23495 )	2026-05-23 23:11:43 -07:00
Yiwei Shao	1c0f6db545	hexagon: apply repl optimization in flash attn softmax as #22993 (#23455 ) Some checks failed Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details	2026-05-23 19:56:59 -07:00
dskwe	a497476330	ggml : Check the right iface method before using the fallback 2d get (#23514 )	2026-05-23 12:49:24 +02:00
Jeff Bolz	95405ac65f	vulkan: fix windows find_package of SPIRV-Headers (#23215 ) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only	2026-05-23 09:44:46 +02:00
Shawn Gu	0f3cb3fc8b	opencl: generalize Adreno MoE kernels on M (#23449 )	2026-05-22 17:08:41 -07:00
Alexey Kopytko	cc9e331213	SYCL: improve MoE prefill throughput (#23142 ) - change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity	2026-05-22 15:50:17 +03:00
Alexey Kopytko	bcfd1989e9	sycl : Level Zero detection in ggml_sycl_init (#23097 ) * [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning	2026-05-22 15:49:45 +03:00

1 2 3 4 5 ...

2534 commits