koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-22 03:10:03 +00:00

Author	SHA1	Message	Date
Concedo	487d509b44	try fix oldpc cuda broken without flash attn since upstream pr14361 between 1.94 and 1.95 (+1 squashed commits) Squashed commits: [940f0c639] try fix oldpc cuda broken without flash attn since upstream pr14361 between 1.94 and 1.95	2025-08-10 00:10:37 +08:00
Concedo	0fb25bb165	Merge branch 'upstream' into concedo_experimental	2025-08-09 20:31:36 +08:00
Aman Gupta	34c9d765bf	CUDA: add attention sinks for tile and wmma (#15178 ) * CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma	2025-08-09 20:00:24 +08:00
Concedo	4c7b82e982	Merge branch 'upstream' into concedo_experimental # Conflicts: # scripts/server-bench.py	2025-08-09 10:34:24 +08:00
compilade	e54d41befc	gguf-py : add Numpy MXFP4 de/quantization support (#15111 ) * gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4	2025-08-08 17:48:26 -04:00
Concedo	9e7a940ce4	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/softmax_4_f16.cl # ggml/src/ggml-opencl/kernels/softmax_4_f32.cl # ggml/src/ggml-opencl/kernels/softmax_f16.cl # ggml/src/ggml-opencl/kernels/softmax_f32.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp	2025-08-09 01:24:52 +08:00
Concedo	7087aeb4bc	anti bsod only for nvidia	2025-08-09 01:23:38 +08:00
Concedo	67e0072245	fixed clblast repacking	2025-08-09 01:08:02 +08:00
AN Long	cd6983d56d	ggml : fix field name when new ggml_backend (#14944 )	2025-08-08 14:37:22 +02:00
Johannes Gäßler	1425f587a8	CUDA: attention sinks for mma FlashAttention (#15157 )	2025-08-08 08:19:58 +02:00
lhez	aaa3d07ae7	opencl: support sink in `soft_max` (attn sinks) (#15152 )	2025-08-07 21:47:03 -07:00
Concedo	d5b5e79035	should fix vulkan bsod	2025-08-08 10:57:50 +08:00
Jeff Bolz	c4f53563df	vulkan: support fattn sinks (#15126 )	2025-08-07 22:44:20 +02:00
Jeff Bolz	a0552c8bee	vulkan: Add env var to disable host visible vidmem (#15109 )	2025-08-07 22:07:11 +02:00
uvos	7ad67ba9fe	HIP: add cmake option to enable compiler output of kernel resource usage metrics (#15103 )	2025-08-07 16:44:14 +02:00
Concedo	8a71eb03c0	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # ggml/cmake/ggml-config.cmake.in # ggml/src/ggml-cann/CMakeLists.txt # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # requirements/requirements-convert_hf_to_gguf.txt # scripts/compare-llama-bench.py # tests/test-chat-template.cpp # tests/test-chat.cpp # tools/llama-bench/llama-bench.cpp	2025-08-07 21:23:09 +08:00
Christian Kastner	9a96389544	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 ) Any available libraries are found and loaded dynamically at runtime.	2025-08-07 13:45:41 +02:00
Johannes Gäßler	1d72c84188	CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (#15131 ) * CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16	2025-08-07 10:53:21 +02:00
Reese Levine	5fd160bbd9	ggml: Add basic SET_ROWS support in WebGPU (#15137 ) * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments	2025-08-06 15:14:40 -07:00
rmatif	756cfea826	fix profiling crash (#15072 )	2025-08-06 14:17:51 -07:00
lhez	e725a1a982	opencl: add `swiglu_oai` and `add_id` (#15121 ) * opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`	2025-08-06 12:12:17 -07:00
Diego Devesa	0d8831543c	ggml : fix fallback to CPU for ununsupported ops (#15118 )	2025-08-06 14:37:35 +02:00
Chenguang Li	2241453252	CANN: add support for ACL Graph (#15065 ) * feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is disabled, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <757486878@qq.com> * Fix review comments Signed-off-by: noemotiovon <757486878@qq.com> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-06 14:12:42 +08:00
Concedo	6eea7b88d2	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # README.md # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-backend-ops.cpp # tests/test-chat-template.cpp	2025-08-06 10:51:29 +08:00
Reese Levine	9515c6131a	ggml: WebGPU disable SET_ROWS for now (#15078 ) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * Disable set_rows until it's implemented * Fix potential issue around empty queue submission * Try synchronous submission * Try waiting on all futures explicitly * Add debug * Add more debug messages * Work on getting ssh access for debugging * Debug on failure * Disable other tests * Remove extra if * Try more locking * maybe passes? * test * Some cleanups * Restore build file * Remove extra testing branch ci	2025-08-05 16:26:38 -07:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Romain Biessy	3306ceabf0	sycl: fix mul_mat selection (#15092 )	2025-08-05 18:39:55 +02:00
Concedo	7590a0ea39	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # ggml/CMakeLists.txt # ggml/cmake/ggml-config.cmake.in # ggml/src/CMakeLists.txt # models/templates/README.md # tools/imatrix/imatrix.cpp	2025-08-05 19:24:29 +08:00
Christian Kastner	41613437ff	cmake: Add GGML_BACKEND_DIR option (#15074 ) * cmake: Add GGML_BACKEND_DIR option This can be used by distributions to specify where to look for backends when ggml is built with GGML_BACKEND_DL=ON. * Fix phrasing	2025-08-04 21:29:14 +02:00
Reese Levine	587d0118f5	ggml: WebGPU backend host improvements and style fixing (#14978 ) * Add parameter buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow	2025-08-04 08:52:43 -07:00
Concedo	8bd0a560f0	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # requirements/requirements-convert_hf_to_gguf_update.txt # scripts/compare-llama-bench.py # tests/test-backend-ops.cpp # tests/test-chat.cpp # tools/imatrix/README.md # tools/imatrix/imatrix.cpp # tools/llama-bench/llama-bench.cpp	2025-08-04 22:42:02 +08:00
Jeff Bolz	5aa1105da2	vulkan: fix build when using glslang that does not support coopmat2 (#15062 )	2025-08-04 07:09:19 +02:00
Jeff Bolz	6c7a441161	vulkan: Use coopmat2 for conv2d (#14982 )	2025-08-03 14:23:57 +02:00
lhez	5c0eb5ef54	opencl: fix adreno compiler detection logic (#15029 )	2025-08-02 19:51:18 +02:00
Johannes Gäßler	03d4698218	CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035 )	2025-08-02 16:37:08 +02:00
leejet	3303c19b16	cuda: make im2col a little faster (#15025 )	2025-08-02 17:15:36 +03:00
Georgi Gerganov	15e92fd337	cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038 ) * cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch	2025-08-02 17:13:05 +03:00
Jeff Bolz	4cb208c93c	vulkan: coopmat2 mul_mat optimizations (#14934 ) - Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used	2025-08-02 11:21:37 +02:00
Jeff Bolz	ec0b18802c	vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015 )	2025-08-02 10:48:30 +02:00
Jeff Bolz	a9f7541ec2	vulkan: optimizations for direct convolution (#14933 ) * vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-02 09:57:04 +02:00
Concedo	f430916a71	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/backend/CANN.md # docs/multimodal/minicpmo2.6.md # docs/multimodal/minicpmv2.5.md # docs/multimodal/minicpmv2.6.md # examples/speculative-simple/speculative-simple.cpp # ggml/cmake/ggml-config.cmake.in # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-cpu/repack.cpp # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/add.cl # ggml/src/ggml-opencl/kernels/mul.cl # scripts/compare-commits.sh # scripts/compare-llama-bench.py # scripts/sync-ggml.last # tools/server/README.md	2025-08-02 10:25:10 +08:00
Concedo	b04362f831	Merge commit '`00131d6eaf`' into concedo_experimental # Conflicts: # docs/ops.md # examples/save-load-state/save-load-state.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-sycl/cpy.cpp # ggml/src/ggml-sycl/cpy.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/set_rows.cpp # scripts/server-bench.py # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-thread-safety.cpp # tools/llama-bench/llama-bench.cpp	2025-08-02 10:15:39 +08:00
Johannes Gäßler	9c35706b98	CUDA: fix MMQ nwarps for AMD with warp_size==32 (#15014 )	2025-08-01 20:47:32 +02:00
lhez	1c872f71fb	opencl: add f16 for `add`, `sub`, `mul`, `div` (#14984 )	2025-08-01 13:15:44 +02:00
Srihari-mcw	baad94885d	ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373 ) * Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>	2025-08-01 09:20:33 +03:00
diannao	2860d479b4	docker : add cann build pipline (#14591 ) * docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-08-01 10:02:34 +08:00
Ruben Ortlam	e08a98826b	Vulkan: Fix minor debug mode issues (#14899 ) * vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support	2025-07-31 17:46:54 +02:00
hipudding	11490b3672	CANN: Improve loading efficiency after converting weights to NZ format. (#14985 ) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-07-31 19:47:20 +08:00
lhez	6e6725459a	opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (#14809 )	2025-07-30 14:56:55 -07:00
uvos	ad4a700117	HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (#14949 )	2025-07-30 17:38:06 +02:00

1 2 3 4 5 ...

1426 commits