Concedo
487d509b44
try fix oldpc cuda broken without flash attn since upstream pr14361 between 1.94 and 1.95 (+1 squashed commits)
...
Squashed commits:
[940f0c639] try fix oldpc cuda broken without flash attn since upstream pr14361 between 1.94 and 1.95
2025-08-10 00:10:37 +08:00
Concedo
0fb25bb165
Merge branch 'upstream' into concedo_experimental
2025-08-09 20:31:36 +08:00
Aman Gupta
34c9d765bf
CUDA: add attention sinks for tile and wmma ( #15178 )
...
* CUDA: add attention sinks for tile and wmma
* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-09 20:00:24 +08:00
Concedo
4c7b82e982
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# scripts/server-bench.py
2025-08-09 10:34:24 +08:00
compilade
e54d41befc
gguf-py : add Numpy MXFP4 de/quantization support ( #15111 )
...
* gguf-py : add MXFP4 de/quantization support
* ggml-quants : handle zero amax for MXFP4
2025-08-08 17:48:26 -04:00
Concedo
9e7a940ce4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/softmax_4_f16.cl
# ggml/src/ggml-opencl/kernels/softmax_4_f32.cl
# ggml/src/ggml-opencl/kernels/softmax_f16.cl
# ggml/src/ggml-opencl/kernels/softmax_f32.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
2025-08-09 01:24:52 +08:00
Concedo
7087aeb4bc
anti bsod only for nvidia
2025-08-09 01:23:38 +08:00
Concedo
67e0072245
fixed clblast repacking
2025-08-09 01:08:02 +08:00
AN Long
cd6983d56d
ggml : fix field name when new ggml_backend ( #14944 )
2025-08-08 14:37:22 +02:00
Johannes Gäßler
1425f587a8
CUDA: attention sinks for mma FlashAttention ( #15157 )
2025-08-08 08:19:58 +02:00
lhez
aaa3d07ae7
opencl: support sink in soft_max (attn sinks) ( #15152 )
2025-08-07 21:47:03 -07:00
Concedo
d5b5e79035
should fix vulkan bsod
2025-08-08 10:57:50 +08:00
Jeff Bolz
c4f53563df
vulkan: support fattn sinks ( #15126 )
2025-08-07 22:44:20 +02:00
Jeff Bolz
a0552c8bee
vulkan: Add env var to disable host visible vidmem ( #15109 )
2025-08-07 22:07:11 +02:00
uvos
7ad67ba9fe
HIP: add cmake option to enable compiler output of kernel resource usage metrics ( #15103 )
2025-08-07 16:44:14 +02:00
Concedo
8a71eb03c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# scripts/compare-llama-bench.py
# tests/test-chat-template.cpp
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-07 21:23:09 +08:00
Christian Kastner
9a96389544
ggml: Skip backend library linking code when GGML_BACKEND_DL=ON ( #15094 )
...
Any available libraries are found and loaded dynamically at runtime.
2025-08-07 13:45:41 +02:00
Johannes Gäßler
1d72c84188
CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 ( #15131 )
...
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-07 10:53:21 +02:00
Reese Levine
5fd160bbd9
ggml: Add basic SET_ROWS support in WebGPU ( #15137 )
...
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
2025-08-06 15:14:40 -07:00
rmatif
756cfea826
fix profiling crash ( #15072 )
2025-08-06 14:17:51 -07:00
lhez
e725a1a982
opencl: add swiglu_oai and add_id ( #15121 )
...
* opencl: add `swiglu-oai`
* opencl: add `add_id`
* opencl: add missing `add_id.cl`
2025-08-06 12:12:17 -07:00
Diego Devesa
0d8831543c
ggml : fix fallback to CPU for ununsupported ops ( #15118 )
2025-08-06 14:37:35 +02:00
Chenguang Li
2241453252
CANN: add support for ACL Graph ( #15065 )
...
* feat(cann): add optional support for ACL Graph execution
This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:
-DUSE_CANN_GRAPH=ON
By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.
Key additions:
- CMake option to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
is unset or invalid
This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.
Signed-off-by: noemotiovon <757486878@qq.com>
* Fix review comments
Signed-off-by: noemotiovon <757486878@qq.com>
* remane USE_CANN_GRAPH to USE_ACL_GRAPH
Signed-off-by: noemotiovon <757486878@qq.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-06 14:12:42 +08:00
Concedo
6eea7b88d2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# README.md
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/test-backend-ops.cpp
# tests/test-chat-template.cpp
2025-08-06 10:51:29 +08:00
Reese Levine
9515c6131a
ggml: WebGPU disable SET_ROWS for now ( #15078 )
...
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* Disable set_rows until it's implemented
* Fix potential issue around empty queue submission
* Try synchronous submission
* Try waiting on all futures explicitly
* Add debug
* Add more debug messages
* Work on getting ssh access for debugging
* Debug on failure
* Disable other tests
* Remove extra if
* Try more locking
* maybe passes?
* test
* Some cleanups
* Restore build file
* Remove extra testing branch ci
2025-08-05 16:26:38 -07:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com>
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Romain Biessy
3306ceabf0
sycl: fix mul_mat selection ( #15092 )
2025-08-05 18:39:55 +02:00
Concedo
7590a0ea39
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ggml/CMakeLists.txt
# ggml/cmake/ggml-config.cmake.in
# ggml/src/CMakeLists.txt
# models/templates/README.md
# tools/imatrix/imatrix.cpp
2025-08-05 19:24:29 +08:00
Christian Kastner
41613437ff
cmake: Add GGML_BACKEND_DIR option ( #15074 )
...
* cmake: Add GGML_BACKEND_DIR option
This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.
* Fix phrasing
2025-08-04 21:29:14 +02:00
Reese Levine
587d0118f5
ggml: WebGPU backend host improvements and style fixing ( #14978 )
...
* Add parameter buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
2025-08-04 08:52:43 -07:00
Concedo
8bd0a560f0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# requirements/requirements-convert_hf_to_gguf_update.txt
# scripts/compare-llama-bench.py
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/imatrix/README.md
# tools/imatrix/imatrix.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-04 22:42:02 +08:00
Jeff Bolz
5aa1105da2
vulkan: fix build when using glslang that does not support coopmat2 ( #15062 )
2025-08-04 07:09:19 +02:00
Jeff Bolz
6c7a441161
vulkan: Use coopmat2 for conv2d ( #14982 )
2025-08-03 14:23:57 +02:00
lhez
5c0eb5ef54
opencl: fix adreno compiler detection logic ( #15029 )
2025-08-02 19:51:18 +02:00
Johannes Gäßler
03d4698218
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 ( #15035 )
2025-08-02 16:37:08 +02:00
leejet
3303c19b16
cuda: make im2col a little faster ( #15025 )
2025-08-02 17:15:36 +03:00
Georgi Gerganov
15e92fd337
cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ( #15038 )
...
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1
ggml-ci
* cont : fix cont types
ggml-ci
* cont : adopt variable names and comment from the other branch
2025-08-02 17:13:05 +03:00
Jeff Bolz
4cb208c93c
vulkan: coopmat2 mul_mat optimizations ( #14934 )
...
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
2025-08-02 11:21:37 +02:00
Jeff Bolz
ec0b18802c
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply ( #15015 )
2025-08-02 10:48:30 +02:00
Jeff Bolz
a9f7541ec2
vulkan: optimizations for direct convolution ( #14933 )
...
* vulkan: optimizations for direct convolution
- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.
* Three tiles sizes for CONV_2D, and a heuristic to choose
* reallow collectives for pre-Turing
* make SHMEM_PAD a spec constant
* fixes for intel perf - no shmem padding, placeholder shader core count
* shader variants with/without unrolling
* 0cc4m's fixes for AMD perf
Co-authored-by: 0cc4m <picard12@live.de>
---------
Co-authored-by: 0cc4m <picard12@live.de>
2025-08-02 09:57:04 +02:00
Concedo
f430916a71
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/backend/CANN.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# examples/speculative-simple/speculative-simple.cpp
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/repack.cpp
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/add.cl
# ggml/src/ggml-opencl/kernels/mul.cl
# scripts/compare-commits.sh
# scripts/compare-llama-bench.py
# scripts/sync-ggml.last
# tools/server/README.md
2025-08-02 10:25:10 +08:00
Concedo
b04362f831
Merge commit ' 00131d6eaf' into concedo_experimental
...
# Conflicts:
# docs/ops.md
# examples/save-load-state/save-load-state.cpp
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-sycl/cpy.cpp
# ggml/src/ggml-sycl/cpy.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/set_rows.cpp
# scripts/server-bench.py
# tests/CMakeLists.txt
# tests/test-backend-ops.cpp
# tests/test-thread-safety.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-02 10:15:39 +08:00
Johannes Gäßler
9c35706b98
CUDA: fix MMQ nwarps for AMD with warp_size==32 ( #15014 )
2025-08-01 20:47:32 +02:00
lhez
1c872f71fb
opencl: add f16 for add, sub, mul, div ( #14984 )
2025-08-01 13:15:44 +02:00
Srihari-mcw
baad94885d
ggml : Q2k interleaving implementation - x86/x64 SIMD ( #14373 )
...
* Initial Q2_K Block Interleaving Implementation
* Addressed review comments and clean up of the code
* Post rebase fixes
* Initial CI/CD fixes
* Update declarations in arch-fallback.h
* Changes for GEMV Q2_K in arch-fallback.h
* Enable repacking only on AVX-512 machines
* Update comments in repack.cpp
* Address q2k comments
---------
Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
2025-08-01 09:20:33 +03:00
diannao
2860d479b4
docker : add cann build pipline ( #14591 )
...
* docker: add cann build pipline
* docker: add cann build pipline
* docker: fix cann devops
* cann : fix multi card hccl
* Update ggml/src/ggml-cann/ggml-cann.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Update ggml-cann.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-08-01 10:02:34 +08:00
Ruben Ortlam
e08a98826b
Vulkan: Fix minor debug mode issues ( #14899 )
...
* vulkan: fix debug mode issues
* vulkan: remove broken check_results GGML_OP_SET_ROWS support
2025-07-31 17:46:54 +02:00
hipudding
11490b3672
CANN: Improve loading efficiency after converting weights to NZ format. ( #14985 )
...
* CANN: Improve loading efficiency after converting weights to NZ format.
* CANN: fix typo
2025-07-31 19:47:20 +08:00
lhez
6e6725459a
opencl: add mul_mat_f32_f32_l4_lm and mul_mat_f16_f32_l4_lm ( #14809 )
2025-07-30 14:56:55 -07:00
uvos
ad4a700117
HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes ( #14949 )
2025-07-30 17:38:06 +02:00