Concedo
ada982b7c1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/vulkan.Dockerfile
# benches/dgx-spark/dgx-spark.md
# scripts/bench-models.sh
2026-02-05 22:24:12 +08:00
Concedo
157fac7bd0
Merge commit ' c342c3b93d' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# CODEOWNERS
# scripts/sync_vendor.py
2026-02-05 22:23:05 +08:00
Georgi Gerganov
7a4f97d196
metal : add diag ( #19330 )
2026-02-05 10:08:45 +02:00
Oleksandr Kuvshynov
a498c75ad1
vulkan: fix GPU deduplication logic. ( #19222 )
...
* vulkan: fix GPU deduplication logic.
As reported in https://github.com/ggml-org/llama.cpp/issues/19221 , the
(same uuid, same driver) logic is problematic for windows+intel igpu.
Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the same as before 88d23ad5 - just dedup based on UUID.
Verified that MacOS + 4xVega still reports 4 GPUs with this version.
* vulkan: only skip dedup when both drivers are moltenVk
2026-02-05 09:06:59 +01:00
Jeff Bolz
3409ab842d
vulkan: Set k_load_shmem to false when K is too large ( #19301 )
2026-02-05 08:48:33 +01:00
Jeff Bolz
c342c3b93d
vulkan: fix non-contig rope ( #19299 )
2026-02-05 08:38:59 +01:00
will-lms
af252d0758
metal : add missing includes ( #19348 )
2026-02-05 08:05:09 +02:00
Concedo
a2251a154f
Merge remote-tracking branch 'jeff/rope_noncontig' into concedo_experimental
2026-02-04 16:21:31 +08:00
Concedo
1f803ae27b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/server.yml
# CMakeLists.txt
# cmake/common.cmake
# ggml/src/ggml-virtgpu/apir_cs_ggml-rpc-front.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-device.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
# ggml/src/ggml-virtgpu/backend/backend-dispatched.h
# ggml/src/ggml-virtgpu/backend/backend.cpp
# ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
# ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
# ggml/src/ggml-virtgpu/ggml-backend-device.cpp
# ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
# ggml/src/ggml-virtgpu/ggml-remoting.h
# ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml
# ggml/src/ggml-virtgpu/regenerate_remoting.py
# ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
# ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
# ggml/src/ggml-virtgpu/virtgpu-shm.cpp
# ggml/src/ggml-virtgpu/virtgpu.cpp
# ggml/src/ggml-virtgpu/virtgpu.h
2026-02-04 16:21:06 +08:00
Kevin Pouget
015deb9048
ggml-virtgpu: make the code thread safe ( #19204 )
...
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function
* ggml-virtgpu: deprecate buffer_type is_host remoting
not necessary
* ggml-virtgpu: stop using static vars as cache
The static init isn't thread safe.
* ggml-virtgpu: protect the use of the shared memory to transfer data
* ggml-virtgpu: make the remote calls thread-safe
* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory
* ggml-virtgpu: add a cleanup function for consistency
* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing
* fix style and ordering
* Remove the static variable in apir_device_get_count
* ggml-virtgpu: improve the logging
* fix review minor formatting changes
2026-02-04 10:46:18 +08:00
Aman Gupta
2ceda3f662
ggml-cpu: use LUT for converting e8->f32 scales on x86 ( #19288 )
...
* ggml-cpu: use LUT for converting e8->f32 scales on x86
* add dispatch based on macro
2026-02-04 09:43:29 +08:00
Georgi Gerganov
44008ce8f9
metal : add solve_tri ( #19302 )
2026-02-03 23:43:14 +02:00
Jeff Bolz
5de50e9d86
vulkan: fix non-contig rope
2026-02-03 12:20:08 -06:00
Ruben Ortlam
32b17abdb0
vulkan: disable coopmat1 fa on Nvidia Turing ( #19290 )
2026-02-03 17:37:32 +01:00
Aman Gupta
8bece2eb20
CUDA: use mmvq for mul-mat-id for small batch sizes ( #18958 )
...
* CUDA: use mmvq for mul-mat-id for small batch sizes
* add mmvq too
* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs
* templatize multi_token_path
2026-02-03 23:31:23 +08:00
Georgi Gerganov
c55bce4159
metal : minor cleanup ( #19251 )
2026-02-03 13:43:29 +02:00
Concedo
316530e9cf
fix cuda graph spams
2026-02-03 19:00:50 +08:00
Concedo
7b393fa487
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# AUTHORS
# ci/run.sh
# docs/backend/SYCL.md
# docs/build.md
# docs/multimodal/minicpmo2.6.md
# docs/multimodal/minicpmo4.0.md
# docs/multimodal/minicpmv2.5.md
# docs/multimodal/minicpmv2.6.md
# docs/multimodal/minicpmv4.0.md
# docs/multimodal/minicpmv4.5.md
# docs/ops.md
# docs/ops/SYCL.csv
# docs/speculative.md
# examples/deprecation-warning/README.md
# examples/deprecation-warning/deprecation-warning.cpp
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/include/ggml-cann.h
# ggml/src/ggml-cann/acl_tensor.cpp
# ggml/src/ggml-cann/acl_tensor.h
# ggml/src/ggml-cann/aclnn_ops.cpp
# ggml/src/ggml-cann/aclnn_ops.h
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-metal/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/concat.cl
# ggml/src/ggml-opencl/kernels/repeat.cl
# ggml/src/ggml-opencl/kernels/scale.cl
# ggml/src/ggml-opencl/kernels/tanh.cl
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/outprod.cpp
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-sycl/wkv.cpp
# src/llama-vocab.cpp
# tests/test-autorelease.cpp
# tests/test-backend-ops.cpp
# tools/cvector-generator/pca.hpp
# tools/export-lora/export-lora.cpp
# tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
Oliver Simons
1f1e57f2bf
CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup ( #19053 )
...
By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816) )
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:
| GPU | Model | Test | t/s master | t/s osimons/fix_bw_mmq_fixup_kernel | Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation | gpt-oss 20B MXFP4 MoE | pp8096 | 8404.05 | 8375.79 | 1.00 |
| NVIDIA RTX 6000 Ada Generation | llama 3B Q4_K_M | pp8096 | 16148.93 | 16019.60 | 0.99 |
| NVIDIA RTX 6000 Ada Generation | llama 8B Q4_0 | pp8096 | 8008.29 | 7978.80 | 1.00 |
| NVIDIA RTX 6000 Ada Generation | nemotron_h 9B BF16 | pp8096 | 4263.16 | 4248.53 | 1.00 |
| NVIDIA RTX 6000 Ada Generation | nemotron_h 9B Q4_K_M | pp8096 | 5165.11 | 5157.43 | 1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 | 12582.80 | 12758.37 | 1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M | pp8096 | 16879.10 | 17619.47 | 1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0 | pp8096 | 10649.90 | 10982.65 | 1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16 | pp8096 | 7717.73 | 7716.22 | 1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M | pp8096 | 7301.90 | 7370.38 | 1.01 |
2026-02-03 11:33:14 +01:00
George
e9a859db3c
ggml: added cleanups in ggml_quantize_free ( #19278 )
...
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Has been cancelled
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-03 08:43:39 +02:00
Gaurav Garg
41e3f02647
cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated ( #19227 )
...
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042 ) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-03 08:41:02 +02:00
lhez
91ea44e89b
opencl: refactor some ops, concat, repeat, tanh and scale ( #19226 )
...
* opencl: refactor concat
* opencl: refactor repeat
* opencl: refactor tanh
* opencl: enable fp16 for tanh
* opencl: refactor scale
* opencl: fix unused variables
2026-02-02 15:54:43 -08:00
Aman Gupta
9f682fb640
ggml-cpu: FA split across kv for faster TG ( #19209 )
...
* ggml-cpu: split across kv for faster TG
* simplify sinks application
* add ref impl
2026-02-03 01:19:55 +08:00
Neo Zhang
bf38346d13
Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. ( #19246 )
...
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-02 21:06:21 +08:00
Tamar
4d5e972673
sycl: implement GGML_OP_TOP_K ( #19242 )
2026-02-02 21:05:51 +08:00
Georgi Gerganov
6fdddb4987
metal : support virtual devices ( #18919 )
...
* metal : support virtual devices
* cont : manage buffer type context memory
* metal : add events
* cont : implement cpy_tensor_async
2026-02-02 14:29:44 +02:00
Johannes Gäßler
59377a6c87
ggml-backend: fix async set/get fallback sync ( #19179 )
2026-02-02 10:00:05 +01:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups ( #19252 )
...
* Update old URLs to github.com/ggml-org/
* Bump copyrights
2026-02-02 08:38:55 +02:00
Concedo
68f9c6df91
fix cuda graph spams
2026-02-02 11:28:50 +08:00
Nikhil Jain
2dc3ce2166
Remove pipeline cache mutexes ( #19195 )
...
* Remove mutex for pipeline caches, since they are now per-thread.
* Add comment
* Run clang-format
* Cleanup
* Run CI again
* Run CI once more
* Run clang-format
2026-02-01 18:47:29 -08:00
Max Krasnyansky
3bc8d2cf23
Bump cmake max version (needed for Windows on Snapdragon builds) ( #19188 )
...
* Bump max cmake version (needed for Windows on Snapdragon builds)
* cmake: move max version setting into ggml/CMakeLists
2026-02-01 14:13:38 -08:00
Concedo
ddce19db72
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/nix/package-gguf-py.nix
# .devops/nix/scope.nix
# common/CMakeLists.txt
# docs/backend/SYCL.md
# examples/lookahead/lookahead.cpp
# examples/lookup/lookup.cpp
# examples/sycl/run-llama2.sh
# examples/sycl/win-run-llama2.bat
# examples/sycl/win-test.bat
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hvx-dump.h
# ggml/src/ggml-hexagon/htp/hvx-reduce.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
nullname
89f10baad5
ggml-hexagon: flash-attention and reduce-sum optimizations ( #19141 )
...
* wip
* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation
* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations
* wip
* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance
* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability
* optimize vector dot product functions to use unified reduction for improved performance
* wip
* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation
* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations
* wip
* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance
* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability
* optimize vector dot product functions to use unified reduction for improved performance
* hexagon: optimize reduce-sum for v75+
* hexagon: always keep row_sums in sf/fp32
* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT
* fix compiling error after rebase
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-30 21:14:20 -08:00
shaofeiqi
971facc38e
opencl: add optimized q8_0 mm kernel for adreno ( #18871 )
...
* Add Q8_0 OpenCL kernel
Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
* opencl: fix build for non-adreno
* opencl: refactor q8_0
* opencl: enforce subgroup size of 64 for adreno for q8_0
* For A750 and older generations, subgroup size can be 64 or 128.
This kernel assumes subgroup size 64.
* opencl: suppress warning when adreno kernels are disabled
---------
Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-01-30 10:19:27 -08:00
Georgi Gerganov
dfd6106c84
cuda : fix compile warnings (whisper/0)
2026-01-30 20:09:21 +02:00
Simon Redman
13f3ebfae1
Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b ( #19194 )
2026-01-30 17:27:16 +01:00
Concedo
8d173f50c2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# docs/backend/SYCL.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# docs/backend/snapdragon/developer.md
# docs/ops.md
# docs/ops/SYCL.csv
# embd_res/templates/upstage-Solar-Open-100B.jinja
# ggml/src/CMakeLists.txt
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/element_wise.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# tests/test-chat.cpp
2026-01-30 15:32:59 +08:00
bssrdf
ecbf01d441
add tensor type checking as part of cuda graph properties ( #19186 )
Update Operations Documentation / update-ops-docs (push) Has been cancelled
2026-01-30 12:57:52 +08:00
s8322
1025fd2c09
sycl: implement GGML_UNARY_OP_SOFTPLUS ( #19114 )
...
* sycl: add softplus unary op implementation
* sycl: add softplus unary op implementation
* docs(ops): mark SYCL SOFTPLUS as supported
* docs: update SYCL status for SOFTPLUS
2026-01-30 12:01:38 +08:00
RachelMantel
c7358ddf64
sycl: implement GGML_OP_TRI ( #19089 )
...
* sycl: implement GGML_OP_TRI
* docs: update ops.md for SYCL TRI
* docs: regenerate ops.md
* docs: update SYCL support for GGML_OP_TRI
2026-01-30 12:00:49 +08:00
Zheyuan Chen
bd90fc74c3
ggml-webgpu: improve flastAttention performance by software pipelining ( #19151 )
...
* webgpu : pipeline flash_attn Q/K loads in WGSL
* ggml-webgpu: unroll Q*K accumlation inner loop
* ggml-webgpu: vectorization
* ggml-webgpu: unrolling
* ggml-webgpu: remove redundant unrolling
* ggml-webgpu: restore the config
* ggml-webgpu: remove redundant comments
* ggml-webgpu: formatting
* ggml-webgpu: formatting and remove vectorization
* ggml-webgpu: remove unnecessary constants
* ggml-webgpu: change QKV buffer to read_write to pass validation
* ggml-webgpu: add explanation for the additional bracket around Q K accumulate
* Indentation and for -> if for tail
* Kick off CI on wgsl only commits
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-29 14:05:30 -08:00
Todor Boinovski
ce38a4db47
hexagon: enable offloading to Hexagon on Windows on Snapdragon ( #19150 )
...
* hexagon: updates to enable offloading to HTP on WoS
* Update windows.md
* Update windows.md
* hexagon: enable -O3 optimizations
* hexagon: move all _WINDOWS conditional compilation to _WIN32
* hexagon: updates to enable offloading to HTP on WoS
* hexagon: use run-time vs load-time dynamic linking for cdsp driver interface
* refactor htp-drv
* hexagon: add run-bench.ps1 script
* hexagon: htdrv refactor
* hexagon: unify Android and Windows build readmes
* hexagon: update README.md
* hexagon: refactor htpdrv
* hexagon: drv refactor
* hexagon: more drv refactor
* hexagon: fixes for android builds
* hexagon: factor out dl into ggml-backend-dl
* hexagon: add run-tool.ps1 script
* hexagon: merge htp-utils in htp-drv and remove unused code
* wos: no need for getopt_custom.h
* wos: add missing CR in htpdrv
* hexagon: ndev enforecement applies only to the Android devices
* hexagon: add support for generating and signing .cat file
* hexagon: add .inf file
* hexagon: working auto-signing and improved windows builds
* hexagon: futher improve skel build
* hexagon: add rough WoS guide
* hexagon: updated windows guide
* hexagon: improve cmake handling of certs and logging
* hexagon: improve windows setup/build doc
* hexagon: more windows readme updates
* hexagon: windows readme updates
* hexagon: windows readme updates
* hexagon: windows readme updates
* hexagon: windows readme updates
* Update windows.md
* Update windows.md
* snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon
Also added a power shell script to simplify build env setup.
* hexagon: remove trailing whitespace and move cmake requirement to user-presets
* hexagon: fix CMakeUserPresets path in workflow yaml
* hexagon: introduce local version of libdl.h
* hexagon: fix src1 reuse logic
gpt-oss needs a bigger lookahead window.
The check for src[1] itself being quantized was wrong.
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-29 12:33:21 -08:00
Georgi Gerganov
4fdbc1e4db
cuda : fix nkvo, offload and cuda graph node properties matching ( #19165 )
...
* cuda : fix nkvo
* cont : more robust cuda graph node property matching
* cont : restore pre-leafs implementation
* cont : comments + static_assert
2026-01-29 18:45:30 +02:00
Concedo
7e755014b2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/winget.yml
# CODEOWNERS
# common/CMakeLists.txt
# common/arg.cpp
# docs/ops/SYCL.csv
# examples/lookup/lookup-create.cpp
# examples/lookup/lookup-stats.cpp
# examples/lookup/lookup.cpp
# examples/speculative-simple/speculative-simple.cpp
# examples/speculative/speculative.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-sycl/dpct/helper.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/norm.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# tests/test-chat-template.cpp
2026-01-29 23:05:05 +08:00
Concedo
46cd17c17e
Merge commit ' 88d23ad515' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# docs/build.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zendnn/CMakeLists.txt
# tests/test-chat-template.cpp
2026-01-29 22:25:56 +08:00
yulo
f3dd7b8e68
HIP: add mmf for CDNA ( #18896 )
...
* refactor mmf rows_per_block
* speed up compile
* pass cdna compile
* fix cuda error
* clean up mmf
* f32 mmf
* clean float mma
* fix mmf error
* faster mmf
* extend tile k
* fix compile error
* Revert "extend tile k"
This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7.
* fix smem overflow
* speed up compiling mmf
* speed up compile for hip
* 512 block for cdna
* config pad size
* fix as comment
* update select logic
* move some code to cuh
* fix as comment
* correct cdna3 config
---------
Co-authored-by: zhang hui <you@example.com>
2026-01-29 11:10:53 +01:00
Vishal Singh
b33df266d0
ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency ( #19159 )
2026-01-29 12:28:57 +08:00
Aman Gupta
3bcc990997
CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) ( #19126 )
2026-01-29 10:31:28 +08:00
Neo Zhang
d4964a7c66
sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases ( #19154 )
...
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-29 09:20:22 +08:00
Ruben Ortlam
f6b533d898
Vulkan Flash Attention Coopmat1 Refactor ( #19075 )
...
* vulkan: use coopmat for flash attention p*v matrix multiplication
* fix P loading issue
* fix barrier position
* remove reduction that is no longer needed
* move max thread reduction into loop
* remove osh padding
* add bounds checks and padding
* remove unused code
* fix shmem sizes, loop duration and accesses
* don't overwrite Qf, add new shared psh buffer instead
* add missing bounds checks
* use subgroup reductions
* optimize
* move bounds check, reduce barriers
* support other Bc values and other subgroup sizes
* remove D_split
* replace Of register array with shared memory Ofsh array
* parallelize HSV across the rowgroups
* go back to Of in registers, not shmem
* vectorize sfsh
* don't store entire K tile in shmem
* fixes
* load large k tiles to shmem on Nvidia
* adapt shared memory host check function to shader changes
* remove Bc 32 case
* remove unused variable
* fix missing mask reduction tmspsh barrier
* fix mask bounds check
* fix rowmax f16 under/overflow to inf
* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-28 18:52:45 +01:00