Concedo
80a9082166
q5_1 kv in cuda
2026-05-03 13:40:24 +08:00
Concedo
7c70187e26
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/ISSUE_TEMPLATE/010-bug-compilation.yml
# .github/ISSUE_TEMPLATE/011-bug-results.yml
# .github/ISSUE_TEMPLATE/019-bug-misc.yml
# .github/ISSUE_TEMPLATE/020-enhancement.yml
# .github/ISSUE_TEMPLATE/030-research.yml
# .github/ISSUE_TEMPLATE/040-refactor.yml
# ggml/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-hexagon/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/hmx-utils.h
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/hvx-copy.h
# ggml/src/ggml-hexagon/htp/hvx-exp.h
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
2026-05-02 18:07:50 +08:00
Concedo
61478cbf4a
Merge commit ' c20c44514a' into concedo_experimental
...
# Conflicts:
# .github/workflows/python-type-check.yml
# examples/speculative/speculative.cpp
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# scripts/jinja/jinja-tester.py
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/sync_vendor.py
# tests/test-backend-ops.cpp
2026-05-01 00:07:46 +08:00
Ruben Ortlam
660b1b4bdc
vulkan: add get/set tensor 2d functions ( #22514 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* vulkan: add get/set_tensor_2d functions
* fix backend interface comments
* Update ggml/src/ggml-metal/ggml-metal.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-30 17:37:13 +02:00
Johannes Gäßler
e82aaf2587
CUDA: fix tile FA kernel on Pascal ( #22541 )
2026-04-30 13:04:50 +02:00
Concedo
37073bc13d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/mmq.cuh
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync-ggml.last
# tests/test-backend-ops.cpp
# tests/test-log.cpp
2026-04-30 17:37:52 +08:00
Anav Prasad
098705a29e
CUDA: fuse SSM_CONV + ADD(bias) + SILU ( #22478 )
2026-04-30 02:39:56 +08:00
Aman Gupta
3142f1dbb9
ggml-cuda: refactor fusion code ( #22468 )
...
* ggml-cuda: refactor fusion code
* apply formatting + make env variable truthy
2026-04-29 16:19:33 +08:00
Michael Wand
fc2b0053ff
ggml-cuda: Repost of 21896: Blackwell native NVFP4 support ( #22196 )
2026-04-29 06:47:42 +08:00
lnigam
7b8443ac78
ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… ( #22286 )
...
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)
Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.
* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32
* Apply suggestions from code review
Address review comments
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256
* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-28 21:37:35 +02:00
Concedo
095cfd6354
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# tests/test-chat-auto-parser.cpp
# tests/test-chat.cpp
2026-04-26 15:57:35 +08:00
Oliver Simons
b1a5bd4e0c
CUDA: better coalesce data-access for contiguous concat ( #22330 )
...
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
2026-04-26 09:21:45 +02:00
Johannes Gäßler
9725a313be
CUDA: reduce MMQ stream-k overhead ( #22298 )
...
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* CUDA: reduce MMQ stream-k overhead
* use 32 bit integers for kbc
2026-04-25 14:15:03 +02:00
Concedo
340b22283e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .devops/intel.Dockerfile
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# .gitignore
# docs/backend/SYCL.md
# docs/backend/snapdragon/README.md
# examples/model-conversion/scripts/causal/convert-model.sh
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/hvx-base.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/libggml-htp.inf
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/mmvq.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_blk.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
# scripts/server-test-structured.py
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/qdc/requirements.txt
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-completion.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tools/cli/cli.cpp
# ty.toml
2026-04-25 12:13:14 +08:00
Anav Prasad
86db42e97f
CUDA: fuse relu + sqr ( #22249 )
2026-04-23 10:28:56 +08:00
Concedo
19a12bb080
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# common/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# scripts/sync-ggml.last
# tools/cli/cli.cpp
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
leonardHONG
97895129e5
ggml-cuda: flush legacy pool on OOM and retry ( #22155 )
...
* ggml-cuda: flush legacy pool on OOM and retry
Signed-off-by: 梁厚宏 <2695316095@qq.com>
* Address review comments: add explicit sync, update destructor, clean up MUSA macros
Signed-off-by: 梁厚宏 <2695316095@qq.com>
---------
Signed-off-by: 梁厚宏 <2695316095@qq.com>
2026-04-20 23:30:38 +02:00
Johannes Gäßler
fb19f94c71
TP: fix 0-sized tensor slices, AllReduce fallback ( #21808 )
...
* TP: fix 0-sized tensor slices, AllReduce fallback
* fix layer structure <-> GPU count aliasing
* add missing std::fill
* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
Concedo
cd6788007e
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-cross.yml
# .github/workflows/build-self-hosted.yml
# .github/workflows/release.yml
# examples/llama.android/lib/src/main/cpp/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# scripts/sync_vendor.py
# tests/test-chat.cpp
# tests/test-mtmd-c-api.c
# tools/server/README.md
2026-04-20 20:19:11 +08:00
Johannes Gäßler
4eac5b4509
CUDA: refactor mma data loading for AMD ( #22051 )
...
* CUDA: refactor mma data loading for AMD
* fix CDNA MMQ occupancy
* fix CDNA3 mma
* fix RDNA3 compile
2026-04-19 18:26:59 +02:00
uvos
471540ae8a
HIP: Remove unesscary NCCL_CHECK ( #21914 )
2026-04-19 12:59:44 +02:00
Concedo
e5eab545f3
handle override jinja template
2026-04-19 00:30:28 +08:00
Aman Gupta
b94050e896
CUDA: use LRU based eviction for cuda graphs ( #21611 )
...
* CUDA: use a ring-buffer for cuda graphs
* bump limit to 128
* use LRU eviction
* better naming
* do periodic clean-up
2026-04-17 23:24:21 +08:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
768527b031
Merge commit ' 1e796eb41f' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/workflows/build-riscv.yml
# .github/workflows/build-vulkan.yml
# .github/workflows/build.yml
# docs/backend/SYCL.md
# docs/build.md
# docs/development/HOWTO-add-model.md
# embd_res/templates/Reka-Edge.jinja
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_id.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# tests/test-chat.cpp
# tools/rpc/README.md
2026-04-17 21:47:29 +08:00
Concedo
9a38091207
support q5_1 kv
2026-04-17 17:06:15 +08:00
Aman Gupta
3f7c29d318
ggml: add graph_reused ( #21764 )
...
* ggml: add graph_reused
* use versioning instead of reuse flag
* increment version with atomic
* use top bits for split numbering
* add assert
* move counter to ggml.c
* set uid in split_graph only
* fix windows
* address further review comments
* get next_uid rather than doing bit manipulation
* rename + add comment about uid
2026-04-16 17:21:28 +08:00
Pasha Khosravi
7e72b38bc1
cuda: Q1_0 initial backend ( #21629 )
...
* [cuda] initial Q1_0 backend
* remove unused code, fix AMD MMA guard
* attempt to support dp4a
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-15 18:38:38 +02:00
Johannes Gäßler
a6206958d2
CUDA: require explicit opt-in for P2P access ( #21910 )
2026-04-15 16:01:46 +02:00
Johannes Gäßler
014dca49d6
CUDA: manage NCCL communicators in context ( #21891 )
...
* CUDA: manage NCCL communicators in context
* add check that all backends are CUDA
* remove unused vector, limit init to > 1 GPUs
* fix warnings
* fix cuda device, cache allreduce
2026-04-15 15:58:40 +02:00
Concedo
9c0b9b0bb1
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# docs/development/HOWTO-add-model.md
# docs/multimodal.md
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/element_wise.cpp
# ggml/src/ggml-sycl/gated_delta_net.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/upscale.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# tests/test-backend-ops.cpp
# tests/test-llama-archs.cpp
# tools/mtmd/CMakeLists.txt
2026-04-14 20:06:04 +08:00
Oliver Simons
9f5e1edb10
CUDA: Limit DeviceSegmentedSort to immediate mode ( #21718 )
...
* CUDA: Limit DeviceSegmentedSort to immediate mode
DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.
Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)
ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s
DeviceSegmentedSort in immediate mode
ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s
* Add test case for dispatch to DeviceSegmentedRadixSort
We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode
2026-04-13 11:14:06 +02:00
Stephen Cox
547765a93e
mtmd: add Gemma 4 audio conformer encoder support ( #21421 )
...
* mtmd: add Gemma 4 audio conformer encoder support
Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.
Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder
Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998
Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
entries in ctx_data. Fixed with std::set guard.
- ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.
Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).
Ref: #21325
2026-04-12 14:15:26 +02:00
Concedo
5361b45fba
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# requirements/requirements-tool_bench.txt
2026-04-12 16:22:26 +08:00
Johannes Gäßler
ff5ef82786
CUDA: skip compilation of superfluous FA kernels ( #21768 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-04-11 18:52:11 +02:00
Concedo
4c860ae4ae
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# common/download.cpp
# docs/backend/OPENVINO.md
# docs/backend/snapdragon/CMakeUserPresets.json
# docs/backend/snapdragon/README.md
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/argsort-ops.c
# ggml/src/ggml-hexagon/htp/binary-ops.c
# ggml/src/ggml-hexagon/htp/cpy-ops.c
# ggml/src/ggml-hexagon/htp/cumsum-ops.c
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/hex-utils.h
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/hmx-ops.h
# ggml/src/ggml-hexagon/htp/htp-ctx.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/htp_iface.idl
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/repeat-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/ssm-conv.c
# ggml/src/ggml-hexagon/htp/sum-rows-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# models/templates/google-gemma-4-31B-it-interleaved.jinja
# models/templates/google-gemma-4-31B-it.jinja
# scripts/snapdragon/adb/run-bench.sh
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-tool.sh
# scripts/snapdragon/windows/run-bench.ps1
# scripts/snapdragon/windows/run-cli.ps1
# scripts/snapdragon/windows/run-mtmd.ps1
# scripts/snapdragon/windows/run-tool.ps1
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
2026-04-11 11:19:32 +08:00
Concedo
a165a73120
Merge commit ' d6f3030047' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
# examples/model-conversion/scripts/utils/semantic_check.py
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cpu/amx/amx.cpp
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hip/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-openvino/ggml-openvino.cpp
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-virtgpu/ggml-backend-buffer.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zdnn/ggml-zdnn.cpp
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# pyproject.toml
# requirements/requirements-convert_legacy_llama.txt
# requirements/requirements-tool_bench.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-llama-archs.cpp
# tests/test-tokenizer-0.py
# tests/test-tokenizer-random.py
# tools/llama-bench/llama-bench.cpp
# tools/perplexity/perplexity.cpp
2026-04-11 11:10:55 +08:00
Aman Gupta
a29e4c0b7b
CUDA: also store node->src ne/nb for graph equality ( #21736 )
2026-04-11 10:30:30 +08:00
Aman Gupta
e34f042154
CUDA: fuse muls ( #21665 )
2026-04-10 10:24:09 +08:00
andyluo7
d132f22fc9
HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X ( #21570 )
...
Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support:
- vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__
- common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros
- mma.cuh: Route CDNA4 to compatible MFMA instructions:
* f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950)
* bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3)
- mmq.cuh: Include CDNA4 in stream-k kernel dispatch
CDNA4 is largely compatible with CDNA3 except:
- No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path
- Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here
Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1:
- Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950
- llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU):
* f16+FA: 40,013 tok/s prefill, 254 tok/s decode
* q8_0+FA: functional
- Flash attention: works correctly
- MMQ: works correctly with stream-k dispatch
Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>
2026-04-09 21:13:32 +02:00
Johannes Gäßler
d6f3030047
ggml: backend-agnostic tensor parallelism (experimental) ( #19378 )
...
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8 )
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9 )
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7 )
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11 )
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12 )
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16 )
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17 )
* meta : formatting, naming, indentation (#18 )
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-09 16:42:19 +02:00
fairydreaming
009a113326
ggml : check return value of CUB calls used in argsort and top-k (they all return cudaError_t) ( #21676 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-04-09 21:17:11 +08:00
Concedo
c82c0b463a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/labeler.yml
# .github/workflows/release.yml
# examples/debug/debug.cpp
# ggml/src/ggml-cuda/common.cuh
# ggml/src/ggml-cuda/mmq.cuh
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# src/llama-vocab.cpp
# tests/test-backend-ops.cpp
# tests/test-chat.cpp
# tests/test-json-schema-to-grammar.cpp
# tools/mtmd/CMakeLists.txt
2026-04-09 17:45:04 +08:00
Concedo
5529748a01
Merge commit ' de1aa6fa73' into concedo_experimental
...
# Conflicts:
# docs/build.md
# docs/ops.md
# docs/ops/WebGPU.csv
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/quants.hpp
# ggml/src/ggml-sycl/vecdotq.hpp
# ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# tests/test-backend-ops.cpp
# tests/test-quantize-fns.cpp
2026-04-09 17:16:33 +08:00
Aman Gupta
d12cc3d1ca
CUDA: also store node->src->data ptrs for equality check ( #21635 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* CUDA: also store node->src->data ptrs for equality check
* address review comments
2026-04-09 01:01:56 +08:00
Aman Gupta
c5ce4bc227
CUDA: make cuda graphs props check faster ( #21472 )
...
* CUDA: compute fast hash instead of expensive props check
* use seen node
* use memcp
2026-04-08 09:05:51 +08:00
iacopPBK
66c4f9ded0
ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels ( #21168 )
...
* ds_read_b128 for q4_0 and q4_1 mmq kernels
Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation
* Explicit for loop in mmq, renamed vec into tmp
* Fixed max_cpy usage in the loading loop
* Fixed typo in q4_1 kernel
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Renoved trailing white line 500
* Update mmq.cuh removed other whitelines
* Remove trailing whitespaces
---------
Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
2026-04-07 21:47:42 +02:00
Aman Gupta
de1aa6fa73
CUDA: check for buffer overlap before fusing ( #21566 )
...
* CUDA: check for buffer overlap before fusing
* use ggml_cuda_check_fusion_memory_ranges
2026-04-08 00:57:04 +08:00
Antoine Viallon
71a81f6fcc
ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) ( #21519 )
...
GGML_CUDA_CC_CDNA2 was set to 0x910
Fix by setting the constant to 0x90a to match the actual gfx90a ISA.
2026-04-07 12:18:55 +02:00
Concedo
15d269197e
Merge commit ' 506200cf8b' into concedo_experimental
...
# Conflicts:
# docs/multimodal.md
# scripts/compare-llama-bench.py
# src/llama-vocab.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
2026-04-07 14:58:36 +08:00