* CUDA: Check PTX version on host side to guard PDL dispatch
Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).
Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.
This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code
* Implement MurmurHash3 mixer for better hash distribution
Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments, make seed non-zero
* Apply code-formatting
* Replace std::size_t -> size_t for consistency
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING
* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* hex-fa: clean up qf32/fp32 handling and stride handling
* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79
* hex-fa: vectorize leftover handling
* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity
* hmx-mm: remove dead code
* hmx-mm: use fastdiv in x4x2 dequant
* hmx-mm: sandwich dequant and scatter to improve perf
* hmx-mm: fixed rebase conflicts
* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv
* hmx-mm: an even earlier dispatch for per-type dequant
* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs
This is a bit faster than LUT.
* hex-cmake: one more tweak for lto
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines
* ggml-zendnn: fixed naming of matmul function
* ggml-zendnn: fixed naming of mul_mat_id function
* ggml-zendnn: fixed print in mul_mat_id
---------
Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d
* vulkan: skip conv2d bounds checks when shapes align with tile sizes
* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d
* vulkan: stage cm2 conv2d accumulator through shmem before global store
* vulkan: add coopmat1 conv2d path
* fallback when using too much shared memory. clean up comments
* Require 16x16x16 and subgroup size 32 or 64
* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
* hexagon: add support for CONCAT with optimized concat_2d_transposed
qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.
* hex-concat: use fastdiv in generic version
* hex-concat: make checks for transposed a bit more readable
* hex-concat: reoder dma ops for better pipelining
* hex-cont/cpy: optimize CPY and CONT ops
The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).
* hex-get-rows: optimize GET_ROWS for large rows
We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).
Also bump the DMA queue depth now that we can take advantage of it.
* hex-concat: unroll the inner loops of concat_2d
* hex-concat: more updates to concat_2d to improve perf a bit further
* hex-cpy: fixed n_rows per thread checks in the copy ops
* hmx-fa: fix alignment issues while computing dma sizes
* hex-set-rows: add early returns for idle threads
* hvx-rope: minor optimization to replace loops with fastdiv logic
* hex-rope: replace scalar tail processing with HVX
* hex-rope: optimize rope cache init with HVX
Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K
* Fix to editorconfig checking pass
* Remove mul-mat-legacy pipeline
* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
* Only run webgpu CI on my fork
* Add webgpu only workflow
* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled
* restore build.yml
* TP: fix ggml context size calculation, memory leak
* move split state cache back into the context
* revert to constant ggml context size for cgraphs
* increase headroom for statically allocated tensors
* remove obsolete include
* ggml: implement `gguf_init_from_buffer`
* test: `gguf_init_from_buffer`
* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer
* fix: use `GGML_UNUSED`
Co-authored-by: Copilot <copilot@github.com>
* fix: remove `total_size` from `gguf_reader`
* fix: file offset calculation, rename `offset` to `data_offset`
Co-authored-by: Copilot <copilot@github.com>
* refactor: extract model loader bug fixes to another PR
* feat: add `gguf_init_from_callback`
* fix: always require a max expected size
* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`
* fix: harden against offset overflow in buffer read
* fix: remove seek behavior from the callback
* feat: `max_chunk_read == 0` means `SIZE_MAX`
* fix: seeking in a gguf file with no tensors
---------
Co-authored-by: Copilot <copilot@github.com>
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity