* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* ci : separate CUDA windows workflow + fix names
* ci : rename workflow
* ci : prefix cache names with workflow name
* ci : rename build.yml -> build-cpu.yml
* ci : cache keys
* ci : fix windows cuda/hip concurrency of release workflow
* ci : fix apple cache names
* ci : add TODOs
* cont : keep just the last cache
* ci : update release concurrency to queue
* ci : move the release trigger to ubuntu-slim
* ci : hip add TODO
* cont : improve words
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines
* ci : server windows set build type explicitly
* cont : try windows-2025
* ci : use llvm
* cont : use ninja
* cont : fix shell
* ci : set number of jobs correctly
* ci : fix windows with vulkan ccache by using llvm
* ci : server ccache only on master
* ocd : fix job names
[no release]
* ci : fix undefined sanitizer build to use Debug build type only
* ci : ccache the server builds
* cont : remove ui dependency + reuse ccache for both ubuntu jobs
* tmp : force ccache save
* Revert "tmp : force ccache save"
This reverts commit a857b03a10b1304d456129a017e0e46b185618ee.
* cont : no need for node.js
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and
implement hardcoded regex handling in llama-vocab.cpp, consistent with
other BPE pre-tokenizers.
Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
* ggml-zendnn: fixed naming of matmul function
* ggml-zendnn: fixed naming of mul_mat_id function
* ggml-zendnn: fixed print in mul_mat_id
---------
Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
* ci : move [no release] check to dedicated check_release job
Move the workflow-level \`if\` condition that skips builds when the commit
message contains \`[no release]\` into a lightweight \`check_release\` job.
All build jobs now depend on it via \`needs\` and check its output.
This ensures the skip logic is evaluated at the job level rather than at
the workflow level, which is the recommended approach for conditional jobs.
Assisted-by: llama.cpp:local pi
* cont : use `fast` runner
* ci : skip release workflow on master when commit message contains [no release]
Assisted-by: llama.cpp:local pi
* ci : restrict sanitizer builds to x86_64 + fix build type
the spark is apparently too slow for some reason
* tests : fix undefined warning
[no ci]
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d
* vulkan: skip conv2d bounds checks when shapes align with tile sizes
* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d
* vulkan: stage cm2 conv2d accumulator through shmem before global store
* vulkan: add coopmat1 conv2d path
* fallback when using too much shared memory. clean up comments
* Require 16x16x16 and subgroup size 32 or 64
* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
* hexagon: add support for CONCAT with optimized concat_2d_transposed
qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.
* hex-concat: use fastdiv in generic version
* hex-concat: make checks for transposed a bit more readable
* hex-concat: reoder dma ops for better pipelining
* hex-cont/cpy: optimize CPY and CONT ops
The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).
* hex-get-rows: optimize GET_ROWS for large rows
We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).
Also bump the DMA queue depth now that we can take advantage of it.
* hex-concat: unroll the inner loops of concat_2d
* hex-concat: more updates to concat_2d to improve perf a bit further
* hex-cpy: fixed n_rows per thread checks in the copy ops
* hmx-fa: fix alignment issues while computing dma sizes
* hex-set-rows: add early returns for idle threads
* hvx-rope: minor optimization to replace loops with fastdiv logic
* hex-rope: replace scalar tail processing with HVX
* hex-rope: optimize rope cache init with HVX
Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
Create a pool of N threads that grab a chunk of up to 100 tests at a time to
iterate through. The number of tests at a time decreases as fewer remain.
Each thread uses its own dev and cpu backend, and set_n_threads_fn is not
called on the cpu backend.
Fix some TSAN issues that arose:
- In init_tensor_uniform, don't use static vector of generators.
- Replace gmtime with versions that don't use a global variable.
- Mutex calls to print_test_result.
* initial talkie support, coherent
* reorder to follow convention
* absorb inverse rope
* stop folding scalars to improve quantization
* use broadcasting instead of duplication
* style cleanup
* add scaling support to LoraTorchTensor; use that path in conversion
* use layer_out_scale instead of embd_skip_scale
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K
* Fix to editorconfig checking pass
* Remove mul-mat-legacy pipeline
* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
* Only run webgpu CI on my fork
* Add webgpu only workflow
* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled
* restore build.yml
* ci : disable SYCL f16 builds
* ci : extract android and hip into separate workflows
* ci : move webgpu to separate workflow
* ci : move the rpc to a separate workflow
* ci : extract s309x and ppcl jobs
* ci : extract opencl job into a separate workflow
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.
Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.