koboldcpp/scripts/snapdragon/adb
Max Krasnyansky aa50b2c2ae
hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-27 10:46:11 -07:00
..
llama-cli.farf Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
run-bench.sh hexagon: HMX quantized matmul rework (#23368) 2026-05-20 07:39:01 -07:00
run-cli.sh hexagon: HMX quantized matmul rework (#23368) 2026-05-20 07:39:01 -07:00
run-completion.sh hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647) 2026-05-27 10:46:11 -07:00
run-mtmd.sh hexagon: HMX quantized matmul rework (#23368) 2026-05-20 07:39:01 -07:00
run-tool.sh hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647) 2026-05-27 10:46:11 -07:00