* CUDA: Check PTX version on host side to guard PDL dispatch
Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).
Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.
This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code
* Implement MurmurHash3 mixer for better hash distribution
Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments, make seed non-zero
* Apply code-formatting
* Replace std::size_t -> size_t for consistency
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
* mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING
* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* misc(server): add default port to impl RAII
* misc(server): register_gcp_compat() can be const
* misc(server): use proper cpp const/auto methods
* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
* hex-fa: clean up qf32/fp32 handling and stride handling
* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79
* hex-fa: vectorize leftover handling
* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity
* hmx-mm: remove dead code
* hmx-mm: use fastdiv in x4x2 dequant
* hmx-mm: sandwich dequant and scatter to improve perf
* hmx-mm: fixed rebase conflicts
* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv
* hmx-mm: an even earlier dispatch for per-type dequant
* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs
This is a bit faster than LUT.
* hex-cmake: one more tweak for lto
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
* allow caching of ui elements in llama-server
* use fnv_hash
* Update tools/server/server-http.cpp
etag has to be set always
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* ci : separate CUDA windows workflow + fix names
* ci : rename workflow
* ci : prefix cache names with workflow name
* ci : rename build.yml -> build-cpu.yml
* ci : cache keys
* ci : fix windows cuda/hip concurrency of release workflow
* ci : fix apple cache names
* ci : add TODOs
* cont : keep just the last cache
* ci : update release concurrency to queue
* ci : move the release trigger to ubuntu-slim
* ci : hip add TODO
* cont : improve words
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines