When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
* mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING
* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* misc(server): add default port to impl RAII
* misc(server): register_gcp_compat() can be const
* misc(server): use proper cpp const/auto methods
* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
* hex-fa: clean up qf32/fp32 handling and stride handling
* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79
* hex-fa: vectorize leftover handling
* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity
* hmx-mm: remove dead code
* hmx-mm: use fastdiv in x4x2 dequant
* hmx-mm: sandwich dequant and scatter to improve perf
* hmx-mm: fixed rebase conflicts
* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv
* hmx-mm: an even earlier dispatch for per-type dequant
* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs
This is a bit faster than LUT.
* hex-cmake: one more tweak for lto
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
* allow caching of ui elements in llama-server
* use fnv_hash
* Update tools/server/server-http.cpp
etag has to be set always
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* ci : separate CUDA windows workflow + fix names
* ci : rename workflow
* ci : prefix cache names with workflow name
* ci : rename build.yml -> build-cpu.yml
* ci : cache keys
* ci : fix windows cuda/hip concurrency of release workflow
* ci : fix apple cache names
* ci : add TODOs
* cont : keep just the last cache
* ci : update release concurrency to queue
* ci : move the release trigger to ubuntu-slim
* ci : hip add TODO
* cont : improve words
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines
* ci : server windows set build type explicitly
* cont : try windows-2025
* ci : use llvm
* cont : use ninja
* cont : fix shell
* ci : set number of jobs correctly
* ci : fix windows with vulkan ccache by using llvm
* ci : server ccache only on master
* ocd : fix job names
[no release]
* ci : fix undefined sanitizer build to use Debug build type only
* ci : ccache the server builds
* cont : remove ui dependency + reuse ccache for both ubuntu jobs
* tmp : force ccache save
* Revert "tmp : force ccache save"
This reverts commit a857b03a10b1304d456129a017e0e46b185618ee.
* cont : no need for node.js