Commit graph

12768 commits

Author SHA1 Message Date
Concedo
ae60ea0009 handle updated gemma templates 2026-04-14 00:29:58 +08:00
Concedo
3f810dc8c7 fixed preload story import for large stories 2026-04-13 23:27:55 +08:00
Concedo
c984147c84 fix quotes 2026-04-13 22:50:08 +08:00
Concedo
d9724a4caa kcpp musicgen - disable flash attention as its not stable on vulkan. due to optimizations should still fit in 6gb in lowvram. 2026-04-12 18:28:30 +08:00
Concedo
7bf7b0aefc optimize lowvram for music 2026-04-12 18:17:08 +08:00
Concedo
5361b45fba Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	requirements/requirements-tool_bench.txt
2026-04-12 16:22:26 +08:00
Concedo
5a3369fd2a support for gpt oss jinja 2026-04-12 16:13:51 +08:00
Concedo
4084917cab fixed token counting limit (+1 squashed commits)
Squashed commits:

[314528eb2] fixed token counting limit, set to max supported ctx of 256k
2026-04-12 15:36:03 +08:00
Concedo
f07dcbf7af allow tokencount to handle messages 2026-04-12 11:46:37 +08:00
Concedo
63ce8ae64e fix tool builds 2026-04-12 02:36:30 +08:00
Concedo
6556161804 jinja tool streaming is now finally working 2026-04-12 02:05:39 +08:00
Concedo
c4abba8868 almost working 2026-04-12 01:44:41 +08:00
Johannes Gäßler
ff5ef82786
CUDA: skip compilation of superfluous FA kernels (#21768)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-04-11 18:52:11 +02:00
Sirui He
073bb2c20b
mtmd : add MERaLiON-2 multimodal audio support (#21756)
* mtmd : add MERaLiON-2 multimodal audio support

Adds support for A*STAR's MERaLiON-2 audio-language model (3B and 10B)
to the multimodal framework.

Architecture:
- Whisper large-v2 encoder for audio feature extraction
- Gated MLP adaptor: ln_speech -> frame stack (x15) -> Linear+SiLU -> GLU -> out_proj
- Gemma2 3B / 27B decoder

The mmproj GGUF is generated via convert_hf_to_gguf.py --mmproj on the full
MERaLiON-2 model directory (architecture: MERaLiON2ForConditionalGeneration).
The decoder is converted separately as a standard Gemma2 model after stripping
the text_decoder. weight prefix.

New projector type: PROJECTOR_TYPE_MERALION

Supports tasks: speech transcription (EN/ZH/MS/TA), translation, spoken QA.

Model: https://huggingface.co/MERaLiON/MERaLiON-2-3B
       https://huggingface.co/MERaLiON/MERaLiON-2-10B

* simplify comments in meralion adaptor

* meralion: use format_tensor_name, ascii arrows in comments
2026-04-11 14:15:48 +02:00
shaofeiqi
af1127d3c4
opencl: add basic support for q5_k (#21593)
* opencl: add general q5_k mv

* opencl: add flattened Q5_K mv and general Q5_K mm

* opencl: fix Q5_K unit tests
2026-04-11 01:46:19 -07:00
Concedo
d216aabfdc make sdui buttons more compact 2026-04-11 16:04:51 +08:00
Johannes Gäßler
865ff06b2f
TP: fix Qwen 3 Next data split (#21732) 2026-04-11 09:23:42 +02:00
Sigbjørn Skjæret
2b2cd57de6
ggml : fix a few instances of missing GGML_TYPE_Q1_0 cases (#21716) 2026-04-11 09:45:00 +03:00
Bartowski
660386f6f8
py : Bump typer to latest to fix huggingface_hub issue (#21701) 2026-04-11 09:44:15 +03:00
Concedo
3175da0873 cleanup - do not use tool calls from kai api, only 2026-04-11 12:19:48 +08:00
Concedo
4c860ae4ae Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/download.cpp
#	docs/backend/OPENVINO.md
#	docs/backend/snapdragon/CMakeUserPresets.json
#	docs/backend/snapdragon/README.md
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/argsort-ops.c
#	ggml/src/ggml-hexagon/htp/binary-ops.c
#	ggml/src/ggml-hexagon/htp/cpy-ops.c
#	ggml/src/ggml-hexagon/htp/cumsum-ops.c
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/get-rows-ops.c
#	ggml/src/ggml-hexagon/htp/hex-utils.h
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-ops.h
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/htp_iface.idl
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/repeat-ops.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/set-rows-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/ssm-conv.c
#	ggml/src/ggml-hexagon/htp/sum-rows-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
#	models/templates/google-gemma-4-31B-it-interleaved.jinja
#	models/templates/google-gemma-4-31B-it.jinja
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/snapdragon/adb/run-tool.sh
#	scripts/snapdragon/windows/run-bench.ps1
#	scripts/snapdragon/windows/run-cli.ps1
#	scripts/snapdragon/windows/run-mtmd.ps1
#	scripts/snapdragon/windows/run-tool.ps1
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/llama-bench/llama-bench.cpp
2026-04-11 11:19:32 +08:00
Concedo
a165a73120 Merge commit 'd6f3030047' into concedo_experimental
# Conflicts:
#	examples/model-conversion/scripts/causal/run-casual-gen-embeddings-org.py
#	examples/model-conversion/scripts/utils/semantic_check.py
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-cpu/amx/amx.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hip/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-openvino/ggml-openvino.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-virtgpu/ggml-backend-buffer.cpp
#	ggml/src/ggml-virtgpu/ggml-backend.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-zdnn/ggml-zdnn.cpp
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	pyproject.toml
#	requirements/requirements-convert_legacy_llama.txt
#	requirements/requirements-tool_bench.txt
#	src/llama-model.cpp
#	src/llama.cpp
#	tests/test-llama-archs.cpp
#	tests/test-tokenizer-0.py
#	tests/test-tokenizer-random.py
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
2026-04-11 11:10:55 +08:00
Aman Gupta
a29e4c0b7b
CUDA: also store node->src ne/nb for graph equality (#21736) 2026-04-11 10:30:30 +08:00
Concedo
8b90bfe094 Merge commit '4ef9301e4d' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	docs/multimodal.md
#	embd_res/ggml-vocab-gemma-4.gguf
#	embd_res/ggml-vocab-gemma-4.gguf.inp
#	embd_res/ggml-vocab-gemma-4.gguf.out
#	ggml/src/ggml-sycl/fattn-tile.cpp
#	ggml/src/ggml-sycl/fattn-tile.hpp
#	ggml/src/ggml-sycl/fattn-vec.hpp
#	ggml/src/ggml-sycl/fattn.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-f16-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_0-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q4_1-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_0-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q5_1-q8_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-f16.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q4_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q4_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q5_0.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q5_1.cpp
#	ggml/src/ggml-sycl/template-instances/fattn-vec-instance-q8_0-q8_0.cpp
#	tests/CMakeLists.txt
#	tests/test-jinja.cpp
#	tools/mtmd/CMakeLists.txt
2026-04-11 09:38:50 +08:00
Wagner Bruna
f4fbd94129
sdapi: add job_timestamp field to info result (#2110) 2026-04-11 09:28:53 +08:00
Concedo
5a3c92365a slightly increase max splits 2026-04-11 09:27:58 +08:00
Galunid
b136b62cf9
fix: Fix broken structured output when using $refs in json_schema (#21699)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
2026-04-10 18:26:36 -05:00
Todor Boinovski
81069a808a
hexagon: add support for linux on snapdragon (#21707)
* hexagon: add support for debian on ex2

* hexagon: add -fvectotize to c/c++ cmake flags

* hexagon: remove trailing white space

* update onboarding steps

* hexagon: update linux setup documentation

* hexagon: update intallation scripts

* Hexagon: update docs

* hexagon: update onboarding scripts

---------

Co-authored-by: Zack Li <zackli@qti.qualcomm.com>
2026-04-10 15:57:23 -07:00
Max Krasnyansky
9aa2807769
hexagon: improved Op queuing, buffer and cache management (#21705)
* hexagon: introduce op request batching and rewrite buffer managment

The host now prepares batches of requests and dispatches them via a single dspqueue message.

Buffers are mapped explicitly by NPU while processing batches.

* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops

* hex-utils: add explicit l2flush and l2clear helpers

* hex-opreq: use fine-grain per tensor l2 management

* hex-opreq: avoid redundant invalidates for tensors we already flushed

* hex-opreq: update debug messages

* htp-opreq: reuse ops_context

* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry

* hex-opreq: fix errors in log message

* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"

This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.

* hexagon: limit l2 flushes to 1MB which covers l2 cache

* hex-opreq: limit cache flush to 4MB

Looks like 4MB cont. vitual space should cover the 1MB cache.

* hexagon: drop cache flush size to 2MB

* hex-opreq: start reworking opreq packing

* hex-opreq: introduce new way of packing opbatch where tensors are stored separately

* hex-opreq: add a simple fastrpc call to force unmap all buffers

* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size

* hex-opreq: bump opreq batch size to 256

* hex-mm: place src1 spad at the top of vtcm for easy reuse

* hex-ops: introduce internal types and disable src1 reuse for now

Nothing new just formalizing the repack / qyn.quant types we've been using.

* htp-opreq: use tensor pointers instead of copies

* hex-opreq: introduce more robust way for tracking vtcm/spad reuse

This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.

* hex-cumsum: fix error post opreq merge

* hex-opreq: move request batch handling into the session

Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.

* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx

* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers

* hex-buf: add support for allocating shared/pinned buffer for opreqs

* hex-opbatch: make opbatches configurable

* hex-naming: better name for ggml_hexagon_shared_buffer

* hex-naming: add session->c_name() helper

* hex-opbatch: start using shm but still copy for now

* hex-opbatch: use shared buffer for packing opbatch

* hex-opbatch: beter naming for opbatch related classes and code

* hex-opbatch: reuse batched tensors with same data/dims/strides

* hex-opbatch: update logging

* hex-opbatch: add support for vmem limit for op batching

* hex-opbatch: update htp side to properly support dynamic mmap/unmap

* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing

* hex-opbatch: fixed src1 handling in act ops

* hex-act: fix empty src1 handling in swiglu and friends

Simplify preamble macro while at it

* hex-mm: minor fix vtcm and dma handling in matmul

cleaning up some left-overs from merges

* hex-opbatch: allocate extra 1KB for dspqueue overhead

* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc

* hex-mm: properly handle hmx_disabled flag

* hex-ops: update comments

* hex-ops: add debug output for get/set-rows

* hex-mmap: optimize un/mapping of buffers

* hex-opreq: global cache flush and invalidate beyond 128KB threshold

* hex-ops: add super simple opfilter regex for debugging

If an Op matches the regex hex backend will reject it.

* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future

* hexagon: improved vtcm acquision to remove inter-op overhead

Fully compatible with QNN-HTP coex

* hex-mm: fixed hvx fallback path

* hex-mm: lower the vmem threshold a bit further to ~3GB

* hexagon: update debug & error logs

This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.

* hexagon: move ops context into main context

Just a cleanup. We don't need separate contexts at this point.

* hex-opbatch: cleanup naming and headers for opbatch and related descriptors

* hex-fa: it's now better to enable FA during TG to reduce graph splits

* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var

It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.

* hexagon: fixed editorconfig check

* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-10 15:47:43 -07:00
Aldehir Rojas
3fc65063d9
common : better align to the updated official gemma4 template (#21704) 2026-04-10 16:12:53 -05:00
Adrien Gallouët
05b3caaa48
common : add callback interface for download progress (#21735)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-10 22:17:00 +02:00
MoonRide303
e62fa13c24
model : make Gemma 4 shared-KV tail attn_k tensors optional on load (#21739) 2026-04-10 21:45:50 +02:00
Rithik Sharma
bfd1f453cb
ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (#21669) 2026-04-10 10:52:38 -07:00
Chen Yuan
e4fed9d08d
ggml-webgpu: address quantization precision and backend lifecycle managment (#21521)
* ggml(webgpu): fix the busy-polls in Emscripten  in the waitAny after #20618, and remove the busy webgpu log

* Merge with upstream

* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants

* Update Unary wgsl EXP and EXPM1 for f16 stability

* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization

* Fix numerical percision for unary sqrt when working with f16

* Fix NaN canonicalization for packed integers using f16

* Update err threshold for binary div ops when using f16

* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend

* clean: uncomment existing code logs

* clean: clean the unncessary debug info

* Refactor and generalize dequant helpers

* Remove deprecated quant structs

* Refactor shader defines to reduce repetition

* Remove error override for F16 type

* fix: fix the accidential removal of the proper initialization of ctx

* clean: clean legacy and format code

* fix: did not modify tests ops

---------

Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
2026-04-10 10:52:01 -07:00
Adrien Gallouët
5dd102539b
server : ignore --alias when using --models-preset (#21380)
I'm not sure what the purpose of keeping `--alias` was when using
`--models-preset`, but the result is really weird, as shown in the
following logs:

    $ build/bin/llama-server --models-preset preset.ini --alias "Gemma 4 E4B UD Q8_K_XL"
    ...
    init: using 31 threads for HTTP server
    srv   load_models: Loaded 2 cached model presets
    srv   load_models: Loaded 1 custom model presets from preset.ini
    main: failed to initialize router models: alias 'Gemma 4 E4B UD Q8_K_XL' for model 'angt/test-split-model-stories260K:F32' conflicts with existing model name

So I propose to simply ignore `--alias` too in this case. With this
commit, the server starts in routing mode correctly.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-10 17:42:56 +02:00
Concedo
919a010ebc support outro but don't actually use it yet 2026-04-10 23:18:30 +08:00
Adrien Gallouët
fb38d6f278
common : fix when loading a cached HF models with unavailable API (#21670)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-10 16:37:46 +02:00
Concedo
9b95ade6cc updated lite 2026-04-10 22:21:05 +08:00
Concedo
0f278d93b3 better image handling in jinja 2026-04-10 22:19:01 +08:00
Concedo
b962335c99 fix rosie toolcall issue 2026-04-10 21:23:33 +08:00
Concedo
bcf499e5bf fix gemma tool calling 2026-04-10 20:51:40 +08:00
Johannes Gäßler
0893f50f2d
common: mark --split-mode tensor as experimental (#21684) 2026-04-10 12:27:27 +02:00
Concedo
ffdc3ba49e tool coercion fixes 2026-04-10 18:25:07 +08:00
Aleksander Grygier
f989a6e39e
webui: Static build output improvements (#21667)
* refactor: Build improvements

* chore: Formatting + package lock update
2026-04-10 11:49:47 +02:00
Berk Idem
d7ff074c87
common : enable reasoning budget sampler for gemma4 (#21697)
* fix: enable reasoning budget sampler for gemma4

Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.

Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).

Add test case for empty thinking block.

Fixes #21487

* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser
2026-04-10 11:49:14 +02:00
Belem Zhang
3f8752b559
docs : fix broken link to ggml-openvino in OPENVINO.md (#21709) 2026-04-10 09:50:08 +02:00
Jeff Bolz
7b69125331
vulkan: Support Q1_0 (#21539)
* vulkan: Support Q1_0

* use get_dm
2026-04-10 08:35:27 +02:00
Adrien Gallouët
e095a482a0
common : add fluidity to the progress bar (#21671)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-10 08:24:53 +02:00
Aman Gupta
e34f042154
CUDA: fuse muls (#21665) 2026-04-10 10:24:09 +08:00
andyluo7
d132f22fc9
HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (#21570)
Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support:

- vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__
- common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros
- mma.cuh: Route CDNA4 to compatible MFMA instructions:
  * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950)
  * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3)
- mmq.cuh: Include CDNA4 in stream-k kernel dispatch

CDNA4 is largely compatible with CDNA3 except:
- No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path
- Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here

Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1:
- Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950
- llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU):
  * f16+FA: 40,013 tok/s prefill, 254 tok/s decode
  * q8_0+FA: functional
- Flash attention: works correctly
- MMQ: works correctly with stream-k dispatch

Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>
2026-04-09 21:13:32 +02:00