Concedo
46cd17c17e
Merge commit ' 88d23ad515' into concedo_experimental
...
# Conflicts:
# CODEOWNERS
# docs/build.md
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-zendnn/CMakeLists.txt
# tests/test-chat-template.cpp
2026-01-29 22:25:56 +08:00
Concedo
cd6e087eeb
include kde in the fractional scaling fix
2026-01-29 22:06:21 +08:00
Wagner Bruna
1f01d54848
sd: sync to master-487-43e829f ( #1947 )
2026-01-29 21:37:30 +08:00
Rose
deee1c2cfc
fixed mcp stdio server tool listing ( #1950 )
2026-01-29 21:35:35 +08:00
Concedo
226c79338f
handle glm4.7 flash template
2026-01-28 23:29:08 +08:00
Concedo
ef7fe1b5d4
make flash attention default in cli. added --noflashattention
2026-01-28 23:28:48 +08:00
Oleksandr Kuvshynov
88d23ad515
vulkan: handle device dedup on MacOS + Vega II Duo cards ( #19058 )
...
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.
See more details here: KhronosGroup/MoltenVK#2683 .
The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.
The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-28 12:35:54 +01:00
Ben Chen
0a95026da9
doc: add build instruction to use Vulkan backend on macos ( #19029 )
2026-01-28 12:30:16 +01:00
Concedo
d877ec7aec
updated lite
2026-01-28 19:11:49 +08:00
Kevin Pouget
b7feacf7f3
ggml: new backend for Virglrenderer API Remoting acceleration (v2) ( #18718 )
2026-01-28 17:49:40 +08:00
Alberto Cabrera Pérez
6ad70c5a77
ggml-cpu: arm64: Q4_K scale unroll and vectorization ( #19108 )
2026-01-28 09:15:56 +02:00
Georgi Gerganov
631cbfcc7a
cuda : fix "V is K view" check for non-unified KV cache ( #19145 )
2026-01-28 09:15:27 +02:00
Georgi Gerganov
2eee6c866c
CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) ( #19142 )
2026-01-28 09:15:11 +02:00
Georgi Gerganov
b931f81b5a
server : adjust spec tests to generate up to 16 tokens ( #19093 )
2026-01-28 09:11:40 +02:00
Georgi Gerganov
c5c64f72ac
llama : disable Direct IO by default ( #19109 )
...
* llama : disable Direct IO by default
* cont : override mmap if supported
2026-01-28 09:11:13 +02:00
Daniel Bevenius
eef375ce16
sampling : remove sampling branching in output_reserve ( #18811 )
...
* sampling : remove sampling branching in output_reserve
This commit updates output_reserve in llama-context.cpp to always
allocate sampling buffers regardless of whether sampling is needed for
the current batch.
The motivation for this is to avoid reallocations and branching based on
the sampling requirements of the batch.
2026-01-28 05:59:30 +01:00
Nikhil Jain
06961e2876
ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state ( #18976 )
...
* Squashed commit of the following:
commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755
Author: Abhijit Ramesh <abhijitramesh2k@gmail.com>
Date: Mon Dec 1 18:29:00 2025 -0800
ggml webgpu: fix xielu parameter passing (#11 )
The XIELU operation was incorrectly using static_cast to convert
float parameters to uint32_t, which converted numeric values instead
of preserving IEEE 754 bit patterns. This caused incorrect values
to be interpreted by the GPU shader.
* Use reinterpret_cast to preserve float bit patterns when passing
through uint32_t params buffer
* Update WGSL shader parameter types from u32 to f32
* Re-enable XIELU support (was disabled due to numerical issues)
Fixes NMSE test failures for XIELU operation on WebGPU backend.
commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a
Author: neha-ha <137219201+neha-ha@users.noreply.github.com>
Date: Tue Nov 18 12:17:00 2025 -0800
Refactored pipelines and workgroup calculations (#10 )
* refactored pipelines
* refactored workgroup calculation
* removed commented out block of prior maps
* Clean up ceiling division pattern
---------
Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
Author: James Contini <jamescontini@gmail.com>
Date: Wed Oct 29 23:13:06 2025 -0700
formatted embed wgsl and ggml-webgpu.cpp
commit e1f6baea31645e5d96ad53664acae856f74b96f4
Author: James Contini <jamescontini@gmail.com>
Date: Wed Oct 29 23:08:37 2025 -0700
implemented REPL_Template support and removed bug in unary operators kernel
commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb
Author: James Contini <jamescontini@gmail.com>
Date: Wed Oct 15 16:14:20 2025 -0700
responded and dealt with PR comments
commit f9282c660c10dec4487d434549bdb707a9cd9f37
Author: James Contini <jamescontini@gmail.com>
Date: Sun Oct 12 13:41:41 2025 -0700
removed unnecesarry checking if node->src[1] exists for unary operators
commit 4cf28d7dec41c29186d66152735b244c5699f9dc
Author: James Contini <jamescontini@gmail.com>
Date: Sun Oct 12 13:32:45 2025 -0700
All operators (inlcluding xielu) working
commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e
Author: James Contini <jamescontini@gmail.com>
Date: Fri Oct 10 13:16:48 2025 -0700
fixed autoconfig
commit 362749910be4f0120c8ffb21ceddeb7d2c088e51
Author: James Contini <jamescontini@gmail.com>
Date: Fri Oct 10 13:10:46 2025 -0700
removed vestigial files
commit cb0858333785757804c5104e59c4981843207c16
Author: James Contini <jamescontini@gmail.com>
Date: Fri Oct 10 12:59:32 2025 -0700
abides by editor-config
commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed
Author: James Contini <jamescontini@gmail.com>
Date: Fri Oct 10 12:45:57 2025 -0700
rms_norm double declaration bug atoned
commit 7b09baa4aa53711be5a126043670cc182c78bfcd
Merge: 8a6ec843 74b8fc17
Author: James Contini <jamescontini@gmail.com>
Date: Fri Oct 10 11:50:03 2025 -0700
resolving merge conflicts
commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d
Author: James Contini <jamescontini@gmail.com>
Date: Wed Oct 8 18:06:47 2025 -0700
unary operators pass ggml tests
commit c3ae38278a2db236adc5912c9140e4f0d63f2c19
Author: James Contini <jamescontini@gmail.com>
Date: Wed Oct 1 16:22:40 2025 -0700
neg passes backend test
commit aa1c9b2f8877a405470ca56709c42a1fd43713de
Author: James Contini <jamescontini@gmail.com>
Date: Tue Sep 30 23:55:27 2025 -0700
neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
Co-authored-by: James Contini <jamescontini@gmail.com>
Co-authored-by: Neha Abbas <neabbas@ucsc.edu>
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>
* Remove extra code and format
* Add ops documentation (finally)
* ggml webgpu: add SOFTPLUS unary operator
Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern
* ggml webgpu: add EXPM1 unary operator
Implements EXPM1 (exp(x) - 1) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add FLOOR unary operator
Implements FLOOR (rounds down to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add CEIL unary operator
Implements CEIL (rounds up to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add ROUND unary operator
Implements ROUND (rounds to nearest integer) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* ggml webgpu: add TRUNC unary operator
Implements TRUNC (truncates towards zero) with f16/f32 support.
* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)
* Updates to webgpu get_memory
* Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context
* Small cleanup
* Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state.
* Cleanups
* More cleanup
* Move staging_buf mutex to global context
* Resolve merge
* Resolve merge
* Resolve merge
* Clean up merge errors, delete forward declaration, and run clang-format
* Rename device_init to backend_init
* Move webgpu_context to backend_context
* Move buffer context members into global context and refactor function calls
* Run clang-format
* Remove commends
* Move parameter buffers to per-thread, add single memset_tensor param buf
* Fix CI compilation issue
* Fix builds for emscripten not supporting subgroups
* cleanup
* cleanup
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-27 20:53:36 -08:00
Vishal Singh
f2571df8b7
ggml-zendnn : update ZenDNN git tag to main branch ( #19133 )
2026-01-28 06:21:36 +08:00
Sigbjørn Skjæret
2b4cbd2834
jinja : implement mixed type object keys ( #18955 )
...
* implement mixed type object keys
* add tests
* refactor
* minor fixes
* massive refactor
* add more tests
* forgotten tuples
* fix array/object is_hashable
* correct (albeit broken) jinja responses
verified with transformers
* improved hashing and equality
* refactor hash function
* more exhausive test case
* clean up
* cont
* cont (2)
* missing cstring
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-01-27 19:50:42 +01:00
Concedo
f6ece6fd37
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/check-vendor.yml
# .github/workflows/close-issue.yml
# .github/workflows/editorconfig.yml
# .github/workflows/gguf-publish.yml
# .github/workflows/labeler.yml
# .github/workflows/pre-tokenizer-hashes.yml
# .github/workflows/python-check-requirements.yml
# .github/workflows/python-lint.yml
# .github/workflows/python-type-check.yml
# .github/workflows/server.yml
# .github/workflows/update-ops-docs.yml
# README.md
# docs/build.md
# examples/model-conversion/scripts/utils/perplexity-gen.sh
# examples/model-conversion/scripts/utils/perplexity-run-simple.sh
# examples/model-conversion/scripts/utils/perplexity-run.sh
# examples/model-conversion/scripts/utils/quantize.sh
# examples/model-conversion/scripts/utils/run-embedding-server.sh
# ggml/src/ggml-cpu/ggml-cpu.c
# ggml/src/ggml-hexagon/htp/flash-attn-ops.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# ggml/src/ggml-opencl/kernels/mul_mv_q6_k_f32.cl
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/compare-llama-bench.py
# tests/test-backend-ops.cpp
# tests/test-gguf.cpp
# tools/cli/README.md
# tools/completion/README.md
# tools/server/README.md
2026-01-27 23:06:13 +08:00
David Lima
68ac3acb43
docs: Remove duplicated word on CUDA build section ( #19136 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2026-01-27 14:48:51 +01:00
Johannes Gäßler
a5bb8ba4c5
CUDA: tune GLM 4.7 Flash FA kernel selection logic ( #19097 )
2026-01-27 14:28:56 +01:00
Sigbjørn Skjæret
c0204a0893
ci : revert slim runner for winget ( #19129 )
2026-01-27 11:54:25 +01:00
Alberto Cabrera Pérez
be8890e721
ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 ( #18888 )
...
* Boilerplate for q6_K repack
* q6_K repack to q6_Kx8 implementation
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* q6_K generic gemv and gemm
* wip, gemm_q6_K 8x8
* Still WIP: loading of q8s, q6h and q6l
* first working version of q6_K gemm
* Moved q6 loads outside of sb block, Unrolled inner loop
* Replaced modulo with mask
* First implementation of GEMV
* ggml_vdotq_s32 -> vdotq_s32
* Reduce width of accumulators in q6_K gemv
* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.
* Reuse scales in GEMM (same GEMV opt)
* Added todos for bsum and different qh repack
* Arch fallback
* VSLIQ for merging qh adn ql
* Removed TODO, already tested
* Apply suggestions
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Removed unused import
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-27 11:08:10 +02:00
Gaurav Garg
a83c73a18a
[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full ( #19042 )
...
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
* Set the env variable in the CUDA backend registry allocation
* Add link to PR in code comment
* Remove warning logs and update documentation
2026-01-27 08:52:44 +02:00
Daniel Bevenius
fc3cdf32ce
common : clarify HTTPS build options in error message ( #19103 )
...
* common : clarify HTTPS build options in error message
This commit updates the https error message to provide clearer
instructions for users who encounter the "HTTPS is not supported" error.
The motivation for this is that it might not be clear to users that only
one of these options are needed to enable HTTPS support.
The LLAMA_OPENSSL option is also added to the message to cover all
possible build configurations.
* clarify that OpenSSL is the default for HTTPS support
2026-01-27 06:16:00 +01:00
shalinib-ibm
7afdfc9b84
ggml-cpu: Enable FP16 MMA kernels on PPC ( #19060 )
2026-01-27 11:52:34 +08:00
lhez
94eeb5967c
opencl: add flattened q6_K mv ( #19054 )
...
* opencl: flatten `q6_K` and add `kernel_mul_mv_q6_K_f32_flat`
* opencl: clean up
* opencl: refactor q6_K mv - put loop body in `block_q_6_K_dot_y_flat`
* opencl: tweak the workgroup size a bit
* opencl: output 4 values per subgroup for `kernel_mul_mv_q6_K_f32_flat`
* opencl: proper alignment for q6_K
* opencl: boundary handling for flattened q6_K mv
* opencl: rename q6_K mv kernel file
* opencl: put flattened q6_K mv in its own file
* opencl: use lower k in file name
* opencl: use K in variable names
2026-01-26 19:36:24 -08:00
Johannes Gäßler
b0311c16d2
CUDA: fix padding of GQA to power of 2 in FA ( #19115 )
2026-01-26 23:24:58 +01:00
Georgi Gerganov
8f80d1b254
graph : fix nkvo offload with FA ( #19105 )
2026-01-26 20:18:34 +02:00
Sigbjørn Skjæret
142cbe2ac6
ci : use new 1vCPU runner for lightweight jobs ( #19107 )
...
* use new 1vCPU runner for lightweight jobs
* pyright is too heavy, look into ty some day
use new pip-install input
2026-01-26 15:22:49 +01:00
Georgi Gerganov
56f3ebf38e
model : add correct type for GLM 4.7 Flash ( #19106 )
2026-01-26 11:24:30 +02:00
Concedo
3a84623ea5
Merge branch 'concedo' into concedo_experimental
2026-01-26 13:06:47 +08:00
Concedo
03e995309e
updated readme
2026-01-26 13:06:36 +08:00
Johannes Gäßler
0c21677e43
CUDA: faster FA for GQA > 1 but not power of 2 ( #19092 )
2026-01-25 21:19:47 +01:00
ccbinn
0440bfd160
metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS ( #19088 )
...
Co-authored-by: chenbin11 <chenbin11@kuaishou.com>
2026-01-25 20:07:19 +02:00
Sigbjørn Skjæret
0bf5636938
convert : yield Gemma3N custom_map tensors directly ( #19091 )
2026-01-25 18:03:34 +01:00
Aman Gupta
bcb43163ae
ggml-cpu: Use tiled FA for prompt-processing ( #19012 )
...
* ggml-cpu: Use tiled FA for prompt-processing
the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.
* fix out of bounds for mask
* skip rows where there are all masks
* skip tile if mask is inf
* store mask in worksize
* check inf tile earlier
2026-01-25 23:25:58 +08:00
Georgi Gerganov
d9c6ce46f7
kv-cache : support V-less cache ( #19067 )
...
* kv-cache : support V-less cache
* cuda : better check for V_is_K_view
* cuda : improve V_is_K_view check
* graph : add comments
* hparams : refactor
2026-01-25 15:48:56 +02:00
Sigbjørn Skjæret
70d860824a
convert : fix Gemma3N, GraniteMoe and Ernie4.5Moe ( #19084 )
...
* fix Gemma3N and Ernie4.5Moe
* fix GraniteMoe
2026-01-25 13:05:05 +01:00
Concedo
fec0d2bb4a
pipeline parallel is default in cli now
2026-01-25 18:16:58 +08:00
Georgi Gerganov
080b161995
completion : fix prompt cache for recurrent models ( #19045 )
2026-01-25 09:12:50 +02:00
Molly Sophia
1243f93a2d
readme: update RWKV7 model links ( #19061 )
...
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2026-01-25 09:11:19 +02:00
Jakkala Mahesh
24bc238303
llama: fix integer type consistency in split helpers ( #18894 )
...
* llama: fix integer type consistency in split helpers
* llama: apply minor style fixes
* llama: remove trailing whitespace
2026-01-25 09:10:52 +02:00
Daniel Bevenius
16639ba217
common : use two decimal places for float arg help messages ( #19048 )
...
* common : use two decimal places for float arg help messages
This commit updates the help messages for various command-line arguments
in arg.cpp to display floating-point default values with two decimal
places instead of one.
The motivation for this changes is that currently only having one decimal
place means that values generated using --help or llama-gen-docs will not
display the correct values.
For example, currently the value of top-p in tools/server/README.md is
`0.9`, but the default value is actually '0.95'. And running
llama-gen-docs does not update this value as it uses the output from the
help message, which shows only one decimal place, so the values look
like they are unchanged.
* docs : run llama-gen-docs to update docs
2026-01-25 07:31:42 +01:00
Bartowski
9981c30130
convert : fix conversion for inheriting models that were bypassing modify_tensors ( #19064 )
...
Python Type-Check / pyright type-check (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
* Add undo_permute = False where needed
* Replace super().modify_tensors with ModelBase
* Add one more ModelBase.modify_tensors
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-25 02:36:47 +01:00
Johannes Gäßler
e9fd8dcab4
llama-fit-params: keep explicit --ctx-size 0 ( #19070 )
2026-01-24 22:13:08 +01:00
Johannes Gäßler
4e5b83b226
GGUF: check that tensor size is representable ( #19072 )
2026-01-24 21:57:51 +01:00
Xuan-Son Nguyen
bb02f74c61
chat: fix language input for translategemma ( #19052 )
...
* chat: fix language input for translategemma
* Update common/chat.cpp
Co-authored-by: Aldehir Rojas <hello@alde.dev>
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-01-24 17:58:45 +01:00
Johannes Gäßler
8f91ca54ec
CUDA: re-use MLA K data for V in MMA FA ( #19057 )
2026-01-24 10:09:36 +01:00