Commit graph

2463 commits

Author SHA1 Message Date
Concedo
757b293ac9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/server-webui.yml
#	.github/workflows/server.yml
#	tools/rpc/rpc-server.cpp
2026-02-09 00:33:11 +08:00
Concedo
099af98288 Revert "CUDA: Fix non-contig rope (early merge)"
This reverts commit c32b4305e2.
2026-02-09 00:15:40 +08:00
Oliver Simons
e06088da0f
CUDA: Fix non-contig rope (#19338)
* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-08 15:12:51 +02:00
Georgi Gerganov
8872ad2125
metal : consolidate bin kernels (#19390)
* metal : refactor bin kernels

* cont

* cont : fix cv
2026-02-07 10:35:56 +02:00
Concedo
a0a78dacc4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	docs/ops.md
#	docs/ops/SYCL.csv
#	ggml/src/ggml-sycl/element_wise.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	pyproject.toml
#	requirements/requirements-convert_legacy_llama.txt
#	src/CMakeLists.txt
#	src/llama-vocab.cpp
#	tests/test-backend-ops.cpp
2026-02-07 15:54:02 +08:00
Georgi Gerganov
34ba7b5a2f
metal : fix event synchronization in cpy_tensor_async (#19402)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
2026-02-07 07:37:15 +02:00
Abhijit Ramesh
7fbd36c50c
ggml-webgpu: JIT compile binary operators and handle binding overlaps (#19310)
* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication
2026-02-06 10:33:30 -08:00
Nechama Krashinski
537eadb1b9
sycl: add F16 support for GGML_OP_CEIL (#19306)
* Fix SYCL CEIL operator

* sycl: implement GGML_OP_CEIL
2026-02-06 23:13:44 +08:00
Jeff Bolz
1946e46f4c
vulkan: For coopmat2 FA, use fp16 accumulators for the final result (#19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-06 09:15:13 +01:00
Jeff Bolz
f9bd518a6b
vulkan: make FA mask/softcap enables spec constants (#19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-06 08:49:58 +01:00
Georgi Gerganov
7fcf1ef45d
metal : skip loading all-zero mask (#19337)
* metal : skip loading all-zero mask

* cont : minor
2026-02-06 09:25:11 +02:00
Concedo
c32b4305e2 CUDA: Fix non-contig rope (early merge) 2026-02-06 14:45:37 +08:00
Concedo
423a4bd3c0 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
2026-02-06 14:43:02 +08:00
Georgi Gerganov
3e21647666
cuda : cuda graphs now compare all node params (#19383) 2026-02-06 07:55:06 +02:00
Georgi Gerganov
22cae83218
metal : adaptive CPU/GPU interleave based on number of nodes (#19369) 2026-02-05 19:07:22 +02:00
Jeff Bolz
449ec2ab07
vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-05 09:26:38 -06:00
Concedo
ada982b7c1 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/vulkan.Dockerfile
#	benches/dgx-spark/dgx-spark.md
#	scripts/bench-models.sh
2026-02-05 22:24:12 +08:00
Concedo
157fac7bd0 Merge commit 'c342c3b93d' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	CODEOWNERS
#	scripts/sync_vendor.py
2026-02-05 22:23:05 +08:00
Georgi Gerganov
7a4f97d196
metal : add diag (#19330) 2026-02-05 10:08:45 +02:00
Oleksandr Kuvshynov
a498c75ad1
vulkan: fix GPU deduplication logic. (#19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-05 09:06:59 +01:00
Jeff Bolz
3409ab842d
vulkan: Set k_load_shmem to false when K is too large (#19301) 2026-02-05 08:48:33 +01:00
Jeff Bolz
c342c3b93d
vulkan: fix non-contig rope (#19299) 2026-02-05 08:38:59 +01:00
will-lms
af252d0758
metal : add missing includes (#19348) 2026-02-05 08:05:09 +02:00
Concedo
a2251a154f Merge remote-tracking branch 'jeff/rope_noncontig' into concedo_experimental 2026-02-04 16:21:31 +08:00
Concedo
1f803ae27b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/server.yml
#	CMakeLists.txt
#	cmake/common.cmake
#	ggml/src/ggml-virtgpu/apir_cs_ggml-rpc-front.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched-device.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
#	ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
#	ggml/src/ggml-virtgpu/backend/backend-dispatched.h
#	ggml/src/ggml-virtgpu/backend/backend.cpp
#	ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
#	ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
#	ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
#	ggml/src/ggml-virtgpu/ggml-backend-device.cpp
#	ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
#	ggml/src/ggml-virtgpu/ggml-remoting.h
#	ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml
#	ggml/src/ggml-virtgpu/regenerate_remoting.py
#	ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
#	ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
#	ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
#	ggml/src/ggml-virtgpu/virtgpu-shm.cpp
#	ggml/src/ggml-virtgpu/virtgpu.cpp
#	ggml/src/ggml-virtgpu/virtgpu.h
2026-02-04 16:21:06 +08:00
Kevin Pouget
015deb9048
ggml-virtgpu: make the code thread safe (#19204)
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function

* ggml-virtgpu: deprecate buffer_type is_host remoting

not necessary

* ggml-virtgpu: stop using static vars as cache

The static init isn't thread safe.

* ggml-virtgpu: protect the use of the shared memory to transfer data

* ggml-virtgpu: make the remote calls thread-safe

* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory

* ggml-virtgpu: add a cleanup function for consistency

* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing

* fix style and ordering

* Remove the static variable in apir_device_get_count

* ggml-virtgpu: improve the logging

* fix review minor formatting changes
2026-02-04 10:46:18 +08:00
Aman Gupta
2ceda3f662
ggml-cpu: use LUT for converting e8->f32 scales on x86 (#19288)
* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro
2026-02-04 09:43:29 +08:00
Georgi Gerganov
44008ce8f9
metal : add solve_tri (#19302) 2026-02-03 23:43:14 +02:00
Jeff Bolz
5de50e9d86 vulkan: fix non-contig rope 2026-02-03 12:20:08 -06:00
Ruben Ortlam
32b17abdb0
vulkan: disable coopmat1 fa on Nvidia Turing (#19290) 2026-02-03 17:37:32 +01:00
Aman Gupta
8bece2eb20
CUDA: use mmvq for mul-mat-id for small batch sizes (#18958)
* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path
2026-02-03 23:31:23 +08:00
Georgi Gerganov
c55bce4159
metal : minor cleanup (#19251) 2026-02-03 13:43:29 +02:00
Concedo
316530e9cf fix cuda graph spams 2026-02-03 19:00:50 +08:00
Concedo
7b393fa487 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	AUTHORS
#	ci/run.sh
#	docs/backend/SYCL.md
#	docs/build.md
#	docs/multimodal/minicpmo2.6.md
#	docs/multimodal/minicpmo4.0.md
#	docs/multimodal/minicpmv2.5.md
#	docs/multimodal/minicpmv2.6.md
#	docs/multimodal/minicpmv4.0.md
#	docs/multimodal/minicpmv4.5.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	docs/speculative.md
#	examples/deprecation-warning/README.md
#	examples/deprecation-warning/deprecation-warning.cpp
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/causal/convert-model.sh
#	ggml/include/ggml-cann.h
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-cann/common.h
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-metal/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/concat.cl
#	ggml/src/ggml-opencl/kernels/repeat.cl
#	ggml/src/ggml-opencl/kernels/scale.cl
#	ggml/src/ggml-opencl/kernels/tanh.cl
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/outprod.cpp
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-sycl/wkv.cpp
#	src/llama-vocab.cpp
#	tests/test-autorelease.cpp
#	tests/test-backend-ops.cpp
#	tools/cvector-generator/pca.hpp
#	tools/export-lora/export-lora.cpp
#	tools/perplexity/README.md
2026-02-03 19:00:42 +08:00
Oliver Simons
1f1e57f2bf
CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053)
By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816))
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |
2026-02-03 11:33:14 +01:00
George
e9a859db3c
ggml: added cleanups in ggml_quantize_free (#19278)
Some checks failed
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Has been cancelled
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-03 08:43:39 +02:00
Gaurav Garg
41e3f02647
cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (#19227)
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-03 08:41:02 +02:00
lhez
91ea44e89b
opencl: refactor some ops, concat, repeat, tanh and scale (#19226)
* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables
2026-02-02 15:54:43 -08:00
Aman Gupta
9f682fb640
ggml-cpu: FA split across kv for faster TG (#19209)
* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl
2026-02-03 01:19:55 +08:00
Neo Zhang
bf38346d13
Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246)
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-02 21:06:21 +08:00
Tamar
4d5e972673
sycl: implement GGML_OP_TOP_K (#19242) 2026-02-02 21:05:51 +08:00
Georgi Gerganov
6fdddb4987
metal : support virtual devices (#18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-02 14:29:44 +02:00
Johannes Gäßler
59377a6c87
ggml-backend: fix async set/get fallback sync (#19179) 2026-02-02 10:00:05 +01:00
Christian Kastner
7a4ca3cbd9
docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-02 08:38:55 +02:00
Concedo
68f9c6df91 fix cuda graph spams 2026-02-02 11:28:50 +08:00
Nikhil Jain
2dc3ce2166
Remove pipeline cache mutexes (#19195)
* Remove mutex for pipeline caches, since they are now per-thread.

* Add comment

* Run clang-format

* Cleanup

* Run CI again

* Run CI once more

* Run clang-format
2026-02-01 18:47:29 -08:00
Max Krasnyansky
3bc8d2cf23
Bump cmake max version (needed for Windows on Snapdragon builds) (#19188)
* Bump max cmake version (needed for Windows on Snapdragon builds)

* cmake: move max version setting into ggml/CMakeLists
2026-02-01 14:13:38 -08:00
Concedo
ddce19db72 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package-gguf-py.nix
#	.devops/nix/scope.nix
#	common/CMakeLists.txt
#	docs/backend/SYCL.md
#	examples/lookahead/lookahead.cpp
#	examples/lookup/lookup.cpp
#	examples/sycl/run-llama2.sh
#	examples/sycl/win-run-llama2.bat
#	examples/sycl/win-test.bat
#	ggml/src/ggml-hexagon/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hvx-dump.h
#	ggml/src/ggml-hexagon/htp/hvx-reduce.h
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	scripts/sync-ggml.last
2026-02-01 22:35:25 +08:00
nullname
89f10baad5
ggml-hexagon: flash-attention and reduce-sum optimizations (#19141)
* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-30 21:14:20 -08:00
shaofeiqi
971facc38e
opencl: add optimized q8_0 mm kernel for adreno (#18871)
* Add Q8_0 OpenCL kernel

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>

* opencl: fix build for non-adreno

* opencl: refactor q8_0

* opencl: enforce subgroup size of 64 for adreno for q8_0

* For A750 and older generations, subgroup size can be 64 or 128.
  This kernel assumes subgroup size 64.

* opencl: suppress warning when adreno kernels are disabled

---------

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-01-30 10:19:27 -08:00