Commit graph

1026 commits

Author SHA1 Message Date
Concedo
48f86bbbc7 tweaked text 2025-05-13 15:54:59 +08:00
Concedo
21e31e255b Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	README.md
#	build-xcframework.sh
#	common/CMakeLists.txt
#	examples/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-metal/ggml-metal.m
#	ggml/src/ggml-metal/ggml-metal.metal
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	scripts/compare-llama-bench.py
#	src/CMakeLists.txt
#	src/llama-model.cpp
#	src/llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-opt.cpp
#	tools/llama-bench/README.md
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/README.md
#	tools/mtmd/clip.cpp
#	tools/rpc/rpc-server.cpp
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Dan Johansson
a71a4075cd
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * code review fixes

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

---------

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
2025-05-12 13:06:19 +02:00
Johannes Gäßler
95e18884fc
CUDA: fix misaligned synchronization in FA (#13469) 2025-05-12 10:51:21 +02:00
Xuan-Son Nguyen
df8491922f
ggml : add mrope kernel for metal (#13457) 2025-05-12 10:29:13 +02:00
Atharva Dubey
14492144c2
enable dpcpp nightly builds with libraries (#13406) 2025-05-12 13:15:32 +08:00
Johannes Gäßler
7474e00b34
CUDA: fix crash with partial offloading of MoE (#13439) 2025-05-11 16:09:33 +02:00
David Huang
7f323a589f
Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
Johannes Gäßler
0208355f42
CUDA: fix race conditions FlashAttention kernels (#13438) 2025-05-10 22:22:48 +02:00
Johannes Gäßler
d8919424f1
CUDA: fix FlashAttention on Turing (#13415) 2025-05-10 09:16:52 +02:00
Jeff Bolz
dc1d2adfc0
vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
Georgi Gerganov
a62e1dfea1 metal : optimize MoE for large batches (#13388)
ggml-ci

(cherry picked from commit 611aa914ef)
2025-05-10 00:33:51 +08:00
Concedo
6bb44391bd Merge commit '5c86c9ed3e' into concedo_experimental
# Conflicts:
#	tools/imatrix/imatrix.cpp
#	tools/mtmd/README.md
#	tools/run/README.md
#	tools/run/run.cpp
2025-05-10 00:30:18 +08:00
Alberto Cabrera Pérez
17512a94d6
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)
* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
2025-05-09 16:34:08 +01:00
Georgi Gerganov
611aa914ef
metal : optimize MoE for large batches (#13388)
ggml-ci
2025-05-09 15:14:56 +03:00
Johannes Gäßler
0cf6725e9f
CUDA: FA support for Deepseek (Ampere or newer) (#13306)
* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template
2025-05-09 13:34:58 +02:00
Johannes Gäßler
5c86c9ed3e
CUDA: fix crash on large batch size for MoE models (#13384) 2025-05-09 12:14:04 +02:00
Concedo
0874cd231a Merge remote-tracking branch 'jeffbolz/scalar_fa_3' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
2025-05-09 17:19:33 +08:00
Concedo
42f6930e13 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-rpc/ggml-rpc.cpp
2025-05-09 17:18:14 +08:00
Radoslav Gerganov
b486ba05bf
rpc : add rpc_msg_set_tensor_hash_req (#13353)
* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix
2025-05-09 10:31:07 +03:00
Jeff Bolz
02115dcd9a
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-09 09:23:41 +02:00
Jeff Bolz
20a6246f29 vulkan: avoid using Float16 capability in scalar FA 2025-05-08 14:55:52 -05:00
Jeff Bolz
615958f42c vulkan: for scalar FA, select between 1 and 8 rows 2025-05-08 14:34:59 -05:00
Concedo
b6220669f4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	Makefile
#	examples/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/convert.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	scripts/sync-ggml.last
2025-05-08 23:07:33 +08:00
Alberto Cabrera Pérez
8733e0cf6e
sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel
2025-05-08 10:08:01 +01:00
Jeff Bolz
e66094276b vulkan: support q4_0/q8_0 KV in scalar FA 2025-05-07 23:53:38 -05:00
Jeff Bolz
989bfb18fc vulkan: load each Q value once. optimize O reduction. more tuning 2025-05-07 15:57:38 -05:00
Jeff Bolz
c747227a57 vulkan: reduce register usage in scalar FA, but perf may be slightly worse 2025-05-07 15:02:11 -05:00
Jeff Bolz
a6c940bb79 vulkan: remove PV matrix, helps with register usage 2025-05-07 13:46:35 -05:00
Jeff Bolz
876e6617a7 vulkan: use vector loads in scalar flash attention shader 2025-05-07 13:35:13 -05:00
Daniel Bevenius
13b0a04597 whisper: remove MSVC warnings pragmas (whisper/3090)
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
Jared Tweed
bba9d945c1 cmake : removed stdc++fs (whisper/3097)
* removed stdc++fs

* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
Concedo
fa22c1a5a4 fixed cfg scale, but turns out it sucks. embedded aria2c into pyinstaller 2025-05-07 18:30:36 +08:00
R0CKSTAR
1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile (#13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 09:48:23 +02:00
Jeff Bolz
3a8d954e0c vulkan: always use fp32 for scalar flash attention 2025-05-06 23:08:39 -05:00
Johannes Gäßler
141a908a59
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135) 2025-05-06 23:35:51 +02:00
Concedo
ffe23f0e93 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	pyproject.toml
2025-05-06 23:39:45 +08:00
Concedo
1377a93a73 Merge commit '5215b91e93' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	cmake/x64-windows-llvm.cmake
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/CMakeLists.txt
#	tools/imatrix/imatrix.cpp
#	tools/llava/clip.cpp
#	tools/rpc/rpc-server.cpp
2025-05-06 23:15:04 +08:00
Akarshan Biswas
1e333d5bba
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254)
* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default
2025-05-06 20:27:06 +05:30
Johannes Gäßler
2356fb1d53
CUDA: fix bad asserts for partial offload (#13337) 2025-05-06 13:58:51 +02:00
Johannes Gäßler
15a28ec8c7
CUDA: fix --split-mode row for MMQ (#13323) 2025-05-06 08:36:46 +02:00
Jeff Bolz
005756a2a9 vulkan: scalar flash attention implementation 2025-05-05 19:40:45 -05:00
Johannes Gäßler
9070365020
CUDA: fix logic for clearing padding with -ngl 0 (#13320) 2025-05-05 22:32:13 +02:00
Akarshan Biswas
66645a5285
SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308)
ggml-ci
2025-05-05 13:39:10 +05:30
Diego Devesa
9fdfcdaedd
rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
Aaron Teo
6eb7d25c70
ggml : activate s390x simd for Q3_K (#13301)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-04 19:49:12 +02:00
Johannes Gäßler
93c4e23905
CUDA: fix race condition in MMQ stream-k fixup (#13299) 2025-05-04 14:16:39 +02:00
Johannes Gäßler
8afbd96818
CUDA: fix race condition in MMQ ids_dst (#13294) 2025-05-04 13:58:38 +02:00
Jeff Bolz
8ae5ebcf85
vulkan: Additional type support for unary, binary, and copy (#13266)
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00