Concedo
48f86bbbc7
tweaked text
2025-05-13 15:54:59 +08:00
Concedo
21e31e255b
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# .github/workflows/docker.yml
# README.md
# build-xcframework.sh
# common/CMakeLists.txt
# examples/CMakeLists.txt
# ggml/src/ggml-cpu/CMakeLists.txt
# ggml/src/ggml-cuda/CMakeLists.txt
# ggml/src/ggml-metal/ggml-metal.m
# ggml/src/ggml-metal/ggml-metal.metal
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/backend.hpp
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-sycl/mmvq.cpp
# ggml/src/ggml-sycl/vecdotq.hpp
# scripts/compare-llama-bench.py
# src/CMakeLists.txt
# src/llama-model.cpp
# src/llama.cpp
# tests/test-backend-ops.cpp
# tests/test-opt.cpp
# tools/llama-bench/README.md
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/mtmd/README.md
# tools/mtmd/clip.cpp
# tools/rpc/rpc-server.cpp
# tools/server/CMakeLists.txt
# tools/server/README.md
2025-05-13 00:28:35 +08:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Dan Johansson
a71a4075cd
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel ( #13053 )
...
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
* * code review fixes
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
* * adds a comment that clarifies barrier usage
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
---------
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
2025-05-12 13:06:19 +02:00
Johannes Gäßler
95e18884fc
CUDA: fix misaligned synchronization in FA ( #13469 )
2025-05-12 10:51:21 +02:00
Xuan-Son Nguyen
df8491922f
ggml : add mrope kernel for metal ( #13457 )
2025-05-12 10:29:13 +02:00
Atharva Dubey
14492144c2
enable dpcpp nightly builds with libraries ( #13406 )
2025-05-12 13:15:32 +08:00
Johannes Gäßler
7474e00b34
CUDA: fix crash with partial offloading of MoE ( #13439 )
2025-05-11 16:09:33 +02:00
David Huang
7f323a589f
Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
Johannes Gäßler
0208355f42
CUDA: fix race conditions FlashAttention kernels ( #13438 )
2025-05-10 22:22:48 +02:00
Johannes Gäßler
d8919424f1
CUDA: fix FlashAttention on Turing ( #13415 )
2025-05-10 09:16:52 +02:00
Jeff Bolz
dc1d2adfc0
vulkan: scalar flash attention implementation ( #13324 )
...
* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
Georgi Gerganov
a62e1dfea1
metal : optimize MoE for large batches ( #13388 )
...
ggml-ci
(cherry picked from commit 611aa914ef )
2025-05-10 00:33:51 +08:00
Concedo
6bb44391bd
Merge commit ' 5c86c9ed3e' into concedo_experimental
...
# Conflicts:
# tools/imatrix/imatrix.cpp
# tools/mtmd/README.md
# tools/run/README.md
# tools/run/run.cpp
2025-05-10 00:30:18 +08:00
Alberto Cabrera Pérez
17512a94d6
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs ( #12858 )
...
* sycl : Implemented reorder Q4_0 mmvq
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* sycl : Fixed mmvq being called when reorder is disabled
* sycl : Improved comments in the quants header
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* Use static_assert
* safe_div -> ceil_div
* Clarify qi comment
* change the reorder tensor from init to execute OP
* dbg
* Undo changes to test-backend-ops
* Refactor changes on top of q4_0 reorder fix
* Missing Reverts
* Refactored opt_for_reorder logic to simplify code path
* Explicit inlining and unroll
* Renamed mul_mat_algo enum for consistency
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
2025-05-09 16:34:08 +01:00
Georgi Gerganov
611aa914ef
metal : optimize MoE for large batches ( #13388 )
...
ggml-ci
2025-05-09 15:14:56 +03:00
Johannes Gäßler
0cf6725e9f
CUDA: FA support for Deepseek (Ampere or newer) ( #13306 )
...
* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template
2025-05-09 13:34:58 +02:00
Johannes Gäßler
5c86c9ed3e
CUDA: fix crash on large batch size for MoE models ( #13384 )
2025-05-09 12:14:04 +02:00
Concedo
0874cd231a
Merge remote-tracking branch 'jeffbolz/scalar_fa_3' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
2025-05-09 17:19:33 +08:00
Concedo
42f6930e13
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-rpc/ggml-rpc.cpp
2025-05-09 17:18:14 +08:00
Radoslav Gerganov
b486ba05bf
rpc : add rpc_msg_set_tensor_hash_req ( #13353 )
...
* rpc : add rpc_msg_set_tensor_hash_req
Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.
* fix
2025-05-09 10:31:07 +03:00
Jeff Bolz
02115dcd9a
vulkan: Allow up to 4096 elements for mul_mat_id row_ids ( #13326 )
...
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-09 09:23:41 +02:00
Jeff Bolz
20a6246f29
vulkan: avoid using Float16 capability in scalar FA
2025-05-08 14:55:52 -05:00
Jeff Bolz
615958f42c
vulkan: for scalar FA, select between 1 and 8 rows
2025-05-08 14:34:59 -05:00
Concedo
b6220669f4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# Makefile
# examples/CMakeLists.txt
# ggml/CMakeLists.txt
# ggml/src/CMakeLists.txt
# ggml/src/ggml-sycl/common.hpp
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/convert.hpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# scripts/sync-ggml.last
2025-05-08 23:07:33 +08:00
Alberto Cabrera Pérez
8733e0cf6e
sycl: addressing non-contiguous src1 mul_mats (nc and batched) ( #13343 )
...
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)
* Fixed wrong static_cast inside kernel
2025-05-08 10:08:01 +01:00
Jeff Bolz
e66094276b
vulkan: support q4_0/q8_0 KV in scalar FA
2025-05-07 23:53:38 -05:00
Jeff Bolz
989bfb18fc
vulkan: load each Q value once. optimize O reduction. more tuning
2025-05-07 15:57:38 -05:00
Jeff Bolz
c747227a57
vulkan: reduce register usage in scalar FA, but perf may be slightly worse
2025-05-07 15:02:11 -05:00
Jeff Bolz
a6c940bb79
vulkan: remove PV matrix, helps with register usage
2025-05-07 13:46:35 -05:00
Jeff Bolz
876e6617a7
vulkan: use vector loads in scalar flash attention shader
2025-05-07 13:35:13 -05:00
Daniel Bevenius
13b0a04597
whisper: remove MSVC warnings pragmas (whisper/3090)
...
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
Jared Tweed
bba9d945c1
cmake : removed stdc++fs (whisper/3097)
...
* removed stdc++fs
* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
Concedo
fa22c1a5a4
fixed cfg scale, but turns out it sucks. embedded aria2c into pyinstaller
2025-05-07 18:30:36 +08:00
R0CKSTAR
1f73301b63
cuda : remove nrows_x in mul_mat_q_process_tile ( #13325 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 09:48:23 +02:00
Jeff Bolz
3a8d954e0c
vulkan: always use fp32 for scalar flash attention
2025-05-06 23:08:39 -05:00
Johannes Gäßler
141a908a59
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF ( #13135 )
2025-05-06 23:35:51 +02:00
Concedo
ffe23f0e93
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-sycl/ggml-sycl.cpp
# pyproject.toml
2025-05-06 23:39:45 +08:00
Concedo
1377a93a73
Merge commit ' 5215b91e93' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# cmake/x64-windows-llvm.cmake
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# tests/CMakeLists.txt
# tools/imatrix/imatrix.cpp
# tools/llava/clip.cpp
# tools/rpc/rpc-server.cpp
2025-05-06 23:15:04 +08:00
Akarshan Biswas
1e333d5bba
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled ( #13254 )
...
* SYCL: Do not set tensor extras when reorder optimize is disabled
* SYCL: Disable reorder optimize by default
2025-05-06 20:27:06 +05:30
Johannes Gäßler
2356fb1d53
CUDA: fix bad asserts for partial offload ( #13337 )
2025-05-06 13:58:51 +02:00
Johannes Gäßler
15a28ec8c7
CUDA: fix --split-mode row for MMQ ( #13323 )
2025-05-06 08:36:46 +02:00
Jeff Bolz
005756a2a9
vulkan: scalar flash attention implementation
2025-05-05 19:40:45 -05:00
Johannes Gäßler
9070365020
CUDA: fix logic for clearing padding with -ngl 0 ( #13320 )
2025-05-05 22:32:13 +02:00
Akarshan Biswas
66645a5285
SYCL: Disable mul_mat kernels for noncontiguous tensor b ( #13308 )
...
ggml-ci
2025-05-05 13:39:10 +05:30
Diego Devesa
9fdfcdaedd
rpc : use backend registry, support dl backends ( #13304 )
2025-05-04 21:25:43 +02:00
Aaron Teo
6eb7d25c70
ggml : activate s390x simd for Q3_K ( #13301 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-04 19:49:12 +02:00
Johannes Gäßler
93c4e23905
CUDA: fix race condition in MMQ stream-k fixup ( #13299 )
2025-05-04 14:16:39 +02:00
Johannes Gäßler
8afbd96818
CUDA: fix race condition in MMQ ids_dst ( #13294 )
2025-05-04 13:58:38 +02:00
Jeff Bolz
8ae5ebcf85
vulkan: Additional type support for unary, binary, and copy ( #13266 )
...
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00