Commit graph

7439 commits

Author SHA1 Message Date
Concedo
103d60ed2c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	common/common.cpp
#	examples/batched-bench/batched-bench.cpp
#	examples/batched/batched.cpp
#	examples/export-lora/export-lora.cpp
#	examples/gritlm/gritlm.cpp
#	examples/parallel/parallel.cpp
#	examples/passkey/passkey.cpp
#	examples/speculative-simple/speculative-simple.cpp
#	examples/speculative/speculative.cpp
#	ggml/src/ggml-cann/CMakeLists.txt
#	ggml/src/ggml-cann/acl_tensor.cpp
#	ggml/src/ggml-cann/acl_tensor.h
#	ggml/src/ggml-cann/aclnn_ops.cpp
#	ggml/src/ggml-cann/aclnn_ops.h
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	tests/test-arg-parser.cpp
#	tests/test-backend-ops.cpp
2025-04-03 18:57:49 +08:00
a3sh
193c3e03a6
fix MUSA compiler warning (#12704)
* fix MUSA compiler warning

* replace (void) with GGML_UNUSED
2025-04-03 09:32:55 +02:00
Chenguang Li
65cfe136a0
CANN: Support operator SIN COS ARGMAX (#12709)
* [CANN]support sin cos argmax

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]codestyle adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]Remove redundant code

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2025-04-03 15:18:08 +08:00
Alan Gray
3f9da22c2b
Simplify and improve CUDA graphs through use of indirect copy pointers (#9017)
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.

Fixes #12152

* Addressed comments

* fix HIP builds

* properly sync to stream

* removed ggml_cuda_cpy_fn_ptrs

* move stream sync before free

* guard to only use indirection with graphs

* style fixes

* check for errors

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-04-03 03:31:15 +02:00
hipudding
2a0dc97e56
CANN: Fix failed test cases (#12708)
* CANN: Fix memory waste in aclnn_tensor

* CANN: fix backend ops fail

* CANN: fix acl_tensor memory alloc.

* CANN: format

* CANN: remove trailing whitespace
2025-04-03 08:49:51 +08:00
lhez
97a20c012b
opencl: use max_alloc_size in backend ctx instead of querying again (#12705) 2025-04-02 17:01:42 -07:00
Jeff Bolz
f01bd02376
vulkan: Implement split_k for coopmat2 flash attention. (#12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
2025-04-02 14:25:08 -05:00
bandoti
6f3bd38640
cmake: remove caching from vulkan coopmat checks (#12719) 2025-04-02 14:56:26 -03:00
Jeff Bolz
be0a0f8cae
vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
2025-04-02 19:40:32 +02:00
0cc4m
92e3006bb6
Vulkan: Fix mmq int dot float cache size (#12722) 2025-04-02 19:12:30 +02:00
Georgi Gerganov
833e2b7409
model : print tensor size during load (#12711)
* model : print tensor size during load

* cont : fix units MB -> MiB

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-04-02 16:38:54 +03:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers (#11397)
* llama : add option to override tensor buffers

* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Georgi Gerganov
a10b36c91a
llama : refactor kv cache guard (#12695)
* llama : refactor kv cache guard

ggml-ci

* cont : fix comment [no ci]

* llama : fix kv_cache restore logic

ggml-ci

* context : simplify kv cache updates

ggml-ci

* cont : better name [no ci]

* llama : fix llama_decode return code when could not find KV slot

ggml-ci

* context : change log err -> warn [no ci]

* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Concedo
7f1003be44 warning for max tokens being too high 2025-04-02 18:58:38 +08:00
Sigbjørn Skjæret
83a88bd6af
vocab : BailingMoE : change possessive quantifiers to greedy (#12677) 2025-04-02 11:21:48 +02:00
Xuan-Son Nguyen
42eb248f46
common : remove json.hpp from common.cpp (#12697)
* common : remove json.hpp from common.cpp

* fix comment
2025-04-02 09:58:34 +02:00
Chenguang Li
9bacd6b374
[CANN] get_rows and dup optimization (#12671)
* [CANN]get_rows and dup optimization.

Co-authored-by: hipudding <huafengchun@gmail.com>
Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]GET_ROWS and CPY/DUP optimization

Co-authored-by: hipudding <huafengchun@gmail.com>
Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-04-02 15:22:13 +08:00
Concedo
669311365c fixed gemma system prompt 2025-04-02 13:58:51 +08:00
Xuan-Son Nguyen
267c1399f1
common : refactor downloading system, handle mmproj with -hf option (#12694)
* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2
2025-04-01 23:44:05 +02:00
Junil Kim
f423981ac8
opencl : fix memory allocation size (#12649)
issue:
https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283

This patch fixes the memory allocation size
not exceeding the maximum size of the OpenCL device.
2025-04-01 09:54:34 -07:00
Concedo
fbf5c04c3c silly me 2025-04-02 00:51:05 +08:00
Concedo
30e3d24ead embd include name 2025-04-02 00:40:38 +08:00
Concedo
e37f27632f clear cpu flag manually for templates, added truncation for embeddings 2025-04-02 00:18:30 +08:00
jklincn
e39e727e9a
llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672) 2025-04-01 14:54:28 +02:00
Sigbjørn Skjæret
5936a616e4
convert : BailingMoE : fix qkv split when head_dim is 0 (#12687)
NOTE: Ling-lite-base is broken, see https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2
2025-04-01 14:37:13 +02:00
Concedo
8a4a9b8c19 Merge branch 'upstream' into concedo_experimental 2025-04-01 20:16:16 +08:00
Concedo
9e182b3e78 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	docs/backend/SYCL.md
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	scripts/sync-ggml.last
#	tests/test-chat-template.cpp
2025-04-01 20:16:07 +08:00
Georgi Gerganov
3fd072a540
metal : use F32 prec in FA kernels (#12688)
* metal : use F32 prec in FA kernels

ggml-ci

* cont : fix FA vec kernel

ggml-ci
2025-04-01 14:57:19 +03:00
Concedo
0fd94e19f3 made tool calls more robust and allowed tool call template customization 2025-04-01 19:16:45 +08:00
R0CKSTAR
a6f32f0b34
Fix clang warning in gguf_check_reserved_keys (#12686)
* Fix clang warning in gguf_check_reserved_keys

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix typo

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-01 13:12:53 +02:00
Wagner Bruna
2bb3597e42
vulkan: fix build when glslc doesn't support coopmat (#12683) 2025-04-01 11:38:07 +02:00
Romain Biessy
8293970542
SYCL: Rename oneMKL to oneMath (#12192)
* Rename oneMKL Interface to oneMath

* Use oneMath for Intel vendor

* Rename occurences to mkl

* clang-format

* Silence verbose warnings

* Set oneMath HIP_TARGETS

* Fix silence warnings

* Remove step to build oneMath from build instructions

* Use fixed oneMath version

* Remove INTEL_CPU

* Fold CMake oneDNN conditions

* Use Intel oneMKL for Intel devices

* Improve CMake message

* Link against MKL::MKL_SYCL::BLAS only

* Move oneMath documentation to Nvidia and AMD sections
2025-04-01 16:24:29 +08:00
Akarshan Biswas
8bbf26083d
SYCL: switch to SYCL namespace (#12674) 2025-04-01 10:11:39 +02:00
henk717
4291e1575b
Fix tool spec, this spec is kinda.... (#1458) 2025-04-01 10:39:02 +08:00
Sigbjørn Skjæret
35782aeedb
convert : BailingMoE : avoid setting rope_dim to 0 (#12678) 2025-03-31 23:09:48 +02:00
Daniel Bevenius
c80a7759da
vocab : add special infill tokens for CodeLlama (#11850)
* vocab : add special infill tokens for CodeLlama

The commit adds the following special tokens for CodeLlama infill:
- `▁<PRE>`
- `▁<SUF>`
- `▁<MID>`

The motivation for this is that currently the infill example uses
CodeLlama as a suggested model. But when using this model the following
error is generated:
```console
/llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
305251 Aborted                 (core dumped)
./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \
  -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \
  --in-prefix "def helloworld():\n    print(\"hell" \
  --in-suffix "\n   print(\"goodbye world\")\n    "
```

* squash! vocab : add special infill tokens for CodeLlama

Add _<EOT> as well.
2025-03-31 18:40:56 +02:00
Concedo
c0adaabfa4 Revert "try fix owui"
This reverts commit 12e5b8abdb.
2025-04-01 00:27:31 +08:00
Concedo
12e5b8abdb try fix owui 2025-04-01 00:23:45 +08:00
a3sh
250d7953e8
ggml : faster ssm scan (#10558)
* faster ssm_scan

* delete unused commnet

* clang format

* add space

* modify unnecessary calculations

* faster ssm conv implementatioin

* modify file name with dash
2025-03-31 18:05:13 +02:00
Concedo
0ed95fcccc fixed l3 template, add index 2025-03-31 23:59:06 +08:00
Sigbjørn Skjæret
403fbacbbc
convert : Qwerky : use lora_rank_tokenshift and lora_rank_decay if present (#12667) 2025-03-31 16:36:25 +02:00
0cc4m
a8a1f33567
Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135)
* Vulkan: Add DP4A MMQ and Q8_1 quantization shader

* Add q4_0 x q8_1 matrix matrix multiplication support

* Vulkan: Add int8 coopmat MMQ support

* Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code

* Add GL_EXT_integer_dot_product check

* Remove ggml changes, fix mmq pipeline picker

* Remove ggml changes, restore Intel coopmat behaviour

* Fix glsl compile attempt when integer vec dot is not supported

* Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq

* Remove redundant comment

* Fix integer dot check

* Fix compile issue with unsupported int dot glslc

* Update Windows build Vulkan SDK version
2025-03-31 14:37:01 +02:00
Georgi Gerganov
1790e73157 cmake : fix whitespace (#0) 2025-03-31 15:07:32 +03:00
Georgi Gerganov
0114a32da0 sync : ggml
ggml-ci
2025-03-31 15:07:32 +03:00
Sandro Hanea
a7724480fd cmake: improve Vulkan cooperative matrix support checks (whisper/2966)
Co-authored-by: Sandro Hanea <me@sandro.rocks>
2025-03-31 15:07:32 +03:00
Sigbjørn Skjæret
1a85949067
llava : proper description fix (#12668) 2025-03-31 11:28:30 +02:00
Akarshan Biswas
6c02a032fa
SYCL: Remove misleading ggml_sycl_op_flatten function (#12387)
* SYCL: Remove misleading ggml_sycl_op_flatten function

* remove trailing whitespace

* Fix L2 norm from rebase

* remove try catch block from element_wise.cpp

* remove comment from common.hp

* ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward

* norm.cpp: remove try catch exception block
2025-03-31 11:25:24 +02:00
Sigbjørn Skjæret
f52d59d771
llava : fix clip loading GGUFs with missing description (#12660) 2025-03-31 11:07:07 +02:00
Concedo
1ebadc515e add streaming support for oai tools (+2 squashed commit)
Squashed commit:

[4d080b37] qwen2.5vl surgery script

[4bebe7e5] add streaming support for oai tools
2025-03-31 16:49:15 +08:00
marcoStocchi
52de2e5949
tts : remove printfs (#12640)
* tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output.
2025-03-31 11:20:30 +03:00