Sigbjørn Skjæret
a686171ea7
convert : Support chat_template.json ( #12460 )
2025-03-19 08:58:13 +01:00
Jeff Bolz
c446b2edd2
vulkan: Submit once enough matmul work has been recorded ( #12406 )
...
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-19 08:26:26 +01:00
lhez
d84635b1b0
opencl: improve profiling ( #12442 )
...
* opencl: more profiling timing
* opencl: generate trace for profiling
* opencl: reduce profiling overhead
* Populate profiling timing info at the end rather than after each
kernel run
* opencl: fix for chrome tracing
2025-03-18 12:54:55 -07:00
Georgi Gerganov
75422e8bc4
graph : normalize Q, K, V shapes + sync cross attention ( #12449 )
...
* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check
2025-03-18 21:35:19 +02:00
R0CKSTAR
bb115d2bf7
musa: override warp_size of musa device to 32 ( #12445 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-18 19:28:26 +01:00
Xuan-Son Nguyen
29fff308c7
llama : support converting Mistral Small text-only ( #12450 )
2025-03-18 19:16:19 +01:00
Georgi Gerganov
c6af2161b2
speculative : fix seg fault in certain cases ( #12454 )
2025-03-18 19:35:11 +02:00
Xuan-Son Nguyen
99aa304fb9
llama : add support for EXAONE tied word embeddings ( #12451 )
2025-03-18 17:24:33 +01:00
Georgi Gerganov
8551c44d84
context : always use non-causal attention for encoder graphs ( #12447 )
...
* context : always use non-causal attention for encoder graphs
ggml-ci
* context : move the change to llama_context::encode()
ggml-ci
2025-03-18 13:05:49 +02:00
Łukasz Ślusarczyk
35cae5ba05
SYCL: using graphs is configurable by environment variable and compile option ( #12371 )
...
* alberto changes
* enable sycl graphs by env variable
* fixed compilation warnings in ggml-sycl.cpp
* renamed graph variables
* fix markdown in docs/backend/SYCL.md
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
* fix markdown in docs/backend/SYCL.md again
* compiling graphs by default, renamed graph_enable to graph_disable
---------
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-18 11:16:31 +01:00
Georgi Gerganov
810e0af3f5
server : fix warmup draft cache type ( #12446 )
...
ggml-ci
2025-03-18 12:05:42 +02:00
Prajwal B Mehendarkar
eba92d64c3
cmake : fix PowerPC build ( #12241 )
...
Closes #12240
2025-03-18 11:37:33 +02:00
fj-y-saito
d9a14523bb
ggml : add SVE support for q6_K_q8_K ( #12361 )
2025-03-18 10:14:39 +02:00
0cc4m
fd123cfead
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues ( #12434 )
2025-03-18 07:21:40 +01:00
Łukasz Ślusarczyk
a53f7f7b88
fixed compilation warnings in ggml-sycl ( #12424 )
2025-03-18 08:51:25 +08:00
Molly Sophia
7dfad387e3
llama: Add support for RWKV v7 architecture ( #12412 )
...
* ggml: Add op l2_norm
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* ggml: Add op rwkv_wkv7
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: Add support for RWKV7 and ARWKV7 models
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix inference with RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: add more (a)rwkv7 variants in size
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* Apply code-format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* fix MUSA build
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
* llama: fix shape error with rwkv using llama-parallel
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
Sigbjørn Skjæret
60c902926c
docs : bring llama-cli conversation/template docs up-to-date ( #12426 )
2025-03-17 21:14:32 +01:00
Gaurav Garg
b1b132efcb
cuda : enable CUDA Graph on CUDA Toolkit < 12.x ( #12394 )
...
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
* Fix compilation errors with MUSA
* Disable CUDA Graph for MUSA
2025-03-17 20:25:13 +02:00
Guus Waals
01e8f2138b
ggml-vulkan: remove unused find_program(glslc) ( #12416 )
...
It's already found by FindVulkan.cmake in the parent CMakeLists
2025-03-17 13:35:43 -03:00
Jeff Bolz
484a8ab513
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader ( #12312 )
2025-03-17 09:26:18 -05:00
Concedo
ddaa8d5a38
fixed saving path for savedata
2025-03-17 22:19:52 +08:00
Concedo
0cfd8d23cb
handle symlinks (+1 squashed commits)
...
Squashed commits:
[fb8477b9] fixed makefile (+4 squashed commit)
Squashed commit:
[4a245bba] fixed a makefile issue
[d68eba69] alias usehipblas to usecublas
[a9ab0a7c] dynamic rocwmma selection
[fefe17c7] revert rocwmma
2025-03-17 21:03:30 +08:00
Daniele
cf2270e4d3
vulkan: subgroup size tuning ( #12087 )
...
* vulkan: subgroup size test
* Vulkan: Add device architecture enum and logic to recognize AMD generations
* vulkan: use new architecture logic to specify subgroup size
* Initial vulkan subgroup size tuning for RDNA3
* vulkan: commonize RDNA subgroup tuning
* vulkan: override subgroup size if required_subgroup_size = 0
* vulkan: disable warp 32 for RDNA3
* vulkan: fine tuned RDNA1 subgroup sizes
* vulkan: adjusted subgroup size map
* vulkan: fixed RDNA2 subgroup map
---------
Co-authored-by: 0cc4m <picard12@live.de>
2025-03-17 12:42:33 +01:00
Jeff Bolz
f07690c930
vulkan: use fp32 in coopmat2 q4_k dequant function ( #12309 )
2025-03-17 10:43:35 +01:00
Jeff Bolz
891c63956d
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking ( #12273 )
...
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
2025-03-17 10:41:59 +01:00
Jeff Bolz
2f21123c1d
vulkan: Adjust coopmat2 tile sizes and selection heuristic ( #12258 )
2025-03-17 10:35:00 +01:00
Christian Kastner
374101fd74
cmake : enable building llama.cpp using system libggml ( #12321 )
...
* cmake: Factor out compiler flag function from ggml
llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).
* cmake: Enable building against system ggml
This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
2025-03-17 11:05:23 +02:00
Akarshan Biswas
b3c9a65673
SYCL: set extras only on GGML_TYPE_Q4_0 ( #12366 )
...
* SYCL: set extras only on GGML_TYPE_Q4_0
* release tensor_extras in reset buffer interface
2025-03-17 09:45:12 +08:00
Sigbjørn Skjæret
8ba95dca20
llama : fix OLMo-2-0325-32B-Instruct K-norm size ( #12400 )
2025-03-16 19:46:36 +02:00
Georgi Gerganov
dc079cfdff
context : fix init of n_outputs ( #12397 )
...
ggml-ci
2025-03-16 19:29:36 +02:00
Daniel Bevenius
7b61bcc87c
ci : add --symlinks to xcframework zip command ( #12409 )
...
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option, the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```
Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377
2025-03-16 18:22:05 +01:00
Concedo
131107dc91
lite fix admin button display issue when preload story
2025-03-17 00:17:31 +08:00
Concedo
6888f5495d
allow quantkv with contextshift
2025-03-16 21:48:42 +08:00
Concedo
e466ce65e2
updated sd metadata
2025-03-16 20:12:43 +08:00
Concedo
8708403ee9
revert clean
2025-03-16 17:53:35 +08:00
Concedo
5ef1722d5f
fix for sd
2025-03-16 17:02:42 +08:00
Concedo
0954e9e476
improve model estimation
2025-03-16 16:14:13 +08:00
Concedo
5d7c5e9e33
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/tts/tts.cpp
2025-03-16 15:42:39 +08:00
Concedo
2401502cbd
improvement to tool calling, allowing specific tools to be used
2025-03-16 15:20:08 +08:00
Concedo
9f7fd63160
revert unwanted change to tool calling
2025-03-16 01:35:48 +08:00
marcoStocchi
f4c3dd5daa
llama-tts : add '-o' option ( #12398 )
...
* added -o option to specify an output file name
* llama-tts returns ENOENT in case of file write error
note : PR #12042 is closed as superseded with this one.
2025-03-15 17:23:11 +01:00
Concedo
98eade358a
more rocm include dir
2025-03-15 23:29:00 +08:00
aubreyli
3d35d87b41
SYCL: Delete redundant plus sign and space ( #12391 )
2025-03-15 15:49:03 +01:00
fairydreaming
b19bd064c0
SYCL : support non-contiguous tensors in binary ops (add, sub, etc) ( #12399 )
...
* sycl : support non-contiguous tensors in binary ops
* sycl : silence unused variable warning
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-15 22:19:30 +08:00
Concedo
67851e5415
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# examples/run/run.cpp
# ggml/src/ggml-cann/aclnn_ops.cpp
2025-03-15 19:54:19 +08:00
Concedo
e84596ec1a
add config for default gen tokens and bos toggle
2025-03-15 19:53:06 +08:00
Concedo
bfc30066c9
fixed a clip processing bug
2025-03-15 17:49:49 +08:00
Concedo
7272165e0e
verbosity
2025-03-15 12:13:04 +08:00
Concedo
4212f0b8e8
wip on multiple fixes
2025-03-15 10:50:36 +08:00
Chenguang Li
92a391327e
[CANN]MUL_MAT optimization ( #12382 )
2025-03-15 09:31:08 +08:00