Georgi Gerganov
228f724d9c
kv-cache : fix seq_rm with seq_id == -1 ( #15226 )
...
* kv-cache : fix seq_rm with seq_id == -1
ggml-ci
* cont : iterate over streams
ggml-ci
2025-08-11 13:58:24 +03:00
Daniel Bevenius
cd3069dfcb
kv-cache : log (debug) all streams in find_slot ( #15176 )
...
This commit updates `llama_kv_cache_unified::find_slot` to log
information for all streams when debug is enabled.
The motivation for this change is that currently if a non-unified
kv-cache is used, then only one stream will be logged because the
code was currently uses `seq_to_stream[1]`.
2025-08-11 11:21:19 +02:00
Sigbjørn Skjæret
50e81bdf5d
convert : fix merge conflicts ( #15229 )
2025-08-11 11:15:44 +02:00
Daniel Bevenius
1ebbaddff2
perplexity : update comments/error msg to use decode [no ci] ( #15227 )
...
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.
The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
2025-08-11 11:21:24 +03:00
Julien Denize
a3a7874272
convert : improve Mistral models integration ( #14737 )
...
* Improve Mistral models integration with llama.cpp
* Revert changes and fix gguf
* Revert change
* refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py
* Revert collateral
* Rename model name
* refactor
* revert
* remove duplicate
* Remove duplication code
* Fixes
* Fix flake issues
* Apply comments
* Apply comments
* Apply comments
* Fix remote
* add default chat template
* Revert
* nit
2025-08-11 10:07:49 +02:00
Charles Xu
002cb1bb33
kleidiai: fix unsigned overflow bug ( #15150 )
...
* kleidiai: fix unsigned overflow bug
* address review comments
2025-08-11 09:59:26 +02:00
Concedo
30e2f25c05
alias tensorsplit , fixed python error
2025-08-10 22:38:14 +08:00
Concedo
300e20be6c
allow termux to launch existing downloaded models
2025-08-10 21:29:51 +08:00
Concedo
8e6d27f629
handle if assistant_message_gen and assistant_message_gen!=assistant_message_start, replace final output tag with unspaced (gen) version if exists
2025-08-10 16:51:34 +08:00
kallewoof
204739e7f1
Adapter fixes ( #1659 )
...
* test adapters
* add assistant_gen adapter key
* add support for chat templates stored as .jinja files
* removed mistakenly commited gated-tokenizers link
* autoguess: Harmony: add missing newline prefixes to system_end
2025-08-10 16:19:50 +08:00
Concedo
57db0ce9cd
allow uploading tagged pinned versions for rocm
2025-08-10 11:04:49 +08:00
Concedo
1515d67c2c
oldpc build is now fixed (+2 squashed commit)
...
Squashed commit:
[d11ac6cef] temp test
[cfbc008b1] test no f16 as well
2025-08-10 10:52:45 +08:00
David Zhao
79c1160b07
cuda: refactored ssm_scan and use CUB ( #13291 )
...
* cuda: refactored ssm_scan to use CUB
* fixed compilation error when when not using CUB
* assign L to constant and use size_t instead of int
* deduplicated functions
* change min blocks per mp to 1
* Use cub load and store warp transpose
* suppress clang warning
2025-08-09 20:29:43 +02:00
Concedo
89266ac6b8
autoguess adapter make case insensitive
2025-08-10 00:58:47 +08:00
Concedo
487d509b44
try fix oldpc cuda broken without flash attn since upstream pr14361 between 1.94 and 1.95 (+1 squashed commits)
...
Squashed commits:
[940f0c639] try fix oldpc cuda broken without flash attn since upstream pr14361 between 1.94 and 1.95
2025-08-10 00:10:37 +08:00
Concedo
4c1faf61b2
increment version (+1 squashed commits)
...
Squashed commits:
[6e5080ad2] increment version
2025-08-09 20:53:26 +08:00
Concedo
0fb25bb165
Merge branch 'upstream' into concedo_experimental
2025-08-09 20:31:36 +08:00
Concedo
5f95fc1122
update lite
2025-08-09 20:31:15 +08:00
Aman Gupta
34c9d765bf
CUDA: add attention sinks for tile and wmma ( #15178 )
...
* CUDA: add attention sinks for tile and wmma
* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-09 20:00:24 +08:00
Concedo
ced98823a1
kai api tool calling
2025-08-09 10:51:10 +08:00
Concedo
4c7b82e982
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# scripts/server-bench.py
2025-08-09 10:34:24 +08:00
Concedo
fc551470d4
updated lite
2025-08-09 10:33:59 +08:00
compilade
e54d41befc
gguf-py : add Numpy MXFP4 de/quantization support ( #15111 )
...
* gguf-py : add MXFP4 de/quantization support
* ggml-quants : handle zero amax for MXFP4
2025-08-08 17:48:26 -04:00
Johannes Gäßler
4850b52aed
server-bench: external OAI servers, sqlite ( #15179 )
...
* server-bench: external OAI servers, sqlite
* Update scripts/server-bench.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/server-bench.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/server-bench.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* raise_for_status
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-08 23:04:36 +02:00
Concedo
9e7a940ce4
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/softmax_4_f16.cl
# ggml/src/ggml-opencl/kernels/softmax_4_f32.cl
# ggml/src/ggml-opencl/kernels/softmax_f16.cl
# ggml/src/ggml-opencl/kernels/softmax_f32.cl
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
2025-08-09 01:24:52 +08:00
Concedo
7087aeb4bc
anti bsod only for nvidia
2025-08-09 01:23:38 +08:00
Concedo
67e0072245
fixed clblast repacking
2025-08-09 01:08:02 +08:00
Concedo
3468c2834d
fixed adv mode
2025-08-08 22:26:36 +08:00
kallewoof
866cc346ab
tweak OpenAI Harmony autoguess developer prefix and assistant end token ( #1673 )
...
* tweak OpenAI Harmony autoguess developer prefix
* use <|end|> for adapter end
2025-08-08 21:15:11 +08:00
AN Long
cd6983d56d
ggml : fix field name when new ggml_backend ( #14944 )
2025-08-08 14:37:22 +02:00
Olivier Chafik
6c7e9a5440
vendor: sync minja ( #15161 )
...
* vendor: sync minja
* Update minja.hpp
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-08 10:45:18 +01:00
Johannes Gäßler
1425f587a8
CUDA: attention sinks for mma FlashAttention ( #15157 )
2025-08-08 08:19:58 +02:00
lhez
aaa3d07ae7
opencl: support sink in soft_max (attn sinks) ( #15152 )
2025-08-07 21:47:03 -07:00
Concedo
d5b5e79035
should fix vulkan bsod
2025-08-08 10:57:50 +08:00
Wagner Bruna
eed5577aaa
fix unintended sd model quantization ( #1672 )
...
The recent ggml update added another quant type, GGML_TYPE_MXFP4,
which got the same value as SD_TYPE_COUNT. That made the embedded
sd.cpp quantize to GGML_TYPE_MXFP4 by default.
Photomaker in particular ends up crashing due to
"Missing CPY op for types: f32 mxfp4".
2025-08-08 10:19:58 +08:00
Xuan-Son Nguyen
50aa938901
convert : support non-mxfp4 HF model ( #15153 )
...
* convert : support non-mxfp4 HF model
* rm redundant check
* disable debug check
2025-08-07 23:26:03 +02:00
Jeff Bolz
c4f53563df
vulkan: support fattn sinks ( #15126 )
2025-08-07 22:44:20 +02:00
Jeff Bolz
a0552c8bee
vulkan: Add env var to disable host visible vidmem ( #15109 )
2025-08-07 22:07:11 +02:00
RunningLeon
99acbc9921
llama : Support intern-s1 ( #14875 )
...
* support internvl
* support interns1
* resolve comments
* put interns1 in tensor mapping
* resolve comment
* move tokenizer changes to sub class
2025-08-07 18:20:40 +02:00
Concedo
8f15461bea
updated lite
2025-08-08 00:04:47 +08:00
uvos
7ad67ba9fe
HIP: add cmake option to enable compiler output of kernel resource usage metrics ( #15103 )
2025-08-07 16:44:14 +02:00
Concedo
8a71eb03c0
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build.yml
# ggml/cmake/ggml-config.cmake.in
# ggml/src/ggml-cann/CMakeLists.txt
# ggml/src/ggml-cann/common.h
# ggml/src/ggml-cann/ggml-cann.cpp
# ggml/src/ggml-cuda/fattn.cu
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# requirements/requirements-convert_hf_to_gguf.txt
# scripts/compare-llama-bench.py
# tests/test-chat-template.cpp
# tests/test-chat.cpp
# tools/llama-bench/llama-bench.cpp
2025-08-07 21:23:09 +08:00
Concedo
338b1fe97e
readjusted mistral and oai template, fixed compile issue on termux, updated lite, show generated token ids in debug mode
2025-08-07 21:14:48 +08:00
Christian Kastner
9a96389544
ggml: Skip backend library linking code when GGML_BACKEND_DL=ON ( #15094 )
...
Any available libraries are found and loaded dynamically at runtime.
2025-08-07 13:45:41 +02:00
Johannes Gäßler
1d72c84188
CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 ( #15131 )
...
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-07 10:53:21 +02:00
Johannes Gäßler
20638e4f16
scripts: fix crash when --tool is not set ( #15133 )
2025-08-07 08:50:30 +02:00
Daniel Bevenius
36d3f00e14
requirements : fix PyTorch uint64 compatibility ( #15134 )
...
This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```
This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734 ).
PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: https://github.com/pytorch/pytorch/issues/58734
2025-08-07 05:31:48 +02:00
Reese Levine
5fd160bbd9
ggml: Add basic SET_ROWS support in WebGPU ( #15137 )
...
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
2025-08-06 15:14:40 -07:00
rmatif
756cfea826
fix profiling crash ( #15072 )
2025-08-06 14:17:51 -07:00
lhez
e725a1a982
opencl: add swiglu_oai and add_id ( #15121 )
...
* opencl: add `swiglu-oai`
* opencl: add `add_id`
* opencl: add missing `add_id.cl`
2025-08-06 12:12:17 -07:00