Commit graph

11893 commits

Author SHA1 Message Date
Mario Limonciello
8fdf269dad
ci : update Windows ROCm build to 26.Q1 [no ci] (#19810)
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
2026-02-25 12:30:19 +01:00
Aldehir Rojas
a96a1120b4
gguf : fix ftell/fseek for Windows (#19870) 2026-02-25 06:58:11 +02:00
Georgi Gerganov
244641955f
models : fix graph splits (#19866) 2026-02-25 00:01:13 +02:00
Pascal
47eb12b953
server: fix query params lost when proxying requests in multi-model router mode (#19854)
* server: fix query params lost when proxying requests in multi-model router mode

* server: re-encode query params using httplib::encode_query_component in proxy
2026-02-24 21:46:06 +01:00
Georgi Gerganov
418dea39ce
ggml/gguf : prevent integer overflows (#19856)
* gguf : prevent integer overflow for ggml_context mem size

* ggml : fix int overflows in ggml_new_object()

* gguf : prevent string exhaustion

* gguf : prevent array elements exhaustion

* ggml : fix negative tensor type oob

* py : assert that alignment is non-zero power of 2

* ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object

* gguf-py : error on duplicate keys when reading

* py : restore tensor_fields

* enforce proper alignment in add_custom_alignment

* gguf : better name

* gguf : fix ctx size for no_alloc == true

* gguf : minor print fix

* ggml : print values when overflow

* ggml : remove deprecated ggml_type_sizef()

* ggml : relax ggml_type asserts to debug-only

* gguf : add mem_size overflow test

* gguf : add file size check for arrays

* ggml : relax asseerts for ggml_get_type_traits()

* flake8 fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-24 20:17:11 +02:00
Concedo
0eafc3cf2d ace step lowvram mode done, improved 2026-02-24 23:12:26 +08:00
Concedo
11a85d62fc lowvram for music lm 2026-02-24 22:21:17 +08:00
Concedo
aa58d1ed3b all working, but needs to optimize vram 2026-02-24 21:55:57 +08:00
Tarek Dakhran
da426cb250
model : update label for LFM2-24B-A2B (#19848)
* model : Update label for LFM2-24B-A2B

```
❯ build/bin/llama-bench -m /data/playground/checkpoints/LFM2-24B-A2B-Preview-Q4_0.gguf,/data/playground/checkpoints/LFM2-8B-A1B-Q4_0.gguf -p 1 -n 0
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2moe 24B.A2B Q4_0           |  12.54 GiB |    23.84 B | CPU        |      10 |             pp1 |         30.35 ± 2.49 |
| lfm2moe 8B.A1B Q4_0            |   4.41 GiB |     8.34 B | CPU        |      10 |             pp1 |         49.24 ± 1.93 |
```

* Remove extra line
2026-02-24 14:27:42 +01:00
Concedo
488c431331 not yet working 2026-02-24 17:47:50 +08:00
Radoslav Gerganov
c830f99cfa
server : support max_completion_tokens request property (#19831)
"max_tokens" is deprectated in favor of "max_completion_tokens" which
sets the upper bound for reasoning+output token.

Closes: #13700
2026-02-24 10:30:00 +02:00
Ruben Ortlam
aa6f918c1c
Vulkan Scalar Flash Attention Refactor (#19625)
* vulkan: allow using fp16 in scalar flash attention shader

* split rows inside of subgroups for faster synchronization

* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1

* use f32 scalar FA if f16 is not supported by device

* fix amd workgroup size issue

* optimize masksh use

* add medium rows FA shader Br size

* fixes

* add padding to mask shmem buffer

* cache q values into registers for KQ

* fuse lf accumulation, pf and v accumulation into a loop

* stage K loads through shmem

* stage V loads through shmem

* only stage through shmem on Nvidia

* default to Bc 32

* also stage V through shmem when this is done for K

* dynamic subgroups for intel

* use vectorized stores

* use float_type for dequantize4 functions

* use smaller scalar rows size for smaller rows count

* relax flash attention split_k condition to allow non-gqa use

* use minimal subgroup size on Intel

* fix shmem support function

* fix rebase issues

* fixes

* Bc 4 for scalar FA is not a valid configuration

* Use wave32 on AMD RDNA for scalar FA

* add Intel shader core count lookup-table

* fix regressions

* device tuning

* tmpsh size fix

* fix editorconfig

* refactor fa tuning logic into a single place

* fix gqa opt logic

* fix block_rows with small n_rows

* amd tuning

* fix hsk=72/80 issue

* tuning

* allow condition skipping for column check

* use float16 for Of if available

* address feedback

* fix bad RDNA performance on head size <= 128 by limiting occupancy

* allow printing pipeline stats

* cleanup and fixes

* limit occupancy for GCN for small batch FA with large HSK

* disable f16 FA for GCN AMD GPUs on the proprietary driver
2026-02-24 08:35:48 +01:00
Concedo
0fd7d2c0e5 ace step diffusion loading 2026-02-24 15:24:15 +08:00
Jeff Bolz
8c2c0108dd
vulkan: fix coopmat1 without bf16 support (#19793) 2026-02-24 07:48:32 +01:00
Jeff Bolz
3ea5360c00
vulkan: fix data race in mul_mat_id shader (#19790) 2026-02-24 07:43:12 +01:00
Max Krasnyansky
39fb81f875
hexagon refactor all Ops to use local context struct (#19819)
* hexagon: refactor set/get/sum-rows ops to use local context

* hexagon: refactor ROPE and Softmax Ops to use local context

Improves performance a bit by precomputing things and saving in the context.

* hexagon: refactor activation ops to use local context struct

* hexagon: refactor unary ops to use local context struct and DMA/VTCM

* hexagon: use aligned hvx_scale function

* hexagon: remove unused fields from op_context

* hexagon: rewrite ROPE to use DMA and VTCM scratchpad

* hex-rope: keep N rows in scratchpad (instead of just two)

* hex-rope: introduce rowidx cache

* hex-rope: remove unused fields

* hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute

also removes the need for fastdiv.

* hex-rope: minor formatting

* hex-rope: use indices and unroll the loops

* hex-rope: more updates to cleanup rope-block handling

* hexagon: cleanup supported type/dims checks

* hexagon: all reduce funcs replicated across lanes

There is no need to explicitly replicate the first value.

* snapdragon: update adb and windows scripts to use ubatch-size 256

Updated Ops support handles larger ubatches.
2026-02-23 16:32:14 -08:00
Aleksander Grygier
5eb0ea32f0
feat: Add code blocks full height setting to parameter sync service (#19835) 2026-02-23 22:30:13 +01:00
Adrien Gallouët
b68a83e641
vendor : update cpp-httplib to 0.34.0 (#19830)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-23 21:05:48 +01:00
Concedo
749536f464 fixed wav header wrong size 2026-02-24 01:13:44 +08:00
Daniel Bevenius
d8aeb65cee
tests : fix typos in comments in test-backend-sampler [no ci] (#19824)
* tests : fix typos in comments in test-backend-sampler [no ci]
2026-02-23 17:12:02 +01:00
askmyteapot
062e361968
Update ace-qwen3.cpp to build on MSVC (#1992)
need to include <sstream> otherwise build fails with lots of the below errors: 

```
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2297: '<<': not valid as right operand has type 'const cha
r [26]' [C:\koboldcpp\build\music_adapter.vcxproj]
  (compiling source file '../otherarch/acestep/music_adapter.cpp')

C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2679: binary '<<': no operator found which takes a right-h
and operand of type 'std::string' (or there is no acceptable conversion) [C:\koboldcpp\build\music_adapter.vcxproj]
  (compiling source file '../otherarch/acestep/music_adapter.cpp')
      C:\Program Files (x86)\Microsoft Visual Studio\18\BuildTools\VC\Tools\MSVC\14.50.35717\include\__msvc_int128.hpp(
  753,46):
      could be 'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept' [found us
  ing argument-dependent lookup]
          C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9):
          'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept': cannot conver
  t argument 2 from 'std::string' to 'const std::_Base128 &'
              C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57):
              Reason: cannot convert from 'std::string' to 'const std::_Base128'
              C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57):
              No user-defined-conversion operator available that can perform this conversion, or the operator cannot be
   called
```
2026-02-23 23:03:07 +08:00
Concedo
5311997581 updated ace step cpp 2026-02-23 23:01:10 +08:00
Concedo
2e713cfff5 fixed compile issue, trying out 8bit pcm 2026-02-23 21:19:03 +08:00
Aleksander Grygier
9051663d5d
webui: Add setting to have full height Code Blocks in Chat Messages (#19829) 2026-02-23 14:16:50 +01:00
Daniel Bevenius
72b44c0d21
model-conversion : merge inspect-org-model.py with tensor-info.py (#19823)
This commit replaces/merges the inspect-org-model.py script with the
contents tensor-info.py script. The merged script has also been updated
to also print tensor sizes which was the only thing that was not done
before (by tensor-info.py that is).

The motivation for this is that tensor-info.py does not load the tensor
weights which can be time consuming for larger models. And also now that
both are doing almost the same thing it makes sense to just have one and
not two scripts to maintain.
2026-02-23 14:15:16 +01:00
Alberto Cabrera Pérez
bc160d3582
ggml-cpu: arm64: q5_K repack gemm and gemv (and generic) implementations (dotprod) (#19356)
* Generic GEMV and boilerplate for q5_K dotprod
* Generic GEMM and boilerplate for q5_K dotprod
* ARM64 q5_K dotprod GEMM
* ARM64 q5_K dotprod GEMV
2026-02-23 12:42:52 +00:00
Wagner Bruna
a6c0a224b2
sd: sync to master-506-c9cd497 (#1991) 2026-02-23 17:35:59 +08:00
Concedo
06c0ffaead with am17an fix for henk to test 2026-02-23 17:30:19 +08:00
Concedo
c2b0cb26a8 ace step codes api 2026-02-23 14:04:45 +08:00
Daniel Bevenius
2b6dfe824d
llama : remove write/read of output ids/logits/embeddings (#18862)
* llama : remove write/read of output ids/logits/embeddings

This commit removes the write/read of output ids, logits and
embeddings from the llama context state.

Refs: https://github.com/ggml-org/llama.cpp/pull/18862#issuecomment-3756330941

* completion : add replying of session state

This commit updates the session handing in the completion tool to handle
the that logits are no longer stored in the session file. Instead, we
need to replay the last token to get the logits for sampling.

* common : add common_prompt_batch_decode function

This commit adds a new function which is responsible for decoding prompt
and optionally handle the saving for session data.

* update save-state.cpp to use llama_state_load_file

This commit updates the save-load-state example to utilize the new
llama_state_load_file function for loading the model state from a file.
And it also replays the last token after loading since this state is now
stored before the last token is processed.

* examples : set n_seq_max = 2 for ctx3

This commit updates the save-load-state example to set the n_seq_max
parameter to 2 when initializing the ctx3 context.

The motivation for this change is that using 1 as n_parallel/n_seq_max
the context only supports one sequence, but the test laster tries to
use a second sequence which results in the following error:
```console
main : loaded state with 4 tokens
main : seq 0 copied, 225760 bytes
main : kv cache cleared
find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value
state_read_meta: failed to find available cells in kv cache
```
This seems to only happen for recurrent/hybrid models.
2026-02-23 07:04:30 +01:00
Concedo
d100c8660e added Tlacuilo 2026-02-23 10:48:56 +08:00
Sigbjørn Skjæret
e8e261699a
cli : provide model with text filename (#19783) 2026-02-22 22:33:49 +01:00
Xuan-Son Nguyen
5452d736f8
jinja: correct stats for tojson and string filters (#19785) 2026-02-22 21:08:23 +01:00
Aldehir Rojas
ed4837891d
common : fix improper trimming in XML parser on complete message (#19805)
Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>
2026-02-22 17:34:54 +01:00
Concedo
4be93db21c ace step codes generation now working 2026-02-23 00:27:26 +08:00
Kilian Krampf
cacc371f99
Fix wrong cli-argument in documentation (#19804) 2026-02-22 16:26:33 +01:00
Concedo
71d42fae85 Revert "Revert "Revert "cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645)"""
This reverts commit edc04f3f7d.
2026-02-22 23:18:53 +08:00
Concedo
13db5aee9e stub files for loading ace step 2026-02-22 23:15:08 +08:00
HelloKS
ae2368e74e
model : add Kanana-2 model support (#19803)
* model: Add Kanana-2 model support

* lint: adjust spacing
2026-02-22 16:15:02 +01:00
Sigbjørn Skjæret
9f0684f003
ci : fix rocm archive name [no ci] (#19808) 2026-02-22 16:14:37 +01:00
Aldehir Rojas
34ec1c3f18
server : merge contiguous Responses input items into a single assistant message (#19773)
* server : merge contiguous input items into a single assistant message

* cont : simplify tool call msg

* cont : reduce and combine content

* cont : fix merging content items
2026-02-22 14:11:31 +01:00
Concedo
37ae068dee set default to GPU test 2026-02-22 17:03:43 +08:00
Sigbjørn Skjæret
e877ad8bd9
ci : fix rocm release path [no ci] (#19784) 2026-02-22 08:07:46 +01:00
Concedo
fdf868f397 add ace step cpp license info 2026-02-22 13:24:28 +08:00
Concedo
5cd6e50eab initial files for ace step 2026-02-22 13:22:24 +08:00
Concedo
ac70ca35dd preliminary patches for acestep.cpp 2026-02-22 12:50:08 +08:00
Wagner Bruna
19588f18ea
sd: relax size restrictions for DiT models (#1986)
Round image dimensions to the specific multiple required by each
DiT model, which range from 32 (certain Wan models) to 1 (Chroma
Radiance), with most requiring multiples of 8 or 16. Unet models
keep being rounded to multiples of 64.

Current sd.cpp rounds the sizes internally; but it always rounds
up, so we still need to round on our side to apply image size
restrictions, and to trigger VAE tiling correctly.

Also, remove a legacy test that could abort a generation with
unsupported image sizes: it'd never run, because it was applied
after the image side adjustements.
2026-02-22 11:00:10 +08:00
Concedo
0a87f5501e updated sdui, fix img imports 2026-02-22 10:49:55 +08:00
Concedo
73f3ffaeb7 fix followup tool call check with assistant prefills 2026-02-22 10:33:00 +08:00
Concedo
edc04f3f7d Revert "Revert "cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645)""
This reverts commit 131e3cb17a.
2026-02-22 09:33:25 +08:00