Commit graph

11868 commits

Author SHA1 Message Date
Concedo
05834eecb3 Merge commit '1ca3d1de15' into concedo_experimental
# Conflicts:
#	tools/server/README.md
2026-02-26 19:55:06 +08:00
Concedo
adebf63877 ace converter 2026-02-26 19:53:02 +08:00
Georgi Gerganov
1ca3d1de15
gguf : avoid too many file size calls (#19919) 2026-02-26 12:46:32 +02:00
yggdrasil75
bd72300591
server : fix typo in server README.md (#19900)
fix typo
2026-02-26 11:26:16 +01:00
Concedo
ac8f12f259 still a bit wonky 2026-02-26 17:50:49 +08:00
Concedo
81fb4d773c swap resampling function 2026-02-26 17:37:53 +08:00
Concedo
749a606374 whisper broke 2026-02-26 16:45:04 +08:00
Concedo
44182ebefe Merge commit '8c2c0108dd' into concedo_experimental
# Conflicts:
#	examples/model-conversion/Makefile
#	examples/model-conversion/scripts/utils/inspect-org-model.py
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/get-rows-ops.c
#	ggml/src/ggml-hexagon/htp/hex-dma.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/set-rows-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/snapdragon/adb/run-mtmd.sh
#	scripts/snapdragon/windows/run-cli.ps1
#	scripts/sync_vendor.py
#	tests/test-backend-sampler.cpp
2026-02-26 16:30:37 +08:00
Concedo
7e53bfd28d Merge commit '2b6dfe824d' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	examples/save-load-state/save-load-state.cpp
#	src/llama-context.cpp
#	tools/cli/cli.cpp
2026-02-26 15:07:23 +08:00
Wagner Bruna
d400b37215
config file saving enhancements (#1994)
* process --exportconfig and --exporttemplate after --config

This allows using `--config oldfile.kcpps --exportconfig newfile.kcpps`
to update old config items, copy a config file with changed parameters,
download and save a remote config, etc.

* filter out command flags from the saved config files

Also ident files saved by command-line.
2026-02-26 14:55:01 +08:00
Concedo
fb3f7d92bc reenable cfg 2026-02-26 14:51:15 +08:00
Concedo
b7d2fe68e7 adjust 2026-02-26 14:46:41 +08:00
Concedo
edbc4fe592 music lm finally working 2026-02-26 14:00:58 +08:00
Concedo
cf042af701 Revert "still not working"
This reverts commit a1305ffff9.
2026-02-26 10:55:55 +08:00
Concedo
a1305ffff9 still not working 2026-02-26 10:48:21 +08:00
Neo Zhang
2943210c1e
support permuted, remove check s0/s10 (#19889)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-02-26 10:27:20 +08:00
Jeff Bolz
3769fe6eb7
vulkan: check for memory overlap before doing fusion (#19768)
* vulkan: check for memory overlap before doing fusion

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

* address feedback
2026-02-25 18:25:38 +01:00
Concedo
5c5fe55f7d bump kv overrides max (+1 squashed commits)
Squashed commits:

[9bc8212a0] bump kv overrides max
2026-02-26 00:24:53 +08:00
Concedo
d8746a851f still bugged 2026-02-26 00:07:04 +08:00
Concedo
8a3ccfcba5 some fixes but some issues 2026-02-25 23:41:32 +08:00
ddh0
832aa94762
common : add more aliases for sampler CLI params (#19797)
* common : add more aliases for sampler CLI params
2026-02-25 16:34:25 +01:00
Slobodan Josic
3af34b9ff5
ci : update the ROCm/HIP toolchain versions [no ci] (#19891)
* [HIP] Update ROCm build container to rocm/dev-ubuntu-22.04:7.2 and HIP_SDK to 26.Q1

* revert container version

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-25 15:54:49 +01:00
Georgi Gerganov
f20469d919
server : enable multi-modal prompt caching (#19877) 2026-02-25 15:15:42 +02:00
Georgi Gerganov
d7d826b3c1
server : support multi-modal context checkpoints (#19849)
* Modify llama-memory-hybrid-iswa.cpp

* Modify llama-memory-recurrent.cpp

* Modify server-common.cpp

* Modify server-common.h

* Modify server-context.cpp

* Modify server-task.h

* Added comment to llama-memory-hybrid-iswa.cpp

* Remove comment from server-context.cpp

* Stylistic fix server-context.cpp

* Fix an issue when seqrm isn't called in server-context.cpp

* cont : alternative impl

* cont : cleanup

* cont : n_tokens -> int64_t

---------

Co-authored-by: timkhronos <timkhronos@gmail.com>
2026-02-25 15:14:27 +02:00
Xuan-Son Nguyen
c747294b2d
scripts: update corpus of compare-logprobs (#19326)
* scripts: update corpus of compare-logprobs

* fix
2026-02-25 12:57:34 +01:00
Mario Limonciello
8fdf269dad
ci : update Windows ROCm build to 26.Q1 [no ci] (#19810)
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
2026-02-25 12:30:19 +01:00
Aldehir Rojas
a96a1120b4
gguf : fix ftell/fseek for Windows (#19870) 2026-02-25 06:58:11 +02:00
Georgi Gerganov
244641955f
models : fix graph splits (#19866) 2026-02-25 00:01:13 +02:00
Pascal
47eb12b953
server: fix query params lost when proxying requests in multi-model router mode (#19854)
* server: fix query params lost when proxying requests in multi-model router mode

* server: re-encode query params using httplib::encode_query_component in proxy
2026-02-24 21:46:06 +01:00
Georgi Gerganov
418dea39ce
ggml/gguf : prevent integer overflows (#19856)
* gguf : prevent integer overflow for ggml_context mem size

* ggml : fix int overflows in ggml_new_object()

* gguf : prevent string exhaustion

* gguf : prevent array elements exhaustion

* ggml : fix negative tensor type oob

* py : assert that alignment is non-zero power of 2

* ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object

* gguf-py : error on duplicate keys when reading

* py : restore tensor_fields

* enforce proper alignment in add_custom_alignment

* gguf : better name

* gguf : fix ctx size for no_alloc == true

* gguf : minor print fix

* ggml : print values when overflow

* ggml : remove deprecated ggml_type_sizef()

* ggml : relax ggml_type asserts to debug-only

* gguf : add mem_size overflow test

* gguf : add file size check for arrays

* ggml : relax asseerts for ggml_get_type_traits()

* flake8 fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-24 20:17:11 +02:00
Concedo
0eafc3cf2d ace step lowvram mode done, improved 2026-02-24 23:12:26 +08:00
Concedo
11a85d62fc lowvram for music lm 2026-02-24 22:21:17 +08:00
Concedo
aa58d1ed3b all working, but needs to optimize vram 2026-02-24 21:55:57 +08:00
Tarek Dakhran
da426cb250
model : update label for LFM2-24B-A2B (#19848)
* model : Update label for LFM2-24B-A2B

```
❯ build/bin/llama-bench -m /data/playground/checkpoints/LFM2-24B-A2B-Preview-Q4_0.gguf,/data/playground/checkpoints/LFM2-8B-A1B-Q4_0.gguf -p 1 -n 0
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2moe 24B.A2B Q4_0           |  12.54 GiB |    23.84 B | CPU        |      10 |             pp1 |         30.35 ± 2.49 |
| lfm2moe 8B.A1B Q4_0            |   4.41 GiB |     8.34 B | CPU        |      10 |             pp1 |         49.24 ± 1.93 |
```

* Remove extra line
2026-02-24 14:27:42 +01:00
Concedo
488c431331 not yet working 2026-02-24 17:47:50 +08:00
Radoslav Gerganov
c830f99cfa
server : support max_completion_tokens request property (#19831)
"max_tokens" is deprectated in favor of "max_completion_tokens" which
sets the upper bound for reasoning+output token.

Closes: #13700
2026-02-24 10:30:00 +02:00
Ruben Ortlam
aa6f918c1c
Vulkan Scalar Flash Attention Refactor (#19625)
* vulkan: allow using fp16 in scalar flash attention shader

* split rows inside of subgroups for faster synchronization

* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1

* use f32 scalar FA if f16 is not supported by device

* fix amd workgroup size issue

* optimize masksh use

* add medium rows FA shader Br size

* fixes

* add padding to mask shmem buffer

* cache q values into registers for KQ

* fuse lf accumulation, pf and v accumulation into a loop

* stage K loads through shmem

* stage V loads through shmem

* only stage through shmem on Nvidia

* default to Bc 32

* also stage V through shmem when this is done for K

* dynamic subgroups for intel

* use vectorized stores

* use float_type for dequantize4 functions

* use smaller scalar rows size for smaller rows count

* relax flash attention split_k condition to allow non-gqa use

* use minimal subgroup size on Intel

* fix shmem support function

* fix rebase issues

* fixes

* Bc 4 for scalar FA is not a valid configuration

* Use wave32 on AMD RDNA for scalar FA

* add Intel shader core count lookup-table

* fix regressions

* device tuning

* tmpsh size fix

* fix editorconfig

* refactor fa tuning logic into a single place

* fix gqa opt logic

* fix block_rows with small n_rows

* amd tuning

* fix hsk=72/80 issue

* tuning

* allow condition skipping for column check

* use float16 for Of if available

* address feedback

* fix bad RDNA performance on head size <= 128 by limiting occupancy

* allow printing pipeline stats

* cleanup and fixes

* limit occupancy for GCN for small batch FA with large HSK

* disable f16 FA for GCN AMD GPUs on the proprietary driver
2026-02-24 08:35:48 +01:00
Concedo
0fd7d2c0e5 ace step diffusion loading 2026-02-24 15:24:15 +08:00
Jeff Bolz
8c2c0108dd
vulkan: fix coopmat1 without bf16 support (#19793) 2026-02-24 07:48:32 +01:00
Jeff Bolz
3ea5360c00
vulkan: fix data race in mul_mat_id shader (#19790) 2026-02-24 07:43:12 +01:00
Max Krasnyansky
39fb81f875
hexagon refactor all Ops to use local context struct (#19819)
* hexagon: refactor set/get/sum-rows ops to use local context

* hexagon: refactor ROPE and Softmax Ops to use local context

Improves performance a bit by precomputing things and saving in the context.

* hexagon: refactor activation ops to use local context struct

* hexagon: refactor unary ops to use local context struct and DMA/VTCM

* hexagon: use aligned hvx_scale function

* hexagon: remove unused fields from op_context

* hexagon: rewrite ROPE to use DMA and VTCM scratchpad

* hex-rope: keep N rows in scratchpad (instead of just two)

* hex-rope: introduce rowidx cache

* hex-rope: remove unused fields

* hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute

also removes the need for fastdiv.

* hex-rope: minor formatting

* hex-rope: use indices and unroll the loops

* hex-rope: more updates to cleanup rope-block handling

* hexagon: cleanup supported type/dims checks

* hexagon: all reduce funcs replicated across lanes

There is no need to explicitly replicate the first value.

* snapdragon: update adb and windows scripts to use ubatch-size 256

Updated Ops support handles larger ubatches.
2026-02-23 16:32:14 -08:00
Aleksander Grygier
5eb0ea32f0
feat: Add code blocks full height setting to parameter sync service (#19835) 2026-02-23 22:30:13 +01:00
Adrien Gallouët
b68a83e641
vendor : update cpp-httplib to 0.34.0 (#19830)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-23 21:05:48 +01:00
Concedo
749536f464 fixed wav header wrong size 2026-02-24 01:13:44 +08:00
Daniel Bevenius
d8aeb65cee
tests : fix typos in comments in test-backend-sampler [no ci] (#19824)
* tests : fix typos in comments in test-backend-sampler [no ci]
2026-02-23 17:12:02 +01:00
askmyteapot
062e361968
Update ace-qwen3.cpp to build on MSVC (#1992)
need to include <sstream> otherwise build fails with lots of the below errors: 

```
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2297: '<<': not valid as right operand has type 'const cha
r [26]' [C:\koboldcpp\build\music_adapter.vcxproj]
  (compiling source file '../otherarch/acestep/music_adapter.cpp')

C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2679: binary '<<': no operator found which takes a right-h
and operand of type 'std::string' (or there is no acceptable conversion) [C:\koboldcpp\build\music_adapter.vcxproj]
  (compiling source file '../otherarch/acestep/music_adapter.cpp')
      C:\Program Files (x86)\Microsoft Visual Studio\18\BuildTools\VC\Tools\MSVC\14.50.35717\include\__msvc_int128.hpp(
  753,46):
      could be 'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept' [found us
  ing argument-dependent lookup]
          C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9):
          'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept': cannot conver
  t argument 2 from 'std::string' to 'const std::_Base128 &'
              C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57):
              Reason: cannot convert from 'std::string' to 'const std::_Base128'
              C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57):
              No user-defined-conversion operator available that can perform this conversion, or the operator cannot be
   called
```
2026-02-23 23:03:07 +08:00
Concedo
5311997581 updated ace step cpp 2026-02-23 23:01:10 +08:00
Concedo
2e713cfff5 fixed compile issue, trying out 8bit pcm 2026-02-23 21:19:03 +08:00
Aleksander Grygier
9051663d5d
webui: Add setting to have full height Code Blocks in Chat Messages (#19829) 2026-02-23 14:16:50 +01:00
Daniel Bevenius
72b44c0d21
model-conversion : merge inspect-org-model.py with tensor-info.py (#19823)
This commit replaces/merges the inspect-org-model.py script with the
contents tensor-info.py script. The merged script has also been updated
to also print tensor sizes which was the only thing that was not done
before (by tensor-info.py that is).

The motivation for this is that tensor-info.py does not load the tensor
weights which can be time consuming for larger models. And also now that
both are doing almost the same thing it makes sense to just have one and
not two scripts to maintain.
2026-02-23 14:15:16 +01:00