Concedo
7df210833e
missed one case for autofit
2026-03-03 21:05:59 +08:00
Concedo
707f7b37bf
optimize pp
2026-03-03 21:02:51 +08:00
Concedo
ae67caa2f7
ace qwen rep pen for codes
2026-03-02 21:18:06 +08:00
Concedo
de9840afac
qwen image max ref image size fix from 512x512 to 1024x1024
2026-03-02 21:08:52 +08:00
Concedo
b632d2ce1c
print timestamp when image generated
2026-03-02 18:38:21 +08:00
Concedo
cf158f1b6e
updated lite
2026-03-02 16:59:16 +08:00
Concedo
d7fb3df10a
support 1 level deep admindir
2026-03-02 16:23:34 +08:00
Concedo
d904b51b0f
adjust slot counts
2026-03-02 15:56:15 +08:00
Concedo
42134db6b4
finally fixed smartcache for qwen
2026-03-02 00:47:38 +08:00
Concedo
6c5a7a27af
clamp music duration
2026-03-01 01:15:26 +08:00
Concedo
c9e651f7e5
updated lite, fix some cuda spams, fix qwen3tts voice loading
2026-03-01 00:41:56 +08:00
Concedo
0b76f73fc2
smartcache bug seems to be fixed
2026-02-28 18:08:54 +08:00
Concedo
4e358265a3
Merge commit ' 8387ffb28d' into concedo_experimental
...
# Conflicts:
# docs/backend/VirtGPU.md
# docs/backend/ZenDNN.md
# ggml/src/ggml-cpu/amx/amx.cpp
# ggml/src/ggml-cpu/amx/mmq.cpp
# ggml/src/ggml-sycl/add-id.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp
# ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h
# ggml/src/ggml-virtgpu/backend/backend-dispatched.h
# ggml/src/ggml-virtgpu/backend/backend-virgl-apir.h
# ggml/src/ggml-virtgpu/backend/backend.cpp
# ggml/src/ggml-virtgpu/backend/shared/api_remoting.h
# ggml/src/ggml-virtgpu/backend/shared/apir_backend.gen.h
# ggml/src/ggml-virtgpu/backend/shared/apir_backend.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h
# ggml/src/ggml-virtgpu/backend/shared/apir_cs_rpc.h
# ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp
# ggml/src/ggml-virtgpu/ggml-backend-device.cpp
# ggml/src/ggml-virtgpu/ggml-backend-reg.cpp
# ggml/src/ggml-virtgpu/ggml-backend.cpp
# ggml/src/ggml-virtgpu/ggml-remoting.h
# ggml/src/ggml-virtgpu/include/apir_hw.h
# ggml/src/ggml-virtgpu/regenerate_remoting.py
# ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp
# ggml/src/ggml-virtgpu/virtgpu-forward-impl.h
# ggml/src/ggml-virtgpu/virtgpu-forward.gen.h
# ggml/src/ggml-virtgpu/virtgpu.cpp
# ggml/src/ggml-virtgpu/virtgpu.h
# ggml/src/ggml-zendnn/CMakeLists.txt
# ggml/src/ggml-zendnn/ggml-zendnn.cpp
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-tokenizer-0.sh
# tools/cli/README.md
# tools/completion/README.md
# tools/imatrix/imatrix.cpp
# tools/server/README.md
2026-02-28 12:45:16 +08:00
Wagner Bruna
5c40f07d4a
sd: sync to 0752cc9 (master-507-b314d80 +1) ( #1999 )
...
* sd: sync to 0752cc9 (master-507-b314d80 +1)
* sd: add flow-shift support to gendefaults
2026-02-28 12:22:32 +08:00
Concedo
d643d945f5
clamp music inference steps to 100 max
2026-02-28 12:12:50 +08:00
Concedo
dd08d675f2
incomplete fix for rnn models, load state works but logits slightly different
2026-02-28 11:52:24 +08:00
Concedo
14d82bb38e
allow music llm and diffusion gen models to be loaded independently
2026-02-27 21:56:48 +08:00
Concedo
19eb78844c
audio codes working
2026-02-27 21:23:00 +08:00
Concedo
ba42f22fc8
stereo is working
2026-02-27 20:36:44 +08:00
Daniel Bevenius
8387ffb28d
gguf-py : dump version to 0.18.0 ( #19950 )
...
This commit updates the gguf-py package version to 0.18.0 in preperation
of a new release to PyPI.
Refs: https://github.com/ggml-org/llama.cpp/discussions/19948
2026-02-27 11:02:53 +01:00
Pascal
2e7e638523
server : support multiple model aliases via comma-separated --alias ( #19926 )
...
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Jan Patrick Lehr
a8b192b6ec
tests : enable test-chat out of tree build ( #19558 )
...
The binary relies on model files that it tries to find. However, when
configuring the build directory to be parallel to the source tree those
heuristics fail.
This sets the working directory for the test executable to be the
source-tree which resolves this issue.
2026-02-27 05:37:54 +01:00
Neo Zhang
c17dce4f5c
replace the magic nunber 768 by max work group size to support iGPU ( #19920 )
...
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-02-27 09:26:07 +08:00
Vishal Singh
88cf781f51
ggml-zendnn: update code for latest ZenDNN API ( #19923 )
...
- adapt ggml-zendnn.cpp to the new lowoha::matmul interface
- update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08)
- add static lib support in CMake
2026-02-27 08:43:41 +08:00
Adrien Gallouët
4e76d24f28
ggml : fix AMX and add batched support ( #19925 )
...
llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF:Q4_0 -f wikitext-2-raw/wiki.test.raw -c 2048 -b 2048 --chunks 2
before this commit:
```
perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 2.31 seconds per pass - ETA 0.07 minutes
[1]17.3868,[2]22.2199,
Final estimate: PPL = 22.2199 +/- 1.59692
llama_perf_context_print: load time = 878.56 ms
llama_perf_context_print: prompt eval time = 2037.82 ms / 4096 tokens ( 0.50 ms per token, 2009.99 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 6403.17 ms / 4097 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 845 = 318 + 224 + 302 |
llama_memory_breakdown_print: | - CPU_REPACK | 288 = 288 + 0 + 0 |
llama_memory_breakdown_print: | - AMX | 31 = 31 + 0 + 0 |
```
after this commit:
```
perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 1.98 seconds per pass - ETA 0.05 minutes
[1]17.2005,[2]21.8220,
Final estimate: PPL = 21.8220 +/- 1.56485
llama_perf_context_print: load time = 719.23 ms
llama_perf_context_print: prompt eval time = 1676.23 ms / 4096 tokens ( 0.41 ms per token, 2443.58 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 4258.74 ms / 4097 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 845 = 318 + 224 + 302 |
llama_memory_breakdown_print: | - AMX | 319 = 319 + 0 + 0 |
```
(no more CPU_REPACK)
after this commit, disabling amx:
```
perplexity: calculating perplexity over 2 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 2.34 seconds per pass - ETA 0.07 minutes
[1]17.2005,[2]21.8220,
Final estimate: PPL = 21.8220 +/- 1.56485
llama_perf_context_print: load time = 841.91 ms
llama_perf_context_print: prompt eval time = 2057.28 ms / 4096 tokens ( 0.50 ms per token, 1990.98 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 6454.51 ms / 4097 tokens
llama_perf_context_print: graphs reused = 0
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Host | 845 = 318 + 224 + 302 |
llama_memory_breakdown_print: | - CPU_REPACK | 319 = 319 + 0 + 0 |
```
=> same perplexity.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-26 21:39:11 +01:00
Ruben Ortlam
723c71064d
vulkan: fix fp16 Flash Attention on Windows AMD RDNA2 and below ( #19921 )
2026-02-26 19:11:04 +01:00
Georgi Gerganov
37964f44f9
mtmd : fix padding of n_tokens ( #19930 )
2026-02-26 18:39:49 +02:00
Georgi Gerganov
01cd448b8c
server : fix ctx checkpoint restore logic ( #19924 )
2026-02-26 18:20:16 +02:00
Georgi Gerganov
99bd67c9b2
kv-cache : fix can_shift() check to take into account M-RoPE ( #19928 )
2026-02-26 18:08:54 +02:00
Concedo
5a57ed8ca4
revert to 8 step
2026-02-26 22:07:01 +08:00
Concedo
173702d1a4
music lowvram indicator
2026-02-26 21:30:47 +08:00
Aman Gupta
b68d75165a
llama: Add option to merge gate and exp weights ( #19139 )
...
* llama: Add option to merge gate and exp weights
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* update constants.py
* add gate_up for the all MoE models
* convert: simplify merge tensor condition
* update constants.py
* reduce number of models, add create_tensor_gate_up helper
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-26 21:01:08 +08:00
Kevin Pouget
ffaafde16f
ggml-virtgpu: improve the reliability of the code ( #19846 )
...
* ggml-virtgpu-backend: validate the consistency of the received objects
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
* ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods
```
1. bck->iface.synchronize(bck)
2. buft->iface.get_alloc_size(buft, op)
3. buft->iface.get_max_size(buft)
```
these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
* ggml-virtgpu-backend: fix log format missing argument
* ggml-virtgpu-backend: improve the abort message
* ggml-virtgpu-backend: more safety checks
* ggml-virtgpu-backend: new error code
* ggml-virtgpu-backend: initialize all the error codes
* ggml-virtgpu: add a missing comment generated by the code generator
* ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names
* ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message
* ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR
(still unmerged)
* ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file
* ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error
* docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun
* ggml-virtgpu: turn off clang-format header ordering for some of the files
Compilation breaks when ordered alphabetically.
* ggml-virtgpu: clang-format
* ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes
2026-02-26 20:00:57 +08:00
Concedo
05834eecb3
Merge commit ' 1ca3d1de15' into concedo_experimental
...
# Conflicts:
# tools/server/README.md
2026-02-26 19:55:06 +08:00
Concedo
adebf63877
ace converter
2026-02-26 19:53:02 +08:00
drrros
efba35a860
server: fix load-on-startup not respected in ini file ( #19897 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Co-authored-by: Roman Marchenko <r.marchenko@ideco.ru>
2026-02-26 12:32:31 +01:00
Eric Zhang
9b62913b40
jinja : correct default size for string slices ( #19913 )
2026-02-26 12:28:09 +01:00
Maximilian Werk
66287bdaac
model : add Jina Embeddings v5 Nano (partial EuroBERT) support ( #19826 )
...
* WIP: Add EuroBERT support with autoformatting changes
This commit includes:
- EuroBERT model implementation for GGUF conversion
- C++ backend support for EuroBERT architecture
- Unintended autoformatting changes to Python files
Saving before reverting formatting-only changes.
* feat: add back eos assert when not last token pooling
* feat: removed duplicated code and cleanup
* feat: removed not working architectures and unnecessary check
* fix: typo
* fix: dynamic pooling config
* feat: added an example model for eurobert
* feat: proper llama-vocab implementation for jina-v5
* fix: removed unnecessary comments
2026-02-26 12:14:09 +01:00
Georgi Gerganov
1ca3d1de15
gguf : avoid too many file size calls ( #19919 )
2026-02-26 12:46:32 +02:00
yggdrasil75
bd72300591
server : fix typo in server README.md ( #19900 )
...
fix typo
2026-02-26 11:26:16 +01:00
Concedo
ac8f12f259
still a bit wonky
2026-02-26 17:50:49 +08:00
Concedo
81fb4d773c
swap resampling function
2026-02-26 17:37:53 +08:00
Concedo
749a606374
whisper broke
2026-02-26 16:45:04 +08:00
Concedo
44182ebefe
Merge commit ' 8c2c0108dd' into concedo_experimental
...
# Conflicts:
# examples/model-conversion/Makefile
# examples/model-conversion/scripts/utils/inspect-org-model.py
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-hexagon/htp/get-rows-ops.c
# ggml/src/ggml-hexagon/htp/hex-dma.h
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/matmul-ops.c
# ggml/src/ggml-hexagon/htp/rope-ops.c
# ggml/src/ggml-hexagon/htp/set-rows-ops.c
# ggml/src/ggml-hexagon/htp/softmax-ops.c
# ggml/src/ggml-hexagon/htp/unary-ops.c
# scripts/snapdragon/adb/run-cli.sh
# scripts/snapdragon/adb/run-completion.sh
# scripts/snapdragon/adb/run-mtmd.sh
# scripts/snapdragon/windows/run-cli.ps1
# scripts/sync_vendor.py
# tests/test-backend-sampler.cpp
2026-02-26 16:30:37 +08:00
Concedo
7e53bfd28d
Merge commit ' 2b6dfe824d' into concedo_experimental
...
# Conflicts:
# .github/workflows/release.yml
# examples/save-load-state/save-load-state.cpp
# src/llama-context.cpp
# tools/cli/cli.cpp
2026-02-26 15:07:23 +08:00
Wagner Bruna
d400b37215
config file saving enhancements ( #1994 )
...
* process --exportconfig and --exporttemplate after --config
This allows using `--config oldfile.kcpps --exportconfig newfile.kcpps`
to update old config items, copy a config file with changed parameters,
download and save a remote config, etc.
* filter out command flags from the saved config files
Also ident files saved by command-line.
2026-02-26 14:55:01 +08:00
Concedo
fb3f7d92bc
reenable cfg
2026-02-26 14:51:15 +08:00
Concedo
b7d2fe68e7
adjust
2026-02-26 14:46:41 +08:00
Concedo
edbc4fe592
music lm finally working
2026-02-26 14:00:58 +08:00
Concedo
cf042af701
Revert "still not working"
...
This reverts commit a1305ffff9 .
2026-02-26 10:55:55 +08:00