Commit graph

12060 commits

Author SHA1 Message Date
Concedo
22c78f6c82 fix q3tts compile, update docs and lite 2026-03-14 23:33:18 +08:00
Concedo
1802b09e6f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	docs/build.md
#	docs/ops.md
#	docs/ops/CPU.csv
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-cpu/repack.cpp
#	ggml/src/ggml-cpu/repack.h
#	src/llama-quant.cpp
#	tests/test-json-schema-to-grammar.cpp
2026-03-14 17:56:16 +08:00
Concedo
ff3f8533d3 Merge commit 'c96f608d98' into concedo_experimental
# Conflicts:
#	CONTRIBUTING.md
#	docs/ops.md
#	docs/ops/Vulkan.csv
#	models/templates/LFM2-8B-A1B.jinja
#	tests/peg-parser/test-python-dict-parser.cpp
#	tests/peg-parser/test-unicode.cpp
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat.cpp
#	tools/llama-bench/llama-bench.cpp
2026-03-14 17:14:34 +08:00
Concedo
8b9594b6ea wip router mode 2026-03-14 17:07:05 +08:00
Concedo
1d067933f0 claude fixes for ace step, idk man who am i to argue with an agi 2026-03-14 12:27:26 +08:00
Concedo
349fc744e9 cleanup, fixed a regression in music gen with codes due to instruct prompt change 2026-03-14 11:32:47 +08:00
Concedo
6143a75426 improve autofit padding heuristics 2026-03-14 00:36:52 +08:00
Concedo
04915d99ee Merge commit '451ef08432' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	docs/ops.md
#	docs/ops/Vulkan.csv
#	src/llama-model-loader.cpp
#	src/llama-model.cpp
#	src/llama.cpp
#	tests/CMakeLists.txt
#	tests/peg-parser/test-basic.cpp
#	tests/peg-parser/test-json-parser.cpp
#	tests/peg-parser/test-python-dict-parser.cpp
#	tests/peg-parser/test-unicode.cpp
#	tests/test-chat-auto-parser.cpp
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat.cpp
#	tools/CMakeLists.txt
2026-03-13 23:33:37 +08:00
Concedo
d2c911884d Merge commit '213c4a0b81' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/chat-peg-parser.cpp
#	common/chat.cpp
#	docs/backend/SYCL.md
#	docs/development/parsing.md
#	docs/ops.md
#	docs/ops/SYCL.csv
#	embd_res/templates/Apriel-1.6-15b-Thinker-fixed.jinja
#	embd_res/templates/Bielik-11B-v3.0-Instruct.jinja
#	embd_res/templates/GLM-4.7-Flash.jinja
#	embd_res/templates/LFM2-8B-A1B.jinja
#	embd_res/templates/StepFun3.5-Flash.jinja
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/backend.hpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/convert.hpp
#	ggml/src/ggml-sycl/count-equal.cpp
#	ggml/src/ggml-sycl/dpct/helper.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-sycl/presets.hpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-sycl/vecdotq.hpp
#	models/templates/Apertus-8B-Instruct.jinja
#	models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja
#	models/templates/Qwen-QwQ-32B.jinja
#	models/templates/Qwen3-Coder.jinja
#	models/templates/deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja
#	models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja
#	models/templates/deepseek-ai-DeepSeek-V3.1.jinja
#	models/templates/fireworks-ai-llama-3-firefunction-v2.jinja
#	models/templates/moonshotai-Kimi-K2.jinja
#	models/templates/unsloth-Apriel-1.5.jinja
#	tests/CMakeLists.txt
#	tests/peg-parser/test-basic.cpp
#	tests/peg-parser/tests.h
#	tests/test-backend-ops.cpp
#	tests/test-chat-peg-parser.cpp
#	tests/test-chat-template.cpp
#	tests/test-chat.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-peg-parser.cpp
#	tools/CMakeLists.txt
#	tools/cli/cli.cpp
2026-03-13 21:35:56 +08:00
Concedo
4189508ef3 qwen3tts support 1.7b model 2026-03-13 21:15:24 +08:00
Concedo
a13641c00c tts loader fixes 2026-03-13 18:33:10 +08:00
Concedo
0a38237ff5 original qwen3tts files 2026-03-13 15:24:18 +08:00
Concedo
4427bab37e cover mode is now working 2026-03-13 14:55:39 +08:00
Concedo
84734eb409 better audio runtime reload 2026-03-13 14:02:56 +08:00
Concedo
8f23b8d81e wip on ref audio, but it compiles 2026-03-12 23:46:10 +08:00
Concedo
d5a4c17e14 mp3 not default 2026-03-12 21:42:59 +08:00
Concedo
3fd9648726 added mp3 support 2026-03-12 21:00:50 +08:00
Concedo
3092694d2e better resampler 2026-03-12 16:49:53 +08:00
Wagner Bruna
796f7bdeff
sd: fix LoRA multiplier logic to switch to at_runtime mode (#2029)
`0. in inputs.lora_multipliers` didn't work because the C array has
variable length.

Also fixed a few corner cases related to the default multipliers
(mainly to ensure robustness against future changes, since in most
cases the multiplier list is already sanitized by a previous
function).
2026-03-12 15:36:51 +08:00
Concedo
318a5486ce duration 2026-03-12 15:33:51 +08:00
Concedo
5b22858dbd updated docs 2026-03-12 00:20:20 +08:00
Concedo
3cc6e2ea17 make stereo default 2026-03-12 00:10:25 +08:00
Concedo
211d4fe632 lots of tweaks for ace step 2026-03-11 23:57:52 +08:00
Concedo
ecc4865244 improves code output quality 2026-03-10 23:07:52 +08:00
Concedo
8095bf9807 include overhead fromn music models 2026-03-10 22:52:20 +08:00
Concedo
6adcd0b5db Merge commit '34df42f7be' into concedo_experimental
# Conflicts:
#	README.md
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-hexagon/htp/binary-ops.c
#	ggml/src/ggml-hexagon/htp/cpy-ops.c
#	ggml/src/ggml-hexagon/htp/get-rows-ops.c
#	ggml/src/ggml-hexagon/htp/htp-msg.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/hvx-arith.h
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/hvx-inverse.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/set-rows-ops.c
#	ggml/src/ggml-hexagon/htp/softmax-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	tests/test-backend-ops.cpp
#	tools/cli/cli.cpp
#	tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
2026-03-10 22:20:04 +08:00
Concedo
746664fde6 Merge commit '2cd20b72ed' into concedo_experimental
# Conflicts:
#	CONTRIBUTING.md
#	docs/backend/CANN.md
#	docs/backend/SYCL.md
#	docs/backend/snapdragon/README.md
#	docs/backend/snapdragon/windows.md
#	docs/build.md
#	docs/multimodal/MobileVLM.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	examples/debug/README.md
#	examples/llama.vim
#	examples/model-conversion/README.md
#	examples/sycl/README.md
#	ggml/src/ggml-cpu/amx/mmq.cpp
#	ggml/src/ggml-cpu/arch/x86/repack.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp-drv.cpp
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hvx-base.h
#	ggml/src/ggml-hexagon/htp/hvx-copy.h
#	ggml/src/ggml-hexagon/htp/hvx-inverse.h
#	ggml/src/ggml-hexagon/htp/hvx-reduce.h
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/worker-pool.c
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cpy.cl
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/quants.hpp
#	ggml/src/ggml-sycl/softmax.cpp
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/pr2wt.sh
#	scripts/server-bench.py
#	scripts/snapdragon/windows/run-cli.ps1
#	tests/test-alloc.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/cli/cli.cpp
#	tools/completion/README.md
#	tools/cvector-generator/cvector-generator.cpp
#	tools/imatrix/README.md
#	tools/perplexity/README.md
#	tools/server/public_simplechat/readme.md
#	tools/server/tests/README.md
2026-03-10 22:11:08 +08:00
Concedo
c8800ed16c gcc path fix 2026-03-10 21:40:32 +08:00
Ray Xu
8d880ac012
examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968)
* Fix logic for retrieving schema items in `json_schema_to_grammar.py`

If `schema['items']` is `{}` and `prefixItems not in schema', as `{}` is Falsy, the original code here will raise an error.

I think if `schema['items']` is `{}`, them items should just be `{}`

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tests for arrays with empty items

Add two unit tests to `tests/test-json-schema-to-grammar.cpp` that validate handling of arrays when 'items' is an empty schema and when 'prefixItems' is present alongside an empty 'items'. Both tests expect the same generated grammar, ensuring the JSON Schema->grammar conversion treats an empty 'items' schema (and the presence of 'prefixItems') correctly and covering this edge case.

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-10 14:38:18 +01:00
Concedo
b06dd2606e ruff: linting 2026-03-10 21:32:36 +08:00
a3894281
0f1e9d14cc
docs: update CPU backend ops to mark POOL_1D as supported (#20304) 2026-03-10 21:31:24 +08:00
Wagner Bruna
3f42ed1af7
support for customizing LoRA multipliers through the sdapi (#1982)
* fix corner case in sd_oai_transform_params

Also fix typo in the function name.

* support for customizing loaded LoRA multipliers

The `sdloramult` flag now accepts a list of multipliers, one for each
LoRA. If all multipliers are non-zero, LoRAs load as before, with no extra
VRAM usage or performance impact.

If any LoRA has a multiplier of 0, we switch to `at_runtime` mode, and these
LoRAs will be available to multiplier changes via the `lora` sdapi field and
show up in the `sdapi/v1/loras` endpoint. All LoRAs are still preloaded on
startup, and cached to avoid file reloads.

If the list of multipliers is shorter than the list of LoRAs, the multiplier
list is extended with the first multiplier (1.0 by default), to keep it
compatible with the previous behavior.

* support for `<lora:name:multiplier>` prompt syntax and metadata

* add a few tests for sanitize_lora_multipliers
2026-03-10 21:29:39 +08:00
Concedo
eafb5ff4c5 autofit improvement e.g. for strix (+1 squashed commits)
Squashed commits:

[6f6fd59c3] autofit improvement e.g. for strix
2026-03-10 21:20:02 +08:00
Georgi Gerganov
1274fbee9e
models : fix assert in mamba2 (cont) (#20335)
* models : fix assert in mamba2 (cont)

* cont : add n_group mod

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-10 15:00:08 +02:00
Georgi Gerganov
a7b3dee7a5
server : make 2 checkpoints near the end of the prompt (#20288)
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
Sigbjørn Skjæret
ec947d2b16
common : fix incorrect uses of stoul (#20313) 2026-03-10 11:40:26 +01:00
Charles Xu
0cd4f4720b
kleidiai : support for concurrent sme and neon kernel execution (#20070) 2026-03-10 09:25:25 +02:00
Taimur Ahmad
af237f3026
ggml-cpu: add RVV repack GEMM and GEMV for quantization types (#19121)
* ggml-cpu: add rvv ggml_quantize_mat_4x8 for q8_0

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add rvv repacking for iq4_nl

* ggml-cpu: add generic impl for iq4_nl gemm/gemv

* ggml-cpu: add rvv repacking for q8_0

* ggml-cpu: refactor; add rvv repacking for q4_0, q4_K

* ggml-cpu: refactor; add rvv repacking for q2_K

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: refactor rvv repack

---------

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-03-10 08:49:52 +02:00
Julian Pscheid
1a5631beaa
metal: handle command buffer failures gracefully in synchronize (#20306)
Replace GGML_ABORT("fatal error") in ggml_metal_synchronize() with
error flag + return. This aligns synchronize error handling with
graph_compute, which already returns GGML_STATUS_FAILED for the same
condition.

When a command buffer fails (e.g., iOS GPU access revocation during
backgrounding, macOS eGPU disconnect, OOM), the backend enters an
error state instead of killing the host process. Subsequent
graph_compute calls return GGML_STATUS_FAILED immediately. Recovery
requires recreating the backend.

Failed extra command buffers are properly released on the error path
to avoid Metal object leaks.
2026-03-10 08:32:24 +02:00
ddh0
1dab5f5a44
llama-quant : fail early on missing imatrix, refactor type selection, code cleanup (#19770)
* quantize : imatrix-fail early + code cleanup

* fix manual override printing

it's in the preliminary loop now, so needs to be on its own line

* revert header changes per ggerganov

* remove old #includes

* clarify naming

rename `tensor_quantization` to `tensor_typo_option` to descirbe its
functionality

* fix per barto
2026-03-10 08:16:05 +02:00
Concedo
500a1ab466 disable smartcache if slots is zero 2026-03-10 08:57:31 +08:00
Aldehir Rojas
c96f608d98
common: consolidate PEG string parsers (#20263)
* common : consolidate PEG string parsers
* cont : fix json_string_content()
2026-03-10 00:29:21 +01:00
Xuan-Son Nguyen
0842b9b465
model: fix step3.5 n_rot (#20318) 2026-03-09 23:42:24 +01:00
Xuan-Son Nguyen
59db9a357d
llama: dynamic head_dim and n_rot for SWA (#20301)
* llama: dynamic head_dim and n_rot for SWA

* also add gguf_writer wrappers

* fix build

* build_rope_shift arg reorder
2026-03-09 22:22:39 +01:00
Evan Huus
23fbfcb1ad
server: Parse port numbers from MCP server URLs in CORS proxy (#20208)
* Parse port numbers from MCP server URLs

* Pass scheme to http proxy for determining whether to use SSL

* Fix download on non-standard port and re-add port to logging

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-09 17:47:54 +01:00
Concedo
2bd6b87d5b remove a file 2026-03-09 23:08:53 +08:00
Concedo
ee96e71bae don't resample audio 2026-03-09 22:53:55 +08:00
Paul Flynn
e22cd0aa15
metal : extend mul_mv_ext to BF16, Q2_K, Q3_K (#20250)
Enable mul_mv_ext small-batch kernels (BS 2-8) for BF16, Q2_K,
and Q3_K quantization types. These types previously fell through
to the slower single-row mul_mv path.

BF16 uses the float4 dequantize path (like F16). Q2_K and Q3_K
use the float4x4 K-quant path (like Q4_K/Q5_K/Q6_K).

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 16:48:12 +02:00
Georgi Gerganov
96cfc4992c
server : fix checkpoints n_tokens calculation (#20287) 2026-03-09 16:47:06 +02:00
Georgi Gerganov
ed0007aa32
metal : add upscale (#20284) 2026-03-09 16:45:11 +02:00