Commit graph

13158 commits

Author SHA1 Message Date
Wagner Bruna
243b03586b
sd: build each source file separately (#2188)
* sd: build source files separately

* sd: decouple stable-diffusion.cpp and sdtype_adapter.cpp

* sd: remove include util.h from sdtype_adapter.cpp

* sd: update source file lists and review dependencies
2026-05-07 22:50:10 +08:00
Concedo
81f2b5c448 prepare for sdcpp build refactor 2026-05-07 22:49:14 +08:00
Concedo
9e9497f0cc Merge remote-tracking branch 'origin/upstream' into concedo_experimental
# Conflicts:
#	examples/save-load-state/save-load-state.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl
#	ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	scripts/sync-ggml.last
#	scripts/sync_vendor.py
#	src/llama-graph.cpp
#	tests/test-backend-ops.cpp
#	tests/test-state-restore-fragmented.cpp
2026-05-06 21:20:06 +08:00
Concedo
7240da764a Merge commit '935a340292' into concedo_experimental
# Conflicts:
#	examples/diffusion/CMakeLists.txt
#	scripts/server-test-function-call.py
#	src/llama-model.cpp
#	src/models/gemma4.cpp
#	tests/test-chat.cpp
#	tests/test-reasoning-budget.cpp
#	tools/server/README.md
2026-05-06 21:02:25 +08:00
henk717
bcf9c81e0d
Linux CUDA13 Action (#2186)
* Linux CU13 CI

* Bump max CUDA arch

* CUDA13 Linux

* Upload the correct build to rolling (CUDA13)

* Downgrade cuda to get better compatibility

Runpod can't handle 13.1, and if they can't handle it neither can the people with a secondary GPU of an older generation.

* Add support for compute capability 89 in NVCCFLAGS
2026-05-06 18:06:39 +08:00
Concedo
15e86c4f9b hard coded reasoning_effort field from the api payload and force it into the jinja kwargs (request by @henk717). field name also hardcoded. 2026-05-06 17:35:26 +08:00
Aleksander Grygier
e3e3f8e46a
webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719)
* refactor: Remove Google favicon utility

* fix: MCP Server favicon

* refactor: Cleanup

* refactor: MCP Server Information

* fix: Fix MCP Settings UI

* refactor: Cleanup
2026-05-06 11:12:27 +02:00
zzzzwc
f08f20a0e3
ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423) 2026-05-06 15:41:14 +08:00
viggy
07eaf919ed
add tabindex and aria-hidden (#22699) 2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret
74d6248f71
convert : add filter_tensors method to pre-filter tensors (#22597)
* add filter_tensors classmethod

* remove language_model

* fix parts validation
2026-05-06 08:06:05 +02:00
fl0rianr
2ca1161bd7
ggml : use CL_DEVICE_GLOBAL_MEM_SIZE as memory estimate for OpenCL --fit (#22688)
* ggml : report estimated OpenCL memory for --fit

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* ggml : estimated OpenCL memory backend integrated

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
2026-05-05 22:12:48 -07:00
Trivikram Reddy
bbeb89d76c
Hexagon: Process M-tail rows on HMX instead of HVX (#22724)
* hex-mm: process m-tail rows on HMX instead of HVX

* hmx-mm: unroll and optimize padded activation loop

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-05 09:43:03 -07:00
lhez
ff806a110d
opencl: refactor Adreno q4_0 (#22335)
* opencl: refactor adreno q4_0 gemm/gemv dispatch

* opencl: refactor q4_0 gemm/gemv loading, use consistent names

* opencl: use consistent name for adreno q8_0 gemm/gemv

* opencl: use consistent names for adreno q4_0 gemm/gemv

* opencl: simplify adreno q4_0 set_tensor

* opencl: refactor q4_0 get_tensor
2026-05-05 09:38:57 -07:00
Radoslav Gerganov
d5003b6e4d
rpc : use graph uid instead of graph cache (#22701)
Store the last graph uid and compare against it to determine if the same
graph is being computed.
2026-05-05 13:47:13 +03:00
Adrien Gallouët
2635ac76e8
common : fix missing-noreturn warnings when compiling with clang 21 (#22702)
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3719 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3726 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3733 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3740 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3747 |         [](common_params & /*params*/, int /*value*/) {
          |         ^

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 13:16:25 +03:00
Georgi Gerganov
70a8309114 sync : ggml 2026-05-05 13:15:59 +03:00
Georgi Gerganov
c91faf997f ggml : bump version to 0.11.0 (ggml/1478) 2026-05-05 13:15:59 +03:00
Adrien Gallouët
bf76ac77be
common : only load backends when required (#22290)
* common : only load backends when required

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama : call ggml_backend_load_all() directly from llama_backend_init()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add ggml_backend_load_all() where llama_backend_init() is not used

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
a09a00e502
vendor : update cpp-httplib to 0.43.3 (#22686) 2026-05-05 09:04:57 +02:00
Georgi Gerganov
2bacb1eb77
server : validate --tools CLI argument against known tool names (#22538)
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.

Assisted-by: llama.cpp:local pi
2026-05-05 06:35:27 +03:00
Georgi Gerganov
d6e7b033a4
llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret
fa595462ca
graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630)
* qkv may not always be contiguous

* cont : make the cont conditional

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-05 06:34:44 +03:00
Ismail
a817a22bc6
ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (#22631) 2026-05-05 10:05:05 +08:00
Charles Xu
eff06702b2
kleidiai : update to v1.24.0 and use release archive (#22549) 2026-05-04 22:13:31 +03:00
leonardHONG
e77056f9b2
CUDA: use fastdiv for batch index split in get_rows (#22650) 2026-05-04 16:24:05 +02:00
Xuan-Son Nguyen
935a340292
server: implement /models?reload=1 (#21848) 2026-05-04 16:23:26 +02:00
Shakhnazar Sailaukan
d8794eecd5
examples: refactor diffusion generation (#22590)
* examples: refactor diffusion generation

* renamed enum values
2026-05-04 20:19:30 +08:00
JusteLeo
36a694c965
webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625) 2026-05-04 13:38:10 +02:00
Piotr Wilkin (ilintar)
a4701c98f7
common/autoparser: fixes for newline handling / forced tool calls (#22654)
* chat/autoparser: the fixes

* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.

* Trim whitespace on apply instead
2026-05-04 13:18:11 +02:00
Xuan-Son Nguyen
994118a183
model: move load_hparams and load_tensors to per-model definition (#22004)
* git-friendly migration

* add build_graph

* nits

* exclude old code from build

* wip

* add llm_arch_model_i

* prepare downstream functions

* nits

* nits

* wip

* wip

* add back create_tensor_qkv

* fix files missing include

* enforce one llm_build per arch

* cmake: use glob

* missing model params

* nits

* wip

* wip (2)

* wip (3)

* test-llama-archs is happy

* improve switch case

* move more stuff into llm_arch_model_i

* fix downstream code

* nits

* nits (2)

* fix order

* llama_model_base

* LLAMA_LOAD_LOCALS

* small fix

* fix build errors

* auto

* rm migration script and ifdef
2026-05-04 12:36:59 +02:00
Evan Huus
c84e6d6db5
server: Add a simple get_datetime server tool (#22649) 2026-05-04 12:19:41 +02:00
Concedo
2905c6254f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.pi/gg/SYSTEM.md
#	docs/speculative.md
#	ggml/src/ggml-virtgpu/virtgpu-shm.cpp
#	ggml/src/ggml-virtgpu/virtgpu.cpp
#	ggml/src/ggml-virtgpu/virtgpu.h
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/server/README.md
2026-05-04 15:36:13 +08:00
Concedo
4a8a51a3a7 updated sdui, increase ace step music vae chunk size 2026-05-04 15:30:45 +08:00
Concedo
950676fdb7 split utils.cpp into 2 files to support sd.cpp 2026-05-04 15:04:12 +08:00
Nick Towle
fa8feaed34
webui: restore missing settings (#22666)
Some checks failed
Python Type-Check / python type-check (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
2026-05-04 09:04:07 +02:00
Wagner Bruna
276c651a12
sd: sync to master-593-3d6064b (#2175)
* sd: sync to master-593-3d6064b

* sd: use the same sdtype_adapter object for all builds

Since master-592-b8079e2, no sd.cpp source depends on the ggml
backend build anymore.

* sd: fix main_gpu selection

* sd: report backend devices to the Python layer
2026-05-04 14:05:34 +08:00
Georgi Gerganov
846262d787
docs : update speculative decoding parameters after refactor (#22397) (#22539)
* docs : update speculative decoding parameters after refactor (#22397)

Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397:

- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands

Assisted-by: llama.cpp:local pi

* pi : add rule to use gh CLI for GitHub resources

Assisted-by: llama.cpp:local pi

* docs : run llama-gen-docs

* arg : fix typo
2026-05-04 08:52:07 +03:00
Atomic-Germ
6dcd824fce
vulkan: delete dead GGML_VK_MAX_NODES def (#22621) 2026-05-04 07:49:29 +02:00
Chen Yuan
d4b0c22f9e
ggml-webgpu: add layer norm ops (#22406)
* shader(norm): add layer norm ops

* shader(norm): stablize floating point computation with Kahan summation and handle mixed types

* shader(norm): remove the non-contiguous strides

* shader(norm): use the original implementation rather than the kahan summation
2026-05-03 20:52:53 -07:00
Aldehir Rojas
e48034dfc9
common : determine generation prompt using longest common prefix (#22657) 2026-05-04 00:18:23 +02:00
Julien Denize
048a490f76
convert : Mistral format yarn apply_scale support (#22612)
* [BUGFIX] Mistral format apply_scale support.

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix misunderstood boolean parameters

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-03 21:51:21 +02:00
JM Robles
db44417b02
convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611)
Llama-architecture q_proj/k_proj weights need an axis-0 row permutation
to match GGML's RoPE convention. The BF16 path applies this in
LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path
bypasses modify_tensors and writes weights directly through
ModelBase._repack_nvfp4. Without the permutation, attention heads end
up scrambled at inference and the model produces gibberish.

This change overrides _repack_nvfp4 on LlamaModel and applies the same
permutation to both the nibble-packed weight and the per-block scale
before delegating to ModelBase._repack_nvfp4 via super(). Reuses the
existing LlamaModel.permute static helper and respects the existing
undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.)
inherit the fix automatically.

Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419
(gibberish) to 43.9, matching the BF16-dequantized baseline (44.0).
Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama
architecture) with multilingual generation in Spanish/Catalan/Basque/
Galician all coherent with the fix applied.

Co-authored-by: Chema <chema@montevive.ai>
2026-05-03 18:22:00 +03:00
Tai An
24495f6c48
docs(args): clarify --debugmode level semantics in help text (#2181)
Closes #2178

The --debugmode help string previously read "Shows additional debug
info in the terminal" with no indication of what numeric values it
accepts or what each does — making the recommended troubleshooting
flag opaque (per #2178).

Document the three values actually checked in the source:
  -1: Horde-quiet (suppresses non-essential prints; auto-applied
       when --horde* args are set, see configure_horde_settings)
   0: default
   1: verbose (extra slot/cache info; larger utfprint buffer;
       retains 'debug-' horde model prefix; etc.)

Also note that bare --debugmode (no value) implies 1, which is the
existing argparse behavior (nargs='?', const=1) but easy to miss.
2026-05-03 16:06:13 +08:00
Concedo
676e716ce3 try to handle duplicate think tags by swallowing them 2026-05-03 16:02:38 +08:00
Concedo
2d1c1eb54e updated lite 2026-05-03 14:54:09 +08:00
Concedo
80a9082166 q5_1 kv in cuda 2026-05-03 13:40:24 +08:00
Concedo
9be810628e setenv return int 2026-05-03 13:32:05 +08:00
Concedo
2fb97d9c2c explicitly set env var internally. 2026-05-03 13:18:50 +08:00
lucy
d05fe1d7da
fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533)
* fix: CUDA device PCI bus ID detection for multi-GPU de-dupe

* HIP, MUSA macros

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-02 22:19:25 +02:00
Wagner Bruna
25fab4113e
refactor: handle GGML_VK_VISIBLE_DEVICES at the Python level (#2179)
All C++ handling code currently:
- build a comma-separated list from the info_vulkan array
- if GGML_VK_VISIBLE_DEVICES isn't set
  - set GGML_VK_VISIBLE_DEVICES to the list

Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this
can be done in the same way at the Python level, before all loading
functions.

Caveat: load_model had the default `inputs.vulkan_info = "0"`,
so the default GPU would be "0" only when loading a text model.
2026-05-02 23:10:29 +08:00