Commit graph

13500 commits

Author SHA1 Message Date
Concedo
29fd00a7a4 allow accessing tensor split and gpu layers in cpu mode if rpc is in connect mode 2026-05-24 22:23:51 +08:00
Concedo
86f6bff620 rpc port to 5551 2026-05-24 18:56:23 +08:00
Concedo
2937fdd823 make send as reference image default for img2img, but with smart fallback for older models. 2026-05-24 18:34:24 +08:00
Concedo
d774184e9d hacky patch for hidream on kobold to fix tensor type issues 2026-05-24 16:37:49 +08:00
Concedo
c62d921f0a fix sdcpp build cmake 2026-05-24 16:14:37 +08:00
Concedo
298da8a4c2 Merge remote-tracking branch 'wbruna/kcpp_sd_update_202605_5' into concedo_experimental 2026-05-24 15:38:37 +08:00
Concedo
8ca4283f55 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/release.yml
#	.github/workflows/server.yml
#	.github/workflows/ui-build.yml
#	.github/workflows/ui-publish.yml
#	CMakeLists.txt
#	docs/autoparser.md
#	docs/backend/snapdragon/CMakeUserPresets.json
#	docs/backend/snapdragon/README.md
#	docs/backend/snapdragon/windows.md
#	docs/function-calling.md
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_mxfp4_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q4_0_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q4_1_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q4_k_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q5_0_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q5_1_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q5_k_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemm_moe_q6_k_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_mxfp4_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q4_0_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q4_1_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q4_k_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q5_0_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q5_1_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q5_k_f32_ns.cl
#	ggml/src/ggml-opencl/kernels/gemv_moe_q6_k_f32_ns.cl
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/gated_delta_net.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-vulkan/CMakeLists.txt
#	ggml/src/ggml-zendnn/CMakeLists.txt
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	requirements/requirements-convert_hf_to_gguf.txt
#	scripts/snapdragon/windows/setup-build.ps1
#	tools/perplexity/perplexity.cpp
2026-05-24 13:55:44 +08:00
Concedo
ae335c4338 fix tools build 2026-05-24 13:46:36 +08:00
Yiwei Shao
1c0f6db545
hexagon: apply repl optimization in flash attn softmax as #22993 (#23455)
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-05-23 19:56:59 -07:00
Aparna M P
cec51c7a7d
snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552) 2026-05-23 19:56:41 -07:00
Aldehir Rojas
b22ff4b7b4
cmake/ui : refactor the build (#23352) 2026-05-23 17:08:22 -04:00
Aditya Singh
c0c7e147e7
requirements : bump torch to 2.11.0 (#23503)
* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf

The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.

Fixes #23408

* requirements: bump torch floor to 2.11.0 per maintainer

* requirements: pin torch to ==2.11.0 per project policy

* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy

* requirements: suppress check_requirements pin warning on mtmd

The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.

* ty: silence Tensor/Module union check on model[0].auto_model

With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
2026-05-23 18:24:39 +02:00
Concedo
38298dd4e8 try to fix cuda builds 2026-05-23 21:58:01 +08:00
Concedo
3aea5a795e Revert "fixed incorrect cfg scale returned"
This reverts commit cae0375157.
2026-05-23 21:37:47 +08:00
Wagner Bruna
3ec404b2cb sd: sync to master-646-0baf721 2026-05-23 09:05:36 -03:00
Wagner Bruna
8427efb4c6 sd: sync to master-642-3a8788c 2026-05-23 09:05:36 -03:00
Wagner Bruna
a0413bdf55 sd: sync to master-637-ef92a00 2026-05-23 09:05:36 -03:00
Wagner Bruna
6c2093e422 sd: sync to master-633-5b0267e 2026-05-23 09:05:36 -03:00
Wagner Bruna
c28c50e441 sd: sync to master-621-baf7eda 2026-05-23 09:05:35 -03:00
Michael Wand
b0df4c0cfd
model : add NVFP4 MTP scale tensors (#23563)
* Add NVFP4 MTP scale tensors

* Link Qwen3.5 MTP tensors

* Aligned nullptr
2026-05-23 13:30:31 +02:00
dskwe
a497476330
ggml : Check the right iface method before using the fallback 2d get (#23514) 2026-05-23 12:49:24 +02:00
Wagner Bruna
9450834335
sd: adjust VAE tile size according to sdtiledvae (#2208) 2026-05-23 17:50:44 +08:00
Concedo
ce3aa09b99 cache dir is null 2026-05-23 17:39:09 +08:00
Concedo
cae0375157 fixed incorrect cfg scale returned 2026-05-23 17:30:07 +08:00
Concedo
4bbbd55be6 rpc implementation is complete 2026-05-23 17:11:30 +08:00
Jeff Bolz
95405ac65f
vulkan: fix windows find_package of SPIRV-Headers (#23215)
* vulkan: fix windows find_package of SPIRV-Headers

* not windows-only
2026-05-23 09:44:46 +02:00
Concedo
3520b915f9 try revert vae chunk size change 2026-05-23 09:46:11 +08:00
Shawn Gu
0f3cb3fc8b
opencl: generalize Adreno MoE kernels on M (#23449) 2026-05-22 17:08:41 -07:00
Concedo
81553e6524 mmproj overhead estimate calculated but only used on python side 2026-05-23 00:04:12 +08:00
Aldehir Rojas
1acee6bf89
server: only parse empty msg if continuing an assistant msg (#23506) 2026-05-22 11:58:15 -04:00
Concedo
f85cc79526 make swa default on models that support it. removed --useswa, added --noswa 2026-05-22 23:38:33 +08:00
fairydreaming
ef570f6308
perplexity : fix integer overflow (#23496)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-22 15:50:44 +03:00
Alexey Kopytko
cc9e331213
SYCL: improve MoE prefill throughput (#23142)
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
2026-05-22 15:50:17 +03:00
Alexey Kopytko
bcfd1989e9
sycl : Level Zero detection in ggml_sycl_init (#23097)
* [SYCL] Centralize Level Zero detection in ggml_sycl_init

* use the same wording

* get back the warning
2026-05-22 15:49:45 +03:00
karavayev
56f16f235c
SYCL : gated_delta_net K>1 (#23174)
* sycl_gated_delta_net K>1

* editor_config
2026-05-22 15:48:56 +03:00
Katostrofik
8cc67efcd4
SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (#21580)
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup

BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.

Fixes #20478

* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0

The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.

Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.

Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 15:48:24 +03:00
Concedo
632c41a72f Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-apple.yml
#	.github/workflows/build-cmake-pkg.yml
#	.github/workflows/release.yml
#	.pi/gg/SYSTEM.md
#	CMakeLists.txt
#	CODEOWNERS
#	README.md
#	build-xcframework.sh
#	ci/run.sh
#	docs/build.md
#	examples/CMakeLists.txt
#	examples/llama.android/lib/build.gradle.kts
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-save-load-state.cpp
#	tools/batched-bench/CMakeLists.txt
#	tools/cli/CMakeLists.txt
#	tools/completion/CMakeLists.txt
#	tools/llama-bench/CMakeLists.txt
#	tools/perplexity/CMakeLists.txt
#	tools/quantize/CMakeLists.txt
#	tools/server/CMakeLists.txt
2026-05-22 20:42:51 +08:00
Concedo
694e8824c5 mmproj autofit reworked 2026-05-22 20:36:16 +08:00
Jesus Talavera
95feeab52e
docs: Update documentation with Granite 4.0/4.1 (#23404) 2026-05-22 20:35:46 +08:00
Sachin Sharma
99d4026b11
ggml-zendnn : add Q8_0 quantization support (#23414)
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
2026-05-22 13:16:55 +02:00
fairydreaming
9c92e96a64
cmake : build router app only during standalone builds (#23521)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-22 12:55:29 +03:00
Kashif Rasul
afcda09d15
vocab : fix HybridDNA tokenizer (#23466)
Some checks failed
Python Type-Check / python type-check (push) Has been cancelled
* vocab : mark hybriddna k-mers to avoid BPE token collisions

* improved loop

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-22 11:17:31 +02:00
Georgi Gerganov
bbce619adb
cmake : add install() for impl libraries + fix apple builds (#23511)
* pi : update

* ci : fix ios build

* ci : fix andoroid

* ci : fix apple builds

* cmake : add install() for impl libraries

Add install(TARGETS <target> LIBRARY) for all -impl libraries that were
changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in
commit bb28c1fe2. Without this, cmake --install fails to copy the shared
libraries, causing runtime errors like:

  llama-server: error while loading shared libraries: libllama-server-impl.so

Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515

Assisted-by: llama.cpp:local pi

* ci : fix xcframework build
2026-05-22 11:46:26 +03:00
Concedo
de6b8f9369 increase ctx slider granularity 2026-05-22 16:17:54 +08:00
Johannes Gäßler
4f0e43da6f
CUDA: fix PDL CC check for JIT compilation (#23471) 2026-05-21 23:35:29 +02:00
Georgi Gerganov
bb28c1fe24
cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (#23462)
* cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control

Remove explicit STATIC from all -impl libraries (server, cli, completion, bench,
batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls
shared vs static linkage.

Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows.

Assisted-by: llama.cpp:local pi

* cmake : enable LLAMA_BUILD_APP by default

Assisted-by: llama.cpp:local pi

* ci : disable app in build-cmake-pkg.yml
2026-05-21 21:13:59 +03:00
Reese Levine
ee7c30578a
Update WebGPU support and add link to blog/demo (#23483) 2026-05-21 11:00:27 -07:00
Pascal
47c0eda9d4
vulkan: fuse snake activation (mul, sin, sqr, mul, add) (#22855)
* vulkan: fuse snake activation (mul, sin, sqr, mul, add)

Add snake.comp shader with F32 / F16 / BF16 pipelines and
ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op
decomposition emitted by audio decoders (BigVGAN, Vocos) for snake
activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single
elementwise kernel.

test_snake_fuse from the CUDA PR now also compares CPU naive vs
Vulkan fused across F32 / F16 / BF16.

* vulkan: address jeffbolznv review for fused snake activation

Rename T / C to ne0 / ne1 in the shader and push constants to match
the standard naming convention used across the Vulkan backend.

Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous
(the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be
tightly packed on the broadcast dim (the shader reads data_a[i1]).

* vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review)

* vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review)

* vulkan: address 0cc4m review for fused snake activation

snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention.
A_TYPE now applies to the activation tensor data_a instead of the
broadcast multiplier, and the bindings become data_a (A_TYPE), data_b
(float), data_c (float) and data_d (D_TYPE). A header at the top of
the shader maps each buffer to its role in y = x + sin(b * x)^2 * c.

On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern
constant instead of duplicating the op list, sin_node is extracted as a
named local alongside the other chain nodes, and the broadcast operands
a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded
float bindings on data_b and data_c (the previous a->type == x->type
would silently reject any future BF16 or F16 chain once the supports_op
gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an
explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the
silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1]
is refreshed to match the new binding names.
2026-05-21 19:39:42 +02:00
Concedo
718dc159b6 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	docs/speculative.md
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/matmul-ops.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/ssm-conv.c
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	scripts/snapdragon/adb/run-bench.sh
#	scripts/snapdragon/adb/run-cli.sh
#	scripts/snapdragon/adb/run-completion.sh
#	scripts/snapdragon/adb/run-mtmd.sh
#	scripts/snapdragon/windows/run-bench.ps1
#	scripts/snapdragon/windows/run-cli.ps1
#	scripts/snapdragon/windows/run-completion.ps1
#	scripts/snapdragon/windows/run-mtmd.ps1
#	src/llama-vocab.cpp
#	tests/test-backend-ops.cpp
#	tools/batched-bench/CMakeLists.txt
#	tools/batched-bench/batched-bench.cpp
#	tools/cli/CMakeLists.txt
#	tools/cli/README.md
#	tools/cli/cli.cpp
#	tools/completion/CMakeLists.txt
#	tools/completion/README.md
#	tools/llama-bench/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/mtmd/tests/test-deepseek-ocr.py
#	tools/mtmd/tests/tests-requirements.txt
#	tools/perplexity/CMakeLists.txt
#	tools/perplexity/perplexity.cpp
#	tools/quantize/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/server/README.md
#	ty.toml
2026-05-21 23:47:21 +08:00
Concedo
54af9aada9 Merge commit 'e6b4acfe86' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.devops/cpu.Dockerfile
#	.devops/cuda.Dockerfile
#	.devops/intel.Dockerfile
#	.devops/musa.Dockerfile
#	.devops/openvino.Dockerfile
#	.devops/rocm.Dockerfile
#	.devops/s390x.Dockerfile
#	.devops/vulkan.Dockerfile
#	tools/mtmd/clip.cpp
#	tools/mtmd/clip.h
2026-05-21 23:31:32 +08:00