Commit graph

12869 commits

Author SHA1 Message Date
Concedo
e5eab545f3 handle override jinja template 2026-04-19 00:30:28 +08:00
Concedo
ff37b336a7 updated lite 2026-04-18 18:38:32 +08:00
Concedo
2962e5bac4 updated colab image models 2026-04-18 18:02:17 +08:00
Concedo
40827ab5b5 updated lite, improved reasoning budget 2026-04-18 17:37:47 +08:00
Concedo
17c754a5fc improved reasoning budget 2026-04-18 17:19:09 +08:00
Concedo
78589974de updated colab 2026-04-18 16:41:27 +08:00
Concedo
0b37cb9a57 added preliminary support for reasoning budget 2026-04-18 11:56:33 +08:00
Concedo
79882d669a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-android.yml
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	CMakeLists.txt
#	CODEOWNERS
#	common/CMakeLists.txt
#	common/common.h
#	docs/ops.md
#	docs/ops/Metal.csv
#	examples/batched/CMakeLists.txt
#	examples/convert-llama2c-to-ggml/CMakeLists.txt
#	examples/debug/CMakeLists.txt
#	examples/diffusion/CMakeLists.txt
#	examples/embedding/CMakeLists.txt
#	examples/eval-callback/CMakeLists.txt
#	examples/gen-docs/CMakeLists.txt
#	examples/idle/CMakeLists.txt
#	examples/lookahead/CMakeLists.txt
#	examples/lookup/CMakeLists.txt
#	examples/parallel/CMakeLists.txt
#	examples/passkey/CMakeLists.txt
#	examples/retrieval/CMakeLists.txt
#	examples/save-load-state/CMakeLists.txt
#	examples/speculative-simple/CMakeLists.txt
#	examples/speculative/CMakeLists.txt
#	examples/sycl/CMakeLists.txt
#	examples/training/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	pocs/vdot/CMakeLists.txt
#	src/CMakeLists.txt
#	tests/CMakeLists.txt
#	tests/test-quantize-stats.cpp
#	tools/batched-bench/CMakeLists.txt
#	tools/cli/CMakeLists.txt
#	tools/cli/cli.cpp
#	tools/completion/CMakeLists.txt
#	tools/cvector-generator/CMakeLists.txt
#	tools/cvector-generator/cvector-generator.cpp
#	tools/export-lora/CMakeLists.txt
#	tools/gguf-split/CMakeLists.txt
#	tools/gguf-split/gguf-split.cpp
#	tools/imatrix/CMakeLists.txt
#	tools/llama-bench/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/mtmd/CMakeLists.txt
#	tools/perplexity/CMakeLists.txt
#	tools/quantize/CMakeLists.txt
#	tools/quantize/quantize.cpp
#	tools/results/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/tokenize/CMakeLists.txt
#	tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
768527b031 Merge commit '1e796eb41f' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	.github/workflows/build-riscv.yml
#	.github/workflows/build-vulkan.yml
#	.github/workflows/build.yml
#	docs/backend/SYCL.md
#	docs/build.md
#	docs/development/HOWTO-add-model.md
#	embd_res/templates/Reka-Edge.jinja
#	ggml/CMakeLists.txt
#	ggml/src/ggml-rpc/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/dequantize.hpp
#	ggml/src/ggml-sycl/dmmv.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_id.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
#	tests/test-chat.cpp
#	tools/rpc/README.md
2026-04-17 21:47:29 +08:00
Concedo
a089d6c59b updated lite 2026-04-17 21:12:25 +08:00
Yuri Khrustalev
a279d0f0f4
ci : add android arm64 build and release (#21647)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* server: respect the ignore eos flag

* ci: add android arm64 build and release

* patch

* pin android-setup actions to v4

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* lf in the suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-17 11:32:24 +02:00
Concedo
9a38091207 support q5_1 kv 2026-04-17 17:06:15 +08:00
65a
268d61e178
mtmd: add missing struct tag (#22023) 2026-04-17 10:48:33 +02:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common (#21936)
* cmake : allow libcommon to be shared

* cmake : rename libcommon to libllama-common

* cont : set -fPIC for httplib

* cont : export all symbols

* cont : fix build_info exports

* libs : add libllama-common-base

* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Eric Zhang
fcc7508759
model : Gemma4 model type detection (#22027)
* model : Gemma4 model type detection

* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
Concedo
e074939c17 compact context GUI page (+1 squashed commits)
Squashed commits:

[136f073ce] compact context GUI page
2026-04-17 14:40:53 +08:00
Concedo
cccb45a00a summary outputs include processed amt 2026-04-17 14:22:51 +08:00
lhez
5e6c0e18b6
opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno (#21938)
* opencl: refactor q8_0 gemm/gemv Adreno dispatch

* opencl: refactor q8_0 set_tensor

* opencl: fix whitespace
2026-04-16 22:28:33 -07:00
Concedo
64ce5fca15 better approach when SWA window exceeded, simply refill the window. this is not 100% correct but good enough for fastforward users. Disable FF or increase window if not good enough 2026-04-17 11:44:13 +08:00
Concedo
fa3f86ee70 added simplepod cloud template to readme 2026-04-17 10:58:44 +08:00
Concedo
aed18cc901 swa padding default to 0 2026-04-17 10:54:14 +08:00
Sigbjørn Skjæret
30dce2cf29
cli : use get_media_marker (#22017) 2026-04-17 00:12:31 +02:00
Xuan-Son Nguyen
089dd41fe3
cmake: use glob to collect src/models sources (#22005) 2026-04-16 23:25:16 +02:00
nullname
85dde8dc4a
hexagon: optimize HMX matmul operations (#21071)
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679eff79c4a28dd2d89d32f710015e09592b6.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
2026-04-16 13:48:34 -07:00
Xuan-Son Nguyen
4fbdabdc61
model: using single llm_build per arch (#21970)
* model: using single llm_build per arch

* fix merge

* nits
2026-04-16 21:10:22 +02:00
shaofeiqi
e45dbdece8
opencl: add q5_K gemm and gemv kernels for Adreno (#21595) 2026-04-16 12:08:33 -07:00
Pascal
4adac43f6f
server: tests: fetch random media marker via /apply-template (#21962) (#21980)
* server: tests: fetch random media marker via /apply-template (#21962 fix)

* server: allow pinning media marker via LLAMA_MEDIA_MARKER env var

get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it
as-is if set, falling back to the random marker otherwise.

Tests no longer need to fetch the marker dynamically via /apply-template:
the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts
work as before.

Address review feedback from ngxson

* server: make get_media_marker() thread-safe via magic statics

Use a C++11 static local with a lambda initializer instead of a global
static with an empty-check. The runtime guarantees initialization exactly
once without explicit locking.

Address review feedback from ggerganov

* nits

* nits
2026-04-16 20:46:21 +03:00
Concedo
b5e317e015 SWA fix attempt 2 2026-04-17 00:33:45 +08:00
PikaPikachu
9db77a020c
model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245)
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers

* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Concedo
ab2c596718 updated lite 2026-04-16 23:21:57 +08:00
Sigbjørn Skjæret
f772f6e434
model : support NVFP4 tensors for Gemma4 (#21971)
* support nvfp4 tensors for Gemma4

* add wo_s to build_attn

* add wo_s to build_attn

* fix glm4
2026-04-16 16:51:47 +02:00
Ruben Ortlam
b572d1ecd6
codeowners: add team member comments (#21714) 2026-04-16 13:13:11 +03:00
Anav Prasad
03b3d07798
Convert: Fix NemotronH Config Parsing (#21664)
* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns

* fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code
2026-04-16 13:11:45 +03:00
Aman Gupta
3f7c29d318
ggml: add graph_reused (#21764)
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
2026-04-16 17:21:28 +08:00
Concedo
ae292c496e handle SWA conflicting with rewind, increased default SWA padding. 2026-04-16 17:00:26 +08:00
Kusha Gharahi
ae2d34899e
metal: Implement ROLL op (#21946)
* nix: support unified apple-sdk

* Impl roll op for Metal

* Revert "nix: support unified apple-sdk"

This reverts commit abfa473360471532c547de8b202c780507924d4b.

* update ops.md

* update op docs
2026-04-16 11:54:37 +03:00
Concedo
0251c6dbde added swa padding controls 2026-04-16 16:21:48 +08:00
rehan-10xengineer
1e796eb41f
ggml-cpu: add 128-bit RVV implementation for Quantization Vector Dot (#20633)
* ggml-cpu: add 128-bit impls for i-quants, ternary quants

* ggml-cpu: add 128-bit impls for iq2_xs, iq3_s, iq3_xxs, tq2_0

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: refactor; add rvv checks

---------

Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai>
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-04-16 11:15:15 +03:00
rehan-10xengineer
5637536517
ggml : implemented simd_gemm kernel for riscv vector extension (#20627)
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-04-16 11:14:26 +03:00
Yuannan
90fb96a7b3
devops : added spirv-headers to nix (#21965) 2026-04-16 11:12:52 +03:00
Reese Levine
82677a6ede
ggml-webgpu: compute pass batching and removing profiling overhead (#21873)
* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function

* Move to a single query set for GPU profiling

* Move to batching compute passes when not profiling

* Refactor build_multi

* remove iOS throttling now that we're batching compute passes
2026-04-16 11:12:19 +03:00
Ludovic Henry
8612ed18b7
ci : Use ggml-org/ccache-action on RISC-V as well (#21632) 2026-04-16 11:11:25 +03:00
Concedo
a9e817fb4c smartcache off when fastforward off 2026-04-16 15:29:23 +08:00
Concedo
535df844dd touchup for min/max tokens ui 2026-04-16 14:56:22 +08:00
Katostrofik
b1be68e8ca
[SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM (#21638)
* [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM

The Q8_0 reorder optimization (#21527) was missing a reorder-aware
dequantizer for the GEMM code path used during prompt processing.
After token generation reordered Q8_0 weights (via DMMV/MMVQ), the
next prompt processing pass would read them with the standard
dequantizer, producing garbage output.

Add dequantize_block_q8_0_reorder() and wire it into both
ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the
pattern already used by Q4_0, Q4_K, and Q6_K.

Fixes #21589

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.

* SYCL: fix reorder crash when device memory is full

The reorder optimization allocates a temporary buffer the full size of
the weight tensor on the device. When VRAM is nearly full (large models
on a single GPU), this allocation fails and the subsequent memcpy crashes
on a NULL pointer.

Fix: try device allocation first, fall back to host memory if device
memory is full. The reorder kernel still works correctly reading from
host memory over PCIe. This is slower for the one-time reorder (~21 t/s
vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for
all subsequent inference. If both device and host allocation fail, skip
the reorder and fall back to the unoptimized kernel path.

Also fixes a bug where opt_for_reorder() marked tensors as reordered
even when the reorder was skipped due to allocation failure. This caused
DMMV/MMVQ kernels to read the original AoS data as if it were SoA,
producing garbage output or NaN results.

Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was
AI-assisted (Claude), reviewed and tested on hardware by a human.

Fixes #20478

* SYCL: add RAII temp buffer class + macro guard for host fallback

Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free
functions with sycl_reorder_temp_buffer RAII class. The host_fallback
bool is now a private member, and cleanup happens automatically at
scope exit.

Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard
the host memory fallback code path. Device access to host memory
requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels
can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it.

Addresses arthw's review on PR #21638.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K

Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not
DMMV. When the DMMV path encountered reordered data it would abort.

Add DMMV kernels that read from the SOA reorder layout for both
types. Same math as the non-reorder versions, different memory
access pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:34:05 +03:00
Llama
c592bd01da
Pass img_min_params and img_max_params to ctx_clip_params (#2133)
* Pass img_min_params and img_max_params to ctx_clip_params

These values determine the minimum and maximum size (in
tokens) of vision embeddings. The default value of -1
uses a model-dependent default size, for example for
Gemma 4 the default is a 280 token embedding. For higher
quality results (at the cost of using more memory and
slower speed) you can increase the size of the embedding
to 1120 tokens.

* Change dict to mydict to match change to method
2026-04-16 12:27:06 +08:00
Concedo
a9f9e9a38b rename the filepaths for clarity (+1 squashed commits)
Squashed commits:

[fa8fc6914] rename the filepaths for clarity
2026-04-16 12:17:23 +08:00
Concedo
45737effd3 refactor for clarity 2026-04-16 10:53:35 +08:00
Xuan-Son Nguyen
408225bb1a
server: use random media marker (#21962)
* server: use random media marker

* nits

* remove legacy <__image__> token

* revert special char in random
2026-04-15 23:52:22 +02:00
Ruben Ortlam
b3d758750a
vulkan: optimize im2col (#21713)
* vulkan: improve im2col memory write layout

* cap workgroups

* minimal device tuning

* use vendor_id instead of subgroup size
2026-04-15 19:04:51 +02:00