Commit graph

9871 commits

Author SHA1 Message Date
Concedo
b6f6338bba Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/build.yml
#	CODEOWNERS
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cuda/fattn.cu
#	ggml/src/ggml-webgpu/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.tmpl.wgsl
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/rpc/README.md
#	tools/server/README.md
2025-10-09 01:33:27 +08:00
Concedo
224800b33b revert https://github.com/ggml-org/llama.cpp/pull/14904 , segfault on repacked q4_0 on avx2 cpu. 2025-10-09 00:37:05 +08:00
issixx
d2ee056e1d
server : fix cancel pending task (#16467)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Co-authored-by: DevAI <DevAI@gmail.com>
2025-10-08 11:20:18 +03:00
Georgi Gerganov
b2c08c9ec4
metal : mark FA blocks (#16372)
* metal : better unroll in the FA kernels

* metal : index FA blocks

* tests : restore [no ci]

* metal : prevent division by zero in FA kernels

* metal : fix -INF detection logic
2025-10-08 10:57:53 +03:00
Georgi Gerganov
7fdd16b432
server : improve context checkpoint logic (#16440) 2025-10-08 10:57:29 +03:00
Reese Levine
74b8fc17f9
ggml webgpu: profiling, CI updates, reworking of command submission (#16452)
* Add profiling

* More detailed profiling

* Rework command submission to avoid global locks

* Update wait handling

* try new method of waiting on futures

* Add serializing of command submission in some cases

* Add new pool for timestamp queries and clean up logging

* Serialize command submission in CI and leave a TODO note

* Update webgpu CI

* Add myself as WebGPU codeowner

* Deadlock avoidance

* Leave WebGPU/Vulkan CI serialized

* Fix divide by 0

* Fix logic in division by inflight_threads

* Update CODEOWNERS and remove serialize submit option
2025-10-07 13:48:56 -07:00
Tarek Dakhran
aeaf8a36f0
llama : support LiquidAI LFM2-MoE hybrid model (#16464)
* llama : support LiquidAI LFM2-MoE hybrid model

Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model.
For more information about models, please read [the blog post](https://www.liquid.ai/company/news).

[HF PR](https://github.com/huggingface/transformers/pull/41401)
[GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF)

* Do not use defaultdict

* Address PR feedback
2025-10-07 20:03:35 +02:00
Concedo
9919fdc83a granite 4 template 2025-10-08 00:09:46 +08:00
Concedo
c1a246c1de fixed typo 2025-10-07 21:51:15 +08:00
Georgi Gerganov
df1b612e29
server : add /v1/health endpoint (#16461)
* server : add /v1/health endpoint

* cont : update readme
2025-10-07 15:57:14 +03:00
Concedo
3b30f12ca7 future proof handling of rnn models 2025-10-07 19:12:47 +08:00
Sascha Rogmann
4e0388aa8a
webui : added download action (#13552) (#16282)
* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* webui : Updated static build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-10-07 11:11:08 +02:00
Georgi Gerganov
ef4c5b87ea
presets : fix pooling param for embedding models (#16455) 2025-10-07 10:32:32 +03:00
Radoslav Gerganov
c61ae20d05
rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-10-07 06:59:13 +00:00
Concedo
7857578f45 handle more rnn models 2025-10-07 13:47:15 +08:00
Georgi Gerganov
0123ff38f5
memory : use sequential equal splits for recurrent modules (#16442) 2025-10-07 08:24:17 +03:00
Georgi Gerganov
0a319bb75e
metal : add support for non-padded FA KV (#16148)
* metal : pad K, V and Mask when needed

* cont : simplify

* cuda : add TODO about KV padding requirement

* metal : add comments

* metal : remove mask padding requirement
2025-10-07 08:23:30 +03:00
Georgi Gerganov
1d6092fc72
tests : add -INF blocks to the KQ mask in the FA tests (#16380)
* tests : add -INF blocks to the KQ mask in the FA tests

* cont : bump -INF block size to 64

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* ggml : prevent division by zero in FA CPU op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-10-07 08:22:35 +03:00
Georgi Gerganov
8ae32dc9ec
metal : various optimizations + refactoring (#16446)
* metal : ssm_scan minor opts

* metal : get_rows optimize

* metal : cpy optimize

* metal : ssm_conv opt

* metal : ssm_scan simplify

* metal : ssm_Scan opt
2025-10-07 08:21:40 +03:00
Gadflyii
3df2244df4
llama : add --no-host to disable host buffers (#16310)
* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-10-06 19:55:53 +02:00
Gabe Goodhart
c08002a198
chat : Granite Docling stopping (#16438)
* fix: Fix duplicate fake image before token on first slice

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use double-newline before overview image

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove incorrect newline at the end of granite chat template gen prompt

There should not be one, even for the language models.

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* tests: Remove bad newline from granite chat template test (legacy)

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-10-06 18:59:40 +02:00
Sigbjørn Skjæret
3a002afafa
ci : refactor sdk caching to minimize storage (#16414)
* refactor sdk caching to minimize storage

* use correct action

* add myself as owner to /.github/actions/ [no ci]
2025-10-06 17:40:21 +02:00
Concedo
bb5cef1756 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	ci/run.sh
#	ggml/src/ggml-cpu/amx/amx.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/rms_norm.wgsl
#	tools/server/README.md
2025-10-06 22:41:46 +08:00
Concedo
f2b9b93838 updated lite 2025-10-06 21:59:51 +08:00
Georgi Gerganov
a23b9bdbd3
ggml : fix unaligned access in AMX code (#16315)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2025-10-06 16:05:27 +03:00
Daniel Bevenius
04e632a4aa
ci : remove missing reranker model files (#16444)
This commit removes jina-reranker-v1-tiny-en model files that are no
longer present on Hugging Face.

The motivation for this that it clears up the CI logs from 404 errors
which can be a little confusing when looking at the logs the first time.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649
2025-10-06 14:56:59 +02:00
Daniel Bevenius
a80ff183ab
ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443)
This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
2025-10-06 14:17:12 +02:00
Yuannan
1d49ca3759
nix : removed metal for nix (#16118) 2025-10-06 12:29:56 +03:00
Oleksandr Kuvshynov
c5fef0fcea
server: update readme to mention n_past_max metric (#16436)
https://github.com/ggml-org/llama.cpp/pull/15361 added new metric
exported, but I've missed this doc.
2025-10-06 10:53:31 +03:00
Concedo
2fa28fdcf8 wrap sd_parse_meta_field in trycatch 2025-10-06 00:05:19 +08:00
Wagner Bruna
c48999f7c0
additional options for image generation (#1765)
* sd: add backend support for choosing the default sampler

* use the default sampler on the API

* sd: add backend support for the scheduler

* sd: add backend support for distilled guidance

* sd: add backend support for timestep-shift

* sd: add a config field to set default image gen options
2025-10-05 23:36:20 +08:00
Gabe Goodhart
ca71fb9b36
model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-05 14:57:47 +02:00
Concedo
75272f62af remove gif-h 2025-10-05 17:49:29 +08:00
Concedo
a09d8333b5 allow lowvram (nkvo) to be used with vulkan. 2025-10-05 16:18:58 +08:00
Concedo
b5bec86231 simple quick triage for vulkan compilation 2025-10-05 14:25:35 +08:00
Reese Levine
35266573b9
ggml webgpu: actually add softmax, fix rms_norm offset (#16400)
* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit
2025-10-04 20:59:31 -07:00
Concedo
c83dde8a34 not working commit, need to fix vulkan shaders gen 2025-10-05 11:32:50 +08:00
Concedo
76818cb67a update readme 2025-10-05 10:37:43 +08:00
Eve
86df2c9ae4
vulkan: use a more appropriate amount of threads when generating shaders (#16418)
* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
2025-10-04 22:04:27 +02:00
Concedo
1d728bbc89 Merge commit '128d522c04' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	ggml/src/ggml-vulkan/ggml-vulkan.cpp
#	tests/test-alloc.cpp
#	tests/test-chat.cpp
2025-10-04 23:51:22 +08:00
Concedo
b8680cd0c5 Revert "preserve rocm6 ci"
This reverts commit 9df2a02c4c. (+1 squashed commits)

Squashed commits:

[2e96da6f0] Revert "ROCm 7 CI (#1752)"

This reverts commit 118e589743.
2025-10-04 23:33:59 +08:00
Concedo
ef773cd8cc updated lite 2025-10-04 23:29:47 +08:00
Radoslav Gerganov
f39283960b
rpc : check src buffer when copying tensor (#16421)
Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.
2025-10-04 16:22:45 +03:00
Wagner Bruna
a27d71f95f
fix VAE tiling for Qwen Image (#1774)
leejet/stable-diffusion.cpp#873
2025-10-04 20:44:43 +08:00
Concedo
a98b63013e allow tiling on qwen image 2025-10-04 20:43:36 +08:00
Radoslav Gerganov
898acba681
rpc : add support for multiple devices (#16276)
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
2025-10-04 12:49:16 +03:00
Acly
e29acf74fe
vulkan : incremental shader builds (#16341)
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-10-04 11:42:56 +02:00
Concedo
bb06956b2d allow wan to use img2img via init image 2025-10-04 11:25:46 +08:00
Concedo
db37688b47 qwen image disable VAE tiling as it's broken 2025-10-04 11:19:19 +08:00
henk717
118e589743
ROCm 7 CI (#1752)
* Bump ROCm

* Container experiment

* Can 7.0 compile it on its own?

* Clean the env before pulling docker

* Cleanup attempt 2

* Fix cleanup test 2

* Bing attempts to save ROCm users

* CI binary location fix attempt

* Attempt to fix Docker env vars (make it compile rocm again)

* Update kcpp-build-release-linux-rocm.yaml

* Less fancy ROCm spelling
2025-10-04 09:11:12 +08:00