Commit graph

10172 commits

Author SHA1 Message Date
LostRuins Concedo
af94884971 update props 2025-11-08 10:15:13 +08:00
LostRuins Concedo
92b5afc019 flag to show if jinja is enabled 2025-11-08 00:49:50 +08:00
LostRuins Concedo
b02fc29030 jinja2 as dependency 2025-11-07 23:47:39 +08:00
LostRuins Concedo
462a34ed5b jinja is now working 2025-11-07 23:46:22 +08:00
LostRuins Concedo
cfb22b5c9d rename a missed BLAS -> batch 2025-11-06 16:11:26 +08:00
LostRuins Concedo
978d755ddc escape clause for tool calling 2025-11-05 22:02:24 +08:00
LostRuins Concedo
3e4a33499f updated lite 2025-11-05 20:52:47 +08:00
LostRuins Concedo
6ddacb62a0 serve gzipped versions of files. added a modded lcpp gui with modified path handling and proper stream termination, see https://github.com/ggml-org/llama.cpp/pull/14839#issuecomment-3490987929 2025-11-05 20:40:30 +08:00
LostRuins Concedo
fc80cdccc2 Merge commit 'bea04522ff' into concedo_experimental
# Conflicts:
#	scripts/sync-ggml.last
#	src/CMakeLists.txt
#	tests/test-backend-ops.cpp
2025-11-05 12:41:01 +08:00
Concedo
9720aa6224 change an assert to optional testing https://github.com/LostRuins/koboldcpp/issues/1821 2025-11-02 10:30:04 +08:00
Concedo
7946203d5b add test build target for linux olderpc 2025-11-02 10:25:00 +08:00
Concedo
3aec5ed0fd Kcpp triage for rowsplit: revert https://github.com/ggml-org/llama.cpp/pull/16715 until https://github.com/ggml-org/llama.cpp/issues/16799 is resolved
revert https://github.com/ggml-org/llama.cpp/pull/16715 (+2 squashed commit)

Squashed commit:

[289af2ee2] Revert "Hide latency of bias and gate-loading (#16847)"

This reverts commit 8b11deea46.

[a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (#16807)"

This reverts commit 463bbf20bf.
2025-11-02 09:58:41 +08:00
henk717
2649618042
ROCm 7.1 CI (#1823) 2025-11-02 08:03:27 +08:00
Concedo
af327857ec handle loading very old mmproj that broke after https://github.com/ggml-org/llama.cpp/pull/14928 2025-11-02 02:11:17 +08:00
Concedo
333e2bb30b fix for qwen image crashing due to ref images being too big, trial and error shows it happens after 512x512 2025-11-02 01:31:01 +08:00
Concedo
7179e49aef fix from https://github.com/leejet/stable-diffusion.cpp/pull/926 2025-11-01 23:38:37 +08:00
Concedo
60d3cc713c updated lite 2025-11-01 12:21:35 +08:00
xzuyn
988baa544e
add JobRate and JobCost to worker log (#1820)
- adds average jobs per hour
- adds average kudos earned per job
- change EarnRate to show 2 decimal places
2025-11-01 10:01:13 +08:00
Piotr Wilkin (ilintar)
bea04522ff
refactor : llama-model.cpp (#16252)
* Sqashed: llama-model.cpp refactoring

* Fix formatting of attn / ffn / ffn_moe calls

* Fix import regression / unify spacing in models.h

* totally DID NOT miss those!

* Add missing qwen3vl(moe) models

* Add missing new .cpp files to build

* Remove extra semicolons

* Editor checker

* Update src/models/models.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)
0de0a01576
model : Minimax M2 (#16831)
* Model: Minimax M2

* Cleanup

* Cleanup pt. 2

* Cleanup pt. 3

* Update convert_hf_to_gguf_update.py - merge catch blocks

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove vocab models and test

* Remove all redundant hparam settings covered by TextModel

* Move super to start, don't set block_count

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 21:20:47 +01:00
Giuseppe Scrivano
e58d585604
model : add Granite Hybrid nano types (#16896)
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-31 21:20:07 +01:00
Johannes Gäßler
31c511a968
CUDA: Volta tensor core support for MMF (#16843)
* CUDA: Volta tensor core support for MMF

* more generic checks for hardware support

* Update ggml/src/ggml-cuda/mmf.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-10-31 15:57:19 +01:00
Georgi Gerganov
6d39015a74 sync : ggml 2025-10-31 16:26:28 +02:00
Concedo
75375157fd Merge commit '8da3c0e200' into concedo_experimental
# Conflicts:
#	tests/test-backend-ops.cpp
2025-10-31 21:35:58 +08:00
Concedo
800b5c3dfa updated lite 2025-10-31 21:34:21 +08:00
Aman Gupta
4146d6a1a6
CUDA: add expert reduce kernel (#16857)
* CUDA: add expert reduce kernel

* contigous checks, better formatting, use std::vector instead of array

* use vector empty instead of size

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-10-31 20:05:07 +08:00
Georgi Gerganov
8da3c0e200
batch : fix consistency checks for the input positions (#16890) 2025-10-31 13:50:33 +02:00
Concedo
0891b0752d qwen3vl fixed (+2 squashed commit)
Squashed commit:

[89f65ed0c] wip fixing q3vl

[6fa34cff2] wip fixing q3vl
2025-10-31 17:52:33 +08:00
Georgi Gerganov
c22473b580
server : don't print user inputs to console (#16871) 2025-10-31 10:54:19 +02:00
Daniel Bevenius
0f715b4e75
server : fix typos in server.cpp comments [no ci] (#16883) 2025-10-31 09:51:26 +01:00
Jeff Bolz
d2d931f173
vulkan: disable spirv-opt for rope shaders (#16872) 2025-10-31 08:34:47 +01:00
Masato Nakasaka
2976b0374d
vulkan: Fix crash when FP16 mul_mat accumulation is not supported (#16796)
* Experimenting crash fix

* added assert for aborting and fixed comment

* changed to check if a pipeline is empty or not

* Moved function in class definition

* replaced with is_empty

* Modified is_empty to check only unaligned pipelines
2025-10-31 08:18:59 +01:00
Ruben Ortlam
d2a2673dd1
vulkan: fix shmem overrun in mmq id shader (#16873)
* vulkan: fix shmem overrun in mmq id shader

* metal : fix mul_mm_id

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-31 08:14:49 +01:00
l3utterfly
13002a0896
ggml-hexagon: respect input size when getting/setting tensor data (#16836)
* respect input size when getting/setting tensor data

allows partial repacking/copying when get tensor size is smaller than the actual tensor

* Removed duplicate repack_mxfp4_mxfp4x4x2 function
2025-10-30 21:46:31 -07:00
Concedo
adec6eb5d5 occam patch for vulkan: fix shmem overrun in mmq id shader https://github.com/ggml-org/llama.cpp/pull/16873 2025-10-31 10:58:29 +08:00
Concedo
2b00e55356 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl
#	ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl
#	ggml/src/ggml-sycl/rope.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl
#	requirements/requirements-convert_legacy_llama.txt
#	tests/test-backend-ops.cpp
#	tests/test-rope.cpp
#	tools/server/README.md
2025-10-31 10:52:57 +08:00
Sigbjørn Skjæret
6eb208d17e
ci : enable free-disk-space on cuda docker build (#16877)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2025-10-31 00:34:27 +01:00
lhez
9984cbb61d
opencl: fix boundary handling for mul_mm (#16875) 2025-10-30 16:00:20 -07:00
RodriMora
ce18efeaf1
convert : update transformers requirements (#16866)
* Update requirements-convert_legacy_llama.txt

Updated requirements to support Qwen3-VL in transformers 4.57.1 version

* Update requirements/requirements-convert_legacy_llama.txt

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 23:15:03 +01:00
chansikpark
16724b5b68
server : bump request URI max length to 32768 (#16862) 2025-10-30 20:22:23 +02:00
Georgi Gerganov
b52edd2558
server : remove n_past (#16818)
* server : remove n_past

* server : replace slot.n_prompt_tokens() with slot.task->n_tokens()

* server : fixes + clean-up

* cont : fix context shift

* server : add server_tokens::pos_next()

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* server : fix pos_next() usage

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-30 18:42:57 +02:00
Max Krasnyansky
517b7170e1
cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (#16833)
Very similar implementation to the flash-attention chunking, with similar benefits.
2025-10-30 09:06:13 -07:00
Shagun Bera
835e918d84
common: fix typo in cli help text (#16864) 2025-10-30 17:47:31 +02:00
JJJYmmm
d261223d24
model: add support for qwen3vl series (#16780)
* support qwen3vl series.

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

* bugfix: fix the arch check for qwen3vl-moe.

* use build_ffn

* optimize deepstack structure

* optimize deepstack feature saving

* Revert "optimize deepstack feature saving" for temporal fix

This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71.

* code clean

* use fused qkv in clip

* clean up / rm is_deepstack_layers for simplification

* add test model

* move test model to "big" section

* fix imrope check

* remove trailing whitespace

* fix rope fail

* metal : add imrope support

* add imrope support for sycl

* vulkan: add imrope w/o check

* fix vulkan

* webgpu: add imrope w/o check

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix tensor mapping

---------

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Concedo
c2316353a1 allow usage of flux without some components 2025-10-30 22:32:20 +08:00
Max Krasnyansky
dcca0d3ab8
cpu: introduce chunking for flash attention (#16829)
Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop
on top that handles the chunks.
2025-10-30 14:26:05 +02:00
Tianyue-Zhao
bacddc049a
model: Add support for CogVLM model (#15002)
* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-30 12:18:50 +01:00
Sigbjørn Skjæret
229bf68628
cuda : fix argsort with 64k+ rows (#16849) 2025-10-30 08:56:28 +01:00
Jan Boon
d7395115ba
llama : use std::abs instead of abs (#16853) 2025-10-30 08:30:58 +02:00
Jeff Bolz
052df28b0e
vulkan: Handle argsort with a large number of rows (#16851) 2025-10-30 07:27:41 +01:00