LostRuins Concedo
92b5afc019
flag to show if jinja is enabled
2025-11-08 00:49:50 +08:00
LostRuins Concedo
b02fc29030
jinja2 as dependency
2025-11-07 23:47:39 +08:00
LostRuins Concedo
462a34ed5b
jinja is now working
2025-11-07 23:46:22 +08:00
LostRuins Concedo
cfb22b5c9d
rename a missed BLAS -> batch
2025-11-06 16:11:26 +08:00
LostRuins Concedo
978d755ddc
escape clause for tool calling
2025-11-05 22:02:24 +08:00
LostRuins Concedo
3e4a33499f
updated lite
2025-11-05 20:52:47 +08:00
LostRuins Concedo
6ddacb62a0
serve gzipped versions of files. added a modded lcpp gui with modified path handling and proper stream termination, see https://github.com/ggml-org/llama.cpp/pull/14839#issuecomment-3490987929
2025-11-05 20:40:30 +08:00
LostRuins Concedo
fc80cdccc2
Merge commit ' bea04522ff' into concedo_experimental
...
# Conflicts:
# scripts/sync-ggml.last
# src/CMakeLists.txt
# tests/test-backend-ops.cpp
2025-11-05 12:41:01 +08:00
Concedo
9720aa6224
change an assert to optional testing https://github.com/LostRuins/koboldcpp/issues/1821
2025-11-02 10:30:04 +08:00
Concedo
7946203d5b
add test build target for linux olderpc
2025-11-02 10:25:00 +08:00
Concedo
3aec5ed0fd
Kcpp triage for rowsplit: revert https://github.com/ggml-org/llama.cpp/pull/16715 until https://github.com/ggml-org/llama.cpp/issues/16799 is resolved
...
revert https://github.com/ggml-org/llama.cpp/pull/16715 (+2 squashed commit)
Squashed commit:
[289af2ee2] Revert "Hide latency of bias and gate-loading (#16847 )"
This reverts commit 8b11deea46 .
[a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (#16807 )"
This reverts commit 463bbf20bf .
2025-11-02 09:58:41 +08:00
henk717
2649618042
ROCm 7.1 CI ( #1823 )
2025-11-02 08:03:27 +08:00
Concedo
af327857ec
handle loading very old mmproj that broke after https://github.com/ggml-org/llama.cpp/pull/14928
2025-11-02 02:11:17 +08:00
Concedo
333e2bb30b
fix for qwen image crashing due to ref images being too big, trial and error shows it happens after 512x512
2025-11-02 01:31:01 +08:00
Concedo
7179e49aef
fix from https://github.com/leejet/stable-diffusion.cpp/pull/926
2025-11-01 23:38:37 +08:00
Concedo
60d3cc713c
updated lite
2025-11-01 12:21:35 +08:00
xzuyn
988baa544e
add JobRate and JobCost to worker log ( #1820 )
...
- adds average jobs per hour
- adds average kudos earned per job
- change EarnRate to show 2 decimal places
2025-11-01 10:01:13 +08:00
Piotr Wilkin (ilintar)
bea04522ff
refactor : llama-model.cpp ( #16252 )
...
* Sqashed: llama-model.cpp refactoring
* Fix formatting of attn / ffn / ffn_moe calls
* Fix import regression / unify spacing in models.h
* totally DID NOT miss those!
* Add missing qwen3vl(moe) models
* Add missing new .cpp files to build
* Remove extra semicolons
* Editor checker
* Update src/models/models.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)
0de0a01576
model : Minimax M2 ( #16831 )
...
* Model: Minimax M2
* Cleanup
* Cleanup pt. 2
* Cleanup pt. 3
* Update convert_hf_to_gguf_update.py - merge catch blocks
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Remove vocab models and test
* Remove all redundant hparam settings covered by TextModel
* Move super to start, don't set block_count
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-31 21:20:47 +01:00
Giuseppe Scrivano
e58d585604
model : add Granite Hybrid nano types ( #16896 )
...
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-31 21:20:07 +01:00
Johannes Gäßler
31c511a968
CUDA: Volta tensor core support for MMF ( #16843 )
...
* CUDA: Volta tensor core support for MMF
* more generic checks for hardware support
* Update ggml/src/ggml-cuda/mmf.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-10-31 15:57:19 +01:00
Georgi Gerganov
6d39015a74
sync : ggml
2025-10-31 16:26:28 +02:00
Concedo
75375157fd
Merge commit ' 8da3c0e200' into concedo_experimental
...
# Conflicts:
# tests/test-backend-ops.cpp
2025-10-31 21:35:58 +08:00
Concedo
800b5c3dfa
updated lite
2025-10-31 21:34:21 +08:00
Aman Gupta
4146d6a1a6
CUDA: add expert reduce kernel ( #16857 )
...
* CUDA: add expert reduce kernel
* contigous checks, better formatting, use std::vector instead of array
* use vector empty instead of size
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-10-31 20:05:07 +08:00
Georgi Gerganov
8da3c0e200
batch : fix consistency checks for the input positions ( #16890 )
2025-10-31 13:50:33 +02:00
Concedo
0891b0752d
qwen3vl fixed (+2 squashed commit)
...
Squashed commit:
[89f65ed0c] wip fixing q3vl
[6fa34cff2] wip fixing q3vl
2025-10-31 17:52:33 +08:00
Georgi Gerganov
c22473b580
server : don't print user inputs to console ( #16871 )
2025-10-31 10:54:19 +02:00
Daniel Bevenius
0f715b4e75
server : fix typos in server.cpp comments [no ci] ( #16883 )
2025-10-31 09:51:26 +01:00
Jeff Bolz
d2d931f173
vulkan: disable spirv-opt for rope shaders ( #16872 )
2025-10-31 08:34:47 +01:00
Masato Nakasaka
2976b0374d
vulkan: Fix crash when FP16 mul_mat accumulation is not supported ( #16796 )
...
* Experimenting crash fix
* added assert for aborting and fixed comment
* changed to check if a pipeline is empty or not
* Moved function in class definition
* replaced with is_empty
* Modified is_empty to check only unaligned pipelines
2025-10-31 08:18:59 +01:00
Ruben Ortlam
d2a2673dd1
vulkan: fix shmem overrun in mmq id shader ( #16873 )
...
* vulkan: fix shmem overrun in mmq id shader
* metal : fix mul_mm_id
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-31 08:14:49 +01:00
l3utterfly
13002a0896
ggml-hexagon: respect input size when getting/setting tensor data ( #16836 )
...
* respect input size when getting/setting tensor data
allows partial repacking/copying when get tensor size is smaller than the actual tensor
* Removed duplicate repack_mxfp4_mxfp4x4x2 function
2025-10-30 21:46:31 -07:00
Concedo
adec6eb5d5
occam patch for vulkan: fix shmem overrun in mmq id shader https://github.com/ggml-org/llama.cpp/pull/16873
2025-10-31 10:58:29 +08:00
Concedo
2b00e55356
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl
# ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl
# ggml/src/ggml-sycl/rope.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl
# requirements/requirements-convert_legacy_llama.txt
# tests/test-backend-ops.cpp
# tests/test-rope.cpp
# tools/server/README.md
2025-10-31 10:52:57 +08:00
Sigbjørn Skjæret
6eb208d17e
ci : enable free-disk-space on cuda docker build ( #16877 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2025-10-31 00:34:27 +01:00
lhez
9984cbb61d
opencl: fix boundary handling for mul_mm ( #16875 )
2025-10-30 16:00:20 -07:00
RodriMora
ce18efeaf1
convert : update transformers requirements ( #16866 )
...
* Update requirements-convert_legacy_llama.txt
Updated requirements to support Qwen3-VL in transformers 4.57.1 version
* Update requirements/requirements-convert_legacy_llama.txt
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 23:15:03 +01:00
chansikpark
16724b5b68
server : bump request URI max length to 32768 ( #16862 )
2025-10-30 20:22:23 +02:00
Georgi Gerganov
b52edd2558
server : remove n_past ( #16818 )
...
* server : remove n_past
* server : replace slot.n_prompt_tokens() with slot.task->n_tokens()
* server : fixes + clean-up
* cont : fix context shift
* server : add server_tokens::pos_next()
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : fix pos_next() usage
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-10-30 18:42:57 +02:00
Max Krasnyansky
517b7170e1
cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 ( #16833 )
...
Very similar implementation to the flash-attention chunking, with similar benefits.
2025-10-30 09:06:13 -07:00
Shagun Bera
835e918d84
common: fix typo in cli help text ( #16864 )
2025-10-30 17:47:31 +02:00
JJJYmmm
d261223d24
model: add support for qwen3vl series ( #16780 )
...
* support qwen3vl series.
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
* bugfix: fix the arch check for qwen3vl-moe.
* use build_ffn
* optimize deepstack structure
* optimize deepstack feature saving
* Revert "optimize deepstack feature saving" for temporal fix
This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71.
* code clean
* use fused qkv in clip
* clean up / rm is_deepstack_layers for simplification
* add test model
* move test model to "big" section
* fix imrope check
* remove trailing whitespace
* fix rope fail
* metal : add imrope support
* add imrope support for sycl
* vulkan: add imrope w/o check
* fix vulkan
* webgpu: add imrope w/o check
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix tensor mapping
---------
Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-30 16:19:14 +01:00
Concedo
c2316353a1
allow usage of flux without some components
2025-10-30 22:32:20 +08:00
Max Krasnyansky
dcca0d3ab8
cpu: introduce chunking for flash attention ( #16829 )
...
Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop
on top that handles the chunks.
2025-10-30 14:26:05 +02:00
Tianyue-Zhao
bacddc049a
model: Add support for CogVLM model ( #15002 )
...
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-10-30 12:18:50 +01:00
Sigbjørn Skjæret
229bf68628
cuda : fix argsort with 64k+ rows ( #16849 )
2025-10-30 08:56:28 +01:00
Jan Boon
d7395115ba
llama : use std::abs instead of abs ( #16853 )
2025-10-30 08:30:58 +02:00
Jeff Bolz
052df28b0e
vulkan: Handle argsort with a large number of rows ( #16851 )
2025-10-30 07:27:41 +01:00
Wagner Bruna
96a70033ba
sd: sync to master-343-dd75fc0 ( #1818 )
2025-10-30 13:44:59 +08:00