Commit graph

11129 commits

Author SHA1 Message Date
Concedo
c9308570b2 added mcp to list of capabilities, allow it to run standalone 2026-01-05 20:32:25 +08:00
Concedo
b762036388 indicate unofficial builds 2026-01-05 16:12:54 +08:00
Concedo
301a04adfc Merge branch 'concedo' into concedo_experimental 2026-01-05 15:24:43 +08:00
Concedo
9a4eeafbfc hotfix 1.105.3 2026-01-05 15:24:21 +08:00
Concedo
ad6c53aeff Merge commit '908a9e5a1e' into concedo 2026-01-05 15:01:49 +08:00
Concedo
4d3866a016 mcp proxy is done 2026-01-05 12:24:43 +08:00
Aman Gupta
908a9e5a1e
CUDA: disable cuda graph when using n-cpu-moe (#18593)
* CUDA: disable cuda graph when using n-cpu-moe

* call ggml_cuda_set_device
2026-01-05 01:37:48 +08:00
Aman Gupta
5126c41c1c
ggml-cuda: remove unused params in ggml_cuda_graph (#18579) 2026-01-05 01:37:09 +08:00
Concedo
91089ad1bd wip on mcp 2026-01-04 22:52:47 +08:00
Concedo
a82c89b065 minimax template 2026-01-04 20:51:16 +08:00
Concedo
acfc1e56d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	tests/test-regex-partial.cpp
2026-01-04 11:14:33 +08:00
Concedo
01c70a7d3d allow transcribe to be used with the LLM instead if no whisper model exists 2026-01-04 11:06:05 +08:00
Aldehir Rojas
cef1d23c5a
common/grammar : replace problematic backtracking regex [\s\S]* (#18342)
* grammar : add support for std::regex_search() with trigger patterns

* common : update hermes2 pro trigger to search instead of match

* common : use regex_search with anchoring for partial matching

* common : adjust regex partial tests to use new pattern

* grammar : check pattern directly instead of adding a type

* common : adjust existing patterns to match new semantics
2026-01-03 16:02:43 -06:00
Georgi Gerganov
c69c7ebc90
graph : fix graph reuse logic when n_pos_per_embd > 1 (#18566) 2026-01-03 23:59:06 +02:00
Concedo
04f5445bef fix for macos asserting on exit 2026-01-03 23:26:04 +08:00
Aman Gupta
e57f52334b
ggml-cuda: fixes for concurrent streams (#18496) 2026-01-03 23:15:01 +08:00
Concedo
5a505cbc62 disable blackwell mma for now 2026-01-03 22:45:06 +08:00
Georgi Gerganov
a554a1ecc7
context : fix reserve token padding to n_seqs (#18536) 2026-01-03 15:45:34 +02:00
Johannes Gäßler
0f2e42ca1d
CUDA: only allocate FA tmp buffer if needed (#18564) 2026-01-03 13:55:53 +01:00
pl752
9dba9f5352
(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-03 11:13:40 +01:00
Concedo
e4abf643fa Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	src/CMakeLists.txt
#	src/llama-vocab.cpp
2026-01-03 15:37:30 +08:00
Wagner Bruna
0ef55844d3
sd: sync to master-453-4ff2c8c (#1907) 2026-01-03 15:28:27 +08:00
Shouyu
bcfc8c3cec
ggml-hexagon: optimize activation function (#18393)
* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue
2026-01-02 21:24:24 -08:00
Jeff Bolz
18ddaea2ae
vulkan: Optimize GGML_OP_CUMSUM (#18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-02 15:32:30 -06:00
Jeff Bolz
706e3f93a6
vulkan: Implement mmvq for iq1_s/iq1_m (#18450) 2026-01-02 20:19:04 +01:00
Prabod
5755e52d15
model : Maincoder-1B support (#18534)
* Add Maincoder model support

* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp

* removed set_vocab

* added new line

* Fix formatting

* Add a new line for PEP8
2026-01-02 20:11:59 +01:00
Georgi Gerganov
f38de16341
metal : adjust extra size for FA buffer to avoid reallocations (#18545) 2026-01-02 19:02:18 +02:00
Georgi Gerganov
af1e8e1a6c
graph : reduce topology branching (#18548) 2026-01-02 19:01:56 +02:00
Concedo
77082dddfb mcp image handling 2026-01-03 00:03:05 +08:00
Georgi Gerganov
d84a6a98be
vocab : reduce debug logs about non-EOG control tokens (#18541)
* vocab : reduce debug logs about non-EOG control tokens

* cont : add comment
2026-01-02 16:17:33 +02:00
Concedo
107def07c8 updated lite and sdui (+1 squashed commits)
Squashed commits:

[3172b5d19] updated lite (+1 squashed commits)

Squashed commits:

[45081b0e2] updated glm nothink template
2026-01-02 18:11:32 +08:00
Chris Rohlf
c6f0e832da
rpc : use unordered_map::reserve and emplace (#18513) 2026-01-02 12:09:36 +02:00
Concedo
d8942cde14 smartcache allow custom number of slots 2026-01-02 17:19:40 +08:00
Concedo
7e1ae49e7d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cuda/ggml-cuda.cu
#	tests/test-backend-ops.cpp
#	tools/mtmd/CMakeLists.txt
2026-01-02 11:05:20 +08:00
Concedo
0a23388e7d added images in tool call queries 2026-01-02 10:48:34 +08:00
MeeMin
e86f3c2221
cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
2026-01-02 00:24:20 +01:00
Sigbjørn Skjæret
169ee68ffb
model : remove modern-bert iswa template (#18529)
* remove modern-bert iswa template

* forgotten
2026-01-02 00:06:42 +01:00
tt
ced765be44
model: support youtu-vl model (#18479)
* Support Youtu-VL Model

* merge code

* fix bug

* revert qwen2 code & support rsplit in minja.hpp

* update warm info

* fix annotation

* u

* revert minja.hpp

* fix

* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None

* fix expert_weights_scale

* LGTM after whitespace fixes

* fix

* fix

* fix

* layers to layer_index

* enum fix

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 19:25:54 +01:00
Piotr Wilkin (ilintar)
3ccccc83f7
Add conversion support for IQuestCoderForCausalLM (#18524) 2026-01-01 18:45:55 +01:00
o7si
d0a6a31470
model : add support for JinaBertModel with non-gated ffn (#18475)
* WIP: Initial commit for fixing JinaBert original FF type support

* convert: add jina-v2-de tokenizer variant for German_Semantic_V3

* convert: fix token collision in BERT phantom vocab conversion

* convert: add feed_forward_type metadata

* model: add feed_forward_type metadata for jina-bert-v2

* model: jina-bert-v2 support standard GELU FFN variant

* model: remove ffn_type, detect FFN variant from tensor dimensions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* revert collision fix to be handled in separate PR

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:38:51 +01:00
o7si
2b2afade9f
convert : fix encoding of WPM vocab for BERT models (#18500)
* convert: avoid token collision when stripping ## prefix

* convert: use token types for BERT special tokens check

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:27:07 +01:00
HelloKS
f4f5019254
model: add Solar Open model (#18511)
* model: add Solar-Open model

* vocab: add solar-open to end eog blacklist

* model: add proper llm type

* chat: basic template for solar open

* typo: fix comment about vocab

* convert: sugested changes

* convert: suggested changes

* chat: change reasoning end tag for solar-open

* llama-chat: add solar-open template
2026-01-01 18:01:43 +01:00
Concedo
bfa2ae7744 fixed smartcache bug when used with images 2026-01-02 00:35:05 +08:00
Concedo
774841ffd6 clear the images array from kcpp chat completions 2026-01-01 22:51:00 +08:00
Concedo
51edb6ae61 allow clip fa for anything besides cuda on gpu 2026-01-01 21:09:51 +08:00
Anri Lombard
d5574c919c
webui: fix code copy stripping XML/HTML tags (#18518)
* webui: fix code copy stripping XML/HTML tags

* webui: update static build
2026-01-01 13:44:11 +01:00
Aman Gupta
26831bded9
ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502) 2026-01-01 19:18:43 +08:00
Concedo
442fa7cd7c support for circular textures in sdcpp 2026-01-01 16:34:09 +08:00
Jeff Bolz
be47fb9285
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-01 08:58:27 +01:00
Concedo
54e419f587 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	docs/ops.md
#	docs/ops/Metal.csv
#	ggml/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	grammars/README.md
#	models/templates/llama-cpp-deepseek-r1.jinja
#	scripts/sync-ggml.last
#	tests/test-chat.cpp
2026-01-01 15:34:10 +08:00