Concedo
c9308570b2
added mcp to list of capabilities, allow it to run standalone
2026-01-05 20:32:25 +08:00
Concedo
b762036388
indicate unofficial builds
2026-01-05 16:12:54 +08:00
Concedo
301a04adfc
Merge branch 'concedo' into concedo_experimental
2026-01-05 15:24:43 +08:00
Concedo
9a4eeafbfc
hotfix 1.105.3
2026-01-05 15:24:21 +08:00
Concedo
ad6c53aeff
Merge commit ' 908a9e5a1e' into concedo
2026-01-05 15:01:49 +08:00
Concedo
4d3866a016
mcp proxy is done
2026-01-05 12:24:43 +08:00
Aman Gupta
908a9e5a1e
CUDA: disable cuda graph when using n-cpu-moe ( #18593 )
...
* CUDA: disable cuda graph when using n-cpu-moe
* call ggml_cuda_set_device
2026-01-05 01:37:48 +08:00
Aman Gupta
5126c41c1c
ggml-cuda: remove unused params in ggml_cuda_graph ( #18579 )
2026-01-05 01:37:09 +08:00
Concedo
91089ad1bd
wip on mcp
2026-01-04 22:52:47 +08:00
Concedo
a82c89b065
minimax template
2026-01-04 20:51:16 +08:00
Concedo
acfc1e56d2
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# tests/test-regex-partial.cpp
2026-01-04 11:14:33 +08:00
Concedo
01c70a7d3d
allow transcribe to be used with the LLM instead if no whisper model exists
2026-01-04 11:06:05 +08:00
Aldehir Rojas
cef1d23c5a
common/grammar : replace problematic backtracking regex [\s\S]* ( #18342 )
...
* grammar : add support for std::regex_search() with trigger patterns
* common : update hermes2 pro trigger to search instead of match
* common : use regex_search with anchoring for partial matching
* common : adjust regex partial tests to use new pattern
* grammar : check pattern directly instead of adding a type
* common : adjust existing patterns to match new semantics
2026-01-03 16:02:43 -06:00
Georgi Gerganov
c69c7ebc90
graph : fix graph reuse logic when n_pos_per_embd > 1 ( #18566 )
2026-01-03 23:59:06 +02:00
Concedo
04f5445bef
fix for macos asserting on exit
2026-01-03 23:26:04 +08:00
Aman Gupta
e57f52334b
ggml-cuda: fixes for concurrent streams ( #18496 )
2026-01-03 23:15:01 +08:00
Concedo
5a505cbc62
disable blackwell mma for now
2026-01-03 22:45:06 +08:00
Georgi Gerganov
a554a1ecc7
context : fix reserve token padding to n_seqs ( #18536 )
2026-01-03 15:45:34 +02:00
Johannes Gäßler
0f2e42ca1d
CUDA: only allocate FA tmp buffer if needed ( #18564 )
2026-01-03 13:55:53 +01:00
pl752
9dba9f5352
(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment ( #18559 )
...
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)
* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort
---------
Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-03 11:13:40 +01:00
Concedo
e4abf643fa
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-hexagon/htp/act-ops.c
# ggml/src/ggml-rpc/ggml-rpc.cpp
# src/CMakeLists.txt
# src/llama-vocab.cpp
2026-01-03 15:37:30 +08:00
Wagner Bruna
0ef55844d3
sd: sync to master-453-4ff2c8c ( #1907 )
2026-01-03 15:28:27 +08:00
Shouyu
bcfc8c3cec
ggml-hexagon: optimize activation function ( #18393 )
...
* refactor: refactor silu
* refactor: optimize swiglu
* refactor: remove unncessary if in swiglu
* refactor: refactor swiglu_oai
* chore: fix formatting issue
2026-01-02 21:24:24 -08:00
Jeff Bolz
18ddaea2ae
vulkan: Optimize GGML_OP_CUMSUM ( #18417 )
...
* vulkan: Optimize GGML_OP_CUMSUM
There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.
In the whole-row shader, handle multiple elements per invocation.
* use 2 ELEM_PER_THREAD for AMD/Intel
* address feedback
2026-01-02 15:32:30 -06:00
Jeff Bolz
706e3f93a6
vulkan: Implement mmvq for iq1_s/iq1_m ( #18450 )
2026-01-02 20:19:04 +01:00
Prabod
5755e52d15
model : Maincoder-1B support ( #18534 )
...
* Add Maincoder model support
* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp
* removed set_vocab
* added new line
* Fix formatting
* Add a new line for PEP8
2026-01-02 20:11:59 +01:00
Georgi Gerganov
f38de16341
metal : adjust extra size for FA buffer to avoid reallocations ( #18545 )
2026-01-02 19:02:18 +02:00
Georgi Gerganov
af1e8e1a6c
graph : reduce topology branching ( #18548 )
2026-01-02 19:01:56 +02:00
Concedo
77082dddfb
mcp image handling
2026-01-03 00:03:05 +08:00
Georgi Gerganov
d84a6a98be
vocab : reduce debug logs about non-EOG control tokens ( #18541 )
...
* vocab : reduce debug logs about non-EOG control tokens
* cont : add comment
2026-01-02 16:17:33 +02:00
Concedo
107def07c8
updated lite and sdui (+1 squashed commits)
...
Squashed commits:
[3172b5d19] updated lite (+1 squashed commits)
Squashed commits:
[45081b0e2] updated glm nothink template
2026-01-02 18:11:32 +08:00
Chris Rohlf
c6f0e832da
rpc : use unordered_map::reserve and emplace ( #18513 )
2026-01-02 12:09:36 +02:00
Concedo
d8942cde14
smartcache allow custom number of slots
2026-01-02 17:19:40 +08:00
Concedo
7e1ae49e7d
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# ggml/src/ggml-cuda/ggml-cuda.cu
# tests/test-backend-ops.cpp
# tools/mtmd/CMakeLists.txt
2026-01-02 11:05:20 +08:00
Concedo
0a23388e7d
added images in tool call queries
2026-01-02 10:48:34 +08:00
MeeMin
e86f3c2221
cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) ( #18433 )
...
* ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140 )
* ggml-cuda: changes in data types to int64_t
* ggml-cuda: added asserts for CUDA block numbers
* ggml-cuda: changed the condition for y and z dimension
2026-01-02 00:24:20 +01:00
Sigbjørn Skjæret
169ee68ffb
model : remove modern-bert iswa template ( #18529 )
...
* remove modern-bert iswa template
* forgotten
2026-01-02 00:06:42 +01:00
tt
ced765be44
model: support youtu-vl model ( #18479 )
...
* Support Youtu-VL Model
* merge code
* fix bug
* revert qwen2 code & support rsplit in minja.hpp
* update warm info
* fix annotation
* u
* revert minja.hpp
* fix
* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None
* fix expert_weights_scale
* LGTM after whitespace fixes
* fix
* fix
* fix
* layers to layer_index
* enum fix
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 19:25:54 +01:00
Piotr Wilkin (ilintar)
3ccccc83f7
Add conversion support for IQuestCoderForCausalLM ( #18524 )
2026-01-01 18:45:55 +01:00
o7si
d0a6a31470
model : add support for JinaBertModel with non-gated ffn ( #18475 )
...
* WIP: Initial commit for fixing JinaBert original FF type support
* convert: add jina-v2-de tokenizer variant for German_Semantic_V3
* convert: fix token collision in BERT phantom vocab conversion
* convert: add feed_forward_type metadata
* model: add feed_forward_type metadata for jina-bert-v2
* model: jina-bert-v2 support standard GELU FFN variant
* model: remove ffn_type, detect FFN variant from tensor dimensions
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* revert collision fix to be handled in separate PR
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:38:51 +01:00
o7si
2b2afade9f
convert : fix encoding of WPM vocab for BERT models ( #18500 )
...
* convert: avoid token collision when stripping ## prefix
* convert: use token types for BERT special tokens check
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:27:07 +01:00
HelloKS
f4f5019254
model: add Solar Open model ( #18511 )
...
* model: add Solar-Open model
* vocab: add solar-open to end eog blacklist
* model: add proper llm type
* chat: basic template for solar open
* typo: fix comment about vocab
* convert: sugested changes
* convert: suggested changes
* chat: change reasoning end tag for solar-open
* llama-chat: add solar-open template
2026-01-01 18:01:43 +01:00
Concedo
bfa2ae7744
fixed smartcache bug when used with images
2026-01-02 00:35:05 +08:00
Concedo
774841ffd6
clear the images array from kcpp chat completions
2026-01-01 22:51:00 +08:00
Concedo
51edb6ae61
allow clip fa for anything besides cuda on gpu
2026-01-01 21:09:51 +08:00
Anri Lombard
d5574c919c
webui: fix code copy stripping XML/HTML tags ( #18518 )
...
* webui: fix code copy stripping XML/HTML tags
* webui: update static build
2026-01-01 13:44:11 +01:00
Aman Gupta
26831bded9
ggml-cuda: remove unneccesary prints on ggml_cuda_init ( #18502 )
2026-01-01 19:18:43 +08:00
Concedo
442fa7cd7c
support for circular textures in sdcpp
2026-01-01 16:34:09 +08:00
Jeff Bolz
be47fb9285
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron ( #18295 )
...
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron
Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).
Fewer pipeline variants and spec constants, just use push constants.
In test_topk_moe, change exp_probs_b to be 1D, matching real networks.
Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.
* change test_topk_moe to allow results in arbitrary order
* disable sigmoid fusion for moltenvk
2026-01-01 08:58:27 +01:00
Concedo
54e419f587
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/docker.yml
# docs/ops.md
# docs/ops/Metal.csv
# ggml/CMakeLists.txt
# ggml/src/ggml-sycl/CMakeLists.txt
# grammars/README.md
# models/templates/llama-cpp-deepseek-r1.jinja
# scripts/sync-ggml.last
# tests/test-chat.cpp
2026-01-01 15:34:10 +08:00