Commit graph

11123 commits

Author SHA1 Message Date
Concedo
9a4eeafbfc hotfix 1.105.3 2026-01-05 15:24:21 +08:00
Concedo
ad6c53aeff Merge commit '908a9e5a1e' into concedo 2026-01-05 15:01:49 +08:00
Aman Gupta
908a9e5a1e
CUDA: disable cuda graph when using n-cpu-moe (#18593)
* CUDA: disable cuda graph when using n-cpu-moe

* call ggml_cuda_set_device
2026-01-05 01:37:48 +08:00
Aman Gupta
5126c41c1c
ggml-cuda: remove unused params in ggml_cuda_graph (#18579) 2026-01-05 01:37:09 +08:00
Concedo
acfc1e56d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	tests/test-regex-partial.cpp
2026-01-04 11:14:33 +08:00
Concedo
01c70a7d3d allow transcribe to be used with the LLM instead if no whisper model exists 2026-01-04 11:06:05 +08:00
Aldehir Rojas
cef1d23c5a
common/grammar : replace problematic backtracking regex [\s\S]* (#18342)
* grammar : add support for std::regex_search() with trigger patterns

* common : update hermes2 pro trigger to search instead of match

* common : use regex_search with anchoring for partial matching

* common : adjust regex partial tests to use new pattern

* grammar : check pattern directly instead of adding a type

* common : adjust existing patterns to match new semantics
2026-01-03 16:02:43 -06:00
Georgi Gerganov
c69c7ebc90
graph : fix graph reuse logic when n_pos_per_embd > 1 (#18566) 2026-01-03 23:59:06 +02:00
Concedo
04f5445bef fix for macos asserting on exit 2026-01-03 23:26:04 +08:00
Aman Gupta
e57f52334b
ggml-cuda: fixes for concurrent streams (#18496) 2026-01-03 23:15:01 +08:00
Concedo
5a505cbc62 disable blackwell mma for now 2026-01-03 22:45:06 +08:00
Georgi Gerganov
a554a1ecc7
context : fix reserve token padding to n_seqs (#18536) 2026-01-03 15:45:34 +02:00
Johannes Gäßler
0f2e42ca1d
CUDA: only allocate FA tmp buffer if needed (#18564) 2026-01-03 13:55:53 +01:00
pl752
9dba9f5352
(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-03 11:13:40 +01:00
Concedo
e4abf643fa Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-hexagon/htp/act-ops.c
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	src/CMakeLists.txt
#	src/llama-vocab.cpp
2026-01-03 15:37:30 +08:00
Wagner Bruna
0ef55844d3
sd: sync to master-453-4ff2c8c (#1907) 2026-01-03 15:28:27 +08:00
Shouyu
bcfc8c3cec
ggml-hexagon: optimize activation function (#18393)
* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue
2026-01-02 21:24:24 -08:00
Jeff Bolz
18ddaea2ae
vulkan: Optimize GGML_OP_CUMSUM (#18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-02 15:32:30 -06:00
Jeff Bolz
706e3f93a6
vulkan: Implement mmvq for iq1_s/iq1_m (#18450) 2026-01-02 20:19:04 +01:00
Prabod
5755e52d15
model : Maincoder-1B support (#18534)
* Add Maincoder model support

* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp

* removed set_vocab

* added new line

* Fix formatting

* Add a new line for PEP8
2026-01-02 20:11:59 +01:00
Georgi Gerganov
f38de16341
metal : adjust extra size for FA buffer to avoid reallocations (#18545) 2026-01-02 19:02:18 +02:00
Georgi Gerganov
af1e8e1a6c
graph : reduce topology branching (#18548) 2026-01-02 19:01:56 +02:00
Concedo
77082dddfb mcp image handling 2026-01-03 00:03:05 +08:00
Georgi Gerganov
d84a6a98be
vocab : reduce debug logs about non-EOG control tokens (#18541)
* vocab : reduce debug logs about non-EOG control tokens

* cont : add comment
2026-01-02 16:17:33 +02:00
Concedo
107def07c8 updated lite and sdui (+1 squashed commits)
Squashed commits:

[3172b5d19] updated lite (+1 squashed commits)

Squashed commits:

[45081b0e2] updated glm nothink template
2026-01-02 18:11:32 +08:00
Chris Rohlf
c6f0e832da
rpc : use unordered_map::reserve and emplace (#18513) 2026-01-02 12:09:36 +02:00
Concedo
d8942cde14 smartcache allow custom number of slots 2026-01-02 17:19:40 +08:00
Concedo
7e1ae49e7d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-cuda/ggml-cuda.cu
#	tests/test-backend-ops.cpp
#	tools/mtmd/CMakeLists.txt
2026-01-02 11:05:20 +08:00
Concedo
0a23388e7d added images in tool call queries 2026-01-02 10:48:34 +08:00
MeeMin
e86f3c2221
cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
2026-01-02 00:24:20 +01:00
Sigbjørn Skjæret
169ee68ffb
model : remove modern-bert iswa template (#18529)
* remove modern-bert iswa template

* forgotten
2026-01-02 00:06:42 +01:00
tt
ced765be44
model: support youtu-vl model (#18479)
* Support Youtu-VL Model

* merge code

* fix bug

* revert qwen2 code & support rsplit in minja.hpp

* update warm info

* fix annotation

* u

* revert minja.hpp

* fix

* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None

* fix expert_weights_scale

* LGTM after whitespace fixes

* fix

* fix

* fix

* layers to layer_index

* enum fix

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 19:25:54 +01:00
Piotr Wilkin (ilintar)
3ccccc83f7
Add conversion support for IQuestCoderForCausalLM (#18524) 2026-01-01 18:45:55 +01:00
o7si
d0a6a31470
model : add support for JinaBertModel with non-gated ffn (#18475)
* WIP: Initial commit for fixing JinaBert original FF type support

* convert: add jina-v2-de tokenizer variant for German_Semantic_V3

* convert: fix token collision in BERT phantom vocab conversion

* convert: add feed_forward_type metadata

* model: add feed_forward_type metadata for jina-bert-v2

* model: jina-bert-v2 support standard GELU FFN variant

* model: remove ffn_type, detect FFN variant from tensor dimensions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* revert collision fix to be handled in separate PR

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:38:51 +01:00
o7si
2b2afade9f
convert : fix encoding of WPM vocab for BERT models (#18500)
* convert: avoid token collision when stripping ## prefix

* convert: use token types for BERT special tokens check

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:27:07 +01:00
HelloKS
f4f5019254
model: add Solar Open model (#18511)
* model: add Solar-Open model

* vocab: add solar-open to end eog blacklist

* model: add proper llm type

* chat: basic template for solar open

* typo: fix comment about vocab

* convert: sugested changes

* convert: suggested changes

* chat: change reasoning end tag for solar-open

* llama-chat: add solar-open template
2026-01-01 18:01:43 +01:00
Concedo
bfa2ae7744 fixed smartcache bug when used with images 2026-01-02 00:35:05 +08:00
Concedo
774841ffd6 clear the images array from kcpp chat completions 2026-01-01 22:51:00 +08:00
Concedo
51edb6ae61 allow clip fa for anything besides cuda on gpu 2026-01-01 21:09:51 +08:00
Anri Lombard
d5574c919c
webui: fix code copy stripping XML/HTML tags (#18518)
* webui: fix code copy stripping XML/HTML tags

* webui: update static build
2026-01-01 13:44:11 +01:00
Aman Gupta
26831bded9
ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502) 2026-01-01 19:18:43 +08:00
Concedo
442fa7cd7c support for circular textures in sdcpp 2026-01-01 16:34:09 +08:00
Jeff Bolz
be47fb9285
vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-01 08:58:27 +01:00
Concedo
54e419f587 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	docs/ops.md
#	docs/ops/Metal.csv
#	ggml/CMakeLists.txt
#	ggml/src/ggml-sycl/CMakeLists.txt
#	grammars/README.md
#	models/templates/llama-cpp-deepseek-r1.jinja
#	scripts/sync-ggml.last
#	tests/test-chat.cpp
2026-01-01 15:34:10 +08:00
Concedo
66ccf8f6b8 Merge commit 'f14f4e421b' into concedo_experimental
# Conflicts:
#	.github/workflows/docker.yml
#	AGENTS.md
#	CONTRIBUTING.md
#	docs/build.md
#	examples/llama.android/app/build.gradle.kts
#	examples/llama.android/app/src/main/java/com/example/llama/MainActivity.kt
#	examples/llama.android/app/src/main/res/layout/activity_main.xml
#	examples/llama.android/gradle/libs.versions.toml
#	examples/llama.android/lib/src/main/cpp/ai_chat.cpp
#	examples/llama.android/lib/src/main/java/com/arm/aichat/InferenceEngine.kt
#	examples/llama.android/lib/src/main/java/com/arm/aichat/internal/InferenceEngineImpl.kt
#	examples/model-conversion/scripts/causal/compare-embeddings-logits.sh
#	examples/model-conversion/scripts/embedding/run-original-model.py
#	examples/retrieval/retrieval.cpp
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	ggml/src/ggml-cuda/CMakeLists.txt
#	ggml/src/ggml-cuda/mmq.cu
#	ggml/src/ggml-cuda/mmq.cuh
#	src/CMakeLists.txt
#	tools/llama-bench/llama-bench.cpp
#	tools/server/CMakeLists.txt
2026-01-01 15:20:56 +08:00
triplenom
9e10bd2eaf
llama: handle short reads in direct I/O path (#18504) 2026-01-01 10:24:43 +08:00
Anri Lombard
4cd162a123
chat: make tool description and parameters optional per OpenAI spec (#18478)
* chat: make tool description and parameters optional per OpenAI spec

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

* refactor: use value() for cleaner optional field access
2025-12-31 17:21:37 -06:00
Concedo
03df0c40f3 if gendefaults is set, horde has debug flag 2026-01-01 00:54:57 +08:00
Georgi Gerganov
13814eb370 sync : ggml 2025-12-31 18:54:43 +02:00
Georgi Gerganov
54f67b9b66 ggml : bump version to 0.9.5 (ggml/1410) 2025-12-31 18:54:43 +02:00