Commit graph

10692 commits

Author SHA1 Message Date
Sachin Desai
3db4da56a5
chat : support Granite model reasoning and tool call (#14864) 2025-08-06 20:27:30 +02:00
Juk Armstrong
476aa3fd57
Fixed name -override-tensors to -override-tensor (#15129) 2025-08-06 17:28:48 +01:00
Diego Devesa
0d8831543c
ggml : fix fallback to CPU for ununsupported ops (#15118) 2025-08-06 14:37:35 +02:00
Sigbjørn Skjæret
65c797c4fa
chat : fix yandex chat template (#15116) 2025-08-06 13:26:49 +02:00
stevenkuang
25726898e8
chat : fix hunyuan auto-detection (#15114)
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
2025-08-06 11:48:30 +02:00
Concedo
61c19fea56 fixed glm4 sop, lower regex max stacks (+2 squashed commit)
Squashed commit:

[47e39ae5d] lower regex max stack again

[0a32ca232] lower regex max stack again
2025-08-06 17:10:57 +08:00
Chenguang Li
2241453252
CANN: add support for ACL Graph (#15065)
* feat(cann): add optional support for ACL Graph execution

This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:

    -DUSE_CANN_GRAPH=ON

By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.

Key additions:
- CMake option  to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
  is unset or invalid

This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.

Signed-off-by: noemotiovon <757486878@qq.com>

* Fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

* remane USE_CANN_GRAPH to USE_ACL_GRAPH

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-06 14:12:42 +08:00
Concedo
34487d3c02 gpt oss harmony template 2025-08-06 11:39:40 +08:00
Concedo
6eea7b88d2 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	README.md
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	tests/test-backend-ops.cpp
#	tests/test-chat-template.cpp
2025-08-06 10:51:29 +08:00
Reese Levine
9515c6131a
ggml: WebGPU disable SET_ROWS for now (#15078)
* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* Disable set_rows until it's implemented

* Fix potential issue around empty queue submission

* Try synchronous submission

* Try waiting on all futures explicitly

* Add debug

* Add more debug messages

* Work on getting ssh access for debugging

* Debug on failure

* Disable other tests

* Remove extra if

* Try more locking

* maybe passes?

* test

* Some cleanups

* Restore build file

* Remove extra testing branch ci
2025-08-05 16:26:38 -07:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss (#15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret
f324a3b715
chat : only remove double bos/eos if added (#15086)
* only remove double bos/eos if added

* fix tests
2025-08-05 20:43:36 +02:00
Georgi Gerganov
be42642581
readme : update hot topics (#15097) 2025-08-05 20:19:33 +03:00
Concedo
8bffbd9ce5 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-sycl/ggml-sycl.cpp
2025-08-06 00:44:19 +08:00
Concedo
8323d8a2d7 try fix regex stack size issues 2025-08-06 00:44:05 +08:00
Romain Biessy
3306ceabf0
sycl: fix mul_mat selection (#15092) 2025-08-05 18:39:55 +02:00
Concedo
e40d26b9e7 allow offloading moe to cpu with --moecpu 2025-08-05 23:42:42 +08:00
Juk Armstrong
c81de6e107
Fix glm4moe bug (#15088) 2025-08-05 13:56:44 +01:00
Alex Wu
22f060c9c4
webui: fix markdown table (#15081)
* webui: fix markdown table

* webui: fix table display with themes
2025-08-05 13:56:44 +02:00
Concedo
7590a0ea39 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	ggml/CMakeLists.txt
#	ggml/cmake/ggml-config.cmake.in
#	ggml/src/CMakeLists.txt
#	models/templates/README.md
#	tools/imatrix/imatrix.cpp
2025-08-05 19:24:29 +08:00
compilade
ee3a9fcf88
context : fix index overflow on huge outputs (#15080)
* context : fix overflow when re-ordering huge outputs

* context : fix logits size overflow for huge batches
2025-08-05 11:27:45 +02:00
Diego Devesa
ec428b02c3
llama : add --n-cpu-moe option (#15077)
* llama : add --n-cpu-moe option

Keeps the MoE weights of the first N layers in the CPU
2025-08-05 01:05:36 +02:00
compilade
19f68fa5a4
imatrix : warn when GGUF imatrix is saved without .gguf suffix (#15076)
* imatrix : add warning when suffix is not .gguf for GGUF imatrix

* imatrix : only warn about suffix when output format is unspecified
2025-08-04 23:26:52 +02:00
Christian Kastner
41613437ff
cmake: Add GGML_BACKEND_DIR option (#15074)
* cmake: Add GGML_BACKEND_DIR option

This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.

* Fix phrasing
2025-08-04 21:29:14 +02:00
Sigbjørn Skjæret
e5bebe5251
gguf-py : add --chat-template-file to gguf_new_metadata (#15075) 2025-08-04 21:01:48 +02:00
Sam
ef0144c087
model: support GLM 4.5 family of models (#14939)
* model: Add GLM 4.5 (#14921)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Merge in PR suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: Add GLM 4.5 family of models (#14921)

1. Updated tensor_mapping.py with NextN tensor mappings

- Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py
- Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm

2. Added num_nextn_predict_layers configuration

- Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp
- Added num_nextn_predict_layers field to llama_hparams struct
- Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter
- Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers
- Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method
- Updated conversion script to extract and write this parameter from HuggingFace config

3. Added FIM tokens for GLM4_MOE

- Added GLM-4.5's FIM tokens to llama-vocab.cpp:
  - <|code_prefix|> for FIM_PRE
  - <|code_suffix|> for FIM_SUF
  - <|code_middle|> for FIM_MID

4. Removed manual NextN tensor handling

- Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors
- NextN tensors are now handled automatically through the proper tensor mapping system

* glm 4.5 update tensors names

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: glm 4.5 apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model: glm 4.5 apply suggestions from code review

* Apply suggestions from code review

* patch broken chat template

* typings fix

* add TENSOR_SKIP flag


Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update src/llama-model-loader.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-04 20:29:25 +02:00
Sigbjørn Skjæret
2721257e3e
quantize : fix confusing error message if ftype is invalid (#15071) 2025-08-04 18:11:02 +02:00
Concedo
428a07416a cleanup some debug 2025-08-05 00:07:22 +08:00
Reese Levine
587d0118f5
ggml: WebGPU backend host improvements and style fixing (#14978)
* Add parameter buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow
2025-08-04 08:52:43 -07:00
Concedo
9fbbd9e127 half measure for mistral spaced formats 2025-08-04 23:48:11 +08:00
Concedo
8bd0a560f0 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	requirements/requirements-convert_hf_to_gguf_update.txt
#	scripts/compare-llama-bench.py
#	tests/test-backend-ops.cpp
#	tests/test-chat.cpp
#	tools/imatrix/README.md
#	tools/imatrix/imatrix.cpp
#	tools/llama-bench/llama-bench.cpp
2025-08-04 22:42:02 +08:00
Concedo
d37529c0cd add sanitize flag 2025-08-04 22:19:23 +08:00
Jeff Bolz
5aa1105da2
vulkan: fix build when using glslang that does not support coopmat2 (#15062) 2025-08-04 07:09:19 +02:00
compilade
d31192b4ee
imatrix : use GGUF by default (#14842)
* imatrix : use GGUF by default

* imatrix : use GGUF regardless of the output filename

The legacy format can only be produced with --output-format dat
2025-08-03 22:00:05 +02:00
compilade
0a2f5496be
imatrix : fix 3d activation handling for hybrid and recurrent models (#14994)
* imatrix : use a single count for dense 3d tensors

* imatrix : fix 3d activations when model tensor is 2d

* imatrix : fix 3d tensor counts
2025-08-03 21:49:13 +02:00
compilade
11a3811164
memory : handle kv_unified for hybrid models (#15050) 2025-08-03 21:43:07 +02:00
Csaba Kecskemeti
97366dc6ab
vocab : JetBrains Mellum pre-tokenizer (#15045) 2025-08-03 21:38:18 +02:00
Gabriel Larson
83bc2f288c
model : add text-only support for Kimi-VL (and find special tokens in text_config) (#15051)
* basic kimi-vl textmodel conversion

* check config["text_config"] for special tokens
2025-08-03 16:56:25 +02:00
Jeff Bolz
6c7a441161
vulkan: Use coopmat2 for conv2d (#14982) 2025-08-03 14:23:57 +02:00
Concedo
6cb8f95b5b tool calling params have been ported over to KAI api and can be used, same syntax as OAI endpoint 2025-08-03 16:21:57 +08:00
lhez
5c0eb5ef54
opencl: fix adreno compiler detection logic (#15029) 2025-08-02 19:51:18 +02:00
Johannes Gäßler
03d4698218
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035) 2025-08-02 16:37:08 +02:00
Concedo
3c2a9f6dcf updated lite 2025-08-02 22:27:16 +08:00
leejet
3303c19b16
cuda: make im2col a little faster (#15025) 2025-08-02 17:15:36 +03:00
Daniel Bevenius
4fdea540bd
kv-cache : skip alignment of n_stream in kv-cache log msg [no ci] (#15040)
This commit removes the right alignment the `n_stream` value in the
log message in the `llama_kv_cache_unified` constructor.

The motivation for this change is to enhance the readability of log
message. Currently the output looks like this:
```console
llama_kv_cache_unified: size = 2048.00 MiB (  4096 cells,  32 layers,  1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
Notice that the `n_stream` value is right aligned, which makes it a
little harder to read.

With the change in this commit the output will look like
```console
llama_kv_cache_unified: size = 2048.00 MiB (  4096 cells,  32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
```
2025-08-02 17:14:57 +03:00
Georgi Gerganov
a4569c41fd
llama : enable LLAMA_SET_ROWS=1 by default (#14959)
ggml-ci
2025-08-02 17:14:21 +03:00
Georgi Gerganov
15e92fd337
cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038)
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
2025-08-02 17:13:05 +03:00
Sigbjørn Skjæret
2bf3fbf0b5
ci : check that pre-tokenizer hashes are up-to-date (#15032)
* torch is not required for convert_hf_to_gguf_update

* add --check-missing parameter

* check that pre-tokenizer hashes are up-to-date
2025-08-02 14:39:01 +02:00
Douglas Hanley
711d5e6fe6
convert : fix Qwen3-Embedding pre-tokenizer hash (#15030) 2025-08-02 12:51:02 +02:00
Jhen-Jie Hong
f738989dcb
chat : fix multiple tool_calls on hermes-2-pro (#14962) 2025-08-02 18:04:48 +08:00