Commit graph

10709 commits

Author SHA1 Message Date
Concedo
3fb0f337fe remove z-image clamping for now 2025-12-11 23:05:00 +08:00
Concedo
278e45becf Merge commit '2fa51c19b0' into concedo_experimental
# Conflicts:
#	.github/actions/windows-setup-cuda/action.yml
#	.github/workflows/build-linux-cross.yml
#	.github/workflows/release.yml
#	README.md
#	docs/build-riscv64-spacemit.md
#	examples/model-conversion/logits.cpp
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	models/templates/Kimi-K2-Instruct.jinja
#	models/templates/Kimi-K2-Thinking.jinja
#	tests/test-chat.cpp
#	tools/server/README.md
2025-12-11 23:04:48 +08:00
Concedo
d07d2c1b39 stub loras endpoint for comfy 2025-12-11 22:48:38 +08:00
Concedo
fd0d0cab03 move pipeline parallelism to a --pipelineparallel launch flag 2025-12-11 21:03:41 +08:00
Concedo
b7428048fc try reduce pipeline parallelism in order to reduce compute buffer sizes 2025-12-11 14:30:38 +08:00
Concedo
798473d867 updated sdui, fixed image import 2025-12-11 11:43:40 +08:00
Concedo
34634aef1b tweak to smartcache for contextshifting 2025-12-10 20:08:11 +08:00
Concedo
8a18e094f5 added smartcaching implementation inspired from Pento95 (+2 squashed commit)
Squashed commit:

[fcc498688] wip basic smart caching test

[b6e8b2577] wip basic smart caching test
2025-12-10 18:00:03 +08:00
Concedo
1aab32fe03 fixed safetensors loading for zimage 2025-12-09 18:09:47 +08:00
Daniel Bevenius
2fa51c19b0
model-conversion : add token ids to prompt token output [no ci] (#17863)
This commit adds the token ids to the printed prompt outputs.

The motivation for this is that is can be useful to see the actual token
ids alongside the token strings for debugging.
2025-12-08 17:13:08 +01:00
Xuan-Son Nguyen
951520ddb0
server: delegate result_state creation to server_task (#17835)
* server: delegate result_state creation to server_task

* remove unued states

* add more docs
2025-12-08 17:04:38 +01:00
Neo Zhang
68522c678d
ci : support bfloat16 SYCL release package (#17855)
* support bfloat16 release package

* add fallback file
2025-12-08 15:09:39 +01:00
Xuan-Son Nguyen
f896d2c34f
server: improve speed of speculative decoding (#17808)
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:35:28 +01:00
Piotr Wilkin (ilintar)
e4e9c4329c
Make graph_max_nodes vary by ubatch size (#17794)
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph

* Update src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add missing const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
hksdpc255
636fc17a37
Fix Kimi-K2 tool-call parsing issues (#17376)
* Fix kimi-k2 parsing

* fix template & add more tests for kimi-k2

* Another fix for Kimi-K2 chat template.

* enable allow_toolcall_in_think for Kimi-K2

* Refine key-value separator and value end format

* Enable tool call in think for kimi-k2

* allow_toolcall_in_think is now tested with Kimi-K2

* Remove outdated TODO comment in XML tool call parser

Removed TODO comment about untested tool call feature.

* Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"
2025-12-08 14:32:04 +01:00
Jay Zenith
51e0c2d917
cuda : add FILL op support (#17851)
* cuda : add FILL op support

* cuda : add missing FILL op files
2025-12-08 21:10:12 +08:00
Xuan-Son Nguyen
37a4f63244
server : add development documentation (#17760)
* first draft

* rewrite

* update & remove duplicated sections
2025-12-08 13:54:58 +01:00
Wagner Bruna
801840d3bd
sd: sync to master-391-5865b5e (#1878) 2025-12-08 19:53:52 +08:00
Concedo
242ae8b8f3 http get cleanup 2025-12-08 19:51:55 +08:00
Concedo
cd73613136 moved volta onto tile kernels, so building for cc7.0 can be avoided
this shouldn't do anything (+2 squashed commit)

Squashed commit:

[1cdcb302a] another attempt to tip the scales, part 2

[8f647b709] another attempt to tip the scales (volta)
2025-12-08 19:51:54 +08:00
Georgi Gerganov
2bc96931d2
server : make cache_reuse configurable per request (#17858) 2025-12-08 12:43:12 +02:00
wsbagnsv1
5814b4dce1
cuda: optimize SOLVE_TRI using registers and FMAF (#17703)
* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.

* Small cleanup

* Remove comments in solve_tri.cu

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Use const for variables in solve_tri.cu

* Replace fmaf with more readable code

* remove last fmaf

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-08 10:41:08 +01:00
ixgbe
79d61896d3
ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784)
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* cmake: enable RISC-V zihintpause extension for Spacemit builds

* readme : add ZIHINTPAUSE support for RISC-V

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-12-08 10:41:34 +02:00
Xuan-Son Nguyen
4d3726278b
model: add llama 4 scaling for mistral-large (deepseek arch) (#17744) 2025-12-07 22:29:54 +01:00
lovedheart
08f9d3cc1d
Vulkan: improve mul_mat_vec_iq1_m (#16907)
* Optimize Vulkan shader for matrix-vector multiplication

* Revert changes on compute_outputs and main

Refactor compute_outputs to handle remaining rows correctly.

* Fix trailing whitespace
2025-12-07 18:40:42 +01:00
Sigbjørn Skjæret
0a540f9abd
ci : add windows-cuda 13.1 release (#17839) 2025-12-07 14:02:04 +01:00
Concedo
40d3d830a1 updated lite 2025-12-07 17:13:23 +08:00
Concedo
17c0c8d55d Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	docs/backend/zDNN.md
#	docs/build.md
#	docs/ops.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-cann/ggml-cann.cpp
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/convert.cpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	src/llama-quant.cpp
#	tests/test-backend-ops.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/server/README.md
2025-12-07 16:48:38 +08:00
Concedo
7c5d271d6c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/release.yml
#	.github/workflows/winget.yml
#	CMakeLists.txt
#	CODEOWNERS
#	CONTRIBUTING.md
#	cmake/build-info.cmake
#	docs/ops.md
#	docs/ops/BLAS.csv
#	docs/ops/Metal.csv
#	examples/CMakeLists.txt
#	examples/save-load-state/save-load-state.cpp
#	examples/simple-cmake-pkg/README.md
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py
#	src/llama-quant.cpp
#	tests/test-backend-ops.cpp
#	tools/server/CMakeLists.txt
2025-12-07 16:37:32 +08:00
Concedo
20363dc6e7 z image limit cfg scale to 3.0 max 2025-12-07 16:24:26 +08:00
Concedo
8577628874 freeze lcpp ui forever, modify branding 2025-12-07 13:11:01 +08:00
Concedo
8c17541cc0 modify llama.cpp branding on lcpp ui (+1 squashed commits)
Squashed commits:

[067343edf] modify llama.cpp branding on lcpp ui
2025-12-07 12:53:33 +08:00
Sigbjørn Skjæret
22577583a3
common : change --color to accept on/off/auto, default to auto (#17827)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
2025-12-07 03:43:50 +01:00
Law Po Ying
d9e03db1e7
sycl: add missing BF16 conversion support for Intel oneAPI (#17780)
* sycl: add missing BF16 conversion support for Intel oneAPI

* Fix Line 645: Trailing whitespace
2025-12-07 09:18:18 +08:00
Jeff Bolz
db97837385
vulkan: perf_logger improvements (#17672)
* vulkan: perf_logger improvements

- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.

* fix vector sizes
2025-12-06 18:46:46 +01:00
Vishal Singh
017761daf5
ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690)
* ggml-zennn: add ZenDNN backend support

* ggml-zendnn : address ZenDNN backend review fixes and suggestions

* docs : apply blockquote syntax to ZenDNN docs

---------

Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
2025-12-07 00:13:33 +08:00
Xuan-Son Nguyen
c42712b056
server: support multiple generations from one prompt (OAI "n" option) (#17775)
* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
2025-12-06 15:54:38 +01:00
Phylliida Dev
09c7c50e64
ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985)
* Feat: Added vulkan circular tiling support

* Feat: Added cpu circular

* Feat: Added cuda kernels

* Added tests

* Added tests

* Removed non-pad operations

* Removed unneded changes

* removed backend non pad tests

* Update test-backend-ops.cpp

* Fixed comment on pad test

* removed trailing whitespace

* Removed unneded test in test-backend-ops

* Removed removed test from calls

* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp

Co-authored-by: Ruben Ortlam <picard12@live.de>

* Fixed alignment

* Formatting

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Format pad

* Format

* Clang format

* format

* format

* don't change so much stuff

* clang format and update to bool

* fix duplicates

* don't need to fix the padding

* make circular bool

* duplicate again

* rename vulkan to wrap around

* Don't need indent

* moved to const expr

* removed unneded extra line break

* More readable method calls

* Minor wording changes

* Added final newline

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Added circular pad ext tests

* Gate non circular pad devices

* Cleaned gating of non-circular pad devices

---------

Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-06 15:07:02 +01:00
Concedo
d27949f22a Revert "try remove volta as a dedicated target b (+1 squashed commits)"
This reverts commit ddba580f00.
2025-12-06 21:31:44 +08:00
Concedo
ddba580f00 try remove volta as a dedicated target b (+1 squashed commits)
Squashed commits:

[2df689a03] try remove volta as a dedicated target
2025-12-06 21:31:06 +08:00
Johannes Gäßler
f334b79494
HIP: fix RDNA3 FP16/BF16 matrix multiplication (#17817) 2025-12-06 13:45:36 +01:00
Aleksander Grygier
a28e3c7567
webui: Stop generation from chat sidebar (#17806)
* feat: Add stop generation button for Conversation Item

* chore: update webui build output
2025-12-06 13:29:15 +01:00
Aleksander Grygier
e31b5c55c3
webui: Fix context available value in Multi-model Router mode (#17804)
* fix: Use context size from `/props?model=...` in ROUTER mode

* chore: update webui build output
2025-12-06 13:23:29 +01:00
Aleksander Grygier
21f24f27a9
webui: Per-conversation system message with UI displaying, edition & branching (#17275)
* feat: Per-conversation system message with optional display in UI, edition and branching (WIP)

* chore: update webui build output
2025-12-06 13:19:05 +01:00
Sky
7b43f55753
ggml : improve error handling for search path existence checks (#17653)
* Improve error handling for search path existence checks

Refactor existence checks for search paths using std::error_code to handle potential errors.

* Improve cache file existence check with error code 

Update fs::exists to use std::error_code for error handling.

* Simplify existence check for search paths

Simplify existence check for search paths

* Fix logging path in error message for posix_stat

* Update ggml/src/ggml-backend-reg.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Adapt to the coding standard

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-12-06 12:28:16 +01:00
Daniel Bevenius
444f00b0ec
llama : remove quantization sanity check (#17788)
* llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

* llama : remove unused pruned_attention_w and is_clip_model vars
2025-12-06 12:26:20 +01:00
Jeff Bolz
2960eb2975
vulkan: Use one row per workgroup for f32 mmv (#17711)
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
2025-12-06 11:12:26 +01:00
Xuan-Son Nguyen
dbc15a7967
convert: support Mistral 3 Large MoE (#17730)
* convert: support Mistral 3 Large MoE

* filter out vision tensors, add missing keys

* handle vocab

* add temperature_length

* fix mscale_all_dim

* clean up

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-06 10:49:33 +01:00
Jeff Bolz
c6c5e85979
vulkan: support solve_tri with larger N/K values (#17781)
Some checks are pending
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Waiting to run
Split N into chunks to fit into shared memory.
If K > 128, use a larger workgroup with enough invocations.
Add perf tests matching qwen3next.
2025-12-06 08:56:45 +01:00
Concedo
1a14ae1183 lets try without volta specific kernels, fattn should fall back to tile 2025-12-06 15:56:07 +08:00