Commit graph

8853 commits

Author SHA1 Message Date
Concedo
75154a3d91 add ping endpoint 2025-07-22 18:55:35 +08:00
Concedo
4abea4b5c9 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	README.md
#	docs/build.md
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-cpu/kleidiai/kernels.cpp
#	ggml/src/ggml-cpu/kleidiai/kernels.h
#	ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
#	tests/test-backend-ops.cpp
#	tools/server/README.md
2025-07-21 23:37:42 +08:00
Concedo
f5aa7c2265 updated lite 2025-07-21 23:35:37 +08:00
Concedo
9f4d0f6ccf fixed swa pp bug by retrying smaller batches 2025-07-21 23:34:22 +08:00
Charles Xu
922042601b
kleidiai: add support for get_rows (#14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-21 16:49:52 +03:00
Radoslav Gerganov
2ba1333b35
docs : fix backends table in README.md (#14796) 2025-07-21 14:03:49 +02:00
Jeff Bolz
c2e058f1b4
vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Molly Sophia
c82d48ec23
llama : fix --reverse-prompt crashing issue (#14794)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-21 17:38:36 +08:00
Concedo
6d50def409 default kv_unified to true, handle LLAMA_SET_ROWS. 2025-07-21 16:13:20 +08:00
IsaacDynamo
b4efd77f8a
server : add parse_special option to /tokenize endpoint (#14783) 2025-07-21 10:24:51 +03:00
Aman Gupta
2be60cbc27
docs : fix link for tools/perplexity in README.md (#14780) 2025-07-20 20:13:47 +02:00
rspOverflow
b526ad2668
Documentation: Further revisions to the Vulkan section in build.md (#14785)
* Documentation: Revised and further improved the Vulkan instructions for Linux users in build.md.

* Minor: Revise step 2 of the Vulkan instructions for Linux users in build.md
2025-07-20 18:55:32 +02:00
Concedo
30675b0798 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	docs/build.md
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tools/imatrix/README.md
#	tools/imatrix/imatrix.cpp
2025-07-20 22:47:31 +08:00
Aman Gupta
938b785764
Clang-format: local files first + fix BinPacking (#14779) 2025-07-20 19:42:34 +08:00
Concedo
15b1034dd6 allow larger img2img 2025-07-20 11:10:03 +08:00
Concedo
e1849f9a0b updated lite (+4 squashed commit)
Squashed commit:

[ecaa653df] updated lite

[1154c44a7] updated lite

[db2e5e43d] allow whisper interrogate mode for audio files

[490b13af8] whitespace
2025-07-20 11:10:03 +08:00
0cc4m
36c153248f
Contrib: add 0cc4m as codeowner for Vulkan backend (#14775) 2025-07-19 23:47:21 +03:00
Ervin Áron Tasnádi
a979ca22db
ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
compilade
90083283ec
imatrix : use GGUF to store importance matrices (#9400)
* imatrix : allow processing multiple chunks per batch

* perplexity : simplify filling the batch

* imatrix : fix segfault when using a single chunk per batch

* imatrix : use GGUF to store imatrix data

* imatrix : fix conversion problems

* imatrix : use FMA and sort tensor names

* py : add requirements for legacy imatrix convert script

* perplexity : revert changes

* py : include imatrix converter requirements in toplevel requirements

* imatrix : avoid using designated initializers in C++

* imatrix : remove unused n_entries

* imatrix : allow loading mis-ordered tensors

Sums and counts tensors no longer need to be consecutive.

* imatrix : more sanity checks when loading multiple imatrix files

* imatrix : use ggml_format_name instead of std::string concatenation

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* quantize : use unused imatrix chunk_size with LLAMA_TRACE

* common : use GGUF for imatrix output by default

* imatrix : two-way conversion between old format and GGUF

* convert : remove imatrix to gguf python script

* imatrix : use the function name in more error messages

* imatrix : don't use FMA explicitly

This should make comparisons between the formats easier
because this matches the behavior of the previous version.

* imatrix : avoid returning from void function save_imatrix

* imatrix : support 3d tensors with MUL_MAT

* quantize : fix dataset name loading from gguf imatrix

* common : move string_remove_suffix from quantize and imatrix

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* imatrix : add warning when legacy format is written

* imatrix : warn when writing partial data, to help guess dataset coverage

Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.

* imatrix : avoid loading model to convert or combine imatrix

* imatrix : avoid using imatrix.dat in README

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-19 12:51:22 -04:00
Peter0x44
d4b91ea7b2
vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (#14707) 2025-07-19 17:58:03 +02:00
0cc4m
83f5872404
Vulkan: Fix fprintf format-security warning (#14770) 2025-07-19 17:47:53 +02:00
rspOverflow
f0d4d176df
Documentation: Update build.md's Vulkan section (#14736)
* Documentation: Rewrote and updated the "Without docker" portion of the Vulkan backend build documentation.

* Documentation: Reorganize build.md's Vulkan section.
2025-07-19 12:18:36 +02:00
Georgi Gerganov
b17230917c sync : ggml 2025-07-19 11:46:50 +03:00
Georgi Gerganov
bf9087f59a
metal : fuse add, mul + add tests (#14596)
ggml-ci
2025-07-18 20:37:26 +03:00
Georgi Gerganov
9fb1042ce6
graph : fix graph reuse reset of params (#14760)
ggml-ci
2025-07-18 20:08:33 +03:00
Concedo
b0b7a07b34 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	examples/parallel/parallel.cpp
2025-07-18 23:49:45 +08:00
Georgi Gerganov
2adf8d83ac
parallel : add option for different RNG seeds (#14757)
ggml-ci
2025-07-18 17:33:41 +03:00
Oliver Simons
021cc28bef
cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (#14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-18 04:35:32 -07:00
Georgi Gerganov
d498af3d5a
graph : avoid huge warm-up graphs for MoE models (#14753)
* graph : avoid huge warm-up graphs for MoE models

ggml-ci

* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Georgi Gerganov
eacdeb5bfc
model : fix build after merge conflict (#14754) 2025-07-18 11:53:55 +03:00
lgai-exaone
e0cb5c5cb8
model : add EXAONE 4.0 support (#14630) 2025-07-18 10:45:49 +02:00
Aman Gupta
f9a31eea06
CUDA: set_rows + cpy.cu refactor (#14712) 2025-07-18 14:54:18 +08:00
Concedo
b8e3280432 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/nix/package.nix
#	ggml/src/ggml-sycl/ggml-sycl.cpp
2025-07-18 13:46:32 +08:00
Georgi Gerganov
8f974bc1e9
graph : refactor context to not pass gf explicitly (#14629)
ggml-ci
2025-07-18 08:29:28 +03:00
kallewoof
226624639c
AutoGuess: Move Generic cases to end of file and put Kimi with other ChatML variants (#1648)
* AutoGuess: Move Generic cases to end of file and put Kimi with other ChatML variants

* patch Kimi ChatML template
2025-07-18 13:24:21 +08:00
Concedo
b028dd4e84 minor fixes 2025-07-18 13:22:59 +08:00
Nexes the Elder
09651d09ff
graph : Pass the graph placeholder message in debug mode (#14748)
Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.
2025-07-18 07:25:54 +03:00
Neo Zhang Jianyu
349ea79fce
use max work group size for device to replace the magic number (#14732) 2025-07-18 10:23:14 +08:00
Piotr Wilkin (ilintar)
670e1360cd
convert : fix Ernie4.5 MoE without shared experts (#14746) 2025-07-18 01:17:16 +02:00
Wroclaw
760b4484e3
nix : use optionalAttrs for env mkDerivation attrset argument (#14726) 2025-07-17 15:18:16 -07:00
Piotr Wilkin (ilintar)
cb887f1bc1
model: add Ernie 4.5 MoE support (#14658)
* Add Ernie4.5 MoE

* Fix Flake errors.

* Properly encode/decode MoE layer step

* Correct tensor mappings (.weight)

* Pass and read n_ff_exp

* n_ff_shexp calculation and further minor changes

* Rope fixes.

* .gitignore fix

* Add unit32 cast for Linux builds

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes from code review

* Fix trailing whitespace

* Reenable missing experts error

* Code style from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix non-MoE regression

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-17 23:15:32 +02:00
Georgi Gerganov
d6fb3f6b49
kv-cache : fix k-shift for multiple streams (#14742)
ggml-ci
2025-07-17 20:52:33 +03:00
Concedo
1ca666f9c1 allow handling multipart files up to 999 2025-07-18 01:18:28 +08:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs (#14482)
* llama : reuse compute graphs

ggml-ci

* llama-bench : add graph reuse parameter

ggml-ci

* cont : remove the parameter and the sched resets

ggml-ci

* graph : rename update() to can_reuse()

ggml-ci

* params : remove is_same()

ggml-ci

* graph : set res->params in llm_graph_context constructor

ggml-ci

* graph : avoid set_max_nodes in llm_graph_result

ggml-ci

* kv-cache : reuse llama_context's graph result instance

ggml-ci

* context : reset the previous graph result upon memory updates

ggml-ci

* batch : llama_ubatch now carries its data instead of pointing to balloc

ggml-ci

* merge : fix build

ggml-ci

* graph : fix can_reuse() checks when flash-attention is disabled

* graph : move llm_graph_result impl in source file + debug env

ggml-ci
2025-07-17 19:08:33 +03:00
Concedo
8cf812eddd updated lite 2025-07-17 20:11:17 +08:00
Concedo
f57018f722 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-linux-cross.yml
2025-07-17 18:23:26 +08:00
Concedo
afca31bfbe handle clean_env for remotetunnel 2025-07-17 18:21:22 +08:00
Tarek Dakhran
086cf81e88
llama : fix parallel processing for lfm2 (#14705) 2025-07-17 09:22:11 +02:00
Georgi Gerganov
d9b691081c
kv-cache : opt mask set input (#14600)
ggml-ci
2025-07-17 09:49:15 +03:00
Georgi Gerganov
ad57d3edd2
batch : fix uninitialized has_cpl flag (#14733)
ggml-ci
2025-07-17 09:45:54 +03:00