Commit graph

12923 commits

Author SHA1 Message Date
Concedo
19a12bb080 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	CODEOWNERS
#	common/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
#	scripts/sync-ggml.last
#	tools/cli/cli.cpp
#	tools/llama-bench/llama-bench.cpp
#	tools/perplexity/perplexity.cpp
2026-04-21 18:53:03 +08:00
Concedo
1feba4e4ea fixed koboldcpp.sh, fixed vision max/min when one param is missing, fixed processing count wrong, updated lite 2026-04-21 18:36:47 +08:00
Jeff Bolz
82209efb7e
vulkan: Support F16 OP_FILL (#22177) 2026-04-21 11:01:56 +02:00
Xuan-Son Nguyen
9998d88bc8
mtmd: correct mtmd_decode_use_mrope() (#22188) 2026-04-21 10:53:37 +02:00
Georgi Gerganov
cd03ec7642
llama-ext : fix exports (#22202) 2026-04-21 11:04:46 +03:00
Georgi Gerganov
4889afba5f sync : ggml 2026-04-21 11:04:21 +03:00
Georgi Gerganov
041fe83d74 ggml : bump version to 0.10.0 (ggml/1463) 2026-04-21 11:04:21 +03:00
Georgi Gerganov
cfe9838d26
fit-params : refactor + add option to output estimated memory per device (#22171)
* fit-params : add option to output estimated memory per device

* cont : minor

* cont : refactor

* cont : move fit params implementation to libcommon

* cont : header

* cont : headers

* cont : codeowners
2026-04-21 09:54:36 +03:00
xris99
ff6b1062af
server : fix hardcoded proxy connection timeout in router mode (#18760) (#22003)
Fixes: https://github.com/ggml-org/llama.cpp/issues/18760

Co-authored-by: Christian <christian@example.com>
2026-04-21 06:41:14 +02:00
leonardHONG
97895129e5
ggml-cuda: flush legacy pool on OOM and retry (#22155)
* ggml-cuda: flush legacy pool on OOM and retry

Signed-off-by: 梁厚宏 <2695316095@qq.com>

* Address review comments: add explicit sync, update destructor, clean up MUSA macros

Signed-off-by: 梁厚宏 <2695316095@qq.com>

---------

Signed-off-by: 梁厚宏 <2695316095@qq.com>
2026-04-20 23:30:38 +02:00
Xuan-Son Nguyen
86f8daacfe
mtmd: correct get_n_pos / get_decoder_pos (#22175) 2026-04-20 23:29:19 +02:00
Georgi Gerganov
cf8b0dbda9
server : remove /api endpoints (#22165)
* server : remove /api endpoints

* cont : remove /api/tags
2026-04-20 20:41:19 +03:00
Gaurav Garg
fd6ae4ca1c
Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129)
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
2026-04-20 18:25:39 +02:00
Johannes Gäßler
fb19f94c71
TP: fix 0-sized tensor slices, AllReduce fallback (#21808)
* TP: fix 0-sized tensor slices, AllReduce fallback

* fix layer structure <-> GPU count aliasing

* add missing std::fill

* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
pl752
7f251fdbce
ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636)
* Implemented optimized q1_0 dot for x86 and generic

* Removed redundant helper definition

* Removed two redundant instructions from AVX q1_0 dot

* Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback

* Style cleanup around AVX q1_0 dot

* Replaced explicitly unrolled blocks with inner for loop for q1_0

* Replaced scalar ARM q1_0 impl with new generic one
2026-04-20 19:02:54 +03:00
Concedo
c17ba99812 change time.sleep to asyncio 2026-04-20 23:25:35 +08:00
neha-ha
a6cc43c286
ggml-webgpu: updated matrix-vector multiplication (#21738)
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-04-20 07:37:17 -07:00
Xuan-Son Nguyen
a678916623
mtmd: refactor mtmd_decode_use_mrope (#22161) 2026-04-20 14:45:11 +02:00
Concedo
cd6788007e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build-cross.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	examples/llama.android/lib/src/main/cpp/CMakeLists.txt
#	ggml/CMakeLists.txt
#	ggml/src/ggml-rpc/CMakeLists.txt
#	ggml/src/ggml-rpc/ggml-rpc.cpp
#	ggml/src/ggml-sycl/mmvq.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	scripts/sync_vendor.py
#	tests/test-chat.cpp
#	tests/test-mtmd-c-api.c
#	tools/server/README.md
2026-04-20 20:19:11 +08:00
SamareshSingh
81df3f7cfa
fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* llama: fix crash in print_info for GLM-DSA when vocab_only is set

* addressed code review comments

* cont : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-20 10:32:46 +03:00
Concedo
fe4c1b80a1 fix unwanted error print 2026-04-20 13:48:57 +08:00
Georgi Gerganov
de71b5f81c
server : refactor "use checkpoint" logic (#22114) 2026-04-20 08:42:37 +03:00
Katostrofik
788fcbc5dd
[SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035)
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes

The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K
asserted that block_num_y was a multiple of 16 subgroups. Models with
a vocab size not divisible by 16 (for example HY-MT at 120818) aborted
on model load when the output projection tripped the assert.

I replaced the assert with padding: block_num_y now rounds up to a
whole number of subgroup-sized workgroups. The kernel already has the
row bounds check (`if (row >= nrows) return;`) so the extra padded
threads early-exit cleanly. Row values are uniform across a subgroup
so the collective reduce stays safe.

For aligned vocab sizes the padded block_num_y equals the old value,
so the kernel launch is identical and there is no regression.

Thanks to @arthw for flagging the relationship to #21527.

Fixes #22020.

AI assisted coding, tested on Intel B70 hardware.

* sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches

Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec
launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel
target where WARP_SIZE is 16, but makes the relationship to subgroup
size explicit. Per review by @NeoZhangJianyu on #22035.

Assisted by Claude.
2026-04-20 08:39:45 +03:00
Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots (#21741) 2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
e365e658f0
vendor : update cpp-httplib to 0.42.0 (#21781) 2026-04-20 06:41:43 +08:00
Johannes Gäßler
4eac5b4509
CUDA: refactor mma data loading for AMD (#22051)
* CUDA: refactor mma data loading for AMD

* fix CDNA MMQ occupancy

* fix CDNA3 mma

* fix RDNA3 compile
2026-04-19 18:26:59 +02:00
Concedo
c3c42f6e7f updated lite 2026-04-19 23:40:29 +08:00
Concedo
a8290a072f more robust json field handling 2026-04-19 23:27:19 +08:00
Concedo
271c4c332c hack to allow kokoro to remain functional even with much higher GGML_SCHED_MAX_SPLIT_INPUTS 2026-04-19 20:40:07 +08:00
Concedo
707bb67b30 minimal uses 10% of budget 2026-04-19 20:19:45 +08:00
Aldehir Rojas
d5b780a676
common/autoparser : allow space after tool call (#22073) 2026-04-19 13:28:35 +02:00
Concedo
afaf3b960e try to make kokoro take less graph size 2026-04-19 19:00:35 +08:00
uvos
471540ae8a
HIP: Remove unesscary NCCL_CHECK (#21914) 2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen
19124078be
mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082)
* mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos

* fix build
2026-04-19 11:57:21 +02:00
Gaurav Garg
bcdcc1044f
ggml : reduce CPU overhead in meta backend (#22041)
* cache subgraph splits when cgraph is unchanged

Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.

Assign uid to every sub-graph so that CUDA's fast uid check path hits too.

* Address review comments

* Keep the scope as is

* Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.

* Address review comments

* Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret
037bfe38d0
ci : install spirv-headers for vulkan-cross (#22109) 2026-04-19 10:32:08 +03:00
Dowon
8685e7b075
convert : support sentence-transformer 5.4 config files (#22087)
* convert : support sentence-transformer 5.4 config files

* fix: embeddinggemma

* fix: mapping

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix: pooling_mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-19 10:25:39 +03:00
texasich
09b4efa95f
cmake: remove CMP0194 policy to restore MSVC builds (#21934)
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.

Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.

Reported-by: oobabooga

Refs: #21630

Co-authored-by: texasich <texasich@users.noreply.github.com>
2026-04-19 10:25:05 +03:00
Sascha Rogmann
455d8e4be8
server : speculative checkpointing (#19493)
* server : speculative decoding using checkpoints

* server : fix draft check with checkpoints

* server : rename spec vars

* server : log levels

* server : refactored spec logic to speculative.cpp

* server : renamed spec checkpoints option

* server : fix spec checkpoints, logging

* speculative : checkpoints with draft model, logging

* server : n_tokens_cur and create_checkpoint in draft

* server : fix server_speculative_callback (slot.id)

* spec : fix ngram-map/begin idx_last_check

* spec : init ckpt (begin() wasn't called)

* chore: update webui build output

* server : restore sampler in spec checkpoint and clear mem

* cont : avoid --spec-use-checkpoints argument

* cont : remove server_prompt_checkpoint_with_size

* spec : rename (leave_draft_state)

* cont : clean-up

* cont : do not ignore partial drafts even if the are short

* cont : spec callback owned by session

* cont : simplify

* cont : avoid empty speculative session

* cont : simplify

* cont : simplify

* cont : enable mtmd speculative decoding

* cont : keep the spec sampler alive

* cont : simplify

* cont : fix nullptr deref + draft checkpoints

* cont : remove common_speculative_accept_response

* cont : remove callback

* cont : simplify

* cont : minor

* cont : simplify

* cont : fix accepted number

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Radoslav Gerganov
91fef95362
rpc : refactor the RPC transport (#21998)
* rpc : refactor the RPC transport

Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.

* fix win32

* better socket_t construction
2026-04-19 10:21:53 +03:00
Concedo
2336c3e549 updated lite 2026-04-19 14:15:10 +08:00
Concedo
8f4eaedfd8 updated sdui 2026-04-19 13:24:41 +08:00
Concedo
71b4107bb6 fixed terminal logs 2026-04-19 11:31:12 +08:00
Cetarthoriphros
9e5647affa
server: Expose media_tag on /props endpoint. (#22028) 2026-04-19 00:27:17 +02:00
Concedo
8886e48a4a cache sd info 2026-04-19 02:19:11 +08:00
Sigbjørn Skjæret
4f02d47339
model : refactor bias tensor variable names (#22079)
* refactor bias tensor variable names

* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Wagner Bruna
1be08b9d15
sd: report all sampler aliases and centralize name mapping (#2149)
* debug: allow loading backend libraries without normal arg parsing

This is just to be able to test backend functions directly, with e.g.:

>> import koboldcpp
>> koboldcpp.init_libraries()
>> koboldcpp.sd_get_info()

* sd: report all sampler aliases and centralize name mapping
2026-04-19 01:51:42 +08:00
Concedo
e5eab545f3 handle override jinja template 2026-04-19 00:30:28 +08:00
Concedo
ff37b336a7 updated lite 2026-04-18 18:38:32 +08:00
Concedo
2962e5bac4 updated colab image models 2026-04-18 18:02:17 +08:00