Yes You Can Have Your Own
9d49acb2a7
server: rename --clear-idle to --cache-idle-slots ( #21741 )
2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
e365e658f0
vendor : update cpp-httplib to 0.42.0 ( #21781 )
2026-04-20 06:41:43 +08:00
Johannes Gäßler
4eac5b4509
CUDA: refactor mma data loading for AMD ( #22051 )
...
* CUDA: refactor mma data loading for AMD
* fix CDNA MMQ occupancy
* fix CDNA3 mma
* fix RDNA3 compile
2026-04-19 18:26:59 +02:00
Concedo
c3c42f6e7f
updated lite
2026-04-19 23:40:29 +08:00
Concedo
a8290a072f
more robust json field handling
2026-04-19 23:27:19 +08:00
Concedo
271c4c332c
hack to allow kokoro to remain functional even with much higher GGML_SCHED_MAX_SPLIT_INPUTS
2026-04-19 20:40:07 +08:00
Concedo
707bb67b30
minimal uses 10% of budget
2026-04-19 20:19:45 +08:00
Aldehir Rojas
d5b780a676
common/autoparser : allow space after tool call ( #22073 )
2026-04-19 13:28:35 +02:00
Concedo
afaf3b960e
try to make kokoro take less graph size
2026-04-19 19:00:35 +08:00
uvos
471540ae8a
HIP: Remove unesscary NCCL_CHECK ( #21914 )
2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen
19124078be
mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) ( #22082 )
...
* mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos
* fix build
2026-04-19 11:57:21 +02:00
Gaurav Garg
bcdcc1044f
ggml : reduce CPU overhead in meta backend ( #22041 )
...
* cache subgraph splits when cgraph is unchanged
Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.
Assign uid to every sub-graph so that CUDA's fast uid check path hits too.
* Address review comments
* Keep the scope as is
* Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.
* Address review comments
* Update ggml/src/ggml-backend-meta.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-backend-meta.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret
037bfe38d0
ci : install spirv-headers for vulkan-cross ( #22109 )
2026-04-19 10:32:08 +03:00
Dowon
8685e7b075
convert : support sentence-transformer 5.4 config files ( #22087 )
...
* convert : support sentence-transformer 5.4 config files
* fix: embeddinggemma
* fix: mapping
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix: pooling_mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-19 10:25:39 +03:00
texasich
09b4efa95f
cmake: remove CMP0194 policy to restore MSVC builds ( #21934 )
...
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.
Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.
Reported-by: oobabooga
Refs: #21630
Co-authored-by: texasich <texasich@users.noreply.github.com>
2026-04-19 10:25:05 +03:00
Sascha Rogmann
455d8e4be8
server : speculative checkpointing ( #19493 )
...
* server : speculative decoding using checkpoints
* server : fix draft check with checkpoints
* server : rename spec vars
* server : log levels
* server : refactored spec logic to speculative.cpp
* server : renamed spec checkpoints option
* server : fix spec checkpoints, logging
* speculative : checkpoints with draft model, logging
* server : n_tokens_cur and create_checkpoint in draft
* server : fix server_speculative_callback (slot.id)
* spec : fix ngram-map/begin idx_last_check
* spec : init ckpt (begin() wasn't called)
* chore: update webui build output
* server : restore sampler in spec checkpoint and clear mem
* cont : avoid --spec-use-checkpoints argument
* cont : remove server_prompt_checkpoint_with_size
* spec : rename (leave_draft_state)
* cont : clean-up
* cont : do not ignore partial drafts even if the are short
* cont : spec callback owned by session
* cont : simplify
* cont : avoid empty speculative session
* cont : simplify
* cont : simplify
* cont : enable mtmd speculative decoding
* cont : keep the spec sampler alive
* cont : simplify
* cont : fix nullptr deref + draft checkpoints
* cont : remove common_speculative_accept_response
* cont : remove callback
* cont : simplify
* cont : minor
* cont : simplify
* cont : fix accepted number
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Radoslav Gerganov
91fef95362
rpc : refactor the RPC transport ( #21998 )
...
* rpc : refactor the RPC transport
Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.
* fix win32
* better socket_t construction
2026-04-19 10:21:53 +03:00
Concedo
2336c3e549
updated lite
2026-04-19 14:15:10 +08:00
Concedo
8f4eaedfd8
updated sdui
2026-04-19 13:24:41 +08:00
Concedo
71b4107bb6
fixed terminal logs
2026-04-19 11:31:12 +08:00
Cetarthoriphros
9e5647affa
server: Expose media_tag on /props endpoint. ( #22028 )
2026-04-19 00:27:17 +02:00
Concedo
8886e48a4a
cache sd info
2026-04-19 02:19:11 +08:00
Sigbjørn Skjæret
4f02d47339
model : refactor bias tensor variable names ( #22079 )
...
* refactor bias tensor variable names
* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Wagner Bruna
1be08b9d15
sd: report all sampler aliases and centralize name mapping ( #2149 )
...
* debug: allow loading backend libraries without normal arg parsing
This is just to be able to test backend functions directly, with e.g.:
>> import koboldcpp
>> koboldcpp.init_libraries()
>> koboldcpp.sd_get_info()
* sd: report all sampler aliases and centralize name mapping
2026-04-19 01:51:42 +08:00
Concedo
e5eab545f3
handle override jinja template
2026-04-19 00:30:28 +08:00
Concedo
ff37b336a7
updated lite
2026-04-18 18:38:32 +08:00
Concedo
2962e5bac4
updated colab image models
2026-04-18 18:02:17 +08:00
Concedo
40827ab5b5
updated lite, improved reasoning budget
2026-04-18 17:37:47 +08:00
Sigbjørn Skjæret
23b8cc4991
android : libcommon -> libllama-common ( #22076 )
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
2026-04-18 11:19:40 +02:00
Concedo
17c754a5fc
improved reasoning budget
2026-04-18 17:19:09 +08:00
Concedo
78589974de
updated colab
2026-04-18 16:41:27 +08:00
SamareshSingh
59accc8863
ggml-backend-meta: add multi-segment read support in get_tensor ( #22063 )
2026-04-18 10:04:51 +02:00
Sigbjørn Skjæret
83d58e02fc
ci : free disk space for rocm release ( #22012 )
2026-04-18 09:37:30 +02:00
Sigbjørn Skjæret
89a5474f0e
convert : fix (ignore for now) typings errors ( #22002 )
2026-04-18 09:36:41 +02:00
Johannes Gäßler
fd1c0ec3f0
llama: fit ctx size for CPU only ( #21568 )
2026-04-18 08:16:04 +02:00
Concedo
0b37cb9a57
added preliminary support for reasoning budget
2026-04-18 11:56:33 +08:00
Reese Levine
45cac7ca70
ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding ( #21052 )
...
* Update workflows to remove dependence on llvmpipe
* Try setting Dawn_DIR
* remove c++20 initializers
* Move to proper guid
* Try avoiding segfaults on vulkan backend process exit
* Remove compiler warnings on parameter casting
* Fix soft_max and update reg_tile accumulation to f32 for better precision
* Refactor flash_attn a bit
* remove c++20 initializers and format
* Increase div precision for NVIDIA
* revert div precision and comment out ggml-ci node for now
* Formatting
* Try debugging on a failing CI node
* Revert "Try debugging on a failing CI node"
This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.
2026-04-17 09:17:11 -07:00
Aman Gupta
b94050e896
CUDA: use LRU based eviction for cuda graphs ( #21611 )
...
* CUDA: use a ring-buffer for cuda graphs
* bump limit to 128
* use LRU eviction
* better naming
* do periodic clean-up
2026-04-17 23:24:21 +08:00
Concedo
79882d669a
Merge branch 'upstream' into concedo_experimental
...
# Conflicts:
# .github/workflows/build-android.yml
# .github/workflows/build.yml
# .github/workflows/release.yml
# CMakeLists.txt
# CODEOWNERS
# common/CMakeLists.txt
# common/common.h
# docs/ops.md
# docs/ops/Metal.csv
# examples/batched/CMakeLists.txt
# examples/convert-llama2c-to-ggml/CMakeLists.txt
# examples/debug/CMakeLists.txt
# examples/diffusion/CMakeLists.txt
# examples/embedding/CMakeLists.txt
# examples/eval-callback/CMakeLists.txt
# examples/gen-docs/CMakeLists.txt
# examples/idle/CMakeLists.txt
# examples/lookahead/CMakeLists.txt
# examples/lookup/CMakeLists.txt
# examples/parallel/CMakeLists.txt
# examples/passkey/CMakeLists.txt
# examples/retrieval/CMakeLists.txt
# examples/save-load-state/CMakeLists.txt
# examples/speculative-simple/CMakeLists.txt
# examples/speculative/CMakeLists.txt
# examples/sycl/CMakeLists.txt
# examples/training/CMakeLists.txt
# ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
# ggml/src/ggml-hexagon/htp/htp-ops.h
# ggml/src/ggml-hexagon/htp/main.c
# ggml/src/ggml-opencl/CMakeLists.txt
# ggml/src/ggml-opencl/ggml-opencl.cpp
# ggml/src/ggml-opencl/kernels/cvt.cl
# pocs/vdot/CMakeLists.txt
# src/CMakeLists.txt
# tests/CMakeLists.txt
# tests/test-quantize-stats.cpp
# tools/batched-bench/CMakeLists.txt
# tools/cli/CMakeLists.txt
# tools/cli/cli.cpp
# tools/completion/CMakeLists.txt
# tools/cvector-generator/CMakeLists.txt
# tools/cvector-generator/cvector-generator.cpp
# tools/export-lora/CMakeLists.txt
# tools/gguf-split/CMakeLists.txt
# tools/gguf-split/gguf-split.cpp
# tools/imatrix/CMakeLists.txt
# tools/llama-bench/CMakeLists.txt
# tools/llama-bench/llama-bench.cpp
# tools/mtmd/CMakeLists.txt
# tools/perplexity/CMakeLists.txt
# tools/quantize/CMakeLists.txt
# tools/quantize/quantize.cpp
# tools/results/CMakeLists.txt
# tools/server/CMakeLists.txt
# tools/tokenize/CMakeLists.txt
# tools/tts/CMakeLists.txt
2026-04-17 22:37:37 +08:00
Concedo
768527b031
Merge commit ' 1e796eb41f' into concedo_experimental
...
# Conflicts:
# .devops/nix/package.nix
# .github/workflows/build-riscv.yml
# .github/workflows/build-vulkan.yml
# .github/workflows/build.yml
# docs/backend/SYCL.md
# docs/build.md
# docs/development/HOWTO-add-model.md
# embd_res/templates/Reka-Edge.jinja
# ggml/CMakeLists.txt
# ggml/src/ggml-rpc/CMakeLists.txt
# ggml/src/ggml-rpc/ggml-rpc.cpp
# ggml/src/ggml-sycl/CMakeLists.txt
# ggml/src/ggml-sycl/convert.cpp
# ggml/src/ggml-sycl/dequantize.hpp
# ggml/src/ggml-sycl/dmmv.cpp
# ggml/src/ggml-sycl/ggml-sycl.cpp
# ggml/src/ggml-webgpu/ggml-webgpu.cpp
# ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_id.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_reg_tile.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_subgroup_matrix.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl
# ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
# tests/test-chat.cpp
# tools/rpc/README.md
2026-04-17 21:47:29 +08:00
Concedo
a089d6c59b
updated lite
2026-04-17 21:12:25 +08:00
Yuri Khrustalev
a279d0f0f4
ci : add android arm64 build and release ( #21647 )
...
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* server: respect the ignore eos flag
* ci: add android arm64 build and release
* patch
* pin android-setup actions to v4
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* lf in the suggestion
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-17 11:32:24 +02:00
Concedo
9a38091207
support q5_1 kv
2026-04-17 17:06:15 +08:00
65a
268d61e178
mtmd: add missing struct tag ( #22023 )
2026-04-17 10:48:33 +02:00
Georgi Gerganov
6990e2f1f7
libs : rename libcommon -> libllama-common ( #21936 )
...
* cmake : allow libcommon to be shared
* cmake : rename libcommon to libllama-common
* cont : set -fPIC for httplib
* cont : export all symbols
* cont : fix build_info exports
* libs : add libllama-common-base
* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Eric Zhang
fcc7508759
model : Gemma4 model type detection ( #22027 )
...
* model : Gemma4 model type detection
* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
Concedo
e074939c17
compact context GUI page (+1 squashed commits)
...
Squashed commits:
[136f073ce] compact context GUI page
2026-04-17 14:40:53 +08:00
Concedo
cccb45a00a
summary outputs include processed amt
2026-04-17 14:22:51 +08:00
lhez
5e6c0e18b6
opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno ( #21938 )
...
* opencl: refactor q8_0 gemm/gemv Adreno dispatch
* opencl: refactor q8_0 set_tensor
* opencl: fix whitespace
2026-04-16 22:28:33 -07:00
Concedo
64ce5fca15
better approach when SWA window exceeded, simply refill the window. this is not 100% correct but good enough for fastforward users. Disable FF or increase window if not good enough
2026-04-17 11:44:13 +08:00