Commit graph

1484 commits

Author SHA1 Message Date
Concedo
0d320f60a6 fix multiuser regression 2026-05-17 00:17:12 +08:00
Concedo
47d5772fbe add batching failure spam logs 2026-05-16 23:21:01 +08:00
Concedo
9203b6a051 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	.github/workflows/server-sanitize.yml
#	.github/workflows/server-self-hosted.yml
#	.github/workflows/server.yml
#	.github/workflows/ui-build.yml
#	.github/workflows/ui-ci.yml
#	.github/workflows/ui-publish.yml
#	.gitignore
#	CMakeLists.txt
#	CODEOWNERS
#	scripts/ui-download.cmake
#	scripts/xxd.cmake
#	tests/test-backend-ops.cpp
#	tests/test-reasoning-budget.cpp
#	tools/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2026-05-16 22:56:33 +08:00
Concedo
3095da076a only fetch new popped horde requests if model is not blocked queue 2026-05-16 22:27:12 +08:00
Concedo
80ce8a50b3 allow token bans and eos handling in 2026-05-16 15:20:46 +08:00
Wagner Bruna
f273fd35b9
sd: sync to master-601-eeac950 (#2206)
* sd: sync to master-601-eeac950

* sd: add mmap support
2026-05-16 11:23:10 +08:00
Concedo
77fa2cd348 batching horde worker adjustments 2026-05-16 00:30:23 +08:00
Concedo
35f524d3e2 horde advertise more threads when batching is enabled 2026-05-15 17:36:53 +08:00
Reithan
5962bca463
Fix jinja error on case-insensitive roles and 0-len messages result (#2201)
* fix jinja error on case-insensitive roles and 0-len messages result

* check length in correct place
2026-05-15 16:48:42 +08:00
Concedo
1fe1a083cd run multiple horde workers if used with batching. 2026-05-14 23:36:42 +08:00
Concedo
286e62267e adjust batching eligibility 2026-05-11 21:54:32 +08:00
Concedo
bfaddd7a3b added support for added memory and gemma and glm prompt fixes for batching mode 2026-05-10 23:39:03 +08:00
Concedo
33ca75d56f ci for tools upload, minor function reordering 2026-05-10 23:10:43 +08:00
AlpinDale
c03302b670
feat: add a primitive form of continuous batching (#2167)
* feat: add a primitive form of continuous batching

* fix: deadlock in batching fallback

* fix: windows build

* chore: suppress the contbatch arg from --help

* feat: batch-aware rep_pen_slope

* fix: automatically disable shifting when batching is enabled

* fix: mixed-path state corruption

* fix: attempt to fully separate the two pipelines

* added a semaphore to prevent non-batchable requests from starting while batched requests are running

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2026-05-10 17:50:31 +08:00
Concedo
7a2f653451 falsy value handling on load config 2026-05-07 23:42:44 +08:00
Concedo
15e86c4f9b hard coded reasoning_effort field from the api payload and force it into the jinja kwargs (request by @henk717). field name also hardcoded. 2026-05-06 17:35:26 +08:00
Tai An
24495f6c48
docs(args): clarify --debugmode level semantics in help text (#2181)
Closes #2178

The --debugmode help string previously read "Shows additional debug
info in the terminal" with no indication of what numeric values it
accepts or what each does — making the recommended troubleshooting
flag opaque (per #2178).

Document the three values actually checked in the source:
  -1: Horde-quiet (suppresses non-essential prints; auto-applied
       when --horde* args are set, see configure_horde_settings)
   0: default
   1: verbose (extra slot/cache info; larger utfprint buffer;
       retains 'debug-' horde model prefix; etc.)

Also note that bare --debugmode (no value) implies 1, which is the
existing argparse behavior (nargs='?', const=1) but easy to miss.
2026-05-03 16:06:13 +08:00
Concedo
676e716ce3 try to handle duplicate think tags by swallowing them 2026-05-03 16:02:38 +08:00
Concedo
9be810628e setenv return int 2026-05-03 13:32:05 +08:00
Concedo
2fb97d9c2c explicitly set env var internally. 2026-05-03 13:18:50 +08:00
Wagner Bruna
25fab4113e
refactor: handle GGML_VK_VISIBLE_DEVICES at the Python level (#2179)
All C++ handling code currently:
- build a comma-separated list from the info_vulkan array
- if GGML_VK_VISIBLE_DEVICES isn't set
  - set GGML_VK_VISIBLE_DEVICES to the list

Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this
can be done in the same way at the Python level, before all loading
functions.

Caveat: load_model had the default `inputs.vulkan_info = "0"`,
so the default GPU would be "0" only when loading a text model.
2026-05-02 23:10:29 +08:00
Concedo
42ce63fd3b allow customizing multiuser queue in gui 2026-05-02 18:25:50 +08:00
Concedo
8b62e7b667 allow splitmode to be set independently, enable tensor parallelism 2026-05-02 16:41:28 +08:00
Concedo
7e98e06075 improved lora dir selection via gui 2026-05-02 10:51:35 +08:00
Concedo
b18a250205 handle raw args for model_param 2026-05-02 00:21:45 +08:00
Concedo
ef79904628 added a fix to make description optional in rosie's tool repack 2026-04-30 17:32:34 +08:00
Concedo
029cc3ad99 don't save deprecated args 2026-04-30 16:27:02 +08:00
Concedo
7bd95eb505 routermodetimeout -> reqtimeout, add to gui 2026-04-29 21:55:07 +08:00
Tai An
dfd87c4fb6
feat(router): add --routermodetimeout to make reverse-proxy timeout configurable (#2169)
Closes the hardcoded 600s timeout in the router-mode reverse proxy: long
generations through --routermode would be cut off at the upstream
HTTPConnection timeout regardless of how long the model actually takes,
because http.client.HTTPConnection('localhost', upstream_port, timeout=600)
was wired with a literal 600.

Adds a new --routermodetimeout (default 600) under the admin group, and
threads it through the three HTTPConnection sites in the router handler:
the model-swap reload, the autoswap reload, and the main upstream proxy
forward. Behavior is unchanged at the default; users with long generations
can now pass e.g. --routermodetimeout 3600.

Reported in https://github.com/LostRuins/koboldcpp/issues/2168
2026-04-29 20:20:42 +08:00
Concedo
9eaed2ec32 make musicui accessible to screen readers 2026-04-27 19:43:51 +08:00
Concedo
f679e3fec5 fix missing ipv4 support 2026-04-26 14:44:26 +08:00
Concedo
929f214bf6 updated docs, handle seed oss thinking 2026-04-25 22:44:40 +08:00
Wagner Bruna
c04832bb2b
sd: add eta support (#2164) 2026-04-25 19:04:13 +08:00
Concedo
18a3bedf63 fixed a deadlock 2026-04-25 19:03:03 +08:00
Concedo
4090400dff improved gemma toolcall handling 2026-04-25 09:51:29 +08:00
Concedo
cfb14bd844 fixed more args 2026-04-23 11:11:24 +08:00
Concedo
68e238857f fixed args 2026-04-23 11:00:42 +08:00
Concedo
c818716f57 router mode fixed for parallel requests 2026-04-21 22:33:46 +08:00
Concedo
96ec87127a updated colab, handle connection dropping during prompt processing 2026-04-21 21:46:13 +08:00
Concedo
1feba4e4ea fixed koboldcpp.sh, fixed vision max/min when one param is missing, fixed processing count wrong, updated lite 2026-04-21 18:36:47 +08:00
Concedo
c17ba99812 change time.sleep to asyncio 2026-04-20 23:25:35 +08:00
Concedo
fe4c1b80a1 fix unwanted error print 2026-04-20 13:48:57 +08:00
Concedo
a8290a072f more robust json field handling 2026-04-19 23:27:19 +08:00
Concedo
707bb67b30 minimal uses 10% of budget 2026-04-19 20:19:45 +08:00
Concedo
71b4107bb6 fixed terminal logs 2026-04-19 11:31:12 +08:00
Concedo
8886e48a4a cache sd info 2026-04-19 02:19:11 +08:00
Wagner Bruna
1be08b9d15
sd: report all sampler aliases and centralize name mapping (#2149)
* debug: allow loading backend libraries without normal arg parsing

This is just to be able to test backend functions directly, with e.g.:

>> import koboldcpp
>> koboldcpp.init_libraries()
>> koboldcpp.sd_get_info()

* sd: report all sampler aliases and centralize name mapping
2026-04-19 01:51:42 +08:00
Concedo
e5eab545f3 handle override jinja template 2026-04-19 00:30:28 +08:00
Concedo
17c754a5fc improved reasoning budget 2026-04-18 17:19:09 +08:00
Concedo
0b37cb9a57 added preliminary support for reasoning budget 2026-04-18 11:56:33 +08:00