* feat: add a primitive form of continuous batching
* fix: deadlock in batching fallback
* fix: windows build
* chore: suppress the contbatch arg from --help
* feat: batch-aware rep_pen_slope
* fix: automatically disable shifting when batching is enabled
* fix: mixed-path state corruption
* fix: attempt to fully separate the two pipelines
* added a semaphore to prevent non-batchable requests from starting while batched requests are running
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Closes#2178
The --debugmode help string previously read "Shows additional debug
info in the terminal" with no indication of what numeric values it
accepts or what each does — making the recommended troubleshooting
flag opaque (per #2178).
Document the three values actually checked in the source:
-1: Horde-quiet (suppresses non-essential prints; auto-applied
when --horde* args are set, see configure_horde_settings)
0: default
1: verbose (extra slot/cache info; larger utfprint buffer;
retains 'debug-' horde model prefix; etc.)
Also note that bare --debugmode (no value) implies 1, which is the
existing argparse behavior (nargs='?', const=1) but easy to miss.
All C++ handling code currently:
- build a comma-separated list from the info_vulkan array
- if GGML_VK_VISIBLE_DEVICES isn't set
- set GGML_VK_VISIBLE_DEVICES to the list
Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this
can be done in the same way at the Python level, before all loading
functions.
Caveat: load_model had the default `inputs.vulkan_info = "0"`,
so the default GPU would be "0" only when loading a text model.
Closes the hardcoded 600s timeout in the router-mode reverse proxy: long
generations through --routermode would be cut off at the upstream
HTTPConnection timeout regardless of how long the model actually takes,
because http.client.HTTPConnection('localhost', upstream_port, timeout=600)
was wired with a literal 600.
Adds a new --routermodetimeout (default 600) under the admin group, and
threads it through the three HTTPConnection sites in the router handler:
the model-swap reload, the autoswap reload, and the main upstream proxy
forward. Behavior is unchanged at the default; users with long generations
can now pass e.g. --routermodetimeout 3600.
Reported in https://github.com/LostRuins/koboldcpp/issues/2168
* debug: allow loading backend libraries without normal arg parsing
This is just to be able to test backend functions directly, with e.g.:
>> import koboldcpp
>> koboldcpp.init_libraries()
>> koboldcpp.sd_get_info()
* sd: report all sampler aliases and centralize name mapping
* Pass img_min_params and img_max_params to ctx_clip_params
These values determine the minimum and maximum size (in
tokens) of vision embeddings. The default value of -1
uses a model-dependent default size, for example for
Gemma 4 the default is a 280 token embedding. For higher
quality results (at the cost of using more memory and
slower speed) you can increase the size of the embedding
to 1120 tokens.
* Change dict to mydict to match change to method