* feat: add a primitive form of continuous batching
* fix: deadlock in batching fallback
* fix: windows build
* chore: suppress the contbatch arg from --help
* feat: batch-aware rep_pen_slope
* fix: automatically disable shifting when batching is enabled
* fix: mixed-path state corruption
* fix: attempt to fully separate the two pipelines
* added a semaphore to prevent non-batchable requests from starting while batched requests are running
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
All C++ handling code currently:
- build a comma-separated list from the info_vulkan array
- if GGML_VK_VISIBLE_DEVICES isn't set
- set GGML_VK_VISIBLE_DEVICES to the list
Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this
can be done in the same way at the Python level, before all loading
functions.
Caveat: load_model had the default `inputs.vulkan_info = "0"`,
so the default GPU would be "0" only when loading a text model.
Previously, logprobs only contained the token string
and byte data, as well as the log probability itself.
For workflows that require the token id, translating
from the token bytes to the token id is potentially
costly and unreliable. It is simple and inexpensive
to expose the numeric token ids directly instead.
Squashed commit:
[0a6306ca0] draft wip dont use (will be squashed)
[a758a1c9c] wip dont use (will be squashed)
[e1994d3ce] wip dont use
[f59690d68] wip
[77228147d] wip on spec decoding. dont use yet
[2445bca54] wip adding speculative decoding (+1 squashed commits)
Squashed commits:
[50e341bb7] wip adding speculative decoding
* API: add an /extra/chat_template route
A lot of manual tweaking is done when swapping between models. We can automate or make better assumptions about some of them by having more information, such as chat template. This PR adds an endpoint /extra/chat_template which returns the model chat template string as is in a 'chat_template' key. The front end can then use this to derive the proper templates or use it as is, or at least warn the user when they are trying to use e.g. a Mistral preset with a Llama 3.1 model.
* switch to pre-established /props endpoint for chat template
* bug-fix (upstream): one-off in string juggling