* sd: generalize internal interfaces to place generation on CPU
* sd: backend support for multi-device selection
* sd: frontend support for multi-device selection
* add deprecated flags to avoid breaking old cli args
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* sd: sync to master-593-3d6064b
* sd: use the same sdtype_adapter object for all builds
Since master-592-b8079e2, no sd.cpp source depends on the ggml
backend build anymore.
* sd: fix main_gpu selection
* sd: report backend devices to the Python layer
All C++ handling code currently:
- build a comma-separated list from the info_vulkan array
- if GGML_VK_VISIBLE_DEVICES isn't set
- set GGML_VK_VISIBLE_DEVICES to the list
Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this
can be done in the same way at the Python level, before all loading
functions.
Caveat: load_model had the default `inputs.vulkan_info = "0"`,
so the default GPU would be "0" only when loading a text model.
* sd: remove sampler alias handling from the C++ layer
It's already handled at the Python layer.
* sd: sync to master-580-7d33d4b
* sd: sync to master-582-7023fc4
* debug: allow loading backend libraries without normal arg parsing
This is just to be able to test backend functions directly, with e.g.:
>> import koboldcpp
>> koboldcpp.init_libraries()
>> koboldcpp.sd_get_info()
* sd: report all sampler aliases and centralize name mapping
* Pass img_min_params and img_max_params to ctx_clip_params
These values determine the minimum and maximum size (in
tokens) of vision embeddings. The default value of -1
uses a model-dependent default size, for example for
Gemma 4 the default is a 280 token embedding. For higher
quality results (at the cost of using more memory and
slower speed) you can increase the size of the embedding
to 1120 tokens.
* Change dict to mydict to match change to method
* sd: sync to master-540-f16a110
* tae post-merge fixes
* build fixes
* restore image mask for non-inpainting models
* sd: sync to master-551-99c1de3
* avoid nlohmann/json.hpp include diffs
* Euler A now works on Flux
* sd: sync to master-555-7397dda
avi_writer.h got removed upstream, but I've simply kept the local
copy for now.
* sd: sync to master-558-8afbeb6
* sd: sync to master-560-e8323ca
* Fix music generation token stopping for quantized models
In Phase 1 lyrics mode, the FSM transitions to CODES state after
TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was
not efficiently generating TOKEN_IM_END to stop the generation,
causing it to continue until hitting the 8192 token limit.
This fix forces TOKEN_IM_END to be generated immediately after
TOKEN_THINK_END in lyrics mode, ensuring clean completion of the
planning phase without excessive token generation.
Testing shows generation now completes in ~500ms instead of 80+
seconds with timeout errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Clarify comment - fix applies to all models, not just quantized
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Improve fix: only force TOKEN_IM_END at token limit
Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END,
only force it when we've reached the token limit. This allows the model
to generate lyrics after the thinking block while still preventing KV
cache exhaustion.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>