* sd: sync to master-593-3d6064b
* sd: use the same sdtype_adapter object for all builds
Since master-592-b8079e2, no sd.cpp source depends on the ggml
backend build anymore.
* sd: fix main_gpu selection
* sd: report backend devices to the Python layer
All C++ handling code currently:
- build a comma-separated list from the info_vulkan array
- if GGML_VK_VISIBLE_DEVICES isn't set
- set GGML_VK_VISIBLE_DEVICES to the list
Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this
can be done in the same way at the Python level, before all loading
functions.
Caveat: load_model had the default `inputs.vulkan_info = "0"`,
so the default GPU would be "0" only when loading a text model.
* sd: remove sampler alias handling from the C++ layer
It's already handled at the Python layer.
* sd: sync to master-580-7d33d4b
* sd: sync to master-582-7023fc4
* debug: allow loading backend libraries without normal arg parsing
This is just to be able to test backend functions directly, with e.g.:
>> import koboldcpp
>> koboldcpp.init_libraries()
>> koboldcpp.sd_get_info()
* sd: report all sampler aliases and centralize name mapping
* Pass img_min_params and img_max_params to ctx_clip_params
These values determine the minimum and maximum size (in
tokens) of vision embeddings. The default value of -1
uses a model-dependent default size, for example for
Gemma 4 the default is a 280 token embedding. For higher
quality results (at the cost of using more memory and
slower speed) you can increase the size of the embedding
to 1120 tokens.
* Change dict to mydict to match change to method
* sd: sync to master-540-f16a110
* tae post-merge fixes
* build fixes
* restore image mask for non-inpainting models
* sd: sync to master-551-99c1de3
* avoid nlohmann/json.hpp include diffs
* Euler A now works on Flux
* sd: sync to master-555-7397dda
avi_writer.h got removed upstream, but I've simply kept the local
copy for now.
* sd: sync to master-558-8afbeb6
* sd: sync to master-560-e8323ca
* Fix music generation token stopping for quantized models
In Phase 1 lyrics mode, the FSM transitions to CODES state after
TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was
not efficiently generating TOKEN_IM_END to stop the generation,
causing it to continue until hitting the 8192 token limit.
This fix forces TOKEN_IM_END to be generated immediately after
TOKEN_THINK_END in lyrics mode, ensuring clean completion of the
planning phase without excessive token generation.
Testing shows generation now completes in ~500ms instead of 80+
seconds with timeout errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Clarify comment - fix applies to all models, not just quantized
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Improve fix: only force TOKEN_IM_END at token limit
Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END,
only force it when we've reached the token limit. This allows the model
to generate lyrics after the thinking block while still preventing KV
cache exhaustion.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* sd: remove C++ support for enforcing fixed LoRA multipliers
The logic at the Python level is enough.
* sd: support changing preloaded LoRA multipliers
We keep the same rules as before:
- Any LoRA with multiplier 0 can be changed
- If all LoRAs have multiplier != 0, they are fixed and optimized
but tweak the corner case of LoRAs specified more than once to
allow adjusting the multiplier if the same LoRA is also specified
with a zero multiplier, as if they were two different LoRAs.
So the following keeps working as before:
- --sdlora /loras/lcm.gguf --sdloramult 1 : fixed as 1
- --sdlora /loras/lcm.gguf --sdloramult 0 : dynamic, default 0
- --sdlora /loras/ : dynamic, default 0
- --sdlora /loras/lcm.gguf /loras/lcm.gguf --sdloramult 1 1 : fixed as 2
But now we have:
- --sdlora /loras/lcm.gguf /loras/lcm.gguf --sdloramult 1 0 : dynamic, default 1
- --sdlora /loras/lcm.gguf /loras/ --sdloramult 1 : dynamic, default 1
* backend support for controlling LoRA cache and fixed multipliers
The generation LoRA multipliers are now added to the initial
multipliers, so e.g. a merged LCM model will behave the same as
a normal model with a preloaded LCM LoRA.
* frontend support
* sd: sync to master-525-d6dd6d7
* sd: add support for cache modes for inference acceleration
* keep gendefaults as a JSON object inside the config file
* covered more invalid cases on gendefaults parsing