* Fix music generation token stopping for quantized models
In Phase 1 lyrics mode, the FSM transitions to CODES state after
TOKEN_THINK_END and disables itself. The quantized Q4_K_M model was
not efficiently generating TOKEN_IM_END to stop the generation,
causing it to continue until hitting the 8192 token limit.
This fix forces TOKEN_IM_END to be generated immediately after
TOKEN_THINK_END in lyrics mode, ensuring clean completion of the
planning phase without excessive token generation.
Testing shows generation now completes in ~500ms instead of 80+
seconds with timeout errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Clarify comment - fix applies to all models, not just quantized
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Improve fix: only force TOKEN_IM_END at token limit
Instead of forcing TOKEN_IM_END immediately after TOKEN_THINK_END,
only force it when we've reached the token limit. This allows the model
to generate lyrics after the thinking block while still preventing KV
cache exhaustion.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* sd: remove C++ support for enforcing fixed LoRA multipliers
The logic at the Python level is enough.
* sd: support changing preloaded LoRA multipliers
We keep the same rules as before:
- Any LoRA with multiplier 0 can be changed
- If all LoRAs have multiplier != 0, they are fixed and optimized
but tweak the corner case of LoRAs specified more than once to
allow adjusting the multiplier if the same LoRA is also specified
with a zero multiplier, as if they were two different LoRAs.
So the following keeps working as before:
- --sdlora /loras/lcm.gguf --sdloramult 1 : fixed as 1
- --sdlora /loras/lcm.gguf --sdloramult 0 : dynamic, default 0
- --sdlora /loras/ : dynamic, default 0
- --sdlora /loras/lcm.gguf /loras/lcm.gguf --sdloramult 1 1 : fixed as 2
But now we have:
- --sdlora /loras/lcm.gguf /loras/lcm.gguf --sdloramult 1 0 : dynamic, default 1
- --sdlora /loras/lcm.gguf /loras/ --sdloramult 1 : dynamic, default 1
* backend support for controlling LoRA cache and fixed multipliers
The generation LoRA multipliers are now added to the initial
multipliers, so e.g. a merged LCM model will behave the same as
a normal model with a preloaded LCM LoRA.
* frontend support
* sd: sync to master-525-d6dd6d7
* sd: add support for cache modes for inference acceleration
* keep gendefaults as a JSON object inside the config file
* covered more invalid cases on gendefaults parsing
* fix corner case in sd_oai_transform_params
Also fix typo in the function name.
* support for customizing loaded LoRA multipliers
The `sdloramult` flag now accepts a list of multipliers, one for each
LoRA. If all multipliers are non-zero, LoRAs load as before, with no extra
VRAM usage or performance impact.
If any LoRA has a multiplier of 0, we switch to `at_runtime` mode, and these
LoRAs will be available to multiplier changes via the `lora` sdapi field and
show up in the `sdapi/v1/loras` endpoint. All LoRAs are still preloaded on
startup, and cached to avoid file reloads.
If the list of multipliers is shorter than the list of LoRAs, the multiplier
list is extended with the first multiplier (1.0 by default), to keep it
compatible with the previous behavior.
* support for `<lora:name:multiplier>` prompt syntax and metadata
* add a few tests for sanitize_lora_multipliers
* sd: sync to master-509-4cdfff5
* sd: Anima support
* sd: sync to master-514-5792c66
* sd: additional workaround for Anima .safetensors model
* sd: sync to master-517-ba35dd7
* sd: sync to master-520-d950627
* tweak format sting types
This may not be all of them, but it's the ones which warn on OpenBSD
* complete the changes needed to fix the format string specifers
* avoid using inttypes, directly cast to size_t (u64 usually) instead
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>