* fix corner case in sd_oai_transform_params
Also fix typo in the function name.
* support for customizing loaded LoRA multipliers
The `sdloramult` flag now accepts a list of multipliers, one for each
LoRA. If all multipliers are non-zero, LoRAs load as before, with no extra
VRAM usage or performance impact.
If any LoRA has a multiplier of 0, we switch to `at_runtime` mode, and these
LoRAs will be available to multiplier changes via the `lora` sdapi field and
show up in the `sdapi/v1/loras` endpoint. All LoRAs are still preloaded on
startup, and cached to avoid file reloads.
If the list of multipliers is shorter than the list of LoRAs, the multiplier
list is extended with the first multiplier (1.0 by default), to keep it
compatible with the previous behavior.
* support for `<lora:name:multiplier>` prompt syntax and metadata
* add a few tests for sanitize_lora_multipliers
* sd: sync to master-509-4cdfff5
* sd: Anima support
* sd: sync to master-514-5792c66
* sd: additional workaround for Anima .safetensors model
* sd: sync to master-517-ba35dd7
* sd: sync to master-520-d950627
* ggml-cuda: add mem check for fusion
* Replace NaNs with -FLT_MAX
* fix typo
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
`plxv 40, 2(14)`
This ensures zero performance regression while unblocking builds on
newer toolchains.
Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9 + GCC 15.1.1 (gcc-toolset-15)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* CUDA: use shared mem for ssm_conv
* fuse silu + ssm_conv
* fuse unary + mul
* enable for fp16
* formatting
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* fix: token usage fix for mistral-vibe
* fix: generate unique request IDs for OAI-compatible responses
* fix: prompt_tokens reporting KV cache size instead of actual count during streaming
* fixes for PR #2015
For (1), this is not a good idea. If it returned 0 (e.g. during an error), this value may not be updated and will return the value of a previous or different request. It's better to return 0 in those cases.
For (2), this is a good idea but we don't need that level of randomness. I'll probably swap it with a 6 digit random number instead.
For (3), the official openai spec gates it behind stream_options.include_usage = true so i'll do that too
* missed 1 item
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* tweak format sting types
This may not be all of them, but it's the ones which warn on OpenBSD
* complete the changes needed to fix the format string specifers
* avoid using inttypes, directly cast to size_t (u64 usually) instead
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* hexagon: add fp16 support for binary ops: add,sub,mul,div
* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)
* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad
* snapdragon: fix readme link
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* models : add llm_build_delta_net_base
* cont : keep qwen35 and qwen35moe graphs intact
* cont : add comments [no ci]
* add kimi linear to delta-net-base
* removed unnecessary ggml_cont from g_exp_t
* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp
* removed unnecessary diag mask
* cont : simplify
* cont : avoid graph splits
* scale q after mul instead of beginning
* scale q after mul instead of beginning
* identical ppl
* cont : fix scale and decay mask
* minor : remove TODO
* block implementation for kda
* remove space at the end of line 101
* concat+pad
* pad+binary row concat
* chunk size 16 for kda
* removed minor differences to master
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()
* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)
* Exchanges synchronous copy with async copy function.
* Adds macro guards to allow compilation in non-CUDA builds
* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts
* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues
* Minor cleanup
* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.
* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.
* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization
* Simplifies synchronizations to adhere to `saaasg` pattern.
* Apply suggestion from @ggerganov (src->buffer to buf_src)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestion from @ggerganov (src->buffer to buf_src) v2
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* model : fix Qwen3.5 model type detection
* Update src/llama-model.cpp
whoops, my bad
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>