need to include <sstream> otherwise build fails with lots of the below errors:
```
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2297: '<<': not valid as right operand has type 'const cha
r [26]' [C:\koboldcpp\build\music_adapter.vcxproj]
(compiling source file '../otherarch/acestep/music_adapter.cpp')
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9): error C2679: binary '<<': no operator found which takes a right-h
and operand of type 'std::string' (or there is no acceptable conversion) [C:\koboldcpp\build\music_adapter.vcxproj]
(compiling source file '../otherarch/acestep/music_adapter.cpp')
C:\Program Files (x86)\Microsoft Visual Studio\18\BuildTools\VC\Tools\MSVC\14.50.35717\include\__msvc_int128.hpp(
753,46):
could be 'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept' [found us
ing argument-dependent lookup]
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,9):
'std::_Unsigned128 std::operator <<(const std::_Unsigned128 &,const std::_Base128 &) noexcept': cannot conver
t argument 2 from 'std::string' to 'const std::_Base128 &'
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57):
Reason: cannot convert from 'std::string' to 'const std::_Base128'
C:\koboldcpp\otherarch\acestep\ace-qwen3.cpp(1278,57):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be
called
```
Round image dimensions to the specific multiple required by each
DiT model, which range from 32 (certain Wan models) to 1 (Chroma
Radiance), with most requiring multiples of 8 or 16. Unet models
keep being rounded to multiples of 64.
Current sd.cpp rounds the sizes internally; but it always rounds
up, so we still need to round on our side to apply image size
restrictions, and to trigger VAE tiling correctly.
Also, remove a legacy test that could abort a generation with
unsupported image sizes: it'd never run, because it was applied
after the image side adjustements.
* Improve CUDA graph capture
Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because:
- The first call always incurs CUDA graph capture overhead even if the graph is unstable
- Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode)
The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize.
This also fixes issues such as https://github.com/ggml-org/llama.cpp/discussions/19708
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Remove EM dashes
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* common : fix Step-3.5-Flash format detection and thinking support
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.
Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.
Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
<parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test
Builds on: https://github.com/ggml-org/llama.cpp/pull/19283
* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests
Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.
Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.
Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.
* chat : remove dead thinking code from qwen3_coder_xml
Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.