qwen-code/docs/design/adaptive-output-token-escalation
Shaojin Wen 1e8bc031cc
feat(core): adaptive output token escalation (8K default + 64K retry) (#2898)
* feat(core): adaptive output token escalation (8K default + 64K retry)

99% of model responses are under 5K tokens, but we previously reserved
32K for every request. This wastes GPU slot capacity by ~4x.

Now the default output limit is 8K. When a response hits this cap
(stop_reason=max_tokens), it automatically retries once at 64K — only
the ~1% of requests that actually need more tokens pay the cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add design doc and user doc for adaptive output token escalation

- Add design doc covering problem, architecture, token limit
  determination, escalation mechanism, and design decisions
- Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md
- Add max_tokens adaptive behavior explanation in model config section

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:30:39 +08:00
..
adaptive-output-token-escalation-design.md feat(core): adaptive output token escalation (8K default + 64K retry) (#2898) 2026-04-08 17:30:39 +08:00