mirror of
https://github.com/QwenLM/qwen-code.git
synced 2026-05-01 21:20:44 +00:00
feat(core): adaptive output token escalation (8K default + 64K retry) (#2898)
* feat(core): adaptive output token escalation (8K default + 64K retry) 99% of model responses are under 5K tokens, but we previously reserved 32K for every request. This wastes GPU slot capacity by ~4x. Now the default output limit is 8K. When a response hits this cap (stop_reason=max_tokens), it automatically retries once at 64K — only the ~1% of requests that actually need more tokens pay the cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add design doc and user doc for adaptive output token escalation - Add design doc covering problem, architecture, token limit determination, escalation mechanism, and design decisions - Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md - Add max_tokens adaptive behavior explanation in model config section --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
3c23952ef7
commit
1e8bc031cc
11 changed files with 299 additions and 57 deletions
|
|
@ -0,0 +1,138 @@
|
|||
# Adaptive Output Token Escalation Design
|
||||
|
||||
> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
|
||||
|
||||
## Problem
|
||||
|
||||
Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
|
||||
|
||||
## Solution
|
||||
|
||||
Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ Request starts │
|
||||
│ max_tokens = 8K │
|
||||
└───────────┬─────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ Stream response │
|
||||
└───────────┬─────────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ │
|
||||
finish_reason finish_reason
|
||||
!= MAX_TOKENS == MAX_TOKENS
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────┐ ┌─────────────────────┐
|
||||
│ Done │ │ Check conditions: │
|
||||
└───────────┘ │ - No user override? │
|
||||
│ - No env override? │
|
||||
│ - Not already │
|
||||
│ escalated? │
|
||||
└─────────┬───────────┘
|
||||
YES │ NO
|
||||
┌─────────┴────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌──────────┐
|
||||
│ Pop partial │ │ Done │
|
||||
│ model resp │ │ (truncd) │
|
||||
│ from history│ └──────────┘
|
||||
│ │
|
||||
│ Yield RETRY │
|
||||
│ event │
|
||||
│ │
|
||||
│ Re-send │
|
||||
│ max_tokens │
|
||||
│ = 64K │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Token limit determination
|
||||
|
||||
The effective `max_tokens` is resolved in the following priority order:
|
||||
|
||||
| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
|
||||
| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
|
||||
| 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation |
|
||||
| 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation |
|
||||
| 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to 64K on truncation |
|
||||
|
||||
A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
|
||||
|
||||
This logic is implemented in three content generators:
|
||||
|
||||
- `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers
|
||||
- `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider
|
||||
- `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider
|
||||
|
||||
## Escalation mechanism
|
||||
|
||||
The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional:
|
||||
|
||||
1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
|
||||
2. Truncation is not an error — it's a successful response that was cut short
|
||||
3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic
|
||||
|
||||
### Escalation steps (geminiChat.ts)
|
||||
|
||||
```
|
||||
1. Stream completes successfully (lastError === null)
|
||||
2. Last chunk has finishReason === MAX_TOKENS
|
||||
3. Guard checks pass:
|
||||
- maxTokensEscalated === false (prevent infinite escalation)
|
||||
- hasUserMaxTokensOverride === false (respect user intent)
|
||||
4. Pop the partial model response from chat history
|
||||
5. Yield RETRY event → UI discards partial output
|
||||
6. Re-send the same request with maxOutputTokens: 64K
|
||||
```
|
||||
|
||||
### State cleanup on RETRY (turn.ts)
|
||||
|
||||
When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
|
||||
|
||||
- `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
|
||||
- `pendingCitations` — cleared to avoid duplicate citations
|
||||
- `debugResponses` — cleared to avoid stale debug data
|
||||
- `finishReason` — reset to `undefined` so the new response's finish reason is used
|
||||
|
||||
## Constants
|
||||
|
||||
Defined in `tokenLimits.ts`:
|
||||
|
||||
| Constant | Value | Purpose |
|
||||
| --------------------------- | ------ | ------------------------------------------------------- |
|
||||
| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set |
|
||||
| `ESCALATED_MAX_TOKENS` | 64,000 | Output token limit used on truncation retry |
|
||||
|
||||
## Design decisions
|
||||
|
||||
### Why 8K default?
|
||||
|
||||
- 99% of responses are under 5K tokens
|
||||
- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
|
||||
- Reduces average slot reservation from 32K to 8K (4x improvement)
|
||||
|
||||
### Why 64K escalated limit?
|
||||
|
||||
- Covers the vast majority of long outputs that were truncated at 8K
|
||||
- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
|
||||
- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
|
||||
|
||||
### Why not progressive escalation (8K → 16K → 32K → 64K)?
|
||||
|
||||
- Each retry adds latency (the full response must be regenerated)
|
||||
- A single retry is the simplest approach that captures almost all cases
|
||||
- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
|
||||
|
||||
### Why is escalation outside the retry loop?
|
||||
|
||||
- Truncation is a success case, not an error
|
||||
- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
|
||||
- Keeps the retry loop focused on its original purpose (transient error recovery)
|
||||
Loading…
Add table
Add a link
Reference in a new issue