feat(core): adaptive output token escalation (8K default + 64K retry) (#2898)

* feat(core): adaptive output token escalation (8K default + 64K retry) 99% of model responses are under 5K tokens, but we previously reserved 32K for every request. This wastes GPU slot capacity by ~4x. Now the default output limit is 8K. When a response hits this cap (stop_reason=max_tokens), it automatically retries once at 64K — only the ~1% of requests that actually need more tokens pay the cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add design doc and user doc for adaptive output token escalation - Add design doc covering problem, architecture, token limit determination, escalation mechanism, and design decisions - Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md - Add max_tokens adaptive behavior explanation in model config section --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 21:20:44 +00:00 · 2026-04-08 17:30:39 +08:00 · 2026-04-08 17:30:39 +08:00 · 1e8bc031cc
commit 1e8bc031cc
parent 3c23952ef7
11 changed files with 299 additions and 57 deletions
--- a/docs/design/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
+++ b/docs/design/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
@ -0,0 +1,138 @@
+# Adaptive Output Token Escalation Design
+
+> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
+
+## Problem
+
+Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
+
+## Solution
+
+Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
+
+## Architecture
+
+```
+                      ┌─────────────────────────┐
+                      │   Request starts        │
+                      │   max_tokens = 8K       │
+                      └───────────┬─────────────┘
+                                  │
+                                  ▼
+                      ┌─────────────────────────┐
+                      │   Stream response       │
+                      └───────────┬─────────────┘
+                                  │
+                        ┌─────────┴─────────┐
+                        │                   │
+                   finish_reason        finish_reason
+                   != MAX_TOKENS        == MAX_TOKENS
+                        │                   │
+                        ▼                   ▼
+                  ┌───────────┐   ┌─────────────────────┐
+                  │   Done    │   │  Check conditions:   │
+                  └───────────┘   │  - No user override? │
+                                  │  - No env override?  │
+                                  │  - Not already       │
+                                  │    escalated?        │
+                                  └─────────┬───────────┘
+                                     YES    │    NO
+                                  ┌─────────┴────┐
+                                  │              │
+                                  ▼              ▼
+                          ┌─────────────┐  ┌──────────┐
+                          │ Pop partial │  │  Done    │
+                          │ model resp  │  │ (truncd) │
+                          │ from history│  └──────────┘
+                          │             │
+                          │ Yield RETRY │
+                          │ event       │
+                          │             │
+                          │ Re-send     │
+                          │ max_tokens  │
+                          │   = 64K     │
+                          └─────────────┘
+```
+
+## Token limit determination
+
+The effective `max_tokens` is resolved in the following priority order:
+
+| Priority    | Source                                               | Value (known model)          | Value (unknown model) | Escalation behavior            |
+| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
+| 1 (highest) | User config (`samplingParams.max_tokens`)            | `min(userValue, modelLimit)` | `userValue`           | No escalation                  |
+| 2           | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)`  | `envValue`            | No escalation                  |
+| 3 (lowest)  | Capped default                                       | `min(modelLimit, 8K)`        | `min(32K, 8K)` = 8K   | Escalates to 64K on truncation |
+
+A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
+
+This logic is implemented in three content generators:
+
+- `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers
+- `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider
+- `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider
+
+## Escalation mechanism
+
+The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional:
+
+1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
+2. Truncation is not an error — it's a successful response that was cut short
+3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic
+
+### Escalation steps (geminiChat.ts)
+
+```
+1. Stream completes successfully (lastError === null)
+2. Last chunk has finishReason === MAX_TOKENS
+3. Guard checks pass:
+   - maxTokensEscalated === false (prevent infinite escalation)
+   - hasUserMaxTokensOverride === false (respect user intent)
+4. Pop the partial model response from chat history
+5. Yield RETRY event → UI discards partial output
+6. Re-send the same request with maxOutputTokens: 64K
+```
+
+### State cleanup on RETRY (turn.ts)
+
+When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
+
+- `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
+- `pendingCitations` — cleared to avoid duplicate citations
+- `debugResponses` — cleared to avoid stale debug data
+- `finishReason` — reset to `undefined` so the new response's finish reason is used
+
+## Constants
+
+Defined in `tokenLimits.ts`:
+
+| Constant                    | Value  | Purpose                                                 |
+| --------------------------- | ------ | ------------------------------------------------------- |
+| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000  | Default output token limit when no user override is set |
+| `ESCALATED_MAX_TOKENS`      | 64,000 | Output token limit used on truncation retry             |
+
+## Design decisions
+
+### Why 8K default?
+
+- 99% of responses are under 5K tokens
+- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
+- Reduces average slot reservation from 32K to 8K (4x improvement)
+
+### Why 64K escalated limit?
+
+- Covers the vast majority of long outputs that were truncated at 8K
+- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
+- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
+
+### Why not progressive escalation (8K → 16K → 32K → 64K)?
+
+- Each retry adds latency (the full response must be regenerated)
+- A single retry is the simplest approach that captures almost all cases
+- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
+
+### Why is escalation outside the retry loop?
+
+- Truncation is a success case, not an error
+- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
+- Keeps the retry loop focused on its original purpose (transient error recovery)