# Adaptive Output Token Escalation Design > Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens, with multi-turn recovery for responses that exceed even the escalated limit. ## Problem Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost. ## Solution Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`): 1. **Escalate** to the model's full output limit (with 64K as a floor for unknown models) 2. If still truncated, **recover** by keeping the partial response in history and injecting a continuation message, up to 3 times 3. If recovery is exhausted, fall back to the tool scheduler's truncation guidance Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses. ## Architecture ``` Request (max_tokens = 8K) │ ▼ ┌─────────────────────────┐ │ Response truncated? │──── No ──▶ Done ✓ │ (MAX_TOKENS) │ └───────────┬──────────────┘ │ Yes ▼ ┌──────────────────────────────────────────────────┐ │ Layer 1: Escalate to model output limit │ │ ┌────────────────────────────────────────────┐ │ │ │ Pop partial response from history │ │ │ │ RETRY (isContinuation: false → reset UI) │ │ │ │ Re-send at max(64K, model output limit) │ │ │ └────────────────────────────────────────────┘ │ └───────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────┐ │ Still truncated? │──── No ──▶ Done ✓ │ (MAX_TOKENS) │ └───────────┬──────────────┘ │ Yes ▼ ┌──────────────────────────────────────────────────┐ │ Layer 2: Multi-turn recovery (up to 3×) │ │ ┌────────────────────────────────────────────┐ │ │ │ Keep partial response in history │ │ │ │ Push user message: "Resume directly..." │ │ │ │ RETRY (isContinuation: true → keep UI buf) │ │ │ │ Re-send with updated history │ │ │ │ Model continues from where it left off │ │ │ └──────────────┬─────────────────────────────┘ │ │ │ │ │ ┌──────┴──────┐ │ │ │ Succeeded? │── Yes ──▶ Done ✓ │ │ └──────┬──────┘ │ │ │ No (still truncated) │ │ ▼ │ │ attempt < 3? ── Yes ──▶ loop back ↑ │ └───────────┬──────────────────────────────────────┘ │ No (exhausted) ▼ ┌──────────────────────────────────────────────────┐ │ Layer 3: Tool scheduler fallback │ │ ┌────────────────────────────────────────────┐ │ │ │ Reject truncated Edit/Write tool calls │ │ │ │ Return guidance: "You MUST split into │ │ │ │ smaller parts — write skeleton first, │ │ │ │ then edit incrementally." │ │ │ └────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────┘ ``` ## Token limit determination The effective `max_tokens` is resolved in the following priority order: | Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior | | ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ----------------------------------------------- | | 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation | | 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation | | 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to model limit (64K floor) + recovery | A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits. This logic is implemented in three content generators: - `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers - `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider - `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider ## Escalation mechanism The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional: 1. The retry loop handles transient errors (rate limits, invalid streams, content validation) 2. Truncation is not an error — it's a successful response that was cut short 3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic ### Escalation steps (geminiChat.ts) ``` 1. Stream completes successfully (lastError === null) 2. Last chunk has finishReason === MAX_TOKENS 3. Guard checks pass: - maxTokensEscalated === false (prevent infinite escalation) - hasUserMaxTokensOverride === false (respect user intent) 4. Compute escalated limit: max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output')) 5. Pop the partial model response from chat history 6. Yield RETRY event (isContinuation: false) → UI discards partial output and resets buffers 7. Re-send the same request with maxOutputTokens: escalatedLimit ``` ### Recovery steps (geminiChat.ts) If the escalated response is also truncated (finishReason === MAX_TOKENS), the recovery loop runs up to `MAX_OUTPUT_RECOVERY_ATTEMPTS` (3) times: ``` 1. Partial model response is already in history (pushed by processStreamResponse) 2. Push a recovery user message: OUTPUT_RECOVERY_MESSAGE 3. Yield RETRY event (isContinuation: true) → UI keeps text buffer for continuation 4. Re-send with updated history (model sees its partial output + recovery instruction) 5. If still truncated and attempts remain, loop back to step 1 6. If recovery attempt throws (empty response, network error): - Pop the dangling recovery message from history - Break out of recovery loop ``` ### State cleanup on RETRY (turn.ts) When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies: - `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response - `pendingCitations` — cleared to avoid duplicate citations - `debugResponses` — cleared to avoid stale debug data - `finishReason` — reset to `undefined` so the new response's finish reason is used The `isContinuation` flag is passed through to the UI so it can decide whether to reset text buffers (escalation) or keep them (recovery). ## Constants Defined in `geminiChat.ts` and `tokenLimits.ts`: | Constant | Value | Purpose | | ------------------------------ | ------ | ------------------------------------------------------- | | `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set | | `ESCALATED_MAX_TOKENS` | 64,000 | Floor for escalation (used when model limit is unknown) | | `MAX_OUTPUT_RECOVERY_ATTEMPTS` | 3 | Max multi-turn recovery attempts after escalation | The effective escalated limit is `max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))`: | Model | Escalated limit | | ---------------- | --------------- | | Claude Opus 4.6 | 131,072 (128K) | | GPT-5 / o-series | 131,072 (128K) | | Qwen3.x | 65,536 (64K) | | Unknown models | 64,000 (floor) | ## Design decisions ### Why 8K default? - 99% of responses are under 5K tokens - 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries - Reduces average slot reservation from 32K to 8K (4x improvement) ### Why escalate to model limit instead of fixed 64K? - Models with higher output limits (Claude Opus 128K, GPT-5 128K) were constrained to 64K unnecessarily - Using the model's actual limit captures the vast majority of long outputs without a second retry - `ESCALATED_MAX_TOKENS` (64K) serves as a floor for unknown models where `tokenLimit()` returns the default 32K ### Why multi-turn recovery instead of progressive escalation? - Progressive escalation (8K → 16K → 32K → 64K) requires regenerating the full response each time - Multi-turn recovery keeps the partial response and lets the model continue, saving tokens and latency - Recovery messages are cheap (~40 tokens each) compared to regenerating large responses - The 3-attempt limit prevents infinite loops while covering most practical cases ### Why is escalation outside the retry loop? - Truncation is a success case, not an error - Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters - Keeps the retry loop focused on its original purpose (transient error recovery) - Recovery errors are caught separately to avoid aborting the entire conversation