qwen-code/docs/design/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
Shaojin Wen 519e5aa1de
fix(core): recover from truncated tool calls via multi-turn continuation (#3313)
* fix(core): recover from truncated tool calls via multi-turn continuation (#3049)

When large tool calls (e.g., WriteFile with big HTML) exceed the output
token limit, the model's response gets truncated and required parameters
like file_path are missing. Previously this surfaced as a confusing
"params must have required property" error.

Three-layer defense:

1. Escalate to model's actual output limit (not fixed 64K). Models with
   128K output (Claude Opus, GPT-5) now use their full capacity.

2. Multi-turn recovery: if the escalated response is still truncated,
   keep the partial response in history and inject a recovery message
   ("Resume directly — pick up mid-thought") so the model continues
   from where it left off. Up to 3 recovery attempts before falling
   back to the tool scheduler's guidance.

3. Stronger truncation guidance as fallback: "you MUST split" instead
   of "consider splitting".

Also fixes:
- Clear toolCallRequests on RETRY to prevent duplicate tool execution
- Add isContinuation flag to RETRY events so the UI preserves text
  buffers during recovery (continuation) but resets them during
  escalation (fresh restart)
- Catch errors during recovery to prevent dangling history entries

* docs: update adaptive output token escalation design for recovery mechanism

Update the design doc to reflect:
- Escalation now targets model's actual output limit (64K floor)
- Multi-turn recovery loop after escalation (up to 3 attempts)
- isContinuation flag on RETRY events
- Recovery error handling (pop dangling message, break)
- Updated constants table and model-specific escalation limits
- New design decision: why multi-turn recovery over progressive escalation

* fix: remove competitor reference from code comment

* fix: address review feedback on recovery mechanism

Three correctness fixes from @tanzhenxin's review:

1. Partial text lost during continuation (useGeminiStream.ts):
   On continuation RETRY, setPendingHistoryItem(null) cleared the pending
   gemini item. The next Content event then saw a null pending item,
   created a fresh one, and reset geminiMessageBuffer = eventValue —
   discarding the preserved partial text. Now the pending item AND
   buffers are kept on continuation, so the continuation appends.

2. Recovery on truncated tool-call turns (geminiChat.ts):
   When the truncated turn already contains a complete functionCall,
   appending a user recovery message produces model(functionCall) →
   user(text) with no intervening functionResponse — an invalid API
   sequence. Now recovery skips turns with functionCall parts and
   defers to the tool scheduler's layer-3 fallback.

3. Recovery errors swallowed after partial chunks (geminiChat.ts):
   If a recovery attempt yielded chunks then failed, the catch block
   broke without emitting any terminal signal, leaving the UI with
   partial text and no Finished event. Now emits a synthetic
   finishReason=STOP chunk in the catch so the UI gets a proper
   terminal signal.

* test: add coverage for output token recovery loop

Four targeted tests for the recovery mechanism introduced in the
truncated-tool-call-recovery PR:

1. Recovery loop fires when escalated response is also truncated:
   initial MAX_TOKENS → escalation MAX_TOKENS → recovery STOP. Verifies
   two RETRY events (one escalation, one continuation) and three API
   calls.

2. Recovery is skipped when truncated turn contains a functionCall:
   prevents the invalid model(functionCall) → user(text) sequence.
   Verifies no continuation RETRY and history ends with the functionCall
   intact.

3. Recovery attempts are capped at MAX_OUTPUT_RECOVERY_ATTEMPTS (3):
   persistent MAX_TOKENS triggers exactly 5 API calls (1 initial + 1
   escalation + 3 recovery).

4. Recovery catch block emits synthetic STOP chunk and pops dangling
   user message: when a recovery attempt fails (empty stream →
   InvalidStreamError), the UI gets a terminal signal and history
   ends on the model turn, not a dangling user recovery message.

* test: cover cross-iteration functionCall detection in recovery loop

Existing tests cover the functionCall guard when both initial and
escalated responses have functionCall. This adds a test for the
cross-iteration case: iter 1 returns text (recovery proceeds), iter 2
returns functionCall (recovery must break before iter 3).

Verifies:
- API called exactly 4 times (1 initial + 1 escalation + 2 recovery)
- History ends with the functionCall model turn, not a dangling user
  recovery message
- Iter 3's user recovery message is never pushed (guard fires at top
  of loop before recoveryCount increment)

* fix(core): cast synthetic STOP chunk via unknown for TS2352

The object literal {candidates, content, parts} doesn't structurally
overlap enough with GenerateContentResponse for TypeScript's strict
narrow cast. Casting through 'unknown' is required per TS2352.

Build error from CI:
  src/core/geminiChat.ts(651,24): error TS2352: Conversion of type '...'
  to type 'GenerateContentResponse' may be a mistake because neither
  type sufficiently overlaps with the other. If this was intentional,
  convert the expression to 'unknown' first.

* test(core): tighten recovery history integrity assertions

Strengthen the "pop dangling recovery message" test to catch any
future regression that leaves consecutive same-role entries or an
empty last-model placeholder in history — conditions providers
reject on the next turn.

* fix(core): coalesce recovery pairs to avoid leaking control prompt

Previously every output-token recovery iteration left a (user, model)
pair in durable history where the user turn was the internal
OUTPUT_RECOVERY_MESSAGE control prompt. That prompt was then visible
to every later turn, biasing responses and polluting compression,
replay, and export.

Track successful recovery iterations and, after the recovery loop,
fold each completed pair back into the preceding model turn via a
new `coalesceRecoveryPairs` helper. Failed iterations already pop
their user turn in the catch block, so they need no coalescing.

Adds a targeted test that runs escalation + two successful recovery
iterations + a clean STOP, and asserts the merged history has
exactly one user turn and one model turn, no trace of the control
prompt text, and content ordered as B (escalation) + C + D.
2026-04-21 17:04:24 +08:00

11 KiB
Raw Blame History

Adaptive Output Token Escalation Design

Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens, with multi-turn recovery for responses that exceed even the escalated limit.

Problem

Every API request reserves a fixed GPU slot proportional to max_tokens. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.

Solution

Use a capped default of 8K output tokens. When a response is truncated (the model hits max_tokens):

  1. Escalate to the model's full output limit (with 64K as a floor for unknown models)
  2. If still truncated, recover by keeping the partial response in history and injecting a continuation message, up to 3 times
  3. If recovery is exhausted, fall back to the tool scheduler's truncation guidance

Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.

Architecture

Request (max_tokens = 8K)
│
▼
┌─────────────────────────┐
│  Response truncated?     │──── No ──▶ Done ✓
│  (MAX_TOKENS)            │
└───────────┬──────────────┘
            │ Yes
            ▼
┌──────────────────────────────────────────────────┐
│  Layer 1: Escalate to model output limit         │
│  ┌────────────────────────────────────────────┐  │
│  │ Pop partial response from history          │  │
│  │ RETRY (isContinuation: false → reset UI)   │  │
│  │ Re-send at max(64K, model output limit)    │  │
│  └────────────────────────────────────────────┘  │
└───────────┬──────────────────────────────────────┘
            │
            ▼
┌─────────────────────────┐
│  Still truncated?        │──── No ──▶ Done ✓
│  (MAX_TOKENS)            │
└───────────┬──────────────┘
            │ Yes
            ▼
┌──────────────────────────────────────────────────┐
│  Layer 2: Multi-turn recovery (up to 3×)         │
│  ┌────────────────────────────────────────────┐  │
│  │ Keep partial response in history           │  │
│  │ Push user message: "Resume directly..."    │  │
│  │ RETRY (isContinuation: true → keep UI buf) │  │
│  │ Re-send with updated history               │  │
│  │ Model continues from where it left off     │  │
│  └──────────────┬─────────────────────────────┘  │
│                 │                                 │
│          ┌──────┴──────┐                          │
│          │ Succeeded?  │── Yes ──▶ Done ✓         │
│          └──────┬──────┘                          │
│                 │ No (still truncated)            │
│                 ▼                                 │
│          attempt < 3? ── Yes ──▶ loop back ↑      │
└───────────┬──────────────────────────────────────┘
            │ No (exhausted)
            ▼
┌──────────────────────────────────────────────────┐
│  Layer 3: Tool scheduler fallback                │
│  ┌────────────────────────────────────────────┐  │
│  │ Reject truncated Edit/Write tool calls     │  │
│  │ Return guidance: "You MUST split into      │  │
│  │ smaller parts — write skeleton first,      │  │
│  │ then edit incrementally."                  │  │
│  └────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

Token limit determination

The effective max_tokens is resolved in the following priority order:

Priority Source Value (known model) Value (unknown model) Escalation behavior
1 (highest) User config (samplingParams.max_tokens) min(userValue, modelLimit) userValue No escalation
2 Environment variable (QWEN_CODE_MAX_OUTPUT_TOKENS) min(envValue, modelLimit) envValue No escalation
3 (lowest) Capped default min(modelLimit, 8K) min(32K, 8K) = 8K Escalates to model limit (64K floor) + recovery

A "known model" is one that has an explicit entry in OUTPUT_PATTERNS (checked via hasExplicitOutputLimit()). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.

This logic is implemented in three content generators:

  • DefaultOpenAICompatibleProvider.applyOutputTokenLimit() — OpenAI-compatible providers
  • DashScopeProvider — inherits applyOutputTokenLimit() from the default provider
  • AnthropicContentGenerator.buildSamplingParameters() — Anthropic provider

Escalation mechanism

The escalation logic lives in geminiChat.ts, placed outside the main retry loop. This is intentional:

  1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
  2. Truncation is not an error — it's a successful response that was cut short
  3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic

Escalation steps (geminiChat.ts)

1. Stream completes successfully (lastError === null)
2. Last chunk has finishReason === MAX_TOKENS
3. Guard checks pass:
   - maxTokensEscalated === false (prevent infinite escalation)
   - hasUserMaxTokensOverride === false (respect user intent)
4. Compute escalated limit: max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))
5. Pop the partial model response from chat history
6. Yield RETRY event (isContinuation: false) → UI discards partial output and resets buffers
7. Re-send the same request with maxOutputTokens: escalatedLimit

Recovery steps (geminiChat.ts)

If the escalated response is also truncated (finishReason === MAX_TOKENS), the recovery loop runs up to MAX_OUTPUT_RECOVERY_ATTEMPTS (3) times:

1. Partial model response is already in history (pushed by processStreamResponse)
2. Push a recovery user message: OUTPUT_RECOVERY_MESSAGE
3. Yield RETRY event (isContinuation: true) → UI keeps text buffer for continuation
4. Re-send with updated history (model sees its partial output + recovery instruction)
5. If still truncated and attempts remain, loop back to step 1
6. If recovery attempt throws (empty response, network error):
   - Pop the dangling recovery message from history
   - Break out of recovery loop

State cleanup on RETRY (turn.ts)

When the Turn class receives a RETRY event, it clears accumulated state to prevent inconsistencies:

  • pendingToolCalls — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
  • pendingCitations — cleared to avoid duplicate citations
  • debugResponses — cleared to avoid stale debug data
  • finishReason — reset to undefined so the new response's finish reason is used

The isContinuation flag is passed through to the UI so it can decide whether to reset text buffers (escalation) or keep them (recovery).

Constants

Defined in geminiChat.ts and tokenLimits.ts:

Constant Value Purpose
CAPPED_DEFAULT_MAX_TOKENS 8,000 Default output token limit when no user override is set
ESCALATED_MAX_TOKENS 64,000 Floor for escalation (used when model limit is unknown)
MAX_OUTPUT_RECOVERY_ATTEMPTS 3 Max multi-turn recovery attempts after escalation

The effective escalated limit is max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output')):

Model Escalated limit
Claude Opus 4.6 131,072 (128K)
GPT-5 / o-series 131,072 (128K)
Qwen3.x 65,536 (64K)
Unknown models 64,000 (floor)

Design decisions

Why 8K default?

  • 99% of responses are under 5K tokens
  • 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
  • Reduces average slot reservation from 32K to 8K (4x improvement)

Why escalate to model limit instead of fixed 64K?

  • Models with higher output limits (Claude Opus 128K, GPT-5 128K) were constrained to 64K unnecessarily
  • Using the model's actual limit captures the vast majority of long outputs without a second retry
  • ESCALATED_MAX_TOKENS (64K) serves as a floor for unknown models where tokenLimit() returns the default 32K

Why multi-turn recovery instead of progressive escalation?

  • Progressive escalation (8K → 16K → 32K → 64K) requires regenerating the full response each time
  • Multi-turn recovery keeps the partial response and lets the model continue, saving tokens and latency
  • Recovery messages are cheap (~40 tokens each) compared to regenerating large responses
  • The 3-attempt limit prevents infinite loops while covering most practical cases

Why is escalation outside the retry loop?

  • Truncation is a success case, not an error
  • Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
  • Keeps the retry loop focused on its original purpose (transient error recovery)
  • Recovery errors are caught separately to avoid aborting the entire conversation