feat(core): adaptive output token escalation (8K default + 64K retry) (#2898)

* feat(core): adaptive output token escalation (8K default + 64K retry) 99% of model responses are under 5K tokens, but we previously reserved 32K for every request. This wastes GPU slot capacity by ~4x. Now the default output limit is 8K. When a response hits this cap (stop_reason=max_tokens), it automatically retries once at 64K — only the ~1% of requests that actually need more tokens pay the cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add design doc and user doc for adaptive output token escalation - Add design doc covering problem, architecture, token limit determination, escalation mechanism, and design decisions - Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md - Add max_tokens adaptive behavior explanation in model config section --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 11:41:04 +00:00 · 2026-04-08 17:30:39 +08:00 · 2026-04-08 17:30:39 +08:00 · 1e8bc031cc
commit 1e8bc031cc
parent 3c23952ef7
11 changed files with 299 additions and 57 deletions
--- a/docs/design/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
+++ b/docs/design/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
@ -0,0 +1,138 @@
+# Adaptive Output Token Escalation Design
+
+> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
+
+## Problem
+
+Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
+
+## Solution
+
+Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
+
+## Architecture
+
+```
+                      ┌─────────────────────────┐
+                      │   Request starts        │
+                      │   max_tokens = 8K       │
+                      └───────────┬─────────────┘
+                                  │
+                                  ▼
+                      ┌─────────────────────────┐
+                      │   Stream response       │
+                      └───────────┬─────────────┘
+                                  │
+                        ┌─────────┴─────────┐
+                        │                   │
+                   finish_reason        finish_reason
+                   != MAX_TOKENS        == MAX_TOKENS
+                        │                   │
+                        ▼                   ▼
+                  ┌───────────┐   ┌─────────────────────┐
+                  │   Done    │   │  Check conditions:   │
+                  └───────────┘   │  - No user override? │
+                                  │  - No env override?  │
+                                  │  - Not already       │
+                                  │    escalated?        │
+                                  └─────────┬───────────┘
+                                     YES    │    NO
+                                  ┌─────────┴────┐
+                                  │              │
+                                  ▼              ▼
+                          ┌─────────────┐  ┌──────────┐
+                          │ Pop partial │  │  Done    │
+                          │ model resp  │  │ (truncd) │
+                          │ from history│  └──────────┘
+                          │             │
+                          │ Yield RETRY │
+                          │ event       │
+                          │             │
+                          │ Re-send     │
+                          │ max_tokens  │
+                          │   = 64K     │
+                          └─────────────┘
+```
+
+## Token limit determination
+
+The effective `max_tokens` is resolved in the following priority order:
+
+| Priority    | Source                                               | Value (known model)          | Value (unknown model) | Escalation behavior            |
+| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
+| 1 (highest) | User config (`samplingParams.max_tokens`)            | `min(userValue, modelLimit)` | `userValue`           | No escalation                  |
+| 2           | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)`  | `envValue`            | No escalation                  |
+| 3 (lowest)  | Capped default                                       | `min(modelLimit, 8K)`        | `min(32K, 8K)` = 8K   | Escalates to 64K on truncation |
+
+A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
+
+This logic is implemented in three content generators:
+
+- `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers
+- `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider
+- `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider
+
+## Escalation mechanism
+
+The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional:
+
+1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
+2. Truncation is not an error — it's a successful response that was cut short
+3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic
+
+### Escalation steps (geminiChat.ts)
+
+```
+1. Stream completes successfully (lastError === null)
+2. Last chunk has finishReason === MAX_TOKENS
+3. Guard checks pass:
+   - maxTokensEscalated === false (prevent infinite escalation)
+   - hasUserMaxTokensOverride === false (respect user intent)
+4. Pop the partial model response from chat history
+5. Yield RETRY event → UI discards partial output
+6. Re-send the same request with maxOutputTokens: 64K
+```
+
+### State cleanup on RETRY (turn.ts)
+
+When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
+
+- `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
+- `pendingCitations` — cleared to avoid duplicate citations
+- `debugResponses` — cleared to avoid stale debug data
+- `finishReason` — reset to `undefined` so the new response's finish reason is used
+
+## Constants
+
+Defined in `tokenLimits.ts`:
+
+| Constant                    | Value  | Purpose                                                 |
+| --------------------------- | ------ | ------------------------------------------------------- |
+| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000  | Default output token limit when no user override is set |
+| `ESCALATED_MAX_TOKENS`      | 64,000 | Output token limit used on truncation retry             |
+
+## Design decisions
+
+### Why 8K default?
+
+- 99% of responses are under 5K tokens
+- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
+- Reduces average slot reservation from 32K to 8K (4x improvement)
+
+### Why 64K escalated limit?
+
+- Covers the vast majority of long outputs that were truncated at 8K
+- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
+- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
+
+### Why not progressive escalation (8K → 16K → 32K → 64K)?
+
+- Each retry adds latency (the full response must be regenerated)
+- A single retry is the simplest approach that captures almost all cases
+- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
+
+### Why is escalation outside the retry loop?
+
+- Truncation is a success case, not an error
+- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
+- Keeps the retry loop focused on its original purpose (transient error recovery)
--- a/docs/users/configuration/settings.md
+++ b/docs/users/configuration/settings.md
@ -168,6 +168,18 @@ Settings are organized into categories. All settings should be placed within the
 }
 ```

+**max_tokens (adaptive output tokens):**
+
+When `samplingParams.max_tokens` is not set, Qwen Code uses an adaptive output token strategy to optimize GPU resource usage:
+
+1. Requests start with a default limit of **8K** output tokens
+2. If the response is truncated (the model hits the limit), Qwen Code automatically retries with **64K** tokens
+3. The partial output is discarded and replaced with the full response from the retry
+
+This is transparent to users — you may briefly see a retry indicator if escalation occurs. Since 99% of responses are under 5K tokens, the retry happens rarely (<1% of requests).
+
+To override this behavior, either set `samplingParams.max_tokens` in your settings or use the `QWEN_CODE_MAX_OUTPUT_TOKENS` environment variable.
+
 **contextWindowSize:**

 Overrides the default context window size for the selected model. Qwen Code determines the context window using built-in defaults based on model name matching, with a constant fallback value. Use this setting when a provider's effective context limit differs from Qwen Code's default. This value defines the model's assumed maximum context capacity, not a per-request token limit.
@ -491,22 +503,23 @@ For authentication-related variables (like `OPENAI_*`) and the recommended `.qwe

 ### Environment Variables Table

-| Variable                       | Description                                                                                                                                            | Notes                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
-| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `QWEN_TELEMETRY_ENABLED`       | Set to `true` or `1` to enable telemetry. Any other value is treated as disabling it.                                                                  | Overrides the `telemetry.enabled` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| `QWEN_TELEMETRY_TARGET`        | Sets the telemetry target (`local` or `gcp`).                                                                                                          | Overrides the `telemetry.target` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                          |
-| `QWEN_TELEMETRY_OTLP_ENDPOINT` | Sets the OTLP endpoint for telemetry.                                                                                                                  | Overrides the `telemetry.otlpEndpoint` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `QWEN_TELEMETRY_OTLP_PROTOCOL` | Sets the OTLP protocol (`grpc` or `http`).                                                                                                             | Overrides the `telemetry.otlpProtocol` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `QWEN_TELEMETRY_LOG_PROMPTS`   | Set to `true` or `1` to enable or disable logging of user prompts. Any other value is treated as disabling it.                                         | Overrides the `telemetry.logPrompts` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                      |
-| `QWEN_TELEMETRY_OUTFILE`       | Sets the file path to write telemetry to when the target is `local`.                                                                                   | Overrides the `telemetry.outfile` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-| `QWEN_TELEMETRY_USE_COLLECTOR` | Set to `true` or `1` to enable or disable using an external OTLP collector. Any other value is treated as disabling it.                                | Overrides the `telemetry.useCollector` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `QWEN_SANDBOX`                 | Alternative to the `sandbox` setting in `settings.json`.                                                                                               | Accepts `true`, `false`, `docker`, `podman`, or a custom command string.                                                                                                                                                                                                                                                                                                                                                                                                           |
-| `SEATBELT_PROFILE`             | (macOS specific) Switches the Seatbelt (`sandbox-exec`) profile on macOS.                                                                              | `permissive-open`: (Default) Restricts writes to the project folder (and a few other folders, see `packages/cli/src/utils/sandbox-macos-permissive-open.sb`) but allows other operations. `strict`: Uses a strict profile that declines operations by default. `<profile_name>`: Uses a custom profile. To define a custom profile, create a file named `sandbox-macos-<profile_name>.sb` in your project's `.qwen/` directory (e.g., `my-project/.qwen/sandbox-macos-custom.sb`). |
-| `DEBUG` or `DEBUG_MODE`        | (often used by underlying libraries or the CLI itself) Set to `true` or `1` to enable verbose debug logging, which can be helpful for troubleshooting. | **Note:** These variables are automatically excluded from project `.env` files by default to prevent interference with the CLI behavior. Use `.qwen/.env` files if you need to set these for Qwen Code specifically.                                                                                                                                                                                                                                                               |
-| `NO_COLOR`                     | Set to any value to disable all color output in the CLI.                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `CLI_TITLE`                    | Set to a string to customize the title of the CLI.                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `CODE_ASSIST_ENDPOINT`         | Specifies the endpoint for the code assist server.                                                                                                     | This is useful for development and testing.                                                                                                                                                                                                                                                                                                                                                                                                                                        |
-| `TAVILY_API_KEY`               | Your API key for the Tavily web search service.                                                                                                        | Used to enable the `web_search` tool functionality. Example: `export TAVILY_API_KEY="tvly-your-api-key-here"`                                                                                                                                                                                                                                                                                                                                                                      |
+| Variable                       | Description                                                                                                                                                                                                                                                                    | Notes                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `QWEN_TELEMETRY_ENABLED`       | Set to `true` or `1` to enable telemetry. Any other value is treated as disabling it.                                                                                                                                                                                          | Overrides the `telemetry.enabled` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| `QWEN_TELEMETRY_TARGET`        | Sets the telemetry target (`local` or `gcp`).                                                                                                                                                                                                                                  | Overrides the `telemetry.target` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| `QWEN_TELEMETRY_OTLP_ENDPOINT` | Sets the OTLP endpoint for telemetry.                                                                                                                                                                                                                                          | Overrides the `telemetry.otlpEndpoint` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `QWEN_TELEMETRY_OTLP_PROTOCOL` | Sets the OTLP protocol (`grpc` or `http`).                                                                                                                                                                                                                                     | Overrides the `telemetry.otlpProtocol` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `QWEN_TELEMETRY_LOG_PROMPTS`   | Set to `true` or `1` to enable or disable logging of user prompts. Any other value is treated as disabling it.                                                                                                                                                                 | Overrides the `telemetry.logPrompts` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| `QWEN_TELEMETRY_OUTFILE`       | Sets the file path to write telemetry to when the target is `local`.                                                                                                                                                                                                           | Overrides the `telemetry.outfile` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| `QWEN_TELEMETRY_USE_COLLECTOR` | Set to `true` or `1` to enable or disable using an external OTLP collector. Any other value is treated as disabling it.                                                                                                                                                        | Overrides the `telemetry.useCollector` setting.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `QWEN_SANDBOX`                 | Alternative to the `sandbox` setting in `settings.json`.                                                                                                                                                                                                                       | Accepts `true`, `false`, `docker`, `podman`, or a custom command string.                                                                                                                                                                                                                                                                                                                                                                                                           |
+| `SEATBELT_PROFILE`             | (macOS specific) Switches the Seatbelt (`sandbox-exec`) profile on macOS.                                                                                                                                                                                                      | `permissive-open`: (Default) Restricts writes to the project folder (and a few other folders, see `packages/cli/src/utils/sandbox-macos-permissive-open.sb`) but allows other operations. `strict`: Uses a strict profile that declines operations by default. `<profile_name>`: Uses a custom profile. To define a custom profile, create a file named `sandbox-macos-<profile_name>.sb` in your project's `.qwen/` directory (e.g., `my-project/.qwen/sandbox-macos-custom.sb`). |
+| `DEBUG` or `DEBUG_MODE`        | (often used by underlying libraries or the CLI itself) Set to `true` or `1` to enable verbose debug logging, which can be helpful for troubleshooting.                                                                                                                         | **Note:** These variables are automatically excluded from project `.env` files by default to prevent interference with the CLI behavior. Use `.qwen/.env` files if you need to set these for Qwen Code specifically.                                                                                                                                                                                                                                                               |
+| `NO_COLOR`                     | Set to any value to disable all color output in the CLI.                                                                                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `CLI_TITLE`                    | Set to a string to customize the title of the CLI.                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `CODE_ASSIST_ENDPOINT`         | Specifies the endpoint for the code assist server.                                                                                                                                                                                                                             | This is useful for development and testing.                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| `QWEN_CODE_MAX_OUTPUT_TOKENS`  | Overrides the default maximum output tokens per response. When not set, Qwen Code uses an adaptive strategy: starts with 8K tokens and automatically retries with 64K if the response is truncated. Set this to a specific value (e.g., `16000`) to use a fixed limit instead. | Takes precedence over the capped default (8K) but is overridden by `samplingParams.max_tokens` in settings. Disables automatic escalation when set. Example: `export QWEN_CODE_MAX_OUTPUT_TOKENS=16000`                                                                                                                                                                                                                                                                            |
+| `TAVILY_API_KEY`               | Your API key for the Tavily web search service.                                                                                                                                                                                                                                | Used to enable the `web_search` tool functionality. Example: `export TAVILY_API_KEY="tvly-your-api-key-here"`                                                                                                                                                                                                                                                                                                                                                                      |

 ## Command-Line Arguments

--- a/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.test.ts
+++ b/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.test.ts
@ -423,7 +423,7 @@ describe('AnthropicContentGenerator', () => {
        const [anthropicRequest] =
          anthropicState.lastCreateArgs as AnthropicCreateArgs;
        expect(anthropicRequest).toEqual(
-          expect.objectContaining({ max_tokens: 32000 }),
+          expect.objectContaining({ max_tokens: 8000 }),
        );
      });

@ -488,7 +488,7 @@ describe('AnthropicContentGenerator', () => {
        const [anthropicRequest] =
          anthropicState.lastCreateArgs as AnthropicCreateArgs;
        expect(anthropicRequest).toEqual(
-          expect.objectContaining({ max_tokens: 32000 }),
+          expect.objectContaining({ max_tokens: 8000 }),
        );
      });
    });
--- a/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts
+++ b/packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts
@ -33,7 +33,7 @@ import { DEFAULT_TIMEOUT } from '../openaiContentGenerator/constants.js';
 import { createDebugLogger } from '../../utils/debugLogger.js';
 import {
  tokenLimit,
-  DEFAULT_OUTPUT_TOKEN_LIMIT,
+  CAPPED_DEFAULT_MAX_TOKENS,
  hasExplicitOutputLimit,
 } from '../tokenLimits.js';

@ -234,12 +234,23 @@ export class AnthropicContentGenerator implements ContentGenerator {
    const modelLimit = tokenLimit(modelId, 'output');
    const isKnownModel = hasExplicitOutputLimit(modelId);

-    const maxTokens =
-      userMaxTokens !== undefined && userMaxTokens !== null
-        ? isKnownModel
-          ? Math.min(userMaxTokens, modelLimit)
-          : userMaxTokens
-        : Math.min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT);
+    let maxTokens: number;
+    if (userMaxTokens !== undefined && userMaxTokens !== null) {
+      maxTokens = isKnownModel
+        ? Math.min(userMaxTokens, modelLimit)
+        : userMaxTokens;
+    } else {
+      // No explicit user config — check env var, then use capped default.
+      const envVal = process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
+      const envMaxTokens = envVal ? parseInt(envVal, 10) : NaN;
+      if (!isNaN(envMaxTokens) && envMaxTokens > 0) {
+        maxTokens = isKnownModel
+          ? Math.min(envMaxTokens, modelLimit)
+          : envMaxTokens;
+      } else {
+        maxTokens = Math.min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS);
+      }
+    }

    return {
      max_tokens: maxTokens,
--- a/packages/core/src/core/geminiChat.ts
+++ b/packages/core/src/core/geminiChat.ts
@ -16,13 +16,14 @@ import type {
  Tool,
  GenerateContentResponseUsageMetadata,
 } from '@google/genai';
-import { createUserContent } from '@google/genai';
+import { createUserContent, FinishReason } from '@google/genai';
 import { retryWithBackoff } from '../utils/retry.js';
 import { getErrorStatus } from '../utils/errors.js';
 import { createDebugLogger } from '../utils/debugLogger.js';
 import { parseAndFormatApiError } from '../utils/errorParsing.js';
 import { isRateLimitError, type RetryInfo } from '../utils/rateLimit.js';
 import type { Config } from '../config/config.js';
+import { ESCALATED_MAX_TOKENS } from './tokenLimits.js';
 import { hasCycleInSchema } from '../tools/tools.js';
 import type { StructuredError } from './turn.js';
 import {
@ -355,6 +356,17 @@ export class GeminiChat {
          cgConfig?.maxRetries ?? RATE_LIMIT_RETRY_OPTIONS.maxRetries;
        const extraRetryErrorCodes = cgConfig?.retryErrorCodes;

+        // Max output tokens escalation: when no user/env override is set,
+        // the capped default (8K) is used. If the model hits MAX_TOKENS,
+        // retry once with escalated limit (64K).
+        let maxTokensEscalated = false;
+        const hasUserMaxTokensOverride =
+          (cgConfig?.samplingParams?.max_tokens !== undefined &&
+            cgConfig?.samplingParams?.max_tokens !== null) ||
+          !!process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
+
+        let lastFinishReason: string | undefined;
+
        for (
          let attempt = 0;
          attempt < INVALID_CONTENT_RETRY_OPTIONS.maxAttempts;
@ -376,7 +388,10 @@ export class GeminiChat {
              prompt_id,
            );

+            lastFinishReason = undefined;
            for await (const chunk of stream) {
+              const fr = chunk.candidates?.[0]?.finishReason;
+              if (fr) lastFinishReason = fr;
              yield { type: StreamEventType.CHUNK, value: chunk };
            }

@ -481,6 +496,49 @@ export class GeminiChat {
          }
        }

+        // Max output tokens escalation: if the retry loop succeeded with
+        // the capped default (8K) but hit MAX_TOKENS, retry once at 64K.
+        // Placed outside the retry loop so that any errors from the
+        // escalated stream propagate directly (not caught by retry logic).
+        if (
+          lastError === null &&
+          lastFinishReason === FinishReason.MAX_TOKENS &&
+          !maxTokensEscalated &&
+          !hasUserMaxTokensOverride
+        ) {
+          maxTokensEscalated = true;
+          debugLogger.info(
+            `Output truncated at capped default. Escalating to ${ESCALATED_MAX_TOKENS} tokens.`,
+          );
+          // Remove partial model response from history
+          // (processStreamResponse already pushed it)
+          if (
+            self.history.length > 0 &&
+            self.history[self.history.length - 1].role === 'model'
+          ) {
+            self.history.pop();
+          }
+          // Signal UI to discard partial output
+          yield { type: StreamEventType.RETRY };
+          // Retry with escalated max_tokens
+          const escalatedParams: SendMessageParameters = {
+            ...params,
+            config: {
+              ...params.config,
+              maxOutputTokens: ESCALATED_MAX_TOKENS,
+            },
+          };
+          const escalatedStream = await self.makeApiCallAndProcessStream(
+            model,
+            requestContents,
+            escalatedParams,
+            prompt_id,
+          );
+          for await (const chunk of escalatedStream) {
+            yield { type: StreamEventType.CHUNK, value: chunk };
+          }
+        }
+
        if (lastError) {
          if (lastError instanceof InvalidStreamError) {
            const totalAttempts = invalidStreamRetryCount + 1;
--- a/packages/core/src/core/openaiContentGenerator/provider/dashscope.test.ts
+++ b/packages/core/src/core/openaiContentGenerator/provider/dashscope.test.ts
@ -786,9 +786,9 @@ describe('DashScopeOpenAICompatibleProvider', () => {

      const result = provider.buildRequest(request, 'test-prompt-id');

-      // Should set conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
-      // qwen3-max has 32K output limit, so min(32K, 32K) = 32K
-      expect(result.max_tokens).toBe(32000);
+      // Should set capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS)
+      // qwen3-max has 32K output limit, so min(32K, 8K) = 8K
+      expect(result.max_tokens).toBe(8000);
    });

    it('should set conservative max_tokens when null is provided', () => {
@ -800,8 +800,8 @@ describe('DashScopeOpenAICompatibleProvider', () => {

      const result = provider.buildRequest(request, 'test-prompt-id');

-      // null is treated as not configured, so set conservative default
-      expect(result.max_tokens).toBe(32000);
+      // null is treated as not configured, so set capped default: min(32K, 8K) = 8K
+      expect(result.max_tokens).toBe(8000);
    });

    it('should respect user max_tokens for unknown models', () => {
--- a/packages/core/src/core/openaiContentGenerator/provider/dashscope.ts
+++ b/packages/core/src/core/openaiContentGenerator/provider/dashscope.ts
@ -110,8 +110,8 @@ export class DashScopeOpenAICompatibleProvider extends DefaultOpenAICompatiblePr
    }

    // Apply output token limits using parent class logic
-    // Uses conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
-    // to preserve input quota when user hasn't explicitly configured max_tokens
+    // Uses capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS=8K)
+    // Requests hitting the cap get one clean retry at 64K (geminiChat.ts)
    const requestWithTokenLimits = this.applyOutputTokenLimit(request);

    const extraBody = this.contentGeneratorConfig.extra_body;
--- a/packages/core/src/core/openaiContentGenerator/provider/default.test.ts
+++ b/packages/core/src/core/openaiContentGenerator/provider/default.test.ts
@ -204,9 +204,9 @@ describe('DefaultOpenAICompatibleProvider', () => {
        'prompt-id',
      );

-      // Should set conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
-      // GPT-4 has 16K output limit, so min(16K, 32K) = 16K
-      expect(result.max_tokens).toBe(16384);
+      // Should set capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS)
+      // GPT-4 has 16K output limit, so min(16K, 8K) = 8K
+      expect(result.max_tokens).toBe(8000);
    });

    it('should respect user max_tokens for unknown models (deployment aliases, self-hosted)', () => {
@ -223,8 +223,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
      expect(result.max_tokens).toBe(100000);
    });

-    it('should use conservative default for unknown models when max_tokens not configured', () => {
-      // Unknown models without user config: use DEFAULT_OUTPUT_TOKEN_LIMIT
+    it('should use capped default for unknown models when max_tokens not configured', () => {
+      // Unknown models without user config: use CAPPED_DEFAULT_MAX_TOKENS
      const request: OpenAI.Chat.ChatCompletionCreateParams = {
        model: 'custom-deployment-alias',
        messages: [{ role: 'user', content: 'Hello' }],
@ -232,8 +232,8 @@ describe('DefaultOpenAICompatibleProvider', () => {

      const result = provider.buildRequest(request, 'prompt-id');

-      // Uses conservative default (32K)
-      expect(result.max_tokens).toBe(32000);
+      // Uses capped default (8K)
+      expect(result.max_tokens).toBe(8000);
    });

    it('should cap max_tokens for known models to avoid API errors', () => {
@ -259,8 +259,8 @@ describe('DefaultOpenAICompatibleProvider', () => {

      const result = provider.buildRequest(request, 'prompt-id');

-      // GPT-4 has 16K output limit, so conservative default is still 16K
-      expect(result.max_tokens).toBe(16384);
+      // GPT-4 has 16K output limit, capped default is 8K: min(16K, 8K) = 8K
+      expect(result.max_tokens).toBe(8000);
    });

    it('should preserve all sampling parameters', () => {
@ -303,7 +303,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
      // Should set conservative max_tokens default
      expect(result.model).toBe('gpt-4');
      expect(result.messages).toEqual(minimalRequest.messages);
-      expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
+      expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
    });

    it('should handle streaming requests', () => {
@ -319,7 +319,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
      expect(result.model).toBe('gpt-4');
      expect(result.messages).toEqual(streamingRequest.messages);
      expect(result.stream).toBe(true);
-      expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
+      expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
    });

    it('should not modify the original request object', () => {
@ -363,7 +363,7 @@ describe('DefaultOpenAICompatibleProvider', () => {

      expect(result).toEqual({
        ...originalRequest,
-        max_tokens: 16384, // GPT-4 has 16K limit, min(16K, 32K) = 16K
+        max_tokens: 8000, // GPT-4 has 16K limit, min(16K, 8K) = 8K
        custom_param: 'custom_value',
        nested: { key: 'value' },
      });
@ -382,7 +382,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
      expect(result.model).toBe('gpt-4');
      expect(result.messages).toEqual(originalRequest.messages);
      expect(result.temperature).toBe(0.7);
-      expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
+      expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
      expect(result).not.toHaveProperty('custom_param');
    });
  });
--- a/packages/core/src/core/openaiContentGenerator/provider/default.ts
+++ b/packages/core/src/core/openaiContentGenerator/provider/default.ts
@ -7,7 +7,7 @@ import type { OpenAICompatibleProvider } from './types.js';
 import { buildRuntimeFetchOptions } from '../../../utils/runtimeFetchOptions.js';
 import {
  tokenLimit,
-  DEFAULT_OUTPUT_TOKEN_LIMIT,
+  CAPPED_DEFAULT_MAX_TOKENS,
  hasExplicitOutputLimit,
 } from '../../tokenLimits.js';

@ -101,18 +101,19 @@ export class DefaultOpenAICompatibleProvider
   *    - For unknown models (deployment aliases, self-hosted): respect user's
   *      configured value entirely (backend may support larger limits)
   * 2. If user didn't configure max_tokens:
-   *    - Use min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT)
-   *    - This provides a conservative default (32K) that avoids truncating output
-   *      while preserving input quota (not occupying too much context window)
+   *    - Check QWEN_CODE_MAX_OUTPUT_TOKENS env var first
+   *    - Otherwise use min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS=8K)
+   *    - Requests hitting the 8K cap get one clean retry at 64K (geminiChat.ts)
   * 3. If model has no specific limit (tokenLimit returns default):
-   *    - Still apply DEFAULT_OUTPUT_TOKEN_LIMIT as safeguard
+   *    - Still apply CAPPED_DEFAULT_MAX_TOKENS as safeguard
   *
   * Examples:
   * - User sets 4K, known model limit 64K → uses 4K (respects user preference)
   * - User sets 100K, known model limit 64K → uses 64K (capped to avoid API error)
   * - User sets 100K, unknown model → uses 100K (respects user, backend may support it)
-   * - User not set, model limit 64K → uses 32K (conservative default)
-   * - User not set, model limit 8K → uses 8K (model limit is lower)
+   * - User not set, model limit 64K → uses 8K (capped default for slot optimization)
+   * - User not set, model limit 4K → uses 4K (model limit is lower)
+   * - User not set, env QWEN_CODE_MAX_OUTPUT_TOKENS=16000 -> uses 16K
   *
   * @param request - The chat completion request parameters
   * @returns The request with max_tokens adjusted according to the logic
@ -140,9 +141,18 @@ export class DefaultOpenAICompatibleProvider
        effectiveMaxTokens = userMaxTokens;
      }
    } else {
-      // User didn't configure, use conservative default:
-      // min(model-specific limit, DEFAULT_OUTPUT_TOKEN_LIMIT)
-      effectiveMaxTokens = Math.min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT);
+      // No explicit user config — check env var, then use capped default.
+      // Capped default (8K) reduces GPU slot over-reservation by ~4×.
+      // Requests hitting the cap get one clean retry at 64K (geminiChat.ts).
+      const envVal = process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
+      const envMaxTokens = envVal ? parseInt(envVal, 10) : NaN;
+      if (!isNaN(envMaxTokens) && envMaxTokens > 0) {
+        effectiveMaxTokens = isKnownModel
+          ? Math.min(envMaxTokens, modelLimit)
+          : envMaxTokens;
+      } else {
+        effectiveMaxTokens = Math.min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS);
+      }
    }

    return {
--- a/packages/core/src/core/tokenLimits.ts
+++ b/packages/core/src/core/tokenLimits.ts
@ -11,6 +11,13 @@ export type TokenLimitType = 'input' | 'output';
 export const DEFAULT_TOKEN_LIMIT: TokenCount = 131_072; // 128K (power-of-two)
 export const DEFAULT_OUTPUT_TOKEN_LIMIT: TokenCount = 32_000; // 32K tokens

+// Capped default for slot-reservation optimization. 99% of outputs are under 5K
+// tokens, so 32K defaults over-reserve 4-6× slot capacity. With the cap
+// enabled, <1% of requests hit the limit; those get one clean retry at 64K
+// (see geminiChat.ts max_output_tokens escalation).
+export const CAPPED_DEFAULT_MAX_TOKENS: TokenCount = 8_000;
+export const ESCALATED_MAX_TOKENS: TokenCount = 64_000;
+
 /**
 * Accurate numeric limits:
 * - power-of-two approximations (128K -> 131072, 256K -> 262144, etc.)
--- a/packages/core/src/core/turn.ts
+++ b/packages/core/src/core/turn.ts
@ -280,8 +280,13 @@ export class Turn {
          return;
        }

-        // Handle the new RETRY event
+        // Handle the new RETRY event: clear accumulated state from the
+        // previous attempt to avoid duplicate tool calls and stale metadata.
        if (streamEvent.type === 'retry') {
+          this.pendingToolCalls.length = 0;
+          this.pendingCitations.clear();
+          this.debugResponses = [];
+          this.finishReason = undefined;
          yield {
            type: GeminiEventType.Retry,
            retryInfo: streamEvent.retryInfo,