feat(core): adaptive output token escalation (8K default + 64K retry) (#2898)

* feat(core): adaptive output token escalation (8K default + 64K retry)

99% of model responses are under 5K tokens, but we previously reserved
32K for every request. This wastes GPU slot capacity by ~4x.

Now the default output limit is 8K. When a response hits this cap
(stop_reason=max_tokens), it automatically retries once at 64K — only
the ~1% of requests that actually need more tokens pay the cost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add design doc and user doc for adaptive output token escalation

- Add design doc covering problem, architecture, token limit
  determination, escalation mechanism, and design decisions
- Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md
- Add max_tokens adaptive behavior explanation in model config section

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Shaojin Wen 2026-04-08 17:30:39 +08:00 committed by GitHub
parent 3c23952ef7
commit 1e8bc031cc
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
11 changed files with 299 additions and 57 deletions

View file

@ -0,0 +1,138 @@
# Adaptive Output Token Escalation Design
> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
## Problem
Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
## Solution
Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
## Architecture
```
┌─────────────────────────┐
│ Request starts │
│ max_tokens = 8K │
└───────────┬─────────────┘
┌─────────────────────────┐
│ Stream response │
└───────────┬─────────────┘
┌─────────┴─────────┐
│ │
finish_reason finish_reason
!= MAX_TOKENS == MAX_TOKENS
│ │
▼ ▼
┌───────────┐ ┌─────────────────────┐
│ Done │ │ Check conditions: │
└───────────┘ │ - No user override? │
│ - No env override? │
│ - Not already │
│ escalated? │
└─────────┬───────────┘
YES │ NO
┌─────────┴────┐
│ │
▼ ▼
┌─────────────┐ ┌──────────┐
│ Pop partial │ │ Done │
│ model resp │ │ (truncd) │
│ from history│ └──────────┘
│ │
│ Yield RETRY │
│ event │
│ │
│ Re-send │
│ max_tokens │
│ = 64K │
└─────────────┘
```
## Token limit determination
The effective `max_tokens` is resolved in the following priority order:
| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
| 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation |
| 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation |
| 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to 64K on truncation |
A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
This logic is implemented in three content generators:
- `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers
- `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider
- `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider
## Escalation mechanism
The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional:
1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
2. Truncation is not an error — it's a successful response that was cut short
3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic
### Escalation steps (geminiChat.ts)
```
1. Stream completes successfully (lastError === null)
2. Last chunk has finishReason === MAX_TOKENS
3. Guard checks pass:
- maxTokensEscalated === false (prevent infinite escalation)
- hasUserMaxTokensOverride === false (respect user intent)
4. Pop the partial model response from chat history
5. Yield RETRY event → UI discards partial output
6. Re-send the same request with maxOutputTokens: 64K
```
### State cleanup on RETRY (turn.ts)
When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
- `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
- `pendingCitations` — cleared to avoid duplicate citations
- `debugResponses` — cleared to avoid stale debug data
- `finishReason` — reset to `undefined` so the new response's finish reason is used
## Constants
Defined in `tokenLimits.ts`:
| Constant | Value | Purpose |
| --------------------------- | ------ | ------------------------------------------------------- |
| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set |
| `ESCALATED_MAX_TOKENS` | 64,000 | Output token limit used on truncation retry |
## Design decisions
### Why 8K default?
- 99% of responses are under 5K tokens
- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
- Reduces average slot reservation from 32K to 8K (4x improvement)
### Why 64K escalated limit?
- Covers the vast majority of long outputs that were truncated at 8K
- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
### Why not progressive escalation (8K → 16K → 32K → 64K)?
- Each retry adds latency (the full response must be regenerated)
- A single retry is the simplest approach that captures almost all cases
- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
### Why is escalation outside the retry loop?
- Truncation is a success case, not an error
- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
- Keeps the retry loop focused on its original purpose (transient error recovery)

View file

@ -168,6 +168,18 @@ Settings are organized into categories. All settings should be placed within the
}
```
**max_tokens (adaptive output tokens):**
When `samplingParams.max_tokens` is not set, Qwen Code uses an adaptive output token strategy to optimize GPU resource usage:
1. Requests start with a default limit of **8K** output tokens
2. If the response is truncated (the model hits the limit), Qwen Code automatically retries with **64K** tokens
3. The partial output is discarded and replaced with the full response from the retry
This is transparent to users — you may briefly see a retry indicator if escalation occurs. Since 99% of responses are under 5K tokens, the retry happens rarely (<1% of requests).
To override this behavior, either set `samplingParams.max_tokens` in your settings or use the `QWEN_CODE_MAX_OUTPUT_TOKENS` environment variable.
**contextWindowSize:**
Overrides the default context window size for the selected model. Qwen Code determines the context window using built-in defaults based on model name matching, with a constant fallback value. Use this setting when a provider's effective context limit differs from Qwen Code's default. This value defines the model's assumed maximum context capacity, not a per-request token limit.
@ -491,22 +503,23 @@ For authentication-related variables (like `OPENAI_*`) and the recommended `.qwe
### Environment Variables Table
| Variable | Description | Notes |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `QWEN_TELEMETRY_ENABLED` | Set to `true` or `1` to enable telemetry. Any other value is treated as disabling it. | Overrides the `telemetry.enabled` setting. |
| `QWEN_TELEMETRY_TARGET` | Sets the telemetry target (`local` or `gcp`). | Overrides the `telemetry.target` setting. |
| `QWEN_TELEMETRY_OTLP_ENDPOINT` | Sets the OTLP endpoint for telemetry. | Overrides the `telemetry.otlpEndpoint` setting. |
| `QWEN_TELEMETRY_OTLP_PROTOCOL` | Sets the OTLP protocol (`grpc` or `http`). | Overrides the `telemetry.otlpProtocol` setting. |
| `QWEN_TELEMETRY_LOG_PROMPTS` | Set to `true` or `1` to enable or disable logging of user prompts. Any other value is treated as disabling it. | Overrides the `telemetry.logPrompts` setting. |
| `QWEN_TELEMETRY_OUTFILE` | Sets the file path to write telemetry to when the target is `local`. | Overrides the `telemetry.outfile` setting. |
| `QWEN_TELEMETRY_USE_COLLECTOR` | Set to `true` or `1` to enable or disable using an external OTLP collector. Any other value is treated as disabling it. | Overrides the `telemetry.useCollector` setting. |
| `QWEN_SANDBOX` | Alternative to the `sandbox` setting in `settings.json`. | Accepts `true`, `false`, `docker`, `podman`, or a custom command string. |
| `SEATBELT_PROFILE` | (macOS specific) Switches the Seatbelt (`sandbox-exec`) profile on macOS. | `permissive-open`: (Default) Restricts writes to the project folder (and a few other folders, see `packages/cli/src/utils/sandbox-macos-permissive-open.sb`) but allows other operations. `strict`: Uses a strict profile that declines operations by default. `<profile_name>`: Uses a custom profile. To define a custom profile, create a file named `sandbox-macos-<profile_name>.sb` in your project's `.qwen/` directory (e.g., `my-project/.qwen/sandbox-macos-custom.sb`). |
| `DEBUG` or `DEBUG_MODE` | (often used by underlying libraries or the CLI itself) Set to `true` or `1` to enable verbose debug logging, which can be helpful for troubleshooting. | **Note:** These variables are automatically excluded from project `.env` files by default to prevent interference with the CLI behavior. Use `.qwen/.env` files if you need to set these for Qwen Code specifically. |
| `NO_COLOR` | Set to any value to disable all color output in the CLI. | |
| `CLI_TITLE` | Set to a string to customize the title of the CLI. | |
| `CODE_ASSIST_ENDPOINT` | Specifies the endpoint for the code assist server. | This is useful for development and testing. |
| `TAVILY_API_KEY` | Your API key for the Tavily web search service. | Used to enable the `web_search` tool functionality. Example: `export TAVILY_API_KEY="tvly-your-api-key-here"` |
| Variable | Description | Notes |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `QWEN_TELEMETRY_ENABLED` | Set to `true` or `1` to enable telemetry. Any other value is treated as disabling it. | Overrides the `telemetry.enabled` setting. |
| `QWEN_TELEMETRY_TARGET` | Sets the telemetry target (`local` or `gcp`). | Overrides the `telemetry.target` setting. |
| `QWEN_TELEMETRY_OTLP_ENDPOINT` | Sets the OTLP endpoint for telemetry. | Overrides the `telemetry.otlpEndpoint` setting. |
| `QWEN_TELEMETRY_OTLP_PROTOCOL` | Sets the OTLP protocol (`grpc` or `http`). | Overrides the `telemetry.otlpProtocol` setting. |
| `QWEN_TELEMETRY_LOG_PROMPTS` | Set to `true` or `1` to enable or disable logging of user prompts. Any other value is treated as disabling it. | Overrides the `telemetry.logPrompts` setting. |
| `QWEN_TELEMETRY_OUTFILE` | Sets the file path to write telemetry to when the target is `local`. | Overrides the `telemetry.outfile` setting. |
| `QWEN_TELEMETRY_USE_COLLECTOR` | Set to `true` or `1` to enable or disable using an external OTLP collector. Any other value is treated as disabling it. | Overrides the `telemetry.useCollector` setting. |
| `QWEN_SANDBOX` | Alternative to the `sandbox` setting in `settings.json`. | Accepts `true`, `false`, `docker`, `podman`, or a custom command string. |
| `SEATBELT_PROFILE` | (macOS specific) Switches the Seatbelt (`sandbox-exec`) profile on macOS. | `permissive-open`: (Default) Restricts writes to the project folder (and a few other folders, see `packages/cli/src/utils/sandbox-macos-permissive-open.sb`) but allows other operations. `strict`: Uses a strict profile that declines operations by default. `<profile_name>`: Uses a custom profile. To define a custom profile, create a file named `sandbox-macos-<profile_name>.sb` in your project's `.qwen/` directory (e.g., `my-project/.qwen/sandbox-macos-custom.sb`). |
| `DEBUG` or `DEBUG_MODE` | (often used by underlying libraries or the CLI itself) Set to `true` or `1` to enable verbose debug logging, which can be helpful for troubleshooting. | **Note:** These variables are automatically excluded from project `.env` files by default to prevent interference with the CLI behavior. Use `.qwen/.env` files if you need to set these for Qwen Code specifically. |
| `NO_COLOR` | Set to any value to disable all color output in the CLI. | |
| `CLI_TITLE` | Set to a string to customize the title of the CLI. | |
| `CODE_ASSIST_ENDPOINT` | Specifies the endpoint for the code assist server. | This is useful for development and testing. |
| `QWEN_CODE_MAX_OUTPUT_TOKENS` | Overrides the default maximum output tokens per response. When not set, Qwen Code uses an adaptive strategy: starts with 8K tokens and automatically retries with 64K if the response is truncated. Set this to a specific value (e.g., `16000`) to use a fixed limit instead. | Takes precedence over the capped default (8K) but is overridden by `samplingParams.max_tokens` in settings. Disables automatic escalation when set. Example: `export QWEN_CODE_MAX_OUTPUT_TOKENS=16000` |
| `TAVILY_API_KEY` | Your API key for the Tavily web search service. | Used to enable the `web_search` tool functionality. Example: `export TAVILY_API_KEY="tvly-your-api-key-here"` |
## Command-Line Arguments

View file

@ -423,7 +423,7 @@ describe('AnthropicContentGenerator', () => {
const [anthropicRequest] =
anthropicState.lastCreateArgs as AnthropicCreateArgs;
expect(anthropicRequest).toEqual(
expect.objectContaining({ max_tokens: 32000 }),
expect.objectContaining({ max_tokens: 8000 }),
);
});
@ -488,7 +488,7 @@ describe('AnthropicContentGenerator', () => {
const [anthropicRequest] =
anthropicState.lastCreateArgs as AnthropicCreateArgs;
expect(anthropicRequest).toEqual(
expect.objectContaining({ max_tokens: 32000 }),
expect.objectContaining({ max_tokens: 8000 }),
);
});
});

View file

@ -33,7 +33,7 @@ import { DEFAULT_TIMEOUT } from '../openaiContentGenerator/constants.js';
import { createDebugLogger } from '../../utils/debugLogger.js';
import {
tokenLimit,
DEFAULT_OUTPUT_TOKEN_LIMIT,
CAPPED_DEFAULT_MAX_TOKENS,
hasExplicitOutputLimit,
} from '../tokenLimits.js';
@ -234,12 +234,23 @@ export class AnthropicContentGenerator implements ContentGenerator {
const modelLimit = tokenLimit(modelId, 'output');
const isKnownModel = hasExplicitOutputLimit(modelId);
const maxTokens =
userMaxTokens !== undefined && userMaxTokens !== null
? isKnownModel
? Math.min(userMaxTokens, modelLimit)
: userMaxTokens
: Math.min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT);
let maxTokens: number;
if (userMaxTokens !== undefined && userMaxTokens !== null) {
maxTokens = isKnownModel
? Math.min(userMaxTokens, modelLimit)
: userMaxTokens;
} else {
// No explicit user config — check env var, then use capped default.
const envVal = process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
const envMaxTokens = envVal ? parseInt(envVal, 10) : NaN;
if (!isNaN(envMaxTokens) && envMaxTokens > 0) {
maxTokens = isKnownModel
? Math.min(envMaxTokens, modelLimit)
: envMaxTokens;
} else {
maxTokens = Math.min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS);
}
}
return {
max_tokens: maxTokens,

View file

@ -16,13 +16,14 @@ import type {
Tool,
GenerateContentResponseUsageMetadata,
} from '@google/genai';
import { createUserContent } from '@google/genai';
import { createUserContent, FinishReason } from '@google/genai';
import { retryWithBackoff } from '../utils/retry.js';
import { getErrorStatus } from '../utils/errors.js';
import { createDebugLogger } from '../utils/debugLogger.js';
import { parseAndFormatApiError } from '../utils/errorParsing.js';
import { isRateLimitError, type RetryInfo } from '../utils/rateLimit.js';
import type { Config } from '../config/config.js';
import { ESCALATED_MAX_TOKENS } from './tokenLimits.js';
import { hasCycleInSchema } from '../tools/tools.js';
import type { StructuredError } from './turn.js';
import {
@ -355,6 +356,17 @@ export class GeminiChat {
cgConfig?.maxRetries ?? RATE_LIMIT_RETRY_OPTIONS.maxRetries;
const extraRetryErrorCodes = cgConfig?.retryErrorCodes;
// Max output tokens escalation: when no user/env override is set,
// the capped default (8K) is used. If the model hits MAX_TOKENS,
// retry once with escalated limit (64K).
let maxTokensEscalated = false;
const hasUserMaxTokensOverride =
(cgConfig?.samplingParams?.max_tokens !== undefined &&
cgConfig?.samplingParams?.max_tokens !== null) ||
!!process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
let lastFinishReason: string | undefined;
for (
let attempt = 0;
attempt < INVALID_CONTENT_RETRY_OPTIONS.maxAttempts;
@ -376,7 +388,10 @@ export class GeminiChat {
prompt_id,
);
lastFinishReason = undefined;
for await (const chunk of stream) {
const fr = chunk.candidates?.[0]?.finishReason;
if (fr) lastFinishReason = fr;
yield { type: StreamEventType.CHUNK, value: chunk };
}
@ -481,6 +496,49 @@ export class GeminiChat {
}
}
// Max output tokens escalation: if the retry loop succeeded with
// the capped default (8K) but hit MAX_TOKENS, retry once at 64K.
// Placed outside the retry loop so that any errors from the
// escalated stream propagate directly (not caught by retry logic).
if (
lastError === null &&
lastFinishReason === FinishReason.MAX_TOKENS &&
!maxTokensEscalated &&
!hasUserMaxTokensOverride
) {
maxTokensEscalated = true;
debugLogger.info(
`Output truncated at capped default. Escalating to ${ESCALATED_MAX_TOKENS} tokens.`,
);
// Remove partial model response from history
// (processStreamResponse already pushed it)
if (
self.history.length > 0 &&
self.history[self.history.length - 1].role === 'model'
) {
self.history.pop();
}
// Signal UI to discard partial output
yield { type: StreamEventType.RETRY };
// Retry with escalated max_tokens
const escalatedParams: SendMessageParameters = {
...params,
config: {
...params.config,
maxOutputTokens: ESCALATED_MAX_TOKENS,
},
};
const escalatedStream = await self.makeApiCallAndProcessStream(
model,
requestContents,
escalatedParams,
prompt_id,
);
for await (const chunk of escalatedStream) {
yield { type: StreamEventType.CHUNK, value: chunk };
}
}
if (lastError) {
if (lastError instanceof InvalidStreamError) {
const totalAttempts = invalidStreamRetryCount + 1;

View file

@ -786,9 +786,9 @@ describe('DashScopeOpenAICompatibleProvider', () => {
const result = provider.buildRequest(request, 'test-prompt-id');
// Should set conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
// qwen3-max has 32K output limit, so min(32K, 32K) = 32K
expect(result.max_tokens).toBe(32000);
// Should set capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS)
// qwen3-max has 32K output limit, so min(32K, 8K) = 8K
expect(result.max_tokens).toBe(8000);
});
it('should set conservative max_tokens when null is provided', () => {
@ -800,8 +800,8 @@ describe('DashScopeOpenAICompatibleProvider', () => {
const result = provider.buildRequest(request, 'test-prompt-id');
// null is treated as not configured, so set conservative default
expect(result.max_tokens).toBe(32000);
// null is treated as not configured, so set capped default: min(32K, 8K) = 8K
expect(result.max_tokens).toBe(8000);
});
it('should respect user max_tokens for unknown models', () => {

View file

@ -110,8 +110,8 @@ export class DashScopeOpenAICompatibleProvider extends DefaultOpenAICompatiblePr
}
// Apply output token limits using parent class logic
// Uses conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
// to preserve input quota when user hasn't explicitly configured max_tokens
// Uses capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS=8K)
// Requests hitting the cap get one clean retry at 64K (geminiChat.ts)
const requestWithTokenLimits = this.applyOutputTokenLimit(request);
const extraBody = this.contentGeneratorConfig.extra_body;

View file

@ -204,9 +204,9 @@ describe('DefaultOpenAICompatibleProvider', () => {
'prompt-id',
);
// Should set conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
// GPT-4 has 16K output limit, so min(16K, 32K) = 16K
expect(result.max_tokens).toBe(16384);
// Should set capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS)
// GPT-4 has 16K output limit, so min(16K, 8K) = 8K
expect(result.max_tokens).toBe(8000);
});
it('should respect user max_tokens for unknown models (deployment aliases, self-hosted)', () => {
@ -223,8 +223,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
expect(result.max_tokens).toBe(100000);
});
it('should use conservative default for unknown models when max_tokens not configured', () => {
// Unknown models without user config: use DEFAULT_OUTPUT_TOKEN_LIMIT
it('should use capped default for unknown models when max_tokens not configured', () => {
// Unknown models without user config: use CAPPED_DEFAULT_MAX_TOKENS
const request: OpenAI.Chat.ChatCompletionCreateParams = {
model: 'custom-deployment-alias',
messages: [{ role: 'user', content: 'Hello' }],
@ -232,8 +232,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
const result = provider.buildRequest(request, 'prompt-id');
// Uses conservative default (32K)
expect(result.max_tokens).toBe(32000);
// Uses capped default (8K)
expect(result.max_tokens).toBe(8000);
});
it('should cap max_tokens for known models to avoid API errors', () => {
@ -259,8 +259,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
const result = provider.buildRequest(request, 'prompt-id');
// GPT-4 has 16K output limit, so conservative default is still 16K
expect(result.max_tokens).toBe(16384);
// GPT-4 has 16K output limit, capped default is 8K: min(16K, 8K) = 8K
expect(result.max_tokens).toBe(8000);
});
it('should preserve all sampling parameters', () => {
@ -303,7 +303,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
// Should set conservative max_tokens default
expect(result.model).toBe('gpt-4');
expect(result.messages).toEqual(minimalRequest.messages);
expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
});
it('should handle streaming requests', () => {
@ -319,7 +319,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
expect(result.model).toBe('gpt-4');
expect(result.messages).toEqual(streamingRequest.messages);
expect(result.stream).toBe(true);
expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
});
it('should not modify the original request object', () => {
@ -363,7 +363,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
expect(result).toEqual({
...originalRequest,
max_tokens: 16384, // GPT-4 has 16K limit, min(16K, 32K) = 16K
max_tokens: 8000, // GPT-4 has 16K limit, min(16K, 8K) = 8K
custom_param: 'custom_value',
nested: { key: 'value' },
});
@ -382,7 +382,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
expect(result.model).toBe('gpt-4');
expect(result.messages).toEqual(originalRequest.messages);
expect(result.temperature).toBe(0.7);
expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
expect(result).not.toHaveProperty('custom_param');
});
});

View file

@ -7,7 +7,7 @@ import type { OpenAICompatibleProvider } from './types.js';
import { buildRuntimeFetchOptions } from '../../../utils/runtimeFetchOptions.js';
import {
tokenLimit,
DEFAULT_OUTPUT_TOKEN_LIMIT,
CAPPED_DEFAULT_MAX_TOKENS,
hasExplicitOutputLimit,
} from '../../tokenLimits.js';
@ -101,18 +101,19 @@ export class DefaultOpenAICompatibleProvider
* - For unknown models (deployment aliases, self-hosted): respect user's
* configured value entirely (backend may support larger limits)
* 2. If user didn't configure max_tokens:
* - Use min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT)
* - This provides a conservative default (32K) that avoids truncating output
* while preserving input quota (not occupying too much context window)
* - Check QWEN_CODE_MAX_OUTPUT_TOKENS env var first
* - Otherwise use min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS=8K)
* - Requests hitting the 8K cap get one clean retry at 64K (geminiChat.ts)
* 3. If model has no specific limit (tokenLimit returns default):
* - Still apply DEFAULT_OUTPUT_TOKEN_LIMIT as safeguard
* - Still apply CAPPED_DEFAULT_MAX_TOKENS as safeguard
*
* Examples:
* - User sets 4K, known model limit 64K uses 4K (respects user preference)
* - User sets 100K, known model limit 64K uses 64K (capped to avoid API error)
* - User sets 100K, unknown model uses 100K (respects user, backend may support it)
* - User not set, model limit 64K uses 32K (conservative default)
* - User not set, model limit 8K uses 8K (model limit is lower)
* - User not set, model limit 64K uses 8K (capped default for slot optimization)
* - User not set, model limit 4K uses 4K (model limit is lower)
* - User not set, env QWEN_CODE_MAX_OUTPUT_TOKENS=16000 -> uses 16K
*
* @param request - The chat completion request parameters
* @returns The request with max_tokens adjusted according to the logic
@ -140,9 +141,18 @@ export class DefaultOpenAICompatibleProvider
effectiveMaxTokens = userMaxTokens;
}
} else {
// User didn't configure, use conservative default:
// min(model-specific limit, DEFAULT_OUTPUT_TOKEN_LIMIT)
effectiveMaxTokens = Math.min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT);
// No explicit user config — check env var, then use capped default.
// Capped default (8K) reduces GPU slot over-reservation by ~4×.
// Requests hitting the cap get one clean retry at 64K (geminiChat.ts).
const envVal = process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
const envMaxTokens = envVal ? parseInt(envVal, 10) : NaN;
if (!isNaN(envMaxTokens) && envMaxTokens > 0) {
effectiveMaxTokens = isKnownModel
? Math.min(envMaxTokens, modelLimit)
: envMaxTokens;
} else {
effectiveMaxTokens = Math.min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS);
}
}
return {

View file

@ -11,6 +11,13 @@ export type TokenLimitType = 'input' | 'output';
export const DEFAULT_TOKEN_LIMIT: TokenCount = 131_072; // 128K (power-of-two)
export const DEFAULT_OUTPUT_TOKEN_LIMIT: TokenCount = 32_000; // 32K tokens
// Capped default for slot-reservation optimization. 99% of outputs are under 5K
// tokens, so 32K defaults over-reserve 4-6× slot capacity. With the cap
// enabled, <1% of requests hit the limit; those get one clean retry at 64K
// (see geminiChat.ts max_output_tokens escalation).
export const CAPPED_DEFAULT_MAX_TOKENS: TokenCount = 8_000;
export const ESCALATED_MAX_TOKENS: TokenCount = 64_000;
/**
* Accurate numeric limits:
* - power-of-two approximations (128K -> 131072, 256K -> 262144, etc.)

View file

@ -280,8 +280,13 @@ export class Turn {
return;
}
// Handle the new RETRY event
// Handle the new RETRY event: clear accumulated state from the
// previous attempt to avoid duplicate tool calls and stale metadata.
if (streamEvent.type === 'retry') {
this.pendingToolCalls.length = 0;
this.pendingCitations.clear();
this.debugResponses = [];
this.finishReason = undefined;
yield {
type: GeminiEventType.Retry,
retryInfo: streamEvent.retryInfo,