mirror of
https://github.com/QwenLM/qwen-code.git
synced 2026-04-28 11:41:04 +00:00
feat(core): adaptive output token escalation (8K default + 64K retry) (#2898)
* feat(core): adaptive output token escalation (8K default + 64K retry) 99% of model responses are under 5K tokens, but we previously reserved 32K for every request. This wastes GPU slot capacity by ~4x. Now the default output limit is 8K. When a response hits this cap (stop_reason=max_tokens), it automatically retries once at 64K — only the ~1% of requests that actually need more tokens pay the cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add design doc and user doc for adaptive output token escalation - Add design doc covering problem, architecture, token limit determination, escalation mechanism, and design decisions - Document QWEN_CODE_MAX_OUTPUT_TOKENS env var in settings.md - Add max_tokens adaptive behavior explanation in model config section --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
3c23952ef7
commit
1e8bc031cc
11 changed files with 299 additions and 57 deletions
|
|
@ -0,0 +1,138 @@
|
|||
# Adaptive Output Token Escalation Design
|
||||
|
||||
> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens.
|
||||
|
||||
## Problem
|
||||
|
||||
Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
|
||||
|
||||
## Solution
|
||||
|
||||
Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ Request starts │
|
||||
│ max_tokens = 8K │
|
||||
└───────────┬─────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ Stream response │
|
||||
└───────────┬─────────────┘
|
||||
│
|
||||
┌─────────┴─────────┐
|
||||
│ │
|
||||
finish_reason finish_reason
|
||||
!= MAX_TOKENS == MAX_TOKENS
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────┐ ┌─────────────────────┐
|
||||
│ Done │ │ Check conditions: │
|
||||
└───────────┘ │ - No user override? │
|
||||
│ - No env override? │
|
||||
│ - Not already │
|
||||
│ escalated? │
|
||||
└─────────┬───────────┘
|
||||
YES │ NO
|
||||
┌─────────┴────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌──────────┐
|
||||
│ Pop partial │ │ Done │
|
||||
│ model resp │ │ (truncd) │
|
||||
│ from history│ └──────────┘
|
||||
│ │
|
||||
│ Yield RETRY │
|
||||
│ event │
|
||||
│ │
|
||||
│ Re-send │
|
||||
│ max_tokens │
|
||||
│ = 64K │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Token limit determination
|
||||
|
||||
The effective `max_tokens` is resolved in the following priority order:
|
||||
|
||||
| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
|
||||
| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ |
|
||||
| 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation |
|
||||
| 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation |
|
||||
| 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to 64K on truncation |
|
||||
|
||||
A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
|
||||
|
||||
This logic is implemented in three content generators:
|
||||
|
||||
- `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers
|
||||
- `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider
|
||||
- `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider
|
||||
|
||||
## Escalation mechanism
|
||||
|
||||
The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional:
|
||||
|
||||
1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
|
||||
2. Truncation is not an error — it's a successful response that was cut short
|
||||
3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic
|
||||
|
||||
### Escalation steps (geminiChat.ts)
|
||||
|
||||
```
|
||||
1. Stream completes successfully (lastError === null)
|
||||
2. Last chunk has finishReason === MAX_TOKENS
|
||||
3. Guard checks pass:
|
||||
- maxTokensEscalated === false (prevent infinite escalation)
|
||||
- hasUserMaxTokensOverride === false (respect user intent)
|
||||
4. Pop the partial model response from chat history
|
||||
5. Yield RETRY event → UI discards partial output
|
||||
6. Re-send the same request with maxOutputTokens: 64K
|
||||
```
|
||||
|
||||
### State cleanup on RETRY (turn.ts)
|
||||
|
||||
When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
|
||||
|
||||
- `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
|
||||
- `pendingCitations` — cleared to avoid duplicate citations
|
||||
- `debugResponses` — cleared to avoid stale debug data
|
||||
- `finishReason` — reset to `undefined` so the new response's finish reason is used
|
||||
|
||||
## Constants
|
||||
|
||||
Defined in `tokenLimits.ts`:
|
||||
|
||||
| Constant | Value | Purpose |
|
||||
| --------------------------- | ------ | ------------------------------------------------------- |
|
||||
| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set |
|
||||
| `ESCALATED_MAX_TOKENS` | 64,000 | Output token limit used on truncation retry |
|
||||
|
||||
## Design decisions
|
||||
|
||||
### Why 8K default?
|
||||
|
||||
- 99% of responses are under 5K tokens
|
||||
- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
|
||||
- Reduces average slot reservation from 32K to 8K (4x improvement)
|
||||
|
||||
### Why 64K escalated limit?
|
||||
|
||||
- Covers the vast majority of long outputs that were truncated at 8K
|
||||
- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
|
||||
- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
|
||||
|
||||
### Why not progressive escalation (8K → 16K → 32K → 64K)?
|
||||
|
||||
- Each retry adds latency (the full response must be regenerated)
|
||||
- A single retry is the simplest approach that captures almost all cases
|
||||
- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
|
||||
|
||||
### Why is escalation outside the retry loop?
|
||||
|
||||
- Truncation is a success case, not an error
|
||||
- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
|
||||
- Keeps the retry loop focused on its original purpose (transient error recovery)
|
||||
|
|
@ -168,6 +168,18 @@ Settings are organized into categories. All settings should be placed within the
|
|||
}
|
||||
```
|
||||
|
||||
**max_tokens (adaptive output tokens):**
|
||||
|
||||
When `samplingParams.max_tokens` is not set, Qwen Code uses an adaptive output token strategy to optimize GPU resource usage:
|
||||
|
||||
1. Requests start with a default limit of **8K** output tokens
|
||||
2. If the response is truncated (the model hits the limit), Qwen Code automatically retries with **64K** tokens
|
||||
3. The partial output is discarded and replaced with the full response from the retry
|
||||
|
||||
This is transparent to users — you may briefly see a retry indicator if escalation occurs. Since 99% of responses are under 5K tokens, the retry happens rarely (<1% of requests).
|
||||
|
||||
To override this behavior, either set `samplingParams.max_tokens` in your settings or use the `QWEN_CODE_MAX_OUTPUT_TOKENS` environment variable.
|
||||
|
||||
**contextWindowSize:**
|
||||
|
||||
Overrides the default context window size for the selected model. Qwen Code determines the context window using built-in defaults based on model name matching, with a constant fallback value. Use this setting when a provider's effective context limit differs from Qwen Code's default. This value defines the model's assumed maximum context capacity, not a per-request token limit.
|
||||
|
|
@ -491,22 +503,23 @@ For authentication-related variables (like `OPENAI_*`) and the recommended `.qwe
|
|||
|
||||
### Environment Variables Table
|
||||
|
||||
| Variable | Description | Notes |
|
||||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `QWEN_TELEMETRY_ENABLED` | Set to `true` or `1` to enable telemetry. Any other value is treated as disabling it. | Overrides the `telemetry.enabled` setting. |
|
||||
| `QWEN_TELEMETRY_TARGET` | Sets the telemetry target (`local` or `gcp`). | Overrides the `telemetry.target` setting. |
|
||||
| `QWEN_TELEMETRY_OTLP_ENDPOINT` | Sets the OTLP endpoint for telemetry. | Overrides the `telemetry.otlpEndpoint` setting. |
|
||||
| `QWEN_TELEMETRY_OTLP_PROTOCOL` | Sets the OTLP protocol (`grpc` or `http`). | Overrides the `telemetry.otlpProtocol` setting. |
|
||||
| `QWEN_TELEMETRY_LOG_PROMPTS` | Set to `true` or `1` to enable or disable logging of user prompts. Any other value is treated as disabling it. | Overrides the `telemetry.logPrompts` setting. |
|
||||
| `QWEN_TELEMETRY_OUTFILE` | Sets the file path to write telemetry to when the target is `local`. | Overrides the `telemetry.outfile` setting. |
|
||||
| `QWEN_TELEMETRY_USE_COLLECTOR` | Set to `true` or `1` to enable or disable using an external OTLP collector. Any other value is treated as disabling it. | Overrides the `telemetry.useCollector` setting. |
|
||||
| `QWEN_SANDBOX` | Alternative to the `sandbox` setting in `settings.json`. | Accepts `true`, `false`, `docker`, `podman`, or a custom command string. |
|
||||
| `SEATBELT_PROFILE` | (macOS specific) Switches the Seatbelt (`sandbox-exec`) profile on macOS. | `permissive-open`: (Default) Restricts writes to the project folder (and a few other folders, see `packages/cli/src/utils/sandbox-macos-permissive-open.sb`) but allows other operations. `strict`: Uses a strict profile that declines operations by default. `<profile_name>`: Uses a custom profile. To define a custom profile, create a file named `sandbox-macos-<profile_name>.sb` in your project's `.qwen/` directory (e.g., `my-project/.qwen/sandbox-macos-custom.sb`). |
|
||||
| `DEBUG` or `DEBUG_MODE` | (often used by underlying libraries or the CLI itself) Set to `true` or `1` to enable verbose debug logging, which can be helpful for troubleshooting. | **Note:** These variables are automatically excluded from project `.env` files by default to prevent interference with the CLI behavior. Use `.qwen/.env` files if you need to set these for Qwen Code specifically. |
|
||||
| `NO_COLOR` | Set to any value to disable all color output in the CLI. | |
|
||||
| `CLI_TITLE` | Set to a string to customize the title of the CLI. | |
|
||||
| `CODE_ASSIST_ENDPOINT` | Specifies the endpoint for the code assist server. | This is useful for development and testing. |
|
||||
| `TAVILY_API_KEY` | Your API key for the Tavily web search service. | Used to enable the `web_search` tool functionality. Example: `export TAVILY_API_KEY="tvly-your-api-key-here"` |
|
||||
| Variable | Description | Notes |
|
||||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `QWEN_TELEMETRY_ENABLED` | Set to `true` or `1` to enable telemetry. Any other value is treated as disabling it. | Overrides the `telemetry.enabled` setting. |
|
||||
| `QWEN_TELEMETRY_TARGET` | Sets the telemetry target (`local` or `gcp`). | Overrides the `telemetry.target` setting. |
|
||||
| `QWEN_TELEMETRY_OTLP_ENDPOINT` | Sets the OTLP endpoint for telemetry. | Overrides the `telemetry.otlpEndpoint` setting. |
|
||||
| `QWEN_TELEMETRY_OTLP_PROTOCOL` | Sets the OTLP protocol (`grpc` or `http`). | Overrides the `telemetry.otlpProtocol` setting. |
|
||||
| `QWEN_TELEMETRY_LOG_PROMPTS` | Set to `true` or `1` to enable or disable logging of user prompts. Any other value is treated as disabling it. | Overrides the `telemetry.logPrompts` setting. |
|
||||
| `QWEN_TELEMETRY_OUTFILE` | Sets the file path to write telemetry to when the target is `local`. | Overrides the `telemetry.outfile` setting. |
|
||||
| `QWEN_TELEMETRY_USE_COLLECTOR` | Set to `true` or `1` to enable or disable using an external OTLP collector. Any other value is treated as disabling it. | Overrides the `telemetry.useCollector` setting. |
|
||||
| `QWEN_SANDBOX` | Alternative to the `sandbox` setting in `settings.json`. | Accepts `true`, `false`, `docker`, `podman`, or a custom command string. |
|
||||
| `SEATBELT_PROFILE` | (macOS specific) Switches the Seatbelt (`sandbox-exec`) profile on macOS. | `permissive-open`: (Default) Restricts writes to the project folder (and a few other folders, see `packages/cli/src/utils/sandbox-macos-permissive-open.sb`) but allows other operations. `strict`: Uses a strict profile that declines operations by default. `<profile_name>`: Uses a custom profile. To define a custom profile, create a file named `sandbox-macos-<profile_name>.sb` in your project's `.qwen/` directory (e.g., `my-project/.qwen/sandbox-macos-custom.sb`). |
|
||||
| `DEBUG` or `DEBUG_MODE` | (often used by underlying libraries or the CLI itself) Set to `true` or `1` to enable verbose debug logging, which can be helpful for troubleshooting. | **Note:** These variables are automatically excluded from project `.env` files by default to prevent interference with the CLI behavior. Use `.qwen/.env` files if you need to set these for Qwen Code specifically. |
|
||||
| `NO_COLOR` | Set to any value to disable all color output in the CLI. | |
|
||||
| `CLI_TITLE` | Set to a string to customize the title of the CLI. | |
|
||||
| `CODE_ASSIST_ENDPOINT` | Specifies the endpoint for the code assist server. | This is useful for development and testing. |
|
||||
| `QWEN_CODE_MAX_OUTPUT_TOKENS` | Overrides the default maximum output tokens per response. When not set, Qwen Code uses an adaptive strategy: starts with 8K tokens and automatically retries with 64K if the response is truncated. Set this to a specific value (e.g., `16000`) to use a fixed limit instead. | Takes precedence over the capped default (8K) but is overridden by `samplingParams.max_tokens` in settings. Disables automatic escalation when set. Example: `export QWEN_CODE_MAX_OUTPUT_TOKENS=16000` |
|
||||
| `TAVILY_API_KEY` | Your API key for the Tavily web search service. | Used to enable the `web_search` tool functionality. Example: `export TAVILY_API_KEY="tvly-your-api-key-here"` |
|
||||
|
||||
## Command-Line Arguments
|
||||
|
||||
|
|
|
|||
|
|
@ -423,7 +423,7 @@ describe('AnthropicContentGenerator', () => {
|
|||
const [anthropicRequest] =
|
||||
anthropicState.lastCreateArgs as AnthropicCreateArgs;
|
||||
expect(anthropicRequest).toEqual(
|
||||
expect.objectContaining({ max_tokens: 32000 }),
|
||||
expect.objectContaining({ max_tokens: 8000 }),
|
||||
);
|
||||
});
|
||||
|
||||
|
|
@ -488,7 +488,7 @@ describe('AnthropicContentGenerator', () => {
|
|||
const [anthropicRequest] =
|
||||
anthropicState.lastCreateArgs as AnthropicCreateArgs;
|
||||
expect(anthropicRequest).toEqual(
|
||||
expect.objectContaining({ max_tokens: 32000 }),
|
||||
expect.objectContaining({ max_tokens: 8000 }),
|
||||
);
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -33,7 +33,7 @@ import { DEFAULT_TIMEOUT } from '../openaiContentGenerator/constants.js';
|
|||
import { createDebugLogger } from '../../utils/debugLogger.js';
|
||||
import {
|
||||
tokenLimit,
|
||||
DEFAULT_OUTPUT_TOKEN_LIMIT,
|
||||
CAPPED_DEFAULT_MAX_TOKENS,
|
||||
hasExplicitOutputLimit,
|
||||
} from '../tokenLimits.js';
|
||||
|
||||
|
|
@ -234,12 +234,23 @@ export class AnthropicContentGenerator implements ContentGenerator {
|
|||
const modelLimit = tokenLimit(modelId, 'output');
|
||||
const isKnownModel = hasExplicitOutputLimit(modelId);
|
||||
|
||||
const maxTokens =
|
||||
userMaxTokens !== undefined && userMaxTokens !== null
|
||||
? isKnownModel
|
||||
? Math.min(userMaxTokens, modelLimit)
|
||||
: userMaxTokens
|
||||
: Math.min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT);
|
||||
let maxTokens: number;
|
||||
if (userMaxTokens !== undefined && userMaxTokens !== null) {
|
||||
maxTokens = isKnownModel
|
||||
? Math.min(userMaxTokens, modelLimit)
|
||||
: userMaxTokens;
|
||||
} else {
|
||||
// No explicit user config — check env var, then use capped default.
|
||||
const envVal = process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
|
||||
const envMaxTokens = envVal ? parseInt(envVal, 10) : NaN;
|
||||
if (!isNaN(envMaxTokens) && envMaxTokens > 0) {
|
||||
maxTokens = isKnownModel
|
||||
? Math.min(envMaxTokens, modelLimit)
|
||||
: envMaxTokens;
|
||||
} else {
|
||||
maxTokens = Math.min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS);
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
max_tokens: maxTokens,
|
||||
|
|
|
|||
|
|
@ -16,13 +16,14 @@ import type {
|
|||
Tool,
|
||||
GenerateContentResponseUsageMetadata,
|
||||
} from '@google/genai';
|
||||
import { createUserContent } from '@google/genai';
|
||||
import { createUserContent, FinishReason } from '@google/genai';
|
||||
import { retryWithBackoff } from '../utils/retry.js';
|
||||
import { getErrorStatus } from '../utils/errors.js';
|
||||
import { createDebugLogger } from '../utils/debugLogger.js';
|
||||
import { parseAndFormatApiError } from '../utils/errorParsing.js';
|
||||
import { isRateLimitError, type RetryInfo } from '../utils/rateLimit.js';
|
||||
import type { Config } from '../config/config.js';
|
||||
import { ESCALATED_MAX_TOKENS } from './tokenLimits.js';
|
||||
import { hasCycleInSchema } from '../tools/tools.js';
|
||||
import type { StructuredError } from './turn.js';
|
||||
import {
|
||||
|
|
@ -355,6 +356,17 @@ export class GeminiChat {
|
|||
cgConfig?.maxRetries ?? RATE_LIMIT_RETRY_OPTIONS.maxRetries;
|
||||
const extraRetryErrorCodes = cgConfig?.retryErrorCodes;
|
||||
|
||||
// Max output tokens escalation: when no user/env override is set,
|
||||
// the capped default (8K) is used. If the model hits MAX_TOKENS,
|
||||
// retry once with escalated limit (64K).
|
||||
let maxTokensEscalated = false;
|
||||
const hasUserMaxTokensOverride =
|
||||
(cgConfig?.samplingParams?.max_tokens !== undefined &&
|
||||
cgConfig?.samplingParams?.max_tokens !== null) ||
|
||||
!!process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
|
||||
|
||||
let lastFinishReason: string | undefined;
|
||||
|
||||
for (
|
||||
let attempt = 0;
|
||||
attempt < INVALID_CONTENT_RETRY_OPTIONS.maxAttempts;
|
||||
|
|
@ -376,7 +388,10 @@ export class GeminiChat {
|
|||
prompt_id,
|
||||
);
|
||||
|
||||
lastFinishReason = undefined;
|
||||
for await (const chunk of stream) {
|
||||
const fr = chunk.candidates?.[0]?.finishReason;
|
||||
if (fr) lastFinishReason = fr;
|
||||
yield { type: StreamEventType.CHUNK, value: chunk };
|
||||
}
|
||||
|
||||
|
|
@ -481,6 +496,49 @@ export class GeminiChat {
|
|||
}
|
||||
}
|
||||
|
||||
// Max output tokens escalation: if the retry loop succeeded with
|
||||
// the capped default (8K) but hit MAX_TOKENS, retry once at 64K.
|
||||
// Placed outside the retry loop so that any errors from the
|
||||
// escalated stream propagate directly (not caught by retry logic).
|
||||
if (
|
||||
lastError === null &&
|
||||
lastFinishReason === FinishReason.MAX_TOKENS &&
|
||||
!maxTokensEscalated &&
|
||||
!hasUserMaxTokensOverride
|
||||
) {
|
||||
maxTokensEscalated = true;
|
||||
debugLogger.info(
|
||||
`Output truncated at capped default. Escalating to ${ESCALATED_MAX_TOKENS} tokens.`,
|
||||
);
|
||||
// Remove partial model response from history
|
||||
// (processStreamResponse already pushed it)
|
||||
if (
|
||||
self.history.length > 0 &&
|
||||
self.history[self.history.length - 1].role === 'model'
|
||||
) {
|
||||
self.history.pop();
|
||||
}
|
||||
// Signal UI to discard partial output
|
||||
yield { type: StreamEventType.RETRY };
|
||||
// Retry with escalated max_tokens
|
||||
const escalatedParams: SendMessageParameters = {
|
||||
...params,
|
||||
config: {
|
||||
...params.config,
|
||||
maxOutputTokens: ESCALATED_MAX_TOKENS,
|
||||
},
|
||||
};
|
||||
const escalatedStream = await self.makeApiCallAndProcessStream(
|
||||
model,
|
||||
requestContents,
|
||||
escalatedParams,
|
||||
prompt_id,
|
||||
);
|
||||
for await (const chunk of escalatedStream) {
|
||||
yield { type: StreamEventType.CHUNK, value: chunk };
|
||||
}
|
||||
}
|
||||
|
||||
if (lastError) {
|
||||
if (lastError instanceof InvalidStreamError) {
|
||||
const totalAttempts = invalidStreamRetryCount + 1;
|
||||
|
|
|
|||
|
|
@ -786,9 +786,9 @@ describe('DashScopeOpenAICompatibleProvider', () => {
|
|||
|
||||
const result = provider.buildRequest(request, 'test-prompt-id');
|
||||
|
||||
// Should set conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
|
||||
// qwen3-max has 32K output limit, so min(32K, 32K) = 32K
|
||||
expect(result.max_tokens).toBe(32000);
|
||||
// Should set capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS)
|
||||
// qwen3-max has 32K output limit, so min(32K, 8K) = 8K
|
||||
expect(result.max_tokens).toBe(8000);
|
||||
});
|
||||
|
||||
it('should set conservative max_tokens when null is provided', () => {
|
||||
|
|
@ -800,8 +800,8 @@ describe('DashScopeOpenAICompatibleProvider', () => {
|
|||
|
||||
const result = provider.buildRequest(request, 'test-prompt-id');
|
||||
|
||||
// null is treated as not configured, so set conservative default
|
||||
expect(result.max_tokens).toBe(32000);
|
||||
// null is treated as not configured, so set capped default: min(32K, 8K) = 8K
|
||||
expect(result.max_tokens).toBe(8000);
|
||||
});
|
||||
|
||||
it('should respect user max_tokens for unknown models', () => {
|
||||
|
|
|
|||
|
|
@ -110,8 +110,8 @@ export class DashScopeOpenAICompatibleProvider extends DefaultOpenAICompatiblePr
|
|||
}
|
||||
|
||||
// Apply output token limits using parent class logic
|
||||
// Uses conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
|
||||
// to preserve input quota when user hasn't explicitly configured max_tokens
|
||||
// Uses capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS=8K)
|
||||
// Requests hitting the cap get one clean retry at 64K (geminiChat.ts)
|
||||
const requestWithTokenLimits = this.applyOutputTokenLimit(request);
|
||||
|
||||
const extraBody = this.contentGeneratorConfig.extra_body;
|
||||
|
|
|
|||
|
|
@ -204,9 +204,9 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
'prompt-id',
|
||||
);
|
||||
|
||||
// Should set conservative default (min of model limit and DEFAULT_OUTPUT_TOKEN_LIMIT)
|
||||
// GPT-4 has 16K output limit, so min(16K, 32K) = 16K
|
||||
expect(result.max_tokens).toBe(16384);
|
||||
// Should set capped default (min of model limit and CAPPED_DEFAULT_MAX_TOKENS)
|
||||
// GPT-4 has 16K output limit, so min(16K, 8K) = 8K
|
||||
expect(result.max_tokens).toBe(8000);
|
||||
});
|
||||
|
||||
it('should respect user max_tokens for unknown models (deployment aliases, self-hosted)', () => {
|
||||
|
|
@ -223,8 +223,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
expect(result.max_tokens).toBe(100000);
|
||||
});
|
||||
|
||||
it('should use conservative default for unknown models when max_tokens not configured', () => {
|
||||
// Unknown models without user config: use DEFAULT_OUTPUT_TOKEN_LIMIT
|
||||
it('should use capped default for unknown models when max_tokens not configured', () => {
|
||||
// Unknown models without user config: use CAPPED_DEFAULT_MAX_TOKENS
|
||||
const request: OpenAI.Chat.ChatCompletionCreateParams = {
|
||||
model: 'custom-deployment-alias',
|
||||
messages: [{ role: 'user', content: 'Hello' }],
|
||||
|
|
@ -232,8 +232,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
|
||||
const result = provider.buildRequest(request, 'prompt-id');
|
||||
|
||||
// Uses conservative default (32K)
|
||||
expect(result.max_tokens).toBe(32000);
|
||||
// Uses capped default (8K)
|
||||
expect(result.max_tokens).toBe(8000);
|
||||
});
|
||||
|
||||
it('should cap max_tokens for known models to avoid API errors', () => {
|
||||
|
|
@ -259,8 +259,8 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
|
||||
const result = provider.buildRequest(request, 'prompt-id');
|
||||
|
||||
// GPT-4 has 16K output limit, so conservative default is still 16K
|
||||
expect(result.max_tokens).toBe(16384);
|
||||
// GPT-4 has 16K output limit, capped default is 8K: min(16K, 8K) = 8K
|
||||
expect(result.max_tokens).toBe(8000);
|
||||
});
|
||||
|
||||
it('should preserve all sampling parameters', () => {
|
||||
|
|
@ -303,7 +303,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
// Should set conservative max_tokens default
|
||||
expect(result.model).toBe('gpt-4');
|
||||
expect(result.messages).toEqual(minimalRequest.messages);
|
||||
expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
|
||||
expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
|
||||
});
|
||||
|
||||
it('should handle streaming requests', () => {
|
||||
|
|
@ -319,7 +319,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
expect(result.model).toBe('gpt-4');
|
||||
expect(result.messages).toEqual(streamingRequest.messages);
|
||||
expect(result.stream).toBe(true);
|
||||
expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
|
||||
expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
|
||||
});
|
||||
|
||||
it('should not modify the original request object', () => {
|
||||
|
|
@ -363,7 +363,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
|
||||
expect(result).toEqual({
|
||||
...originalRequest,
|
||||
max_tokens: 16384, // GPT-4 has 16K limit, min(16K, 32K) = 16K
|
||||
max_tokens: 8000, // GPT-4 has 16K limit, min(16K, 8K) = 8K
|
||||
custom_param: 'custom_value',
|
||||
nested: { key: 'value' },
|
||||
});
|
||||
|
|
@ -382,7 +382,7 @@ describe('DefaultOpenAICompatibleProvider', () => {
|
|||
expect(result.model).toBe('gpt-4');
|
||||
expect(result.messages).toEqual(originalRequest.messages);
|
||||
expect(result.temperature).toBe(0.7);
|
||||
expect(result.max_tokens).toBe(16384); // GPT-4 has 16K limit, min(16K, 32K) = 16K
|
||||
expect(result.max_tokens).toBe(8000); // GPT-4 has 16K limit, min(16K, 8K) = 8K
|
||||
expect(result).not.toHaveProperty('custom_param');
|
||||
});
|
||||
});
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ import type { OpenAICompatibleProvider } from './types.js';
|
|||
import { buildRuntimeFetchOptions } from '../../../utils/runtimeFetchOptions.js';
|
||||
import {
|
||||
tokenLimit,
|
||||
DEFAULT_OUTPUT_TOKEN_LIMIT,
|
||||
CAPPED_DEFAULT_MAX_TOKENS,
|
||||
hasExplicitOutputLimit,
|
||||
} from '../../tokenLimits.js';
|
||||
|
||||
|
|
@ -101,18 +101,19 @@ export class DefaultOpenAICompatibleProvider
|
|||
* - For unknown models (deployment aliases, self-hosted): respect user's
|
||||
* configured value entirely (backend may support larger limits)
|
||||
* 2. If user didn't configure max_tokens:
|
||||
* - Use min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT)
|
||||
* - This provides a conservative default (32K) that avoids truncating output
|
||||
* while preserving input quota (not occupying too much context window)
|
||||
* - Check QWEN_CODE_MAX_OUTPUT_TOKENS env var first
|
||||
* - Otherwise use min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS=8K)
|
||||
* - Requests hitting the 8K cap get one clean retry at 64K (geminiChat.ts)
|
||||
* 3. If model has no specific limit (tokenLimit returns default):
|
||||
* - Still apply DEFAULT_OUTPUT_TOKEN_LIMIT as safeguard
|
||||
* - Still apply CAPPED_DEFAULT_MAX_TOKENS as safeguard
|
||||
*
|
||||
* Examples:
|
||||
* - User sets 4K, known model limit 64K → uses 4K (respects user preference)
|
||||
* - User sets 100K, known model limit 64K → uses 64K (capped to avoid API error)
|
||||
* - User sets 100K, unknown model → uses 100K (respects user, backend may support it)
|
||||
* - User not set, model limit 64K → uses 32K (conservative default)
|
||||
* - User not set, model limit 8K → uses 8K (model limit is lower)
|
||||
* - User not set, model limit 64K → uses 8K (capped default for slot optimization)
|
||||
* - User not set, model limit 4K → uses 4K (model limit is lower)
|
||||
* - User not set, env QWEN_CODE_MAX_OUTPUT_TOKENS=16000 -> uses 16K
|
||||
*
|
||||
* @param request - The chat completion request parameters
|
||||
* @returns The request with max_tokens adjusted according to the logic
|
||||
|
|
@ -140,9 +141,18 @@ export class DefaultOpenAICompatibleProvider
|
|||
effectiveMaxTokens = userMaxTokens;
|
||||
}
|
||||
} else {
|
||||
// User didn't configure, use conservative default:
|
||||
// min(model-specific limit, DEFAULT_OUTPUT_TOKEN_LIMIT)
|
||||
effectiveMaxTokens = Math.min(modelLimit, DEFAULT_OUTPUT_TOKEN_LIMIT);
|
||||
// No explicit user config — check env var, then use capped default.
|
||||
// Capped default (8K) reduces GPU slot over-reservation by ~4×.
|
||||
// Requests hitting the cap get one clean retry at 64K (geminiChat.ts).
|
||||
const envVal = process.env['QWEN_CODE_MAX_OUTPUT_TOKENS'];
|
||||
const envMaxTokens = envVal ? parseInt(envVal, 10) : NaN;
|
||||
if (!isNaN(envMaxTokens) && envMaxTokens > 0) {
|
||||
effectiveMaxTokens = isKnownModel
|
||||
? Math.min(envMaxTokens, modelLimit)
|
||||
: envMaxTokens;
|
||||
} else {
|
||||
effectiveMaxTokens = Math.min(modelLimit, CAPPED_DEFAULT_MAX_TOKENS);
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
|
|
|
|||
|
|
@ -11,6 +11,13 @@ export type TokenLimitType = 'input' | 'output';
|
|||
export const DEFAULT_TOKEN_LIMIT: TokenCount = 131_072; // 128K (power-of-two)
|
||||
export const DEFAULT_OUTPUT_TOKEN_LIMIT: TokenCount = 32_000; // 32K tokens
|
||||
|
||||
// Capped default for slot-reservation optimization. 99% of outputs are under 5K
|
||||
// tokens, so 32K defaults over-reserve 4-6× slot capacity. With the cap
|
||||
// enabled, <1% of requests hit the limit; those get one clean retry at 64K
|
||||
// (see geminiChat.ts max_output_tokens escalation).
|
||||
export const CAPPED_DEFAULT_MAX_TOKENS: TokenCount = 8_000;
|
||||
export const ESCALATED_MAX_TOKENS: TokenCount = 64_000;
|
||||
|
||||
/**
|
||||
* Accurate numeric limits:
|
||||
* - power-of-two approximations (128K -> 131072, 256K -> 262144, etc.)
|
||||
|
|
|
|||
|
|
@ -280,8 +280,13 @@ export class Turn {
|
|||
return;
|
||||
}
|
||||
|
||||
// Handle the new RETRY event
|
||||
// Handle the new RETRY event: clear accumulated state from the
|
||||
// previous attempt to avoid duplicate tool calls and stale metadata.
|
||||
if (streamEvent.type === 'retry') {
|
||||
this.pendingToolCalls.length = 0;
|
||||
this.pendingCitations.clear();
|
||||
this.debugResponses = [];
|
||||
this.finishReason = undefined;
|
||||
yield {
|
||||
type: GeminiEventType.Retry,
|
||||
retryInfo: streamEvent.retryInfo,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue