feat(retry): add persistent retry mode for unattended CI/CD environments (#3080)

* feat(retry): add persistent retry mode for unattended CI/CD environments

When running in CI/CD pipelines or background daemon mode, transient API
capacity errors (429/529) should not terminate long-running tasks after a
fixed number of retries. This adds an environment-aware persistent retry
mode that retries indefinitely for transient errors, with exponential
backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to
prevent CI runner timeouts.

* docs: add persistent retry mode documentation

Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG)
to the settings reference, and a new "Persistent Retry Mode" section to the
headless mode docs covering activation, behavior, and CI/CD usage examples.

* refactor(retry): simplify to single explicit env var QWEN_CODE_UNATTENDED_RETRY

Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry.
Having multiple env vars with identical behavior adds confusion, and silently
activating infinite retry on CI=true is dangerous — a regular CI test hitting
a 429 would hang forever instead of failing fast.

* fix(retry): address PR review feedback

- Forward caller's abortSignal into retryWithBackoff in both
  baseLlmClient.ts and geminiChat.ts so persistent waits remain
  cancellable (wenshao)
- Re-apply maxBackoff and capMs after jitter so delays strictly
  respect stated caps (Copilot)
- Respect shouldRetryOnError in persistent mode so callers can
  force fast-fail even for transient 429/529 errors (Copilot)
- Guard sleepWithHeartbeat against infinite loop when heartbeat
  interval is <= 0 via Math.max(1, ...) (Copilot)
- Normalize isEnvTruthy with trim/toLowerCase for robust env
  var parsing across CI conventions (Copilot)

* test(retry): add missing UT for shouldRetryOnError override and heartbeat zero-interval guard

* fix(retry): do not cap Retry-After delays at maxBackoff

Server-specified Retry-After values should only be limited by the
absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min).
Jitter is also skipped for Retry-After since the server already specified
the exact wait time.

* refactor(retry): align isUnattendedMode with project env parsing convention

Replace custom isEnvTruthy (trim + toLowerCase) with strict matching
(val === 'true' || val === '1') to match parseBooleanEnvFlag used
elsewhere in the codebase. Prevents inconsistent behavior where
'TRUE' or ' 1 ' would activate persistent retry here but not in
telemetry or other env-driven features.

* test(retry): add Retry-After handling tests for persistent mode

Cover three key behaviors:
- Retry-After is NOT capped at maxBackoff (only at capMs)
- Retry-After IS capped at persistentCapMs absolute limit
- Retry-After delays have no jitter applied

* fix(test): add isUnattendedMode to retry.js mock in baseLlmClient tests

The existing vi.mock for retry.js only exported retryWithBackoff.
After adding isUnattendedMode to the retry module, baseLlmClient.ts
imports it, causing all 10 generateJson tests to fail with
'No "isUnattendedMode" export is defined on the mock'.

* fix(retry): wire persistent retry mode into client.ts generateContent

Forward persistentMode and abortSignal to retryWithBackoff() in
GeminiClient.generateContent(), matching the existing wiring in
baseLlmClient.ts and geminiChat.ts.
This commit is contained in:
zhangxy-zju 2026-04-21 22:08:11 +08:00 committed by GitHub
parent 309b25d256
commit ebe364d0b8
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 731 additions and 45 deletions

View file

@ -310,6 +310,67 @@ echo "Recent usage trends:"
tail -5 usage.log
```
## Persistent Retry Mode
When Qwen Code runs in CI/CD pipelines or as a background daemon, a brief API outage (rate limiting or overload) should not kill a multi-hour task. **Persistent retry mode** makes Qwen Code retry transient API errors indefinitely until the service recovers.
### How it works
- **Transient errors only**: HTTP 429 (Rate Limit) and 529 (Overloaded) are retried indefinitely. Other errors (400, 500, etc.) still fail normally.
- **Exponential backoff with cap**: Retry delays grow exponentially but are capped at **5 minutes** per retry.
- **Heartbeat keepalive**: During long waits, a status line is printed to stderr every **30 seconds** to prevent CI runners from killing the process due to inactivity.
- **Graceful degradation**: Non-transient errors and interactive mode are completely unaffected.
### Activation
Set the `QWEN_CODE_UNATTENDED_RETRY` environment variable to `true` or `1` (strict match, case-sensitive):
```bash
export QWEN_CODE_UNATTENDED_RETRY=1
```
> [!important]
> Persistent retry requires an **explicit opt-in**. `CI=true` alone does **not** activate it — silently turning a fast-fail CI job into an infinite-wait job would be dangerous. Always set `QWEN_CODE_UNATTENDED_RETRY` explicitly in your pipeline configuration.
### Examples
#### GitHub Actions
```yaml
- name: Automated code review
env:
QWEN_CODE_UNATTENDED_RETRY: '1'
run: |
qwen -p "Review all files in src/ for security issues" \
--output-format json \
--yolo > review.json
```
#### Overnight batch processing
```bash
export QWEN_CODE_UNATTENDED_RETRY=1
qwen -p "Migrate all callback-style functions to async/await in src/" --yolo
```
#### Background daemon
```bash
QWEN_CODE_UNATTENDED_RETRY=1 nohup qwen -p "Audit all dependencies for known CVEs" \
--output-format json > audit.json 2> audit.log &
```
### Monitoring
During persistent retry, heartbeat messages are printed to **stderr**:
```
[qwen-code] Waiting for API capacity... attempt 3, retry in 45s
[qwen-code] Waiting for API capacity... attempt 3, retry in 15s
```
These messages keep CI runners alive and let you monitor progress. They do not appear in stdout, so JSON output piped to other tools remains clean.
## Resources
- [CLI Configuration](../configuration/settings#command-line-arguments) - Complete configuration guide