feat(retry): add persistent retry mode for unattended CI/CD environments (#3080)

* feat(retry): add persistent retry mode for unattended CI/CD environments When running in CI/CD pipelines or background daemon mode, transient API capacity errors (429/529) should not terminate long-running tasks after a fixed number of retries. This adds an environment-aware persistent retry mode that retries indefinitely for transient errors, with exponential backoff capped at 5 minutes and heartbeat keepalives every 30 seconds to prevent CI runner timeouts. * docs: add persistent retry mode documentation Add environment variable entries (QWEN_CODE_UNATTENDED_RETRY, QWEN_CODE_BG) to the settings reference, and a new "Persistent Retry Mode" section to the headless mode docs covering activation, behavior, and CI/CD usage examples. * refactor(retry): simplify to single explicit env var QWEN_CODE_UNATTENDED_RETRY Remove QWEN_CODE_BG and CI=true as activation triggers for persistent retry. Having multiple env vars with identical behavior adds confusion, and silently activating infinite retry on CI=true is dangerous — a regular CI test hitting a 429 would hang forever instead of failing fast. * fix(retry): address PR review feedback - Forward caller's abortSignal into retryWithBackoff in both baseLlmClient.ts and geminiChat.ts so persistent waits remain cancellable (wenshao) - Re-apply maxBackoff and capMs after jitter so delays strictly respect stated caps (Copilot) - Respect shouldRetryOnError in persistent mode so callers can force fast-fail even for transient 429/529 errors (Copilot) - Guard sleepWithHeartbeat against infinite loop when heartbeat interval is <= 0 via Math.max(1, ...) (Copilot) - Normalize isEnvTruthy with trim/toLowerCase for robust env var parsing across CI conventions (Copilot) * test(retry): add missing UT for shouldRetryOnError override and heartbeat zero-interval guard * fix(retry): do not cap Retry-After delays at maxBackoff Server-specified Retry-After values should only be limited by the absolute cap (capMs/6h), not the exponential backoff cap (maxBackoff/5min). Jitter is also skipped for Retry-After since the server already specified the exact wait time. * refactor(retry): align isUnattendedMode with project env parsing convention Replace custom isEnvTruthy (trim + toLowerCase) with strict matching (val === 'true' || val === '1') to match parseBooleanEnvFlag used elsewhere in the codebase. Prevents inconsistent behavior where 'TRUE' or ' 1 ' would activate persistent retry here but not in telemetry or other env-driven features. * test(retry): add Retry-After handling tests for persistent mode Cover three key behaviors: - Retry-After is NOT capped at maxBackoff (only at capMs) - Retry-After IS capped at persistentCapMs absolute limit - Retry-After delays have no jitter applied * fix(test): add isUnattendedMode to retry.js mock in baseLlmClient tests The existing vi.mock for retry.js only exported retryWithBackoff. After adding isUnattendedMode to the retry module, baseLlmClient.ts imports it, causing all 10 generateJson tests to fail with 'No "isUnattendedMode" export is defined on the mock'. * fix(retry): wire persistent retry mode into client.ts generateContent Forward persistentMode and abortSignal to retryWithBackoff() in GeminiClient.generateContent(), matching the existing wiring in baseLlmClient.ts and geminiChat.ts.
2026-04-28 11:41:04 +00:00 · 2026-04-21 22:08:11 +08:00 · 2026-04-21 22:08:11 +08:00 · ebe364d0b8
commit ebe364d0b8
parent 309b25d256
10 changed files with 731 additions and 45 deletions
--- a/docs/users/features/headless.md
+++ b/docs/users/features/headless.md
@ -310,6 +310,67 @@ echo "Recent usage trends:"
 tail -5 usage.log
 ```

+## Persistent Retry Mode
+
+When Qwen Code runs in CI/CD pipelines or as a background daemon, a brief API outage (rate limiting or overload) should not kill a multi-hour task. **Persistent retry mode** makes Qwen Code retry transient API errors indefinitely until the service recovers.
+
+### How it works
+
+- **Transient errors only**: HTTP 429 (Rate Limit) and 529 (Overloaded) are retried indefinitely. Other errors (400, 500, etc.) still fail normally.
+- **Exponential backoff with cap**: Retry delays grow exponentially but are capped at **5 minutes** per retry.
+- **Heartbeat keepalive**: During long waits, a status line is printed to stderr every **30 seconds** to prevent CI runners from killing the process due to inactivity.
+- **Graceful degradation**: Non-transient errors and interactive mode are completely unaffected.
+
+### Activation
+
+Set the `QWEN_CODE_UNATTENDED_RETRY` environment variable to `true` or `1` (strict match, case-sensitive):
+
+```bash
+export QWEN_CODE_UNATTENDED_RETRY=1
+```
+
+> [!important]
+> Persistent retry requires an **explicit opt-in**. `CI=true` alone does **not** activate it — silently turning a fast-fail CI job into an infinite-wait job would be dangerous. Always set `QWEN_CODE_UNATTENDED_RETRY` explicitly in your pipeline configuration.
+
+### Examples
+
+#### GitHub Actions
+
+```yaml
+- name: Automated code review
+  env:
+    QWEN_CODE_UNATTENDED_RETRY: '1'
+  run: |
+    qwen -p "Review all files in src/ for security issues" \
+      --output-format json \
+      --yolo > review.json
+```
+
+#### Overnight batch processing
+
+```bash
+export QWEN_CODE_UNATTENDED_RETRY=1
+qwen -p "Migrate all callback-style functions to async/await in src/" --yolo
+```
+
+#### Background daemon
+
+```bash
+QWEN_CODE_UNATTENDED_RETRY=1 nohup qwen -p "Audit all dependencies for known CVEs" \
+  --output-format json > audit.json 2> audit.log &
+```
+
+### Monitoring
+
+During persistent retry, heartbeat messages are printed to **stderr**:
+
+```
+[qwen-code] Waiting for API capacity... attempt 3, retry in 45s
+[qwen-code] Waiting for API capacity... attempt 3, retry in 15s
+```
+
+These messages keep CI runners alive and let you monitor progress. They do not appear in stdout, so JSON output piped to other tools remains clean.
+
 ## Resources

 - [CLI Configuration](../configuration/settings#command-line-arguments) - Complete configuration guide