mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
11 KiB
11 KiB
Retry and Backoff: OpsDeploy Guardrails
🧭 Quick Return to Map
You are in a sub-page of OpsDeploy.
To reorient, go back here:
- OpsDeploy — operations automation and deployment pipelines
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Make retries safe and predictable. Use structured backoff, deadlines, and idempotent fences so a bad minute does not become a bad day.
Open these first
- Readiness gate: rollout_readiness_gate.md
- Rate limits and pressure control: rate_limit_backpressure.md
- Idempotent fences: idempotency_dedupe.md
- Cache discipline: cache_warmup_invalidation.md
- Canary and cutover: staged_rollout_canary.md, blue_green_switchovers.md
- Boot and first run traps: bootstrap-ordering.md, deployment-deadlock.md, predeploy-collapse.md
- Live ops and rollbacks: live_monitoring_rag.md, debug_playbook.md
When to use
- Provider returns 429 or 5xx or timeouts rise.
- You integrate a new model provider or gateway.
- Queue consumers replay after crash or failover.
- You see retry storms that multiply side effects.
Acceptance targets
- 429 rate ≤ 0.5 percent at steady state, p95 ≤ 2 percent during bursts.
- Success after retry ≥ 90 percent for transient classes.
- Queue wait p95 ≤ 200 ms for read paths, writes use deadlines ≤ 1× p95 service time.
- ΔS(question, retrieved) and coverage stay inside ship targets while retries occur. λ remains convergent across 2 seeds.
60 second blueprint
- Classify errors
408 429 500 502 503 504 are retriable. 400 401 403 404 are not. 409 is a conflict. Treat as success if idempotency proves the effect already exists. - Set a hard deadline
Each request carries a client side deadline. Do not exceed it while retrying. - Use full jitter backoff
Randomized exponential backoff with a cap. Honor Retry After when present. - Coalesce with single flight
Collapse duplicate work under one in flight key to prevent stampede. - Fence side effects
Idempotency key on every write and webhook. Exactly once for the effect.
Retry policy matrix
| Class | Examples | Retry? | Notes |
|---|---|---|---|
| 408 Timeout | network idle, upstream slow | Yes | Backoff with jitter until deadline. |
| 429 Rate limit | burst guard | Yes | Respect Retry After and global token budget. |
| 5xx Transient | 500 502 503 504 | Yes | Retry with jitter. Stop on deadline. |
| 4xx Permanent | 400 401 403 404 | No | Fix request or auth. Do not retry. |
| 409 Conflict | idempotency collision | No | Fetch prior receipt and return. |
Backoff recipes
Full jitter exponential
import random, time
def sleep_time(attempt, base=0.25, cap=10.0):
return random.random() * min(cap, base * (2 ** attempt))
def retry(call, max_attempts=6, deadline_s=20):
start = time.time()
for attempt in range(max_attempts):
ok, res, err, retry_after = call()
if ok:
return res
if time.time() - start >= deadline_s:
break
if retry_after:
time.sleep(min(retry_after, max(0.05, deadline_s - (time.time() - start))))
else:
time.sleep(sleep_time(attempt))
raise RuntimeError("deadline or attempts exhausted")
Decorrelated jitter
import random
def deco_jitter(prev, base=0.25, cap=10.0):
return min(cap, random.uniform(base, prev * 3 or base))
Choose one strategy and stick to it across clients and workers.
Queue consumers and jobs
- Acknowledge only after the effect is sealed. See idempotency_dedupe.md.
- Use per message deadlines and per topic concurrency caps.
- Backoff queues on provider 429 or store saturation.
- Include
idempotency_key,attempt, andfirst_seen_atin logs.
HTTP client rules
- Honor Retry After in seconds or HTTP date.
- Propagate
X Request Idand a stableIdempotency Keyfor writes. - Set
Request Deadlineheader for end to end visibility. - Stop retries on 4xx except 408. Stop after deadline or attempts.
Single flight pattern
Coalesce identical misses so one worker computes the result.
key = f"wf:{hash(request_body)}"
if redis.set(f"lock:{key}", "1", nx=True, px=30000):
try:
val = compute()
redis.set(f"ans:{key}", serialize(val), ex=60)
finally:
redis.delete(f"lock:{key}")
else:
val = wait_poll(f"ans:{key}", timeout_ms=1500)
return val
YAML policy you can paste
# opsdeploy/retry_backoff.yml
retry:
strategy: full_jitter
base_s: 0.25
cap_s: 10
max_attempts: 6
honor_retry_after: true
deadlines:
read_ms: 2000
write_ms: 3000
class_rules:
retriable: [408, 429, 500, 502, 503, 504]
nonretry: [400, 401, 403, 404]
conflict_ok: [409]
single_flight:
enabled: true
lease_ms: 30000
observability:
log_fields:
- request_id
- idempotency_key
- attempt
- retry_after
- sleep_ms
- deadline_ms
decision:
abort_when:
timeout_rate_p95: ">=0.02"
ds_p95_drift: ">=0.15"
error_rate: ">=0.01"
Observability you must log
- Attempts, sleep times, adherence to Retry After.
- Deadline budget spent and remaining at each hop.
- 429 and 5xx counts by endpoint and tenant.
- Success after retry ratio and time to success.
- Quality under pressure, ΔS and coverage, λ states.
Symptom to fix map
| Symptom | Likely cause | Open this |
|---|---|---|
| Retry storms after deploy | no jitter, no global cap | rate_limit_backpressure.md |
| Double writes on retry | missing idempotency fences | idempotency_dedupe.md |
| Mixed answers across versions | cache keys not partitioned by pins | cache_warmup_invalidation.md |
| First call fails after cutover | boot order or index pointer wrong | bootstrap-ordering.md, vector_index_build_and_swap.md |
| Tail latency explodes | unbounded concurrency or no deadlines | rate_limit_backpressure.md |
Common pitfalls
- Retrying nonretriable 4xx and burning budget.
- Ignoring Retry After and syncing retries across clients.
- No deadline at the client so retries outlive the user.
- No idempotency for writes so duplicates slip in.
- Retries that cross a blue green boundary without version pins.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | Canonical framework entry point | View |
| Problem Map | Diagnostic map and navigation hub | View |
| Tension Universe Experiments | MVP experiment field | View |
| Recognition | Where WFGY is referenced or adopted | View |
| AI Guide | Anti-hallucination reading protocol for tools | View |
If this repository helps, starring it improves discovery for other builders.