Create retry_backoff.md

This commit is contained in:
PSBigBig 2025-08-31 20:16:19 +08:00 committed by GitHub
parent 587f48e0e9
commit fc21195a8f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -0,0 +1,247 @@
# Retry and Backoff: OpsDeploy Guardrails
Make retries safe and predictable. Use structured backoff, deadlines, and idempotent fences so a bad minute does not become a bad day.
---
## Open these first
- Readiness gate: [rollout_readiness_gate.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rollout_readiness_gate.md)
- Rate limits and pressure control: [rate_limit_backpressure.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rate_limit_backpressure.md)
- Idempotent fences: [idempotency_dedupe.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/idempotency_dedupe.md)
- Cache discipline: [cache_warmup_invalidation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/cache_warmup_invalidation.md)
- Canary and cutover: [staged_rollout_canary.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/staged_rollout_canary.md), [blue_green_switchovers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/blue_green_switchovers.md)
- Boot and first run traps: [bootstrap-ordering.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md), [deployment-deadlock.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md), [predeploy-collapse.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
- Live ops and rollbacks: [live_monitoring_rag.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/live_monitoring_rag.md), [debug_playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/debug_playbook.md)
---
## When to use
- Provider returns 429 or 5xx or timeouts rise.
- You integrate a new model provider or gateway.
- Queue consumers replay after crash or failover.
- You see retry storms that multiply side effects.
---
## Acceptance targets
- 429 rate ≤ 0.5 percent at steady state, p95 ≤ 2 percent during bursts.
- Success after retry ≥ 90 percent for transient classes.
- Queue wait p95 ≤ 200 ms for read paths, writes use deadlines ≤ 1× p95 service time.
- ΔS(question, retrieved) and coverage stay inside ship targets while retries occur. λ remains convergent across 2 seeds.
---
## 60 second blueprint
1) **Classify errors**
408 429 500 502 503 504 are retriable. 400 401 403 404 are not. 409 is a conflict. Treat as success if idempotency proves the effect already exists.
2) **Set a hard deadline**
Each request carries a client side deadline. Do not exceed it while retrying.
3) **Use full jitter backoff**
Randomized exponential backoff with a cap. Honor Retry After when present.
4) **Coalesce with single flight**
Collapse duplicate work under one in flight key to prevent stampede.
5) **Fence side effects**
Idempotency key on every write and webhook. Exactly once for the effect.
---
## Retry policy matrix
| Class | Examples | Retry? | Notes |
|---|---|---|---|
| 408 Timeout | network idle, upstream slow | Yes | Backoff with jitter until deadline. |
| 429 Rate limit | burst guard | Yes | Respect Retry After and global token budget. |
| 5xx Transient | 500 502 503 504 | Yes | Retry with jitter. Stop on deadline. |
| 4xx Permanent | 400 401 403 404 | No | Fix request or auth. Do not retry. |
| 409 Conflict | idempotency collision | No | Fetch prior receipt and return. |
---
## Backoff recipes
### Full jitter exponential
```python
import random, time
def sleep_time(attempt, base=0.25, cap=10.0):
return random.random() * min(cap, base * (2 ** attempt))
def retry(call, max_attempts=6, deadline_s=20):
start = time.time()
for attempt in range(max_attempts):
ok, res, err, retry_after = call()
if ok:
return res
if time.time() - start >= deadline_s:
break
if retry_after:
time.sleep(min(retry_after, max(0.05, deadline_s - (time.time() - start))))
else:
time.sleep(sleep_time(attempt))
raise RuntimeError("deadline or attempts exhausted")
````
### Decorrelated jitter
```python
import random
def deco_jitter(prev, base=0.25, cap=10.0):
return min(cap, random.uniform(base, prev * 3 or base))
```
Choose one strategy and stick to it across clients and workers.
---
## Queue consumers and jobs
* Acknowledge only after the effect is sealed. See [idempotency\_dedupe.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/idempotency_dedupe.md).
* Use per message deadlines and per topic concurrency caps.
* Backoff queues on provider 429 or store saturation.
* Include `idempotency_key`, `attempt`, and `first_seen_at` in logs.
---
## HTTP client rules
* Honor Retry After in seconds or HTTP date.
* Propagate `X Request Id` and a stable `Idempotency Key` for writes.
* Set `Request Deadline` header for end to end visibility.
* Stop retries on 4xx except 408. Stop after deadline or attempts.
---
## Single flight pattern
Coalesce identical misses so one worker computes the result.
```python
key = f"wf:{hash(request_body)}"
if redis.set(f"lock:{key}", "1", nx=True, px=30000):
try:
val = compute()
redis.set(f"ans:{key}", serialize(val), ex=60)
finally:
redis.delete(f"lock:{key}")
else:
val = wait_poll(f"ans:{key}", timeout_ms=1500)
return val
```
---
## YAML policy you can paste
```yaml
# opsdeploy/retry_backoff.yml
retry:
strategy: full_jitter
base_s: 0.25
cap_s: 10
max_attempts: 6
honor_retry_after: true
deadlines:
read_ms: 2000
write_ms: 3000
class_rules:
retriable: [408, 429, 500, 502, 503, 504]
nonretry: [400, 401, 403, 404]
conflict_ok: [409]
single_flight:
enabled: true
lease_ms: 30000
observability:
log_fields:
- request_id
- idempotency_key
- attempt
- retry_after
- sleep_ms
- deadline_ms
decision:
abort_when:
timeout_rate_p95: ">=0.02"
ds_p95_drift: ">=0.15"
error_rate: ">=0.01"
```
---
## Observability you must log
* Attempts, sleep times, adherence to Retry After.
* Deadline budget spent and remaining at each hop.
* 429 and 5xx counts by endpoint and tenant.
* Success after retry ratio and time to success.
* Quality under pressure, ΔS and coverage, λ states.
---
## Symptom to fix map
| Symptom | Likely cause | Open this |
| ------------------------------ | ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Retry storms after deploy | no jitter, no global cap | [rate\_limit\_backpressure.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rate_limit_backpressure.md) |
| Double writes on retry | missing idempotency fences | [idempotency\_dedupe.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/idempotency_dedupe.md) |
| Mixed answers across versions | cache keys not partitioned by pins | [cache\_warmup\_invalidation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/cache_warmup_invalidation.md) |
| First call fails after cutover | boot order or index pointer wrong | [bootstrap-ordering.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md), [vector\_index\_build\_and\_swap.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/vector_index_build_and_swap.md) |
| Tail latency explodes | unbounded concurrency or no deadlines | [rate\_limit\_backpressure.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rate_limit_backpressure.md) |
---
## Common pitfalls
* Retrying nonretriable 4xx and burning budget.
* Ignoring Retry After and syncing retries across clients.
* No deadline at the client so retries outlive the user.
* No idempotency for writes so duplicates slip in.
* Retries that cross a blue green boundary without version pins.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1⃣ Download · 2⃣ Upload to your LLM · 3⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
### 🧭 Explore More
| Module | Description | Link |
| ------------------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
| 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)**
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
 
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
 
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
 
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
 
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
 
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
 
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
 
</div>