14 KiB
Region Failover Drills — Serverless and Edge
🧭 Quick Return to Map
You are in a sub-page of Cloud_Serverless.
To reorient, go back here:
- Cloud_Serverless — scalable functions and event-driven pipelines
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Practice failover until it is boring. This page gives a repeatable drill that proves your system can evacuate a region, keep answers consistent, and return to steady state without split-brain or hidden drift.
When to use this page
- You run in 2+ regions and need evidence your plan actually works.
- Users in one geography see timeouts or changing answers during incidents.
- RAG indices or caches differ by region and you want a clean promotion flow.
- Compliance requires planned evacuation and return-to-home tests.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- Retrieval knobs and ordering: Retrieval Playbook · Rerankers
- Trace and prove snippets: Retrieval Traceability · Contract payloads: Data Contracts
- Meaning vs distance: Embedding ≠ Semantic
- Boot and deploy hazards: Bootstrap Ordering · Deployment Deadlock · Pre-Deploy Collapse
- Related Cloud/Serverless ops: Cold Start & Concurrency · Timeouts & Streaming · Stateless KV & Queues · Edge Cache Invalidation · Runtime Env Parity · Egress & Webhooks · Pricing vs Latency · Secrets Rotation · Multi-Region Routing · Observability & SLO
Acceptance targets
- Evacuation decision to clean cutover ≤ 30 seconds, no 5xx bursts > 60 seconds.
- p95 latency within 25 percent of pre-incident baseline for served markets.
- Zero write loss. Queue backlog drains to pre-drill baseline in ≤ 10 minutes.
- Identical RAG contract fields and
INDEX_HASHacross promoted region and followers. - ΔS(question, retrieved) ≤ 0.45 and λ convergent for probe set before and after drill.
Drill types you should run
A) Planned evacuation (brownout) Throttle a region by policy and prove traffic drains to survivor without flapping.
B) Hard outage (blackhole) Block the region’s ingress and egress. Verify stickiness, retries, and queue replay.
C) Index skew recovery Deliberately publish a follower with wrong metric or analyzer. Ensure contracts refuse reuse and force rebuild.
D) Webhook and egress reroute Fail the region, then deliver third-party webhooks only from the survivor. Confirm no duplicates.
E) Return-to-home After repair, reintroduce the region, rebuild indices, re-warm caches, and rebalance.
Prerequisites before any drill
-
Version pinning and env parity are equal in both regions. Open: Runtime Env Parity
-
RAG artifacts are stamped:
{INDEX_HASH, METRIC, ANALYZER, BUILD_TS}and exposed via a health endpoint. Open: Retrieval Traceability -
Idempotent write path with
dedupe_keyand visible, durable queues. Open: Stateless KV & Queues -
Dual-accept secrets/keys to avoid auth flaps during route flips. Open: Secrets Rotation
-
Edge cache purge API tested per prefix and per tenant. Open: Edge Cache Invalidation
Copy-paste drill plan (JSON)
{
"regions": ["us-east", "eu-west"],
"evacuate": "us-east",
"promote_survivor": "eu-west",
"routing": { "mode": "latency", "stick_minutes": 15, "hysteresis_s": 30 },
"freeze_writes_on_followers": true,
"checks": {
"pre": ["env_parity", "index_hash_equal", "queue_empty", "cache_warm"],
"live": ["p95_latency", "error_rate", "queue_backlog", "ΔS_probe", "λ_state", "index_hash"],
"post": ["answers_equal", "contracts_equal", "backlog_zero", "cache_rewarmed"]
},
"rag_probe": { "k": 10, "dualhome_k": 5, "delta_s_risk": 0.60, "coverage_min": 0.70 },
"webhook_policy": { "emit_from": "eu-west", "dedupe_key": "sha256(event_id+rev)" },
"return_to_home": { "rebuild_follower": true, "purge_cache": true, "gradual_weights": [10,30,60,100] }
}
Step-by-step runbook
1) Pre-checks (gate the drill)
- Health: both regions report
READY=true, equalINDEX_HASH. - Queues: backlog < threshold.
- Caches: hot-path keys exist in both regions.
- Canary: probe the fixed Q&A set and log ΔS, λ, coverage.
2) Evacuate the region
- Flip routing weight to 0 for the target region, or blackhole ingress.
- Freeze follower writes: accept but enqueue, no direct store writes.
- Announce stick region in response headers for ongoing sessions.
3) Promote survivor
- Promote exactly one region to take writes.
- Route webhooks and outbound calls only from survivor. Open: Egress & Webhooks
4) Observe and clamp
- Separate cold-start from hot latency. Adjust concurrency limits in survivor to avoid thrash. Open: Cold Start & Concurrency
- If p95 spikes beyond SLO, reduce parallel tools or temporarily lower k in retrieval while keeping reranking deterministic.
5) Verify answers
- Run probe set in survivor. If ΔS ≥ 0.60 or citations diverge from gold, rebuild index and purge caches. Open: Retrieval Playbook · Data Contracts
6) Return-to-home (after repair)
- Rebuild follower indices from the same artifact, confirm
INDEX_HASH. - Purge edge cache prefixes, re-warm.
- Gradually restore weights, keeping stickiness for multi-turn sessions. Open: Multi-Region Routing
7) Post-drill closeout
- Prove
answers_equalbetween regions on gold questions. - Export SLO chart, queue backlog graph, ΔS histogram pre vs post.
Metrics and evidence to capture
- p75 / p95 / p99 latency per region.
- Error rate and timeout breakdown (connect, TLS, body read, tool call). Open: Timeouts & Streaming
- Queue backlog length and age percentiles.
- Cache hit ratio changes around purge events.
- ΔS and λ distributions on the probe set, before vs after.
- Index metadata parity logs
{INDEX_HASH, METRIC, ANALYZER, BUILD_TS}for both regions.
Typical drill failures → exact fix
-
Split-brain writes during cutover Missing freeze or idempotency. Enforce queue-first and dedupe keys. Open: Stateless KV & Queues
-
Answers differ after return-to-home Follower index rebuilt with different analyzer or metric. Refuse reuse until
INDEX_HASHmatches. Open: Retrieval Traceability -
Webhook loops or duplicates Third-party still targets evacuated region. Emit only from survivor and apply
dedupe_key. Open: Egress & Webhooks -
Cold-start storm after promotion Concurrency limits not scaled. Pre-warm and clamp. Open: Cold Start & Concurrency
-
Auth failures after route flip Stale region-pinned keys. Rotate with dual-accept and force refresh. Open: Secrets Rotation
Verification checklist
- Single survivor region takes all writes. No duplicates after replay.
INDEX_HASHparity across regions after rebuild.- ΔS and λ within targets on probe set before and after.
- Edge caches purged and re-warmed.
- SLO budget impact recorded and accepted.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
Next page to write: ProblemMap/GlobalFixMap/Cloud_Serverless/observability_slo.md