WFGY/ProblemMap/GlobalFixMap/Cloud_Serverless/region_failover_drills.md

14 KiB
Raw Blame History

Region Failover Drills — Serverless and Edge

🧭 Quick Return to Map

You are in a sub-page of Cloud_Serverless.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Practice failover until it is boring. This page gives a repeatable drill that proves your system can evacuate a region, keep answers consistent, and return to steady state without split-brain or hidden drift.

When to use this page

  • You run in 2+ regions and need evidence your plan actually works.
  • Users in one geography see timeouts or changing answers during incidents.
  • RAG indices or caches differ by region and you want a clean promotion flow.
  • Compliance requires planned evacuation and return-to-home tests.

Open these first

Acceptance targets

  • Evacuation decision to clean cutover ≤ 30 seconds, no 5xx bursts > 60 seconds.
  • p95 latency within 25 percent of pre-incident baseline for served markets.
  • Zero write loss. Queue backlog drains to pre-drill baseline in ≤ 10 minutes.
  • Identical RAG contract fields and INDEX_HASH across promoted region and followers.
  • ΔS(question, retrieved) ≤ 0.45 and λ convergent for probe set before and after drill.

Drill types you should run

A) Planned evacuation (brownout) Throttle a region by policy and prove traffic drains to survivor without flapping.

B) Hard outage (blackhole) Block the regions ingress and egress. Verify stickiness, retries, and queue replay.

C) Index skew recovery Deliberately publish a follower with wrong metric or analyzer. Ensure contracts refuse reuse and force rebuild.

D) Webhook and egress reroute Fail the region, then deliver third-party webhooks only from the survivor. Confirm no duplicates.

E) Return-to-home After repair, reintroduce the region, rebuild indices, re-warm caches, and rebalance.


Prerequisites before any drill


Copy-paste drill plan (JSON)

{
  "regions": ["us-east", "eu-west"],
  "evacuate": "us-east",
  "promote_survivor": "eu-west",
  "routing": { "mode": "latency", "stick_minutes": 15, "hysteresis_s": 30 },
  "freeze_writes_on_followers": true,
  "checks": {
    "pre": ["env_parity", "index_hash_equal", "queue_empty", "cache_warm"],
    "live": ["p95_latency", "error_rate", "queue_backlog", "ΔS_probe", "λ_state", "index_hash"],
    "post": ["answers_equal", "contracts_equal", "backlog_zero", "cache_rewarmed"]
  },
  "rag_probe": { "k": 10, "dualhome_k": 5, "delta_s_risk": 0.60, "coverage_min": 0.70 },
  "webhook_policy": { "emit_from": "eu-west", "dedupe_key": "sha256(event_id+rev)" },
  "return_to_home": { "rebuild_follower": true, "purge_cache": true, "gradual_weights": [10,30,60,100] }
}

Step-by-step runbook

1) Pre-checks (gate the drill)

  • Health: both regions report READY=true, equal INDEX_HASH.
  • Queues: backlog < threshold.
  • Caches: hot-path keys exist in both regions.
  • Canary: probe the fixed Q&A set and log ΔS, λ, coverage.

2) Evacuate the region

  • Flip routing weight to 0 for the target region, or blackhole ingress.
  • Freeze follower writes: accept but enqueue, no direct store writes.
  • Announce stick region in response headers for ongoing sessions.

3) Promote survivor

  • Promote exactly one region to take writes.
  • Route webhooks and outbound calls only from survivor. Open: Egress & Webhooks

4) Observe and clamp

  • Separate cold-start from hot latency. Adjust concurrency limits in survivor to avoid thrash. Open: Cold Start & Concurrency
  • If p95 spikes beyond SLO, reduce parallel tools or temporarily lower k in retrieval while keeping reranking deterministic.

5) Verify answers

6) Return-to-home (after repair)

  • Rebuild follower indices from the same artifact, confirm INDEX_HASH.
  • Purge edge cache prefixes, re-warm.
  • Gradually restore weights, keeping stickiness for multi-turn sessions. Open: Multi-Region Routing

7) Post-drill closeout

  • Prove answers_equal between regions on gold questions.
  • Export SLO chart, queue backlog graph, ΔS histogram pre vs post.

Metrics and evidence to capture

  • p75 / p95 / p99 latency per region.
  • Error rate and timeout breakdown (connect, TLS, body read, tool call). Open: Timeouts & Streaming
  • Queue backlog length and age percentiles.
  • Cache hit ratio changes around purge events.
  • ΔS and λ distributions on the probe set, before vs after.
  • Index metadata parity logs {INDEX_HASH, METRIC, ANALYZER, BUILD_TS} for both regions.

Typical drill failures → exact fix

  • Split-brain writes during cutover Missing freeze or idempotency. Enforce queue-first and dedupe keys. Open: Stateless KV & Queues

  • Answers differ after return-to-home Follower index rebuilt with different analyzer or metric. Refuse reuse until INDEX_HASH matches. Open: Retrieval Traceability

  • Webhook loops or duplicates Third-party still targets evacuated region. Emit only from survivor and apply dedupe_key. Open: Egress & Webhooks

  • Cold-start storm after promotion Concurrency limits not scaled. Pre-warm and clamp. Open: Cold Start & Concurrency

  • Auth failures after route flip Stale region-pinned keys. Rotate with dual-accept and force refresh. Open: Secrets Rotation


Verification checklist

  • Single survivor region takes all writes. No duplicates after replay.
  • INDEX_HASH parity across regions after rebuild.
  • ΔS and λ within targets on probe set before and after.
  • Edge caches purged and re-warmed.
  • SLO budget impact recorded and accepted.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars

Next page to write: ProblemMap/GlobalFixMap/Cloud_Serverless/observability_slo.md