WFGY/ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md

10 KiB
Raw Blame History

Multi-Region and Failover Routing Guardrails

🧭 Quick Return to Map

You are in a sub-page of Cloud_Serverless.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Keep latency low, fail over safely, and avoid split-brain state while your RAG, queues, and streams operate across regions. This page gives a compact playbook that you can paste into any load balancer or edge router policy.

When to use this page

  • Users in different geos see different answers or stale citations.
  • Region outage triggers duplicate tool calls or replayed webhooks.
  • Vector writes happen in one region while reads route to another.
  • Canary in a single region is green yet the global cutover fails.
  • DNS or anycast flips cut long-running streams.

Open these first

Acceptance targets

  • p95 latency improves for local users or stays within ten percent of single-region baseline.
  • No increase in 5xx at failover or failback.
  • Idempotency dedupe rate ≥ 99.9 percent on all write paths during failover windows.
  • For RAG: ΔS(question, retrieved) drift ≤ 0.03 across regions, λ remains convergent on two seeds.
  • Vector index hash identical per region before routing users. INDEX_HASH matches on probes.

Fix in 60 seconds

  1. Pin by header, not only DNS Add X-Region-Pin: {region} and X-Release: {rev}. Edge selects nearest healthy region unless a pin is present. Synthetic probes always include the pin.

  2. Fence writes with idempotency keys Compute sha256(source_id + revision + index_hash + partition) and drop duplicates in the KV. Keep the fence for failover_window plus 24 hours.

  3. Replicate index and blobs before traffic Block user routing until INDEX_HASH equals across regions and blob manifests match.

  4. Graceful streams Sticky route long-lived connections. Drain the old region for N seconds. Do not cut an active stream at the router.

  5. Health with contract checks Health is green only if schema_rev, model_tag, index_hash and secret versions match. Pure 200 is not sufficient.


Routing patterns that work

  • Active-active with sticky reads Reads route to nearest healthy region and stay sticky for the session. Writes go to the region that owns the partition. Use a queue to replicate to others.

  • Active-passive for stateful writers All writes go to primary. Secondary serves read-only. Promote only after index and blob parity plus a clean queue tail.

  • Geo-partitioned stores Partition by tenant or namespace. Keep retrieval within the same partition and region. Cross-partition requests require a join step with explicit contracts.


Typical breakpoints → exact fix

  • Wrong snippet in far region despite high similarity Index or metric differs. Compare INDEX_HASH and analyzer settings. Rebuild and verify a small gold set before routing traffic. Open: Retrieval Playbook, Embedding ≠ Semantic

  • Duplicate webhooks during regional flip Retry plus DNS cutover replays the same event. Use the idempotency fence and a replay TTL beyond the failover window. Open: Bootstrap Ordering

  • Split-brain memory or tool cache Agent memory writes with no version pins cross regions. Namespace by tenant, mem_rev, and region. Open: Multi-Agent Problems

  • Streams cut at failover Router does not support drain. Pin streams with a cookie or header, then change the default path only for new connections. Open: Timeouts and Streaming Limits

  • Cost spikes after enabling global anycast Cold starts increase in remote regions. Raise min instances or provisioned concurrency selectively on hot routes. Open: Cold Start and Concurrency


Minimal recipes you can copy

A) Region pinning contract

Request headers
- X-Region-Pin: us-east-1 | eu-west-1 | ap-southeast-1
- X-Release: r2025-08-27
- X-Index-Hash: a1b2c3
- X-Schema-Rev: sc-12
Router rule
- If pin present and region healthy → route pinned
- Else pick nearest healthy with matching {schema_rev, index_hash, model_tag}
- Sticky cookie for streaming connections

B) Failover gate

Gate conditions before user traffic
- INDEX_HASH equal across regions
- Blob manifest parity
- Health probes return {schema_rev, model_tag, secrets_rev} exact match
- Synthetic RAG probes: ΔS drift ≤ 0.03 on k=10 questions
- Dedupe KV warm and reachable in both regions

C) Vector replication note

Replication
- Prefer periodic rebuild from source texts per region
- If log shipping: checkpoint offsets, verify analyzer parity
- After topology change: run gold-set eval and lock reranker order
Refs: retrieval-playbook, reindex-migration

Observability you must add

  • Split metrics by region, release_id, and revision.
  • Health includes schema_rev, index_hash, model_tag, secrets_rev.
  • Dedupe hit rate, queue age, replay counts per region.
  • ΔS and λ on a fixed probe set, per region.
  • Stream drain success count at flips.

Verification

  • Probe set stable within acceptance targets.
  • No duplicate side effects in the failover window.
  • p95 improves for local users or remains flat.
  • Queue age does not spike at promotion.

When to escalate

  • Persistent ΔS drift across regions after rebuild. Re-embed with the same analyzer and metric, then re-run the gold set.
  • Dedupe misses during outage replay. Increase KV TTL and ensure consistent hashing across regions.
  • Health green yet errors rise. Add contract checks to the probe and block routing when versions disagree.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
Engine WFGY 1.0 Original PDF based tension engine
Engine WFGY 2.0 Production tension kernel and math engine for RAG and agents
Engine WFGY 3.0 TXT based Singularity tension engine, 131 S class set
Map Problem Map 1.0 Flagship 16 problem RAG failure checklist and fix map
Map Problem Map 2.0 RAG focused recovery pipeline
Map Problem Map 3.0 Global Debug Card, image as a debug protocol layer
Map Semantic Clinic Symptom to family to exact fix
Map Grandmas Clinic Plain language stories mapped to Problem Map 1.0
Onboarding Starter Village Guided tour for newcomers
App TXT OS TXT semantic OS, fast boot
App Blah Blah Blah Abstract and paradox Q and A built on TXT OS
App Blur Blur Blur Text to image with semantic control
App Blow Blow Blow Reasoning game engine and memory demo

If this repository helped, starring it improves discovery so more builders can find the docs and tools. GitHub Repo stars