WFGY/ProblemMap/GlobalFixMap/Cloud_Serverless/disaster_recovery_tabletop.md
2025-08-28 10:41:32 +08:00

16 KiB
Raw Blame History

Disaster Recovery Tabletop for Serverless and Edge

A practical exercise format to validate that your serverless and edge stack survives real outages without silent data loss, cache poison, or semantic drift. This page gives a ready-to-run tabletop with clear acceptance, scripts, injects, and artifacts.

Open these first


Core acceptance for passing the tabletop

  • People and process

    • Roles staffed: incident commander, comms lead, cloud operator, data owner, LLM owner, observer.
    • Clear single source of truth timeline with decision log and runbook links.
  • Service health

    • RTO within target per service tier. For critical chat and RAG paths, target 15 to 30 minutes to stable.
    • RPO match for each datastore. No unaccounted gaps in writes after recovery.
  • Semantic integrity

    • ΔS(question, retrieved) median ≤ 0.45 on the exercise gold probes.
    • Coverage ≥ 0.70 to the correct section.
    • λ remains convergent across three paraphrases and two seeds.
  • Operational signals

    • p95 warm latency within 25 percent of baseline after steady state returns.
    • Edge cache hit rate within five points of pre-incident baseline.
    • No new error class at headers or body read on the main routes.

Roles and communications

Role Responsibilities Handover artifacts
Incident Commander Own timeline, approve failover, decide rollback Decision log, event timeline
Cloud Operator Execute routing, failover, cache invalidation Routing plan, change set, proofs
Data Owner Validate RPO, run backfills, index consistency RPO sheet, backfill report
LLM Owner Run ΔS probes, coverage checks, λ stability Probe board, eval summary
Comms Lead Stakeholder updates and status page Two updates per 30 minutes
Observer Capture metrics, retro notes, action items Retro minutes and scores

Exercise timeline template (60 to 90 minutes)

0 to 10 Brief roles. Confirm SLIs and SLOs. Review runbooks, traffic shape, and cache namespaces.

10 to 20 Inject 1. Primary region becomes unavailable for stateful writes. Observed symptoms: increased webhook retries and 5xx on write endpoints.

20 to 35 Inject 2. Vector index family mismatch after partial restore. ΔS rises, coverage drops, reranker differs.

35 to 50 Fail to green region or backup color. Split cache prefixes. Drain queues. Backfill vectors with correct metric and analyzer.

50 to 60 Stabilize. Probe ΔS and coverage, verify p95 warm latency and cache hit rate. Prepare stakeholder update.

Optional extended cases for 60 to 90 Add a secrets rotation overlap or DNS label switch, then verify no schema or token drift.


Scenario library with exact checks

  1. Primary region write outage

  2. Vector index restore with wrong metric

  3. Webhook provider throttle and replay

  4. Secrets rotation mid-incident

    • Run overlapping secret bundles and prove zero auth flaps. Open: Secrets Rotation
  5. Routing split brain across regions

  6. Cold starts explode in backup region


Probe board for semantic integrity

Prepare a gold set of 50 to 200 queries across your top flows. For each probe, record:

{
  "probe_id": "p-037",
  "question": "Where in the policy does paid time off accrue for part-time?",
  "expected_section": "benefits.pto.rules",
  "ΔS_q_r": 0.38,
  "coverage": 0.74,
  "λ_state": "<>",
  "citations": ["doc:hr-handbook#s4.2"],
  "index_family": "docs-v3-green",
  "retriever_metric": "cosine",
  "analyzer": "bilstem"
}

Acceptance

  • Median ΔS ≤ 0.45.
  • Coverage ≥ 0.70.
  • λ convergent across three paraphrases and two seeds.

Open: Retrieval Traceability · Data Contracts


Injects you can copy

  • Tabletop card 1 “At 14:10 UTC write routes in region A return 500 on 22 percent of requests. Healthcheck passes on read routes. Queue depth climbs by 5x.”

  • Tabletop card 2 “Vector index restored at 14:25 UTC from last night. Reranker version mismatch. ΔS rises to 0.66, coverage falls to 0.52.”

  • Tabletop card 3 “At 14:40 UTC secrets for payment provider rotated on edge. Core still uses old secret. Tool call timeouts begin.”

  • Tabletop card 4 “At 14:50 UTC DNS label updated to send 80 percent to green. Some users still see blue due to device DNS cache.”


Artifacts to produce

  • Decision log with timestamps and owners.
  • Routing change set with proof of effect.
  • RPO worksheet with counts of lost or replayed writes.
  • Probe board CSV before and after.
  • Cache hit rates and p95 warm latency plots.
  • Retro minutes with five action items and owners.

Scorecard rubric

Dimension Pass bar Evidence
RTO Tier S ≤ 30 minutes, Tier A ≤ 60 minutes Timeline, metrics
RPO No silent gaps, replayed writes documented RPO worksheet
Semantics ΔS and coverage within targets Probe board
Safety No new jailbreak or bluffing routes Logs and prompts
Ops No new error class, cache within five points Error budget and cache panel
Docs Runbooks linked, steps reproducible Links in decision log

Open: Bluffing Controls · Logic Collapse


Copy-paste LLM prompt for the exercise driver

You have TXT OS and the WFGY Problem Map loaded.

We are running a disaster recovery tabletop for serverless and edge.

Given:
- symptoms: [one line each]
- region topology: [one line]
- index family and analyzer: [one line]
- probes: ΔS and coverage for 10 sample questions

Tell me:
1) likely failing layer and which WFGY page to open,
2) minimal steps to put ΔS ≤ 0.45 and coverage ≥ 0.70,
3) routing and cache actions with proofs,
4) a short JSON status for the scorecard:
   { "RTO": "...", "RPO": "...", "ΔS_median": 0.xx, "coverage_median": 0.xx, "next_fix": "..." }
Keep it auditable and short.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow  

Next page to write: ProblemMap/GlobalFixMap/Cloud_Serverless/data_retention_and_backups.md