mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
9.3 KiB
9.3 KiB
Rollback and Fast Recovery — OpsDeploy Guardrails
🧭 Quick Return to Map
You are in a sub-page of OpsDeploy.
To reorient, go back here:
- OpsDeploy — operations automation and deployment pipelines
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Bring production back to a good state fast. This page gives you the exact levers and a 60-second checklist to reverse a bad rollout, recover service, and preserve data integrity.
Open these first
- Canary staging: staged_rollout_canary.md
- Blue green cutover: blue_green_switchovers.md
- Version fences: version_pinning_and_model_lock.md
- Index swap: vector_index_build_and_swap.md
- Feature flags: feature_flags_safe_launch.md
- Rate and retry: rate_limit_backpressure.md, retry_backoff.md
- Cache and dedupe: cache_warmup_invalidation.md, idempotency_dedupe.md
- First-run traps: bootstrap-ordering.md, deployment-deadlock.md, predeploy-collapse.md
- Live ops: live_monitoring_rag.md, debug_playbook.md
When to pull this page
- ΔS drift p95 over 0.15 or coverage under 0.60 on the canary window.
- λ flips rise above 0.20 or tool loops detected.
- 5xx over 1 percent or 429 storms that do not subside with backoff.
- p99 latency doubles, cache poisoning, mixed answers across versions.
- Suspicion of duplicate side effects or index mismatch.
Recovery exit targets
- User visible error rate back under 0.5 percent within ten minutes.
- ΔS and coverage match the pinned baseline window.
- p95 latency within plus 15 percent of baseline.
- Duplicate side effects equal zero during and after rollback.
60-second rollback checklist
- Capture state
DumpBUILD_ID,GIT_SHA,MODEL_VER,PROMPT_VER,INDEX_HASH,RERANK_CONF,TOK_VER,ANALYZER_CONF, sample ΔS and coverage. Keep for the postmortem. - Freeze writes
Pause non-idempotent writers and queue consumers. Confirm fences are active. See idempotency_dedupe.md. - Pick the fastest lever
- Feature flag kill switch
- Blue to Green pointer back to Blue
- Index alias from
docs_vBback todocs_vA - Version unpin to the last known good set
- Flip one pointer
Single operation. DNS or Ingress or alias or service selector. See the blue green and index swap pages. - Rotate cache namespace
Keys must includeINDEX_HASHandPROMPT_VER. Do not delete in place. See cache page. - Throttling on
Global token budget and circuit breaker while the system cools. See rate and retry pages. - Watch for green
Hold ten to fifteen minutes. Confirm targets are met. Then unfreeze writers.
Rollback decision tree
- Quality regression
Revert prompt pack or model by flag, then warm cache and verify. - Retriever wrong or mixed answers
Alias index back to previousdocs_vA. Keys rotate byINDEX_HASH. - Hot cost or tail latency
Degrade chain length, return cite only when deadline is tight, enforce global caps. - Duplicate effects
Block writes, enable fences, reconcile receipts, then reopen.
Paste-ready snippets
Kubernetes service selector flip
apiVersion: v1
kind: Service
metadata: { name: wfgy-live }
spec:
selector: { app: wfgy-blue } # flip back to blue
ports: [ { port: 80, targetPort: http } ]
Vector index alias rollback
vec alias update docs_live --to docs_vA
Feature flag kill switch
# opsdeploy/flags/off.yml
flags:
prompt_pack_vNplus1:
traffic: { baseline_weight: 1.0, canary_weight: 0.0 }
abort_rules:
hard:
- kind: "global_off"
Fast recovery patterns
- Degrade mode Skip rerank, lower k, or return cite only with links.
- Sticky routing Keep users pinned to the same arm while the window stabilizes.
- Single-flight Collapse identical work on misses to avoid stampede.
- Region by region Roll back one region at a time with audit hashes.
Post-rollback audit
- Verify the pins you expect are actually in logs for every request.
- Confirm cache namespace rotation took effect.
- Validate ΔS, coverage, λ on the warmed gold set.
- Reconcile any side effects created during the incident.
- Open the debug playbook and file the root cause link.
Observability to pin
- Version pins and
INDEX_HASHon every request. - ΔS(question,retrieved) and ΔS(retrieved,anchor).
- Coverage and λ states.
- Error rates, 429, queue waits, breaker state.
- Side effect receipts and dedupe decisions.
Common pitfalls
- Rolling back traffic without rotating cache namespace.
- Two writers active during the flip.
- Client only flags that drift from server reality.
- Forgetting to unfreeze jobs after stability returns.
- No captured evidence for the postmortem.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.