vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-01 21:11:11 +00:00

onestardao 4414250b8d chore: add WFGY_FOOTER_START/END markers around Explore More footer blocks

2026-03-04 06:26:57 +00:00

4.9 KiB

Raw Blame History

Failover & recovery — deterministic recovery steps

Purpose: deterministic operator steps to failover or recover critical components (vectorstore, retriever, generator, indexer, controller). Aim to reduce data loss and return to safe state quickly.

Basic principles

Fail fast to a safe mode — prefer read-only answers or cached responses over uncontrolled writes or risky LLM calls.
Preserve evidence — do not truncate logs or delete index segments until investigation complete.
Prefer scoped recovery — restart single pod/shard before cluster-wide actions.

Scenario A: Vectorstore shard down / index corrupt

Symptoms

Retriever returns empty sets or inconsistent scores for golden queries.
Vectorstore pod logs show IO / index errors.

Steps

Mark the shard unhealthy in the service registry (so retriever avoids it).
If replica exists, route traffic to other replica.

Attempt graceful re-open:

kubectl -n $NS exec deploy/vectorstore -- /bin/sh -c "ctl index reopen shard-5"

If reopen fails, restore from latest snapshot (S3) to a new shard:
- Create new PV and restore snapshot.
- Start fresh pod pointed to restored PV.
Re-run small validation suite (10–50 golden qids) before reintroducing shard.

Post recovery

Re-index missing docs if necessary; track reindex job progress.
Add a postmortem entry and schedule a permanent fix.

Scenario B: Generator (LLM) provider outage

Symptoms

LLM errors (5xx), rate-limit responses, or auth failures.

Steps

Switch to backup LLM provider (if configured) via config flag:

# toggle provider in config map or feature flag
kubectl -n $NS set env deploy/rag-api PROVIDER=backup-provider

If no backup, enable local fallback:
- Return cached answers for known qids.
- Return safe refusal for unknown qids.
Throttle traffic and backlog long-running requests to a worker queue.
Once provider restored, slowly ramp traffic and compare CHR/precision to baseline.

Scenario C: Bootstrap deadlock at startup

Symptoms

Pods stuck in CrashLoopBackOff or Ready never true; logs show circular dependency or missing migration.

Steps

Inspect init containers & migration jobs:

kubectl -n $NS get jobs
kubectl -n $NS logs job/migrations

Run migrations manually in controlled pod:

kubectl -n $NS run --rm -it migration-runner --image=myimage -- bash -c "python migrate.py"

Ensure controller component (if any) is up before starting retriever/generator. Use Helm hooks or manual kubectl apply ordering.
If necessary, scale down and start components one-by-one.

Safety nets & best practices

Keep automated snapshots of vectorstore daily; keep 7–14 days retention.
Maintain a tested restore playbook and a “mini-cluster” restore test monthly.
Automate warm-failover for LLMs: pre-warm API tokens for backup providers.

Post-incident

Triage root cause, assign fixes.
Add automated test that would have caught this.
Update runbooks and notify stakeholders.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Module	Description	Link
WFGY Core	Canonical framework entry point	View
Problem Map	Diagnostic map and navigation hub	View
Tension Universe Experiments	MVP experiment field	View
Recognition	Where WFGY is referenced or adopted	View
AI Guide	Anti-hallucination reading protocol for tools	View

If this repository helps, starring it improves discovery for other builders.

4.9 KiB Raw Blame History Unescape Escape

Failover & recovery — deterministic recovery steps

Basic principles

Scenario A: Vectorstore shard down / index corrupt

Scenario B: Generator (LLM) provider outage

Scenario C: Bootstrap deadlock at startup

Safety nets & best practices

Post-incident

Links

🔗 Quick-Start Downloads (60 sec)

Explore More

4.9 KiB

Raw Blame History