| .. | ||
| debug_playbook.md | ||
| deployment_checklist.md | ||
| failover_and_recovery.md | ||
| live_monitoring_rag.md | ||
| README.md | ||
Ops — Deploy & Runbook (Problem Map)
Purpose: this folder contains operational runbooks, checklists and playbooks for deploying, observing, debugging and failing-over RAG pipelines and their surrounding infra.
Target audience: SREs and engineers responsible for production RAG services. Newbie friendly — each section has a checklist and exact commands.
Quick nav
- Deployment checklist → deployment_checklist.md
- Live monitoring & alerts (RAG) → live_monitoring_rag.md
- Debug playbook (step-by-step) → debug_playbook.md
- Failover & recovery → failover_and_recovery.md
Scope & assumptions
- Production topology: API gateway → RAG service (retriever + generator + guard) → Vector DB + Source storage.
- Infra: Kubernetes (Helm) or docker-compose for small envs. Prometheus + Grafana for metrics; centralized logs (ELK/Fluentd/Vector).
- Safety-first: ops steps favor read-only diagnostic commands until root cause is clear.
How to use these runbooks
- Read the deployment checklist before you deploy.
- Use live monitoring to ensure SLOs after deploy.
- If incident happens, follow debug_playbook (triage → isolate → mitigate → fix).
- If controller/broker or core services fail, follow failover_and_recovery.
Quick operator checks (first 60s)
- Is service responding?
curl -fsS http://$SERVICE/healthz || true - Are pods healthy?
kubectl get pods -n $NS - Any obvious error spikes in logs (last 1 minute):
kubectl logs -n $NS -l app=$APP --since=1m | tail -n 200 - Check key metrics in Prometheus (latency/p95, error rate, retriever QPS).
Where patterns & examples map here
- If retrieval bad → see
ProblemMap/retrieval-collapse.mdand examples for vector-store repair. - If bootstrap ordering failures on start → see
ProblemMap/bootstrap-ordering.md& pattern_bootstrap_deadlock.md. - For memory/state issues →
ProblemMap/patterns/pattern_memory_desync.md.
If you want me to also generate ready-to-apply Kubernetes manifests or Prometheus alerts for your environment (Helm values), I can produce them next — tell me cluster flavor (k8s / k3s / kind / docker-compose) and I’ll adapt.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.