5.1 KiB
Ops — Deploy & Runbook (Problem Map)
Purpose: this folder contains operational runbooks, checklists and playbooks for deploying, observing, debugging and failing-over RAG pipelines and their surrounding infra.
Target audience: SREs and engineers responsible for production RAG services. Newbie friendly — each section has a checklist and exact commands.
Quick nav
- Deployment checklist → deployment_checklist.md
- Live monitoring & alerts (RAG) → live_monitoring_rag.md
- Debug playbook (step-by-step) → debug_playbook.md
- Failover & recovery → failover_and_recovery.md
Scope & assumptions
- Production topology: API gateway → RAG service (retriever + generator + guard) → Vector DB + Source storage.
- Infra: Kubernetes (Helm) or docker-compose for small envs. Prometheus + Grafana for metrics; centralized logs (ELK/Fluentd/Vector).
- Safety-first: ops steps favor read-only diagnostic commands until root cause is clear.
How to use these runbooks
- Read the deployment checklist before you deploy.
- Use live monitoring to ensure SLOs after deploy.
- If incident happens, follow debug_playbook (triage → isolate → mitigate → fix).
- If controller/broker or core services fail, follow failover_and_recovery.
Quick operator checks (first 60s)
- Is service responding?
curl -fsS http://$SERVICE/healthz || true - Are pods healthy?
kubectl get pods -n $NS - Any obvious error spikes in logs (last 1 minute):
kubectl logs -n $NS -l app=$APP --since=1m | tail -n 200 - Check key metrics in Prometheus (latency/p95, error rate, retriever QPS).
Where patterns & examples map here
- If retrieval bad → see
ProblemMap/retrieval-collapse.mdand examples for vector-store repair. - If bootstrap ordering failures on start → see
ProblemMap/bootstrap-ordering.md& pattern_bootstrap_deadlock.md. - For memory/state issues →
ProblemMap/patterns/pattern_memory_desync.md.
If you want me to also generate ready-to-apply Kubernetes manifests or Prometheus alerts for your environment (Helm values), I can produce them next — tell me cluster flavor (k8s / k3s / kind / docker-compose) and I’ll adapt.
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | Standalone semantic reasoning engine for any LLM | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ Star WFGY on GitHub