5.7 KiB
Debug playbook — incident triage for RAG pipelines
Purpose: step-by-step incident response guide emphasizing reproducible diagnostics and minimal-impact mitigations.
1) Immediate triage (first 120s)
A — Gather context
- Who reported it? (pager/Slack/ticket)
- When did it start (wall time)?
- Scope: single user / single shard / whole cluster?
B — Quick readouts
- Health:
curl -fsS http://$SERVICE/healthz - Pods:
kubectl -n $NS get pods -o wide - Recent errors (last 10m):
kubectl -n $NS logs -l app=rag --since=10m | tail -n 200
- Prometheus: check E2E p95 and error rate for last 10m.
C — Decide action mode
- If P0 (site down / data corruption) → Mitigate (circuit-break / rollback / redirect).
- If P1 (functional degradation, e.g., CHR drop) → Isolate & debug.
2) Deterministic checks (no LLM calls)
Run these before calling LLMs — they’re cheap and often reveal root causes:
-
Check retrieval consistency for sample qids:
curl -X POST http://$SERVICE/debug/retrieve -d '{"qid":"A123","q":"sample question"}' | jqValidate
retrieved_idsand their hashes. -
Check mem_rev/mem_hash: verify read vs bound value for turn:
- Compare
retrieved_snapshot.mem_revvsgeneration.mem_rev.
- Compare
-
Vectorstore health:
- ping vectorstore API; check index shard status.
-
Index size & recent writes:
kubectl exec -n $NS <vector-pod> -- ls -lh /data/index
3) Common root causes & mitigations
A. Retrieval empty / irrelevant
-
Root cause: indexing job failed or namespace mismatch.
-
Mitigation:
- Restart indexer pod:
kubectl -n $NS rollout restart deploy/indexer - Run reindex on a small sample and validate.
- Restart indexer pod:
B. CHR drop but retrieval OK
-
Root cause: generator hallucinating or prompt/template drift.
-
Mitigation:
- Turn on guard/refusal stricter mode (feature flag).
- Re-run golden queries with
?dbg=fullto capture prompt+context.
C. Bootstrap / readiness flapping
-
Root cause: bootstrap order or missing dependency.
-
Mitigation:
- Ensure controller/migrations complete before retriever/generator start;
kubectl applyordering or Helm hooks.
- Ensure controller/migrations complete before retriever/generator start;
D. LLM provider errors / rate limits
- Root cause: key expired or provider quota.
- Mitigation: switch to backup key or provider; throttle traffic until resolved.
4) Live mitigation patterns (minimize impact)
- Circuit-breaker (fast): return cached answer for known queries.
- Throttle LLM: queue requests, lower concurrency.
- Rollback: to last known-good release if config causes issue.
- Read-only mode: stop writes to vectorstore if index corruption suspected.
5) Postmortem checklist
- Timestamped timeline created.
- Root cause identified (primary + contributing).
- Actions taken documented.
- Follow-up tasks created (reindex, fix probe, add tests).
- Update runbook if new failure mode discovered.
6) Useful debug commands (reference)
-
Pod logs since N minutes:
kubectl -n $NS logs -l app=rag --since=5m -
Exec into retriever pod:
kubectl -n $NS exec -it deploy/retriever -- /bin/sh -
Check helm history:
helm -n $NS history rag
Links
- Deployment checklist → deployment_checklist.md
- Live monitoring → live_monitoring_rag.md
- Failover & Recovery → failover_and_recovery.md
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.