6.5 KiB
Debug playbook — incident triage for RAG pipelines
Purpose: step-by-step incident response guide emphasizing reproducible diagnostics and minimal-impact mitigations.
1) Immediate triage (first 120s)
A — Gather context
- Who reported it? (pager/Slack/ticket)
- When did it start (wall time)?
- Scope: single user / single shard / whole cluster?
B — Quick readouts
- Health:
curl -fsS http://$SERVICE/healthz - Pods:
kubectl -n $NS get pods -o wide - Recent errors (last 10m):
kubectl -n $NS logs -l app=rag --since=10m | tail -n 200
- Prometheus: check E2E p95 and error rate for last 10m.
C — Decide action mode
- If P0 (site down / data corruption) → Mitigate (circuit-break / rollback / redirect).
- If P1 (functional degradation, e.g., CHR drop) → Isolate & debug.
2) Deterministic checks (no LLM calls)
Run these before calling LLMs — they’re cheap and often reveal root causes:
-
Check retrieval consistency for sample qids:
curl -X POST http://$SERVICE/debug/retrieve -d '{"qid":"A123","q":"sample question"}' | jqValidate
retrieved_idsand their hashes. -
Check mem_rev/mem_hash: verify read vs bound value for turn:
- Compare
retrieved_snapshot.mem_revvsgeneration.mem_rev.
- Compare
-
Vectorstore health:
- ping vectorstore API; check index shard status.
-
Index size & recent writes:
kubectl exec -n $NS <vector-pod> -- ls -lh /data/index
3) Common root causes & mitigations
A. Retrieval empty / irrelevant
-
Root cause: indexing job failed or namespace mismatch.
-
Mitigation:
- Restart indexer pod:
kubectl -n $NS rollout restart deploy/indexer - Run reindex on a small sample and validate.
- Restart indexer pod:
B. CHR drop but retrieval OK
-
Root cause: generator hallucinating or prompt/template drift.
-
Mitigation:
- Turn on guard/refusal stricter mode (feature flag).
- Re-run golden queries with
?dbg=fullto capture prompt+context.
C. Bootstrap / readiness flapping
-
Root cause: bootstrap order or missing dependency.
-
Mitigation:
- Ensure controller/migrations complete before retriever/generator start;
kubectl applyordering or Helm hooks.
- Ensure controller/migrations complete before retriever/generator start;
D. LLM provider errors / rate limits
- Root cause: key expired or provider quota.
- Mitigation: switch to backup key or provider; throttle traffic until resolved.
4) Live mitigation patterns (minimize impact)
- Circuit-breaker (fast): return cached answer for known queries.
- Throttle LLM: queue requests, lower concurrency.
- Rollback: to last known-good release if config causes issue.
- Read-only mode: stop writes to vectorstore if index corruption suspected.
5) Postmortem checklist
- Timestamped timeline created.
- Root cause identified (primary + contributing).
- Actions taken documented.
- Follow-up tasks created (reindex, fix probe, add tests).
- Update runbook if new failure mode discovered.
6) Useful debug commands (reference)
-
Pod logs since N minutes:
kubectl -n $NS logs -l app=rag --since=5m -
Exec into retriever pod:
kubectl -n $NS exec -it deploy/retriever -- /bin/sh -
Check helm history:
helm -n $NS history rag
Links
- Deployment checklist → deployment_checklist.md
- Live monitoring → live_monitoring_rag.md
- Failover & Recovery → failover_and_recovery.md
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | Standalone semantic reasoning engine for any LLM | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ Star WFGY on GitHub