vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig 9a10f1bb89

Update debug_playbook.md

2025-08-13 18:42:02 +08:00

6.5 KiB

Raw Blame History

Debug playbook — incident triage for RAG pipelines

Purpose: step-by-step incident response guide emphasizing reproducible diagnostics and minimal-impact mitigations.

1) Immediate triage (first 120s)

A — Gather context

Who reported it? (pager/Slack/ticket)
When did it start (wall time)?
Scope: single user / single shard / whole cluster?

B — Quick readouts

Health: curl -fsS http://$SERVICE/healthz
Pods: kubectl -n $NS get pods -o wide

Recent errors (last 10m):

kubectl -n $NS logs -l app=rag --since=10m | tail -n 200

Prometheus: check E2E p95 and error rate for last 10m.

C — Decide action mode

If P0 (site down / data corruption) → Mitigate (circuit-break / rollback / redirect).
If P1 (functional degradation, e.g., CHR drop) → Isolate & debug.

2) Deterministic checks (no LLM calls)

Run these before calling LLMs — they’re cheap and often reveal root causes:

Check retrieval consistency for sample qids:

curl -X POST http://$SERVICE/debug/retrieve -d '{"qid":"A123","q":"sample question"}' | jq

Validate retrieved_ids and their hashes.

Check mem_rev/mem_hash: verify read vs bound value for turn:
- Compare retrieved_snapshot.mem_rev vs generation.mem_rev.
Vectorstore health:
- ping vectorstore API; check index shard status.
Index size & recent writes:
- kubectl exec -n $NS <vector-pod> -- ls -lh /data/index

3) Common root causes & mitigations

A. Retrieval empty / irrelevant

Root cause: indexing job failed or namespace mismatch.
Mitigation:
- Restart indexer pod: kubectl -n $NS rollout restart deploy/indexer
- Run reindex on a small sample and validate.

B. CHR drop but retrieval OK

Root cause: generator hallucinating or prompt/template drift.
Mitigation:
- Turn on guard/refusal stricter mode (feature flag).
- Re-run golden queries with ?dbg=full to capture prompt+context.

C. Bootstrap / readiness flapping

Root cause: bootstrap order or missing dependency.
Mitigation:
- Ensure controller/migrations complete before retriever/generator start; kubectl apply ordering or Helm hooks.

D. LLM provider errors / rate limits

Root cause: key expired or provider quota.
Mitigation: switch to backup key or provider; throttle traffic until resolved.

4) Live mitigation patterns (minimize impact)

Circuit-breaker (fast): return cached answer for known queries.
Throttle LLM: queue requests, lower concurrency.
Rollback: to last known-good release if config causes issue.
Read-only mode: stop writes to vectorstore if index corruption suspected.

5) Postmortem checklist

Timestamped timeline created.
Root cause identified (primary + contributing).
Actions taken documented.
Follow-up tasks created (reindex, fix probe, add tests).
Update runbook if new failure mode discovered.

6) Useful debug commands (reference)

Pod logs since N minutes:

kubectl -n $NS logs -l app=rag --since=5m

Exec into retriever pod:

kubectl -n $NS exec -it deploy/retriever -- /bin/sh

Check helm history:
```
helm -n $NS history rag
```

🧭 Explore More

Module	Description	Link
WFGY Core	Standalone semantic reasoning engine for any LLM	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ Star WFGY on GitHub

6.5 KiB Raw Blame History Unescape Escape