vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig 9ad3dc9e98

Update debug_playbook.md

2025-08-15 23:36:13 +08:00

7.1 KiB

Raw Blame History

Debug playbook — incident triage for RAG pipelines

Purpose: step-by-step incident response guide emphasizing reproducible diagnostics and minimal-impact mitigations.

1) Immediate triage (first 120s)

A — Gather context

Who reported it? (pager/Slack/ticket)
When did it start (wall time)?
Scope: single user / single shard / whole cluster?

B — Quick readouts

Health: curl -fsS http://$SERVICE/healthz
Pods: kubectl -n $NS get pods -o wide

Recent errors (last 10m):

kubectl -n $NS logs -l app=rag --since=10m | tail -n 200

Prometheus: check E2E p95 and error rate for last 10m.

C — Decide action mode

If P0 (site down / data corruption) → Mitigate (circuit-break / rollback / redirect).
If P1 (functional degradation, e.g., CHR drop) → Isolate & debug.

2) Deterministic checks (no LLM calls)

Run these before calling LLMs — they’re cheap and often reveal root causes:

Check retrieval consistency for sample qids:

curl -X POST http://$SERVICE/debug/retrieve -d '{"qid":"A123","q":"sample question"}' | jq

Validate retrieved_ids and their hashes.

Check mem_rev/mem_hash: verify read vs bound value for turn:
- Compare retrieved_snapshot.mem_rev vs generation.mem_rev.
Vectorstore health:
- ping vectorstore API; check index shard status.
Index size & recent writes:
- kubectl exec -n $NS <vector-pod> -- ls -lh /data/index

3) Common root causes & mitigations

A. Retrieval empty / irrelevant

Root cause: indexing job failed or namespace mismatch.
Mitigation:
- Restart indexer pod: kubectl -n $NS rollout restart deploy/indexer
- Run reindex on a small sample and validate.

B. CHR drop but retrieval OK

Root cause: generator hallucinating or prompt/template drift.
Mitigation:
- Turn on guard/refusal stricter mode (feature flag).
- Re-run golden queries with ?dbg=full to capture prompt+context.

C. Bootstrap / readiness flapping

Root cause: bootstrap order or missing dependency.
Mitigation:
- Ensure controller/migrations complete before retriever/generator start; kubectl apply ordering or Helm hooks.

D. LLM provider errors / rate limits

Root cause: key expired or provider quota.
Mitigation: switch to backup key or provider; throttle traffic until resolved.

4) Live mitigation patterns (minimize impact)

Circuit-breaker (fast): return cached answer for known queries.
Throttle LLM: queue requests, lower concurrency.
Rollback: to last known-good release if config causes issue.
Read-only mode: stop writes to vectorstore if index corruption suspected.

5) Postmortem checklist

Timestamped timeline created.
Root cause identified (primary + contributing).
Actions taken documented.
Follow-up tasks created (reindex, fix probe, add tests).
Update runbook if new failure mode discovered.

6) Useful debug commands (reference)

Pod logs since N minutes:

kubectl -n $NS logs -l app=rag --since=5m

Exec into retriever pod:

kubectl -n $NS exec -it deploy/retriever -- /bin/sh

Check helm history:
```
helm -n $NS history rag
```

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

7.1 KiB Raw Blame History Unescape Escape