WFGY/ProblemMap/ops/debug_playbook.md
2025-08-13 18:42:02 +08:00

6.5 KiB
Raw Blame History

Debug playbook — incident triage for RAG pipelines

Purpose: step-by-step incident response guide emphasizing reproducible diagnostics and minimal-impact mitigations.


1) Immediate triage (first 120s)

A — Gather context

  • Who reported it? (pager/Slack/ticket)
  • When did it start (wall time)?
  • Scope: single user / single shard / whole cluster?

B — Quick readouts

  • Health: curl -fsS http://$SERVICE/healthz
  • Pods: kubectl -n $NS get pods -o wide
  • Recent errors (last 10m):
    kubectl -n $NS logs -l app=rag --since=10m | tail -n 200
    
    
    
  • Prometheus: check E2E p95 and error rate for last 10m.

C — Decide action mode

  • If P0 (site down / data corruption) → Mitigate (circuit-break / rollback / redirect).
  • If P1 (functional degradation, e.g., CHR drop) → Isolate & debug.

2) Deterministic checks (no LLM calls)

Run these before calling LLMs — theyre cheap and often reveal root causes:

  1. Check retrieval consistency for sample qids:

    curl -X POST http://$SERVICE/debug/retrieve -d '{"qid":"A123","q":"sample question"}' | jq
    

    Validate retrieved_ids and their hashes.

  2. Check mem_rev/mem_hash: verify read vs bound value for turn:

    • Compare retrieved_snapshot.mem_rev vs generation.mem_rev.
  3. Vectorstore health:

    • ping vectorstore API; check index shard status.
  4. Index size & recent writes:

    • kubectl exec -n $NS <vector-pod> -- ls -lh /data/index

3) Common root causes & mitigations

A. Retrieval empty / irrelevant

  • Root cause: indexing job failed or namespace mismatch.

  • Mitigation:

    • Restart indexer pod: kubectl -n $NS rollout restart deploy/indexer
    • Run reindex on a small sample and validate.

B. CHR drop but retrieval OK

  • Root cause: generator hallucinating or prompt/template drift.

  • Mitigation:

    • Turn on guard/refusal stricter mode (feature flag).
    • Re-run golden queries with ?dbg=full to capture prompt+context.

C. Bootstrap / readiness flapping

  • Root cause: bootstrap order or missing dependency.

  • Mitigation:

    • Ensure controller/migrations complete before retriever/generator start; kubectl apply ordering or Helm hooks.

D. LLM provider errors / rate limits

  • Root cause: key expired or provider quota.
  • Mitigation: switch to backup key or provider; throttle traffic until resolved.

4) Live mitigation patterns (minimize impact)

  1. Circuit-breaker (fast): return cached answer for known queries.
  2. Throttle LLM: queue requests, lower concurrency.
  3. Rollback: to last known-good release if config causes issue.
  4. Read-only mode: stop writes to vectorstore if index corruption suspected.

5) Postmortem checklist

  • Timestamped timeline created.
  • Root cause identified (primary + contributing).
  • Actions taken documented.
  • Follow-up tasks created (reindex, fix probe, add tests).
  • Update runbook if new failure mode discovered.

6) Useful debug commands (reference)

  • Pod logs since N minutes:

    kubectl -n $NS logs -l app=rag --since=5m
    
  • Exec into retriever pod:

    kubectl -n $NS exec -it deploy/retriever -- /bin/sh
    
  • Check helm history:

    helm -n $NS history rag
    


🧭 Explore More

Module Description Link
WFGY Core Standalone semantic reasoning engine for any LLM View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow