vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

2025-08-27 15:49:59 +08:00

12 KiB

Raw Blame History

RAG live monitoring

Operational probes, alerts, and dashboards to keep retrieval stable after you change chunking, OCR, or indexing. This page defines the minimal signals you must log and the exact gates to alert on.

Open these first

Chunk ids and stability: chunk_id_schema.md
Title tree numbering: title_hierarchy.md
Section boundary rules: section_detection.md
Typed blocks (code, tables, figures): code_tables_blocks.md
PDF layout and OCR normalization: pdf_layouts_and_ocr.md
Rebuild without breaking citations: reindex_migration.md
Eval harness and gates: eval_rag_precision_recall.md
Traceable retrieval schema: retrieval-traceability.md
Payload contracts for RAG: data-contracts.md
Visual recovery map: rag-architecture-and-recovery.md
Prompt injection defenses: prompt-injection.md
Triage runbook: Debug Playbook

What to monitor in real time

Log per query and aggregate into one minute windows.

Coverage: share of answers that cite at least one valid snippet.
ΔS(question, retrieved): semantic distance for the chosen citation. Stable ≤ 0.45. Risk ≥ 0.60.
λ_observe: convergence state across paraphrases. Track flip rate between adjacent steps.
Citation accuracy: cited section_id and offsets match a valid block within a 30 byte window.
Anchor proximity: title tree distance between cited section and expected anchor when known.
Index integrity: index_hash, metric, analyzer, and embed_model fingerprint. Alert on drift.
Rerank stability: Kendall tau between top ten on consecutive runs for the same query template.
Latency SLOs: p50, p95 for retrieve and reason stages.
Error budget: rolling one hour budget for coverage and citation accuracy.

Logging schema to enable the probes

Emit one record per question. Follow the fields from the traceability spec and chunking pages.

{
  "qid": "live-2025-08-27T12:30:22Z-000134",
  "question": "When to use SCU",
  "retrieval": {
    "topk": [
      {"id":"S.4.2.p.Bk011a", "score":0.83, "type":"prose", "offsets":[204611,205279]},
      {"id":"S.4.1.p.Bk010",  "score":0.79, "type":"prose", "offsets":[198002,199112]}
    ],
    "metric": "cos",
    "analyzer": "porter-en",
    "index_hash": "faiss:v3:hnsw:cos",
    "embed_model": "text-embedding-3-large",
    "ΔS_list": [0.31, 0.59],
    "λ_states": ["→","→"]
  },
  "answer": {
    "citations": [{"id":"S.4.2.p.Bk011a", "offsets":[204611,205279]}],
    "coverage": true,
    "ΔS": 0.31,
    "λ_final": "→"
  },
  "perf": {"t_retrieve_ms": 120, "t_reason_ms": 480},
  "context": {"client":"prod", "build":"2025.08.27.2", "region":"ap-sg"},
  "ts":"2025-08-27T12:30:22Z"
}

Alert rules and thresholds

Use rolling windows with at least 200 samples or five minutes, whichever is larger.

Coverage drop: coverage < 0.70 for five minutes. Page on call.
ΔS spike: p90 ΔS ≥ 0.60 or median ΔS ≥ 0.45 for three minutes.
λ flip rate: fraction of paraphrase triplets with divergent λ ≥ 0.10.
Citation mismatch: citation accuracy < 0.95 for five minutes.
Index drift: any change in index_hash while build id is constant.
Rerank instability: average top ten Kendall tau < 0.6 over five minutes.
Latency regression: p95 retrieve or reason above baseline by 30 percent.

Dashboards to build

SLO board: coverage, citation accuracy, ΔS median, ΔS p90, λ flip rate.
Title tree health: anchor proximity histogram and top failing sections.
Content type panel: split metrics for prose, table, code, figure.
Index integrity: time series of index_hash, metric, analyzer, embed model.
Rerank panel: tau vs recall proxy and error bars.
Latency panel: p50, p95 by stage.

Canary and rollback policy

Ship a shadow index behind a flag. Verify gates from eval_rag_precision_recall.md.
Start at five percent traffic with hourly comparison to the live index.
Promote only if coverage improves or is equal, citation accuracy ≥ 0.95, median ΔS ≤ 0.40, recall proxy unchanged, and λ flip rate ≤ baseline.
Rollback immediately on two consecutive alert windows or any index drift event.

Sampling and gold refresh

Mirror one percent of production queries to a frozen gold set run each day.
Regenerate three paraphrases per question monthly.
Mark hard negatives near anchors after layout changes from pdf_layouts_and_ocr.md.

Copy rules you can paste into your monitor

Coverage gate

window_5m_coverage = sum(answer.coverage) / count()
alert if window_5m_coverage < 0.70 for 5m

ΔS gate

window_3m_ds_p90 = percentile(answer.ΔS, 90)
alert if window_3m_ds_p90 >= 0.60 for 3m

λ flip rate

lambda_flips = count(lambda_triplet_state == "divergent") / count(lambda_triplet_state)
alert if lambda_flips >= 0.10 for 5m

Citation accuracy

cite_ok = count(citation.within_30_bytes == true) / count()
alert if cite_ok < 0.95 for 5m

Index drift

alert if distinct(index_hash) > 1 and distinct(build) == 1 over 5m

LLM assisted triage prompt

You have TXT OS and the WFGY Problem Map.

Given a five minute slice of live logs with:
- ΔS per retrieved item and for the chosen citation,
- λ states across three paraphrases,
- coverage and citation accuracy,
- index_hash, metric, analyzer, embed_model,
- top sections and their title tree ids.

Do:
1) Identify the failing layer: chunk boundary, rerank, index metric, OCR normalization, or prompt schema.
2) Return the exact WFGY pages to open next from:
   retrieval-traceability, data-contracts, section_detection, code_tables_blocks,
   pdf_layouts_and_ocr, reindex_migration, rerankers, embedding-vs-semantic.
3) Propose a minimal reversible fix and a verification test.
Return compact JSON {layer, pages[], fix, test}.

Common pitfalls

Shipping a new index without freezing normalizers. Offsets will not align. See pdf_layouts_and_ocr.md.
Measuring answers without cite-first. Coverage becomes meaningless. See data-contracts.md and retrieval-traceability.md.
Ignoring content types. Averages hide failures in tables and code.
Comparing different rerankers in the same chart. Pin rerank during canary.
Missing guard on index_hash. Small rebuilds can cause silent drift.
Treating high similarity as correctness. Check embedding-vs-semantic.md.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

12 KiB Raw Blame History Unescape Escape