14 KiB
RAG precision/recall evaluation
🧭 Quick Return to Map
You are in a sub-page of Chunking.
To reorient, go back here:
- Chunking — text segmentation and context window management
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
A compact, repeatable harness to measure retrieval precision, recall, and coverage after you change chunking, OCR, or indexing. This page also defines ΔS and λ probes so you can gate rollouts with hard numbers.
Open these first
- Chunk ids and stability: chunk_id_schema.md
- Title tree numbering: title_hierarchy.md
- Section boundary rules: section_detection.md
- Typed blocks (code, tables, figures): code_tables_blocks.md
- PDF, layout, OCR normalization: pdf_layouts_and_ocr.md
- Traceable results and cite schema: retrieval-traceability.md
- Payload contracts for RAG: data-contracts.md
- Visual recovery map and ops: rag-architecture-and-recovery.md
What this measures
- Precision@k: fraction of retrieved snippets among top-k that truly answer the question.
- Recall@k: fraction of all relevant snippets that appear in top-k.
- Coverage: proportion of questions whose final answer can be justified by at least one cited snippet.
- Citation accuracy: percentage of answers where
section_idand offsets match the gold. - ΔS(question, retrieved): semantic distance. Stable ≤ 0.45, transitional 0.40–0.60, risk ≥ 0.60.
- λ_observe: convergence state across paraphrases and seeds.
These metrics tell you if a chunking or index change helps retrieval without breaking traceability.
Acceptance targets
- Coverage ≥ 0.70 on the project’s gold set.
- ΔS(question, retrieved) ≤ 0.45 for the cited snippet of each answered item.
- Citation accuracy ≥ 0.95 for
section_id+ offsets. - λ remains convergent on three paraphrases and two seeds.
- No drop in Recall@k compared with the previous index beyond 2 percent absolute.
Gold set construction
- Scope 200–400 items that span headings, code regions, tables, and prose.
- Write three paraphrases per question with identical intent.
- Annotate relevant blocks using canonical ids from chunk_id_schema.md.
- Mark hard negatives near the true section to test boundary quality.
- Freeze the canonical text and store byte offsets after normalization from pdf_layouts_and_ocr.md.
Gold rows should look like:
{
"qid": "Q-0137",
"paraphrases": [
"How does SCU unlock safety refusals?",
"Explain symbolic constraint unlock.",
"SCU: what is it and when to use?"
],
"relevant": ["S.4.2.p.Bk011a", "S.4.2.p.Bk011b"],
"anchor_section": "S.4.2",
"negatives": ["S.4.1.p.Bk010", "S.4.3.p.Bk014"]
}
Logging schema for evaluation
Your retriever must emit a trace per query. Use the fields defined in retrieval-traceability.md and data-contracts.md.
{
"qid": "Q-0137",
"query": "Explain symbolic constraint unlock.",
"topk": [
{"id": "S.4.2.p.Bk011a", "score": 0.83, "offsets": [204611,205279], "type": "prose"},
{"id": "S.4.1.p.Bk010", "score": 0.79, "offsets": [198002,199112], "type": "prose"}
],
"ΔS": [0.31, 0.59],
"λ_state": "→",
"anchor": "S.4.2",
"index_hash": "faiss:v3:hnsw:cos",
"ts": "2025-08-27T12:30:22Z"
}
Offline evaluation (index only)
-
Run each paraphrase against the shadow index.
-
For each qid, compute:
- P@k: relevant ids ∩ top-k over k ∈ {1, 3, 5, 10}.
- R@k: relevant ids covered by top-k.
- Anchor hit: any retrieved id with
section_id == anchor_section. - ΔS probes for each retrieved item.
-
Aggregate by content type using
type ∈ {prose, code, table, figure}. -
Compare with the live index as a baseline and record deltas.
Online shadow evaluation
- Mirror live questions to the shadow index.
- Require cite-first answers with the schema from retrieval-traceability.md.
- For each answer, verify that at least one citation matches a gold
relevantid or theanchor_section. - Log ΔS for the chosen citation and the final λ state after reasoning.
Metrics definitions
Let G(q) be the set of relevant ids for q. Let R_k(q) be the ids in top-k.
- Precision@k = |G(q) ∩ R_k(q)| / |R_k(q)|
- Recall@k = |G(q) ∩ R_k(q)| / |G(q)|
- Coverage = fraction of questions where the answer cites at least one element in
G(q)or any block withinanchor_section. - Citation accuracy = fraction where both
section_idand byte offsets overlap the gold within a 30-byte window. - Anchor proximity = average path distance in the title tree from the cited
section_idtoanchor_sectionusing rules in title_hierarchy.md.
Pass and fail gates
A shadow index is eligible for canary if:
- Coverage ≥ 0.70 on gold.
- Citation accuracy ≥ 0.95.
- ΔS median ≤ 0.40 and 90-pct ≤ 0.55.
- Recall@5 does not drop more than 2 points absolute vs live.
- λ convergent on ≥ 95 percent of paraphrase triplets.
If any fail, return to chunk boundary checks in section_detection.md and typed block lifting in code_tables_blocks.md.
Diagnosis map
- High similarity yet wrong meaning → Embedding ≠ Semantic
- Order flips across runs → Rerankers
- Boundary leaks or mixed topics in a chunk → revisit section_detection.md
- Tables or code referenced as plain text → code_tables_blocks.md
- OCR drift and offset mismatch → pdf_layouts_and_ocr.md
- Index rebuilt, citations break → reindex_migration.md
Minimal evaluator pseudocode
def score_run(gold, logs, k=5):
p_hits, r_hits, cov_hits, cite_ok = [], [], 0, 0
ds_med, ds_90 = [], []
for q in gold: # q has qid, paraphrases, relevant, anchor_section
items = logs[q.qid]["topk"][:k]
got = {it["id"] for it in items}
rel = set(q.relevant)
prec = len(got & rel) / max(1, len(items))
rec = len(got & rel) / max(1, len(rel))
p_hits.append(prec); r_hits.append(rec)
ds = logs[q.qid]["ΔS"][:k]
if ds: ds_med.append(median(ds)); ds_90.append(percentile(ds, 90))
# coverage and citation accuracy from the final answer's first citation
ans = logs[q.qid].get("answer_citations", [])
if ans:
cited = ans[0]["id"]
off = ans[0]["offsets"]
if cited in rel or section_of(cited) == q.anchor_section:
cov_hits += 1
if cited in rel and overlaps(off, gold_offsets(cited)):
cite_ok += 1
return {
"P@k": mean(p_hits),
"R@k": mean(r_hits),
"coverage": cov_hits / len(gold),
"citation_accuracy": cite_ok / len(gold),
"ΔS_med": median(ds_med) if ds_med else None,
"ΔS_p90": median(ds_90) if ds_90 else None
}
Common pitfalls
- Evaluating answers without enforcing cite-first. You cannot measure coverage reliably. Fix the contract in data-contracts.md.
- Mixing normalizers between builds. Offsets will not compare. Lock the same whitespace and hyphen rules as in pdf_layouts_and_ocr.md.
- Ignoring content types. Aggregates hide failures in code or tables. Segment metrics by
type. - k too small for long documents. Use k ∈ {5, 10} when sections are dense.
- Comparing across different rerankers. Pin rerank during offline runs, then test rerankers separately in a controlled A/B.
Copy-paste prompt for LLM-assisted scoring
You have TXT OS and the WFGY Problem Map.
Given:
- gold.json: gold questions with {qid, paraphrases[], relevant[], anchor_section}
- logs.jsonl: retriever traces with topk ids, ΔS per item, and answer_citations
Do:
1) Compute P@5, R@5, coverage, citation accuracy.
2) Report ΔS median and p90 for the cited snippet per question.
3) Flag any questions with coverage==0 or ΔS>0.60 and return their qids.
4) Summarize per-type breakdown for {prose, code, table, figure}.
Return compact JSON:
{ "P@5": 0.xx, "R@5": 0.xx, "coverage": 0.xx, "citation_accuracy": 0.xx,
"ΔS_med": 0.xx, "ΔS_p90": 0.xx, "bad_qids": ["Q-..."], "by_type": {...} }
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
要我繼續下一頁就說:GO live_monitoring_rag.md 或指定別的檔名。