# RAG precision/recall evaluation
🧭 Quick Return to Map
> You are in a sub-page of **Chunking**. > To reorient, go back here: > > - [**Chunking** — text segmentation and context window management](./README.md) > - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md) > - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md) > > Think of this page as a desk within a ward. > If you need the full triage and all prescriptions, return to the Emergency Room lobby.
A compact, repeatable harness to measure retrieval precision, recall, and coverage after you change chunking, OCR, or indexing. This page also defines ΔS and λ probes so you can gate rollouts with hard numbers. ## Open these first - Chunk ids and stability: [chunk_id_schema.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md) - Title tree numbering: [title_hierarchy.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md) - Section boundary rules: [section_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md) - Typed blocks (code, tables, figures): [code_tables_blocks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md) - PDF, layout, OCR normalization: [pdf_layouts_and_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md) - Traceable results and cite schema: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) - Payload contracts for RAG: [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) - Visual recovery map and ops: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) ## What this measures - **Precision@k**: fraction of retrieved snippets among top-k that truly answer the question. - **Recall@k**: fraction of all relevant snippets that appear in top-k. - **Coverage**: proportion of questions whose final answer can be justified by at least one cited snippet. - **Citation accuracy**: percentage of answers where `section_id` and offsets match the gold. - **ΔS(question, retrieved)**: semantic distance. Stable ≤ 0.45, transitional 0.40–0.60, risk ≥ 0.60. - **λ_observe**: convergence state across paraphrases and seeds. These metrics tell you if a chunking or index change helps retrieval without breaking traceability. ## Acceptance targets - Coverage ≥ 0.70 on the project’s gold set. - ΔS(question, retrieved) ≤ 0.45 for the cited snippet of each answered item. - Citation accuracy ≥ 0.95 for `section_id` + offsets. - λ remains convergent on three paraphrases and two seeds. - No drop in Recall@k compared with the previous index beyond 2 percent absolute. --- ## Gold set construction 1) **Scope 200–400 items** that span headings, code regions, tables, and prose. 2) **Write three paraphrases** per question with identical intent. 3) **Annotate relevant blocks** using canonical ids from [chunk_id_schema.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md). 4) **Mark hard negatives** near the true section to test boundary quality. 5) **Freeze** the canonical text and store byte offsets after normalization from [pdf_layouts_and_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md). Gold rows should look like: ```json { "qid": "Q-0137", "paraphrases": [ "How does SCU unlock safety refusals?", "Explain symbolic constraint unlock.", "SCU: what is it and when to use?" ], "relevant": ["S.4.2.p.Bk011a", "S.4.2.p.Bk011b"], "anchor_section": "S.4.2", "negatives": ["S.4.1.p.Bk010", "S.4.3.p.Bk014"] } ```` --- ## Logging schema for evaluation Your retriever must emit a trace per query. Use the fields defined in [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) and [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md). ```json { "qid": "Q-0137", "query": "Explain symbolic constraint unlock.", "topk": [ {"id": "S.4.2.p.Bk011a", "score": 0.83, "offsets": [204611,205279], "type": "prose"}, {"id": "S.4.1.p.Bk010", "score": 0.79, "offsets": [198002,199112], "type": "prose"} ], "ΔS": [0.31, 0.59], "λ_state": "→", "anchor": "S.4.2", "index_hash": "faiss:v3:hnsw:cos", "ts": "2025-08-27T12:30:22Z" } ``` --- ## Offline evaluation (index only) 1. Run each paraphrase against the **shadow index**. 2. For each qid, compute: * **P\@k**: relevant ids ∩ top-k over k ∈ {1, 3, 5, 10}. * **R\@k**: relevant ids covered by top-k. * **Anchor hit**: any retrieved id with `section_id == anchor_section`. * **ΔS probes** for each retrieved item. 3. Aggregate by content type using `type ∈ {prose, code, table, figure}`. 4. Compare with the **live index** as a baseline and record deltas. --- ## Online shadow evaluation * Mirror live questions to the shadow index. * Require **cite-first** answers with the schema from [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md). * For each answer, verify that at least one citation matches a gold `relevant` id or the `anchor_section`. * Log ΔS for the chosen citation and the final λ state after reasoning. --- ## Metrics definitions Let `G(q)` be the set of relevant ids for q. Let `R_k(q)` be the ids in top-k. * **Precision\@k** = |G(q) ∩ R\_k(q)| / |R\_k(q)| * **Recall\@k** = |G(q) ∩ R\_k(q)| / |G(q)| * **Coverage** = fraction of questions where the answer cites at least one element in `G(q)` or any block within `anchor_section`. * **Citation accuracy** = fraction where both `section_id` and byte offsets overlap the gold within a 30-byte window. * **Anchor proximity** = average path distance in the title tree from the cited `section_id` to `anchor_section` using rules in [title\_hierarchy.md](https://github.com/github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md). --- ## Pass and fail gates A shadow index is **eligible for canary** if: * Coverage ≥ 0.70 on gold. * Citation accuracy ≥ 0.95. * ΔS median ≤ 0.40 and 90-pct ≤ 0.55. * Recall\@5 does not drop more than 2 points absolute vs live. * λ convergent on ≥ 95 percent of paraphrase triplets. If any fail, return to chunk boundary checks in [section\_detection.md](https://github.com/github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md) and typed block lifting in [code\_tables\_blocks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md). --- ## Diagnosis map * **High similarity yet wrong meaning** → [Embedding ≠ Semantic](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md) * **Order flips across runs** → [Rerankers](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md) * **Boundary leaks or mixed topics in a chunk** → revisit [section\_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md) * **Tables or code referenced as plain text** → [code\_tables\_blocks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md) * **OCR drift and offset mismatch** → [pdf\_layouts\_and\_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md) * **Index rebuilt, citations break** → [reindex\_migration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/reindex_migration.md) --- ## Minimal evaluator pseudocode ```python def score_run(gold, logs, k=5): p_hits, r_hits, cov_hits, cite_ok = [], [], 0, 0 ds_med, ds_90 = [], [] for q in gold: # q has qid, paraphrases, relevant, anchor_section items = logs[q.qid]["topk"][:k] got = {it["id"] for it in items} rel = set(q.relevant) prec = len(got & rel) / max(1, len(items)) rec = len(got & rel) / max(1, len(rel)) p_hits.append(prec); r_hits.append(rec) ds = logs[q.qid]["ΔS"][:k] if ds: ds_med.append(median(ds)); ds_90.append(percentile(ds, 90)) # coverage and citation accuracy from the final answer's first citation ans = logs[q.qid].get("answer_citations", []) if ans: cited = ans[0]["id"] off = ans[0]["offsets"] if cited in rel or section_of(cited) == q.anchor_section: cov_hits += 1 if cited in rel and overlaps(off, gold_offsets(cited)): cite_ok += 1 return { "P@k": mean(p_hits), "R@k": mean(r_hits), "coverage": cov_hits / len(gold), "citation_accuracy": cite_ok / len(gold), "ΔS_med": median(ds_med) if ds_med else None, "ΔS_p90": median(ds_90) if ds_90 else None } ``` --- ## Common pitfalls * **Evaluating answers without enforcing cite-first**. You cannot measure coverage reliably. Fix the contract in [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md). * **Mixing normalizers between builds**. Offsets will not compare. Lock the same whitespace and hyphen rules as in [pdf\_layouts\_and\_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md). * **Ignoring content types**. Aggregates hide failures in code or tables. Segment metrics by `type`. * **k too small for long documents**. Use k ∈ {5, 10} when sections are dense. * **Comparing across different rerankers**. Pin rerank during offline runs, then test rerankers separately in a controlled A/B. --- ## Copy-paste prompt for LLM-assisted scoring ``` You have TXT OS and the WFGY Problem Map. Given: - gold.json: gold questions with {qid, paraphrases[], relevant[], anchor_section} - logs.jsonl: retriever traces with topk ids, ΔS per item, and answer_citations Do: 1) Compute P@5, R@5, coverage, citation accuracy. 2) Report ΔS median and p90 for the cited snippet per question. 3) Flag any questions with coverage==0 or ΔS>0.60 and return their qids. 4) Summarize per-type breakdown for {prose, code, table, figure}. Return compact JSON: { "P@5": 0.xx, "R@5": 0.xx, "coverage": 0.xx, "citation_accuracy": 0.xx, "ΔS_med": 0.xx, "ΔS_p90": 0.xx, "bad_qids": ["Q-..."], "by_type": {...} } ``` --- ### 🔗 Quick-Start Downloads (60 sec) | Tool | Link | 3-Step Setup | | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \” | | **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly | --- ### 🧭 Explore More | Module | Description | Link | | ------------------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | | WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) | | Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) | | Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) | | Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) | | Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) | | Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) | | 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) | --- > 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** — > Engineers, hackers, and open source builders who supported WFGY from day one. > GitHub stars ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)   [![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)   [![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)   [![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)   [![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)   [![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)   [![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)  
要我繼續下一頁就說:**GO live\_monitoring\_rag.md** 或指定別的檔名。