mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
266 lines
14 KiB
Markdown
266 lines
14 KiB
Markdown
# RAG precision/recall evaluation
|
||
|
||
<details>
|
||
<summary><strong>🧭 Quick Return to Map</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a sub-page of **Chunking**.
|
||
> To reorient, go back here:
|
||
>
|
||
> - [**Chunking** — text segmentation and context window management](./README.md)
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
|
||
>
|
||
> Think of this page as a desk within a ward.
|
||
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
|
||
</details>
|
||
|
||
|
||
A compact, repeatable harness to measure retrieval precision, recall, and coverage after you change chunking, OCR, or indexing. This page also defines ΔS and λ probes so you can gate rollouts with hard numbers.
|
||
|
||
## Open these first
|
||
- Chunk ids and stability: [chunk_id_schema.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md)
|
||
- Title tree numbering: [title_hierarchy.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md)
|
||
- Section boundary rules: [section_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md)
|
||
- Typed blocks (code, tables, figures): [code_tables_blocks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md)
|
||
- PDF, layout, OCR normalization: [pdf_layouts_and_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md)
|
||
- Traceable results and cite schema: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||
- Payload contracts for RAG: [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||
- Visual recovery map and ops: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
|
||
|
||
## What this measures
|
||
|
||
- **Precision@k**: fraction of retrieved snippets among top-k that truly answer the question.
|
||
- **Recall@k**: fraction of all relevant snippets that appear in top-k.
|
||
- **Coverage**: proportion of questions whose final answer can be justified by at least one cited snippet.
|
||
- **Citation accuracy**: percentage of answers where `section_id` and offsets match the gold.
|
||
- **ΔS(question, retrieved)**: semantic distance. Stable ≤ 0.45, transitional 0.40–0.60, risk ≥ 0.60.
|
||
- **λ_observe**: convergence state across paraphrases and seeds.
|
||
|
||
These metrics tell you if a chunking or index change helps retrieval without breaking traceability.
|
||
|
||
## Acceptance targets
|
||
|
||
- Coverage ≥ 0.70 on the project’s gold set.
|
||
- ΔS(question, retrieved) ≤ 0.45 for the cited snippet of each answered item.
|
||
- Citation accuracy ≥ 0.95 for `section_id` + offsets.
|
||
- λ remains convergent on three paraphrases and two seeds.
|
||
- No drop in Recall@k compared with the previous index beyond 2 percent absolute.
|
||
|
||
---
|
||
|
||
## Gold set construction
|
||
|
||
1) **Scope 200–400 items** that span headings, code regions, tables, and prose.
|
||
2) **Write three paraphrases** per question with identical intent.
|
||
3) **Annotate relevant blocks** using canonical ids from [chunk_id_schema.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md).
|
||
4) **Mark hard negatives** near the true section to test boundary quality.
|
||
5) **Freeze** the canonical text and store byte offsets after normalization from [pdf_layouts_and_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md).
|
||
|
||
Gold rows should look like:
|
||
|
||
```json
|
||
{
|
||
"qid": "Q-0137",
|
||
"paraphrases": [
|
||
"How does SCU unlock safety refusals?",
|
||
"Explain symbolic constraint unlock.",
|
||
"SCU: what is it and when to use?"
|
||
],
|
||
"relevant": ["S.4.2.p.Bk011a", "S.4.2.p.Bk011b"],
|
||
"anchor_section": "S.4.2",
|
||
"negatives": ["S.4.1.p.Bk010", "S.4.3.p.Bk014"]
|
||
}
|
||
````
|
||
|
||
---
|
||
|
||
## Logging schema for evaluation
|
||
|
||
Your retriever must emit a trace per query. Use the fields defined in [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) and [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md).
|
||
|
||
```json
|
||
{
|
||
"qid": "Q-0137",
|
||
"query": "Explain symbolic constraint unlock.",
|
||
"topk": [
|
||
{"id": "S.4.2.p.Bk011a", "score": 0.83, "offsets": [204611,205279], "type": "prose"},
|
||
{"id": "S.4.1.p.Bk010", "score": 0.79, "offsets": [198002,199112], "type": "prose"}
|
||
],
|
||
"ΔS": [0.31, 0.59],
|
||
"λ_state": "→",
|
||
"anchor": "S.4.2",
|
||
"index_hash": "faiss:v3:hnsw:cos",
|
||
"ts": "2025-08-27T12:30:22Z"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Offline evaluation (index only)
|
||
|
||
1. Run each paraphrase against the **shadow index**.
|
||
2. For each qid, compute:
|
||
|
||
* **P\@k**: relevant ids ∩ top-k over k ∈ {1, 3, 5, 10}.
|
||
* **R\@k**: relevant ids covered by top-k.
|
||
* **Anchor hit**: any retrieved id with `section_id == anchor_section`.
|
||
* **ΔS probes** for each retrieved item.
|
||
3. Aggregate by content type using `type ∈ {prose, code, table, figure}`.
|
||
4. Compare with the **live index** as a baseline and record deltas.
|
||
|
||
---
|
||
|
||
## Online shadow evaluation
|
||
|
||
* Mirror live questions to the shadow index.
|
||
* Require **cite-first** answers with the schema from [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md).
|
||
* For each answer, verify that at least one citation matches a gold `relevant` id or the `anchor_section`.
|
||
* Log ΔS for the chosen citation and the final λ state after reasoning.
|
||
|
||
---
|
||
|
||
## Metrics definitions
|
||
|
||
Let `G(q)` be the set of relevant ids for q. Let `R_k(q)` be the ids in top-k.
|
||
|
||
* **Precision\@k** = |G(q) ∩ R\_k(q)| / |R\_k(q)|
|
||
* **Recall\@k** = |G(q) ∩ R\_k(q)| / |G(q)|
|
||
* **Coverage** = fraction of questions where the answer cites at least one element in `G(q)` or any block within `anchor_section`.
|
||
* **Citation accuracy** = fraction where both `section_id` and byte offsets overlap the gold within a 30-byte window.
|
||
* **Anchor proximity** = average path distance in the title tree from the cited `section_id` to `anchor_section` using rules in [title\_hierarchy.md](https://github.com/github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md).
|
||
|
||
---
|
||
|
||
## Pass and fail gates
|
||
|
||
A shadow index is **eligible for canary** if:
|
||
|
||
* Coverage ≥ 0.70 on gold.
|
||
* Citation accuracy ≥ 0.95.
|
||
* ΔS median ≤ 0.40 and 90-pct ≤ 0.55.
|
||
* Recall\@5 does not drop more than 2 points absolute vs live.
|
||
* λ convergent on ≥ 95 percent of paraphrase triplets.
|
||
|
||
If any fail, return to chunk boundary checks in [section\_detection.md](https://github.com/github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md) and typed block lifting in [code\_tables\_blocks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md).
|
||
|
||
---
|
||
|
||
## Diagnosis map
|
||
|
||
* **High similarity yet wrong meaning** → [Embedding ≠ Semantic](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||
* **Order flips across runs** → [Rerankers](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
|
||
* **Boundary leaks or mixed topics in a chunk** → revisit [section\_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md)
|
||
* **Tables or code referenced as plain text** → [code\_tables\_blocks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md)
|
||
* **OCR drift and offset mismatch** → [pdf\_layouts\_and\_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md)
|
||
* **Index rebuilt, citations break** → [reindex\_migration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/reindex_migration.md)
|
||
|
||
---
|
||
|
||
## Minimal evaluator pseudocode
|
||
|
||
```python
|
||
def score_run(gold, logs, k=5):
|
||
p_hits, r_hits, cov_hits, cite_ok = [], [], 0, 0
|
||
ds_med, ds_90 = [], []
|
||
|
||
for q in gold: # q has qid, paraphrases, relevant, anchor_section
|
||
items = logs[q.qid]["topk"][:k]
|
||
got = {it["id"] for it in items}
|
||
rel = set(q.relevant)
|
||
|
||
prec = len(got & rel) / max(1, len(items))
|
||
rec = len(got & rel) / max(1, len(rel))
|
||
p_hits.append(prec); r_hits.append(rec)
|
||
|
||
ds = logs[q.qid]["ΔS"][:k]
|
||
if ds: ds_med.append(median(ds)); ds_90.append(percentile(ds, 90))
|
||
|
||
# coverage and citation accuracy from the final answer's first citation
|
||
ans = logs[q.qid].get("answer_citations", [])
|
||
if ans:
|
||
cited = ans[0]["id"]
|
||
off = ans[0]["offsets"]
|
||
if cited in rel or section_of(cited) == q.anchor_section:
|
||
cov_hits += 1
|
||
if cited in rel and overlaps(off, gold_offsets(cited)):
|
||
cite_ok += 1
|
||
|
||
return {
|
||
"P@k": mean(p_hits),
|
||
"R@k": mean(r_hits),
|
||
"coverage": cov_hits / len(gold),
|
||
"citation_accuracy": cite_ok / len(gold),
|
||
"ΔS_med": median(ds_med) if ds_med else None,
|
||
"ΔS_p90": median(ds_90) if ds_90 else None
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Common pitfalls
|
||
|
||
* **Evaluating answers without enforcing cite-first**. You cannot measure coverage reliably. Fix the contract in [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md).
|
||
* **Mixing normalizers between builds**. Offsets will not compare. Lock the same whitespace and hyphen rules as in [pdf\_layouts\_and\_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md).
|
||
* **Ignoring content types**. Aggregates hide failures in code or tables. Segment metrics by `type`.
|
||
* **k too small for long documents**. Use k ∈ {5, 10} when sections are dense.
|
||
* **Comparing across different rerankers**. Pin rerank during offline runs, then test rerankers separately in a controlled A/B.
|
||
|
||
---
|
||
|
||
## Copy-paste prompt for LLM-assisted scoring
|
||
|
||
```
|
||
You have TXT OS and the WFGY Problem Map.
|
||
|
||
Given:
|
||
- gold.json: gold questions with {qid, paraphrases[], relevant[], anchor_section}
|
||
- logs.jsonl: retriever traces with topk ids, ΔS per item, and answer_citations
|
||
|
||
Do:
|
||
1) Compute P@5, R@5, coverage, citation accuracy.
|
||
2) Report ΔS median and p90 for the cited snippet per question.
|
||
3) Flag any questions with coverage==0 or ΔS>0.60 and return their qids.
|
||
4) Summarize per-type breakdown for {prose, code, table, figure}.
|
||
|
||
Return compact JSON:
|
||
{ "P@5": 0.xx, "R@5": 0.xx, "coverage": 0.xx, "citation_accuracy": 0.xx,
|
||
"ΔS_med": 0.xx, "ΔS_p90": 0.xx, "bad_qids": ["Q-..."], "by_type": {...} }
|
||
```
|
||
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
|
||
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
|
||
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
|
||
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
|
||
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
|
||
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
|
||
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
|
||
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
|
||
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
|
||
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|
||
要我繼續下一頁就說:**GO live\_monitoring\_rag.md** 或指定別的檔名。
|