WFGY/ProblemMap/GlobalFixMap/Chunking/eval_rag_precision_recall.md

14 KiB
Raw Blame History

RAG precision/recall evaluation

🧭 Quick Return to Map

You are in a sub-page of Chunking.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A compact, repeatable harness to measure retrieval precision, recall, and coverage after you change chunking, OCR, or indexing. This page also defines ΔS and λ probes so you can gate rollouts with hard numbers.

Open these first

What this measures

  • Precision@k: fraction of retrieved snippets among top-k that truly answer the question.
  • Recall@k: fraction of all relevant snippets that appear in top-k.
  • Coverage: proportion of questions whose final answer can be justified by at least one cited snippet.
  • Citation accuracy: percentage of answers where section_id and offsets match the gold.
  • ΔS(question, retrieved): semantic distance. Stable ≤ 0.45, transitional 0.400.60, risk ≥ 0.60.
  • λ_observe: convergence state across paraphrases and seeds.

These metrics tell you if a chunking or index change helps retrieval without breaking traceability.

Acceptance targets

  • Coverage ≥ 0.70 on the projects gold set.
  • ΔS(question, retrieved) ≤ 0.45 for the cited snippet of each answered item.
  • Citation accuracy ≥ 0.95 for section_id + offsets.
  • λ remains convergent on three paraphrases and two seeds.
  • No drop in Recall@k compared with the previous index beyond 2 percent absolute.

Gold set construction

  1. Scope 200400 items that span headings, code regions, tables, and prose.
  2. Write three paraphrases per question with identical intent.
  3. Annotate relevant blocks using canonical ids from chunk_id_schema.md.
  4. Mark hard negatives near the true section to test boundary quality.
  5. Freeze the canonical text and store byte offsets after normalization from pdf_layouts_and_ocr.md.

Gold rows should look like:

{
  "qid": "Q-0137",
  "paraphrases": [
    "How does SCU unlock safety refusals?",
    "Explain symbolic constraint unlock.",
    "SCU: what is it and when to use?"
  ],
  "relevant": ["S.4.2.p.Bk011a", "S.4.2.p.Bk011b"],
  "anchor_section": "S.4.2",
  "negatives": ["S.4.1.p.Bk010", "S.4.3.p.Bk014"]
}

Logging schema for evaluation

Your retriever must emit a trace per query. Use the fields defined in retrieval-traceability.md and data-contracts.md.

{
  "qid": "Q-0137",
  "query": "Explain symbolic constraint unlock.",
  "topk": [
    {"id": "S.4.2.p.Bk011a", "score": 0.83, "offsets": [204611,205279], "type": "prose"},
    {"id": "S.4.1.p.Bk010",  "score": 0.79, "offsets": [198002,199112], "type": "prose"}
  ],
  "ΔS": [0.31, 0.59],
  "λ_state": "→",
  "anchor": "S.4.2",
  "index_hash": "faiss:v3:hnsw:cos",
  "ts": "2025-08-27T12:30:22Z"
}

Offline evaluation (index only)

  1. Run each paraphrase against the shadow index.

  2. For each qid, compute:

    • P@k: relevant ids ∩ top-k over k ∈ {1, 3, 5, 10}.
    • R@k: relevant ids covered by top-k.
    • Anchor hit: any retrieved id with section_id == anchor_section.
    • ΔS probes for each retrieved item.
  3. Aggregate by content type using type ∈ {prose, code, table, figure}.

  4. Compare with the live index as a baseline and record deltas.


Online shadow evaluation

  • Mirror live questions to the shadow index.
  • Require cite-first answers with the schema from retrieval-traceability.md.
  • For each answer, verify that at least one citation matches a gold relevant id or the anchor_section.
  • Log ΔS for the chosen citation and the final λ state after reasoning.

Metrics definitions

Let G(q) be the set of relevant ids for q. Let R_k(q) be the ids in top-k.

  • Precision@k = |G(q) ∩ R_k(q)| / |R_k(q)|
  • Recall@k = |G(q) ∩ R_k(q)| / |G(q)|
  • Coverage = fraction of questions where the answer cites at least one element in G(q) or any block within anchor_section.
  • Citation accuracy = fraction where both section_id and byte offsets overlap the gold within a 30-byte window.
  • Anchor proximity = average path distance in the title tree from the cited section_id to anchor_section using rules in title_hierarchy.md.

Pass and fail gates

A shadow index is eligible for canary if:

  • Coverage ≥ 0.70 on gold.
  • Citation accuracy ≥ 0.95.
  • ΔS median ≤ 0.40 and 90-pct ≤ 0.55.
  • Recall@5 does not drop more than 2 points absolute vs live.
  • λ convergent on ≥ 95 percent of paraphrase triplets.

If any fail, return to chunk boundary checks in section_detection.md and typed block lifting in code_tables_blocks.md.


Diagnosis map


Minimal evaluator pseudocode

def score_run(gold, logs, k=5):
    p_hits, r_hits, cov_hits, cite_ok = [], [], 0, 0
    ds_med, ds_90 = [], []

    for q in gold:                         # q has qid, paraphrases, relevant, anchor_section
        items = logs[q.qid]["topk"][:k]
        got = {it["id"] for it in items}
        rel = set(q.relevant)

        prec = len(got & rel) / max(1, len(items))
        rec  = len(got & rel) / max(1, len(rel))
        p_hits.append(prec); r_hits.append(rec)

        ds = logs[q.qid]["ΔS"][:k]
        if ds: ds_med.append(median(ds)); ds_90.append(percentile(ds, 90))

        # coverage and citation accuracy from the final answer's first citation
        ans = logs[q.qid].get("answer_citations", [])
        if ans:
            cited = ans[0]["id"]
            off   = ans[0]["offsets"]
            if cited in rel or section_of(cited) == q.anchor_section:
                cov_hits += 1
            if cited in rel and overlaps(off, gold_offsets(cited)):
                cite_ok += 1

    return {
        "P@k": mean(p_hits),
        "R@k": mean(r_hits),
        "coverage": cov_hits / len(gold),
        "citation_accuracy": cite_ok / len(gold),
        "ΔS_med": median(ds_med) if ds_med else None,
        "ΔS_p90": median(ds_90) if ds_90 else None
    }

Common pitfalls

  • Evaluating answers without enforcing cite-first. You cannot measure coverage reliably. Fix the contract in data-contracts.md.
  • Mixing normalizers between builds. Offsets will not compare. Lock the same whitespace and hyphen rules as in pdf_layouts_and_ocr.md.
  • Ignoring content types. Aggregates hide failures in code or tables. Segment metrics by type.
  • k too small for long documents. Use k ∈ {5, 10} when sections are dense.
  • Comparing across different rerankers. Pin rerank during offline runs, then test rerankers separately in a controlled A/B.

Copy-paste prompt for LLM-assisted scoring

You have TXT OS and the WFGY Problem Map.

Given:
- gold.json: gold questions with {qid, paraphrases[], relevant[], anchor_section}
- logs.jsonl: retriever traces with topk ids, ΔS per item, and answer_citations

Do:
1) Compute P@5, R@5, coverage, citation accuracy.
2) Report ΔS median and p90 for the cited snippet per question.
3) Flag any questions with coverage==0 or ΔS>0.60 and return their qids.
4) Summarize per-type breakdown for {prose, code, table, figure}.

Return compact JSON:
{ "P@5": 0.xx, "R@5": 0.xx, "coverage": 0.xx, "citation_accuracy": 0.xx,
  "ΔS_med": 0.xx, "ΔS_p90": 0.xx, "bad_qids": ["Q-..."], "by_type": {...} }

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars

要我繼續下一頁就說:GO live_monitoring_rag.md 或指定別的檔名。