vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-26 10:40:55 +00:00

onestardao c3075fb1f2 sync footer navigation (remove clinics, align PM versions)

2026-03-06 12:46:37 +00:00

14 KiB

Raw Blame History

RAG precision/recall evaluation

🧭 Quick Return to Map

You are in a sub-page of Chunking.
To reorient, go back here:

Chunking — text segmentation and context window management

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A compact, repeatable harness to measure retrieval precision, recall, and coverage after you change chunking, OCR, or indexing. This page also defines ΔS and λ probes so you can gate rollouts with hard numbers.

Open these first

Chunk ids and stability: chunk_id_schema.md
Title tree numbering: title_hierarchy.md
Section boundary rules: section_detection.md
Typed blocks (code, tables, figures): code_tables_blocks.md
PDF, layout, OCR normalization: pdf_layouts_and_ocr.md
Traceable results and cite schema: retrieval-traceability.md
Payload contracts for RAG: data-contracts.md
Visual recovery map and ops: rag-architecture-and-recovery.md

What this measures

Precision@k: fraction of retrieved snippets among top-k that truly answer the question.
Recall@k: fraction of all relevant snippets that appear in top-k.
Coverage: proportion of questions whose final answer can be justified by at least one cited snippet.
Citation accuracy: percentage of answers where section_id and offsets match the gold.
ΔS(question, retrieved): semantic distance. Stable ≤ 0.45, transitional 0.40–0.60, risk ≥ 0.60.
λ_observe: convergence state across paraphrases and seeds.

These metrics tell you if a chunking or index change helps retrieval without breaking traceability.

Acceptance targets

Coverage ≥ 0.70 on the project’s gold set.
ΔS(question, retrieved) ≤ 0.45 for the cited snippet of each answered item.
Citation accuracy ≥ 0.95 for section_id + offsets.
λ remains convergent on three paraphrases and two seeds.
No drop in Recall@k compared with the previous index beyond 2 percent absolute.

Gold set construction

Scope 200–400 items that span headings, code regions, tables, and prose.
Write three paraphrases per question with identical intent.
Annotate relevant blocks using canonical ids from chunk_id_schema.md.
Mark hard negatives near the true section to test boundary quality.
Freeze the canonical text and store byte offsets after normalization from pdf_layouts_and_ocr.md.

Gold rows should look like:

{
  "qid": "Q-0137",
  "paraphrases": [
    "How does SCU unlock safety refusals?",
    "Explain symbolic constraint unlock.",
    "SCU: what is it and when to use?"
  ],
  "relevant": ["S.4.2.p.Bk011a", "S.4.2.p.Bk011b"],
  "anchor_section": "S.4.2",
  "negatives": ["S.4.1.p.Bk010", "S.4.3.p.Bk014"]
}

Logging schema for evaluation

Your retriever must emit a trace per query. Use the fields defined in retrieval-traceability.md and data-contracts.md.

{
  "qid": "Q-0137",
  "query": "Explain symbolic constraint unlock.",
  "topk": [
    {"id": "S.4.2.p.Bk011a", "score": 0.83, "offsets": [204611,205279], "type": "prose"},
    {"id": "S.4.1.p.Bk010",  "score": 0.79, "offsets": [198002,199112], "type": "prose"}
  ],
  "ΔS": [0.31, 0.59],
  "λ_state": "→",
  "anchor": "S.4.2",
  "index_hash": "faiss:v3:hnsw:cos",
  "ts": "2025-08-27T12:30:22Z"
}

Offline evaluation (index only)

Run each paraphrase against the shadow index.
For each qid, compute:
- P@k: relevant ids ∩ top-k over k ∈ {1, 3, 5, 10}.
- R@k: relevant ids covered by top-k.
- Anchor hit: any retrieved id with section_id == anchor_section.
- ΔS probes for each retrieved item.
Aggregate by content type using type ∈ {prose, code, table, figure}.
Compare with the live index as a baseline and record deltas.

Online shadow evaluation

Mirror live questions to the shadow index.
Require cite-first answers with the schema from retrieval-traceability.md.
For each answer, verify that at least one citation matches a gold relevant id or the anchor_section.
Log ΔS for the chosen citation and the final λ state after reasoning.

Metrics definitions

Let G(q) be the set of relevant ids for q. Let R_k(q) be the ids in top-k.

Precision@k = |G(q) ∩ R_k(q)| / |R_k(q)|
Recall@k = |G(q) ∩ R_k(q)| / |G(q)|
Coverage = fraction of questions where the answer cites at least one element in G(q) or any block within anchor_section.
Citation accuracy = fraction where both section_id and byte offsets overlap the gold within a 30-byte window.
Anchor proximity = average path distance in the title tree from the cited section_id to anchor_section using rules in title_hierarchy.md.

Pass and fail gates

A shadow index is eligible for canary if:

Coverage ≥ 0.70 on gold.
Citation accuracy ≥ 0.95.
ΔS median ≤ 0.40 and 90-pct ≤ 0.55.
Recall@5 does not drop more than 2 points absolute vs live.
λ convergent on ≥ 95 percent of paraphrase triplets.

If any fail, return to chunk boundary checks in section_detection.md and typed block lifting in code_tables_blocks.md.

Diagnosis map

High similarity yet wrong meaning → Embedding ≠ Semantic
Order flips across runs → Rerankers
Boundary leaks or mixed topics in a chunk → revisit section_detection.md
Tables or code referenced as plain text → code_tables_blocks.md
OCR drift and offset mismatch → pdf_layouts_and_ocr.md
Index rebuilt, citations break → reindex_migration.md

Minimal evaluator pseudocode

def score_run(gold, logs, k=5):
    p_hits, r_hits, cov_hits, cite_ok = [], [], 0, 0
    ds_med, ds_90 = [], []

    for q in gold:                         # q has qid, paraphrases, relevant, anchor_section
        items = logs[q.qid]["topk"][:k]
        got = {it["id"] for it in items}
        rel = set(q.relevant)

        prec = len(got & rel) / max(1, len(items))
        rec  = len(got & rel) / max(1, len(rel))
        p_hits.append(prec); r_hits.append(rec)

        ds = logs[q.qid]["ΔS"][:k]
        if ds: ds_med.append(median(ds)); ds_90.append(percentile(ds, 90))

        # coverage and citation accuracy from the final answer's first citation
        ans = logs[q.qid].get("answer_citations", [])
        if ans:
            cited = ans[0]["id"]
            off   = ans[0]["offsets"]
            if cited in rel or section_of(cited) == q.anchor_section:
                cov_hits += 1
            if cited in rel and overlaps(off, gold_offsets(cited)):
                cite_ok += 1

    return {
        "P@k": mean(p_hits),
        "R@k": mean(r_hits),
        "coverage": cov_hits / len(gold),
        "citation_accuracy": cite_ok / len(gold),
        "ΔS_med": median(ds_med) if ds_med else None,
        "ΔS_p90": median(ds_90) if ds_90 else None
    }

Common pitfalls

Evaluating answers without enforcing cite-first. You cannot measure coverage reliably. Fix the contract in data-contracts.md.
Mixing normalizers between builds. Offsets will not compare. Lock the same whitespace and hyphen rules as in pdf_layouts_and_ocr.md.
Ignoring content types. Aggregates hide failures in code or tables. Segment metrics by type.
k too small for long documents. Use k ∈ {5, 10} when sections are dense.
Comparing across different rerankers. Pin rerank during offline runs, then test rerankers separately in a controlled A/B.

Copy-paste prompt for LLM-assisted scoring

You have TXT OS and the WFGY Problem Map.

Given:
- gold.json: gold questions with {qid, paraphrases[], relevant[], anchor_section}
- logs.jsonl: retriever traces with topk ids, ΔS per item, and answer_citations

Do:
1) Compute P@5, R@5, coverage, citation accuracy.
2) Report ΔS median and p90 for the cited snippet per question.
3) Flag any questions with coverage==0 or ΔS>0.60 and return their qids.
4) Summarize per-type breakdown for {prose, code, table, figure}.

Return compact JSON:
{ "P@5": 0.xx, "R@5": 0.xx, "coverage": 0.xx, "citation_accuracy": 0.xx,
  "ΔS_med": 0.xx, "ΔS_p90": 0.xx, "bad_qids": ["Q-..."], "by_type": {...} }

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer	Page	What it’s for
⭐ Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
⚙️ Engine	WFGY 1.0	Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine	WFGY 2.0	Production tension kernel for RAG and agent systems
⚙️ Engine	WFGY 3.0	TXT based Singularity tension engine (131 S class set)
🗺️ Map	Problem Map 1.0	Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map	Problem Map 2.0	Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map	Problem Map 3.0	Global AI troubleshooting atlas and failure pattern map
🧰 App	TXT OS	.txt semantic OS with fast bootstrap
🧰 App	Blah Blah Blah	Abstract and paradox Q&A built on TXT OS
🧰 App	Blur Blur Blur	Text to image generation with semantic control
🏡 Onboarding	Starter Village	Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.

要我繼續下一頁就說：GO live_monitoring_rag.md 或指定別的檔名。

14 KiB Raw Blame History Unescape Escape