WFGY/ProblemMap/eval/eval_rag_precision_recall.md
2025-08-15 23:28:59 +08:00

12 KiB
Raw Blame History

Eval — RAG Precision, Refusals, CHR, and Recall@k (stdlib-only)

Goal
Provide a deterministic, SDK-free way to score grounded Q&A quality and retrieval quality for RAG pipelines. This page defines metrics, data contracts, and a ≤100-line reference scorer.

What you get

  • Clear precision/CHR/refusal definitions for grounded answers
  • Recall@k for retrieval (upper bound on answerability)
  • A tiny reference scorer (Python stdlib-only) + sample JSONL

1) Data Contracts

1.1 Gold set (eval/gold.jsonl)

One JSON object per line:

{"qid":"A0001","question":"Does X support null keys?","answerable":true,"gold_claim_substr":["rejects null keys"],"gold_citations":["p1#2"],"constraints":["X rejects null keys."]}
{"qid":"A0002","question":"Explain Z.","answerable":false,"gold_claim_substr":[],"gold_citations":[]}
{"qid":"A0003","question":"What domain is allowed?","answerable":true,"gold_claim_substr":["only domain example.com"],"gold_citations":["pB#1"]}

Rules:

  • answerable=false ⇒ the only correct output is the exact refusal token: not in context
  • gold_claim_substr: minimal substrings that must appear in the shipped claim (case-insensitive, ≥5 chars)
  • gold_citations: any overlap with shipped citations counts as a citation hit
  • constraints (optional): used by SCU checks elsewhere; ignored in this pages scorer unless you extend it

1.2 System traces (runs/trace.jsonl)

Emitted by your guarded pipeline:

{"qid":"A0001","q":"Does X support null keys?","retrieved_ids":["p1#1","p1#2","p2#1"],"answer_json":{"claim":"X rejects null keys.","citations":["p1#2"]}}
{"qid":"A0002","q":"Explain Z.","retrieved_ids":["p1#1","p2#1"],"answer_json":{"claim":"not in context","citations":[]}}
{"qid":"A0003","q":"What domain is allowed?","retrieved_ids":["pB#1","p1#2"],"answer_json":{"claim":"Only domain example.com is allowed.","citations":["pB#1"]}}

Rules:

  • answer_json.claim is either a sentence or the exact refusal token not in context
  • answer_json.citations must be ids taken from retrieved_ids (scoped grounding)

2) Metrics (definitions)

Let

  • S = set of shipped answers (claim ≠ not in context)
  • R = set of refusals (claim == not in context)
  • A = gold items with answerable=true
  • U = gold items with answerable=false

Derived checks per qid:

  • Containment (C): any gold_claim_substr appears in the shipped claim (case-insensitive, min len ≥ 5)
  • Citation hit (H): citations ∩ gold_citations ≠ ∅ and citations ⊆ retrieved_ids

Scores:

  • Precision (answered) = |{ x ∈ S ∩ A : C ∧ H }| / |S| (Of what we shipped, how many are correct and properly cited?)
  • Under-refusal rate = |{ x ∈ S ∩ U }| / |U| (Should have refused but answered anyway.)
  • Over-refusal rate = |{ x ∈ R ∩ A }| / |A| (Should have answered but refused.)
  • Citation Hit Rate (CHR) = |{ x ∈ S : H }| / |S|
  • Recall@k = |{ x ∈ A : gold_citations ⊆ top-k(retrieved_ids) }| / |A|

Tip: track Precision and CHR together. High precision with low CHR usually means “answers look right but cite the wrong evidence.”

Default ship gates (suggested)

  • Precision (answered) ≥ 0.80
  • CHR ≥ 0.75
  • Under-refusal ≤ 0.05
  • Over-refusal ≤ 0.10

Commit gate thresholds to your repo and enforce them in CI.


3) Worked Mini-Example

Gold (3 items) + Traces above:

  • A0001 → shipped, contains “rejects null keys”, cites p1#2 (hit) → contributes to Precision and CHR
  • A0002 → refusal, gold says unanswerable → correct refusal (neither hurts Precision nor Over-refusal)
  • A0003 → shipped, contains “only domain example.com”, cites pB#1 (hit)

With all three correct, youll see:

  • Precision (answered) = 2/2 = 1.00
  • CHR = 2/2 = 1.00
  • Under-refusal = 0/1 = 0.00
  • Over-refusal = 0/2 = 0.00
  • Recall@k depends on your chosen k and the retrieved ids (here, gold cites are present in tops)

4) Reference scorer (≤100 lines, Python stdlib-only)

Save as ProblemMap/eval/score_eval.py:

#!/usr/bin/env python3
import json, sys, argparse

REFUSAL = "not in context"

def load_jsonl(path):
    with open(path, encoding="utf8") as f:
        for line in f:
            line=line.strip()
            if line: yield json.loads(line)

def contains_substr(claim, subs):
    c = (claim or "").lower()
    for s in subs or []:
        s = s.lower()
        if len(s) >= 5 and s in c:
            return True
    return subs == []  # if no gold substrings, treat containment as vacuously true

def citation_hit(citations, gold_cites, retrieved):
    if not isinstance(citations, list): return False
    if not set(citations).issubset(set(retrieved or [])): return False
    return bool(set(citations or []) & set(gold_cites or [])) if gold_cites else (citations == [])

def topk(ids, k): return (ids or [])[:k]

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--gold", required=True)
    ap.add_argument("--trace", required=True)
    ap.add_argument("--k", type=int, default=5)
    ap.add_argument("--gates", default="precision=0.80,chr=0.75,under=0.05,over=0.10")
    args = ap.parse_args()

    gold = {g["qid"]: g for g in load_jsonl(args.gold)}
    traces = {}
    for t in load_jsonl(args.trace):
        qid = t.get("qid") or t.get("q_id")
        if qid: traces[qid] = t  # keep last

    S=R=A=U=0
    TP=CHR_hit=0
    UNDER=OVER=0
    RECALL=REC_DEN=0

    for qid,g in gold.items():
        ans = traces.get(qid, {})
        aj = (ans.get("answer_json") or {})
        claim = (aj.get("claim") or "").strip()
        cits  = aj.get("citations") or []
        ret   = ans.get("retrieved_ids") or []
        is_ans = claim.lower() != REFUSAL
        if g.get("answerable"): A+=1
        else: U+=1

        # sets
        if is_ans: S+=1
        else: R+=1

        # precision / chr
        if is_ans:
            C = contains_substr(claim, g.get("gold_claim_substr"))
            H = citation_hit(cits, g.get("gold_citations"), ret)
            if g.get("answerable")==False: UNDER += 1
            else:
                if H: CHR_hit += 1
                if C and H: TP += 1
        else:
            if g.get("answerable")==True: OVER += 1

        # recall@k
        if g.get("answerable")==True:
            REC_DEN += 1
            kset = set(topk(ret, args.k))
            if set(g.get("gold_citations") or []).issubset(kset):
                RECALL += 1

    precision = (TP / S) if S else 1.0
    chr_rate  = (CHR_hit / S) if S else 1.0
    under     = (UNDER / U) if U else 0.0
    over      = (OVER / A) if A else 0.0
    recallk   = (RECALL / REC_DEN) if REC_DEN else 0.0

    gates = dict(x.split("=") for x in args.gates.split(","))
    def ok(name, val, ge=True):
        thr = float(gates[name])
        return val >= thr if ge else val <= thr

    pass_all = (ok("precision", precision) and ok("chr", chr_rate) and
                ok("under", under, ge=False) and ok("over", over, ge=False))

    print(json.dumps({
        "answered": S, "refused": R, "answerable": A, "unanswerable": U,
        "precision": round(precision,4),
        "chr": round(chr_rate,4),
        "under_refusal": round(under,4),
        "over_refusal": round(over,4),
        "recall@k": round(recallk,4),
        "k": args.k,
        "gates": gates,
        "pass": pass_all
    }, indent=2))

if __name__ == "__main__":
    main()

Run:

python ProblemMap/eval/score_eval.py \
  --gold ProblemMap/eval/gold.jsonl \
  --trace runs/trace.jsonl \
  --k 5 \
  --gates precision=0.80,chr=0.75,under=0.05,over=0.10

Output (example):

{
  "answered": 2,
  "refused": 1,
  "answerable": 2,
  "unanswerable": 1,
  "precision": 1.0,
  "chr": 1.0,
  "under_refusal": 0.0,
  "over_refusal": 0.0,
  "recall@k": 1.0,
  "k": 5,
  "gates": {"precision":"0.80","chr":"0.75","under":"0.05","over":"0.10"},
  "pass": true
}

5) CI wiring

  • Add a job that runs the scorer on every PR and fails if pass=false.
  • Store last report at eval/report.md (or JSON) to track regressions.
  • Freeze gold.jsonl per release; changes require sign-off.

Example CI step

python ProblemMap/eval/score_eval.py \
  --gold ProblemMap/eval/gold.jsonl \
  --trace runs/trace.jsonl \
  --k 5 \
  --gates precision=0.80,chr=0.75,under=0.05,over=0.10 \
| tee eval/last_report.json

jq -e '.pass == true' eval/last_report.json

6) Troubleshooting

  • Precision low, CHR low → Grounding broken. Apply Pattern: RAG Semantic Drift.
  • Precision low, CHR high → Claim text misses gold_claim_substr. Tighten claim schema / substrings.
  • Under-refusal high → Many unanswerables were answered; strengthen refusal behavior or retrieval constraints.
  • Over-refusal high → Youre refusing real questions; improve recall@k or shrink chunks.
  • Recall@k low → Index/manifest drift or retrieval logic. See Vector Store Fragmentation and Example 03.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow