# Eval — RAG Precision, Refusals, CHR, and Recall@k (stdlib-only) **Goal** Provide a deterministic, SDK-free way to score grounded Q&A quality and retrieval quality for RAG pipelines. This page defines metrics, data contracts, and a ≤100-line reference scorer. **What you get** - Clear **precision/CHR/refusal** definitions for grounded answers - **Recall@k** for retrieval (upper bound on answerability) - A tiny **reference scorer** (Python stdlib-only) + sample JSONL --- ## 1) Data Contracts ### 1.1 Gold set (`eval/gold.jsonl`) One JSON object per line: ```json {"qid":"A0001","question":"Does X support null keys?","answerable":true,"gold_claim_substr":["rejects null keys"],"gold_citations":["p1#2"],"constraints":["X rejects null keys."]} {"qid":"A0002","question":"Explain Z.","answerable":false,"gold_claim_substr":[],"gold_citations":[]} {"qid":"A0003","question":"What domain is allowed?","answerable":true,"gold_claim_substr":["only domain example.com"],"gold_citations":["pB#1"]} ```` Rules: * `answerable=false` ⇒ the **only** correct output is the exact refusal token: `not in context` * `gold_claim_substr`: minimal substrings that must appear in the shipped claim (case-insensitive, ≥5 chars) * `gold_citations`: any overlap with shipped `citations` counts as a citation hit * `constraints` (optional): used by SCU checks elsewhere; ignored in this page’s scorer unless you extend it ### 1.2 System traces (`runs/trace.jsonl`) Emitted by your guarded pipeline: ```json {"qid":"A0001","q":"Does X support null keys?","retrieved_ids":["p1#1","p1#2","p2#1"],"answer_json":{"claim":"X rejects null keys.","citations":["p1#2"]}} {"qid":"A0002","q":"Explain Z.","retrieved_ids":["p1#1","p2#1"],"answer_json":{"claim":"not in context","citations":[]}} {"qid":"A0003","q":"What domain is allowed?","retrieved_ids":["pB#1","p1#2"],"answer_json":{"claim":"Only domain example.com is allowed.","citations":["pB#1"]}} ``` Rules: * `answer_json.claim` is either a sentence **or** the exact refusal token `not in context` * `answer_json.citations` must be ids taken from `retrieved_ids` (scoped grounding) --- ## 2) Metrics (definitions) Let * **S** = set of shipped answers (claim ≠ `not in context`) * **R** = set of refusals (claim == `not in context`) * **A** = gold items with `answerable=true` * **U** = gold items with `answerable=false` Derived checks per qid: * **Containment (C)**: any `gold_claim_substr` appears in the shipped `claim` (case-insensitive, min len ≥ 5) * **Citation hit (H)**: `citations ∩ gold_citations ≠ ∅` and `citations ⊆ retrieved_ids` Scores: * **Precision (answered)** = |{ x ∈ S ∩ A : C ∧ H }| / |S| (Of what we shipped, how many are correct and properly cited?) * **Under-refusal rate** = |{ x ∈ S ∩ U }| / |U| (Should have refused but answered anyway.) * **Over-refusal rate** = |{ x ∈ R ∩ A }| / |A| (Should have answered but refused.) * **Citation Hit Rate (CHR)** = |{ x ∈ S : H }| / |S| * **Recall\@k** = |{ x ∈ A : `gold_citations` ⊆ top-k(`retrieved_ids`) }| / |A| > Tip: track Precision and CHR together. High precision with low CHR usually means “answers look right but cite the wrong evidence.” **Default ship gates (suggested)** * Precision (answered) ≥ **0.80** * CHR ≥ **0.75** * Under-refusal ≤ **0.05** * Over-refusal ≤ **0.10** Commit gate thresholds to your repo and enforce them in CI. --- ## 3) Worked Mini-Example Gold (3 items) + Traces above: * A0001 → shipped, contains “rejects null keys”, cites `p1#2` (hit) → contributes to Precision and CHR * A0002 → refusal, gold says unanswerable → correct refusal (neither hurts Precision nor Over-refusal) * A0003 → shipped, contains “only domain example.com”, cites `pB#1` (hit) With all three correct, you’ll see: * Precision (answered) = 2/2 = 1.00 * CHR = 2/2 = 1.00 * Under-refusal = 0/1 = 0.00 * Over-refusal = 0/2 = 0.00 * Recall\@k depends on your chosen `k` and the retrieved ids (here, gold cites are present in tops) --- ## 4) Reference scorer (≤100 lines, Python stdlib-only) Save as `ProblemMap/eval/score_eval.py`: ```python #!/usr/bin/env python3 import json, sys, argparse REFUSAL = "not in context" def load_jsonl(path): with open(path, encoding="utf8") as f: for line in f: line=line.strip() if line: yield json.loads(line) def contains_substr(claim, subs): c = (claim or "").lower() for s in subs or []: s = s.lower() if len(s) >= 5 and s in c: return True return subs == [] # if no gold substrings, treat containment as vacuously true def citation_hit(citations, gold_cites, retrieved): if not isinstance(citations, list): return False if not set(citations).issubset(set(retrieved or [])): return False return bool(set(citations or []) & set(gold_cites or [])) if gold_cites else (citations == []) def topk(ids, k): return (ids or [])[:k] def main(): ap = argparse.ArgumentParser() ap.add_argument("--gold", required=True) ap.add_argument("--trace", required=True) ap.add_argument("--k", type=int, default=5) ap.add_argument("--gates", default="precision=0.80,chr=0.75,under=0.05,over=0.10") args = ap.parse_args() gold = {g["qid"]: g for g in load_jsonl(args.gold)} traces = {} for t in load_jsonl(args.trace): qid = t.get("qid") or t.get("q_id") if qid: traces[qid] = t # keep last S=R=A=U=0 TP=CHR_hit=0 UNDER=OVER=0 RECALL=REC_DEN=0 for qid,g in gold.items(): ans = traces.get(qid, {}) aj = (ans.get("answer_json") or {}) claim = (aj.get("claim") or "").strip() cits = aj.get("citations") or [] ret = ans.get("retrieved_ids") or [] is_ans = claim.lower() != REFUSAL if g.get("answerable"): A+=1 else: U+=1 # sets if is_ans: S+=1 else: R+=1 # precision / chr if is_ans: C = contains_substr(claim, g.get("gold_claim_substr")) H = citation_hit(cits, g.get("gold_citations"), ret) if g.get("answerable")==False: UNDER += 1 else: if H: CHR_hit += 1 if C and H: TP += 1 else: if g.get("answerable")==True: OVER += 1 # recall@k if g.get("answerable")==True: REC_DEN += 1 kset = set(topk(ret, args.k)) if set(g.get("gold_citations") or []).issubset(kset): RECALL += 1 precision = (TP / S) if S else 1.0 chr_rate = (CHR_hit / S) if S else 1.0 under = (UNDER / U) if U else 0.0 over = (OVER / A) if A else 0.0 recallk = (RECALL / REC_DEN) if REC_DEN else 0.0 gates = dict(x.split("=") for x in args.gates.split(",")) def ok(name, val, ge=True): thr = float(gates[name]) return val >= thr if ge else val <= thr pass_all = (ok("precision", precision) and ok("chr", chr_rate) and ok("under", under, ge=False) and ok("over", over, ge=False)) print(json.dumps({ "answered": S, "refused": R, "answerable": A, "unanswerable": U, "precision": round(precision,4), "chr": round(chr_rate,4), "under_refusal": round(under,4), "over_refusal": round(over,4), "recall@k": round(recallk,4), "k": args.k, "gates": gates, "pass": pass_all }, indent=2)) if __name__ == "__main__": main() ``` Run: ```bash python ProblemMap/eval/score_eval.py \ --gold ProblemMap/eval/gold.jsonl \ --trace runs/trace.jsonl \ --k 5 \ --gates precision=0.80,chr=0.75,under=0.05,over=0.10 ``` Output (example): ```json { "answered": 2, "refused": 1, "answerable": 2, "unanswerable": 1, "precision": 1.0, "chr": 1.0, "under_refusal": 0.0, "over_refusal": 0.0, "recall@k": 1.0, "k": 5, "gates": {"precision":"0.80","chr":"0.75","under":"0.05","over":"0.10"}, "pass": true } ``` --- ## 5) CI wiring * Add a job that runs the scorer on every PR and fails if `pass=false`. * Store last report at `eval/report.md` (or JSON) to track regressions. * Freeze `gold.jsonl` per release; changes require sign-off. **Example CI step** ```bash python ProblemMap/eval/score_eval.py \ --gold ProblemMap/eval/gold.jsonl \ --trace runs/trace.jsonl \ --k 5 \ --gates precision=0.80,chr=0.75,under=0.05,over=0.10 \ | tee eval/last_report.json jq -e '.pass == true' eval/last_report.json ``` --- ## 6) Troubleshooting * **Precision low, CHR low** → Grounding broken. Apply Pattern: *RAG Semantic Drift*. * **Precision low, CHR high** → Claim text misses `gold_claim_substr`. Tighten claim schema / substrings. * **Under-refusal high** → Many unanswerables were answered; strengthen refusal behavior or retrieval constraints. * **Over-refusal high** → You’re refusing real questions; improve recall\@k or shrink chunks. * **Recall\@k low** → Index/manifest drift or retrieval logic. See *Vector Store Fragmentation* and Example 03. --- ### 🧭 Explore More | Module | Description | Link | | --------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | | WFGY Core | Standalone semantic reasoning engine for any LLM | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) | | Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) | | Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) | | Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) | | Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) | | Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) | --- > 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** — > Engineers, hackers, and open source builders who supported WFGY from day one. > GitHub stars ⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ **[Star WFGY on GitHub](https://github.com/onestardao/WFGY)**
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)   [![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)   [![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)   [![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)   [![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)   [![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)   [![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)