12 KiB
Eval — RAG Precision, Refusals, CHR, and Recall@k (stdlib-only)
Goal
Provide a deterministic, SDK-free way to score grounded Q&A quality and retrieval quality for RAG pipelines. This page defines metrics, data contracts, and a ≤100-line reference scorer.
What you get
- Clear precision/CHR/refusal definitions for grounded answers
- Recall@k for retrieval (upper bound on answerability)
- A tiny reference scorer (Python stdlib-only) + sample JSONL
1) Data Contracts
1.1 Gold set (eval/gold.jsonl)
One JSON object per line:
{"qid":"A0001","question":"Does X support null keys?","answerable":true,"gold_claim_substr":["rejects null keys"],"gold_citations":["p1#2"],"constraints":["X rejects null keys."]}
{"qid":"A0002","question":"Explain Z.","answerable":false,"gold_claim_substr":[],"gold_citations":[]}
{"qid":"A0003","question":"What domain is allowed?","answerable":true,"gold_claim_substr":["only domain example.com"],"gold_citations":["pB#1"]}
Rules:
answerable=false⇒ the only correct output is the exact refusal token:not in contextgold_claim_substr: minimal substrings that must appear in the shipped claim (case-insensitive, ≥5 chars)gold_citations: any overlap with shippedcitationscounts as a citation hitconstraints(optional): used by SCU checks elsewhere; ignored in this page’s scorer unless you extend it
1.2 System traces (runs/trace.jsonl)
Emitted by your guarded pipeline:
{"qid":"A0001","q":"Does X support null keys?","retrieved_ids":["p1#1","p1#2","p2#1"],"answer_json":{"claim":"X rejects null keys.","citations":["p1#2"]}}
{"qid":"A0002","q":"Explain Z.","retrieved_ids":["p1#1","p2#1"],"answer_json":{"claim":"not in context","citations":[]}}
{"qid":"A0003","q":"What domain is allowed?","retrieved_ids":["pB#1","p1#2"],"answer_json":{"claim":"Only domain example.com is allowed.","citations":["pB#1"]}}
Rules:
answer_json.claimis either a sentence or the exact refusal tokennot in contextanswer_json.citationsmust be ids taken fromretrieved_ids(scoped grounding)
2) Metrics (definitions)
Let
- S = set of shipped answers (claim ≠
not in context) - R = set of refusals (claim ==
not in context) - A = gold items with
answerable=true - U = gold items with
answerable=false
Derived checks per qid:
- Containment (C): any
gold_claim_substrappears in the shippedclaim(case-insensitive, min len ≥ 5) - Citation hit (H):
citations ∩ gold_citations ≠ ∅andcitations ⊆ retrieved_ids
Scores:
- Precision (answered) = |{ x ∈ S ∩ A : C ∧ H }| / |S| (Of what we shipped, how many are correct and properly cited?)
- Under-refusal rate = |{ x ∈ S ∩ U }| / |U| (Should have refused but answered anyway.)
- Over-refusal rate = |{ x ∈ R ∩ A }| / |A| (Should have answered but refused.)
- Citation Hit Rate (CHR) = |{ x ∈ S : H }| / |S|
- Recall@k = |{ x ∈ A :
gold_citations⊆ top-k(retrieved_ids) }| / |A|
Tip: track Precision and CHR together. High precision with low CHR usually means “answers look right but cite the wrong evidence.”
Default ship gates (suggested)
- Precision (answered) ≥ 0.80
- CHR ≥ 0.75
- Under-refusal ≤ 0.05
- Over-refusal ≤ 0.10
Commit gate thresholds to your repo and enforce them in CI.
3) Worked Mini-Example
Gold (3 items) + Traces above:
- A0001 → shipped, contains “rejects null keys”, cites
p1#2(hit) → contributes to Precision and CHR - A0002 → refusal, gold says unanswerable → correct refusal (neither hurts Precision nor Over-refusal)
- A0003 → shipped, contains “only domain example.com”, cites
pB#1(hit)
With all three correct, you’ll see:
- Precision (answered) = 2/2 = 1.00
- CHR = 2/2 = 1.00
- Under-refusal = 0/1 = 0.00
- Over-refusal = 0/2 = 0.00
- Recall@k depends on your chosen
kand the retrieved ids (here, gold cites are present in tops)
4) Reference scorer (≤100 lines, Python stdlib-only)
Save as ProblemMap/eval/score_eval.py:
#!/usr/bin/env python3
import json, sys, argparse
REFUSAL = "not in context"
def load_jsonl(path):
with open(path, encoding="utf8") as f:
for line in f:
line=line.strip()
if line: yield json.loads(line)
def contains_substr(claim, subs):
c = (claim or "").lower()
for s in subs or []:
s = s.lower()
if len(s) >= 5 and s in c:
return True
return subs == [] # if no gold substrings, treat containment as vacuously true
def citation_hit(citations, gold_cites, retrieved):
if not isinstance(citations, list): return False
if not set(citations).issubset(set(retrieved or [])): return False
return bool(set(citations or []) & set(gold_cites or [])) if gold_cites else (citations == [])
def topk(ids, k): return (ids or [])[:k]
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--gold", required=True)
ap.add_argument("--trace", required=True)
ap.add_argument("--k", type=int, default=5)
ap.add_argument("--gates", default="precision=0.80,chr=0.75,under=0.05,over=0.10")
args = ap.parse_args()
gold = {g["qid"]: g for g in load_jsonl(args.gold)}
traces = {}
for t in load_jsonl(args.trace):
qid = t.get("qid") or t.get("q_id")
if qid: traces[qid] = t # keep last
S=R=A=U=0
TP=CHR_hit=0
UNDER=OVER=0
RECALL=REC_DEN=0
for qid,g in gold.items():
ans = traces.get(qid, {})
aj = (ans.get("answer_json") or {})
claim = (aj.get("claim") or "").strip()
cits = aj.get("citations") or []
ret = ans.get("retrieved_ids") or []
is_ans = claim.lower() != REFUSAL
if g.get("answerable"): A+=1
else: U+=1
# sets
if is_ans: S+=1
else: R+=1
# precision / chr
if is_ans:
C = contains_substr(claim, g.get("gold_claim_substr"))
H = citation_hit(cits, g.get("gold_citations"), ret)
if g.get("answerable")==False: UNDER += 1
else:
if H: CHR_hit += 1
if C and H: TP += 1
else:
if g.get("answerable")==True: OVER += 1
# recall@k
if g.get("answerable")==True:
REC_DEN += 1
kset = set(topk(ret, args.k))
if set(g.get("gold_citations") or []).issubset(kset):
RECALL += 1
precision = (TP / S) if S else 1.0
chr_rate = (CHR_hit / S) if S else 1.0
under = (UNDER / U) if U else 0.0
over = (OVER / A) if A else 0.0
recallk = (RECALL / REC_DEN) if REC_DEN else 0.0
gates = dict(x.split("=") for x in args.gates.split(","))
def ok(name, val, ge=True):
thr = float(gates[name])
return val >= thr if ge else val <= thr
pass_all = (ok("precision", precision) and ok("chr", chr_rate) and
ok("under", under, ge=False) and ok("over", over, ge=False))
print(json.dumps({
"answered": S, "refused": R, "answerable": A, "unanswerable": U,
"precision": round(precision,4),
"chr": round(chr_rate,4),
"under_refusal": round(under,4),
"over_refusal": round(over,4),
"recall@k": round(recallk,4),
"k": args.k,
"gates": gates,
"pass": pass_all
}, indent=2))
if __name__ == "__main__":
main()
Run:
python ProblemMap/eval/score_eval.py \
--gold ProblemMap/eval/gold.jsonl \
--trace runs/trace.jsonl \
--k 5 \
--gates precision=0.80,chr=0.75,under=0.05,over=0.10
Output (example):
{
"answered": 2,
"refused": 1,
"answerable": 2,
"unanswerable": 1,
"precision": 1.0,
"chr": 1.0,
"under_refusal": 0.0,
"over_refusal": 0.0,
"recall@k": 1.0,
"k": 5,
"gates": {"precision":"0.80","chr":"0.75","under":"0.05","over":"0.10"},
"pass": true
}
5) CI wiring
- Add a job that runs the scorer on every PR and fails if
pass=false. - Store last report at
eval/report.md(or JSON) to track regressions. - Freeze
gold.jsonlper release; changes require sign-off.
Example CI step
python ProblemMap/eval/score_eval.py \
--gold ProblemMap/eval/gold.jsonl \
--trace runs/trace.jsonl \
--k 5 \
--gates precision=0.80,chr=0.75,under=0.05,over=0.10 \
| tee eval/last_report.json
jq -e '.pass == true' eval/last_report.json
6) Troubleshooting
- Precision low, CHR low → Grounding broken. Apply Pattern: RAG Semantic Drift.
- Precision low, CHR high → Claim text misses
gold_claim_substr. Tighten claim schema / substrings. - Under-refusal high → Many unanswerables were answered; strengthen refusal behavior or retrieval constraints.
- Over-refusal high → You’re refusing real questions; improve recall@k or shrink chunks.
- Recall@k low → Index/manifest drift or retrieval logic. See Vector Store Fragmentation and Example 03.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.