vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

2025-08-13 16:12:45 +08:00

12 KiB

Raw Blame History

Eval — RAG Precision, Refusals, CHR, and Recall@k (stdlib-only)

Goal
Provide a deterministic, SDK-free way to score grounded Q&A quality and retrieval quality for RAG pipelines. This page defines metrics, data contracts, and a ≤100-line reference scorer.

What you get

Clear precision/CHR/refusal definitions for grounded answers
Recall@k for retrieval (upper bound on answerability)
A tiny reference scorer (Python stdlib-only) + sample JSONL

1) Data Contracts

1.1 Gold set (`eval/gold.jsonl`)

One JSON object per line:

{"qid":"A0001","question":"Does X support null keys?","answerable":true,"gold_claim_substr":["rejects null keys"],"gold_citations":["p1#2"],"constraints":["X rejects null keys."]}
{"qid":"A0002","question":"Explain Z.","answerable":false,"gold_claim_substr":[],"gold_citations":[]}
{"qid":"A0003","question":"What domain is allowed?","answerable":true,"gold_claim_substr":["only domain example.com"],"gold_citations":["pB#1"]}

Rules:

answerable=false ⇒ the only correct output is the exact refusal token: not in context
gold_claim_substr: minimal substrings that must appear in the shipped claim (case-insensitive, ≥5 chars)
gold_citations: any overlap with shipped citations counts as a citation hit
constraints (optional): used by SCU checks elsewhere; ignored in this page’s scorer unless you extend it

1.2 System traces (`runs/trace.jsonl`)

Emitted by your guarded pipeline:

{"qid":"A0001","q":"Does X support null keys?","retrieved_ids":["p1#1","p1#2","p2#1"],"answer_json":{"claim":"X rejects null keys.","citations":["p1#2"]}}
{"qid":"A0002","q":"Explain Z.","retrieved_ids":["p1#1","p2#1"],"answer_json":{"claim":"not in context","citations":[]}}
{"qid":"A0003","q":"What domain is allowed?","retrieved_ids":["pB#1","p1#2"],"answer_json":{"claim":"Only domain example.com is allowed.","citations":["pB#1"]}}

Rules:

answer_json.claim is either a sentence or the exact refusal token not in context
answer_json.citations must be ids taken from retrieved_ids (scoped grounding)

2) Metrics (definitions)

Let

S = set of shipped answers (claim ≠ not in context)
R = set of refusals (claim == not in context)
A = gold items with answerable=true
U = gold items with answerable=false

Derived checks per qid:

Containment (C): any gold_claim_substr appears in the shipped claim (case-insensitive, min len ≥ 5)
Citation hit (H): citations ∩ gold_citations ≠ ∅ and citations ⊆ retrieved_ids

Scores:

Precision (answered) = |{ x ∈ S ∩ A : C ∧ H }| / |S| (Of what we shipped, how many are correct and properly cited?)
Under-refusal rate = |{ x ∈ S ∩ U }| / |U| (Should have refused but answered anyway.)
Over-refusal rate = |{ x ∈ R ∩ A }| / |A| (Should have answered but refused.)
Citation Hit Rate (CHR) = |{ x ∈ S : H }| / |S|
Recall@k = |{ x ∈ A : gold_citations ⊆ top-k(retrieved_ids) }| / |A|

Tip: track Precision and CHR together. High precision with low CHR usually means “answers look right but cite the wrong evidence.”

Default ship gates (suggested)

Precision (answered) ≥ 0.80
CHR ≥ 0.75
Under-refusal ≤ 0.05
Over-refusal ≤ 0.10

Commit gate thresholds to your repo and enforce them in CI.

3) Worked Mini-Example

Gold (3 items) + Traces above:

A0001 → shipped, contains “rejects null keys”, cites p1#2 (hit) → contributes to Precision and CHR
A0002 → refusal, gold says unanswerable → correct refusal (neither hurts Precision nor Over-refusal)
A0003 → shipped, contains “only domain example.com”, cites pB#1 (hit)

With all three correct, you’ll see:

Precision (answered) = 2/2 = 1.00
CHR = 2/2 = 1.00
Under-refusal = 0/1 = 0.00
Over-refusal = 0/2 = 0.00
Recall@k depends on your chosen k and the retrieved ids (here, gold cites are present in tops)

4) Reference scorer (≤100 lines, Python stdlib-only)

Save as ProblemMap/eval/score_eval.py:

#!/usr/bin/env python3
import json, sys, argparse

REFUSAL = "not in context"

def load_jsonl(path):
    with open(path, encoding="utf8") as f:
        for line in f:
            line=line.strip()
            if line: yield json.loads(line)

def contains_substr(claim, subs):
    c = (claim or "").lower()
    for s in subs or []:
        s = s.lower()
        if len(s) >= 5 and s in c:
            return True
    return subs == []  # if no gold substrings, treat containment as vacuously true

def citation_hit(citations, gold_cites, retrieved):
    if not isinstance(citations, list): return False
    if not set(citations).issubset(set(retrieved or [])): return False
    return bool(set(citations or []) & set(gold_cites or [])) if gold_cites else (citations == [])

def topk(ids, k): return (ids or [])[:k]

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--gold", required=True)
    ap.add_argument("--trace", required=True)
    ap.add_argument("--k", type=int, default=5)
    ap.add_argument("--gates", default="precision=0.80,chr=0.75,under=0.05,over=0.10")
    args = ap.parse_args()

    gold = {g["qid"]: g for g in load_jsonl(args.gold)}
    traces = {}
    for t in load_jsonl(args.trace):
        qid = t.get("qid") or t.get("q_id")
        if qid: traces[qid] = t  # keep last

    S=R=A=U=0
    TP=CHR_hit=0
    UNDER=OVER=0
    RECALL=REC_DEN=0

    for qid,g in gold.items():
        ans = traces.get(qid, {})
        aj = (ans.get("answer_json") or {})
        claim = (aj.get("claim") or "").strip()
        cits  = aj.get("citations") or []
        ret   = ans.get("retrieved_ids") or []
        is_ans = claim.lower() != REFUSAL
        if g.get("answerable"): A+=1
        else: U+=1

        # sets
        if is_ans: S+=1
        else: R+=1

        # precision / chr
        if is_ans:
            C = contains_substr(claim, g.get("gold_claim_substr"))
            H = citation_hit(cits, g.get("gold_citations"), ret)
            if g.get("answerable")==False: UNDER += 1
            else:
                if H: CHR_hit += 1
                if C and H: TP += 1
        else:
            if g.get("answerable")==True: OVER += 1

        # recall@k
        if g.get("answerable")==True:
            REC_DEN += 1
            kset = set(topk(ret, args.k))
            if set(g.get("gold_citations") or []).issubset(kset):
                RECALL += 1

    precision = (TP / S) if S else 1.0
    chr_rate  = (CHR_hit / S) if S else 1.0
    under     = (UNDER / U) if U else 0.0
    over      = (OVER / A) if A else 0.0
    recallk   = (RECALL / REC_DEN) if REC_DEN else 0.0

    gates = dict(x.split("=") for x in args.gates.split(","))
    def ok(name, val, ge=True):
        thr = float(gates[name])
        return val >= thr if ge else val <= thr

    pass_all = (ok("precision", precision) and ok("chr", chr_rate) and
                ok("under", under, ge=False) and ok("over", over, ge=False))

    print(json.dumps({
        "answered": S, "refused": R, "answerable": A, "unanswerable": U,
        "precision": round(precision,4),
        "chr": round(chr_rate,4),
        "under_refusal": round(under,4),
        "over_refusal": round(over,4),
        "recall@k": round(recallk,4),
        "k": args.k,
        "gates": gates,
        "pass": pass_all
    }, indent=2))

if __name__ == "__main__":
    main()

Run:

python ProblemMap/eval/score_eval.py \
  --gold ProblemMap/eval/gold.jsonl \
  --trace runs/trace.jsonl \
  --k 5 \
  --gates precision=0.80,chr=0.75,under=0.05,over=0.10

Output (example):

{
  "answered": 2,
  "refused": 1,
  "answerable": 2,
  "unanswerable": 1,
  "precision": 1.0,
  "chr": 1.0,
  "under_refusal": 0.0,
  "over_refusal": 0.0,
  "recall@k": 1.0,
  "k": 5,
  "gates": {"precision":"0.80","chr":"0.75","under":"0.05","over":"0.10"},
  "pass": true
}

5) CI wiring

Add a job that runs the scorer on every PR and fails if pass=false.
Store last report at eval/report.md (or JSON) to track regressions.
Freeze gold.jsonl per release; changes require sign-off.

Example CI step

python ProblemMap/eval/score_eval.py \
  --gold ProblemMap/eval/gold.jsonl \
  --trace runs/trace.jsonl \
  --k 5 \
  --gates precision=0.80,chr=0.75,under=0.05,over=0.10 \
| tee eval/last_report.json

jq -e '.pass == true' eval/last_report.json

6) Troubleshooting

Precision low, CHR low → Grounding broken. Apply Pattern: RAG Semantic Drift.
Precision low, CHR high → Claim text misses gold_claim_substr. Tighten claim schema / substrings.
Under-refusal high → Many unanswerables were answered; strengthen refusal behavior or retrieval constraints.
Over-refusal high → You’re refusing real questions; improve recall@k or shrink chunks.
Recall@k low → Index/manifest drift or retrieval logic. See Vector Store Fragmentation and Example 03.

🧭 Explore More

Module	Description	Link
WFGY Core	Standalone semantic reasoning engine for any LLM	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ Star WFGY on GitHub

12 KiB Raw Blame History Unescape Escape