13 KiB
Eval — Cross-Agent Consistency (Scholar ↔ Auditor, Cohen’s κ, Conflict Policy)
🧭 Quick Return to Map
You are in a sub-page of Eval.
To reorient, go back here:
- Eval — model evaluation and benchmarking
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Goal
Measure and enforce agreement between two independent validators: a Scholar (claims/citations checker) and an Auditor (policy/provenance/constraints gate). Produce (1) quantitative agreement (Percent Agreement & Cohen’s κ) and (2) a deterministic conflict-resolution policy for ship/no-ship.
What you get
- A small label space and file formats
- A ≤120-line reference scorer (Python stdlib-only)
- CI gates and a transparent arbitration rule
1) Label Space (tri-state + abstain)
Every agent emits exactly one label per QID:
VALID— grounded & policy-compliant answer (citations scoped to retrieved ids, passes guards)NOT_IN_CONTEXT— correct refusal (exact tokennot in context)REJECT— answer present but violates grounding/provenance/constraints/templateABSTAIN— (optional) agent can’t decide; counted separately
Tip: Treat
VALIDandNOT_IN_CONTEXTas both acceptable outcomes;REJECTis a hard fail.
2) Data Contracts
2.1 Pairwise judgments (eval/consistency_pairs.jsonl)
One JSON per line, merged per qid:
{
"qid": "A0001",
"scholar": {"label": "VALID", "reason": "claim contained; cites p1#2"},
"auditor": {"label": "VALID", "reason": "provenance ok; constraints echoed"},
"answer_json": {"claim": "X rejects null keys.", "citations": ["p1#2"], "constraints_echo": ["X rejects null keys."]},
"retrieved_ids": ["p1#1","p1#2","p2#1"],
"flags": {"provenance_violation": false, "constraints_mismatch": false}
}
You may also keep separate files:
eval/scholar.jsonlwith{ "qid": "...", "label": "...", "reason": "..." }eval/auditor.jsonlsame schema
The scorer accepts either merged pairs or two files (it will join by qid).
3) Metrics
Let N be the number of items where both agents labeled (excluding missing pairs).
Let L be the set {VALID, NOT_IN_CONTEXT, REJECT, ABSTAIN} (ABSTAIN is optional).
-
Percent Agreement (PA):
\text{PA} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}[y^\text{sch}_i = y^\text{aud}_i] -
Cohen’s κ (kappa) on nominal labels (include
ABSTAINif present):\kappa = \frac{P_o - P_e}{1 - P_e}where
P_ois observed agreement (diagonal / N) andP_e = \sum_{\ell \in L} p^\text{sch}_\ell \cdot p^\text{aud}_\ell(product of label marginals).
Rule of thumb (not law): κ ≥ 0.75 = substantial agreement.
Default CI gates (suggested)
- Percent Agreement ≥ 0.90
- Cohen’s κ ≥ 0.75
ABSTAINrate ≤ 0.02- Disagreements with red flags (
provenance_violationorconstraints_mismatch) must be auto-REJECT in the final arbitration
4) Deterministic Conflict Policy (Arbitration)
Given a pair (label_s, label_a):
- If any hard red flag is true (
provenance_violation,constraints_mismatch, citations ⊄retrieved_ids) → FINAL = REJECT. - Else if
label_a ≠ VALID→ FINAL = REJECT (Auditor wins on policy). - Else if
label_a = VALIDandlabel_s ∈ {VALID, NOT_IN_CONTEXT}→ FINAL = label_a. - Else → FINAL = REJECT.
Rationale: shipping requires policy/provenance green. Scholar validates content; Auditor vetoes unsafe ships.
The scorer will produce a TSV of disagreements with the computed FINAL so you can eyeball deltas.
5) Reference Scorer (stdlib-only, ≤120 lines)
Save as ProblemMap/eval/cross_agent_consistency.py:
#!/usr/bin/env python3
import json, argparse, math, sys
LABELS = ["VALID","NOT_IN_CONTEXT","REJECT","ABSTAIN"]
def load_jsonl(path):
with open(path, encoding="utf8") as f:
for line in f:
line=line.strip()
if line: yield json.loads(line)
def join_pairs(pairs_path=None, scholar_path=None, auditor_path=None):
if pairs_path:
for r in load_jsonl(pairs_path): yield r
return
S={r["qid"]:r for r in load_jsonl(scholar_path)}
A={r["qid"]:r for r in load_jsonl(auditor_path)}
Q=set(S.keys()) & set(A.keys())
for qid in sorted(Q):
yield {
"qid": qid,
"scholar": {"label": S[qid]["label"], "reason": S[qid].get("reason","")},
"auditor": {"label": A[qid]["label"], "reason": A[qid].get("reason","")},
}
def confusion_and_kappa(rows):
# build confusion on the 4 labels
idx={l:i for i,l in enumerate(LABELS)}
K=len(LABELS)
M=[[0]*K for _ in range(K)]
n=0
for r in rows:
ls=r["scholar"]["label"]; la=r["auditor"]["label"]
if ls not in idx or la not in idx: continue
M[idx[ls]][idx[la]]+=1; n+=1
if n==0: return M, 0.0, 0.0, 0, {}
po=sum(M[i][i] for i in range(K))/n
rs=[sum(M[i]) for i in range(K)]
cs=[sum(M[i][j] for i in range(K)) for j in range(K)]
pe=sum((rs[i]/n)*(cs[i]/n) for i in range(K))
kappa=(po - pe)/(1 - pe) if (1-pe)>1e-12 else 0.0
abstain_rate=(rs[idx["ABSTAIN"]]/n) if "ABSTAIN" in idx else 0.0
return M, po, kappa, n, {"abstain_rate":abstain_rate}
def final_arbitration(r):
# hard flags (optional)
flags=r.get("flags",{})
if flags.get("provenance_violation") or flags.get("constraints_mismatch"):
return "REJECT","hard_flag"
# citations ⊆ retrieved_ids if present
aj=(r.get("answer_json") or {})
cits=set(aj.get("citations") or [])
retrieved=set(r.get("retrieved_ids") or [])
if cits and not cits.issubset(retrieved):
return "REJECT","citation_out_of_scope"
# auditor decides policy
la=r["auditor"]["label"]; ls=r["scholar"]["label"]
if la!="VALID":
return "REJECT","auditor_veto"
if ls in ("VALID","NOT_IN_CONTEXT"):
return "VALID","auditor_ok"
return "REJECT","incoherent_pair"
def summarize_disagreements(rows, out_path="runs/consistency_disagreements.tsv"):
out=[]
for r in rows:
if r["scholar"]["label"] != r["auditor"]["label"]:
final, why = final_arbitration(r)
out.append((r["qid"], r["scholar"]["label"], r["auditor"]["label"], final, why))
if out:
with open(out_path,"w",encoding="utf8") as f:
f.write("qid\tscholar\tauditor\tfinal\twhy\n")
for row in out: f.write("\t".join(row)+"\n")
return len(out)
def main():
ap=argparse.ArgumentParser()
ap.add_argument("--pairs", default=None)
ap.add_argument("--scholar", default=None)
ap.add_argument("--auditor", default=None)
ap.add_argument("--pa_gate", type=float, default=0.90)
ap.add_argument("--kappa_gate", type=float, default=0.75)
ap.add_argument("--abstain_gate", type=float, default=0.02)
args=ap.parse_args()
rows=list(join_pairs(args.pairs, args.scholar, args.auditor))
M, pa, kappa, n, extra = confusion_and_kappa(rows)
n_dis = summarize_disagreements(rows)
report = {
"n": n,
"percent_agreement": round(pa,4),
"kappa": round(kappa,4),
"abstain_rate": round(extra.get("abstain_rate",0.0),4),
"disagreements": n_dis,
"gates": {
"pa": args.pa_gate, "kappa": args.kappa_gate, "abstain": args.abstain_gate
},
"pass": (pa>=args.pa_gate and kappa>=args.kappa_gate and extra.get("abstain_rate",0.0)<=args.abstain_gate)
}
print(json.dumps(report, indent=2))
if __name__=="__main__":
main()
Run (pairs file):
python ProblemMap/eval/cross_agent_consistency.py \
--pairs ProblemMap/eval/consistency_pairs.jsonl \
--pa_gate 0.90 --kappa_gate 0.75 --abstain_gate 0.02 > eval/consistency.json
Run (separate files):
python ProblemMap/eval/cross_agent_consistency.py \
--scholar ProblemMap/eval/scholar.jsonl \
--auditor ProblemMap/eval/auditor.jsonl \
> eval/consistency.json
This writes a human-readable TSV of disagreements at runs/consistency_disagreements.tsv with a FINAL arbitration recommendation per item.
6) CI Gates (copy-paste)
python ProblemMap/eval/cross_agent_consistency.py --pairs ProblemMap/eval/consistency_pairs.jsonl \
| tee eval/consistency.json
jq -e '.percent_agreement >= 0.90 and .kappa >= 0.75 and .abstain_rate <= 0.02 and .pass==true' eval/consistency.json
If the job fails, attach
runs/consistency_disagreements.tsvto the PR and prioritize fixes:
- Hard flags first (provenance/constraints), 2) template/schema drift, 3) retriever recall.
7) Hygiene & Design Notes
- Blind independence: do not let Scholar see Auditor’s label (and vice versa).
- Same evidence: both agents must evaluate the identical
retrieved_ids&answer_json. - Fixed refusal token:
not in contextonly. No synonyms. - Stable seeds: if any agent is an LLM, fix temperature/seed (or run multi-vote with a deterministic reducer).
- Audit trail: keep
reasonfields minimal, actionable, and free of PII.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.