mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
| .. | ||
| eval_cross_agent_consistency.md | ||
| eval_latency_vs_accuracy.md | ||
| eval_rag_precision_recall.md | ||
| eval_semantic_stability.md | ||
| README.md | ||
Eval — Quality & Readiness Gates
This folder defines how we measure if a pipeline is allowed to ship.
All evals are SDK-free (stdlib-only) and deterministic: given the same inputs, you must get the same scores and the same ship/no-ship decision.
Quick Links (run these first)
- RAG Precision / Refusals / CHR / Recall@k → eval_rag_precision_recall.md
- Latency vs Accuracy (SLO & Pareto) → eval_latency_vs_accuracy.md
- Cross-Agent Consistency (Scholar ↔ Auditor, κ) → eval_cross_agent_consistency.md
- Semantic Stability (Seeds & Prompt Jitter) → eval_semantic_stability.md
Start with precision/CHR, then latency SLO, then agent agreement, then stability. Fail fast on each gate.
0) What we score (TL;DR)
Grounded Q&A quality
- Precision (answered) — fraction of shipped answers that are correct and properly cited
- Under-refusal rate — fraction of should-refuse questions that were wrongly answered
- Over-refusal rate — fraction of should-answer questions that were refused
- Citation Hit Rate (CHR) — fraction of shipped answers whose citations actually contain the claim
- Constraint Integrity (SCU) — fraction of shipped answers that preserve locked constraints (no contradictions)
Retrieval quality
- Recall@k — fraction of questions whose gold evidence appears in top-k retrieved ids
- CHR@k (upper bound) — best-case CHR if the model always picked the right ids from the retrieved pool
Operational
- Latency vs Accuracy — P50/P95/P99 vs Precision/CHR under knob sweeps and load
- Cross-agent consistency — Scholar vs Auditor agreement (Percent Agreement, Cohen’s κ)
- Semantic stability — invariance to seeds and benign prompt jitters (ACR/CGHC/CSS/NED)
1) Data contracts
1.1 Gold set (eval/gold.jsonl)
Each line:
{
"qid": "A0001",
"question": "Does X support null keys?",
"answerable": true,
"gold_claim_substr": ["rejects null keys"],
"gold_citations": ["p1#2"],
"constraints": ["X rejects null keys."]
}
Rules
answerable=false⇒ the only correct model output is the exact refusal token:not in contextgold_claim_substr: minimal substrings that must appear in the shippedclaimgold_citations: any overlap with shippedcitationscounts as a hitconstraintsoptional; enables SCU checks
1.2 System traces (runs/trace.jsonl)
Emitted by your guarded pipeline:
{
"ts": 1723430400,
"qid": "A0001",
"q": "Does X support null keys?",
"retrieved_ids": ["p1#1", "p1#2", "p2#1"],
"answer_json": {
"claim": "X rejects null keys.",
"citations": ["p1#2"],
"constraints_echo": ["X rejects null keys."]
},
"ok": true,
"reason": "ok"
}
2) Default ship gates
- Precision (answered) ≥ 0.80
- CHR ≥ 0.75
- Under-refusal ≤ 0.05
- Over-refusal ≤ 0.10
- If SCU used: 0 constraint violations
- Latency SLO: P95 (E2E) ≤ 2000 ms (interactive)
- Cross-agent: Percent Agreement ≥ 0.90, κ ≥ 0.75, ABSTAIN ≤ 0.02
- Stability: ACR ≥ 0.95, CGHC ≥ 0.95, CSS ≥ 0.70, NED₅₀ ≤ 0.20, RCR ≥ 0.98
Tune per product, pin thresholds in repo, and enforce in CI.
3) File layout
- README.md — this file (entrypoint)
- eval_rag_precision_recall.md — answer/retrieval metrics + sample JSONL + scorer
- eval_latency_vs_accuracy.md — SLO curves; sweep harness; Pareto selection
- eval_cross_agent_consistency.md — Scholar vs Auditor, κ, arbitration policy
- eval_semantic_stability.md — seeds/jitters invariance, ACR/CGHC/CSS/NED
- score_eval.py — reference scorer for precision/CHR/refusals/recall@k
- latency_sweep.py — sweep harness → runs/latency.csv + summary
- cross_agent_consistency.py — PA/κ + disagreements TSV + arbitration
- semantic_stability.py — runner+scorer for seeds/jitters
- gold.jsonl — canonical gold set (freeze per release)
4) Minimal quickstart (stdlib-only)
- Create
eval/gold.jsonl(10–50 items to start). - Run your guarded pipeline to append
runs/trace.jsonl. - Score core metrics:
python ProblemMap/eval/score_eval.py \
--gold ProblemMap/eval/gold.jsonl \
--trace runs/trace.jsonl \
--k 5 \
--gates precision=0.80,chr=0.75,under=0.05,over=0.10
5) Run the full eval suite (CI recipe, copy/paste)
# A) Latency sweeps (1 rps & 5 rps)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30 | tee eval/lat_1rps.json
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60 | tee eval/lat_5rps.json
# B) Core accuracy
python ProblemMap/eval/score_eval.py --gold ProblemMap/eval/gold.jsonl --trace runs/trace.jsonl --k 5 > eval/acc.json
# C) Cross-agent consistency
python ProblemMap/eval/cross_agent_consistency.py --pairs ProblemMap/eval/consistency_pairs.jsonl > eval/consistency.json
# D) Semantic stability (small daily sweep)
python ProblemMap/eval/semantic_stability.py --mode run --gold ProblemMap/eval/gold.jsonl --http http://localhost:8080/qa --seeds 0,1,2 --jitters none,ws,syn
python ProblemMap/eval/semantic_stability.py --mode score --gold ProblemMap/eval/gold.jsonl --stability runs/stability.jsonl > eval/stability.json
# E) Gates
jq -e '.p95 <= 2000' eval/lat_1rps.json
jq -e '.p95 <= 2500' eval/lat_5rps.json
jq -e '.precision >= 0.80 and .chr >= 0.75 and .under_refusal <= 0.05 and .over_refusal <= 0.10' eval/acc.json
jq -e '.percent_agreement >= 0.90 and .kappa >= 0.75 and .abstain_rate <= 0.02 and .pass==true' eval/consistency.json
jq -e '.pass == true' eval/stability.json
6) Troubleshooting map
- Precision low, CHR low → Start at RAG Precision & CHR page; apply Semantic Drift pattern (guard + intersection + knee).
- Over-refusal high → Recall@k too low or chunks split facts; shrink chunks and re-rank.
- Latency P95 blown → Trim
max_tokens, enable intersection+knee, reducererank_depth(see Latency vs Accuracy). - Agent κ low → Templates drift or inconsistent guards; fix schema and audit rules (Cross-Agent Consistency).
- Stability fails → Retrieval pool unstable; apply Vector Store Fragmentation and SCU patterns.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.