8.6 KiB
📊 Evaluation Playbook
Ship metrics, not vibes — catch regressions before users do.
Who is this for?
– RAG owners tired of “looks fine on my prompt.”
– Agent builders chasing flakey benchmark wins.
– Product teams that must prove ΔS ↓, accuracy ↑, cost ↘.What you get: one YAML suite + CLI commands that output a single-line health verdict:
PASS: ΔS<=0.45 | λ→ | E_res stable | cost $0.0013 / query
0 · Quick Start
pip install wfgy-eval
wfgy-eval init # generates eval.yaml + example dataset
wfgy-eval run # prints CSV & HTML report
Default template covers retrieval, reasoning, tool routing, latency, cost.
1 · Metric Matrix
| Layer | Key Metric | Target (prod) | Source Fn |
|---|---|---|---|
| Retrieval | ΔS(q, ctx) |
≤ 0.45 | deltaS() |
| Reasoning | λ_observe (3 paraphrase avg) |
convergent | lambda_state() |
| Stability | E_resonance (rolling) |
flat / ↓ | e_resonance() |
| Answer | F1 / Exact Match (QA) |
≥ 0.80 | qa_match() |
| Hallucination | citation_precision |
≥ 0.90 | cite_check() |
| Cost | $ / 1k tokens |
≤ baseline | provider bill |
| Latency | 95-th percentile response (ms) | SLA-dependent | timer |
| ΔS Drift | slope over 100 queries | ~0 | linear fit |
2 · Dataset Design
sets:
- name: faq_small
type: qa
source: ./data/faq.tsv
fields: [question, answer, anchor]
- name: chain_logic
type: chain-of-thought
source: ./data/logic.jsonl
fields: [prompt, expected_reasoning]
- name: tool_router
type: tool
source: ./data/router.csv
fields: [task, tool_expected]
Guidelines
- Anchor Field = text chunk you expect to be retrieved → used for ΔS & citation checks.
- Min 50 items / set for stable stats; use 10 × if you want leaderboard-grade noise floor.
- Store only plain-text, no PII.
3 · CI Workflow Example (GitHub Actions)
name: wfgy-eval
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install wfgy-eval
- run: wfgy-eval run --fail-on 'ΔS_q_ctx>0.50 or citation_precision<0.85'
- uses: actions/upload-artifact@v4
with:
name: eval-report
path: reports/latest.html
Any push that drifts past ΔS 0.50 or loses citation accuracy blocks merge.
4 · Reading the HTML Report
| Section | What to Look For | Action Rule |
|---|---|---|
| Heatmap ΔS vs. k | flat @ high ΔS → bad index metric | rebuild index / embed |
| λ Timeline | spikes to ← or × | inspect prompt / tool |
| E_res Trend | upward slope | apply BBAM / shorten ctx |
| Outliers Table | worst 5 queries by ΔS | manual deep dive; log bug |
5 · A/B Pattern
wfgy-eval dump --ref=main(baseline metrics JSON)git switch feature-x→ apply fixwfgy-eval run(produces metrics B)wfgy-eval compare baseline.json current.json
ΔS_q_ctx -0.08 ✅
F1 +0.07 ✅
Cost +$0.0002 ❌
Decide if cost bump acceptable.
6 · Edge-Case Suites (extend as needed)
| Suite | Purpose | Sample Size |
|---|---|---|
| Contradiction | fact statements w/ subtle negation | 30 |
| Long-PDF | >50 k token OCR, check segmentation | 10 docs |
| Jailbreak | prompt injection attempts vs. policy | 40 |
| High Noise | OCR confidence < 0.8 | 25 |
Add to eval.yaml, rerun.
7 · FAQ
Q: Can I plug in LangSmith, Phoenix, Traceloop, or custom spans?
A: Yes. wfgy-eval reads any OpenTelemetry JSON; map fields via otel_map.yaml.
Q: Does this cover reinforcement-style eval (RLHF)?
A: Use the same ΔS/λ hooks during reward model scoring; see examples/rlhf_eval.ipynb.
Q: How hard is vendor swap?
A: wfgy-eval provider openai or provider anthropic; costs/latency recalc automatically.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.