7.7 KiB
📊 Evaluation Playbook
Ship metrics, not vibes — catch regressions before users do.
Evaluation disclaimer (WFGY)
This playbook describes practical recipes for evaluating and debugging AI pipelines.
Any scores, rankings or labels produced by these methods are context dependent signals, not formal scientific proof.
Results depend on the specific models, prompts, datasets and parameters used in each run.
You should re run the experiments, vary the configuration and treat the numbers as guidance for engineering and operations, not as guarantees of real world safety or general model quality.
Who is this for?
– RAG owners tired of “looks fine on my prompt.”
– Agent builders chasing flakey benchmark wins.
– Product teams that must prove ΔS ↓, accuracy ↑, cost ↘.What you get: one YAML suite + CLI commands that output a single-line health verdict:
PASS: ΔS<=0.45 | λ→ | E_res stable | cost $0.0013 / query
0 · Quick Start
pip install wfgy-eval
wfgy-eval init # generates eval.yaml + example dataset
wfgy-eval run # prints CSV & HTML report
Default template covers retrieval, reasoning, tool routing, latency, cost.
1 · Metric Matrix
| Layer | Key Metric | Target (prod) | Source Fn |
|---|---|---|---|
| Retrieval | ΔS(q, ctx) |
≤ 0.45 | deltaS() |
| Reasoning | λ_observe (3 paraphrase avg) |
convergent | lambda_state() |
| Stability | E_resonance (rolling) |
flat / ↓ | e_resonance() |
| Answer | F1 / Exact Match (QA) |
≥ 0.80 | qa_match() |
| Hallucination | citation_precision |
≥ 0.90 | cite_check() |
| Cost | $ / 1k tokens |
≤ baseline | provider bill |
| Latency | 95-th percentile response (ms) | SLA-dependent | timer |
| ΔS Drift | slope over 100 queries | ~0 | linear fit |
2 · Dataset Design
sets:
- name: faq_small
type: qa
source: ./data/faq.tsv
fields: [question, answer, anchor]
- name: chain_logic
type: chain-of-thought
source: ./data/logic.jsonl
fields: [prompt, expected_reasoning]
- name: tool_router
type: tool
source: ./data/router.csv
fields: [task, tool_expected]
Guidelines
- Anchor Field = text chunk you expect to be retrieved → used for ΔS & citation checks.
- Min 50 items / set for stable stats; use 10 × if you want leaderboard-grade noise floor.
- Store only plain-text, no PII.
3 · CI Workflow Example (GitHub Actions)
name: wfgy-eval
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install wfgy-eval
- run: wfgy-eval run --fail-on 'ΔS_q_ctx>0.50 or citation_precision<0.85'
- uses: actions/upload-artifact@v4
with:
name: eval-report
path: reports/latest.html
Any push that drifts past ΔS 0.50 or loses citation accuracy blocks merge.
4 · Reading the HTML Report
| Section | What to Look For | Action Rule |
|---|---|---|
| Heatmap ΔS vs. k | flat @ high ΔS → bad index metric | rebuild index / embed |
| λ Timeline | spikes to ← or × | inspect prompt / tool |
| E_res Trend | upward slope | apply BBAM / shorten ctx |
| Outliers Table | worst 5 queries by ΔS | manual deep dive; log bug |
5 · A/B Pattern
wfgy-eval dump --ref=main(baseline metrics JSON)git switch feature-x→ apply fixwfgy-eval run(produces metrics B)wfgy-eval compare baseline.json current.json
ΔS_q_ctx -0.08 ✅
F1 +0.07 ✅
Cost +$0.0002 ❌
Decide if cost bump acceptable.
6 · Edge-Case Suites (extend as needed)
| Suite | Purpose | Sample Size |
|---|---|---|
| Contradiction | fact statements w/ subtle negation | 30 |
| Long-PDF | >50 k token OCR, check segmentation | 10 docs |
| Jailbreak | prompt injection attempts vs. policy | 40 |
| High Noise | OCR confidence < 0.8 | 25 |
Add to eval.yaml, rerun.
7 · FAQ
Q: Can I plug in LangSmith, Phoenix, Traceloop, or custom spans?
A: Yes. wfgy-eval reads any OpenTelemetry JSON; map fields via otel_map.yaml.
Q: Does this cover reinforcement-style eval (RLHF)?
A: Use the same ΔS/λ hooks during reward model scoring; see examples/rlhf_eval.ipynb.
Q: How hard is vendor swap?
A: wfgy-eval provider openai or provider anthropic; costs/latency recalc automatically.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.