📊 Evaluation Playbook

Ship metrics, not vibes — catch regressions before users do.

Evaluation disclaimer (WFGY)
This playbook describes practical recipes for evaluating and debugging AI pipelines.
Any scores, rankings or labels produced by these methods are context dependent signals, not formal scientific proof.
Results depend on the specific models, prompts, datasets and parameters used in each run.
You should re run the experiments, vary the configuration and treat the numbers as guidance for engineering and operations, not as guarantees of real world safety or general model quality.

Who is this for?
– RAG owners tired of “looks fine on my prompt.”
– Agent builders chasing flakey benchmark wins.
– Product teams that must prove ΔS ↓, accuracy ↑, cost ↘.

What you get: one YAML suite + CLI commands that output a single-line health verdict:
PASS: ΔS<=0.45 | λ→ | E_res stable | cost $0.0013 / query

0 · Quick Start

pip install wfgy-eval
wfgy-eval init  # generates eval.yaml + example dataset
wfgy-eval run   # prints CSV & HTML report

Default template covers retrieval, reasoning, tool routing, latency, cost.

1 · Metric Matrix

Layer	Key Metric	Target (prod)	Source Fn
Retrieval	`ΔS(q, ctx)`	≤ 0.45	`deltaS()`
Reasoning	`λ_observe` (3 paraphrase avg)	convergent	`lambda_state()`
Stability	`E_resonance` (rolling)	flat / ↓	`e_resonance()`
Answer	`F1` / `Exact Match` (QA)	≥ 0.80	`qa_match()`
Hallucination	`citation_precision`	≥ 0.90	`cite_check()`
Cost	`$ / 1k tokens`	≤ baseline	provider bill
Latency	95-th percentile response (ms)	SLA-dependent	timer
ΔS Drift	slope over 100 queries	~0	linear fit

2 · Dataset Design

sets:
  - name: faq_small
    type: qa
    source: ./data/faq.tsv
    fields: [question, answer, anchor]
  - name: chain_logic
    type: chain-of-thought
    source: ./data/logic.jsonl
    fields: [prompt, expected_reasoning]
  - name: tool_router
    type: tool
    source: ./data/router.csv
    fields: [task, tool_expected]

Guidelines

Anchor Field = text chunk you expect to be retrieved → used for ΔS & citation checks.
Min 50 items / set for stable stats; use 10 × if you want leaderboard-grade noise floor.
Store only plain-text, no PII.

3 · CI Workflow Example (GitHub Actions)

name: wfgy-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - run: pip install wfgy-eval
    - run: wfgy-eval run --fail-on 'ΔS_q_ctx>0.50 or citation_precision<0.85'
    - uses: actions/upload-artifact@v4
      with:
        name: eval-report
        path: reports/latest.html

Any push that drifts past ΔS 0.50 or loses citation accuracy blocks merge.

4 · Reading the HTML Report

Section	What to Look For	Action Rule
Heatmap ΔS vs. k	flat @ high ΔS → bad index metric	rebuild index / embed
λ Timeline	spikes to ← or ×	inspect prompt / tool
E_res Trend	upward slope	apply BBAM / shorten ctx
Outliers Table	worst 5 queries by ΔS	manual deep dive; log bug

5 · A/B Pattern

wfgy-eval dump --ref=main (baseline metrics JSON)
git switch feature-x → apply fix
wfgy-eval run (produces metrics B)
wfgy-eval compare baseline.json current.json

ΔS_q_ctx   -0.08  ✅
F1         +0.07  ✅
Cost       +$0.0002 ❌

Decide if cost bump acceptable.

6 · Edge-Case Suites (extend as needed)

Suite	Purpose	Sample Size
Contradiction	fact statements w/ subtle negation	30
Long-PDF	>50 k token OCR, check segmentation	10 docs
Jailbreak	prompt injection attempts vs. policy	40
High Noise	OCR confidence < 0.8	25

Add to eval.yaml, rerun.

7 · FAQ

Q: Can I plug in LangSmith, Phoenix, Traceloop, or custom spans? A: Yes. wfgy-eval reads any OpenTelemetry JSON; map fields via otel_map.yaml.

Q: Does this cover reinforcement-style eval (RLHF)? A: Use the same ΔS/λ hooks during reward model scoring; see examples/rlhf_eval.ipynb.

Q: How hard is vendor swap? A: wfgy-eval provider openai or provider anthropic; costs/latency recalc automatically.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

9.1 KiB Raw Blame History Unescape Escape