WFGY/ProblemMap/evaluation-playbook.md
2025-08-09 13:31:03 +08:00

8.5 KiB
Raw Blame History

📊 Evaluation Playbook

Ship metrics, not vibes — catch regressions before users do.

Who is this for?
RAG owners tired of “looks fine on my prompt.”
Agent builders chasing flakey benchmark wins.
Product teams that must prove ΔS ↓, accuracy ↑, cost ↘.

What you get: one YAML suite + CLI commands that output a single-line health verdict:
PASS: ΔS<=0.45 | λ→ | E_res stable | cost $0.0013 / query


0 · Quick Start

pip install wfgy-eval
wfgy-eval init  # generates eval.yaml + example dataset
wfgy-eval run   # prints CSV & HTML report

Default template covers retrieval, reasoning, tool routing, latency, cost.


1 · Metric Matrix

Layer Key Metric Target (prod) Source Fn
Retrieval ΔS(q, ctx) ≤ 0.45 deltaS()
Reasoning λ_observe (3 paraphrase avg) convergent lambda_state()
Stability E_resonance (rolling) flat / ↓ e_resonance()
Answer F1 / Exact Match (QA) ≥ 0.80 qa_match()
Hallucination citation_precision ≥ 0.90 cite_check()
Cost $ / 1k tokens ≤ baseline provider bill
Latency 95-th percentile response (ms) SLA-dependent timer
ΔS Drift slope over 100 queries ~0 linear fit

2 · Dataset Design

sets:
  - name: faq_small
    type: qa
    source: ./data/faq.tsv
    fields: [question, answer, anchor]
  - name: chain_logic
    type: chain-of-thought
    source: ./data/logic.jsonl
    fields: [prompt, expected_reasoning]
  - name: tool_router
    type: tool
    source: ./data/router.csv
    fields: [task, tool_expected]

Guidelines

  1. Anchor Field = text chunk you expect to be retrieved → used for ΔS & citation checks.
  2. Min 50 items / set for stable stats; use 10 × if you want leaderboard-grade noise floor.
  3. Store only plain-text, no PII.

3 · CI Workflow Example (GitHub Actions)

name: wfgy-eval
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - run: pip install wfgy-eval
    - run: wfgy-eval run --fail-on 'ΔS_q_ctx>0.50 or citation_precision<0.85'
    - uses: actions/upload-artifact@v4
      with:
        name: eval-report
        path: reports/latest.html

Any push that drifts past ΔS 0.50 or loses citation accuracy blocks merge.


4 · Reading the HTML Report

Section What to Look For Action Rule
Heatmap ΔS vs. k flat @ high ΔS → bad index metric rebuild index / embed
λ Timeline spikes to ← or × inspect prompt / tool
E_res Trend upward slope apply BBAM / shorten ctx
Outliers Table worst 5 queries by ΔS manual deep dive; log bug

5 · A/B Pattern

  1. wfgy-eval dump --ref=main (baseline metrics JSON)
  2. git switch feature-x → apply fix
  3. wfgy-eval run (produces metrics B)
  4. wfgy-eval compare baseline.json current.json
ΔS_q_ctx   -0.08  ✅
F1         +0.07  ✅
Cost       +$0.0002 ❌

Decide if cost bump acceptable.


6 · Edge-Case Suites (extend as needed)

Suite Purpose Sample Size
Contradiction fact statements w/ subtle negation 30
Long-PDF >50 k token OCR, check segmentation 10 docs
Jailbreak prompt injection attempts vs. policy 40
High Noise OCR confidence < 0.8 25

Add to eval.yaml, rerun.


7 · FAQ

Q: Can I plug in LangSmith, Phoenix, Traceloop, or custom spans? A: Yes. wfgy-eval reads any OpenTelemetry JSON; map fields via otel_map.yaml.

Q: Does this cover reinforcement-style eval (RLHF)? A: Use the same ΔS/λ hooks during reward model scoring; see examples/rlhf_eval.ipynb.

Q: How hard is vendor swap? A: wfgy-eval provider openai or provider anthropic; costs/latency recalc automatically.


Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to LLM · 3 Ask “answer using WFGY + <question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core Standalone semantic reasoning engine for any LLM View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow