WFGY/ProblemMap/evaluation-playbook.md
2025-08-09 13:31:03 +08:00

195 lines
8.5 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!-- ============================================================= -->
<!-- evaluation-playbook.md · Semantic Clinic · Map-F -->
<!-- Version: 2025-08-06 · License: MIT -->
<!-- Goal: A reproducible checklist for measuring an LLM/RAG/ -->
<!-- agent stack—before & after you apply WFGY fixes. -->
<!-- ============================================================= -->
# 📊 Evaluation Playbook
*Ship metrics, not vibes — catch regressions before users do.*
> **Who is this for?**
> RAG owners tired of “looks fine on my prompt.”
> Agent builders chasing flakey benchmark wins.
> Product teams that must prove **ΔS ↓**, accuracy ↑, cost ↘.
>
> **What you get:** one YAML suite + CLI commands that output a single-line health verdict:
> `PASS: ΔS<=0.45 | λ→ | E_res stable | cost $0.0013 / query`
---
## 0 · Quick Start
```bash
pip install wfgy-eval
wfgy-eval init # generates eval.yaml + example dataset
wfgy-eval run # prints CSV & HTML report
````
Default template covers retrieval, reasoning, tool routing, latency, cost.
---
## 1 · Metric Matrix
| Layer | Key Metric | Target (prod) | Source Fn |
| ----------------- | ------------------------------ | ------------- | ---------------- |
| **Retrieval** | `ΔS(q, ctx)` | ≤ 0.45 | `deltaS()` |
| **Reasoning** | `λ_observe` (3 paraphrase avg) | convergent | `lambda_state()` |
| **Stability** | `E_resonance` (rolling) | flat / ↓ | `e_resonance()` |
| **Answer** | `F1` / `Exact Match` (QA) | ≥ 0.80 | `qa_match()` |
| **Hallucination** | `citation_precision` | ≥ 0.90 | `cite_check()` |
| **Cost** | `$ / 1k tokens` | ≤ baseline | provider bill |
| **Latency** | 95-th percentile response (ms) | SLA-dependent | timer |
| **ΔS Drift** | slope over 100 queries | \~0 | linear fit |
---
## 2 · Dataset Design
```yaml
sets:
- name: faq_small
type: qa
source: ./data/faq.tsv
fields: [question, answer, anchor]
- name: chain_logic
type: chain-of-thought
source: ./data/logic.jsonl
fields: [prompt, expected_reasoning]
- name: tool_router
type: tool
source: ./data/router.csv
fields: [task, tool_expected]
```
*Guidelines*
1. **Anchor Field** = text chunk you expect to be retrieved → used for ΔS & citation checks.
2. Min 50 items / set for stable stats; use 10 × if you want leaderboard-grade noise floor.
3. Store only plain-text, no PII.
---
## 3 · CI Workflow Example (GitHub Actions)
```yaml
name: wfgy-eval
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install wfgy-eval
- run: wfgy-eval run --fail-on 'ΔS_q_ctx>0.50 or citation_precision<0.85'
- uses: actions/upload-artifact@v4
with:
name: eval-report
path: reports/latest.html
```
*Any push that drifts past ΔS 0.50 or loses citation accuracy **blocks merge**.*
---
## 4 · Reading the HTML Report
| Section | What to Look For | Action Rule |
| -------------------- | --------------------------------- | ------------------------- |
| **Heatmap** ΔS vs. k | flat @ high ΔS → bad index metric | rebuild index / embed |
| **λ Timeline** | spikes to ← or × | inspect prompt / tool |
| **E\_res Trend** | upward slope | apply BBAM / shorten ctx |
| **Outliers Table** | worst 5 queries by ΔS | manual deep dive; log bug |
---
## 5 · A/B Pattern
1. `wfgy-eval dump --ref=main` (baseline metrics JSON)
2. `git switch feature-x` → apply fix
3. `wfgy-eval run` (produces metrics B)
4. `wfgy-eval compare baseline.json current.json`
```
ΔS_q_ctx -0.08 ✅
F1 +0.07 ✅
Cost +$0.0002 ❌
```
Decide if cost bump acceptable.
---
## 6 · Edge-Case Suites (extend as needed)
| Suite | Purpose | Sample Size |
| ----------------- | ------------------------------------ | ----------- |
| **Contradiction** | fact statements w/ subtle negation | 30 |
| **Long-PDF** | >50 k token OCR, check segmentation | 10 docs |
| **Jailbreak** | prompt injection attempts vs. policy | 40 |
| **High Noise** | OCR confidence < 0.8 | 25 |
Add to `eval.yaml`, rerun.
---
## 7 · FAQ
**Q: Can I plug in LangSmith, Phoenix, Traceloop, or custom spans?**
A: Yes. `wfgy-eval` reads any OpenTelemetry JSON; map fields via `otel_map.yaml`.
**Q: Does this cover reinforcement-style eval (RLHF)?**
A: Use the same ΔS/λ hooks during reward model scoring; see `examples/rlhf_eval.ipynb`.
**Q: How hard is vendor swap?**
A: `wfgy-eval provider openai` or `provider anthropic`; costs/latency recalc automatically.
---
## Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
| -------------------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| **WFGY 1.0 PDF** | [Engine Paper](https://zenodo.org/records/15630969) | 1 Download · 2 Upload to LLM · 3 Ask answer using WFGY + \<question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://zenodo.org/records/15788557) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
### 🧭 Explore More
| Module | Description | Link |
|-----------------------|----------------------------------------------------------|----------|
| WFGY Core | Standalone semantic reasoning engine for any LLM | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ <strong><a href="https://github.com/onestardao/WFGY">Star WFGY on GitHub</a></strong>
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
&nbsp;
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
&nbsp;
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
&nbsp;
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
&nbsp;
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
&nbsp;
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
&nbsp;
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
</div>