Eval Observability — Evaluation Playbook

A compact playbook to stabilize evaluation and ensure results are reproducible.
Use this when metrics look inconsistent, coverage drifts, or benchmarks feel untrustworthy.

Core acceptance

ΔS(question, retrieved) ≤ 0.45
Coverage ≥ 0.70 to target section
λ remains convergent across 3 paraphrases and 2 seeds
Variance ratio ≤ 0.15 across seed runs
No monotonic downward drift beyond 3 windows

Step 1 — Regression Gate

Run a gate before promoting new models or retrievers.

Require ΔS ≤ 0.45 and coverage ≥ 0.70 on 3 paraphrases.
Block deploy if λ flips divergent.
Open: regression_gate.md

Step 2 — Alerting and Probes

Set live probes to catch collapse early.

Alert when ΔS ≥ 0.60 or λ flips.
Emit logs for coverage gaps, drift slopes, entropy rises.
Open: alerting_and_probes.md

Step 3 — Coverage Tracking

Measure retrieved vs target section alignment.

At least 70% coverage on gold targets.
Track section-level recall, not just overall hit-rate.
Open: coverage_tracking.md

Step 4 — ΔS Thresholds

Normalize ΔS scoring and enforce thresholds.

Stable < 0.40
Transitional 0.40–0.60
Collapse ≥ 0.60
Open: deltaS_thresholds.md

Step 5 — λ Observe

Use λ probes to detect schema drift.

Run same query across seeds and paraphrases.
If λ flips, lock schema and enforce reranker.
Open: lambda_observe.md

Step 6 — Variance and Drift

Log variance ratios and detect slow regressions.

Variance ratio ≤ 0.15
Drift slope ≤ 0.02 per batch
Alert if monotonic decline lasts 3+ windows
Open: variance_and_drift.md

Step 7 — Audit & Escalation

If instability remains:

Rebuild embeddings with embedding-vs-semantic.md
Patch retriever fragmentation with vectorstore-fragmentation.md
Apply reranking schema: rerankers.md

Copy-paste eval contract

eval_contract:
  seeds: 3
  paraphrases: 3
  targets:
    deltaS: <=0.45
    coverage: >=0.70
    lambda: convergent
    variance: <=0.15
    drift: <=0.02
alerts:
  - deltaS >=0.60
  - lambda divergent
  - drift slope >0.02

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

7.7 KiB Raw Blame History Unescape Escape