12 KiB
Eval Harness — Guardrails and Minimal Contract
🧭 Quick Return to Map
You are in a sub-page of Eval.
To reorient, go back here:
- Eval — model evaluation and benchmarking
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Evaluation disclaimer (eval harness)
This page sketches a harness for running structured evaluations on AI pipelines.
Any metrics or labels that pass through such a harness remain heuristic outputs of models, scripts and annotators.
They do not become scientific proof just because they flow through this structure.
Use the harness to compare variants inside a controlled scenario, and avoid presenting those numbers as universal claims about model quality beyond that scenario.
A minimal yet strict harness to run repeatable evaluations for RAG and agent pipelines. It fixes the two usual failures. First, non-reproducible runs. Second, noisy metrics that cannot explain drift. Everything here maps to WFGY pages with measurable targets.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Why this snippet schema: Retrieval Traceability
- Payload schema and fences: Data Contracts
- Chunk quality before metrics: Chunking Checklist
- Similarity vs meaning: Embedding ≠ Semantic
Acceptance targets for this harness
- ΔS(question, retrieved) ≤ 0.45 on the gold set
- Coverage of the target section ≥ 0.70
- λ remains convergent across 3 paraphrases and 2 seeds
- Re-runs with identical seed produce metrics drift ≤ 0.5 percentage point
Folder layout and contracts
eval/
datasets/
gold/
qa.jsonl # minimal gold set
citations.jsonl # expected snippet anchors
probes/
paraphrases.jsonl # 3 paraphrases per item
runs/
2025-08-29_seed42/
config.yaml
metrics.csv
traces.jsonl
config/
harness.yaml # store, retriever, reranker, seeds, k
Input schema
datasets/gold/qa.jsonl one JSON per line.
{
"id": "Q_0001",
"question": "How is vector contamination detected in FAISS indexes",
"answer_ref": "PM:vectorstore-metrics-and-faiss-pitfalls#detect-contamination",
"expected_doc": "ProblemMap/vectorstore-metrics-and-faiss-pitfalls.md",
"section_id": "detect-contamination"
}
datasets/gold/citations.jsonl
{
"id": "Q_0001",
"snippet_id": "S_18823",
"section_id": "detect-contamination",
"source_url": "https://github.com/onestardao/WFGY/blob/main/ProblemMap/vectorstore-metrics-and-faiss-pitfalls.md",
"offsets": [1380, 1540],
"tokens": [310, 352]
}
Contract rules come from Retrieval Traceability and Data Contracts.
Repro knobs
seed: integer. Set for the retriever, reranker, and LLM sampler if available.k: top k per retriever. Test 5, 10, 20.λ_observe: record λ state for retrieve, assemble, reason. See lambda_observe.md.- ΔS probe: compute ΔS(question, retrieved) and ΔS(retrieved, expected anchor). See deltaS_thresholds.md.
Execution flow
-
Warm up fence. Verify index hash, vector ready, secrets. If not ready, stop. Open: Bootstrap Ordering.
-
Retrieval step. Run with fixed metric and analyzer. Save raw hits with snippet fields from the contract page.
-
ΔS and λ probes. Log both per item. If ΔS ≥ 0.60 flag as structural risk.
-
Reasoning step. LLM reads TXT OS and uses the cite then explain schema. Refuse answers without citations.
-
Metrics. Compute precision, recall, citation hit, coverage. See eval_rag_precision_recall.md and Retrieval Playbook.
-
Trace sink. Write
traces.jsonlwithid, seed, k, ΔS, λ_state, snippet_id, section_id, INDEX_HASH. -
Gate. If coverage < 0.70 or ΔS > 0.45 fail the run. See regression_gate.md.
Sixty second quick start
- Place a ten item gold set into
datasets/gold/qa.jsonlandcitations.jsonl. - Copy
config/harness.yamlfrom a previous good run. Setseed: 42,k: 10. - Run your script to produce
runs/<date>_seed42/metrics.csvandtraces.jsonl. - Verify the acceptance targets above. If any gate fails jump to the right fix below.
Common failures and the exact fix
-
Wrong meaning despite high similarity. Open: Embedding ≠ Semantic
-
Citations do not match the referenced section. Open: Retrieval Traceability and Data Contracts
-
Hybrid retrieval worse than single retriever. Open: pattern_query_parsing_split.md and rerankers.md
-
Runs flip across deployments or first run crashes. Open: deployment-deadlock.md, predeploy-collapse.md
-
Long chains collapse. Open: context-drift.md and entropy-collapse.md
CI gates and artifacts
-
Block merge if any of these is true
- ΔS median > 0.45 on gold
- Coverage < 0.70
- λ flips on 2 of 3 paraphrases
- Metrics drift from last green run > 0.5 percentage point
-
Store artifacts
metrics.csv,traces.jsonl,harness.yaml,INDEX_HASH,MODEL_HASH.
Copy paste prompts for the reasoning step
You have TXTOS and the WFGY Problem Map loaded.
Question: "{question}"
Retrieved snippets: [{snippet_id, section_id, source_url, offsets, tokens}]
Do:
1) Cite then explain. If citation is missing or mismatched, fail fast and return the minimal structural fix.
2) If ΔS(question, retrieved) ≥ 0.60 propose the smallest repair. Use retrieval-playbook, retrieval-traceability, data-contracts, rerankers.
3) Return JSON:
{"citations":[...], "answer":"...", "λ_state":"→|←|<>|×", "ΔS":0.xx, "next_fix":"..."}
Keep it short and auditable.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.