Eval Harness — Guardrails and Minimal Contract

A minimal yet strict harness to run repeatable evaluations for RAG and agent pipelines. It fixes the two usual failures. First, non-reproducible runs. Second, noisy metrics that cannot explain drift. Everything here maps to WFGY pages with measurable targets.

Open these first

Visual map and recovery: RAG Architecture & Recovery
End to end retrieval knobs: Retrieval Playbook
Why this snippet schema: Retrieval Traceability
Payload schema and fences: Data Contracts
Chunk quality before metrics: Chunking Checklist
Similarity vs meaning: Embedding ≠ Semantic

Acceptance targets for this harness

ΔS(question, retrieved) ≤ 0.45 on the gold set
Coverage of the target section ≥ 0.70
λ remains convergent across 3 paraphrases and 2 seeds
Re-runs with identical seed produce metrics drift ≤ 0.5 percentage point

Folder layout and contracts

eval/
  datasets/
    gold/
      qa.jsonl            # minimal gold set
      citations.jsonl     # expected snippet anchors
    probes/
      paraphrases.jsonl   # 3 paraphrases per item
  runs/
    2025-08-29_seed42/
      config.yaml
      metrics.csv
      traces.jsonl
  config/
    harness.yaml          # store, retriever, reranker, seeds, k

Input schema

datasets/gold/qa.jsonl one JSON per line.

{
  "id": "Q_0001",
  "question": "How is vector contamination detected in FAISS indexes",
  "answer_ref": "PM:vectorstore-metrics-and-faiss-pitfalls#detect-contamination",
  "expected_doc": "ProblemMap/vectorstore-metrics-and-faiss-pitfalls.md",
  "section_id": "detect-contamination"
}

datasets/gold/citations.jsonl

{
  "id": "Q_0001",
  "snippet_id": "S_18823",
  "section_id": "detect-contamination",
  "source_url": "https://github.com/onestardao/WFGY/blob/main/ProblemMap/vectorstore-metrics-and-faiss-pitfalls.md",
  "offsets": [1380, 1540],
  "tokens": [310, 352]
}

Contract rules come from Retrieval Traceability and Data Contracts.

Repro knobs

seed: integer. Set for the retriever, reranker, and LLM sampler if available.
k: top k per retriever. Test 5, 10, 20.
λ_observe: record λ state for retrieve, assemble, reason. See lambda_observe.md.
ΔS probe: compute ΔS(question, retrieved) and ΔS(retrieved, expected anchor). See deltaS_thresholds.md.

Execution flow

Warm up fence. Verify index hash, vector ready, secrets. If not ready, stop. Open: Bootstrap Ordering.
Retrieval step. Run with fixed metric and analyzer. Save raw hits with snippet fields from the contract page.
ΔS and λ probes. Log both per item. If ΔS ≥ 0.60 flag as structural risk.
Reasoning step. LLM reads TXT OS and uses the cite then explain schema. Refuse answers without citations.
Metrics. Compute precision, recall, citation hit, coverage. See eval_rag_precision_recall.md and Retrieval Playbook.
Trace sink. Write traces.jsonl with id, seed, k, ΔS, λ_state, snippet_id, section_id, INDEX_HASH.
Gate. If coverage < 0.70 or ΔS > 0.45 fail the run. See regression_gate.md.

Sixty second quick start

Place a ten item gold set into datasets/gold/qa.jsonl and citations.jsonl.
Copy config/harness.yaml from a previous good run. Set seed: 42, k: 10.
Run your script to produce runs/<date>_seed42/metrics.csv and traces.jsonl.
Verify the acceptance targets above. If any gate fails jump to the right fix below.

Common failures and the exact fix

Wrong meaning despite high similarity. Open: Embedding ≠ Semantic
Citations do not match the referenced section. Open: Retrieval Traceability and Data Contracts
Hybrid retrieval worse than single retriever. Open: pattern_query_parsing_split.md and rerankers.md
Runs flip across deployments or first run crashes. Open: deployment-deadlock.md, predeploy-collapse.md
Long chains collapse. Open: context-drift.md and entropy-collapse.md

CI gates and artifacts

Block merge if any of these is true
1. ΔS median > 0.45 on gold
2. Coverage < 0.70
3. λ flips on 2 of 3 paraphrases
4. Metrics drift from last green run > 0.5 percentage point
Store artifacts metrics.csv, traces.jsonl, harness.yaml, INDEX_HASH, MODEL_HASH.

Copy paste prompts for the reasoning step

You have TXTOS and the WFGY Problem Map loaded.

Question: "{question}"
Retrieved snippets: [{snippet_id, section_id, source_url, offsets, tokens}]

Do:
1) Cite then explain. If citation is missing or mismatched, fail fast and return the minimal structural fix.
2) If ΔS(question, retrieved) ≥ 0.60 propose the smallest repair. Use retrieval-playbook, retrieval-traceability, data-contracts, rerankers.
3) Return JSON:
   {"citations":[...], "answer":"...", "λ_state":"→|←|<>|×", "ΔS":0.xx, "next_fix":"..."}
Keep it short and auditable.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

11 KiB Raw Blame History Unescape Escape