Eval Observability — λ_observe

A core probe for evaluating semantic convergence across multiple seeds, paraphrases, and retrieval variations.
While ΔS measures semantic distance, λ_observe captures stability vs divergence of reasoning paths.

Why λ_observe matters

Detect fragile reasoning: Even when ΔS looks safe, λ divergence indicates unstable chains.
Identify paraphrase sensitivity: If λ flips across harmless rewordings, the system is brittle.
Audit retrieval randomness: Different seeds producing opposite λ signals reveal weak schema.
Ensure eval reproducibility: Stable λ means tests repeat reliably under small perturbations.

λ state encoding

Symbol	Meaning	Example failure
→	Forward convergence, stable path	Same citations and reasoning across paraphrases
←	Backward collapse, early abort	Tool call retries, empty citations
<>	Split state, partial divergence	One paraphrase cites correct snippet, others miss
×	Total collapse	Random answers, no citation alignment

Acceptance targets

Convergence rate ≥ 0.80 across 3 paraphrases × 2 seeds.
No × states tolerated in gold-set eval.
Split states (<>): ≤ 10% of test cases acceptable.
Forward (→) must dominate stable runs.

Evaluation workflow

Run triple paraphrase probe
Ask the same question three ways. Collect λ states.
Repeat with two seeds
Track variance.
Roll-up stats
Compute convergence ratio, collapse frequency, divergence rate.
Escalation
If λ <0.80 or × >0%, run root-cause: schema audit, retriever split, prompt ordering.

Example probe schema

{
  "query_id": "Q42",
  "runs": [
    {"paraphrase": 1, "seed": 123, "λ": "→"},
    {"paraphrase": 2, "seed": 123, "λ": "→"},
    {"paraphrase": 3, "seed": 123, "λ": "<>"},
    {"paraphrase": 1, "seed": 456, "λ": "→"},
    {"paraphrase": 2, "seed": 456, "λ": "×"},
    {"paraphrase": 3, "seed": 456, "λ": "→"}
  ]
}

Common pitfalls

Only measuring ΔS → misses hidden divergence.
Seed-fixed eval → looks stable but fragile in production.
Ignoring split states → small divergence often grows into collapse.
No per-query logs → averages hide catastrophic single failures.

Reporting recommendations

λ distribution table: % of →, ←, <>, ×.
Convergence trend: chart over time by eval batch.
Drift alerts: trigger if convergence <0.80 or × appears.
Correlation: track ΔS vs λ to spot mixed failures.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

7.2 KiB Raw Blame History Unescape Escape