Eval: Cost Reporting and Efficiency

🧭 Quick Return to Map

You are in a sub-page of Eval.
To reorient, go back here:

Eval — model evaluation and benchmarking

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

This page defines how to measure and report cost per correct answer in retrieval-augmented and reasoning pipelines. Latency and accuracy alone are insufficient. Without cost analysis, systems regress into wasteful configurations.

Open these first

Latency vs Accuracy trade-off: eval_latency_vs_accuracy.md
Benchmark suite: eval_benchmarking.md
Observability probes: alerting_and_probes.md

Acceptance targets

Cost per correct answer ≤ 1.3× baseline
Cost stability variance ≤ 15% across 3 seeds and 3 paraphrases
Token efficiency ≥ 0.7 (fraction of tokens contributing to correct citation)
Budget alerting: auto-flag when projected monthly spend > 110% of budget cap

Reporting dimensions

Each evaluation run must record cost on three levels:

Raw tokens
- input, output, total per query
- broken down by retrieval, rerank, reasoning
Cost per unit
- $/1k tokens per provider and model
- normalized into usd_equiv
Cost per correct
- (total spend ÷ number of correct answers)
- stratified by question bucket (short, medium, long)

JSON schema

{
  "suite": "v1_cost",
  "arm": "with_hybrid",
  "provider": "anthropic",
  "model": "claude-3.7-sonnet",
  "bucket": "long",
  "precision": 0.79,
  "recall": 0.68,
  "ΔS_avg": 0.41,
  "correct_answers": 40,
  "total_questions": 50,
  "tokens": { "in": 2850, "out": 920, "total": 3770 },
  "cost_per_1k_tokens_usd": 0.006,
  "spend_usd": 0.0226,
  "cost_per_correct": 0.00056,
  "variance_across_runs": 0.11,
  "notes": "within budget and stable"
}

Diagnostic questions

Are rerankers worth the extra spend? → check ΔS reduction vs token increase.
Is hybrid retrieval doubling retrieval tokens with little gain?
Does the large model add accuracy, or is a small model + WFGY equal at lower cost?
Is citation length inflated (long snippets)? → enforce snippet contract.

Escalation and fixes

High cost per correct → switch to caching, smaller model with WFGY overlay.
Variance >15% → clamp paraphrases, normalize prompt headers.
Budget overrun → auto-throttle evals, alert with alerting_and_probes.md.

Minimal run

Select 20 mixed-length questions.
Run baseline and candidate arms.
Compute cost per correct.
Ship only if candidate ≤ 1.3× baseline and stable across seeds.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

7.9 KiB Raw Blame History Unescape Escape