mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
7.9 KiB
7.9 KiB
Eval: Cost Reporting and Efficiency
🧭 Quick Return to Map
You are in a sub-page of Eval.
To reorient, go back here:
- Eval — model evaluation and benchmarking
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
This page defines how to measure and report cost per correct answer in retrieval-augmented and reasoning pipelines. Latency and accuracy alone are insufficient. Without cost analysis, systems regress into wasteful configurations.
Open these first
- Latency vs Accuracy trade-off: eval_latency_vs_accuracy.md
- Benchmark suite: eval_benchmarking.md
- Observability probes: alerting_and_probes.md
Acceptance targets
- Cost per correct answer ≤ 1.3× baseline
- Cost stability variance ≤ 15% across 3 seeds and 3 paraphrases
- Token efficiency ≥ 0.7 (fraction of tokens contributing to correct citation)
- Budget alerting: auto-flag when projected monthly spend > 110% of budget cap
Reporting dimensions
Each evaluation run must record cost on three levels:
-
Raw tokens
- input, output, total per query
- broken down by retrieval, rerank, reasoning
-
Cost per unit
- $/1k tokens per provider and model
- normalized into
usd_equiv
-
Cost per correct
- (total spend ÷ number of correct answers)
- stratified by question bucket (short, medium, long)
JSON schema
{
"suite": "v1_cost",
"arm": "with_hybrid",
"provider": "anthropic",
"model": "claude-3.7-sonnet",
"bucket": "long",
"precision": 0.79,
"recall": 0.68,
"ΔS_avg": 0.41,
"correct_answers": 40,
"total_questions": 50,
"tokens": { "in": 2850, "out": 920, "total": 3770 },
"cost_per_1k_tokens_usd": 0.006,
"spend_usd": 0.0226,
"cost_per_correct": 0.00056,
"variance_across_runs": 0.11,
"notes": "within budget and stable"
}
Diagnostic questions
- Are rerankers worth the extra spend? → check ΔS reduction vs token increase.
- Is hybrid retrieval doubling retrieval tokens with little gain?
- Does the large model add accuracy, or is a small model + WFGY equal at lower cost?
- Is citation length inflated (long snippets)? → enforce snippet contract.
Escalation and fixes
- High cost per correct → switch to caching, smaller model with WFGY overlay.
- Variance >15% → clamp paraphrases, normalize prompt headers.
- Budget overrun → auto-throttle evals, alert with alerting_and_probes.md.
Minimal run
- Select 20 mixed-length questions.
- Run baseline and candidate arms.
- Compute cost per correct.
- Ship only if candidate ≤ 1.3× baseline and stable across seeds.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.