8.5 KiB
Eval: Latency vs Accuracy Trade-off
This page defines how to measure, report, and optimize the trade-off between model latency and retrieval/answer accuracy. It is not enough to chase precision; stable systems must also meet latency SLOs while holding ΔS and λ within guardrails.
Open these first
- Core eval protocols: Eval Benchmarking
- Precision/recall metrics: Eval RAG Precision/Recall
- Observability instruments: deltaS_thresholds.md, lambda_observe.md
- Drift and variance: variance_and_drift.md
Acceptance targets
-
Latency:
- Median ≤ 1.2× baseline
- P90 ≤ 1.5× baseline
-
Accuracy:
- Precision ≥ 0.80
- Recall ≥ 0.70
- ΔS(question, cited) ≤ 0.45 for ≥ 80 percent of runs
- λ convergent across paraphrases
-
Cost stability:
- Tokens or API cost per correct answer ≤ 1.3× baseline
If accuracy improves but latency inflates beyond thresholds, classify as not production-ready. Only ship when both dimensions pass.
Measurement protocol
-
Dual track runs
- Run with and without extra retrieval steps (rerank, multi-hop, HyDE, etc).
- Record latency per stage (retrieve, rerank, reason).
-
Buckets
- Short queries: <50 tokens
- Medium queries: 50–200 tokens
- Long queries: >200 tokens Latency vs accuracy must be reported per bucket.
-
Seeds and paraphrases
- Use 2 random seeds, 3 paraphrases each.
- Average and variance required for both latency and accuracy metrics.
-
Normalization
- Report cost per correct answer, not raw tokens.
- Normalize across providers for fair comparison.
Reporting schema
Append to the JSONL logs from Eval Benchmarking:
{
"suite": "v1_latency",
"arm": "with_rerank",
"provider": "openai",
"model": "gpt-4o-mini-2025-07",
"bucket": "medium",
"precision": 0.82,
"recall": 0.71,
"ΔS_avg": 0.39,
"λ_flip_rate": 0.02,
"latency_ms": { "retrieve": 120, "rerank": 85, "reason": 910 },
"latency_total_ms": 1115,
"latency_vs_baseline": 1.35,
"tokens": { "in": 1980, "out": 510 },
"cost_per_correct": 1.25,
"notes": "acceptable trade-off"
}
Diagnostic questions
When latency grows faster than accuracy:
- Is reranking adding value or just delay? → check ΔS histograms pre/post rerank.
- Are paraphrases redundant? → drop to 2 if λ stability holds.
- Is retrieval k too large? → compare 5, 10, 20.
- Are you re-embedding too often? → reuse cached vectors.
- Is model size the bottleneck? → test smaller model + WFGY vs large model baseline.
Escalation and fixes
- Latency regressions without accuracy gain → cut rerank or hybrid steps. See Rerankers.
- High ΔS despite more steps → rebuild index and re-chunk. See Embedding ≠ Semantic.
- Unstable λ across seeds → clamp variance with BBAM, see variance_and_drift.md.
Minimal 60-second run
- Pick 5 medium-length questions.
- Run baseline and WFGY rerank arm.
- Record latency_total_ms and accuracy metrics.
- Accept only if ΔS ≤ 0.45 and latency inflation ≤ 1.5× baseline.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.