Eval: Latency vs Accuracy Trade-off

This page defines how to measure, report, and optimize the trade-off between model latency and retrieval/answer accuracy. It is not enough to chase precision; stable systems must also meet latency SLOs while holding ΔS and λ within guardrails.

Open these first

Core eval protocols: Eval Benchmarking
Precision/recall metrics: Eval RAG Precision/Recall
Observability instruments: deltaS_thresholds.md, lambda_observe.md
Drift and variance: variance_and_drift.md

Acceptance targets

Latency:
- Median ≤ 1.2× baseline
- P90 ≤ 1.5× baseline
Accuracy:
- Precision ≥ 0.80
- Recall ≥ 0.70
- ΔS(question, cited) ≤ 0.45 for ≥ 80 percent of runs
- λ convergent across paraphrases
Cost stability:
- Tokens or API cost per correct answer ≤ 1.3× baseline

If accuracy improves but latency inflates beyond thresholds, classify as not production-ready. Only ship when both dimensions pass.

Measurement protocol

Dual track runs
- Run with and without extra retrieval steps (rerank, multi-hop, HyDE, etc).
- Record latency per stage (retrieve, rerank, reason).
Buckets
- Short queries: <50 tokens
- Medium queries: 50–200 tokens
- Long queries: >200 tokens Latency vs accuracy must be reported per bucket.
Seeds and paraphrases
- Use 2 random seeds, 3 paraphrases each.
- Average and variance required for both latency and accuracy metrics.
Normalization
- Report cost per correct answer, not raw tokens.
- Normalize across providers for fair comparison.

Reporting schema

Append to the JSONL logs from Eval Benchmarking:

{
  "suite": "v1_latency",
  "arm": "with_rerank",
  "provider": "openai",
  "model": "gpt-4o-mini-2025-07",
  "bucket": "medium",
  "precision": 0.82,
  "recall": 0.71,
  "ΔS_avg": 0.39,
  "λ_flip_rate": 0.02,
  "latency_ms": { "retrieve": 120, "rerank": 85, "reason": 910 },
  "latency_total_ms": 1115,
  "latency_vs_baseline": 1.35,
  "tokens": { "in": 1980, "out": 510 },
  "cost_per_correct": 1.25,
  "notes": "acceptable trade-off"
}

Diagnostic questions

When latency grows faster than accuracy:

Is reranking adding value or just delay? → check ΔS histograms pre/post rerank.
Are paraphrases redundant? → drop to 2 if λ stability holds.
Is retrieval k too large? → compare 5, 10, 20.
Are you re-embedding too often? → reuse cached vectors.
Is model size the bottleneck? → test smaller model + WFGY vs large model baseline.

Escalation and fixes

Latency regressions without accuracy gain → cut rerank or hybrid steps. See Rerankers.
High ΔS despite more steps → rebuild index and re-chunk. See Embedding ≠ Semantic.
Unstable λ across seeds → clamp variance with BBAM, see variance_and_drift.md.

Minimal 60-second run

Pick 5 medium-length questions.
Run baseline and WFGY rerank arm.
Record latency_total_ms and accuracy metrics.
Accept only if ΔS ≤ 0.45 and latency inflation ≤ 1.5× baseline.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

8.5 KiB Raw Blame History Unescape Escape