14 KiB
Eval — Latency vs Accuracy (SLO Gating, stdlib-only)
Goal
Decide whether a pipeline is allowed to ship under a latency budget while preserving grounded accuracy. This page defines metrics, experiment design, and a reference harness to collect P50/P95/P99 latency together with Precision/CHR.
What you get
- Precise end-to-end vs per-stage latency definitions
- A sweep harness (stdlib-only) to explore retrieval/rerank/LLM knobs
- SLO gates and a Pareto-frontier selection rule to choose a config
1) Metrics (definitions)
Latency scope
- E2E latency: time from receiving a question to a fully validated answer (includes retrieval, rerank, LLM, auditor/guards, JSON parse, acceptance checks).
- Per-stage latency (optional):
t_retrieval,t_rerank,t_llm,t_guard.
Aggregates
- P50, P90, P95, P99 (milliseconds)
- Tail amplification: P99 / P50 (smaller is better)
Accuracy side (from Precision/CHR page)
- Precision (answered), CHR, Under-/Over-refusal
- Same data contract:
runs/trace.jsonl+eval/gold.jsonl
Default SLO gates (suggested)
- P95 (E2E) ≤ 2000 ms (interactive UX)
- Precision (answered) ≥ 0.80
- CHR ≥ 0.75
- Under-refusal ≤ 0.05, Over-refusal ≤ 0.10
Tune per product, but pin thresholds in repo and enforce in CI.
2) Experiment design
You will sweep low-cost knobs that trade latency for accuracy:
| Knob | Effect on latency | Effect on accuracy |
|---|---|---|
k_lex (BM25 top-k) |
↑ retrieval time with k | ↑ recall (to a point) |
k_sem (embed top-k) |
↑ | ↑ |
| Intersection vs Union | Intersection often ↓ rerank set | ↑ precision / ↓ tail noise |
rerank_depth (N→M) |
↑ linearly with N | ↑ CHR up to knee |
knee_cut |
↓ (smaller context) | Often ↑ (less junk), but risk recall loss |
max_tokens (LLM output) |
↑ decode time | weak effect on grounding |
temperature |
no change | high temp may hurt containment/CHR |
| Model choice | varies | varies; measure not guess |
Loads
Measure at 3 loads (open-loop; stdlib-only):
- 1 rps (single user feel)
- 5 rps (light team usage)
- 20 rps (stress upper bound)
3) Reference harness (stdlib-only)
Save as ProblemMap/eval/latency_sweep.py.
It calls your local function pipeline_qa(q, knobs)->answer_json&trace or an HTTP endpoint (toggle with --http). It writes:
runs/trace.jsonl(answers for accuracy)runs/latency.csv(per run timings + knobs)- A final summary JSON with P95/P99 and pass/fail
#!/usr/bin/env python3
import time, json, csv, random, argparse, threading, queue, urllib.request
REFUSAL = "not in context"
# --- plug points --------------------------------------------------------------
def pipeline_qa_local(question, knobs):
"""
Implement by importing your guarded baseline (Example 01/03).
Must return:
{
"answer_json": {"claim": str, "citations": [str,...]},
"retrieved_ids": [str,...],
"stage_ms": {"retrieval":int,"rerank":int,"llm":int,"guard":int}
}
"""
# Minimal demo: call your ask.py via a local HTTP or function; here we stub.
t0=time.perf_counter()
# fake timings (replace with real calls)
t1=time.perf_counter(); retrieval_ms=int((time.perf_counter()-t1)*1000)+random.randint(5,15)
t2=time.perf_counter(); rerank_ms=int((time.perf_counter()-t2)*1000)+random.randint(3,10)
t3=time.perf_counter(); llm_ms=int((time.perf_counter()-t3)*1000)+random.randint(220,420)
t4=time.perf_counter(); guard_ms=int((time.perf_counter()-t4)*1000)+random.randint(1,3)
ans = {"claim": REFUSAL, "citations": []} # replace with real guarded output
ret = {"answer_json": ans, "retrieved_ids": [], "stage_ms":{"retrieval":retrieval_ms,"rerank":rerank_ms,"llm":llm_ms,"guard":guard_ms}}
return ret
def pipeline_qa_http(url, question, knobs):
body = json.dumps({"q":question, "knobs":knobs}).encode("utf-8")
req = urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"})
t0=time.perf_counter()
with urllib.request.urlopen(req, timeout=60) as r:
j=json.loads(r.read().decode("utf-8"))
# Expect the same contract as local variant
return j
# --- helpers ------------------------------------------------------------------
def percentiles(samples, ps=(50,90,95,99)):
if not samples: return {p:0 for p in ps}
xs=sorted(samples)
out={}
for p in ps:
k=(p/100)*(len(xs)-1)
f=int(k); c=min(f+1,len(xs)-1); d=k-f
out[p]=xs[f]*(1-d)+xs[c]*d
return {p:int(out[p]) for p in ps}
def contains_substr(claim, subs):
c=(claim or "").lower()
if not subs: return True
return any((s.lower() in c and len(s)>=5) for s in subs)
def citation_hit(cits, gold, retrieved):
if not isinstance(cits,list): return False
if not set(cits).issubset(set(retrieved or [])): return False
return bool(set(cits or []) & set(gold or [])) if gold else (cits==[])
# --- main sweep ---------------------------------------------------------------
def run_sweep(gold_path, questions, knobs_grid, http_url=None, rps=1, duration_s=20):
gold = {g["qid"]: g for g in (json.loads(l) for l in open(gold_path, encoding="utf8"))}
lat_ms=[]; per_stage=[]; answered=refused=tp=chr_hit=under=over=0
start=time.perf_counter()
trace_f=open("runs/trace.jsonl","a",encoding="utf8"); lat_f=open("runs/latency.csv","a",newline=""); lat_csv=csv.writer(lat_f)
lat_csv.writerow(["ts","qid","e2e_ms","retrieval_ms","rerank_ms","llm_ms","guard_ms","knobs"])
i=0
while time.perf_counter()-start < duration_s:
qid=questions[i % len(questions)]
g=gold[qid]; q=g["question"]; knobs=knobs_grid[i % len(knobs_grid)]
t0=time.perf_counter()
if http_url:
out=pipeline_qa_http(http_url, q, knobs)
else:
out=pipeline_qa_local(q, knobs)
e2e_ms=int((time.perf_counter()-t0)*1000)
lat_ms.append(e2e_ms)
st=out.get("stage_ms",{})
lat_csv.writerow([int(time.time()), qid, e2e_ms, st.get("retrieval",0), st.get("rerank",0), st.get("llm",0), st.get("guard",0), json.dumps(knobs)])
# accuracy tallies
aj=out.get("answer_json",{}); claim=aj.get("claim",""); cits=aj.get("citations",[]); ret=out.get("retrieved_ids",[])
is_ans=(claim.strip().lower()!= "not in context")
if g.get("answerable"): A=True
else: A=False
if is_ans:
answered+=1
C=contains_substr(claim, g.get("gold_claim_substr"))
H=citation_hit(cits, g.get("gold_citations"), ret)
if not A: under+=1
else:
if H: chr_hit+=1
if C and H: tp+=1
else:
refused+=1
if A: over+=1
trace_f.write(json.dumps({"qid":qid,"q":q,"retrieved_ids":ret,"answer_json":aj})+"\n")
# open-loop pacing
time.sleep(max(0.0, 1.0/rps - (time.perf_counter()-t0)))
i+=1
trace_f.close(); lat_f.close()
# aggregates
P=percentiles(lat_ms); S=max(answered,1)
precision=tp/S; chr_rate=chr_hit/S; under_rate=under/max(sum(1 for x in gold.values() if not x["answerable"]),1)
over_rate=over/max(sum(1 for x in gold.values() if x["answerable"]),1)
return {"p50":P[50],"p95":P[95],"p99":P[99],"answered":answered,"refused":refused,
"precision":round(precision,4),"chr":round(chr_rate,4),
"under":round(under_rate,4),"over":round(over_rate,4),
"samples":len(lat_ms)}
if __name__=="__main__":
ap=argparse.ArgumentParser()
ap.add_argument("--gold", required=True)
ap.add_argument("--http", default=None, help="http://localhost:8080/qa if using HTTP")
ap.add_argument("--rps", type=float, default=1.0)
ap.add_argument("--duration", type=int, default=20)
args=ap.parse_args()
# small grid (expand in CI)
knobs_grid=[
{"k_lex":40, "k_sem":40, "intersect":True, "rerank_depth":32, "knee":True, "max_tokens":256},
{"k_lex":20, "k_sem":20, "intersect":True, "rerank_depth":16, "knee":True, "max_tokens":192},
{"k_lex":60, "k_sem":60, "intersect":False, "rerank_depth":64, "knee":False, "max_tokens":256},
]
# choose 20–50 mixed A/U qids from gold
qids=[json.loads(l)["qid"] for l in open(args.gold,encoding="utf8")]
res=run_sweep(args.gold, qids[:30], knobs_grid, http_url=args.http, rps=args.rps, duration_s=args.duration)
print(json.dumps(res, indent=2))
How to use
# Single-user feel
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30
# Team load
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60
# Stress
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 20 --duration 60
This writes runs/latency.csv. Use any plotting tool later; gating does not require plots.
4) SLO gating & Pareto selection
Ship rule (AND):
- P95 ≤ budget (e.g., 2000 ms)
- Precision ≥ threshold (e.g., 0.80)
- CHR ≥ threshold (e.g., 0.75)
- Under/Over-refusal within limits
Pareto frontier Given multiple knob configs, keep only those where no other config is both faster (lower P95) and more accurate (higher Precision). Choose:
- Interactive app: the fastest config on the frontier that still meets accuracy gates.
- Back-office batch: the most accurate config that meets a relaxed latency gate.
Rollback guard Fail the PR if: P95 increases by >15% or Precision drops by >2% vs last release.
5) Troubleshooting map
- P95 blown but P50 ok → tail from LLM. Trim
max_tokens, enable intersection+knee, reducererank_depth. - Precision low, CHR low → grounding broken. Apply RAG Semantic Drift pattern.
- Precision fine, CHR low → claim substrings not matched; fix claim schema or gold substrings.
- Throughput collapse at 20 rps → remove cross-service
/readyzwaits; pre-warm model and index (see Bootstrap Deadlock). - Variance across runs → check Vector Store Fragmentation and lock normalization.
6) CI wiring (copy/paste)
Example (bash):
# 1) Run sweep at 1 rps (smoke)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30 | tee eval/lat_1rps.json
# 2) Run sweep at 5 rps (light load)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60 | tee eval/lat_5rps.json
# 3) Score accuracy using the RAG scorer
python ProblemMap/eval/score_eval.py --gold ProblemMap/eval/gold.jsonl --trace runs/trace.jsonl --k 5 > eval/acc.json
# 4) Gate: jq asserts
jq -e '.p95 <= 2000' eval/lat_1rps.json
jq -e '.p95 <= 2500' eval/lat_5rps.json
jq -e '.precision >= 0.80 and .chr >= 0.75 and .under <= 0.05 and .over <= 0.10' eval/acc.json
7) Notes & caveats
- Use open-loop pacing (
sleep) to avoid feedback artifacts from server backpressure. - Warmup separately; capture steady-state latency.
- Fix random seeds for prompts (if you jitter prompts, do it in the Semantic Stability eval).
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | Standalone semantic reasoning engine for any LLM | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ Star WFGY on GitHub