WFGY/ProblemMap/eval/eval_latency_vs_accuracy.md
2025-08-13 17:45:41 +08:00

303 lines
14 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Eval — Latency vs Accuracy (SLO Gating, stdlib-only)
**Goal**
Decide whether a pipeline is allowed to ship under a **latency budget** while preserving **grounded accuracy**. This page defines metrics, experiment design, and a reference harness to collect P50/P95/P99 latency together with Precision/CHR.
**What you get**
- Precise **end-to-end** vs **per-stage** latency definitions
- A **sweep harness** (stdlib-only) to explore retrieval/rerank/LLM knobs
- **SLO gates** and a Pareto-frontier selection rule to choose a config
---
## 1) Metrics (definitions)
**Latency scope**
- **E2E latency**: time from receiving a question to a fully validated answer (**includes** retrieval, rerank, LLM, auditor/guards, JSON parse, acceptance checks).
- **Per-stage latency** (optional): `t_retrieval`, `t_rerank`, `t_llm`, `t_guard`.
**Aggregates**
- P50, P90, **P95**, **P99** (milliseconds)
- **Tail amplification**: P99 / P50 (smaller is better)
**Accuracy side (from Precision/CHR page)**
- **Precision (answered)**, **CHR**, **Under-/Over-refusal**
- Same data contract: `runs/trace.jsonl` + `eval/gold.jsonl`
**Default SLO gates (suggested)**
- P95 (E2E) **≤ 2000 ms** (interactive UX)
- Precision (answered) **≥ 0.80**
- CHR **≥ 0.75**
- Under-refusal **≤ 0.05**, Over-refusal **≤ 0.10**
> Tune per product, but pin thresholds in repo and enforce in CI.
---
## 2) Experiment design
You will **sweep** low-cost knobs that trade latency for accuracy:
| Knob | Effect on latency | Effect on accuracy |
|---|---|---|
| `k_lex` (BM25 top-k) | ↑ retrieval time with k | ↑ recall (to a point) |
| `k_sem` (embed top-k) | ↑ | ↑ |
| **Intersection** vs **Union** | Intersection often ↓ rerank set | ↑ precision / ↓ tail noise |
| `rerank_depth` (N→M) | ↑ linearly with N | ↑ CHR up to knee |
| `knee_cut` | ↓ (smaller context) | Often ↑ (less junk), but risk recall loss |
| `max_tokens` (LLM output) | ↑ decode time | weak effect on grounding |
| `temperature` | no change | high temp may hurt containment/CHR |
| **Model choice** | varies | varies; measure not guess |
**Loads**
Measure at 3 loads (open-loop; stdlib-only):
- **1 rps** (single user feel)
- **5 rps** (light team usage)
- **20 rps** (stress upper bound)
---
## 3) Reference harness (stdlib-only)
Save as `ProblemMap/eval/latency_sweep.py`.
It calls your **local function** `pipeline_qa(q, knobs)->answer_json&trace` **or** an HTTP endpoint (toggle with `--http`). It writes:
- `runs/trace.jsonl` (answers for accuracy)
- `runs/latency.csv` (per run timings + knobs)
- A final **summary JSON** with P95/P99 and pass/fail
```python
#!/usr/bin/env python3
import time, json, csv, random, argparse, threading, queue, urllib.request
REFUSAL = "not in context"
# --- plug points --------------------------------------------------------------
def pipeline_qa_local(question, knobs):
"""
Implement by importing your guarded baseline (Example 01/03).
Must return:
{
"answer_json": {"claim": str, "citations": [str,...]},
"retrieved_ids": [str,...],
"stage_ms": {"retrieval":int,"rerank":int,"llm":int,"guard":int}
}
"""
# Minimal demo: call your ask.py via a local HTTP or function; here we stub.
t0=time.perf_counter()
# fake timings (replace with real calls)
t1=time.perf_counter(); retrieval_ms=int((time.perf_counter()-t1)*1000)+random.randint(5,15)
t2=time.perf_counter(); rerank_ms=int((time.perf_counter()-t2)*1000)+random.randint(3,10)
t3=time.perf_counter(); llm_ms=int((time.perf_counter()-t3)*1000)+random.randint(220,420)
t4=time.perf_counter(); guard_ms=int((time.perf_counter()-t4)*1000)+random.randint(1,3)
ans = {"claim": REFUSAL, "citations": []} # replace with real guarded output
ret = {"answer_json": ans, "retrieved_ids": [], "stage_ms":{"retrieval":retrieval_ms,"rerank":rerank_ms,"llm":llm_ms,"guard":guard_ms}}
return ret
def pipeline_qa_http(url, question, knobs):
body = json.dumps({"q":question, "knobs":knobs}).encode("utf-8")
req = urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"})
t0=time.perf_counter()
with urllib.request.urlopen(req, timeout=60) as r:
j=json.loads(r.read().decode("utf-8"))
# Expect the same contract as local variant
return j
# --- helpers ------------------------------------------------------------------
def percentiles(samples, ps=(50,90,95,99)):
if not samples: return {p:0 for p in ps}
xs=sorted(samples)
out={}
for p in ps:
k=(p/100)*(len(xs)-1)
f=int(k); c=min(f+1,len(xs)-1); d=k-f
out[p]=xs[f]*(1-d)+xs[c]*d
return {p:int(out[p]) for p in ps}
def contains_substr(claim, subs):
c=(claim or "").lower()
if not subs: return True
return any((s.lower() in c and len(s)>=5) for s in subs)
def citation_hit(cits, gold, retrieved):
if not isinstance(cits,list): return False
if not set(cits).issubset(set(retrieved or [])): return False
return bool(set(cits or []) & set(gold or [])) if gold else (cits==[])
# --- main sweep ---------------------------------------------------------------
def run_sweep(gold_path, questions, knobs_grid, http_url=None, rps=1, duration_s=20):
gold = {g["qid"]: g for g in (json.loads(l) for l in open(gold_path, encoding="utf8"))}
lat_ms=[]; per_stage=[]; answered=refused=tp=chr_hit=under=over=0
start=time.perf_counter()
trace_f=open("runs/trace.jsonl","a",encoding="utf8"); lat_f=open("runs/latency.csv","a",newline=""); lat_csv=csv.writer(lat_f)
lat_csv.writerow(["ts","qid","e2e_ms","retrieval_ms","rerank_ms","llm_ms","guard_ms","knobs"])
i=0
while time.perf_counter()-start < duration_s:
qid=questions[i % len(questions)]
g=gold[qid]; q=g["question"]; knobs=knobs_grid[i % len(knobs_grid)]
t0=time.perf_counter()
if http_url:
out=pipeline_qa_http(http_url, q, knobs)
else:
out=pipeline_qa_local(q, knobs)
e2e_ms=int((time.perf_counter()-t0)*1000)
lat_ms.append(e2e_ms)
st=out.get("stage_ms",{})
lat_csv.writerow([int(time.time()), qid, e2e_ms, st.get("retrieval",0), st.get("rerank",0), st.get("llm",0), st.get("guard",0), json.dumps(knobs)])
# accuracy tallies
aj=out.get("answer_json",{}); claim=aj.get("claim",""); cits=aj.get("citations",[]); ret=out.get("retrieved_ids",[])
is_ans=(claim.strip().lower()!= "not in context")
if g.get("answerable"): A=True
else: A=False
if is_ans:
answered+=1
C=contains_substr(claim, g.get("gold_claim_substr"))
H=citation_hit(cits, g.get("gold_citations"), ret)
if not A: under+=1
else:
if H: chr_hit+=1
if C and H: tp+=1
else:
refused+=1
if A: over+=1
trace_f.write(json.dumps({"qid":qid,"q":q,"retrieved_ids":ret,"answer_json":aj})+"\n")
# open-loop pacing
time.sleep(max(0.0, 1.0/rps - (time.perf_counter()-t0)))
i+=1
trace_f.close(); lat_f.close()
# aggregates
P=percentiles(lat_ms); S=max(answered,1)
precision=tp/S; chr_rate=chr_hit/S; under_rate=under/max(sum(1 for x in gold.values() if not x["answerable"]),1)
over_rate=over/max(sum(1 for x in gold.values() if x["answerable"]),1)
return {"p50":P[50],"p95":P[95],"p99":P[99],"answered":answered,"refused":refused,
"precision":round(precision,4),"chr":round(chr_rate,4),
"under":round(under_rate,4),"over":round(over_rate,4),
"samples":len(lat_ms)}
if __name__=="__main__":
ap=argparse.ArgumentParser()
ap.add_argument("--gold", required=True)
ap.add_argument("--http", default=None, help="http://localhost:8080/qa if using HTTP")
ap.add_argument("--rps", type=float, default=1.0)
ap.add_argument("--duration", type=int, default=20)
args=ap.parse_args()
# small grid (expand in CI)
knobs_grid=[
{"k_lex":40, "k_sem":40, "intersect":True, "rerank_depth":32, "knee":True, "max_tokens":256},
{"k_lex":20, "k_sem":20, "intersect":True, "rerank_depth":16, "knee":True, "max_tokens":192},
{"k_lex":60, "k_sem":60, "intersect":False, "rerank_depth":64, "knee":False, "max_tokens":256},
]
# choose 2050 mixed A/U qids from gold
qids=[json.loads(l)["qid"] for l in open(args.gold,encoding="utf8")]
res=run_sweep(args.gold, qids[:30], knobs_grid, http_url=args.http, rps=args.rps, duration_s=args.duration)
print(json.dumps(res, indent=2))
````
**How to use**
```bash
# Single-user feel
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30
# Team load
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60
# Stress
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 20 --duration 60
```
This writes `runs/latency.csv`. Use any plotting tool later; gating does **not** require plots.
---
## 4) SLO gating & Pareto selection
**Ship rule (AND):**
* P95 ≤ budget (e.g., 2000 ms)
* Precision ≥ threshold (e.g., 0.80)
* CHR ≥ threshold (e.g., 0.75)
* Under/Over-refusal within limits
**Pareto frontier**
Given multiple knob configs, keep only those where no other config is **both** faster (lower P95) **and** more accurate (higher Precision). Choose:
* **Interactive app**: the **fastest** config on the frontier that still meets accuracy gates.
* **Back-office batch**: the **most accurate** config that meets a relaxed latency gate.
**Rollback guard**
Fail the PR if: P95 increases by >15% **or** Precision drops by >2% vs last release.
---
## 5) Troubleshooting map
* **P95 blown but P50 ok** → tail from LLM. Trim `max_tokens`, enable intersection+knee, reduce `rerank_depth`.
* **Precision low, CHR low** → grounding broken. Apply *RAG Semantic Drift* pattern.
* **Precision fine, CHR low** → claim substrings not matched; fix claim schema or gold substrings.
* **Throughput collapse at 20 rps** → remove cross-service `/readyz` waits; pre-warm model and index (see *Bootstrap Deadlock*).
* **Variance across runs** → check *Vector Store Fragmentation* and lock normalization.
---
## 6) CI wiring (copy/paste)
Example (bash):
```bash
# 1) Run sweep at 1 rps (smoke)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30 | tee eval/lat_1rps.json
# 2) Run sweep at 5 rps (light load)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60 | tee eval/lat_5rps.json
# 3) Score accuracy using the RAG scorer
python ProblemMap/eval/score_eval.py --gold ProblemMap/eval/gold.jsonl --trace runs/trace.jsonl --k 5 > eval/acc.json
# 4) Gate: jq asserts
jq -e '.p95 <= 2000' eval/lat_1rps.json
jq -e '.p95 <= 2500' eval/lat_5rps.json
jq -e '.precision >= 0.80 and .chr >= 0.75 and .under <= 0.05 and .over <= 0.10' eval/acc.json
```
---
## 7) Notes & caveats
* Use **open-loop** pacing (`sleep`) to avoid feedback artifacts from server backpressure.
* Warmup separately; capture **steady-state** latency.
* Fix random seeds for prompts (if you jitter prompts, do it in the *Semantic Stability* eval).
---
### 🧭 Explore More
| Module | Description | Link |
| --------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| WFGY Core | Standalone semantic reasoning engine for any LLM | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ **[Star WFGY on GitHub](https://github.com/onestardao/WFGY)**
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
 
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
 
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
 
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
 
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
 
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
 
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
</div>