WFGY/ProblemMap/eval/eval_latency_vs_accuracy.md
2025-08-13 17:45:41 +08:00

14 KiB
Raw Blame History

Eval — Latency vs Accuracy (SLO Gating, stdlib-only)

Goal
Decide whether a pipeline is allowed to ship under a latency budget while preserving grounded accuracy. This page defines metrics, experiment design, and a reference harness to collect P50/P95/P99 latency together with Precision/CHR.

What you get

  • Precise end-to-end vs per-stage latency definitions
  • A sweep harness (stdlib-only) to explore retrieval/rerank/LLM knobs
  • SLO gates and a Pareto-frontier selection rule to choose a config

1) Metrics (definitions)

Latency scope

  • E2E latency: time from receiving a question to a fully validated answer (includes retrieval, rerank, LLM, auditor/guards, JSON parse, acceptance checks).
  • Per-stage latency (optional): t_retrieval, t_rerank, t_llm, t_guard.

Aggregates

  • P50, P90, P95, P99 (milliseconds)
  • Tail amplification: P99 / P50 (smaller is better)

Accuracy side (from Precision/CHR page)

  • Precision (answered), CHR, Under-/Over-refusal
  • Same data contract: runs/trace.jsonl + eval/gold.jsonl

Default SLO gates (suggested)

  • P95 (E2E) ≤ 2000 ms (interactive UX)
  • Precision (answered) ≥ 0.80
  • CHR ≥ 0.75
  • Under-refusal ≤ 0.05, Over-refusal ≤ 0.10

Tune per product, but pin thresholds in repo and enforce in CI.


2) Experiment design

You will sweep low-cost knobs that trade latency for accuracy:

Knob Effect on latency Effect on accuracy
k_lex (BM25 top-k) ↑ retrieval time with k ↑ recall (to a point)
k_sem (embed top-k)
Intersection vs Union Intersection often ↓ rerank set ↑ precision / ↓ tail noise
rerank_depth (N→M) ↑ linearly with N ↑ CHR up to knee
knee_cut ↓ (smaller context) Often ↑ (less junk), but risk recall loss
max_tokens (LLM output) ↑ decode time weak effect on grounding
temperature no change high temp may hurt containment/CHR
Model choice varies varies; measure not guess

Loads
Measure at 3 loads (open-loop; stdlib-only):

  • 1 rps (single user feel)
  • 5 rps (light team usage)
  • 20 rps (stress upper bound)

3) Reference harness (stdlib-only)

Save as ProblemMap/eval/latency_sweep.py.
It calls your local function pipeline_qa(q, knobs)->answer_json&trace or an HTTP endpoint (toggle with --http). It writes:

  • runs/trace.jsonl (answers for accuracy)
  • runs/latency.csv (per run timings + knobs)
  • A final summary JSON with P95/P99 and pass/fail
#!/usr/bin/env python3
import time, json, csv, random, argparse, threading, queue, urllib.request

REFUSAL = "not in context"

# --- plug points --------------------------------------------------------------
def pipeline_qa_local(question, knobs):
    """
    Implement by importing your guarded baseline (Example 01/03).
    Must return:
      {
        "answer_json": {"claim": str, "citations": [str,...]},
        "retrieved_ids": [str,...],
        "stage_ms": {"retrieval":int,"rerank":int,"llm":int,"guard":int}
      }
    """
    # Minimal demo: call your ask.py via a local HTTP or function; here we stub.
    t0=time.perf_counter()
    # fake timings (replace with real calls)
    t1=time.perf_counter(); retrieval_ms=int((time.perf_counter()-t1)*1000)+random.randint(5,15)
    t2=time.perf_counter(); rerank_ms=int((time.perf_counter()-t2)*1000)+random.randint(3,10)
    t3=time.perf_counter(); llm_ms=int((time.perf_counter()-t3)*1000)+random.randint(220,420)
    t4=time.perf_counter(); guard_ms=int((time.perf_counter()-t4)*1000)+random.randint(1,3)
    ans = {"claim": REFUSAL, "citations": []}  # replace with real guarded output
    ret = {"answer_json": ans, "retrieved_ids": [], "stage_ms":{"retrieval":retrieval_ms,"rerank":rerank_ms,"llm":llm_ms,"guard":guard_ms}}
    return ret

def pipeline_qa_http(url, question, knobs):
    body = json.dumps({"q":question, "knobs":knobs}).encode("utf-8")
    req  = urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"})
    t0=time.perf_counter()
    with urllib.request.urlopen(req, timeout=60) as r:
        j=json.loads(r.read().decode("utf-8"))
    # Expect the same contract as local variant
    return j

# --- helpers ------------------------------------------------------------------
def percentiles(samples, ps=(50,90,95,99)):
    if not samples: return {p:0 for p in ps}
    xs=sorted(samples)
    out={}
    for p in ps:
        k=(p/100)*(len(xs)-1)
        f=int(k); c=min(f+1,len(xs)-1); d=k-f
        out[p]=xs[f]*(1-d)+xs[c]*d
    return {p:int(out[p]) for p in ps}

def contains_substr(claim, subs):
    c=(claim or "").lower()
    if not subs: return True
    return any((s.lower() in c and len(s)>=5) for s in subs)

def citation_hit(cits, gold, retrieved):
    if not isinstance(cits,list): return False
    if not set(cits).issubset(set(retrieved or [])): return False
    return bool(set(cits or []) & set(gold or [])) if gold else (cits==[])

# --- main sweep ---------------------------------------------------------------
def run_sweep(gold_path, questions, knobs_grid, http_url=None, rps=1, duration_s=20):
    gold = {g["qid"]: g for g in (json.loads(l) for l in open(gold_path, encoding="utf8"))}
    lat_ms=[]; per_stage=[]; answered=refused=tp=chr_hit=under=over=0
    start=time.perf_counter()
    trace_f=open("runs/trace.jsonl","a",encoding="utf8"); lat_f=open("runs/latency.csv","a",newline=""); lat_csv=csv.writer(lat_f)
    lat_csv.writerow(["ts","qid","e2e_ms","retrieval_ms","rerank_ms","llm_ms","guard_ms","knobs"])
    i=0
    while time.perf_counter()-start < duration_s:
        qid=questions[i % len(questions)]
        g=gold[qid]; q=g["question"]; knobs=knobs_grid[i % len(knobs_grid)]
        t0=time.perf_counter()
        if http_url:
            out=pipeline_qa_http(http_url, q, knobs)
        else:
            out=pipeline_qa_local(q, knobs)
        e2e_ms=int((time.perf_counter()-t0)*1000)
        lat_ms.append(e2e_ms)
        st=out.get("stage_ms",{})
        lat_csv.writerow([int(time.time()), qid, e2e_ms, st.get("retrieval",0), st.get("rerank",0), st.get("llm",0), st.get("guard",0), json.dumps(knobs)])
        # accuracy tallies
        aj=out.get("answer_json",{}); claim=aj.get("claim",""); cits=aj.get("citations",[]); ret=out.get("retrieved_ids",[])
        is_ans=(claim.strip().lower()!= "not in context")
        if g.get("answerable"): A=True
        else: A=False
        if is_ans:
            answered+=1
            C=contains_substr(claim, g.get("gold_claim_substr"))
            H=citation_hit(cits, g.get("gold_citations"), ret)
            if not A: under+=1
            else:
                if H: chr_hit+=1
                if C and H: tp+=1
        else:
            refused+=1
            if A: over+=1
        trace_f.write(json.dumps({"qid":qid,"q":q,"retrieved_ids":ret,"answer_json":aj})+"\n")
        # open-loop pacing
        time.sleep(max(0.0, 1.0/rps - (time.perf_counter()-t0)))
        i+=1
    trace_f.close(); lat_f.close()
    # aggregates
    P=percentiles(lat_ms); S=max(answered,1)
    precision=tp/S; chr_rate=chr_hit/S; under_rate=under/max(sum(1 for x in gold.values() if not x["answerable"]),1)
    over_rate=over/max(sum(1 for x in gold.values() if x["answerable"]),1)
    return {"p50":P[50],"p95":P[95],"p99":P[99],"answered":answered,"refused":refused,
            "precision":round(precision,4),"chr":round(chr_rate,4),
            "under":round(under_rate,4),"over":round(over_rate,4),
            "samples":len(lat_ms)}

if __name__=="__main__":
    ap=argparse.ArgumentParser()
    ap.add_argument("--gold", required=True)
    ap.add_argument("--http", default=None, help="http://localhost:8080/qa if using HTTP")
    ap.add_argument("--rps", type=float, default=1.0)
    ap.add_argument("--duration", type=int, default=20)
    args=ap.parse_args()
    # small grid (expand in CI)
    knobs_grid=[
        {"k_lex":40, "k_sem":40, "intersect":True,  "rerank_depth":32, "knee":True,  "max_tokens":256},
        {"k_lex":20, "k_sem":20, "intersect":True,  "rerank_depth":16, "knee":True,  "max_tokens":192},
        {"k_lex":60, "k_sem":60, "intersect":False, "rerank_depth":64, "knee":False, "max_tokens":256},
    ]
    # choose 2050 mixed A/U qids from gold
    qids=[json.loads(l)["qid"] for l in open(args.gold,encoding="utf8")]
    res=run_sweep(args.gold, qids[:30], knobs_grid, http_url=args.http, rps=args.rps, duration_s=args.duration)
    print(json.dumps(res, indent=2))

How to use

# Single-user feel
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30
# Team load
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60
# Stress
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 20 --duration 60

This writes runs/latency.csv. Use any plotting tool later; gating does not require plots.


4) SLO gating & Pareto selection

Ship rule (AND):

  • P95 ≤ budget (e.g., 2000 ms)
  • Precision ≥ threshold (e.g., 0.80)
  • CHR ≥ threshold (e.g., 0.75)
  • Under/Over-refusal within limits

Pareto frontier Given multiple knob configs, keep only those where no other config is both faster (lower P95) and more accurate (higher Precision). Choose:

  • Interactive app: the fastest config on the frontier that still meets accuracy gates.
  • Back-office batch: the most accurate config that meets a relaxed latency gate.

Rollback guard Fail the PR if: P95 increases by >15% or Precision drops by >2% vs last release.


5) Troubleshooting map

  • P95 blown but P50 ok → tail from LLM. Trim max_tokens, enable intersection+knee, reduce rerank_depth.
  • Precision low, CHR low → grounding broken. Apply RAG Semantic Drift pattern.
  • Precision fine, CHR low → claim substrings not matched; fix claim schema or gold substrings.
  • Throughput collapse at 20 rps → remove cross-service /readyz waits; pre-warm model and index (see Bootstrap Deadlock).
  • Variance across runs → check Vector Store Fragmentation and lock normalization.

6) CI wiring (copy/paste)

Example (bash):

# 1) Run sweep at 1 rps (smoke)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30 | tee eval/lat_1rps.json
# 2) Run sweep at 5 rps (light load)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60 | tee eval/lat_5rps.json
# 3) Score accuracy using the RAG scorer
python ProblemMap/eval/score_eval.py --gold ProblemMap/eval/gold.jsonl --trace runs/trace.jsonl --k 5 > eval/acc.json

# 4) Gate: jq asserts
jq -e '.p95 <= 2000' eval/lat_1rps.json
jq -e '.p95 <= 2500' eval/lat_5rps.json
jq -e '.precision >= 0.80 and .chr >= 0.75 and .under <= 0.05 and .over <= 0.10' eval/acc.json

7) Notes & caveats

  • Use open-loop pacing (sleep) to avoid feedback artifacts from server backpressure.
  • Warmup separately; capture steady-state latency.
  • Fix random seeds for prompts (if you jitter prompts, do it in the Semantic Stability eval).

🧭 Explore More

Module Description Link
WFGY Core Standalone semantic reasoning engine for any LLM View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow