WFGY/ProblemMap/eval/eval_latency_vs_accuracy.md
2025-08-15 23:28:40 +08:00

15 KiB
Raw Blame History

Eval — Latency vs Accuracy (SLO Gating, stdlib-only)

Goal
Decide whether a pipeline is allowed to ship under a latency budget while preserving grounded accuracy. This page defines metrics, experiment design, and a reference harness to collect P50/P95/P99 latency together with Precision/CHR.

What you get

  • Precise end-to-end vs per-stage latency definitions
  • A sweep harness (stdlib-only) to explore retrieval/rerank/LLM knobs
  • SLO gates and a Pareto-frontier selection rule to choose a config

1) Metrics (definitions)

Latency scope

  • E2E latency: time from receiving a question to a fully validated answer (includes retrieval, rerank, LLM, auditor/guards, JSON parse, acceptance checks).
  • Per-stage latency (optional): t_retrieval, t_rerank, t_llm, t_guard.

Aggregates

  • P50, P90, P95, P99 (milliseconds)
  • Tail amplification: P99 / P50 (smaller is better)

Accuracy side (from Precision/CHR page)

  • Precision (answered), CHR, Under-/Over-refusal
  • Same data contract: runs/trace.jsonl + eval/gold.jsonl

Default SLO gates (suggested)

  • P95 (E2E) ≤ 2000 ms (interactive UX)
  • Precision (answered) ≥ 0.80
  • CHR ≥ 0.75
  • Under-refusal ≤ 0.05, Over-refusal ≤ 0.10

Tune per product, but pin thresholds in repo and enforce in CI.


2) Experiment design

You will sweep low-cost knobs that trade latency for accuracy:

Knob Effect on latency Effect on accuracy
k_lex (BM25 top-k) ↑ retrieval time with k ↑ recall (to a point)
k_sem (embed top-k)
Intersection vs Union Intersection often ↓ rerank set ↑ precision / ↓ tail noise
rerank_depth (N→M) ↑ linearly with N ↑ CHR up to knee
knee_cut ↓ (smaller context) Often ↑ (less junk), but risk recall loss
max_tokens (LLM output) ↑ decode time weak effect on grounding
temperature no change high temp may hurt containment/CHR
Model choice varies varies; measure not guess

Loads
Measure at 3 loads (open-loop; stdlib-only):

  • 1 rps (single user feel)
  • 5 rps (light team usage)
  • 20 rps (stress upper bound)

3) Reference harness (stdlib-only)

Save as ProblemMap/eval/latency_sweep.py.
It calls your local function pipeline_qa(q, knobs)->answer_json&trace or an HTTP endpoint (toggle with --http). It writes:

  • runs/trace.jsonl (answers for accuracy)
  • runs/latency.csv (per run timings + knobs)
  • A final summary JSON with P95/P99 and pass/fail
#!/usr/bin/env python3
import time, json, csv, random, argparse, threading, queue, urllib.request

REFUSAL = "not in context"

# --- plug points --------------------------------------------------------------
def pipeline_qa_local(question, knobs):
    """
    Implement by importing your guarded baseline (Example 01/03).
    Must return:
      {
        "answer_json": {"claim": str, "citations": [str,...]},
        "retrieved_ids": [str,...],
        "stage_ms": {"retrieval":int,"rerank":int,"llm":int,"guard":int}
      }
    """
    # Minimal demo: call your ask.py via a local HTTP or function; here we stub.
    t0=time.perf_counter()
    # fake timings (replace with real calls)
    t1=time.perf_counter(); retrieval_ms=int((time.perf_counter()-t1)*1000)+random.randint(5,15)
    t2=time.perf_counter(); rerank_ms=int((time.perf_counter()-t2)*1000)+random.randint(3,10)
    t3=time.perf_counter(); llm_ms=int((time.perf_counter()-t3)*1000)+random.randint(220,420)
    t4=time.perf_counter(); guard_ms=int((time.perf_counter()-t4)*1000)+random.randint(1,3)
    ans = {"claim": REFUSAL, "citations": []}  # replace with real guarded output
    ret = {"answer_json": ans, "retrieved_ids": [], "stage_ms":{"retrieval":retrieval_ms,"rerank":rerank_ms,"llm":llm_ms,"guard":guard_ms}}
    return ret

def pipeline_qa_http(url, question, knobs):
    body = json.dumps({"q":question, "knobs":knobs}).encode("utf-8")
    req  = urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"})
    t0=time.perf_counter()
    with urllib.request.urlopen(req, timeout=60) as r:
        j=json.loads(r.read().decode("utf-8"))
    # Expect the same contract as local variant
    return j

# --- helpers ------------------------------------------------------------------
def percentiles(samples, ps=(50,90,95,99)):
    if not samples: return {p:0 for p in ps}
    xs=sorted(samples)
    out={}
    for p in ps:
        k=(p/100)*(len(xs)-1)
        f=int(k); c=min(f+1,len(xs)-1); d=k-f
        out[p]=xs[f]*(1-d)+xs[c]*d
    return {p:int(out[p]) for p in ps}

def contains_substr(claim, subs):
    c=(claim or "").lower()
    if not subs: return True
    return any((s.lower() in c and len(s)>=5) for s in subs)

def citation_hit(cits, gold, retrieved):
    if not isinstance(cits,list): return False
    if not set(cits).issubset(set(retrieved or [])): return False
    return bool(set(cits or []) & set(gold or [])) if gold else (cits==[])

# --- main sweep ---------------------------------------------------------------
def run_sweep(gold_path, questions, knobs_grid, http_url=None, rps=1, duration_s=20):
    gold = {g["qid"]: g for g in (json.loads(l) for l in open(gold_path, encoding="utf8"))}
    lat_ms=[]; per_stage=[]; answered=refused=tp=chr_hit=under=over=0
    start=time.perf_counter()
    trace_f=open("runs/trace.jsonl","a",encoding="utf8"); lat_f=open("runs/latency.csv","a",newline=""); lat_csv=csv.writer(lat_f)
    lat_csv.writerow(["ts","qid","e2e_ms","retrieval_ms","rerank_ms","llm_ms","guard_ms","knobs"])
    i=0
    while time.perf_counter()-start < duration_s:
        qid=questions[i % len(questions)]
        g=gold[qid]; q=g["question"]; knobs=knobs_grid[i % len(knobs_grid)]
        t0=time.perf_counter()
        if http_url:
            out=pipeline_qa_http(http_url, q, knobs)
        else:
            out=pipeline_qa_local(q, knobs)
        e2e_ms=int((time.perf_counter()-t0)*1000)
        lat_ms.append(e2e_ms)
        st=out.get("stage_ms",{})
        lat_csv.writerow([int(time.time()), qid, e2e_ms, st.get("retrieval",0), st.get("rerank",0), st.get("llm",0), st.get("guard",0), json.dumps(knobs)])
        # accuracy tallies
        aj=out.get("answer_json",{}); claim=aj.get("claim",""); cits=aj.get("citations",[]); ret=out.get("retrieved_ids",[])
        is_ans=(claim.strip().lower()!= "not in context")
        if g.get("answerable"): A=True
        else: A=False
        if is_ans:
            answered+=1
            C=contains_substr(claim, g.get("gold_claim_substr"))
            H=citation_hit(cits, g.get("gold_citations"), ret)
            if not A: under+=1
            else:
                if H: chr_hit+=1
                if C and H: tp+=1
        else:
            refused+=1
            if A: over+=1
        trace_f.write(json.dumps({"qid":qid,"q":q,"retrieved_ids":ret,"answer_json":aj})+"\n")
        # open-loop pacing
        time.sleep(max(0.0, 1.0/rps - (time.perf_counter()-t0)))
        i+=1
    trace_f.close(); lat_f.close()
    # aggregates
    P=percentiles(lat_ms); S=max(answered,1)
    precision=tp/S; chr_rate=chr_hit/S; under_rate=under/max(sum(1 for x in gold.values() if not x["answerable"]),1)
    over_rate=over/max(sum(1 for x in gold.values() if x["answerable"]),1)
    return {"p50":P[50],"p95":P[95],"p99":P[99],"answered":answered,"refused":refused,
            "precision":round(precision,4),"chr":round(chr_rate,4),
            "under":round(under_rate,4),"over":round(over_rate,4),
            "samples":len(lat_ms)}

if __name__=="__main__":
    ap=argparse.ArgumentParser()
    ap.add_argument("--gold", required=True)
    ap.add_argument("--http", default=None, help="http://localhost:8080/qa if using HTTP")
    ap.add_argument("--rps", type=float, default=1.0)
    ap.add_argument("--duration", type=int, default=20)
    args=ap.parse_args()
    # small grid (expand in CI)
    knobs_grid=[
        {"k_lex":40, "k_sem":40, "intersect":True,  "rerank_depth":32, "knee":True,  "max_tokens":256},
        {"k_lex":20, "k_sem":20, "intersect":True,  "rerank_depth":16, "knee":True,  "max_tokens":192},
        {"k_lex":60, "k_sem":60, "intersect":False, "rerank_depth":64, "knee":False, "max_tokens":256},
    ]
    # choose 2050 mixed A/U qids from gold
    qids=[json.loads(l)["qid"] for l in open(args.gold,encoding="utf8")]
    res=run_sweep(args.gold, qids[:30], knobs_grid, http_url=args.http, rps=args.rps, duration_s=args.duration)
    print(json.dumps(res, indent=2))

How to use

# Single-user feel
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30
# Team load
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60
# Stress
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 20 --duration 60

This writes runs/latency.csv. Use any plotting tool later; gating does not require plots.


4) SLO gating & Pareto selection

Ship rule (AND):

  • P95 ≤ budget (e.g., 2000 ms)
  • Precision ≥ threshold (e.g., 0.80)
  • CHR ≥ threshold (e.g., 0.75)
  • Under/Over-refusal within limits

Pareto frontier Given multiple knob configs, keep only those where no other config is both faster (lower P95) and more accurate (higher Precision). Choose:

  • Interactive app: the fastest config on the frontier that still meets accuracy gates.
  • Back-office batch: the most accurate config that meets a relaxed latency gate.

Rollback guard Fail the PR if: P95 increases by >15% or Precision drops by >2% vs last release.


5) Troubleshooting map

  • P95 blown but P50 ok → tail from LLM. Trim max_tokens, enable intersection+knee, reduce rerank_depth.
  • Precision low, CHR low → grounding broken. Apply RAG Semantic Drift pattern.
  • Precision fine, CHR low → claim substrings not matched; fix claim schema or gold substrings.
  • Throughput collapse at 20 rps → remove cross-service /readyz waits; pre-warm model and index (see Bootstrap Deadlock).
  • Variance across runs → check Vector Store Fragmentation and lock normalization.

6) CI wiring (copy/paste)

Example (bash):

# 1) Run sweep at 1 rps (smoke)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 1 --duration 30 | tee eval/lat_1rps.json
# 2) Run sweep at 5 rps (light load)
python ProblemMap/eval/latency_sweep.py --gold ProblemMap/eval/gold.jsonl --rps 5 --duration 60 | tee eval/lat_5rps.json
# 3) Score accuracy using the RAG scorer
python ProblemMap/eval/score_eval.py --gold ProblemMap/eval/gold.jsonl --trace runs/trace.jsonl --k 5 > eval/acc.json

# 4) Gate: jq asserts
jq -e '.p95 <= 2000' eval/lat_1rps.json
jq -e '.p95 <= 2500' eval/lat_5rps.json
jq -e '.precision >= 0.80 and .chr >= 0.75 and .under <= 0.05 and .over <= 0.10' eval/acc.json

7) Notes & caveats

  • Use open-loop pacing (sleep) to avoid feedback artifacts from server backpressure.
  • Warmup separately; capture steady-state latency.
  • Fix random seeds for prompts (if you jitter prompts, do it in the Semantic Stability eval).

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow