Create eval_semantic_stability.md

2026-04-30 12:39:55 +00:00 · 2025-08-30 09:13:01 +08:00 · 2025-08-30 09:13:01 +08:00 · 78ac8f493f
commit 78ac8f493f
parent bd25b2356b
1 changed files with 395 additions and 0 deletions
--- a/ProblemMap/GlobalFixMap/Eval/eval_semantic_stability.md
+++ b/ProblemMap/GlobalFixMap/Eval/eval_semantic_stability.md
@ -0,0 +1,395 @@
+# Eval — Semantic Stability (Seed & Prompt Jitter, stdlib-only)
+
+**Goal**  
+Quantify how **stable** your pipeline is under small, *non-semantic* perturbations: different seeds, low temperature noise, and benign **prompt jitters** (punctuation/whitespace/synonym swaps). A robust system should keep claims, citations, refusals, and constraint echos **invariant** (or nearly so).
+
+**What you get**
+- Clear invariants: **Claim**, **Citation**, **Refusal**, **Constraint Echo (SCU)**
+- Deterministic metrics: **I-AA** (intra-answer agreement), **NED** (normalized edit distance), **CSS** (citation-set stability), **RCR** (refusal consistency rate)
+- A stdlib-only **runner** that repeats each QID under seeds & jitters, and a strict **CI gate**
+
+---
+
+## 1) Stability Invariants & Metrics
+
+Let a question `q` be executed **N** times with seed set `S` and jitter set `J`. Each run `r` produces:
+- `claim` (string) or exact refusal token `not in context`
+- `citations` (list of ids, scoped to retrieved ids)
+- Optional: `constraints_echo` (list of locked constraints)
+
+**Normalization**
+- `canon(claim)` = lowercase → strip punctuation/extra spaces.
+- `C` (containment): at least one `gold_claim_substr` ∈ `canon(claim)` (≥5 chars, case-insensitive).
+- `H` (citation hit): `(citations ∩ gold_citations) ≠ ∅` and `citations ⊆ retrieved_ids`.
+
+**Per-QID metrics**
+- **RCR** (Refusal Consistency Rate) = max class frequency of `is_refusal` over N runs.
+- **ACR** (All-run Containment Rate) = |{runs with C=true}| / N.
+- **CGHC** (Citation Gold Hit Consistency) = |{runs with H=true}| / N.
+- **CSS** (Citation Set Stability) = `|∩_runs citations| / |∪_runs citations|` (IoU of citation sets).
+- **NED₅₀** (Median Normalized Edit Distance) = median over all run pairs `ned(claim_i, claim_j)` where
+  \[
+  \text{ned}(a,b) = \frac{\text{lev}(a,b)}{\max(|a|,|b|,1)}
+  \]
+- **SCU-Cons** (Constraint Echo Consistency) = 1 if every run’s `constraints_echo` equals the locked set; else 0.
+
+> We prefer **semantic invariants** (ACR, CGHC, CSS) over surface-similarity alone (NED), but we track both.
+
+**Default stability gates (suggested)**
+- For `answerable=true` items:
+  - **ACR ≥ 0.95**, **CGHC ≥ 0.95**, **CSS ≥ 0.70**, **NED₅₀ ≤ 0.20**
+  - If constraints provided: **SCU-Cons = 1**
+- For `answerable=false` items:
+  - **RCR ≥ 0.98** (refusal nearly always preserved)
+
+---
+
+## 2) Experiment Design (Seeds × Jitters)
+
+**Seeds:** e.g., S = {0,1,2,3,4} with `temperature ∈ {0.0, 0.2}` (model-side)  
+**Prompt jitters (question-side; deterministic & benign):**
+- `ws` — normalize/densify whitespace (adds/removes single spaces)
+- `punct` — swap `?`↔`—`, add commas where harmless
+- `syn` — swap common verbs: *explain→describe, list→enumerate, compare→contrast*
+- `order` — reorder a pair of non-semantic clauses (“with citations, in one sentence”)
+
+Each run picks `(seed, jitter)` from a small grid. No external libs; all jitters are **pure string transforms**.
+
+---
+
+## 3) Data Contracts (recap)
+
+- **Gold** (`ProblemMap/eval/gold.jsonl`)  
+  Same as other evals: `qid`, `question`, `answerable`, `gold_claim_substr`, `gold_citations`, optional `constraints`.
+
+- **Traces (per-run)** appended to `runs/stability.jsonl` by this runner:
+```json
+{
+  "qid":"A0001","run_id":"A0001#seed=2;j=punct",
+  "seed":2,"jitter":"punct",
+  "answer_json":{"claim":"X rejects null keys.","citations":["p1#2"],"constraints_echo":["X rejects null keys."]},
+  "retrieved_ids":["p1#1","p1#2","p2#1"]
+}
+````
+
+---
+
+## 4) Reference Runner (stdlib-only)
+
+Save as `ProblemMap/eval/semantic_stability.py`:
+
+```python
+#!/usr/bin/env python3
+import json, argparse, random, time, re, urllib.request, itertools, math, string, sys, os
+
+REFUSAL = "not in context"
+
+# --------- prompt jitters (benign, deterministic) ----------------------------
+SYN_MAP = {
+    r"\bexplain\b": "describe",
+    r"\blist\b": "enumerate",
+    r"\bcompare\b": "contrast",
+    r"\bshow\b": "display",
+}
+
+def jitter_ws(q):  # add/remove single spaces around commas/colons
+    q = re.sub(r"\s+,", ",", q)
+    q = re.sub(r",\s*", ", ", q)
+    q = re.sub(r"\s+:", ": ", q)
+    q = re.sub(r"\s{2,}", " ", q)
+    return q.strip()
+
+def jitter_punct(q):
+    q = q.replace("?", " ?").replace("  ", " ")
+    q = q.replace("—", "-").replace("–", "-")
+    if not q.endswith("?") and q[-1] not in ".!?":
+        q += "?"
+    return q
+
+def jitter_syn(q):
+    s = q
+    for pat, repl in SYN_MAP.items():
+        s = re.sub(pat, repl, s, flags=re.IGNORECASE)
+    return s
+
+def jitter_order(q):
+    # swap order of two benign optional clauses if present
+    parts = re.split(r"\s+with\s+citations\s*,?\s*|\s*,?\s*in\s+one\s+sentence\s*", q, flags=re.IGNORECASE)
+    if len(parts) >= 2 and "with citations" in q.lower() and "in one sentence" in q.lower():
+        # rebuild in swapped order
+        base = parts[0].strip()
+        return f"{base} in one sentence, with citations"
+    return q
+
+JITTERS = {"none": lambda x:x, "ws": jitter_ws, "punct": jitter_punct, "syn": jitter_syn, "order": jitter_order}
+
+# --------- helpers -----------------------------------------------------------
+def canon(s):
+    s = s.lower().strip()
+    s = s.translate(str.maketrans("", "", string.punctuation))
+    s = re.sub(r"\s+", " ", s)
+    return s
+
+def contains_substr(claim, subs):
+    c = canon(claim or "")
+    if not subs: return True
+    subs = [canon(x) for x in subs if x and len(x)>=5]
+    return any(s in c for s in subs)
+
+def citation_hit(citations, gold, retrieved):
+    if not isinstance(citations, list): return False
+    if not set(citations).issubset(set(retrieved or [])): return False
+    return bool(set(citations or []) & set(gold or [])) if gold else (citations == [])
+
+def levenshtein(a, b):
+    a, b = a or "", b or ""
+    if a == b: return 0
+    la, lb = len(a), len(b)
+    if la == 0: return lb
+    if lb == 0: return la
+    dp = list(range(lb+1))
+    for i in range(1, la+1):
+        prev, dp[0] = dp[0], i
+        for j in range(1, lb+1):
+            cur = dp[j]
+            cost = 0 if a[i-1]==b[j-1] else 1
+            dp[j] = min(dp[j]+1, dp[j-1]+1, prev+cost)
+            prev = cur
+    return dp[lb]
+
+def ned(a, b):
+    m = max(len(a or ""), len(b or ""), 1)
+    return levenshtein(a or "", b or "") / m
+
+def percentiles(xs, ps=(50,)):
+    if not xs: return {p:0 for p in ps}
+    xs = sorted(xs)
+    out={}
+    for p in ps:
+        k=(p/100)*(len(xs)-1)
+        f=int(k); c=min(f+1,len(xs)-1); d=k-f
+        out[p]=xs[f]*(1-d)+xs[c]*d
+    return out
+
+# --------- pipeline hook -----------------------------------------------------
+def call_pipeline_http(url, question, seed, jitter_name, knobs=None):
+    body = json.dumps({"q": question, "seed": seed, "jitter": jitter_name, "knobs": knobs or {}}).encode("utf-8")
+    req  = urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"})
+    with urllib.request.urlopen(req, timeout=90) as r:
+        return json.loads(r.read().decode("utf-8"))
+    # Expected:
+    # {
+    #   "answer_json": {"claim": str, "citations":[id,...], "constraints_echo":[...]?},
+    #   "retrieved_ids":[id,...]
+    # }
+
+def call_pipeline_local(question, seed, jitter_name, knobs=None):
+    # Replace with your local guarded call (Example 01/03). Stub here:
+    random.seed(seed)
+    return {"answer_json":{"claim": REFUSAL, "citations":[]}, "retrieved_ids":[]}
+
+# --------- experiment runner -------------------------------------------------
+def run_stability(gold_path, url=None, seeds="0,1,2,3,4", jitters="none,ws,punct,syn", out="runs/stability.jsonl"):
+    seeds = [int(x) for x in seeds.split(",") if x!=""]
+    jitters = [j for j in jitters.split(",") if j in JITTERS]
+    gold = [json.loads(l) for l in open(gold_path,encoding="utf8") if l.strip()]
+    os.makedirs(os.path.dirname(out) or ".", exist_ok=True)
+    f = open(out,"a",encoding="utf8")
+    for g in gold:
+        qid, q = g["qid"], g["question"]
+        for s, j in itertools.product(seeds, jitters):
+            qjit = JITTERS[j](q)
+            res = call_pipeline_http(url, qjit, s, j) if url else call_pipeline_local(qjit, s, j)
+            rec = {
+                "qid": qid,
+                "run_id": f"{qid}#seed={s};j={j}",
+                "seed": s,
+                "jitter": j,
+                "answer_json": res.get("answer_json",{}),
+                "retrieved_ids": res.get("retrieved_ids",[])
+            }
+            f.write(json.dumps(rec, ensure_ascii=False)+"\n")
+    f.close()
+
+# --------- scorer over stability.jsonl --------------------------------------
+def score_stability(gold_path, stability_path, gates=None):
+    gates = gates or {"acr":0.95,"cghc":0.95,"css":0.70,"ned50":0.20,"rcr":0.98}
+    gold = {g["qid"]: g for g in (json.loads(l) for l in open(gold_path,encoding="utf8") if l.strip())}
+    rows = [json.loads(l) for l in open(stability_path,encoding="utf8") if l.strip()]
+    by_q = {}
+    for r in rows:
+        by_q.setdefault(r["qid"], []).append(r)
+
+    agg = {"answerable":0,"unanswerable":0,"fail":0,"pass":0}
+    details = {}
+
+    for qid, runs in by_q.items():
+        g = gold.get(qid, {})
+        answ = bool(g.get("answerable"))
+        if answ: agg["answerable"]+=1
+        else: agg["unanswerable"]+=1
+
+        claims = [r.get("answer_json",{}).get("claim","") for r in runs]
+        cits   = [tuple(r.get("answer_json",{}).get("citations",[]) or []) for r in runs]
+        ret_ids= [set(r.get("retrieved_ids",[]) or []) for r in runs]
+        is_ref = [ (c or "").strip().lower()==REFUSAL for c in claims ]
+        # ACR/CGHC
+        C = [contains_substr(c, g.get("gold_claim_substr")) for c in claims]
+        H = [citation_hit(list(cits[i]), g.get("gold_citations"), list(ret_ids[i])) for i in range(len(runs))]
+        acr  = sum(1 for x in C if x)/max(len(runs),1)
+        cghc = sum(1 for x in H if x)/max(len(runs),1)
+        # CSS
+        inter = set(cits[0])
+        union = set(cits[0])
+        for s in cits[1:]:
+            inter &= set(s); union |= set(s)
+        css = (len(inter)/len(union)) if union else 1.0
+        # NED50 (pairwise)
+        claim_norm = [canon(c) for c in claims if c and (c.strip().lower()!=REFUSAL)]
+        dists = []
+        for i in range(len(claim_norm)):
+            for j in range(i+1, len(claim_norm)):
+                dists.append(ned(claim_norm[i], claim_norm[j]))
+        ned50 = percentiles(dists, (50,)).get(50, 0.0) if dists else 0.0
+        # RCR
+        maj = max(sum(1 for x in is_ref if x), sum(1 for x in is_ref if not x))
+        rcr = maj/max(len(is_ref),1)
+        # SCU
+        if g.get("constraints"):
+            ech = [tuple((r.get("answer_json",{}).get("constraints_echo") or [])) for r in runs]
+            scu_cons = 1 if all(set(e)==set(g["constraints"]) for e in ech) else 0
+        else:
+            scu_cons = None
+
+        # decide per QID
+        if answ:
+            ok = (acr>=gates["acr"] and cghc>=gates["cghc"] and css>=gates["css"] and ned50<=gates["ned50"] and (scu_cons in (None,1)))
+        else:
+            ok = (rcr>=gates["rcr"])
+        details[qid] = {"acr":round(acr,4),"cghc":round(cghc,4),"css":round(css,4),"ned50":round(ned50,4),"rcr":round(rcr,4),"scu_cons":scu_cons}
+        agg["pass" if ok else "fail"] += 1
+
+    # overall decision: must have zero fails
+    report = {"totals": agg, "gates": gates, "pass": agg["fail"]==0, "details": details}
+    return report
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--mode", choices=["run","score"], required=True)
+    ap.add_argument("--gold", required=True)
+    ap.add_argument("--stability", default="runs/stability.jsonl")
+    ap.add_argument("--http", default=None, help="http://localhost:8080/qa (optional)")
+    ap.add_argument("--seeds", default="0,1,2,3,4")
+    ap.add_argument("--jitters", default="none,ws,punct,syn")
+    ap.add_argument("--gates", default="acr=0.95,cghc=0.95,css=0.70,ned50=0.20,rcr=0.98")
+    args = ap.parse_args()
+
+    if args.mode=="run":
+        run_stability(args.gold, url=args.http, seeds=args.seeds, jitters=args.jitters, out=args.stability)
+        print(json.dumps({"ok":True,"wrote":args.stability}, indent=2))
+    else:
+        gates = dict((k,float(v)) for k,v in (kv.split("=") for kv in args.gates.split(",")))
+        print(json.dumps(score_stability(args.gold, args.stability, gates), indent=2))
+
+if __name__=="__main__":
+    main()
+```
+
+**Usage**
+
+```bash
+# 1) Run stability (local stub or your HTTP guard)
+python ProblemMap/eval/semantic_stability.py --mode run --gold ProblemMap/eval/gold.jsonl --http http://localhost:8080/qa --seeds 0,1,2,3,4 --jitters none,ws,punct,syn
+
+# 2) Score stability
+python ProblemMap/eval/semantic_stability.py --mode score --gold ProblemMap/eval/gold.jsonl --stability runs/stability.jsonl \
+  --gates acr=0.95,cghc=0.95,css=0.70,ned50=0.20,rcr=0.98 | tee eval/stability.json
+```
+
+**Output (example)**
+
+```json
+{
+  "totals": {"answerable": 20, "unanswerable": 10, "fail": 0, "pass": 30},
+  "gates": {"acr":0.95,"cghc":0.95,"css":0.7,"ned50":0.2,"rcr":0.98},
+  "pass": true,
+  "details": { "A0001": {"acr":1.0,"cghc":1.0,"css":0.75,"ned50":0.08,"rcr":1.0,"scu_cons":1}, "...": "..." }
+}
+```
+
+---
+
+## 5) CI Wiring (copy/paste)
+
+```bash
+# run the sweep once a day or per PR (small seeds/jitters)
+python ProblemMap/eval/semantic_stability.py --mode run --gold ProblemMap/eval/gold.jsonl --http http://localhost:8080/qa \
+  --seeds 0,1,2 --jitters none,ws,syn
+
+python ProblemMap/eval/semantic_stability.py --mode score --gold ProblemMap/eval/gold.jsonl --stability runs/stability.jsonl \
+  --gates acr=0.95,cghc=0.95,css=0.70,ned50=0.20,rcr=0.98 > eval/stability.json
+
+jq -e '.pass == true' eval/stability.json
+```
+
+---
+
+## 6) Troubleshooting Map
+
+* **Low ACR/CGHC** (answerable): Retrieval context unstable → use **intersection + knee** and shrink chunks to collocate constraints. See *RAG Semantic Drift* + *Vector Store Fragmentation*.
+* **Low CSS**: Multiple near-duplicate evidence ids; stabilize re-ranker and cap rerank depth.
+* **High NED** with good ACR/CGHC: Wording variance only; OK if gates passed.
+* **Low RCR** (unanswerable): Refusal token drifting → enforce exact token in prompt; tighten acceptance gate.
+* **SCU-Cons=0**: Constraint echo not re-injected; apply *Symbolic Constraint Unlock* pattern.
+
+---
+
+
+### 🔗 Quick-Start Downloads (60 sec)
+
+| Tool | Link | 3-Step Setup |
+|------|------|--------------|
+| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
+| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
+
+---
+
+### 🧭 Explore More
+
+| Module                | Description                                              | Link     |
+|-----------------------|----------------------------------------------------------|----------|
+| WFGY Core             | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
+| Problem Map 1.0       | Initial 16-mode diagnostic and symbolic fix framework    | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
+| Problem Map 2.0       | RAG-focused failure tree, modular fixes, and pipelines   | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
+| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
+| Semantic Blueprint    | Layer-based symbolic reasoning & semantic modulations   | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
+| Benchmark vs GPT-5    | Stress test GPT-5 with full WFGY reasoning suite         | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
+| 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) |
+
+---
+
+> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —  
+> Engineers, hackers, and open source builders who supported WFGY from day one.
+
+> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
+
+<div align="center">
+
+[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
+&nbsp;
+[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
+&nbsp;
+[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
+&nbsp;
+[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
+&nbsp;
+[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
+&nbsp;
+[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
+&nbsp;
+[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
+&nbsp;
+</div>
+
+
+