11 KiB
ΔS Probes for Retrieval and Reasoning Stability
A compact playbook to measure semantic distance and catch failure modes before they surface in answers. Run these probes store-agnostic and model-agnostic. Use the readings to route fixes to the right WFGY pages.
What ΔS tells you
- ΔS(question, retrieved) measures semantic tension between the user question and the assembled retrieval context.
- ΔS(retrieved, anchor) measures how well the retrieved context aligns to the expected ground section.
- Combined with λ_observe you can separate metric mismatches from prompt variance and ordering issues.
Targets and thresholds
- Pass: ΔS(question, retrieved) < 0.40
- Transitional: 0.40 ≤ ΔS < 0.60
- Risk: ΔS ≥ 0.60
- Coverage to target section ≥ 0.70
- λ remains convergent across 3 paraphrases and 2 seeds
Reference playbooks:
Retrieval Playbook ·
Retrieval Traceability ·
Data Contracts
Probe pack you should always run
-
Paraphrase sweep
Ask the same question three ways. Record ΔS and λ for each.
If λ flips on harmless paraphrases with small ΔS changes, clamp variance and lock prompt headers.
Open: Context Drift -
Seed sweep
Run with two random seeds and keep the retrieval order fixed.
If answers flip with stable ΔS, add a deterministic reranker.
Open: Rerankers -
k sweep
Try k in {5, 10, 20}. If ΔS stays flat and high while coverage is low, suspect metric or index mismatch.
Open: Embedding ≠ Semantic -
Anchor triangulation
Compare ΔS against the correct section and one decoy section.
If ΔS is close for both, realign chunking and anchors.
Open: Chunking Checklist · chunk_alignment.md -
Hybrid split check
If hybrid underperforms a single retriever, split parsing and rebalance.
Open: pattern_query_parsing_split.md -
Fragmentation probe
If ΔS looks fine on small tests but coverage collapses in production, check for store fragmentation or namespace skew.
Open: pattern_vectorstore_fragmentation.md
Minimal implementation you can paste
# Pseudocode: model and store agnostic
def deltaS(a, b):
# plug your semantic distance, normalized to [0,1]
return metric.distance(a, b)
def probe_once(question, retrieved, anchor, seed=None):
d_qr = deltaS(question, retrieved)
d_ra = deltaS(retrieved, anchor) if anchor else None
lam = observe_lambda(question, retrieved, seed=seed) # convergent | divergent
return {"ΔS_qr": d_qr, "ΔS_ra": d_ra, "λ_state": lam}
def run_probes(q, paraphrases, seeds, ks, anchor):
logs = []
for p in paraphrases:
for k in ks:
ctx = retriever.invoke(p, k=k)
for s in seeds:
logs.append(probe_once(p, ctx, anchor, seed=s))
return logs
What to record
- Question form, seed, k
- ΔS(question, retrieved), ΔS(retrieved, anchor)
- λ_state per run and final coverage
- Retrieval order and analyzer/metric identifiers
- Prompt header hash and template revision
Schema reference: Retrieval Traceability · Data Contracts
Reading the patterns
-
ΔS high across paraphrases and seeds Likely metric or family mismatch. Rebuild with a single embedding family and explicit normalization. Open: Embedding ≠ Semantic
-
ΔS improves with higher k but answers still flip Ordering variance. Add a deterministic reranker and freeze prompt headers. Open: Rerankers
-
ΔS low but citations unstable Schema not enforced or formatter renamed fields. Tighten contracts and fail fast. Open: Data Contracts
-
ΔS near equal to anchor and decoy Chunk boundaries misaligned or anchors missing. Re-chunk with anchors and rebuild. Open: Chunking Checklist · chunk_alignment.md
-
ΔS oscillates with paraphrase, λ flips Prompt variance and entropy. Clamp with BBAM, then stabilize chain layout. Open: Entropy Collapse
Verification loops
-
Evaluate after each change with a small gold set and keep ΔS logs alongside coverage. Open: retrieval_eval_recipes.md
-
Keep a regression gate: ΔS ≤ 0.45 and coverage ≥ 0.70 on three paraphrases before you ship. Open: eval_rag_precision_recall.md
Common gotchas
-
Mixed analyzers or distance metrics between write and read paths. Open: Retrieval Playbook
-
Inconsistent casing or tokenization in HyDE versus dense path. Open: Rerankers
-
Live tests run before index is ready or version hash mismatched. Open: Bootstrap Ordering · Pre-Deploy Collapse
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.
要繼續下一頁請回「GO 3」。