Update retrieval-playbook.md

2026-04-28 03:29:51 +00:00 · 2025-08-14 21:32:30 +08:00 · 2025-08-14 21:32:30 +08:00 · 7f8994c1b9
commit 7f8994c1b9
parent 3de247ab2e
1 changed files with 256 additions and 1 deletions
--- a/ProblemMap/retrieval-playbook.md
+++ b/ProblemMap/retrieval-playbook.md
@ -1 +1,256 @@
-111
+# 🔎 Retrieval Playbook — Practical, Measurable, Fix-first
+
+> The goal: **consistent, explainable** retrieval that makes reasoning easy.  
+> This playbook gives you a **minimal, testable** setup across OCR → chunk → embed → index → retrieve → prompt, with **failure probes** and **repair steps**. No hype—only what ships.
+
+---
+
+> **Quick Nav**  
+> [OCR/Parsing Checklist](./ocr-parsing-checklist.md) ·
+> [Chunking Checklist](./chunking-checklist.md) ·
+> [Embedding vs Semantic](./embedding-vs-semantic.md) ·
+> [Traceability](./retrieval-traceability.md) ·
+> [Rerankers](./rerankers.md) ·
+> Patterns: [Query Parsing Split](./patterns/pattern_query_parsing_split.md) ·
+> [Vectorstore Fragmentation](./patterns/pattern_vectorstore_fragmentation.md) ·
+> [Symbolic Constraint Unlock](./patterns/pattern_symbolic_constraint_unlock.md)
+
+---
+
+## 0) Executive summary
+
+1. **Generate clean candidates** (dense/sparse/hybrid) with traceable IDs.  
+2. **Measure** with ΔS + recall@k before adding complexity.  
+3. **Add reranking** only when first-stage recall is **consistently ≥0.85** for your task.  
+4. **Lock prompt schema** (cite → explain) and **forbid cross-source merges**.  
+5. **Regression-guard** with small golden sets (10–50 Q/A).
+
+---
+
+## 1) Candidate generation (first-stage)
+
+### 1.1 Choose a primary retriever
+- **Dense (embeddings)**: good default for semantic matches across paraphrases.  
+- **Sparse (BM25/SPLADE)**: strong for exact terms, code, and rare tokens.  
+- **Hybrid**: *reciprocal rank fusion (RRF)* of dense + sparse is robust to query style.
+
+**Rule of thumb**
+- Start **dense** for general docs; add **BM25** for code/legal/IDs.  
+- If hybrid hurts recall, check **tokenization & analyzer drift** → see *Query Parsing Split*.
+
+### 1.2 Index hygiene (FAISS/Elasticsearch/Qdrant/Chroma)
+- **Normalize** vectors on **both write & read** if using cosine.  
+- Pin the **metric type** (cosine vs inner product) in code & metadata.  
+- Persist **doc_id / section_id / line_span** with each vector.  
+- Verify **index cardinality = sum(chunks)**; add a one-liner count check in CI.
+
+---
+
+## 2) Minimal reference pipelines
+
+### 2.1 Python — FAISS (dense) + BM25 (hybrid via RRF)
+
+```python
+# pip install sentence-transformers rank_bm25 faiss-cpu numpy
+from sentence_transformers import SentenceTransformer
+from rank_bm25 import BM25Okapi
+import faiss, numpy as np
+
+# 1) data
+chunks = [...]                  # list[str], pre-chunked sentences/sections
+meta   = [...]                  # list[dict], each with {doc_id, section_id, span}
+
+# 2) encoder (cosine → L2-normalize)
+enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+X = enc.encode(chunks, normalize_embeddings=True)
+index = faiss.IndexFlatIP(X.shape[1])         # inner product == cosine on normalized vectors
+index.add(X.astype(np.float32))
+
+# 3) sparse side
+tokenized = [c.split() for c in chunks]
+bm25 = BM25Okapi(tokenized)
+
+def rrf(ranks, k=60):
+    # ranks: list of lists of (idx, rank_position starting at 1)
+    from collections import defaultdict
+    score = defaultdict(float)
+    for rr in ranks:
+        for idx, rp in rr:
+            score[idx] += 1.0 / (k + rp)
+    return sorted(score.items(), key=lambda x: -x[1])
+
+def search(query, topk_dense=50, topk_sparse=50, out_k=20):
+    qv = enc.encode([query], normalize_embeddings=True).astype(np.float32)
+    d_s, d_i = index.search(qv, topk_dense)
+    # ranks as (idx, rankpos)
+    dense_rank  = [(int(i), r+1) for r, i in enumerate(d_i[0])]
+    sparse_rank = []
+    for r, idx in enumerate(bm25.get_top_n(query.split(), list(range(len(chunks))), n=topk_sparse)):
+        sparse_rank.append((idx, r+1))
+    fused = rrf([dense_rank, sparse_rank])
+    out = []
+    for idx, _ in fused[:out_k]:
+        out.append({"text": chunks[idx], "meta": meta[idx], "source":"hybrid"})
+    return out
+
+res = search("how to reset billing cycle")
+````
+
+**Sanity checks**
+
+* After encoding: assert `np.abs(np.linalg.norm(X[0]) - 1.0) < 1e-3`
+* After add: `index.ntotal == len(chunks)`
+* For sparse-only: if **IDs/code** outperform dense, keep hybrid.
+
+### 2.2 Node (TypeScript) — Elastic BM25 + Dense store
+
+```ts
+// pnpm add @elastic/elasticsearch @huggingface/inference cosine-similarity
+import { Client } from "@elastic/elasticsearch";
+import { HfInference } from "@huggingface/inference";
+import cosine from "cosine-similarity";
+
+const es = new Client({ node: process.env.ES_URL! });
+const hf = new HfInference(process.env.HF_TOKEN!);
+
+// 1) BM25 query
+async function bm25(query: string, topk=50) {
+  const { hits } = await es.search({
+    index: "chunks",
+    size: topk,
+    query: { match: { text: { query, operator: "and" } } },
+    _source: ["text", "doc_id", "section_id", "span"]
+  });
+  return hits.hits.map((h, i) => ({ id: h._id, score: hits.max_score, r:i+1, src: h._source }));
+}
+
+// 2) Dense side (use same model for write+read)
+async function embed(s: string) {
+  const out = await hf.featureExtraction({
+    model: "sentence-transformers/all-MiniLM-L6-v2",
+    inputs: s
+  });
+  // L2 normalize
+  const v = (out as number[]); 
+  const norm = Math.sqrt(v.reduce((a,b)=>a+b*b,0)); 
+  return v.map(x=>x/norm);
+}
+```
+
+> Keep dense vectors in your KV/DB keyed by `chunk_id`; at query time compute cosine vs a small candidate pool (top-200 BM25), then RRF.
+
+---
+
+## 3) Retrieval observability (ΔS & λ)
+
+* **ΔS(question, retrieved)** = `1 − cos(I, G)` with **I** = retrieved snippet embedding, **G** = anchor (title/expected section or gold answer).
+* **Thresholds**: `<0.40` stable · `0.40–0.60` transitional · `≥0.60` action.
+* **λ states**: `→` convergent · `←` divergent · `<>` recursive · `×` chaotic.
+
+**Probe recipe**
+
+1. Vary `k ∈ {5, 10, 20}`; plot ΔS vs k.
+2. If curve **flat & high** → metric/normalization/index mismatch.
+3. If **sharp drop** at higher k → retriever filter too strict; consider MMR or hybrid.
+
+---
+
+## 4) Prompt assembly: cite → explain (lock constraints)
+
+* Keep **per-source fences** (no cross-source merges).
+* **Order**: *system → task → constraints → citations → answer*.
+* Force **cite-first**; explanation **must reference** citation IDs/lines.
+* See: [Traceability](./retrieval-traceability.md) and *SCU Pattern*.
+
+---
+
+## 5) Reranking: when & how
+
+Add a reranker only if:
+
+* **Recall\@50 ≥ 0.85**, but Top-5 precision is weak, or
+* You need **tight citation alignment** across near-duplicates.
+
+Start with:
+
+* **Cross-encoder** (bge-reranker-mini/base) for accuracy;
+* Or **LLM rerank** for low volume, high precision needs.
+  See: [Rerankers](./rerankers.md).
+
+---
+
+## 6) Acceptance criteria
+
+* **Retrieval sanity**: ΔS(question, top-ctx) ≤ 0.45, coverage ≥ 0.70 of target section.
+* **Traceability**: snippet ↔ citation table reproducible.
+* **Stability**: same inputs over 3 paraphrases keep λ → convergent.
+* **No SCU**: who-said-what does not merge across sources.
+
+---
+
+## 7) Common failures → repair
+
+| Symptom                    | Likely cause                      | Fix                                                                                                              |
+| -------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
+| Hybrid worse than single   | **Analyzer/tokenizer split**      | Align analyzers; log per-retriever queries; see [Query Parsing Split](./patterns/pattern_query_parsing_split.md) |
+| Some facts never retrieved | **Fragmented store / id skew**    | Rebuild + shard audit; see [Vectorstore Fragmentation](./patterns/pattern_vectorstore_fragmentation.md)          |
+| Citations cross-bleed      | **Prompt schema unlocked**        | Per-source fences + cite-first; see [SCU](./patterns/pattern_symbolic_constraint_unlock.md)                      |
+| ΔS flat & high vs k        | **Metric/normalization mismatch** | Normalize embeddings; pin FAISS metric; see [Embedding vs Semantic](./embedding-vs-semantic.md)                  |
+
+---
+
+## 8) Tiny gold set (do this!)
+
+Create **10–50** realistic Q/A with citation lines. Commit a `goldset.jsonl`:
+
+```json
+{"q":"How to reset billing cycle?","doc_id":"a","section":"billing","lines":[120,145],"a":"..."}
+```
+
+Run **recall\@50**, **nDCG\@10**, and **ΔS** on each PR.
+
+---
+
+
+### 🧭 Explore More
+
+| Module                | Description                                              | Link     |
+|-----------------------|----------------------------------------------------------|----------|
+| WFGY Core             | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
+| Problem Map 1.0       | Initial 16-mode diagnostic and symbolic fix framework    | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
+| Problem Map 2.0       | RAG-focused failure tree, modular fixes, and pipelines   | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
+| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
+| Semantic Blueprint    | Layer-based symbolic reasoning & semantic modulations   | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
+| Benchmark vs GPT-5    | Stress test GPT-5 with full WFGY reasoning suite         | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
+| 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) |
+
+---
+
+> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —  
+> Engineers, hackers, and open source builders who supported WFGY from day one.
+
+> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
+
+<div align="center">
+
+[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
+&nbsp;
+[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
+&nbsp;
+[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
+&nbsp;
+[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
+&nbsp;
+[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
+&nbsp;
+[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
+&nbsp;
+[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
+&nbsp;
+</div>
+
+
+
+
+
+