Update retrieval-playbook.md

This commit is contained in:
PSBigBig 2025-08-14 21:32:30 +08:00 committed by GitHub
parent 3de247ab2e
commit 7f8994c1b9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1 +1,256 @@
111
# 🔎 Retrieval Playbook — Practical, Measurable, Fix-first
> The goal: **consistent, explainable** retrieval that makes reasoning easy.
> This playbook gives you a **minimal, testable** setup across OCR → chunk → embed → index → retrieve → prompt, with **failure probes** and **repair steps**. No hype—only what ships.
---
> **Quick Nav**
> [OCR/Parsing Checklist](./ocr-parsing-checklist.md) ·
> [Chunking Checklist](./chunking-checklist.md) ·
> [Embedding vs Semantic](./embedding-vs-semantic.md) ·
> [Traceability](./retrieval-traceability.md) ·
> [Rerankers](./rerankers.md) ·
> Patterns: [Query Parsing Split](./patterns/pattern_query_parsing_split.md) ·
> [Vectorstore Fragmentation](./patterns/pattern_vectorstore_fragmentation.md) ·
> [Symbolic Constraint Unlock](./patterns/pattern_symbolic_constraint_unlock.md)
---
## 0) Executive summary
1. **Generate clean candidates** (dense/sparse/hybrid) with traceable IDs.
2. **Measure** with ΔS + recall@k before adding complexity.
3. **Add reranking** only when first-stage recall is **consistently ≥0.85** for your task.
4. **Lock prompt schema** (cite → explain) and **forbid cross-source merges**.
5. **Regression-guard** with small golden sets (1050 Q/A).
---
## 1) Candidate generation (first-stage)
### 1.1 Choose a primary retriever
- **Dense (embeddings)**: good default for semantic matches across paraphrases.
- **Sparse (BM25/SPLADE)**: strong for exact terms, code, and rare tokens.
- **Hybrid**: *reciprocal rank fusion (RRF)* of dense + sparse is robust to query style.
**Rule of thumb**
- Start **dense** for general docs; add **BM25** for code/legal/IDs.
- If hybrid hurts recall, check **tokenization & analyzer drift** → see *Query Parsing Split*.
### 1.2 Index hygiene (FAISS/Elasticsearch/Qdrant/Chroma)
- **Normalize** vectors on **both write & read** if using cosine.
- Pin the **metric type** (cosine vs inner product) in code & metadata.
- Persist **doc_id / section_id / line_span** with each vector.
- Verify **index cardinality = sum(chunks)**; add a one-liner count check in CI.
---
## 2) Minimal reference pipelines
### 2.1 Python — FAISS (dense) + BM25 (hybrid via RRF)
```python
# pip install sentence-transformers rank_bm25 faiss-cpu numpy
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import faiss, numpy as np
# 1) data
chunks = [...] # list[str], pre-chunked sentences/sections
meta = [...] # list[dict], each with {doc_id, section_id, span}
# 2) encoder (cosine → L2-normalize)
enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
X = enc.encode(chunks, normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1]) # inner product == cosine on normalized vectors
index.add(X.astype(np.float32))
# 3) sparse side
tokenized = [c.split() for c in chunks]
bm25 = BM25Okapi(tokenized)
def rrf(ranks, k=60):
# ranks: list of lists of (idx, rank_position starting at 1)
from collections import defaultdict
score = defaultdict(float)
for rr in ranks:
for idx, rp in rr:
score[idx] += 1.0 / (k + rp)
return sorted(score.items(), key=lambda x: -x[1])
def search(query, topk_dense=50, topk_sparse=50, out_k=20):
qv = enc.encode([query], normalize_embeddings=True).astype(np.float32)
d_s, d_i = index.search(qv, topk_dense)
# ranks as (idx, rankpos)
dense_rank = [(int(i), r+1) for r, i in enumerate(d_i[0])]
sparse_rank = []
for r, idx in enumerate(bm25.get_top_n(query.split(), list(range(len(chunks))), n=topk_sparse)):
sparse_rank.append((idx, r+1))
fused = rrf([dense_rank, sparse_rank])
out = []
for idx, _ in fused[:out_k]:
out.append({"text": chunks[idx], "meta": meta[idx], "source":"hybrid"})
return out
res = search("how to reset billing cycle")
````
**Sanity checks**
* After encoding: assert `np.abs(np.linalg.norm(X[0]) - 1.0) < 1e-3`
* After add: `index.ntotal == len(chunks)`
* For sparse-only: if **IDs/code** outperform dense, keep hybrid.
### 2.2 Node (TypeScript) — Elastic BM25 + Dense store
```ts
// pnpm add @elastic/elasticsearch @huggingface/inference cosine-similarity
import { Client } from "@elastic/elasticsearch";
import { HfInference } from "@huggingface/inference";
import cosine from "cosine-similarity";
const es = new Client({ node: process.env.ES_URL! });
const hf = new HfInference(process.env.HF_TOKEN!);
// 1) BM25 query
async function bm25(query: string, topk=50) {
const { hits } = await es.search({
index: "chunks",
size: topk,
query: { match: { text: { query, operator: "and" } } },
_source: ["text", "doc_id", "section_id", "span"]
});
return hits.hits.map((h, i) => ({ id: h._id, score: hits.max_score, r:i+1, src: h._source }));
}
// 2) Dense side (use same model for write+read)
async function embed(s: string) {
const out = await hf.featureExtraction({
model: "sentence-transformers/all-MiniLM-L6-v2",
inputs: s
});
// L2 normalize
const v = (out as number[]);
const norm = Math.sqrt(v.reduce((a,b)=>a+b*b,0));
return v.map(x=>x/norm);
}
```
> Keep dense vectors in your KV/DB keyed by `chunk_id`; at query time compute cosine vs a small candidate pool (top-200 BM25), then RRF.
---
## 3) Retrieval observability (ΔS & λ)
* **ΔS(question, retrieved)** = `1 cos(I, G)` with **I** = retrieved snippet embedding, **G** = anchor (title/expected section or gold answer).
* **Thresholds**: `<0.40` stable · `0.400.60` transitional · `≥0.60` action.
* **λ states**: `→` convergent · `←` divergent · `<>` recursive · `×` chaotic.
**Probe recipe**
1. Vary `k ∈ {5, 10, 20}`; plot ΔS vs k.
2. If curve **flat & high** → metric/normalization/index mismatch.
3. If **sharp drop** at higher k → retriever filter too strict; consider MMR or hybrid.
---
## 4) Prompt assembly: cite → explain (lock constraints)
* Keep **per-source fences** (no cross-source merges).
* **Order**: *system → task → constraints → citations → answer*.
* Force **cite-first**; explanation **must reference** citation IDs/lines.
* See: [Traceability](./retrieval-traceability.md) and *SCU Pattern*.
---
## 5) Reranking: when & how
Add a reranker only if:
* **Recall\@50 ≥ 0.85**, but Top-5 precision is weak, or
* You need **tight citation alignment** across near-duplicates.
Start with:
* **Cross-encoder** (bge-reranker-mini/base) for accuracy;
* Or **LLM rerank** for low volume, high precision needs.
See: [Rerankers](./rerankers.md).
---
## 6) Acceptance criteria
* **Retrieval sanity**: ΔS(question, top-ctx) ≤ 0.45, coverage ≥ 0.70 of target section.
* **Traceability**: snippet ↔ citation table reproducible.
* **Stability**: same inputs over 3 paraphrases keep λ → convergent.
* **No SCU**: who-said-what does not merge across sources.
---
## 7) Common failures → repair
| Symptom | Likely cause | Fix |
| -------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Hybrid worse than single | **Analyzer/tokenizer split** | Align analyzers; log per-retriever queries; see [Query Parsing Split](./patterns/pattern_query_parsing_split.md) |
| Some facts never retrieved | **Fragmented store / id skew** | Rebuild + shard audit; see [Vectorstore Fragmentation](./patterns/pattern_vectorstore_fragmentation.md) |
| Citations cross-bleed | **Prompt schema unlocked** | Per-source fences + cite-first; see [SCU](./patterns/pattern_symbolic_constraint_unlock.md) |
| ΔS flat & high vs k | **Metric/normalization mismatch** | Normalize embeddings; pin FAISS metric; see [Embedding vs Semantic](./embedding-vs-semantic.md) |
---
## 8) Tiny gold set (do this!)
Create **1050** realistic Q/A with citation lines. Commit a `goldset.jsonl`:
```json
{"q":"How to reset billing cycle?","doc_id":"a","section":"billing","lines":[120,145],"a":"..."}
```
Run **recall\@50**, **nDCG\@10**, and **ΔS** on each PR.
---
### 🧭 Explore More
| Module | Description | Link |
|-----------------------|----------------------------------------------------------|----------|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
| 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)**
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
&nbsp;
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
&nbsp;
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
&nbsp;
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
&nbsp;
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
&nbsp;
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
&nbsp;
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
&nbsp;
</div>