WFGY/ProblemMap/retrieval-playbook.md
2025-08-15 23:24:05 +08:00

266 lines
12 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🔎 Retrieval Playbook — Practical, Measurable, Fix-first
> The goal: **consistent, explainable** retrieval that makes reasoning easy.
> This playbook gives you a **minimal, testable** setup across OCR → chunk → embed → index → retrieve → prompt, with **failure probes** and **repair steps**. No hype—only what ships.
---
> **Quick Nav**
> [OCR/Parsing Checklist](./ocr-parsing-checklist.md) ·
> [Chunking Checklist](./chunking-checklist.md) ·
> [Embedding vs Semantic](./embedding-vs-semantic.md) ·
> [Traceability](./retrieval-traceability.md) ·
> [Rerankers](./rerankers.md) ·
> Patterns: [Query Parsing Split](./patterns/pattern_query_parsing_split.md) ·
> [Vectorstore Fragmentation](./patterns/pattern_vectorstore_fragmentation.md) ·
> [Symbolic Constraint Unlock](./patterns/pattern_symbolic_constraint_unlock.md)
---
## 0) Executive summary
1. **Generate clean candidates** (dense/sparse/hybrid) with traceable IDs.
2. **Measure** with ΔS + recall@k before adding complexity.
3. **Add reranking** only when first-stage recall is **consistently ≥0.85** for your task.
4. **Lock prompt schema** (cite → explain) and **forbid cross-source merges**.
5. **Regression-guard** with small golden sets (1050 Q/A).
---
## 1) Candidate generation (first-stage)
### 1.1 Choose a primary retriever
- **Dense (embeddings)**: good default for semantic matches across paraphrases.
- **Sparse (BM25/SPLADE)**: strong for exact terms, code, and rare tokens.
- **Hybrid**: *reciprocal rank fusion (RRF)* of dense + sparse is robust to query style.
**Rule of thumb**
- Start **dense** for general docs; add **BM25** for code/legal/IDs.
- If hybrid hurts recall, check **tokenization & analyzer drift** → see *Query Parsing Split*.
### 1.2 Index hygiene (FAISS/Elasticsearch/Qdrant/Chroma)
- **Normalize** vectors on **both write & read** if using cosine.
- Pin the **metric type** (cosine vs inner product) in code & metadata.
- Persist **doc_id / section_id / line_span** with each vector.
- Verify **index cardinality = sum(chunks)**; add a one-liner count check in CI.
---
## 2) Minimal reference pipelines
### 2.1 Python — FAISS (dense) + BM25 (hybrid via RRF)
```python
# pip install sentence-transformers rank_bm25 faiss-cpu numpy
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import faiss, numpy as np
# 1) data
chunks = [...] # list[str], pre-chunked sentences/sections
meta = [...] # list[dict], each with {doc_id, section_id, span}
# 2) encoder (cosine → L2-normalize)
enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
X = enc.encode(chunks, normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1]) # inner product == cosine on normalized vectors
index.add(X.astype(np.float32))
# 3) sparse side
tokenized = [c.split() for c in chunks]
bm25 = BM25Okapi(tokenized)
def rrf(ranks, k=60):
# ranks: list of lists of (idx, rank_position starting at 1)
from collections import defaultdict
score = defaultdict(float)
for rr in ranks:
for idx, rp in rr:
score[idx] += 1.0 / (k + rp)
return sorted(score.items(), key=lambda x: -x[1])
def search(query, topk_dense=50, topk_sparse=50, out_k=20):
qv = enc.encode([query], normalize_embeddings=True).astype(np.float32)
d_s, d_i = index.search(qv, topk_dense)
# ranks as (idx, rankpos)
dense_rank = [(int(i), r+1) for r, i in enumerate(d_i[0])]
sparse_rank = []
for r, idx in enumerate(bm25.get_top_n(query.split(), list(range(len(chunks))), n=topk_sparse)):
sparse_rank.append((idx, r+1))
fused = rrf([dense_rank, sparse_rank])
out = []
for idx, _ in fused[:out_k]:
out.append({"text": chunks[idx], "meta": meta[idx], "source":"hybrid"})
return out
res = search("how to reset billing cycle")
````
**Sanity checks**
* After encoding: assert `np.abs(np.linalg.norm(X[0]) - 1.0) < 1e-3`
* After add: `index.ntotal == len(chunks)`
* For sparse-only: if **IDs/code** outperform dense, keep hybrid.
### 2.2 Node (TypeScript) — Elastic BM25 + Dense store
```ts
// pnpm add @elastic/elasticsearch @huggingface/inference cosine-similarity
import { Client } from "@elastic/elasticsearch";
import { HfInference } from "@huggingface/inference";
import cosine from "cosine-similarity";
const es = new Client({ node: process.env.ES_URL! });
const hf = new HfInference(process.env.HF_TOKEN!);
// 1) BM25 query
async function bm25(query: string, topk=50) {
const { hits } = await es.search({
index: "chunks",
size: topk,
query: { match: { text: { query, operator: "and" } } },
_source: ["text", "doc_id", "section_id", "span"]
});
return hits.hits.map((h, i) => ({ id: h._id, score: hits.max_score, r:i+1, src: h._source }));
}
// 2) Dense side (use same model for write+read)
async function embed(s: string) {
const out = await hf.featureExtraction({
model: "sentence-transformers/all-MiniLM-L6-v2",
inputs: s
});
// L2 normalize
const v = (out as number[]);
const norm = Math.sqrt(v.reduce((a,b)=>a+b*b,0));
return v.map(x=>x/norm);
}
```
> Keep dense vectors in your KV/DB keyed by `chunk_id`; at query time compute cosine vs a small candidate pool (top-200 BM25), then RRF.
---
## 3) Retrieval observability (ΔS & λ)
* **ΔS(question, retrieved)** = `1 cos(I, G)` with **I** = retrieved snippet embedding, **G** = anchor (title/expected section or gold answer).
* **Thresholds**: `<0.40` stable · `0.400.60` transitional · `≥0.60` action.
* **λ states**: `→` convergent · `←` divergent · `<>` recursive · `×` chaotic.
**Probe recipe**
1. Vary `k ∈ {5, 10, 20}`; plot ΔS vs k.
2. If curve **flat & high** → metric/normalization/index mismatch.
3. If **sharp drop** at higher k → retriever filter too strict; consider MMR or hybrid.
---
## 4) Prompt assembly: cite → explain (lock constraints)
* Keep **per-source fences** (no cross-source merges).
* **Order**: *system → task → constraints → citations → answer*.
* Force **cite-first**; explanation **must reference** citation IDs/lines.
* See: [Traceability](./retrieval-traceability.md) and *SCU Pattern*.
---
## 5) Reranking: when & how
Add a reranker only if:
* **Recall\@50 ≥ 0.85**, but Top-5 precision is weak, or
* You need **tight citation alignment** across near-duplicates.
Start with:
* **Cross-encoder** (bge-reranker-mini/base) for accuracy;
* Or **LLM rerank** for low volume, high precision needs.
See: [Rerankers](./rerankers.md).
---
## 6) Acceptance criteria
* **Retrieval sanity**: ΔS(question, top-ctx) ≤ 0.45, coverage ≥ 0.70 of target section.
* **Traceability**: snippet ↔ citation table reproducible.
* **Stability**: same inputs over 3 paraphrases keep λ → convergent.
* **No SCU**: who-said-what does not merge across sources.
---
## 7) Common failures → repair
| Symptom | Likely cause | Fix |
| -------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Hybrid worse than single | **Analyzer/tokenizer split** | Align analyzers; log per-retriever queries; see [Query Parsing Split](./patterns/pattern_query_parsing_split.md) |
| Some facts never retrieved | **Fragmented store / id skew** | Rebuild + shard audit; see [Vectorstore Fragmentation](./patterns/pattern_vectorstore_fragmentation.md) |
| Citations cross-bleed | **Prompt schema unlocked** | Per-source fences + cite-first; see [SCU](./patterns/pattern_symbolic_constraint_unlock.md) |
| ΔS flat & high vs k | **Metric/normalization mismatch** | Normalize embeddings; pin FAISS metric; see [Embedding vs Semantic](./embedding-vs-semantic.md) |
---
## 8) Tiny gold set (do this!)
Create **1050** realistic Q/A with citation lines. Commit a `goldset.jsonl`:
```json
{"q":"How to reset billing cycle?","doc_id":"a","section":"billing","lines":[120,145],"a":"..."}
```
Run **recall\@50**, **nDCG\@10**, and **ΔS** on each PR.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|------|------|--------------|
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1⃣ Download · 2⃣ Upload to your LLM · 3⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
### 🧭 Explore More
| Module | Description | Link |
|-----------------------|----------------------------------------------------------|----------|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) |
| 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) |
---
> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —
> Engineers, hackers, and open source builders who supported WFGY from day one.
> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
<div align="center">
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
&nbsp;
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
&nbsp;
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
&nbsp;
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
&nbsp;
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
&nbsp;
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
&nbsp;
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
&nbsp;
</div>