10 KiB
🔎 Retrieval Playbook — Practical, Measurable, Fix-first
The goal: consistent, explainable retrieval that makes reasoning easy.
This playbook gives you a minimal, testable setup across OCR → chunk → embed → index → retrieve → prompt, with failure probes and repair steps. No hype—only what ships.
Quick Nav
OCR/Parsing Checklist · Chunking Checklist · Embedding vs Semantic · Traceability · Rerankers · Patterns: Query Parsing Split · Vectorstore Fragmentation · Symbolic Constraint Unlock
0) Executive summary
- Generate clean candidates (dense/sparse/hybrid) with traceable IDs.
- Measure with ΔS + recall@k before adding complexity.
- Add reranking only when first-stage recall is consistently ≥0.85 for your task.
- Lock prompt schema (cite → explain) and forbid cross-source merges.
- Regression-guard with small golden sets (10–50 Q/A).
1) Candidate generation (first-stage)
1.1 Choose a primary retriever
- Dense (embeddings): good default for semantic matches across paraphrases.
- Sparse (BM25/SPLADE): strong for exact terms, code, and rare tokens.
- Hybrid: reciprocal rank fusion (RRF) of dense + sparse is robust to query style.
Rule of thumb
- Start dense for general docs; add BM25 for code/legal/IDs.
- If hybrid hurts recall, check tokenization & analyzer drift → see Query Parsing Split.
1.2 Index hygiene (FAISS/Elasticsearch/Qdrant/Chroma)
- Normalize vectors on both write & read if using cosine.
- Pin the metric type (cosine vs inner product) in code & metadata.
- Persist doc_id / section_id / line_span with each vector.
- Verify index cardinality = sum(chunks); add a one-liner count check in CI.
2) Minimal reference pipelines
2.1 Python — FAISS (dense) + BM25 (hybrid via RRF)
# pip install sentence-transformers rank_bm25 faiss-cpu numpy
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import faiss, numpy as np
# 1) data
chunks = [...] # list[str], pre-chunked sentences/sections
meta = [...] # list[dict], each with {doc_id, section_id, span}
# 2) encoder (cosine → L2-normalize)
enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
X = enc.encode(chunks, normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1]) # inner product == cosine on normalized vectors
index.add(X.astype(np.float32))
# 3) sparse side
tokenized = [c.split() for c in chunks]
bm25 = BM25Okapi(tokenized)
def rrf(ranks, k=60):
# ranks: list of lists of (idx, rank_position starting at 1)
from collections import defaultdict
score = defaultdict(float)
for rr in ranks:
for idx, rp in rr:
score[idx] += 1.0 / (k + rp)
return sorted(score.items(), key=lambda x: -x[1])
def search(query, topk_dense=50, topk_sparse=50, out_k=20):
qv = enc.encode([query], normalize_embeddings=True).astype(np.float32)
d_s, d_i = index.search(qv, topk_dense)
# ranks as (idx, rankpos)
dense_rank = [(int(i), r+1) for r, i in enumerate(d_i[0])]
sparse_rank = []
for r, idx in enumerate(bm25.get_top_n(query.split(), list(range(len(chunks))), n=topk_sparse)):
sparse_rank.append((idx, r+1))
fused = rrf([dense_rank, sparse_rank])
out = []
for idx, _ in fused[:out_k]:
out.append({"text": chunks[idx], "meta": meta[idx], "source":"hybrid"})
return out
res = search("how to reset billing cycle")
Sanity checks
- After encoding: assert
np.abs(np.linalg.norm(X[0]) - 1.0) < 1e-3 - After add:
index.ntotal == len(chunks) - For sparse-only: if IDs/code outperform dense, keep hybrid.
2.2 Node (TypeScript) — Elastic BM25 + Dense store
// pnpm add @elastic/elasticsearch @huggingface/inference cosine-similarity
import { Client } from "@elastic/elasticsearch";
import { HfInference } from "@huggingface/inference";
import cosine from "cosine-similarity";
const es = new Client({ node: process.env.ES_URL! });
const hf = new HfInference(process.env.HF_TOKEN!);
// 1) BM25 query
async function bm25(query: string, topk=50) {
const { hits } = await es.search({
index: "chunks",
size: topk,
query: { match: { text: { query, operator: "and" } } },
_source: ["text", "doc_id", "section_id", "span"]
});
return hits.hits.map((h, i) => ({ id: h._id, score: hits.max_score, r:i+1, src: h._source }));
}
// 2) Dense side (use same model for write+read)
async function embed(s: string) {
const out = await hf.featureExtraction({
model: "sentence-transformers/all-MiniLM-L6-v2",
inputs: s
});
// L2 normalize
const v = (out as number[]);
const norm = Math.sqrt(v.reduce((a,b)=>a+b*b,0));
return v.map(x=>x/norm);
}
Keep dense vectors in your KV/DB keyed by
chunk_id; at query time compute cosine vs a small candidate pool (top-200 BM25), then RRF.
3) Retrieval observability (ΔS & λ)
- ΔS(question, retrieved) =
1 − cos(I, G)with I = retrieved snippet embedding, G = anchor (title/expected section or gold answer). - Thresholds:
<0.40stable ·0.40–0.60transitional ·≥0.60action. - λ states:
→convergent ·←divergent ·<>recursive ·×chaotic.
Probe recipe
- Vary
k ∈ {5, 10, 20}; plot ΔS vs k. - If curve flat & high → metric/normalization/index mismatch.
- If sharp drop at higher k → retriever filter too strict; consider MMR or hybrid.
4) Prompt assembly: cite → explain (lock constraints)
- Keep per-source fences (no cross-source merges).
- Order: system → task → constraints → citations → answer.
- Force cite-first; explanation must reference citation IDs/lines.
- See: Traceability and SCU Pattern.
5) Reranking: when & how
Add a reranker only if:
- Recall@50 ≥ 0.85, but Top-5 precision is weak, or
- You need tight citation alignment across near-duplicates.
Start with:
- Cross-encoder (bge-reranker-mini/base) for accuracy;
- Or LLM rerank for low volume, high precision needs. See: Rerankers.
6) Acceptance criteria
- Retrieval sanity: ΔS(question, top-ctx) ≤ 0.45, coverage ≥ 0.70 of target section.
- Traceability: snippet ↔ citation table reproducible.
- Stability: same inputs over 3 paraphrases keep λ → convergent.
- No SCU: who-said-what does not merge across sources.
7) Common failures → repair
| Symptom | Likely cause | Fix |
|---|---|---|
| Hybrid worse than single | Analyzer/tokenizer split | Align analyzers; log per-retriever queries; see Query Parsing Split |
| Some facts never retrieved | Fragmented store / id skew | Rebuild + shard audit; see Vectorstore Fragmentation |
| Citations cross-bleed | Prompt schema unlocked | Per-source fences + cite-first; see SCU |
| ΔS flat & high vs k | Metric/normalization mismatch | Normalize embeddings; pin FAISS metric; see Embedding vs Semantic |
8) Tiny gold set (do this!)
Create 10–50 realistic Q/A with citation lines. Commit a goldset.jsonl:
{"q":"How to reset billing cycle?","doc_id":"a","section":"billing","lines":[120,145],"a":"..."}
Run recall@50, nDCG@10, and ΔS on each PR.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.