WFGY/ProblemMap/retrieval-playbook.md
2025-08-15 23:24:05 +08:00

12 KiB
Raw Blame History

🔎 Retrieval Playbook — Practical, Measurable, Fix-first

The goal: consistent, explainable retrieval that makes reasoning easy.
This playbook gives you a minimal, testable setup across OCR → chunk → embed → index → retrieve → prompt, with failure probes and repair steps. No hype—only what ships.


Quick Nav
OCR/Parsing Checklist · Chunking Checklist · Embedding vs Semantic · Traceability · Rerankers · Patterns: Query Parsing Split · Vectorstore Fragmentation · Symbolic Constraint Unlock


0) Executive summary

  1. Generate clean candidates (dense/sparse/hybrid) with traceable IDs.
  2. Measure with ΔS + recall@k before adding complexity.
  3. Add reranking only when first-stage recall is consistently ≥0.85 for your task.
  4. Lock prompt schema (cite → explain) and forbid cross-source merges.
  5. Regression-guard with small golden sets (1050 Q/A).

1) Candidate generation (first-stage)

1.1 Choose a primary retriever

  • Dense (embeddings): good default for semantic matches across paraphrases.
  • Sparse (BM25/SPLADE): strong for exact terms, code, and rare tokens.
  • Hybrid: reciprocal rank fusion (RRF) of dense + sparse is robust to query style.

Rule of thumb

  • Start dense for general docs; add BM25 for code/legal/IDs.
  • If hybrid hurts recall, check tokenization & analyzer drift → see Query Parsing Split.

1.2 Index hygiene (FAISS/Elasticsearch/Qdrant/Chroma)

  • Normalize vectors on both write & read if using cosine.
  • Pin the metric type (cosine vs inner product) in code & metadata.
  • Persist doc_id / section_id / line_span with each vector.
  • Verify index cardinality = sum(chunks); add a one-liner count check in CI.

2) Minimal reference pipelines

2.1 Python — FAISS (dense) + BM25 (hybrid via RRF)

# pip install sentence-transformers rank_bm25 faiss-cpu numpy
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import faiss, numpy as np

# 1) data
chunks = [...]                  # list[str], pre-chunked sentences/sections
meta   = [...]                  # list[dict], each with {doc_id, section_id, span}

# 2) encoder (cosine → L2-normalize)
enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
X = enc.encode(chunks, normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1])         # inner product == cosine on normalized vectors
index.add(X.astype(np.float32))

# 3) sparse side
tokenized = [c.split() for c in chunks]
bm25 = BM25Okapi(tokenized)

def rrf(ranks, k=60):
    # ranks: list of lists of (idx, rank_position starting at 1)
    from collections import defaultdict
    score = defaultdict(float)
    for rr in ranks:
        for idx, rp in rr:
            score[idx] += 1.0 / (k + rp)
    return sorted(score.items(), key=lambda x: -x[1])

def search(query, topk_dense=50, topk_sparse=50, out_k=20):
    qv = enc.encode([query], normalize_embeddings=True).astype(np.float32)
    d_s, d_i = index.search(qv, topk_dense)
    # ranks as (idx, rankpos)
    dense_rank  = [(int(i), r+1) for r, i in enumerate(d_i[0])]
    sparse_rank = []
    for r, idx in enumerate(bm25.get_top_n(query.split(), list(range(len(chunks))), n=topk_sparse)):
        sparse_rank.append((idx, r+1))
    fused = rrf([dense_rank, sparse_rank])
    out = []
    for idx, _ in fused[:out_k]:
        out.append({"text": chunks[idx], "meta": meta[idx], "source":"hybrid"})
    return out

res = search("how to reset billing cycle")

Sanity checks

  • After encoding: assert np.abs(np.linalg.norm(X[0]) - 1.0) < 1e-3
  • After add: index.ntotal == len(chunks)
  • For sparse-only: if IDs/code outperform dense, keep hybrid.

2.2 Node (TypeScript) — Elastic BM25 + Dense store

// pnpm add @elastic/elasticsearch @huggingface/inference cosine-similarity
import { Client } from "@elastic/elasticsearch";
import { HfInference } from "@huggingface/inference";
import cosine from "cosine-similarity";

const es = new Client({ node: process.env.ES_URL! });
const hf = new HfInference(process.env.HF_TOKEN!);

// 1) BM25 query
async function bm25(query: string, topk=50) {
  const { hits } = await es.search({
    index: "chunks",
    size: topk,
    query: { match: { text: { query, operator: "and" } } },
    _source: ["text", "doc_id", "section_id", "span"]
  });
  return hits.hits.map((h, i) => ({ id: h._id, score: hits.max_score, r:i+1, src: h._source }));
}

// 2) Dense side (use same model for write+read)
async function embed(s: string) {
  const out = await hf.featureExtraction({
    model: "sentence-transformers/all-MiniLM-L6-v2",
    inputs: s
  });
  // L2 normalize
  const v = (out as number[]); 
  const norm = Math.sqrt(v.reduce((a,b)=>a+b*b,0)); 
  return v.map(x=>x/norm);
}

Keep dense vectors in your KV/DB keyed by chunk_id; at query time compute cosine vs a small candidate pool (top-200 BM25), then RRF.


3) Retrieval observability (ΔS & λ)

  • ΔS(question, retrieved) = 1 cos(I, G) with I = retrieved snippet embedding, G = anchor (title/expected section or gold answer).
  • Thresholds: <0.40 stable · 0.400.60 transitional · ≥0.60 action.
  • λ states: convergent · divergent · <> recursive · × chaotic.

Probe recipe

  1. Vary k ∈ {5, 10, 20}; plot ΔS vs k.
  2. If curve flat & high → metric/normalization/index mismatch.
  3. If sharp drop at higher k → retriever filter too strict; consider MMR or hybrid.

4) Prompt assembly: cite → explain (lock constraints)

  • Keep per-source fences (no cross-source merges).
  • Order: system → task → constraints → citations → answer.
  • Force cite-first; explanation must reference citation IDs/lines.
  • See: Traceability and SCU Pattern.

5) Reranking: when & how

Add a reranker only if:

  • Recall@50 ≥ 0.85, but Top-5 precision is weak, or
  • You need tight citation alignment across near-duplicates.

Start with:

  • Cross-encoder (bge-reranker-mini/base) for accuracy;
  • Or LLM rerank for low volume, high precision needs. See: Rerankers.

6) Acceptance criteria

  • Retrieval sanity: ΔS(question, top-ctx) ≤ 0.45, coverage ≥ 0.70 of target section.
  • Traceability: snippet ↔ citation table reproducible.
  • Stability: same inputs over 3 paraphrases keep λ → convergent.
  • No SCU: who-said-what does not merge across sources.

7) Common failures → repair

Symptom Likely cause Fix
Hybrid worse than single Analyzer/tokenizer split Align analyzers; log per-retriever queries; see Query Parsing Split
Some facts never retrieved Fragmented store / id skew Rebuild + shard audit; see Vectorstore Fragmentation
Citations cross-bleed Prompt schema unlocked Per-source fences + cite-first; see SCU
ΔS flat & high vs k Metric/normalization mismatch Normalize embeddings; pin FAISS metric; see Embedding vs Semantic

8) Tiny gold set (do this!)

Create 1050 realistic Q/A with citation lines. Commit a goldset.jsonl:

{"q":"How to reset billing cycle?","doc_id":"a","section":"billing","lines":[120,145],"a":"..."}

Run recall@50, nDCG@10, and ΔS on each PR.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow