vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

2025-08-15 23:24:05 +08:00

12 KiB

Raw Blame History

🔎 Retrieval Playbook — Practical, Measurable, Fix-first

The goal: consistent, explainable retrieval that makes reasoning easy.
This playbook gives you a minimal, testable setup across OCR → chunk → embed → index → retrieve → prompt, with failure probes and repair steps. No hype—only what ships.

Quick Nav
OCR/Parsing Checklist · Chunking Checklist · Embedding vs Semantic · Traceability · Rerankers · Patterns: Query Parsing Split · Vectorstore Fragmentation · Symbolic Constraint Unlock

0) Executive summary

Generate clean candidates (dense/sparse/hybrid) with traceable IDs.
Measure with ΔS + recall@k before adding complexity.
Add reranking only when first-stage recall is consistently ≥0.85 for your task.
Lock prompt schema (cite → explain) and forbid cross-source merges.
Regression-guard with small golden sets (10–50 Q/A).

1) Candidate generation (first-stage)

1.1 Choose a primary retriever

Dense (embeddings): good default for semantic matches across paraphrases.
Sparse (BM25/SPLADE): strong for exact terms, code, and rare tokens.
Hybrid: reciprocal rank fusion (RRF) of dense + sparse is robust to query style.

Rule of thumb

Start dense for general docs; add BM25 for code/legal/IDs.
If hybrid hurts recall, check tokenization & analyzer drift → see Query Parsing Split.

1.2 Index hygiene (FAISS/Elasticsearch/Qdrant/Chroma)

Normalize vectors on both write & read if using cosine.
Pin the metric type (cosine vs inner product) in code & metadata.
Persist doc_id / section_id / line_span with each vector.
Verify index cardinality = sum(chunks); add a one-liner count check in CI.

2) Minimal reference pipelines

2.1 Python — FAISS (dense) + BM25 (hybrid via RRF)

# pip install sentence-transformers rank_bm25 faiss-cpu numpy
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import faiss, numpy as np

# 1) data
chunks = [...]                  # list[str], pre-chunked sentences/sections
meta   = [...]                  # list[dict], each with {doc_id, section_id, span}

# 2) encoder (cosine → L2-normalize)
enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
X = enc.encode(chunks, normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1])         # inner product == cosine on normalized vectors
index.add(X.astype(np.float32))

# 3) sparse side
tokenized = [c.split() for c in chunks]
bm25 = BM25Okapi(tokenized)

def rrf(ranks, k=60):
    # ranks: list of lists of (idx, rank_position starting at 1)
    from collections import defaultdict
    score = defaultdict(float)
    for rr in ranks:
        for idx, rp in rr:
            score[idx] += 1.0 / (k + rp)
    return sorted(score.items(), key=lambda x: -x[1])

def search(query, topk_dense=50, topk_sparse=50, out_k=20):
    qv = enc.encode([query], normalize_embeddings=True).astype(np.float32)
    d_s, d_i = index.search(qv, topk_dense)
    # ranks as (idx, rankpos)
    dense_rank  = [(int(i), r+1) for r, i in enumerate(d_i[0])]
    sparse_rank = []
    for r, idx in enumerate(bm25.get_top_n(query.split(), list(range(len(chunks))), n=topk_sparse)):
        sparse_rank.append((idx, r+1))
    fused = rrf([dense_rank, sparse_rank])
    out = []
    for idx, _ in fused[:out_k]:
        out.append({"text": chunks[idx], "meta": meta[idx], "source":"hybrid"})
    return out

res = search("how to reset billing cycle")

Sanity checks

After encoding: assert np.abs(np.linalg.norm(X[0]) - 1.0) < 1e-3
After add: index.ntotal == len(chunks)
For sparse-only: if IDs/code outperform dense, keep hybrid.

2.2 Node (TypeScript) — Elastic BM25 + Dense store

// pnpm add @elastic/elasticsearch @huggingface/inference cosine-similarity
import { Client } from "@elastic/elasticsearch";
import { HfInference } from "@huggingface/inference";
import cosine from "cosine-similarity";

const es = new Client({ node: process.env.ES_URL! });
const hf = new HfInference(process.env.HF_TOKEN!);

// 1) BM25 query
async function bm25(query: string, topk=50) {
  const { hits } = await es.search({
    index: "chunks",
    size: topk,
    query: { match: { text: { query, operator: "and" } } },
    _source: ["text", "doc_id", "section_id", "span"]
  });
  return hits.hits.map((h, i) => ({ id: h._id, score: hits.max_score, r:i+1, src: h._source }));
}

// 2) Dense side (use same model for write+read)
async function embed(s: string) {
  const out = await hf.featureExtraction({
    model: "sentence-transformers/all-MiniLM-L6-v2",
    inputs: s
  });
  // L2 normalize
  const v = (out as number[]); 
  const norm = Math.sqrt(v.reduce((a,b)=>a+b*b,0)); 
  return v.map(x=>x/norm);
}

Keep dense vectors in your KV/DB keyed by chunk_id; at query time compute cosine vs a small candidate pool (top-200 BM25), then RRF.

3) Retrieval observability (ΔS & λ)

ΔS(question, retrieved) = 1 − cos(I, G) with I = retrieved snippet embedding, G = anchor (title/expected section or gold answer).
Thresholds: <0.40 stable · 0.40–0.60 transitional · ≥0.60 action.
λ states: → convergent · ← divergent · <> recursive · × chaotic.

Probe recipe

Vary k ∈ {5, 10, 20}; plot ΔS vs k.
If curve flat & high → metric/normalization/index mismatch.
If sharp drop at higher k → retriever filter too strict; consider MMR or hybrid.

4) Prompt assembly: cite → explain (lock constraints)

Keep per-source fences (no cross-source merges).
Order: system → task → constraints → citations → answer.
Force cite-first; explanation must reference citation IDs/lines.
See: Traceability and SCU Pattern.

5) Reranking: when & how

Add a reranker only if:

Recall@50 ≥ 0.85, but Top-5 precision is weak, or
You need tight citation alignment across near-duplicates.

Start with:

Cross-encoder (bge-reranker-mini/base) for accuracy;
Or LLM rerank for low volume, high precision needs. See: Rerankers.

6) Acceptance criteria

Retrieval sanity: ΔS(question, top-ctx) ≤ 0.45, coverage ≥ 0.70 of target section.
Traceability: snippet ↔ citation table reproducible.
Stability: same inputs over 3 paraphrases keep λ → convergent.
No SCU: who-said-what does not merge across sources.

7) Common failures → repair

Symptom	Likely cause	Fix
Hybrid worse than single	Analyzer/tokenizer split	Align analyzers; log per-retriever queries; see Query Parsing Split
Some facts never retrieved	Fragmented store / id skew	Rebuild + shard audit; see Vectorstore Fragmentation
Citations cross-bleed	Prompt schema unlocked	Per-source fences + cite-first; see SCU
ΔS flat & high vs k	Metric/normalization mismatch	Normalize embeddings; pin FAISS metric; see Embedding vs Semantic

8) Tiny gold set (do this!)

Create 10–50 realistic Q/A with citation lines. Commit a goldset.jsonl:

{"q":"How to reset billing cycle?","doc_id":"a","section":"billing","lines":[120,145],"a":"..."}

Run recall@50, nDCG@10, and ΔS on each PR.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

12 KiB Raw Blame History Unescape Escape