# Example 03 — Pipeline Patch: Intersection + Rerank (No.1 & No.4) **Goal** Harden retrieval so answers cite the **right** chunks. We combine lexical and semantic views, take the **intersection** (with a safe fallback), then **rerank** by cosine and **cut the tail** at the score knee. **Problem Map link** - **No.1 Hallucination & Chunk Drift** — wrong spans sneak in when chunk borders don’t match entities/constraints. - **No.4 Tail Noise** — low-relevance passages dilute the prompt and push the model to stitch. **Outcome** - Higher citation hit rate and fewer off-topic chunks in the prompt - Same token budget, better signal - Deterministic selection rules you can tune and test --- ## 1) Inputs Use the same structure as Example 01: - `data/chunks.json` — array of `{id, page, text}` - You can keep the tiny toy corpus or point to your own > Tip: If your corpus is large, run this on a 200–500 chunk slice first to verify behavior. --- ## 2) Path A — Python (rank-bm25 + sentence-transformers, CPU-friendly) ### Install ```bash pip install numpy rank-bm25 sentence-transformers faiss-cpu ```` ### Script: `hybrid_retrieve.py` ```python # hybrid_retrieve.py -- lexical ∩ semantic -> rerank -> knee cutoff import json, os, sys, math import numpy as np from rank_bm25 import BM25Okapi from sentence_transformers import SentenceTransformer import faiss TOPK_LEX = 40 # lexical candidate pool TOPK_SEM = 40 # semantic candidate pool TOPK_FINAL = 8 # final picks after rerank KNEE_MIN = 4 # at least this many survive before knee def load_chunks(path): C = json.load(open(path, encoding="utf8")) texts = [c["text"] for c in C] ids = [c["id"] for c in C] return C, texts, ids def build_lex(texts): toks = [t.lower().split() for t in texts] return BM25Okapi(toks), toks def build_sem(texts): model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True, show_progress_bar=False) index = faiss.IndexFlatIP(embs.shape[1]) index.add(embs.astype(np.float32)) return model, embs, index def score_lex(bm25, toks, q): return bm25.get_scores(q.lower().split()) def score_sem(model, index, q, topn): qv = model.encode([q], normalize_embeddings=True) sim, idx = index.search(qv.astype(np.float32), topn) return sim[0], idx[0] def knee_cut(scores_sorted_desc, min_keep=KNEE_MIN): """Find largest drop (relative) to cut the tail. Always keep at least min_keep.""" if len(scores_sorted_desc) <= min_keep: return len(scores_sorted_desc) drops = [] for i in range(1, len(scores_sorted_desc)): prev, cur = scores_sorted_desc[i-1], scores_sorted_desc[i] if prev <= 1e-9: drops.append(0.0) else: drops.append((prev - cur) / max(prev, 1e-9)) knee = max(range(1, len(scores_sorted_desc)), key=lambda i: drops[i-1]) return max(min_keep, knee) def retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, q): # 1) lexical top pool lex_scores = score_lex(bm25, toks, q) lex_top_idx = np.argsort(lex_scores)[::-1][:TOPK_LEX] # 2) semantic top pool sem_sims, sem_idx = score_sem(model, index, q, TOPK_SEM) sem_top_idx = sem_idx # 3) intersection, then union fallback cand = list(set(lex_top_idx).intersection(set(sem_top_idx))) if len(cand) < TOPK_FINAL: cand = list(set(list(lex_top_idx) + list(sem_top_idx))) cand = np.array(cand) # 4) rerank by cosine against query vector qv = model.encode([q], normalize_embeddings=True)[0] embs = index.reconstruct_n(0, index.ntotal) # faiss flat, safe to reconstruct sims = embs[cand] @ qv order = np.argsort(sims)[::-1] cand, sims = cand[order], sims[order] # 5) knee cutoff then top K keep = knee_cut(sims.tolist(), min_keep=min(KNEE_MIN, TOPK_FINAL)) picks = cand[:max(keep, TOPK_FINAL)][:TOPK_FINAL] out = [{"id": ids[i], "text": texts[i], "score": float(sims[list(cand).index(i)])} for i in picks] return out def main(chunks_path, query): chunks, texts, ids = load_chunks(chunks_path) bm25, toks = build_lex(texts) model, embs, index = build_sem(texts) picks = retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, query) print(json.dumps({"q": query, "picks": picks}, ensure_ascii=False, indent=2)) if __name__ == "__main__": if len(sys.argv) < 3: print("usage: python hybrid_retrieve.py data/chunks.json \"your question\"") sys.exit(1) main(sys.argv[1], sys.argv[2]) ``` ### Run ```bash python hybrid_retrieve.py data/chunks.json "What is X and how is it constrained?" ``` **Pass criteria** * The output `picks` includes the chunk(s) that actually define X * Low-signal paragraphs (e.g., “unrelated protocol”) should be cut by the knee --- ## 3) Path B — Node (no external packages, CPU-only) We’ll keep Node dependency-free for portability. For real workloads, swap in `@xenova/transformers` for embeddings and a BM25 lib; the logic stays identical. ### Script: `hybrid_retrieve.mjs` ```js // hybrid_retrieve.mjs -- lexical overlap ∩ simple semantic → rerank → knee import fs from "node:fs"; // --- tiny helpers (dependency-free) --- function tokenize(s) { return s.toLowerCase().split(/\W+/).filter(Boolean); } function dot(a,b){ let s=0; for(let i=0;i((a<<5)-a + c.charCodeAt(0))|0, 0)) % D; v[h] += 1; } // l2 normalize const n = norm(v); return v.map(x => x/n); } function bm25Lite(queryTokens, docTokens, avgdl, k1=1.5, b=0.75){ // very lite BM25: IDF omitted here; we approximate with normalized term overlap + length norm const overlap = docTokens.filter(t => queryTokens.has(t)).length; const dl = docTokens.length || 1; const num = overlap * (k1 + 1); const den = overlap + k1 * (1 - b + b * (dl / avgdl)); return den ? num/den : 0; } function kneeCut(scoresDesc, minKeep=4){ if(scoresDesc.length <= minKeep) return scoresDesc.length; let best=1, bestDrop=-1; for(let i=1;i bestDrop){ bestDrop = drop; best = i; } } return Math.max(minKeep, best); } // --- main hybrid --- function hybridRetrieve(chunks, q, TOPK_LEX=40, TOPK_SEM=40, TOPK_FINAL=8){ const qTok = new Set(tokenize(q)); const tokens = chunks.map(c => tokenize(c.text)); const avgdl = tokens.reduce((a,t)=>a+t.length,0) / Math.max(tokens.length,1); // lexical pool const lexScores = tokens.map(t => bm25Lite(qTok, t, avgdl)); const lexOrder = [...lexScores.keys()].sort((a,b)=>lexScores[b]-lexScores[a]).slice(0, TOPK_LEX); // semantic pool (hash embeddings as stand-in; replace with real model later) const embs = chunks.map(c => hashEmbed(c.text)); const qv = hashEmbed(q); const semScores = embs.map(v => cosine(v, qv)); const semOrder = [...semScores.keys()].sort((a,b)=>semScores[b]-semScores[a]).slice(0, TOPK_SEM); // intersection then union fallback const setLex = new Set(lexOrder), setSem = new Set(semOrder); let cand = [...new Set(lexOrder.filter(i => setSem.has(i)))]; if(cand.length < TOPK_FINAL) cand = [...new Set([...lexOrder, ...semOrder])]; // rerank by cosine, knee cutoff const rescored = cand.map(i => [i, semScores[i]]).sort((a,b)=>b[1]-a[1]); const scores = rescored.map(x => x[1]); const keep = kneeCut(scores, Math.min(4, TOPK_FINAL)); const picks = rescored.slice(0, Math.max(keep, TOPK_FINAL)).slice(0, TOPK_FINAL).map(([i,s]) => ({ id: chunks[i].id, text: chunks[i].text, score: s })); return picks; } // CLI if (import.meta.url === `file://${process.argv[1]}`) { const [chunksPath, ...qparts] = process.argv.slice(2); if (!chunksPath || qparts.length === 0) { console.error("usage: node hybrid_retrieve.mjs data/chunks.json \"your question\""); process.exit(1); } const chunks = JSON.parse(fs.readFileSync(chunksPath,"utf8")); const q = qparts.join(" "); const picks = hybridRetrieve(chunks, q); console.log(JSON.stringify({ q, picks }, null, 2)); } export { hybridRetrieve }; ``` ### Run ```bash node hybrid_retrieve.mjs data/chunks.json "What is X and how is it constrained?" ``` **Pass criteria** are the same as Python. When you later swap in real embeddings, the behavior should improve while the control logic stays unchanged. --- ## 4) Wire into your guarded answer (from Example 01) **Python** — replace your `retrieve()` with the hybrid function (keep the same prompt and trace rules). **Node** — import `hybridRetrieve` and use its `picks` as your `chunks` in the prompt builder. --- ## 5) Verification checklist * **Citation hit rate** increases (more answers cite the defining chunk ids) * **Refusal is correct** when no relevant evidence exists * **Prompt length** stays the same or shrinks (top-8 is enough after rerank) * **Variance** across runs reduces (intersection stabilizes candidate set) A quick way to confirm: re-run the triage from **Example 02**. You should see fewer `retrieval_drift` labels. --- ## 6) Why this works (in one paragraph) Lexical scores reward explicit keyword overlap; semantic scores capture paraphrases and synonyms. Taking the **intersection** forces candidates to be good in **both** views (high precision); when the intersection is too small, a **union fallback** avoids false refusals (recall). A **cosine rerank** against the query enforces semantic closeness across mixed candidates, and the **knee cutoff** removes the low-value tail that tends to cause stitching. You get a clean, testable selection before you ever call the model. --- ## 7) Common mistakes & quick fixes * **Intersection returns 0** → increase the lexical/semantic pools to 80/80, or relax stopword removal. * **Knee cuts too aggressively** → lower the drop sensitivity by requiring a larger relative drop (or set `KNEE_MIN = TOPK_FINAL`). * **Still seeing off-topic chunks** → reduce chunk size so entity and constraints sit together; large chunks blur the signal. * **Latency too high** → cache embeddings; prebuild FAISS; keep models in memory and warm them before serving (see Example 07). --- ## 8) Next steps * Add a lightweight **cross-encoder reranker** (CPU-friendly) and compare with the cosine rerank. * Move to **Eval** docs and measure changes on precision, refusal rate, and citation overlap across your question set. * If you run on Ollama/LangChain, keep this control logic; just swap the embedding/model backends. --- ### 🧭 Explore More | Module | Description | Link | | --------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | | WFGY Core | Standalone semantic reasoning engine for any LLM | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) | | Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) | | Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) | | Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) | | Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) | | Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) | --- > 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** — > Engineers, hackers, and open source builders who supported WFGY from day one. > GitHub stars ⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone ⭐ **[Star WFGY on GitHub](https://github.com/onestardao/WFGY)**
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)   [![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)   [![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)   [![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)   [![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)   [![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)   [![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)