# Example 03 — Pipeline Patch: Intersection + Rerank (No.1 & No.4)

**Goal**  
Harden retrieval so answers cite the **right** chunks. We combine lexical and semantic views, take the **intersection** (with a safe fallback), then **rerank** by cosine and **cut the tail** at the score knee.

**Problem Map link**  
- **No.1 Hallucination & Chunk Drift** — wrong spans sneak in when chunk borders don’t match entities/constraints.  
- **No.4 Tail Noise** — low-relevance passages dilute the prompt and push the model to stitch.

**Outcome**  
- Higher citation hit rate and fewer off-topic chunks in the prompt  
- Same token budget, better signal  
- Deterministic selection rules you can tune and test

---

## 1) Inputs

Use the same structure as Example 01:
- `data/chunks.json` — array of `{id, page, text}`  
- You can keep the tiny toy corpus or point to your own

> Tip: If your corpus is large, run this on a 200–500 chunk slice first to verify behavior.

---

## 2) Path A — Python (rank-bm25 + sentence-transformers, CPU-friendly)

### Install

```bash
pip install numpy rank-bm25 sentence-transformers faiss-cpu
````

### Script: `hybrid_retrieve.py`

```python
# hybrid_retrieve.py -- lexical ∩ semantic -> rerank -> knee cutoff
import json, os, sys, math
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

TOPK_LEX = 40      # lexical candidate pool
TOPK_SEM = 40      # semantic candidate pool
TOPK_FINAL = 8     # final picks after rerank
KNEE_MIN = 4       # at least this many survive before knee

def load_chunks(path):
    C = json.load(open(path, encoding="utf8"))
    texts = [c["text"] for c in C]
    ids = [c["id"] for c in C]
    return C, texts, ids

def build_lex(texts):
    toks = [t.lower().split() for t in texts]
    return BM25Okapi(toks), toks

def build_sem(texts):
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True, show_progress_bar=False)
    index = faiss.IndexFlatIP(embs.shape[1])
    index.add(embs.astype(np.float32))
    return model, embs, index

def score_lex(bm25, toks, q):
    return bm25.get_scores(q.lower().split())

def score_sem(model, index, q, topn):
    qv = model.encode([q], normalize_embeddings=True)
    sim, idx = index.search(qv.astype(np.float32), topn)
    return sim[0], idx[0]

def knee_cut(scores_sorted_desc, min_keep=KNEE_MIN):
    """Find largest drop (relative) to cut the tail. Always keep at least min_keep."""
    if len(scores_sorted_desc) <= min_keep:
        return len(scores_sorted_desc)
    drops = []
    for i in range(1, len(scores_sorted_desc)):
        prev, cur = scores_sorted_desc[i-1], scores_sorted_desc[i]
        if prev <= 1e-9: drops.append(0.0)
        else: drops.append((prev - cur) / max(prev, 1e-9))
    knee = max(range(1, len(scores_sorted_desc)), key=lambda i: drops[i-1])
    return max(min_keep, knee)

def retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, q):
    # 1) lexical top pool
    lex_scores = score_lex(bm25, toks, q)
    lex_top_idx = np.argsort(lex_scores)[::-1][:TOPK_LEX]

    # 2) semantic top pool
    sem_sims, sem_idx = score_sem(model, index, q, TOPK_SEM)
    sem_top_idx = sem_idx

    # 3) intersection, then union fallback
    cand = list(set(lex_top_idx).intersection(set(sem_top_idx)))
    if len(cand) < TOPK_FINAL:
        cand = list(set(list(lex_top_idx) + list(sem_top_idx)))

    cand = np.array(cand)
    # 4) rerank by cosine against query vector
    qv = model.encode([q], normalize_embeddings=True)[0]
    embs = index.reconstruct_n(0, index.ntotal)  # faiss flat, safe to reconstruct
    sims = embs[cand] @ qv

    order = np.argsort(sims)[::-1]
    cand, sims = cand[order], sims[order]

    # 5) knee cutoff then top K
    keep = knee_cut(sims.tolist(), min_keep=min(KNEE_MIN, TOPK_FINAL))
    picks = cand[:max(keep, TOPK_FINAL)][:TOPK_FINAL]
    out = [{"id": ids[i], "text": texts[i], "score": float(sims[list(cand).index(i)])} for i in picks]
    return out

def main(chunks_path, query):
    chunks, texts, ids = load_chunks(chunks_path)
    bm25, toks = build_lex(texts)
    model, embs, index = build_sem(texts)
    picks = retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, query)
    print(json.dumps({"q": query, "picks": picks}, ensure_ascii=False, indent=2))

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("usage: python hybrid_retrieve.py data/chunks.json \"your question\"")
        sys.exit(1)
    main(sys.argv[1], sys.argv[2])
```

### Run

```bash
python hybrid_retrieve.py data/chunks.json "What is X and how is it constrained?"
```

**Pass criteria**

* The output `picks` includes the chunk(s) that actually define X
* Low-signal paragraphs (e.g., “unrelated protocol”) should be cut by the knee

---

## 3) Path B — Node (no external packages, CPU-only)

We’ll keep Node dependency-free for portability.
For real workloads, swap in `@xenova/transformers` for embeddings and a BM25 lib; the logic stays identical.

### Script: `hybrid_retrieve.mjs`

```js
// hybrid_retrieve.mjs -- lexical overlap ∩ simple semantic → rerank → knee
import fs from "node:fs";

// --- tiny helpers (dependency-free) ---
function tokenize(s) { return s.toLowerCase().split(/\W+/).filter(Boolean); }
function dot(a,b){ let s=0; for(let i=0;i<a.length;i++) s+=a[i]*b[i]; return s; }
function norm(a){ return Math.sqrt(dot(a,a)) || 1; }
function cosine(a,b){ return dot(a,b)/(norm(a)*norm(b)); }

// naive "embedding": average of hashed one-hot bins (works as a stand-in)
const D = 512;
function hashEmbed(text){
  const v = new Array(D).fill(0);
  for(const w of tokenize(text)){
    const h = Math.abs([...w].reduce((a,c)=>((a<<5)-a + c.charCodeAt(0))|0, 0)) % D;
    v[h] += 1;
  }
  // l2 normalize
  const n = norm(v); return v.map(x => x/n);
}

function bm25Lite(queryTokens, docTokens, avgdl, k1=1.5, b=0.75){
  // very lite BM25: IDF omitted here; we approximate with normalized term overlap + length norm
  const overlap = docTokens.filter(t => queryTokens.has(t)).length;
  const dl = docTokens.length || 1;
  const num = overlap * (k1 + 1);
  const den = overlap + k1 * (1 - b + b * (dl / avgdl));
  return den ? num/den : 0;
}

function kneeCut(scoresDesc, minKeep=4){
  if(scoresDesc.length <= minKeep) return scoresDesc.length;
  let best=1, bestDrop=-1;
  for(let i=1;i<scoresDesc.length;i++){
    const prev = scoresDesc[i-1], cur = scoresDesc[i];
    const drop = (prev - cur) / Math.max(prev, 1e-9);
    if(drop > bestDrop){ bestDrop = drop; best = i; }
  }
  return Math.max(minKeep, best);
}

// --- main hybrid ---
function hybridRetrieve(chunks, q, TOPK_LEX=40, TOPK_SEM=40, TOPK_FINAL=8){
  const qTok = new Set(tokenize(q));
  const tokens = chunks.map(c => tokenize(c.text));
  const avgdl = tokens.reduce((a,t)=>a+t.length,0) / Math.max(tokens.length,1);

  // lexical pool
  const lexScores = tokens.map(t => bm25Lite(qTok, t, avgdl));
  const lexOrder = [...lexScores.keys()].sort((a,b)=>lexScores[b]-lexScores[a]).slice(0, TOPK_LEX);

  // semantic pool (hash embeddings as stand-in; replace with real model later)
  const embs = chunks.map(c => hashEmbed(c.text));
  const qv = hashEmbed(q);
  const semScores = embs.map(v => cosine(v, qv));
  const semOrder = [...semScores.keys()].sort((a,b)=>semScores[b]-semScores[a]).slice(0, TOPK_SEM);

  // intersection then union fallback
  const setLex = new Set(lexOrder), setSem = new Set(semOrder);
  let cand = [...new Set(lexOrder.filter(i => setSem.has(i)))];
  if(cand.length < TOPK_FINAL) cand = [...new Set([...lexOrder, ...semOrder])];

  // rerank by cosine, knee cutoff
  const rescored = cand.map(i => [i, semScores[i]]).sort((a,b)=>b[1]-a[1]);
  const scores = rescored.map(x => x[1]);
  const keep = kneeCut(scores, Math.min(4, TOPK_FINAL));
  const picks = rescored.slice(0, Math.max(keep, TOPK_FINAL)).slice(0, TOPK_FINAL).map(([i,s]) => ({
    id: chunks[i].id, text: chunks[i].text, score: s
  }));
  return picks;
}

// CLI
if (import.meta.url === `file://${process.argv[1]}`) {
  const [chunksPath, ...qparts] = process.argv.slice(2);
  if (!chunksPath || qparts.length === 0) {
    console.error("usage: node hybrid_retrieve.mjs data/chunks.json \"your question\"");
    process.exit(1);
  }
  const chunks = JSON.parse(fs.readFileSync(chunksPath,"utf8"));
  const q = qparts.join(" ");
  const picks = hybridRetrieve(chunks, q);
  console.log(JSON.stringify({ q, picks }, null, 2));
}

export { hybridRetrieve };
```

### Run

```bash
node hybrid_retrieve.mjs data/chunks.json "What is X and how is it constrained?"
```

**Pass criteria** are the same as Python.
When you later swap in real embeddings, the behavior should improve while the control logic stays unchanged.

---

## 4) Wire into your guarded answer (from Example 01)

**Python** — replace your `retrieve()` with the hybrid function (keep the same prompt and trace rules).
**Node** — import `hybridRetrieve` and use its `picks` as your `chunks` in the prompt builder.

---

## 5) Verification checklist

* **Citation hit rate** increases (more answers cite the defining chunk ids)
* **Refusal is correct** when no relevant evidence exists
* **Prompt length** stays the same or shrinks (top-8 is enough after rerank)
* **Variance** across runs reduces (intersection stabilizes candidate set)

A quick way to confirm: re-run the triage from **Example 02**. You should see fewer `retrieval_drift` labels.

---

## 6) Why this works (in one paragraph)

Lexical scores reward explicit keyword overlap; semantic scores capture paraphrases and synonyms.
Taking the **intersection** forces candidates to be good in **both** views (high precision); when the intersection is too small, a **union fallback** avoids false refusals (recall).
A **cosine rerank** against the query enforces semantic closeness across mixed candidates, and the **knee cutoff** removes the low-value tail that tends to cause stitching.
You get a clean, testable selection before you ever call the model.

---

## 7) Common mistakes & quick fixes

* **Intersection returns 0** → increase the lexical/semantic pools to 80/80, or relax stopword removal.
* **Knee cuts too aggressively** → lower the drop sensitivity by requiring a larger relative drop (or set `KNEE_MIN = TOPK_FINAL`).
* **Still seeing off-topic chunks** → reduce chunk size so entity and constraints sit together; large chunks blur the signal.
* **Latency too high** → cache embeddings; prebuild FAISS; keep models in memory and warm them before serving (see Example 07).

---

## 8) Next steps

* Add a lightweight **cross-encoder reranker** (CPU-friendly) and compare with the cosine rerank.
* Move to **Eval** docs and measure changes on precision, refusal rate, and citation overlap across your question set.
* If you run on Ollama/LangChain, keep this control logic; just swap the embedding/model backends.

---

### 🧭 Explore More

| Module                | Description                                                          | Link                                                                                               |
| --------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| WFGY Core             | Standalone semantic reasoning engine for any LLM                     | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md)                              |
| Problem Map 1.0       | Initial 16-mode diagnostic and symbolic fix framework                | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md)                        |
| Problem Map 2.0       | RAG-focused failure tree, modular fixes, and pipelines               | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md)           |
| Semantic Blueprint    | Layer-based symbolic reasoning & semantic modulations                | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md)                 |
| Benchmark vs GPT-5    | Stress test GPT-5 with full WFGY reasoning suite                     | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md)      |

---

> 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** —
> Engineers, hackers, and open source builders who supported WFGY from day one.

> <img src="https://img.shields.io/github/stars/onestardao/WFGY?style=social" alt="GitHub stars"> ⭐ Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone  ⭐ **[Star WFGY on GitHub](https://github.com/onestardao/WFGY)**

<div align="center">

[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)
 
[![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)
 
[![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)
 
[![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)
 
[![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)
 
[![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)
 
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)

</div>