mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
326 lines
13 KiB
Markdown
326 lines
13 KiB
Markdown
# Example 03 — Pipeline Patch: Intersection + Rerank (No.1 & No.4)
|
||
|
||
**Goal**
|
||
Harden retrieval so answers cite the **right** chunks. We combine lexical and semantic views, take the **intersection** (with a safe fallback), then **rerank** by cosine and **cut the tail** at the score knee.
|
||
|
||
**Problem Map link**
|
||
- **No.1 Hallucination & Chunk Drift** — wrong spans sneak in when chunk borders don’t match entities/constraints.
|
||
- **No.4 Tail Noise** — low-relevance passages dilute the prompt and push the model to stitch.
|
||
|
||
**Outcome**
|
||
- Higher citation hit rate and fewer off-topic chunks in the prompt
|
||
- Same token budget, better signal
|
||
- Deterministic selection rules you can tune and test
|
||
|
||
---
|
||
|
||
## 1) Inputs
|
||
|
||
Use the same structure as Example 01:
|
||
- `data/chunks.json` — array of `{id, page, text}`
|
||
- You can keep the tiny toy corpus or point to your own
|
||
|
||
> Tip: If your corpus is large, run this on a 200–500 chunk slice first to verify behavior.
|
||
|
||
---
|
||
|
||
## 2) Path A — Python (rank-bm25 + sentence-transformers, CPU-friendly)
|
||
|
||
### Install
|
||
|
||
```bash
|
||
pip install numpy rank-bm25 sentence-transformers faiss-cpu
|
||
````
|
||
|
||
### Script: `hybrid_retrieve.py`
|
||
|
||
```python
|
||
# hybrid_retrieve.py -- lexical ∩ semantic -> rerank -> knee cutoff
|
||
import json, os, sys, math
|
||
import numpy as np
|
||
from rank_bm25 import BM25Okapi
|
||
from sentence_transformers import SentenceTransformer
|
||
import faiss
|
||
|
||
TOPK_LEX = 40 # lexical candidate pool
|
||
TOPK_SEM = 40 # semantic candidate pool
|
||
TOPK_FINAL = 8 # final picks after rerank
|
||
KNEE_MIN = 4 # at least this many survive before knee
|
||
|
||
def load_chunks(path):
|
||
C = json.load(open(path, encoding="utf8"))
|
||
texts = [c["text"] for c in C]
|
||
ids = [c["id"] for c in C]
|
||
return C, texts, ids
|
||
|
||
def build_lex(texts):
|
||
toks = [t.lower().split() for t in texts]
|
||
return BM25Okapi(toks), toks
|
||
|
||
def build_sem(texts):
|
||
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
|
||
embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True, show_progress_bar=False)
|
||
index = faiss.IndexFlatIP(embs.shape[1])
|
||
index.add(embs.astype(np.float32))
|
||
return model, embs, index
|
||
|
||
def score_lex(bm25, toks, q):
|
||
return bm25.get_scores(q.lower().split())
|
||
|
||
def score_sem(model, index, q, topn):
|
||
qv = model.encode([q], normalize_embeddings=True)
|
||
sim, idx = index.search(qv.astype(np.float32), topn)
|
||
return sim[0], idx[0]
|
||
|
||
def knee_cut(scores_sorted_desc, min_keep=KNEE_MIN):
|
||
"""Find largest drop (relative) to cut the tail. Always keep at least min_keep."""
|
||
if len(scores_sorted_desc) <= min_keep:
|
||
return len(scores_sorted_desc)
|
||
drops = []
|
||
for i in range(1, len(scores_sorted_desc)):
|
||
prev, cur = scores_sorted_desc[i-1], scores_sorted_desc[i]
|
||
if prev <= 1e-9: drops.append(0.0)
|
||
else: drops.append((prev - cur) / max(prev, 1e-9))
|
||
knee = max(range(1, len(scores_sorted_desc)), key=lambda i: drops[i-1])
|
||
return max(min_keep, knee)
|
||
|
||
def retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, q):
|
||
# 1) lexical top pool
|
||
lex_scores = score_lex(bm25, toks, q)
|
||
lex_top_idx = np.argsort(lex_scores)[::-1][:TOPK_LEX]
|
||
|
||
# 2) semantic top pool
|
||
sem_sims, sem_idx = score_sem(model, index, q, TOPK_SEM)
|
||
sem_top_idx = sem_idx
|
||
|
||
# 3) intersection, then union fallback
|
||
cand = list(set(lex_top_idx).intersection(set(sem_top_idx)))
|
||
if len(cand) < TOPK_FINAL:
|
||
cand = list(set(list(lex_top_idx) + list(sem_top_idx)))
|
||
|
||
cand = np.array(cand)
|
||
# 4) rerank by cosine against query vector
|
||
qv = model.encode([q], normalize_embeddings=True)[0]
|
||
embs = index.reconstruct_n(0, index.ntotal) # faiss flat, safe to reconstruct
|
||
sims = embs[cand] @ qv
|
||
|
||
order = np.argsort(sims)[::-1]
|
||
cand, sims = cand[order], sims[order]
|
||
|
||
# 5) knee cutoff then top K
|
||
keep = knee_cut(sims.tolist(), min_keep=min(KNEE_MIN, TOPK_FINAL))
|
||
picks = cand[:max(keep, TOPK_FINAL)][:TOPK_FINAL]
|
||
out = [{"id": ids[i], "text": texts[i], "score": float(sims[list(cand).index(i)])} for i in picks]
|
||
return out
|
||
|
||
def main(chunks_path, query):
|
||
chunks, texts, ids = load_chunks(chunks_path)
|
||
bm25, toks = build_lex(texts)
|
||
model, embs, index = build_sem(texts)
|
||
picks = retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, query)
|
||
print(json.dumps({"q": query, "picks": picks}, ensure_ascii=False, indent=2))
|
||
|
||
if __name__ == "__main__":
|
||
if len(sys.argv) < 3:
|
||
print("usage: python hybrid_retrieve.py data/chunks.json \"your question\"")
|
||
sys.exit(1)
|
||
main(sys.argv[1], sys.argv[2])
|
||
```
|
||
|
||
### Run
|
||
|
||
```bash
|
||
python hybrid_retrieve.py data/chunks.json "What is X and how is it constrained?"
|
||
```
|
||
|
||
**Pass criteria**
|
||
|
||
* The output `picks` includes the chunk(s) that actually define X
|
||
* Low-signal paragraphs (e.g., “unrelated protocol”) should be cut by the knee
|
||
|
||
---
|
||
|
||
## 3) Path B — Node (no external packages, CPU-only)
|
||
|
||
We’ll keep Node dependency-free for portability.
|
||
For real workloads, swap in `@xenova/transformers` for embeddings and a BM25 lib; the logic stays identical.
|
||
|
||
### Script: `hybrid_retrieve.mjs`
|
||
|
||
```js
|
||
// hybrid_retrieve.mjs -- lexical overlap ∩ simple semantic → rerank → knee
|
||
import fs from "node:fs";
|
||
|
||
// --- tiny helpers (dependency-free) ---
|
||
function tokenize(s) { return s.toLowerCase().split(/\W+/).filter(Boolean); }
|
||
function dot(a,b){ let s=0; for(let i=0;i<a.length;i++) s+=a[i]*b[i]; return s; }
|
||
function norm(a){ return Math.sqrt(dot(a,a)) || 1; }
|
||
function cosine(a,b){ return dot(a,b)/(norm(a)*norm(b)); }
|
||
|
||
// naive "embedding": average of hashed one-hot bins (works as a stand-in)
|
||
const D = 512;
|
||
function hashEmbed(text){
|
||
const v = new Array(D).fill(0);
|
||
for(const w of tokenize(text)){
|
||
const h = Math.abs([...w].reduce((a,c)=>((a<<5)-a + c.charCodeAt(0))|0, 0)) % D;
|
||
v[h] += 1;
|
||
}
|
||
// l2 normalize
|
||
const n = norm(v); return v.map(x => x/n);
|
||
}
|
||
|
||
function bm25Lite(queryTokens, docTokens, avgdl, k1=1.5, b=0.75){
|
||
// very lite BM25: IDF omitted here; we approximate with normalized term overlap + length norm
|
||
const overlap = docTokens.filter(t => queryTokens.has(t)).length;
|
||
const dl = docTokens.length || 1;
|
||
const num = overlap * (k1 + 1);
|
||
const den = overlap + k1 * (1 - b + b * (dl / avgdl));
|
||
return den ? num/den : 0;
|
||
}
|
||
|
||
function kneeCut(scoresDesc, minKeep=4){
|
||
if(scoresDesc.length <= minKeep) return scoresDesc.length;
|
||
let best=1, bestDrop=-1;
|
||
for(let i=1;i<scoresDesc.length;i++){
|
||
const prev = scoresDesc[i-1], cur = scoresDesc[i];
|
||
const drop = (prev - cur) / Math.max(prev, 1e-9);
|
||
if(drop > bestDrop){ bestDrop = drop; best = i; }
|
||
}
|
||
return Math.max(minKeep, best);
|
||
}
|
||
|
||
// --- main hybrid ---
|
||
function hybridRetrieve(chunks, q, TOPK_LEX=40, TOPK_SEM=40, TOPK_FINAL=8){
|
||
const qTok = new Set(tokenize(q));
|
||
const tokens = chunks.map(c => tokenize(c.text));
|
||
const avgdl = tokens.reduce((a,t)=>a+t.length,0) / Math.max(tokens.length,1);
|
||
|
||
// lexical pool
|
||
const lexScores = tokens.map(t => bm25Lite(qTok, t, avgdl));
|
||
const lexOrder = [...lexScores.keys()].sort((a,b)=>lexScores[b]-lexScores[a]).slice(0, TOPK_LEX);
|
||
|
||
// semantic pool (hash embeddings as stand-in; replace with real model later)
|
||
const embs = chunks.map(c => hashEmbed(c.text));
|
||
const qv = hashEmbed(q);
|
||
const semScores = embs.map(v => cosine(v, qv));
|
||
const semOrder = [...semScores.keys()].sort((a,b)=>semScores[b]-semScores[a]).slice(0, TOPK_SEM);
|
||
|
||
// intersection then union fallback
|
||
const setLex = new Set(lexOrder), setSem = new Set(semOrder);
|
||
let cand = [...new Set(lexOrder.filter(i => setSem.has(i)))];
|
||
if(cand.length < TOPK_FINAL) cand = [...new Set([...lexOrder, ...semOrder])];
|
||
|
||
// rerank by cosine, knee cutoff
|
||
const rescored = cand.map(i => [i, semScores[i]]).sort((a,b)=>b[1]-a[1]);
|
||
const scores = rescored.map(x => x[1]);
|
||
const keep = kneeCut(scores, Math.min(4, TOPK_FINAL));
|
||
const picks = rescored.slice(0, Math.max(keep, TOPK_FINAL)).slice(0, TOPK_FINAL).map(([i,s]) => ({
|
||
id: chunks[i].id, text: chunks[i].text, score: s
|
||
}));
|
||
return picks;
|
||
}
|
||
|
||
// CLI
|
||
if (import.meta.url === `file://${process.argv[1]}`) {
|
||
const [chunksPath, ...qparts] = process.argv.slice(2);
|
||
if (!chunksPath || qparts.length === 0) {
|
||
console.error("usage: node hybrid_retrieve.mjs data/chunks.json \"your question\"");
|
||
process.exit(1);
|
||
}
|
||
const chunks = JSON.parse(fs.readFileSync(chunksPath,"utf8"));
|
||
const q = qparts.join(" ");
|
||
const picks = hybridRetrieve(chunks, q);
|
||
console.log(JSON.stringify({ q, picks }, null, 2));
|
||
}
|
||
|
||
export { hybridRetrieve };
|
||
```
|
||
|
||
### Run
|
||
|
||
```bash
|
||
node hybrid_retrieve.mjs data/chunks.json "What is X and how is it constrained?"
|
||
```
|
||
|
||
**Pass criteria** are the same as Python.
|
||
When you later swap in real embeddings, the behavior should improve while the control logic stays unchanged.
|
||
|
||
---
|
||
|
||
## 4) Wire into your guarded answer (from Example 01)
|
||
|
||
**Python** — replace your `retrieve()` with the hybrid function (keep the same prompt and trace rules).
|
||
**Node** — import `hybridRetrieve` and use its `picks` as your `chunks` in the prompt builder.
|
||
|
||
---
|
||
|
||
## 5) Verification checklist
|
||
|
||
* **Citation hit rate** increases (more answers cite the defining chunk ids)
|
||
* **Refusal is correct** when no relevant evidence exists
|
||
* **Prompt length** stays the same or shrinks (top-8 is enough after rerank)
|
||
* **Variance** across runs reduces (intersection stabilizes candidate set)
|
||
|
||
A quick way to confirm: re-run the triage from **Example 02**. You should see fewer `retrieval_drift` labels.
|
||
|
||
---
|
||
|
||
## 6) Why this works (in one paragraph)
|
||
|
||
Lexical scores reward explicit keyword overlap; semantic scores capture paraphrases and synonyms.
|
||
Taking the **intersection** forces candidates to be good in **both** views (high precision); when the intersection is too small, a **union fallback** avoids false refusals (recall).
|
||
A **cosine rerank** against the query enforces semantic closeness across mixed candidates, and the **knee cutoff** removes the low-value tail that tends to cause stitching.
|
||
You get a clean, testable selection before you ever call the model.
|
||
|
||
---
|
||
|
||
## 7) Common mistakes & quick fixes
|
||
|
||
* **Intersection returns 0** → increase the lexical/semantic pools to 80/80, or relax stopword removal.
|
||
* **Knee cuts too aggressively** → lower the drop sensitivity by requiring a larger relative drop (or set `KNEE_MIN = TOPK_FINAL`).
|
||
* **Still seeing off-topic chunks** → reduce chunk size so entity and constraints sit together; large chunks blur the signal.
|
||
* **Latency too high** → cache embeddings; prebuild FAISS; keep models in memory and warm them before serving (see Example 07).
|
||
|
||
---
|
||
|
||
## 8) Next steps
|
||
|
||
* Add a lightweight **cross-encoder reranker** (CPU-friendly) and compare with the cosine rerank.
|
||
* Move to **Eval** docs and measure changes on precision, refusal rate, and citation overlap across your question set.
|
||
* If you run on Ollama/LangChain, keep this control logic; just swap the embedding/model backends.
|
||
|
||
---
|
||
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
|------|------|--------------|
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
|
||
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
|
||
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
|
||
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
|
||
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
|
||
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
|
||
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
|
||
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
|
||
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
|
||
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|