WFGY/ProblemMap/examples/example_03_pipeline_patch.md

326 lines
13 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Example 03 — Pipeline Patch: Intersection + Rerank (No.1 & No.4)
**Goal**
Harden retrieval so answers cite the **right** chunks. We combine lexical and semantic views, take the **intersection** (with a safe fallback), then **rerank** by cosine and **cut the tail** at the score knee.
**Problem Map link**
- **No.1 Hallucination & Chunk Drift** — wrong spans sneak in when chunk borders dont match entities/constraints.
- **No.4 Tail Noise** — low-relevance passages dilute the prompt and push the model to stitch.
**Outcome**
- Higher citation hit rate and fewer off-topic chunks in the prompt
- Same token budget, better signal
- Deterministic selection rules you can tune and test
---
## 1) Inputs
Use the same structure as Example 01:
- `data/chunks.json` — array of `{id, page, text}`
- You can keep the tiny toy corpus or point to your own
> Tip: If your corpus is large, run this on a 200500 chunk slice first to verify behavior.
---
## 2) Path A — Python (rank-bm25 + sentence-transformers, CPU-friendly)
### Install
```bash
pip install numpy rank-bm25 sentence-transformers faiss-cpu
````
### Script: `hybrid_retrieve.py`
```python
# hybrid_retrieve.py -- lexical ∩ semantic -> rerank -> knee cutoff
import json, os, sys, math
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss
TOPK_LEX = 40 # lexical candidate pool
TOPK_SEM = 40 # semantic candidate pool
TOPK_FINAL = 8 # final picks after rerank
KNEE_MIN = 4 # at least this many survive before knee
def load_chunks(path):
C = json.load(open(path, encoding="utf8"))
texts = [c["text"] for c in C]
ids = [c["id"] for c in C]
return C, texts, ids
def build_lex(texts):
toks = [t.lower().split() for t in texts]
return BM25Okapi(toks), toks
def build_sem(texts):
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embs = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True, show_progress_bar=False)
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs.astype(np.float32))
return model, embs, index
def score_lex(bm25, toks, q):
return bm25.get_scores(q.lower().split())
def score_sem(model, index, q, topn):
qv = model.encode([q], normalize_embeddings=True)
sim, idx = index.search(qv.astype(np.float32), topn)
return sim[0], idx[0]
def knee_cut(scores_sorted_desc, min_keep=KNEE_MIN):
"""Find largest drop (relative) to cut the tail. Always keep at least min_keep."""
if len(scores_sorted_desc) <= min_keep:
return len(scores_sorted_desc)
drops = []
for i in range(1, len(scores_sorted_desc)):
prev, cur = scores_sorted_desc[i-1], scores_sorted_desc[i]
if prev <= 1e-9: drops.append(0.0)
else: drops.append((prev - cur) / max(prev, 1e-9))
knee = max(range(1, len(scores_sorted_desc)), key=lambda i: drops[i-1])
return max(min_keep, knee)
def retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, q):
# 1) lexical top pool
lex_scores = score_lex(bm25, toks, q)
lex_top_idx = np.argsort(lex_scores)[::-1][:TOPK_LEX]
# 2) semantic top pool
sem_sims, sem_idx = score_sem(model, index, q, TOPK_SEM)
sem_top_idx = sem_idx
# 3) intersection, then union fallback
cand = list(set(lex_top_idx).intersection(set(sem_top_idx)))
if len(cand) < TOPK_FINAL:
cand = list(set(list(lex_top_idx) + list(sem_top_idx)))
cand = np.array(cand)
# 4) rerank by cosine against query vector
qv = model.encode([q], normalize_embeddings=True)[0]
embs = index.reconstruct_n(0, index.ntotal) # faiss flat, safe to reconstruct
sims = embs[cand] @ qv
order = np.argsort(sims)[::-1]
cand, sims = cand[order], sims[order]
# 5) knee cutoff then top K
keep = knee_cut(sims.tolist(), min_keep=min(KNEE_MIN, TOPK_FINAL))
picks = cand[:max(keep, TOPK_FINAL)][:TOPK_FINAL]
out = [{"id": ids[i], "text": texts[i], "score": float(sims[list(cand).index(i)])} for i in picks]
return out
def main(chunks_path, query):
chunks, texts, ids = load_chunks(chunks_path)
bm25, toks = build_lex(texts)
model, embs, index = build_sem(texts)
picks = retrieve_hybrid(chunks, texts, ids, bm25, toks, model, index, query)
print(json.dumps({"q": query, "picks": picks}, ensure_ascii=False, indent=2))
if __name__ == "__main__":
if len(sys.argv) < 3:
print("usage: python hybrid_retrieve.py data/chunks.json \"your question\"")
sys.exit(1)
main(sys.argv[1], sys.argv[2])
```
### Run
```bash
python hybrid_retrieve.py data/chunks.json "What is X and how is it constrained?"
```
**Pass criteria**
* The output `picks` includes the chunk(s) that actually define X
* Low-signal paragraphs (e.g., “unrelated protocol”) should be cut by the knee
---
## 3) Path B — Node (no external packages, CPU-only)
Well keep Node dependency-free for portability.
For real workloads, swap in `@xenova/transformers` for embeddings and a BM25 lib; the logic stays identical.
### Script: `hybrid_retrieve.mjs`
```js
// hybrid_retrieve.mjs -- lexical overlap ∩ simple semantic → rerank → knee
import fs from "node:fs";
// --- tiny helpers (dependency-free) ---
function tokenize(s) { return s.toLowerCase().split(/\W+/).filter(Boolean); }
function dot(a,b){ let s=0; for(let i=0;i<a.length;i++) s+=a[i]*b[i]; return s; }
function norm(a){ return Math.sqrt(dot(a,a)) || 1; }
function cosine(a,b){ return dot(a,b)/(norm(a)*norm(b)); }
// naive "embedding": average of hashed one-hot bins (works as a stand-in)
const D = 512;
function hashEmbed(text){
const v = new Array(D).fill(0);
for(const w of tokenize(text)){
const h = Math.abs([...w].reduce((a,c)=>((a<<5)-a + c.charCodeAt(0))|0, 0)) % D;
v[h] += 1;
}
// l2 normalize
const n = norm(v); return v.map(x => x/n);
}
function bm25Lite(queryTokens, docTokens, avgdl, k1=1.5, b=0.75){
// very lite BM25: IDF omitted here; we approximate with normalized term overlap + length norm
const overlap = docTokens.filter(t => queryTokens.has(t)).length;
const dl = docTokens.length || 1;
const num = overlap * (k1 + 1);
const den = overlap + k1 * (1 - b + b * (dl / avgdl));
return den ? num/den : 0;
}
function kneeCut(scoresDesc, minKeep=4){
if(scoresDesc.length <= minKeep) return scoresDesc.length;
let best=1, bestDrop=-1;
for(let i=1;i<scoresDesc.length;i++){
const prev = scoresDesc[i-1], cur = scoresDesc[i];
const drop = (prev - cur) / Math.max(prev, 1e-9);
if(drop > bestDrop){ bestDrop = drop; best = i; }
}
return Math.max(minKeep, best);
}
// --- main hybrid ---
function hybridRetrieve(chunks, q, TOPK_LEX=40, TOPK_SEM=40, TOPK_FINAL=8){
const qTok = new Set(tokenize(q));
const tokens = chunks.map(c => tokenize(c.text));
const avgdl = tokens.reduce((a,t)=>a+t.length,0) / Math.max(tokens.length,1);
// lexical pool
const lexScores = tokens.map(t => bm25Lite(qTok, t, avgdl));
const lexOrder = [...lexScores.keys()].sort((a,b)=>lexScores[b]-lexScores[a]).slice(0, TOPK_LEX);
// semantic pool (hash embeddings as stand-in; replace with real model later)
const embs = chunks.map(c => hashEmbed(c.text));
const qv = hashEmbed(q);
const semScores = embs.map(v => cosine(v, qv));
const semOrder = [...semScores.keys()].sort((a,b)=>semScores[b]-semScores[a]).slice(0, TOPK_SEM);
// intersection then union fallback
const setLex = new Set(lexOrder), setSem = new Set(semOrder);
let cand = [...new Set(lexOrder.filter(i => setSem.has(i)))];
if(cand.length < TOPK_FINAL) cand = [...new Set([...lexOrder, ...semOrder])];
// rerank by cosine, knee cutoff
const rescored = cand.map(i => [i, semScores[i]]).sort((a,b)=>b[1]-a[1]);
const scores = rescored.map(x => x[1]);
const keep = kneeCut(scores, Math.min(4, TOPK_FINAL));
const picks = rescored.slice(0, Math.max(keep, TOPK_FINAL)).slice(0, TOPK_FINAL).map(([i,s]) => ({
id: chunks[i].id, text: chunks[i].text, score: s
}));
return picks;
}
// CLI
if (import.meta.url === `file://${process.argv[1]}`) {
const [chunksPath, ...qparts] = process.argv.slice(2);
if (!chunksPath || qparts.length === 0) {
console.error("usage: node hybrid_retrieve.mjs data/chunks.json \"your question\"");
process.exit(1);
}
const chunks = JSON.parse(fs.readFileSync(chunksPath,"utf8"));
const q = qparts.join(" ");
const picks = hybridRetrieve(chunks, q);
console.log(JSON.stringify({ q, picks }, null, 2));
}
export { hybridRetrieve };
```
### Run
```bash
node hybrid_retrieve.mjs data/chunks.json "What is X and how is it constrained?"
```
**Pass criteria** are the same as Python.
When you later swap in real embeddings, the behavior should improve while the control logic stays unchanged.
---
## 4) Wire into your guarded answer (from Example 01)
**Python** — replace your `retrieve()` with the hybrid function (keep the same prompt and trace rules).
**Node** — import `hybridRetrieve` and use its `picks` as your `chunks` in the prompt builder.
---
## 5) Verification checklist
* **Citation hit rate** increases (more answers cite the defining chunk ids)
* **Refusal is correct** when no relevant evidence exists
* **Prompt length** stays the same or shrinks (top-8 is enough after rerank)
* **Variance** across runs reduces (intersection stabilizes candidate set)
A quick way to confirm: re-run the triage from **Example 02**. You should see fewer `retrieval_drift` labels.
---
## 6) Why this works (in one paragraph)
Lexical scores reward explicit keyword overlap; semantic scores capture paraphrases and synonyms.
Taking the **intersection** forces candidates to be good in **both** views (high precision); when the intersection is too small, a **union fallback** avoids false refusals (recall).
A **cosine rerank** against the query enforces semantic closeness across mixed candidates, and the **knee cutoff** removes the low-value tail that tends to cause stitching.
You get a clean, testable selection before you ever call the model.
---
## 7) Common mistakes & quick fixes
* **Intersection returns 0** → increase the lexical/semantic pools to 80/80, or relax stopword removal.
* **Knee cuts too aggressively** → lower the drop sensitivity by requiring a larger relative drop (or set `KNEE_MIN = TOPK_FINAL`).
* **Still seeing off-topic chunks** → reduce chunk size so entity and constraints sit together; large chunks blur the signal.
* **Latency too high** → cache embeddings; prebuild FAISS; keep models in memory and warm them before serving (see Example 07).
---
## 8) Next steps
* Add a lightweight **cross-encoder reranker** (CPU-friendly) and compare with the cosine rerank.
* Move to **Eval** docs and measure changes on precision, refusal rate, and citation overlap across your question set.
* If you run on Ollama/LangChain, keep this control logic; just swap the embedding/model backends.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|------|------|--------------|
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1⃣ Download · 2⃣ Upload to your LLM · 3⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
<!-- WFGY_FOOTER_START -->
### Explore More
| Layer | Page | What its for |
| --- | --- | --- |
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
[![GitHub Repo stars](https://img.shields.io/github/stars/onestardao/WFGY?style=social)](https://github.com/onestardao/WFGY)
<!-- WFGY_FOOTER_END -->