mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-26 10:40:55 +00:00
9.1 KiB
9.1 KiB
🧮 Rerankers — When to Use Them, How to Tune, How to Prove It
Reranking boosts precision@k by re-scoring a candidate set from first-stage retrieval.
Used correctly, it tightens citations and reduces “looks-right-but-wrong” answers. Used blindly, it burns latency & money.
Quick Nav
Retrieval Playbook · Embedding vs Semantic · Traceability · Patterns: Query Parsing Split · Symbolic Constraint Unlock
0) TL;DR — Decision table
| Situation | Use | Why |
|---|---|---|
| First-stage recall@50 < 0.85 | Do NOT add reranker yet | You’re promoting the wrong pool; fix candidate generation first |
| Recall is good but Top-5 irrelevant | Add cross-encoder reranker | Cross-attends Q–D; best precision |
| Need tight citations across near-duplicates | Cross-encoder or ColBERT style | Fine-grained token interactions |
| Very low volume, high stakes | LLM-as-reranker | Expensive but accurate, great for audits |
| High QPS, tight budget | Light cross-encoder (mini) or linear fusion | 80/20 precision for minimal cost |
1) Families of rerankers
-
Cross-encoder (e.g., bge-reranker, ms-marco MiniLM)
- Jointly encodes [query ⊕ doc]; outputs a relevance score.
- Pros: best precision; Cons: O(k) forward passes.
-
Late-interaction (e.g., ColBERT-style)
- Token-level max-sim interactions; faster than full cross-enc.
- Pros: scalable; Cons: infra heavier than CE.
-
LLM-as-reranker
- Ask model to score or rank candidates with a schema.
- Pros: reasoning-aware; Cons: latency & cost; needs a strict judging prompt.
Start point: cross-encoder mini/base → upgrade if needed.
2) Minimal implementations
2.1 Python — Cross-encoder (bge-reranker)
# pip install FlagEmbedding
from FlagEmbedding import FlagReranker
rerank = FlagReranker('BAAI/bge-reranker-base', use_fp16=True)
def rerank_topk(query, candidates, out_k=10):
# candidates: list[{"text":..., "meta":{...}}]
pairs = [(query, c["text"]) for c in candidates]
scores = rerank.compute_score(pairs, normalize=True)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
out = []
for c, s in ranked[:out_k]:
c["rerank_score"] = float(s)
c["source"] = c.get("source","") + "|ce"
out.append(c)
return out
Tips
- Use normalize=True for score comparability across batches.
- Batch size 16–64 depending on VRAM/CPU.
2.2 Node — LLM-as-reranker (schema-locked)
// Example sketch using any chat LLM SDK
const SYSTEM = `You are a strict retrieval judge.
Return JSON array of {id,score,reason} with score in [0,1].
Score by factual support for the query; do not invent.`;
function judgingPrompt(query: string, cands: {id:string,text:string}[]) {
const body = cands.map((c,i)=>`[${i}] id=${c.id}\n${c.text}`).join("\n\n");
return `Query: ${query}\n\nCandidates:\n${body}\n\nRules:\n- Cite terms that match\n- Penalize off-topic\n- Prefer exact sections\n\nNow return JSON only.`;
}
// call your LLM and parse JSON;
// accept top-k with score ≥ threshold and keep justification in logs.
Guardrails
- JSON-only response.
- Enforce max tokens and refuse long doc bodies (pass snippets only).
- Never let LLM rewrite the snippet; judge only.
3) Tuning knobs that actually matter
- Candidate pool size (
k_in): 50–200 typical. Small pool → missed gold; huge pool → latency. - Output size (
k_out): 5–20. For grounded QA, 6–8 is a sweet spot. - Score calibration: Normalize CE outputs to
[0,1]; keep per-query z-scores for audit. - Hybrid gate: If BM25 and dense disagree drastically, log both top-5 and check Query Parsing Split.
- Dedup by doc/section: Keep at most N chunks per section to avoid overfitting to near-duplicates.
4) Verification (don’t skip)
Metrics
- nDCG@10, MRR@10, Recall@50, and ΔS(question, top-ctx).
- Expect ΔS ≤ 0.45 after rerank on accepted top-ctx.
- Track citation hit rate (does the final answer cite a reranked chunk?).
A/B checklist
- Freeze the first-stage retriever.
- Compare with vs without reranker on the same gold set.
- Record latency p95 and cost/query.
- If nDCG@10 ↑ < +0.05 but latency doubles → not worth it.
5) Failure modes → fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| Reranker prefers off-topic “fluent” text | Judge prompt vague / CE miscalibrated | Tighten judging schema; penalize missing query terms; normalize scores |
| Great demo, but prod recall tanks | k_in too small / drift | Increase k_in to 100–200; re-check recall@50 |
| Citations merge across sources | Prompt schema unlocked | Enforce per-source fences; see SCU |
| Hybrid suddenly worse than dense | Tokenizers diverged | Align analyzers; log per-retriever queries; see Query Parsing Split |
6) Cost model (back-of-envelope)
-
Cross-encoder base: ~3–6 ms/doc on A10g-level GPU, slower on CPU.
-
For k_in=100 and p95 ~500 ms on CPU, consider:
- shrink text by sentence-windowing,
- use mini model,
- pre-filter by BM25 top-60 then CE top-10.
7) Acceptance criteria
- nDCG@10 improves by ≥ +0.05 vs baseline.
- Recall@50 unchanged (±0.02) after adding reranker (candidate pool must remain wide).
- ΔS(question, top-ctx) ≤ 0.45 and λ stays convergent on 3 paraphrases.
- Traceability: store
{query, cand_id, pre_score, post_score, reason}.
8) Example pipeline glue
def answer(query):
cands = search(query, topk_dense=80, topk_sparse=80, out_k=60) # from retrieval-playbook
reranked = rerank_topk(query, cands, out_k=8) # CE/LLM reranker
prompt = build_prompt(query, reranked) # cite → explain, fenced by section
return call_llm(prompt)
- Do not exceed 8–10 context chunks for QA—precision collapses after that.
- Always log which reranker selected which chunk.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.