WFGY/ProblemMap/rerankers.md

# 🧮 Rerankers — When to Use Them, How to Tune, How to Prove It

Reranking boosts **precision@k** by re-scoring a candidate set from first-stage retrieval.
Used correctly, it **tightens citations** and reduces “looks-right-but-wrong” answers. Used blindly, it burns latency & money.

---

> **Quick Nav**
> [Retrieval Playbook](./retrieval-playbook.md) ·
> [Embedding vs Semantic](./embedding-vs-semantic.md) ·
> [Traceability](./retrieval-traceability.md) ·
> Patterns: [Query Parsing Split](./patterns/pattern_query_parsing_split.md) ·
> [Symbolic Constraint Unlock](./patterns/pattern_symbolic_constraint_unlock.md)

---

## 0) TL;DR — Decision table

| Situation | Use | Why |
|---|---|---|
| First-stage **recall@50 < 0.85** | **Do NOT** add reranker yet | You’re promoting the wrong pool; fix candidate generation first |
| Recall is good but **Top-5 irrelevant** | Add **cross-encoder** reranker | Cross-attends Q–D; best precision |
| Need **tight citations** across near-duplicates | Cross-encoder or **ColBERT** style | Fine-grained token interactions |
| Very low volume, high stakes | **LLM-as-reranker** | Expensive but accurate, great for audits |
| High QPS, tight budget | **Light cross-encoder** (mini) or **linear fusion** | 80/20 precision for minimal cost |

---

## 1) Families of rerankers

1. **Cross-encoder** (e.g., bge-reranker, ms-marco MiniLM)
   - Jointly encodes **[query ⊕ doc]**; outputs a relevance score.
   - **Pros**: best precision; **Cons**: O(k) forward passes.

2. **Late-interaction** (e.g., ColBERT-style)
   - Token-level max-sim interactions; faster than full cross-enc.
   - **Pros**: scalable; **Cons**: infra heavier than CE.

3. **LLM-as-reranker**
   - Ask model to score or rank candidates with a schema.
   - **Pros**: reasoning-aware; **Cons**: latency & cost; needs a **strict judging prompt**.

**Start point**: cross-encoder mini/base → upgrade if needed.

---

## 2) Minimal implementations

### 2.1 Python — Cross-encoder (bge-reranker)

```python
# pip install FlagEmbedding
from FlagEmbedding import FlagReranker

rerank = FlagReranker('BAAI/bge-reranker-base', use_fp16=True)
def rerank_topk(query, candidates, out_k=10):
    # candidates: list[{"text":..., "meta":{...}}]
    pairs = [(query, c["text"]) for c in candidates]
    scores = rerank.compute_score(pairs, normalize=True)
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    out = []
    for c, s in ranked[:out_k]:
        c["rerank_score"] = float(s)
        c["source"] = c.get("source","") + "|ce"
        out.append(c)
    return out
````

**Tips**

* Use **normalize=True** for score comparability across batches.
* Batch size 16–64 depending on VRAM/CPU.

### 2.2 Node — LLM-as-reranker (schema-locked)

```ts
// Example sketch using any chat LLM SDK
const SYSTEM = `You are a strict retrieval judge.
Return JSON array of {id,score,reason} with score in [0,1].
Score by factual support for the query; do not invent.`;

function judgingPrompt(query: string, cands: {id:string,text:string}[]) {
  const body = cands.map((c,i)=>`[${i}] id=${c.id}\n${c.text}`).join("\n\n");
  return `Query: ${query}\n\nCandidates:\n${body}\n\nRules:\n- Cite terms that match\n- Penalize off-topic\n- Prefer exact sections\n\nNow return JSON only.`;
}

// call your LLM and parse JSON;
// accept top-k with score ≥ threshold and keep justification in logs.
```

**Guardrails**

* **JSON-only** response.
* Enforce **max tokens** and refuse long doc bodies (pass snippets only).
* Never let LLM **rewrite** the snippet; judge only.

---

## 3) Tuning knobs that actually matter

* **Candidate pool size (`k_in`)**: 50–200 typical. Small pool → missed gold; huge pool → latency.
* **Output size (`k_out`)**: 5–20. For grounded QA, 6–8 is a sweet spot.
* **Score calibration**: Normalize CE outputs to `[0,1]`; keep **per-query z-scores** for audit.
* **Hybrid gate**: If BM25 and dense disagree drastically, log both top-5 and check [Query Parsing Split](./patterns/pattern_query_parsing_split.md).
* **Dedup by doc/section**: Keep at most **N** chunks per section to avoid overfitting to near-duplicates.

---

## 4) Verification (don’t skip)

**Metrics**

* **nDCG\@10**, **MRR\@10**, **Recall\@50**, and **ΔS(question, top-ctx)**.
* Expect **ΔS ≤ 0.45** after rerank on accepted top-ctx.
* Track **citation hit rate** (does the final answer cite a reranked chunk?).

**A/B checklist**

1. Freeze the first-stage retriever.
2. Compare **with vs without** reranker on the same gold set.
3. Record latency p95 and cost/query.
4. If nDCG\@10 ↑ < **+0.05** but latency doubles → not worth it.

---

## 5) Failure modes → fixes

| Symptom                                  | Likely cause                          | Fix                                                                                                              |
| ---------------------------------------- | ------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Reranker prefers off-topic “fluent” text | Judge prompt vague / CE miscalibrated | Tighten judging schema; penalize missing query terms; normalize scores                                           |
| Great demo, but prod recall tanks        | k\_in too small / drift               | Increase k\_in to 100–200; re-check recall\@50                                                                   |
| Citations merge across sources           | Prompt schema unlocked                | Enforce per-source fences; see [SCU](./patterns/pattern_symbolic_constraint_unlock.md)                           |
| Hybrid suddenly worse than dense         | Tokenizers diverged                   | Align analyzers; log per-retriever queries; see [Query Parsing Split](./patterns/pattern_query_parsing_split.md) |

---

## 6) Cost model (back-of-envelope)

* Cross-encoder base: \~**3–6 ms**/doc on A10g-level GPU, slower on CPU.
* For **k\_in=100** and p95 **\~500 ms** on CPU, consider:

  * shrink text by **sentence-windowing**,
  * use **mini** model,
  * pre-filter by **BM25 top-60** then CE top-10.

---

## 7) Acceptance criteria

* **nDCG\@10** improves by **≥ +0.05** vs baseline.
* **Recall\@50** unchanged (±0.02) after adding reranker (candidate pool must remain wide).
* **ΔS(question, top-ctx) ≤ 0.45** and λ stays **convergent** on 3 paraphrases.
* **Traceability**: store `{query, cand_id, pre_score, post_score, reason}`.

---

## 8) Example pipeline glue

```python
def answer(query):
    cands = search(query, topk_dense=80, topk_sparse=80, out_k=60)   # from retrieval-playbook
    reranked = rerank_topk(query, cands, out_k=8)                    # CE/LLM reranker
    prompt = build_prompt(query, reranked)                           # cite → explain, fenced by section
    return call_llm(prompt)
```

* **Do not** exceed **8–10** context chunks for QA—precision collapses after that.
* Always **log** which reranker selected which chunk.

---


### 🔗 Quick-Start Downloads (60 sec)

| Tool | Link | 3-Step Setup |
|------|------|--------------|
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |

---

<!-- WFGY_FOOTER_START -->

### Explore More

| Layer | Page | What it’s for |
| --- | --- | --- |
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
[![GitHub Repo stars](https://img.shields.io/github/stars/onestardao/WFGY?style=social)](https://github.com/onestardao/WFGY)

<!-- WFGY_FOOTER_END -->