11 KiB
Rerankers — Ordering Control and Stability
🧭 Quick Return to Map
You are in a sub-page of Retrieval.
To reorient, go back here:
- Retrieval — information access and knowledge lookup
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Use rerankers when recall is fine but the top hits are mis-ordered, unstable, or biased toward the wrong metric. This page shows listwise and pairwise recipes, fusion knobs, and stability fences you can drop into any stack.
References you may want open already:
RAG Architecture & Recovery ·
Retrieval Playbook ·
Retrieval Traceability ·
Data Contracts ·
Embedding ≠ Semantic ·
Query Parsing Split ·
Vectorstore Fragmentation
Acceptance targets
- ΔS(question, top1.text) ≤ 0.45
- Anchor coverage of the final topk ≥ 0.70
- Kendall τ against gold ranking improves by ≥ 0.20 over baseline bi-encoder order
- λ remains convergent across 3 paraphrases and 2 seeds
If ΔS sits in 0.40 to 0.60 and τ gains are small, fix chunking or metric before adding complexity.
Symptoms → exact fix
| Symptom | Likely cause | Open this fix |
|---|---|---|
| Correct passage appears in top20 but not in top3 | wrong ordering after recall | Retrieval Playbook, this page |
| Topk flips between identical runs | non-deterministic tie breaks or LLM variance | Retrieval Traceability |
| BM25 beats dense when queries are abstractive | fusion uncalibrated or query parsing split | Query Parsing Split |
| Duplicated near-identical hits crowd out diversity | no MMR or section-aware penalties | this page (MMR recipe) |
| Great similarity, wrong meaning | metric mismatch at index time | Embedding ≠ Semantic |
| Hits vanish after ingest or rebuild | fragmented store, mixed analyzers | Vectorstore Fragmentation |
Strategy 1: Cross-encoder reranker (robust default)
When you have bi-encoder recall and need precise order.
Why cross-encoders read the full pair (q, passage) and recover semantics lost by embeddings.
Deterministic sort key
sort_key = (-score_ce, section_priority, snippet_id, start_offset)
Keep the tie-break stable so pagination and caching never reshuffle results.
Minimal pipeline
# Pseudocode only
candidates = dense_recall(q, k=50) + bm25_recall(q, k=50) # union then dedupe by snippet_id
scored = []
for c in candidates:
s = cross_encoder.score(q, c.text) # e.g., monoT5, E5-mistral-ce, etc.
scored.append({**c, "score_ce": s})
# diversity
scored = mmr(q, scored, lambda_rank="score_ce", alpha=0.7) # see MMR recipe below
# deterministic order
ordered = sorted(scored, key=lambda x: (-x["score_ce"], x["section_priority"], x["snippet_id"], x["offsets"][0]))
topk = ordered[:k]
Strategy 2: LLM-as-reranker with schema locks
Use an LLM to score evidence only. Do not let it answer. Force a strict schema and cite-then-explain in the trace.
Prompt skeleton
Task: score each candidate passage for "is this the best evidence to answer Q".
Return JSON with fields: {id, score in [0,1], why_short}. Do not answer Q.
Q: "<question>"
Candidates:
- id: s001, section_id: A.3, snippet_id: 19, text: "<passage>"
- id: s002, section_id: B.1, snippet_id: 7, text: "<passage>"
...
Scoring rubric:
1) directness to the likely anchor section,
2) presence of atomic facts that must be cited,
3) low ambiguity, low cross-topic bleed.
Output JSON list only.
Variance controls
- Fix the model, temperature 0, seed fixed if provider supports it.
- Add BBAM clamp in the system preface to keep λ convergent.
- Keep the rubric short and stable across runs.
Strategy 3: Fusion that behaves
RRF (reciprocal rank fusion)
s_fused = Σ_m 1 / (k0 + rank_m), with k0 around 60 for top100 feeds. RRF is robust when scores are not comparable.
Z-score fusion
Normalize each retriever to zero mean and unit variance then sum. Good when score ranges are stable over time.
Two-stage order
- union and dedupe by
(section_id, snippet_id) - fast fusion to top50
- cross-encoder or LLM rerank to topk
Strategy 4: Diversity with MMR
Maximal marginal relevance avoids redundant hits and expands anchor coverage.
mmr(q, items, lambda_rank="score", alpha=0.7):
S = []
while len(S) < k:
select x that maximizes alpha * rel(q, x) - (1 - alpha) * max_sim(x, S)
return S
- Use cosine on embedding space for
max_sim. - Penalize items sharing the same
section_idunless the anchor spans multiple snippets. - Track coverage per section to avoid starving small but relevant sections.
Stability and observability fences
- Log
reranker_version,fusion_type,alpha,k0, andindex_hash. - Write the final order and why for the topk into the trace.
- Freeze prompt headers for LLM rerankers.
- Use a single deterministic tiebreak chain as shown above.
- Alert when the top1 ΔS drifts by more than 0.10 week over week.
Specs to follow while wiring traces: Retrieval Traceability · Data Contracts
Evaluation that catches the real failures
- ΔS(question, top1) and ΔS(top1, anchor)
- Kendall τ against a small gold ranking
- Hit@k for anchor coverage
- Flip rate across 2 seeds and 3 paraphrases
- Time budget per query and p95 latency
See recipes: Retrieval Evaluation Recipes
Copy-paste prompt: LLM reranker (listwise)
You have TXT OS and the WFGY Problem Map loaded.
Goal: score passages for evidence quality only. Do not answer the question.
Question: "<q>"
Return a JSON array: [{"id":"...","score":0.00..1.00,"why_short":"..."}].
Scoring considers:
1) directness to the required anchor,
2) atomic facts present,
3) low ambiguity and low bleed from other topics.
If two are equal, prefer the one with clearer citation spans.
When to escalate
-
Rerankers improve τ but ΔS remains high: rebuild metric, analyzer, and window. Open: Embedding ≠ Semantic and Chunking Checklist
-
Ordering still flips across runs or deployments: inspect schema drift and boot sequencing. Open: Retrieval Traceability, Bootstrap Ordering, Pre-Deploy Collapse
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.