vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

2025-08-27 22:10:22 +08:00

12 KiB

Raw Blame History

Rerankers — Ordering Control and Stability

Use rerankers when recall is fine but the top hits are mis-ordered, unstable, or biased toward the wrong metric. This page shows listwise and pairwise recipes, fusion knobs, and stability fences you can drop into any stack.

References you may want open already:
RAG Architecture & Recovery · Retrieval Playbook · Retrieval Traceability · Data Contracts · Embedding ≠ Semantic · Query Parsing Split · Vectorstore Fragmentation

Acceptance targets

ΔS(question, top1.text) ≤ 0.45
Anchor coverage of the final topk ≥ 0.70
Kendall τ against gold ranking improves by ≥ 0.20 over baseline bi-encoder order
λ remains convergent across 3 paraphrases and 2 seeds

If ΔS sits in 0.40 to 0.60 and τ gains are small, fix chunking or metric before adding complexity.

Symptoms → exact fix

Symptom	Likely cause	Open this fix
Correct passage appears in top20 but not in top3	wrong ordering after recall	Retrieval Playbook, this page
Topk flips between identical runs	non-deterministic tie breaks or LLM variance	Retrieval Traceability
BM25 beats dense when queries are abstractive	fusion uncalibrated or query parsing split	Query Parsing Split
Duplicated near-identical hits crowd out diversity	no MMR or section-aware penalties	this page (MMR recipe)
Great similarity, wrong meaning	metric mismatch at index time	Embedding ≠ Semantic
Hits vanish after ingest or rebuild	fragmented store, mixed analyzers	Vectorstore Fragmentation

Strategy 1: Cross-encoder reranker (robust default)

When you have bi-encoder recall and need precise order.
Why cross-encoders read the full pair (q, passage) and recover semantics lost by embeddings.

Deterministic sort key

sort_key = (-score_ce, section_priority, snippet_id, start_offset)

Keep the tie-break stable so pagination and caching never reshuffle results.

Minimal pipeline

# Pseudocode only
candidates = dense_recall(q, k=50) + bm25_recall(q, k=50)  # union then dedupe by snippet_id
scored = []
for c in candidates:
    s = cross_encoder.score(q, c.text)  # e.g., monoT5, E5-mistral-ce, etc.
    scored.append({**c, "score_ce": s})

# diversity
scored = mmr(q, scored, lambda_rank="score_ce", alpha=0.7)  # see MMR recipe below

# deterministic order
ordered = sorted(scored, key=lambda x: (-x["score_ce"], x["section_priority"], x["snippet_id"], x["offsets"][0]))
topk = ordered[:k]

Strategy 2: LLM-as-reranker with schema locks

Use an LLM to score evidence only. Do not let it answer. Force a strict schema and cite-then-explain in the trace.

Prompt skeleton

Task: score each candidate passage for "is this the best evidence to answer Q".
Return JSON with fields: {id, score in [0,1], why_short}. Do not answer Q.

Q: "<question>"

Candidates:
- id: s001, section_id: A.3, snippet_id: 19, text: "<passage>"
- id: s002, section_id: B.1, snippet_id: 7,  text: "<passage>"
...
Scoring rubric:
1) directness to the likely anchor section,
2) presence of atomic facts that must be cited,
3) low ambiguity, low cross-topic bleed.

Output JSON list only.

Variance controls

Fix the model, temperature 0, seed fixed if provider supports it.
Add BBAM clamp in the system preface to keep λ convergent.
Keep the rubric short and stable across runs.

Strategy 3: Fusion that behaves

RRF (reciprocal rank fusion)

s_fused = Σ_m 1 / (k0 + rank_m), with k0 around 60 for top100 feeds. RRF is robust when scores are not comparable.

Z-score fusion

Normalize each retriever to zero mean and unit variance then sum. Good when score ranges are stable over time.

Two-stage order

union and dedupe by (section_id, snippet_id)
fast fusion to top50
cross-encoder or LLM rerank to topk

Strategy 4: Diversity with MMR

Maximal marginal relevance avoids redundant hits and expands anchor coverage.

mmr(q, items, lambda_rank="score", alpha=0.7):
  S = []
  while len(S) < k:
    select x that maximizes alpha * rel(q, x) - (1 - alpha) * max_sim(x, S)
  return S

Use cosine on embedding space for max_sim.
Penalize items sharing the same section_id unless the anchor spans multiple snippets.
Track coverage per section to avoid starving small but relevant sections.

Stability and observability fences

Log reranker_version, fusion_type, alpha, k0, and index_hash.
Write the final order and why for the topk into the trace.
Freeze prompt headers for LLM rerankers.
Use a single deterministic tiebreak chain as shown above.
Alert when the top1 ΔS drifts by more than 0.10 week over week.

Specs to follow while wiring traces: Retrieval Traceability · Data Contracts

Evaluation that catches the real failures

ΔS(question, top1) and ΔS(top1, anchor)
Kendall τ against a small gold ranking
Hit@k for anchor coverage
Flip rate across 2 seeds and 3 paraphrases
Time budget per query and p95 latency

See recipes: Retrieval Evaluation Recipes

Copy-paste prompt: LLM reranker (listwise)

You have TXT OS and the WFGY Problem Map loaded.

Goal: score passages for evidence quality only. Do not answer the question.

Question: "<q>"

Return a JSON array: [{"id":"...","score":0.00..1.00,"why_short":"..."}].
Scoring considers:
1) directness to the required anchor,
2) atomic facts present,
3) low ambiguity and low bleed from other topics.

If two are equal, prefer the one with clearer citation spans.

When to escalate

Rerankers improve τ but ΔS remains high: rebuild metric, analyzer, and window. Open: Embedding ≠ Semantic and Chunking Checklist
Ordering still flips across runs or deployments: inspect schema drift and boot sequencing. Open: Retrieval Traceability, Bootstrap Ordering, Pre-Deploy Collapse

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

12 KiB Raw Blame History Unescape Escape