WFGY/ProblemMap/GlobalFixMap/Retrieval/hybrid_retrieval.md
2025-09-05 11:47:06 +08:00

13 KiB
Raw Blame History

Hybrid Retrieval

🧭 Quick Return to Map

You are in a sub-page of Retrieval.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A practical guide to fuse dense and sparse retrieval without losing meaning. Use this when BM25 and embeddings each work alone but the combined pipeline gets worse or unstable.

Read together with


Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage ≥ 0.70 to the intended section
  • λ convergent across 3 paraphrases and 2 seeds
  • Fusion stability score ≥ 0.95 match across two identical runs

Why hybrid breaks

  1. Score scales differ
    Dense stores output cosine or dot product. BM25 outputs TF-IDF like scores. Direct addition biases one side.

  2. Different analyzers
    Casing, stemming, stopwords, and unicode fold differ between dense write and sparse read paths.

  3. Query parsing split
    HyDE or query rewriting mutates the text that BM25 expects. Dense receives the rewritten query and sparse receives the original query.

  4. Rerank not deterministic
    Cross encoder changes rank with seed or missing sort tiebreakers.

  5. Window mismatch
    Sparse hits long pages. Dense hits small chunks. Fusing at different granularity leads to cross section reuse.


Normalization rules before fusion

Apply these rules in this order.

  1. Per source min max scale to 0..1

score\_norm = (score\_raw - min\_score) / (max\_score - min\_score + eps)

  1. Clip long tails
    If a source has heavy tails, apply score_norm = sqrt(score_norm).

  2. Align analyzers
    Use the same casing and ascii fold policy for both write and read paths. Log analyzer in citations.

  3. Granularity fence
    Convert all hits to the same unit. Either page level or chunk level. Prefer chunk level with explicit offsets.

  4. Deduplicate by snippet_id
    If the same snippet appears from both sources keep the best score.


Fusion algorithms

Pick one algorithm and keep it stable. Log the chosen method with parameters in your traces.

1) Linear weighted sum


final = α \* dense\_norm + (1 - α) \* sparse\_norm
α ∈ \[0.3, 0.7]

Good default for balanced corpora.

2) Reciprocal Rank Fusion


RRF\_k = 60 by default
final = Σ 1 / (RRF\_k + rank\_i)

Robust when score scales are very different.

3) Two stage with rerank

  1. Union top k from dense and sparse.
  2. Rerank with a cross encoder.
  3. Deterministic tiebreak on (rerank_score desc, section_id asc, snippet_id asc).

Use when you can afford extra latency and want semantic ordering.


Minimal recipes

Python pseudo for linear fusion

def minmax(xs):
    lo, hi = min(xs), max(xs)
    rng = (hi - lo) or 1e-9
    return [(x - lo) / rng for x in xs]

def fuse_linear(dense_hits, sparse_hits, alpha=0.5, k=20):
    # hits: list of {snippet_id, score_raw, ...}
    all_ids = {h["snippet_id"] for h in dense_hits} | {h["snippet_id"] for h in sparse_hits}
    dense_map = {h["snippet_id"]: h for h in dense_hits}
    sparse_map = {h["snippet_id"]: h for h in sparse_hits}
    d_scores = [dense_map.get(i, {"score_raw": 0})["score_raw"] for i in all_ids]
    s_scores = [sparse_map.get(i, {"score_raw": 0})["score_raw"] for i in all_ids]
    d_norm = dict(zip(all_ids, minmax(d_scores)))
    s_norm = dict(zip(all_ids, minmax(s_scores)))
    fused = []
    for i in all_ids:
        sc = alpha * d_norm[i] + (1 - alpha) * s_norm[i]
        meta = dense_map.get(i) or sparse_map.get(i)
        fused.append({**meta, "score_norm": sc})
    fused.sort(key=lambda x: (-x["score_norm"], x["section_id"], x["snippet_id"]))
    return fused[:k]

LCEL style outline

# 1) run dense and sparse branches with the same analyzer policy name in metadata
# 2) project to citation payload
# 3) fuse with linear or RRF
# 4) optional rerank
# 5) validate with ΔS and the traceability validator

fused = fuse_linear(dense_hits, sparse_hits, alpha=0.55, k=20)
if use_rerank:
    fused = cross_encoder_rerank(fused, model="bce-en-v1.5")
validate_citations(fused)

LlamaIndex outline

dense = vector_index.as_retriever(similarity_top_k=20).retrieve(q)
sparse = bm25_retriever.retrieve(q, top_k=50)
fused = fuse_linear(project(dense), project(sparse), alpha=0.6, k=20)
fused = optional_rerank(fused)

Knobs that actually move the needle

Area Knob Defaults Notes
Dense metric cosine or dot Use cosine with normalized vectors
Dense pooling mean or cls Keep pooling constant for write and read
Dense k 10 to 30 Run k sweep with ΔS probes
Sparse analyzer lowercase, ascii fold Keep parity with dense preprocessing
Sparse BM25 k1 0.9 to 1.6 Higher k1 favors term frequency
Sparse BM25 b 0.3 to 0.8 Lower b reduces length normalization
Fusion α 0.4 to 0.6 Start at 0.55 for dense heavy corpora
Fusion RRF k 60 Larger k reduces early rank dominance
Rerank model bce or cohere or e5 Pick one, keep seed fixed
Window pre, post 80 to 160 chars Match across all sources

ΔS and λ probes for hybrid

  1. Run dense only, record ΔS and λ.
  2. Run sparse only, record ΔS and λ.
  3. Run fused. If fused ΔS is worse than both singles, the fusion is wrong.
  4. If λ flips between paraphrases only for the fused case, add rerank and fix tiebreakers.

Use the helper → deltaS_probes.md


Typical failure modes and exact fixes

  • Fused hits look plausible but citations jump sections Add granularity fence and forbid cross section reuse. Open: retrieval-traceability.md

  • Dense dominates even with α near 0.5 Normalize per source then apply square root to dense scores. Open: retrieval-playbook.md

  • Sparse collapses on HyDE queries Send the same rewritten query to both branches or disable HyDE for sparse. Open: query_parsing_split.md

  • Unstable final order between runs Add a cross encoder rerank and deterministic tiebreak. Open: rerankers.md

  • High similarity but wrong meaning Align analyzer and metric. Rebuild with correct pooling. Open: embedding-vs-semantic.md


Evaluation recipe

  1. Create a gold set with positive and confusing negative sections.
  2. Run dense only, sparse only, and fused with identical k.
  3. Chart ΔS and coverage across three paraphrases.
  4. Choose α or RRF k that improves ΔS by at least 0.05 vs best single.
  5. Lock the configuration. Add a regression gate in CI.

Copy paste test prompt

You have TXTOS and the WFGY Problem Map loaded.

My hybrid plan:
- dense = {store, metric, pooling, k}
- sparse = {analyzer, k1, b, k}
- fusion = {method: linear or rrf, alpha or rrf_k}
- rerank = {model, seed}

Return:
1) whether analyzer and window parity hold,
2) the minimal steps to normalize scores,
3) the recommended α or RRF k given the probe results,
4) a JSON checklist I can paste into CI to keep it stable.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main

TXT OS

Blah

Blot

Bloc

Blur

Blow