vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig ed7ae6404b

Update hybrid_retrieval.md

2025-09-05 11:47:06 +08:00

13 KiB

Raw Blame History

Hybrid Retrieval

🧭 Quick Return to Map

You are in a sub-page of Retrieval.
To reorient, go back here:

Retrieval — information access and knowledge lookup

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A practical guide to fuse dense and sparse retrieval without losing meaning. Use this when BM25 and embeddings each work alone but the combined pipeline gets worse or unstable.

Read together with

Playbook overview → retrieval-playbook.md
Trace and citation schema → retrieval-traceability.md
ΔS probes for semantic fit → deltaS_probes.md
Ordering control → rerankers.md
Chunk window parity → chunk_alignment.md

Acceptance targets

ΔS(question, retrieved) ≤ 0.45
Coverage ≥ 0.70 to the intended section
λ convergent across 3 paraphrases and 2 seeds
Fusion stability score ≥ 0.95 match across two identical runs

Why hybrid breaks

Score scales differ
Dense stores output cosine or dot product. BM25 outputs TF-IDF like scores. Direct addition biases one side.
Different analyzers
Casing, stemming, stopwords, and unicode fold differ between dense write and sparse read paths.
Query parsing split
HyDE or query rewriting mutates the text that BM25 expects. Dense receives the rewritten query and sparse receives the original query.
Rerank not deterministic
Cross encoder changes rank with seed or missing sort tiebreakers.
Window mismatch
Sparse hits long pages. Dense hits small chunks. Fusing at different granularity leads to cross section reuse.

Normalization rules before fusion

Apply these rules in this order.

Per source min max scale to 0..1


score\_norm = (score\_raw - min\_score) / (max\_score - min\_score + eps)

Clip long tails
If a source has heavy tails, apply score_norm = sqrt(score_norm).
Align analyzers
Use the same casing and ascii fold policy for both write and read paths. Log analyzer in citations.
Granularity fence
Convert all hits to the same unit. Either page level or chunk level. Prefer chunk level with explicit offsets.
Deduplicate by snippet_id
If the same snippet appears from both sources keep the best score.

Fusion algorithms

Pick one algorithm and keep it stable. Log the chosen method with parameters in your traces.

1) Linear weighted sum


final = α \* dense\_norm + (1 - α) \* sparse\_norm
α ∈ \[0.3, 0.7]

Good default for balanced corpora.

2) Reciprocal Rank Fusion


RRF\_k = 60 by default
final = Σ 1 / (RRF\_k + rank\_i)

Robust when score scales are very different.

3) Two stage with rerank

Union top k from dense and sparse.
Rerank with a cross encoder.
Deterministic tiebreak on (rerank_score desc, section_id asc, snippet_id asc).

Use when you can afford extra latency and want semantic ordering.

Minimal recipes

Python pseudo for linear fusion

def minmax(xs):
    lo, hi = min(xs), max(xs)
    rng = (hi - lo) or 1e-9
    return [(x - lo) / rng for x in xs]

def fuse_linear(dense_hits, sparse_hits, alpha=0.5, k=20):
    # hits: list of {snippet_id, score_raw, ...}
    all_ids = {h["snippet_id"] for h in dense_hits} | {h["snippet_id"] for h in sparse_hits}
    dense_map = {h["snippet_id"]: h for h in dense_hits}
    sparse_map = {h["snippet_id"]: h for h in sparse_hits}
    d_scores = [dense_map.get(i, {"score_raw": 0})["score_raw"] for i in all_ids]
    s_scores = [sparse_map.get(i, {"score_raw": 0})["score_raw"] for i in all_ids]
    d_norm = dict(zip(all_ids, minmax(d_scores)))
    s_norm = dict(zip(all_ids, minmax(s_scores)))
    fused = []
    for i in all_ids:
        sc = alpha * d_norm[i] + (1 - alpha) * s_norm[i]
        meta = dense_map.get(i) or sparse_map.get(i)
        fused.append({**meta, "score_norm": sc})
    fused.sort(key=lambda x: (-x["score_norm"], x["section_id"], x["snippet_id"]))
    return fused[:k]

LCEL style outline

# 1) run dense and sparse branches with the same analyzer policy name in metadata
# 2) project to citation payload
# 3) fuse with linear or RRF
# 4) optional rerank
# 5) validate with ΔS and the traceability validator

fused = fuse_linear(dense_hits, sparse_hits, alpha=0.55, k=20)
if use_rerank:
    fused = cross_encoder_rerank(fused, model="bce-en-v1.5")
validate_citations(fused)

LlamaIndex outline

dense = vector_index.as_retriever(similarity_top_k=20).retrieve(q)
sparse = bm25_retriever.retrieve(q, top_k=50)
fused = fuse_linear(project(dense), project(sparse), alpha=0.6, k=20)
fused = optional_rerank(fused)

Knobs that actually move the needle

Area	Knob	Defaults	Notes
Dense	metric	cosine or dot	Use cosine with normalized vectors
Dense	pooling	mean or cls	Keep pooling constant for write and read
Dense	k	10 to 30	Run k sweep with ΔS probes
Sparse	analyzer	lowercase, ascii fold	Keep parity with dense preprocessing
Sparse	BM25 k1	0.9 to 1.6	Higher k1 favors term frequency
Sparse	BM25 b	0.3 to 0.8	Lower b reduces length normalization
Fusion	α	0.4 to 0.6	Start at 0.55 for dense heavy corpora
Fusion	RRF k	60	Larger k reduces early rank dominance
Rerank	model	bce or cohere or e5	Pick one, keep seed fixed
Window	pre, post	80 to 160 chars	Match across all sources

ΔS and λ probes for hybrid

Run dense only, record ΔS and λ.
Run sparse only, record ΔS and λ.
Run fused. If fused ΔS is worse than both singles, the fusion is wrong.
If λ flips between paraphrases only for the fused case, add rerank and fix tiebreakers.

Use the helper → deltaS_probes.md

Typical failure modes and exact fixes

Fused hits look plausible but citations jump sections Add granularity fence and forbid cross section reuse. Open: retrieval-traceability.md
Dense dominates even with α near 0.5 Normalize per source then apply square root to dense scores. Open: retrieval-playbook.md
Sparse collapses on HyDE queries Send the same rewritten query to both branches or disable HyDE for sparse. Open: query_parsing_split.md
Unstable final order between runs Add a cross encoder rerank and deterministic tiebreak. Open: rerankers.md
High similarity but wrong meaning Align analyzer and metric. Rebuild with correct pooling. Open: embedding-vs-semantic.md

Evaluation recipe

Create a gold set with positive and confusing negative sections.
Run dense only, sparse only, and fused with identical k.
Chart ΔS and coverage across three paraphrases.
Choose α or RRF k that improves ΔS by at least 0.05 vs best single.
Lock the configuration. Add a regression gate in CI.

Copy paste test prompt

You have TXTOS and the WFGY Problem Map loaded.

My hybrid plan:
- dense = {store, metric, pooling, k}
- sparse = {analyzer, k1, b, k}
- fusion = {method: linear or rrf, alpha or rrf_k}
- rerank = {model, seed}

Return:
1) whether analyzer and window parity hold,
2) the minimal steps to normalize scores,
3) the recommended α or RRF k given the probe results,
4) a JSON checklist I can paste into CI to keep it stable.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

13 KiB Raw Blame History Unescape Escape