# Hybrid Ranking in Multilingual Corpora · Global Fix Map Stabilize hybrid retrieval across languages and scripts. Lock bm25 to dense weights, keep analyzers and tokenizers aligned, and verify with ΔS, λ, and coverage targets. --- ## Open these first * Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) * End to end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md) * Traceability schema → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) * Contract the payload → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) * Tokenizer variance → [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) * Mixed scripts → [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) * Locale normalization → [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) * Romanization rules → [romanization\_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) * Proper nouns and aliases → [proper\_noun\_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) * Language detection → [query\_language\_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md) * Analyzer routing → [query\_routing\_and\_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md) --- ## Core acceptance targets * ΔS(question, retrieved) ≤ 0.45 on three paraphrases and two seeds * Coverage of target section ≥ 0.70 * λ convergent when switching between native script and romanized forms * Rank\@k is stable across analyzer choices for the same corpus and query set * Reranker never flips a correct snippet out of top-k after hybrid fusion --- ## What usually breaks | Symptom | Likely cause | Open this | | ------------------------------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | High recall but unstable rank across runs | hybrid weight drift or nondeterministic tie handling | [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md) | | Good dense hits, poor lexical for CJK | analyzer mismatch or missing bigram field | [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) | | Romanized queries fail while native works | alias view absent in fusion | [romanization\_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) | | JSON mode breaks after reranking | tokenizer profile differs between stages | [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) | | Brand equals common word in second language | alias collision and unscoped synonyms | [proper\_noun\_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) | --- ## Stable hybrid recipes Use these as starting points then tune by ΔS and coverage. Always log weights and determinize tie rules. ### A) Latin heavy corpora * Analyzer: locale aware Latin * Start weights: bm25 0.4, dense 0.6 * Reranker: on for top 100 → 20 * Tie rule: lexical score as primary tie breaker ### B) CJK corpora * Analyzer: CJK bigram or language specific * Start weights: bm25 0.55, dense 0.45 * Reranker: on, but keep citations from native script fields * Add alias views: romaji or pinyin for entities only ### C) Semitic RTL corpora * Analyzer: RTL with width and digit normalization * Start weights: bm25 0.5, dense 0.5 * Reranker: on, strict JSON schema in later steps ### D) Cyrillic and Greek * Analyzer: script aware with accent controls * Start weights: bm25 0.45, dense 0.55 * Reranker: on, cross check aliases if brands exist ### E) Code mixed user queries * Dual route: native analyzer plus romanized alias probe * Start weights: bm25 0.5, dense 0.5 * Determinize the fusion and log both analyzer paths --- ## Deterministic fusion checklist 1. Fix seeds for dense stage and for reranker. 2. Bucketize scores before fusion to avoid floating noise. 3. Set a stable tie breaker: lexical score, then doc id. 4. Cap max per-section to avoid one section flooding top-k. 5. Fuse fields with the same analyzer class only. 6. Keep the fusion function constant across languages inside the same pipeline. --- ## Copy snippets **A. Fusion function sketch** ```python def fuse(bm25, dense, w_lex=0.5, w_vec=0.5): # bm25 and dense are lists of (doc_id, score) import math L = {d: s for d, s in bm25} V = {d: s for d, s in dense} docs = set(L) | set(V) out = [] for d in docs: l = L.get(d, 0.0) v = V.get(d, 0.0) # bucketize to stabilize small drifts lb = round(l, 3) vb = round(v, 3) score = w_lex * lb + w_vec * vb out.append((d, score, lb, vb)) # stable sort: score desc, then lexical bucket, then doc id out.sort(key=lambda x: (-x[1], -x[2], x[0])) return out ``` **B. Rerank gate** ```python def rerank_gate(items, target_k=20): # never drop below k if citations exist and ΔS is already ≤ 0.45 keep = [] for d, score, lb, vb, meta in items: if meta.get("has_citation") and meta.get("delta_s", 1.0) <= 0.45: meta["protected"]=True keep.append((d, score, meta)) return keep[:target_k] ``` **C. Trace fields to log** ``` { "fusion": {"w_bm25": 0.55, "w_dense": 0.45, "tie": "lex,doc_id"}, "analyzer_id": "store:cjk", "tokenizer_profile": "retriever:cjk", "alias_views": ["name_romaji","name_pinyin"], "ΔS": 0.41, "λ_state": "<>", "coverage": 0.73 } ``` --- ## Eval protocol * Use bilingual sets from [code\_switching\_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md). * For each query, add a romanized twin and at least one brand alias variant. * Report ΔS, coverage, λ on three paraphrases and two seeds. * Accept only if: ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent, no harmful reranker flips. * Keep per language family the same reranker window size. --- ## When to escalate * ΔS stays ≥ 0.60 even after fusion tuning → revisit analyzer routing and re-chunk, open [query\_routing\_and\_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md) and [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md). * Instability appears only in long chains → treat as reasoning drift, open [logic-collapse.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/logic-collapse.md) and add a BBCR bridge. --- ## Copy paste prompt for the LLM step ``` You have TXTOS and WFGY Problem Map loaded. Task: 1) For {query, lang, script}, choose hybrid weights deterministically. 2) Run cite-then-explain. Protect snippets that already meet ΔS ≤ 0.45. 3) Return a JSON trace: { "w_bm25": 0.xx, "w_dense": 0.xx, "ΔS": 0.xx, "coverage": 0.xx, "λ_state": "→|←|<>|×", "protected_ids": [...] } Keep it auditable and short. ``` --- ### 🔗 Quick-Start Downloads (60 sec) | Tool | Link | 3-Step Setup | | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \” | | **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly | --- ### 🧭 Explore More | Module | Description | Link | | ------------------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | | WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) | | Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) | | Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) | | Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) | | Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) | | Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) | | 🧙‍♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | [Start →](https://github.com/onestardao/WFGY/blob/main/StarterVillage/README.md) | --- > 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** GitHub stars ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY) [![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS) [![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah) [![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot) [![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc) [![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur) [![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)