| .. | ||
| code_switching_eval.md | ||
| fallback_translation_and_glossary_bridge.md | ||
| hybrid_ranking_multilingual.md | ||
| locale_drift.md | ||
| multilingual_guide.md | ||
| proper_noun_aliases.md | ||
| query_language_detection.md | ||
| query_routing_and_analyzers.md | ||
| README.md | ||
| romanization_transliteration.md | ||
| script_mixing.md | ||
| stopword_and_morphology_controls.md | ||
| tokenizer_mismatch.md | ||
Language & Multilingual · Global Fix Map
A compact hub to stabilize cross-lingual retrieval and reasoning.
Use this folder when your corpus or queries include CJK, RTL, Indic, Cyrillic, accented Latin, or frequent code-switching. No infra change required.
Orientation — pages and what they solve
| Page | What it solves | Typical symptom |
|---|---|---|
| tokenizer_mismatch.md | Locks tokenization and segmentation for CJK/Thai/Indic | High sim but low recall on CJK/Thai; broken tokens |
| script_mixing.md | One query carries mixed scripts and analyzers split | Mixed Latin+CJK queries under-recall or flip |
| locale_drift.md | Normalization for width/accents/variants (Hans↔Hant) | zh-Hans/zh-Hant never co-retrieve; accent variants miss |
| multilingual_guide.md | End-to-end recipes and acceptance targets | Unsure where drift comes from across languages |
| proper_noun_aliases.md | Alias shield for names, brands, products | Proper nouns oscillate across spellings |
| romanization_transliteration.md | Romanization pairs and transliteration consistency | Inconsistent transliteration causes misses |
| query_language_detection.md | Stable language detection contract | Detection flips per run; routing becomes random |
| query_routing_and_analyzers.md | Route analyzers per language + parity w/ index | Search vs index behave differently |
| hybrid_ranking_multilingual.md | Deterministic hybrid rerank across languages | Multilingual ranking unstable, hybrid < single |
| stopword_and_morphology_controls.md | Clamp stopwords/lemmatizers to protect meaning | Negations/particles vanish; unit words lost |
| fallback_translation_and_glossary_bridge.md | Controlled translation bridge with glossary | Local path ΔS stays high; glossary needed |
| code_switching_eval.md | Bilingual & code-switch eval sets + checks | Cannot prove multilingual stability before ship |
When to use this folder
- High similarity yet wrong meaning on bilingual or mixed-script corpora
- Citations point to the wrong section after translating the question
- Hybrid retrievers underperform a single retriever across languages
- Index looks healthy while coverage stays low for non-Latin scripts
- Names flip between native, transliteration, and English aliases
- zh-Hans and zh-Hant never co-retrieve; Thai recall drops for no reason
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 across language variants
- Coverage ≥ 0.70 to the intended section after repair
- λ_observe convergent across 3 paraphrases and 2 seeds
- E_resonance flat on long windows that mix scripts
- Citation fields complete; alias noise does not leak into evidence
Map symptoms → structural fixes
| Symptom | Likely cause | Open this |
|---|---|---|
| High similarity yet wrong meaning | Embedding not multilingual or pre-normalization mismatch | embedding-vs-semantic.md |
| Citations jump sections after translation | Snippet schema too loose | data-contracts.md · retrieval-traceability.md |
| zh-Hans and zh-Hant never co-retrieve | Variant mapping and width rules missing | locale_drift.md |
| Thai or CJK recall collapses | Tokenizer mismatch or missing segmenter | tokenizer_mismatch.md |
| Mixed Latin + CJK query under-recalls | Analyzer split across scripts | script_mixing.md |
| Hybrid worse than single | Query parsing split or mis-weighted rerank | patterns/pattern_query_parsing_split.md · rerankers.md |
| Proper nouns oscillate | Missing alias fields and entity shield | proper_noun_aliases.md |
| Transliteration inconsistency | Romanization rules not aligned | romanization_transliteration.md |
| Language detection drifts | Detection contract weak or unlocked | query_language_detection.md |
| Search vs index disagree | Analyzer routing error | query_routing_and_analyzers.md |
| Ranking unstable across languages | Mono-lingual reranker or unaligned features | hybrid_ranking_multilingual.md |
| Negations/particles vanish | Stopword or morphology too aggressive | stopword_and_morphology_controls.md |
| Persistent high ΔS on local path | Need glossary-backed translation bridge | fallback_translation_and_glossary_bridge.md |
Fix in 60 seconds
-
Detect language
Emit stable language + confidence. If unstable, fix detection first.
→ query_language_detection.md -
Lock normalization and analyzers
Keep locale, width, accents, and segmentation identical on write/read.
→ locale_drift.md · query_routing_and_analyzers.md -
Protect entities and syntax
Alias fields and romanization pairs; clamp stopwords/morphology for negations and units.
→ proper_noun_aliases.md · romanization_transliteration.md · stopword_and_morphology_controls.md -
Stabilize ranking
Use multilingual or dual-track rerank with deterministic ordering.
→ hybrid_ranking_multilingual.md -
Translation bridge only if needed
Pair with a glossary and keep native path as default.
→ fallback_translation_and_glossary_bridge.md -
Verify
With bilingual & code-switch sets confirm ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent.
→ code_switching_eval.md
Store-agnostic quick recipes
- Normalize the same way for corpus and queries before storing vectors
- CJK/Thai require segmentation or bigrams; keep entity fields as keyword
- If no multilingual embeddings, add a lexical sidecar and align features with a deterministic rerank
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.