10 KiB
Language & Locale · Global Fix Map
Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, Latin, emoji, and locale variants.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.
What this page is
- A compact language-aware repair guide for retrieval → ranking → reasoning.
- Structural fixes with measurable acceptance targets.
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.
When to use
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
- Queries code-switch or mix scripts, and top-k order drifts across runs.
- Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
- RTL punctuation or control chars flip token order or offsets.
- Token counts jump after deploy even though data did not change.
Open these first
- Visual recovery map → rag-architecture-and-recovery.md
- Retrieval knobs end-to-end → retrieval-playbook.md
- Traceability and snippet schema → retrieval-traceability.md · data-contracts.md
- Embedding vs meaning → embedding-vs-semantic.md
- Metric and normalization → metric_mismatch.md · normalization_and_scaling.md
- OCR confusables and hyphens → OCR_Parsing README
Quick routes to per-page guides
| Topic | Page |
|---|---|
| Tokenizer mismatch across languages | tokenizer_mismatch.md |
| Script mixing in a single query | script_mixing.md |
| Locale drift and analyzer skew | locale_drift.md |
| Unicode normalization policy | unicode_normalization.md |
| CJK segmentation and word-break | cjk_segmentation_wordbreak.md |
| Fullwidth vs halfwidth, punctuation variants | digits_width_punctuation.md |
| Diacritics folding rules | diacritics_and_folding.md |
| RTL and bidi control characters | rtl_bidi_control.md |
| Transliteration and romanization | transliteration_and_romanization.md |
| Collation and stable sort keys | locale_collation_and_sorting.md |
| Numbering systems and sort orders | numbering_and_sort_orders.md |
| Date and time format variants | date_time_format_variants.md |
| Time zones and DST stability | timezones_and_dst.md |
| Keyboard IMEs and composition | keyboard_input_methods.md |
| Input language switching guards | input_language_switching.md |
| Emoji, ZWJ, grapheme clusters | emoji_zwj_grapheme_clusters.md |
| Mixed-locale metadata fields | mixed_locale_metadata.md |
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases
- Coverage of target section ≥ 0.70
- λ remains convergent across two seeds
- Tokenization variance for the same query ≤ 12% across environments
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
Fix in 60 seconds
- Normalize once, up front → Apply NFKC, collapse fullwidth/halfwidth, unify diacritics.
- Match tokenizer and analyzer → Same segmenter for CJK/Indic across embed + store analyzers.
- Stabilize mixed-script queries → Detect code-switch, split per script, rerank deterministically.
- Verify → ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent across two seeds.
FAQ (Beginner-Friendly)
Q1: Why do answers break when I mix English and Chinese in one query?
A: Most vector stores tokenize differently by script. Without alignment, Chinese words get split incorrectly and English tokens dominate. Fix with script_mixing.md and tokenizer_mismatch.md.
Q2: What does “locale drift” mean?
A: Locale drift happens when environments use different analyzers (e.g., zh_TW vs zh_CN) so the same query splits differently. See locale_drift.md.
Q3: Why do “identical-looking” characters not match?
A: They may differ in width (fullwidth vs halfwidth), normalization (NFKC vs NFD), or diacritics. Always apply unicode_normalization.md and digits_width_punctuation.md.
Q4: How do I handle Arabic or Hebrew text?
A: RTL scripts can insert invisible bidi control chars that flip token order. See rtl_bidi_control.md.
Q5: Do I need different embeddings for each language?
A: No. You can combine multilingual embeddings with deterministic normalization and alias fields. If that fails, only then use fallback translation bridges.
Q6: How do I debug when results change between environments?
A: Compare tokenizer version, analyzer settings, normalization passes, and collation rules. Document them in data-contracts.md.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.