| .. | ||
| cjk_segmentation_wordbreak.md | ||
| date_time_format_variants.md | ||
| diacritics_and_folding.md | ||
| digits_width_punctuation.md | ||
| emoji_zwj_grapheme_clusters.md | ||
| input_language_switching.md | ||
| keyboard_input_methods.md | ||
| locale_collation_and_sorting.md | ||
| locale_drift.md | ||
| mixed_locale_metadata.md | ||
| numbering_and_sort_orders.md | ||
| README.md | ||
| rtl_bidi_control.md | ||
| script_mixing.md | ||
| timezones_and_dst.md | ||
| tokenizer_mismatch.md | ||
| transliteration_and_romanization.md | ||
| unicode_normalization.md | ||
Language & Locale: Global Fix Map
Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, and Latin mixes.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.
What this page is
- A compact language-aware repair guide for retrieval → ranking → reasoning.
- Structural fixes with measurable acceptance targets.
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.
When to use
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
- Queries code-switch or mix scripts and top-k order drifts across runs.
- Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
- RTL punctuation or control chars flip token order or offsets.
- Token counts jump after deploy even though data did not change.
Open these first
- Visual recovery map: rag-architecture-and-recovery.md
- Retrieval knobs end-to-end: retrieval-playbook.md
- Traceability and snippet schema: retrieval-traceability.md · data-contracts.md
- Embedding vs meaning: embedding-vs-semantic.md
- Metric and normalization: metric_mismatch.md · normalization_and_scaling.md
- OCR confusables and hyphens: OCR_Parsing README
Quick routes to per-page guides
- Tokenizer mismatch across languages → tokenizer_mismatch.md
- Script mixing in a single query → script_mixing.md
- Locale drift and analyzer skew → locale_drift.md
- Unicode normalization policy (NFKC/NFD etc.) → unicode_normalization.md
- CJK segmentation and word-break contracts → cjk_segmentation_wordbreak.md
- Fullwidth vs halfwidth, punctuation variants → digits_width_punctuation.md
- Diacritics policy and folding → diacritics_and_folding.md
- RTL and bidi control characters → bidi_rtl_control_chars.md
- Transliteration and romanization traps → transliteration_and_romanization.md
- Collation and stable sort keys → locale_collation_and_sorting.md
- Numbering systems and sort orders → numbering_and_sort_orders.md
- Date and time format variants → date_time_format_variants.md
- Time zones and DST stability → timezones_and_dst.md
- Keyboard IMEs and composition → keyboard_input_methods.md
- Input language switching guards → input_language_switching.md
- Emoji, ZWJ, grapheme clusters → emoji_zwj_grapheme_clusters.md
- Mixed-locale metadata fields → mixed_locale_metadata.md
MVP coverage includes the first 8–10 pages. Add the rest when traffic is mixed-locale or search intensive.
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases
- Coverage of target section ≥ 0.70
- λ remains convergent across two seeds
- Tokenization variance for the same query ≤ 12% across environments
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
Map symptoms to structural fixes
-
Wrong-meaning hits despite high similarity
→ embedding-vs-semantic.md -
Similarity drops when switching locales or analyzers
→ metric_mismatch.md · normalization_and_scaling.md -
CJK tokens split differently between dev and prod
→ tokenizer_mismatch.md · locale_drift.md -
Mixed scripts in one query derails ranking
→ script_mixing.md · rerankers.md -
Fullwidth punctuation or RTL marks break citations
→ digits_width_punctuation.md · retrieval-traceability.md -
“Looks identical” after OCR but fails to match
→ OCR_Parsing README
Fix in 60 seconds
-
Normalize once, up front
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock it in ingestion and query paths. -
Match tokenizer and analyzer
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Record exact versions in the data contract. -
Stabilize mixed-script queries
Detect code-switch, split by script, run per-script retrieval, rerank deterministically. -
Verify
Compute ΔS on three paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across two seeds.
Copy-paste prompt for your LLM step
You have TXT OS and the WFGY Problem Map loaded.
My multilingual bug:
* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states
* notes: tokenizer/analyzer versions, normalization policy, scripts seen
Tell me:
1. which layer is failing and why,
2. the exact WFGY page to open from this repo,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test to verify.
Use BBMC/BBCR/BBPF/BBAM when relevant.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.