WFGY/ProblemMap/GlobalFixMap/LanguageLocale
2026-03-06 12:46:37 +00:00
..
checklists Create .gitkeep 2025-09-01 18:20:03 +08:00
eval Create .gitkeep 2025-09-01 18:20:12 +08:00
mvp_demo Create .gitkeep 2025-09-01 18:20:22 +08:00
ops Create .gitkeep 2025-09-01 18:20:30 +08:00
patterns Create .gitkeep 2025-09-01 18:20:38 +08:00
playbooks Create .gitkeep 2025-09-01 18:20:48 +08:00
tools Create .gitkeep 2025-09-01 18:20:57 +08:00
cjk_segmentation_wordbreak.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
date_time_format_variants.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
diacritics_and_folding.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
digits_width_punctuation.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
emoji_zwj_grapheme_clusters.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
input_language_switching.md Update input_language_switching.md 2025-09-05 11:07:24 +08:00
keyboard_input_methods.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
locale_collation_and_sorting.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
locale_drift.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
mixed_locale_metadata.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
numbering_and_sort_orders.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
README.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
rtl_bidi_control.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
script_mixing.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
timezones_and_dst.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
tokenizer_mismatch.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
transliteration_and_romanization.md sync footer navigation (remove clinics, align PM versions) 2026-03-06 12:46:37 +00:00
unicode_normalization.md Update unicode_normalization.md 2025-09-05 11:10:18 +08:00

Language & Locale · Global Fix Map

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, Latin, emoji, and locale variants.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.


What this page is

  • A compact language-aware repair guide for retrieval → ranking → reasoning.
  • Structural fixes with measurable acceptance targets.
  • Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.

When to use

  • Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
  • Queries code-switch or mix scripts, and top-k order drifts across runs.
  • Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
  • RTL punctuation or control chars flip token order or offsets.
  • Token counts jump after deploy even though data did not change.

Open these first


Quick routes to per-page guides

Topic Page
Tokenizer mismatch across languages tokenizer_mismatch.md
Script mixing in a single query script_mixing.md
Locale drift and analyzer skew locale_drift.md
Unicode normalization policy unicode_normalization.md
CJK segmentation and word-break cjk_segmentation_wordbreak.md
Fullwidth vs halfwidth, punctuation variants digits_width_punctuation.md
Diacritics folding rules diacritics_and_folding.md
RTL and bidi control characters rtl_bidi_control.md
Transliteration and romanization transliteration_and_romanization.md
Collation and stable sort keys locale_collation_and_sorting.md
Numbering systems and sort orders numbering_and_sort_orders.md
Date and time format variants date_time_format_variants.md
Time zones and DST stability timezones_and_dst.md
Keyboard IMEs and composition keyboard_input_methods.md
Input language switching guards input_language_switching.md
Emoji, ZWJ, grapheme clusters emoji_zwj_grapheme_clusters.md
Mixed-locale metadata fields mixed_locale_metadata.md

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 on three paraphrases
  • Coverage of target section ≥ 0.70
  • λ remains convergent across two seeds
  • Tokenization variance for the same query ≤ 12% across environments
  • Normalization pass rate for NFKC + width + diacritics ≥ 0.98

Fix in 60 seconds

  1. Normalize once, up front → Apply NFKC, collapse fullwidth/halfwidth, unify diacritics.
  2. Match tokenizer and analyzer → Same segmenter for CJK/Indic across embed + store analyzers.
  3. Stabilize mixed-script queries → Detect code-switch, split per script, rerank deterministically.
  4. Verify → ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent across two seeds.

FAQ (Beginner-Friendly)

Q1: Why do answers break when I mix English and Chinese in one query?
A: Most vector stores tokenize differently by script. Without alignment, Chinese words get split incorrectly and English tokens dominate. Fix with script_mixing.md and tokenizer_mismatch.md.

Q2: What does “locale drift” mean?
A: Locale drift happens when environments use different analyzers (e.g., zh_TW vs zh_CN) so the same query splits differently. See locale_drift.md.

Q3: Why do “identical-looking” characters not match?
A: They may differ in width (fullwidth vs halfwidth), normalization (NFKC vs NFD), or diacritics. Always apply unicode_normalization.md and digits_width_punctuation.md.

Q4: How do I handle Arabic or Hebrew text?
A: RTL scripts can insert invisible bidi control chars that flip token order. See rtl_bidi_control.md.

Q5: Do I need different embeddings for each language?
A: No. You can combine multilingual embeddings with deterministic normalization and alias fields. If that fails, only then use fallback translation bridges.

Q6: How do I debug when results change between environments?
A: Compare tokenizer version, analyzer settings, normalization passes, and collation rules. Document them in data-contracts.md.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars