Update README.md

2026-04-28 11:40:07 +00:00 · 2025-09-01 18:22:41 +08:00 · 2025-09-01 18:22:41 +08:00 · 9d1332832d
commit 9d1332832d
parent bf01cd03ed
1 changed files with 53 additions and 83 deletions
--- a/ProblemMap/GlobalFixMap/LanguageLocale/README.md
+++ b/ProblemMap/GlobalFixMap/LanguageLocale/README.md
@ -1,55 +1,57 @@
-# Language & Locale: Global Fix Map
+# Language & Locale · Global Fix Map

-Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, and Latin mixes.  
+Stabilize multilingual RAG and reasoning across **CJK, RTL, Indic, Latin, emoji, and locale variants**.  
 This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.

 ---

 ## What this page is
- A compact language-aware repair guide for retrieval → ranking → reasoning.
+- A compact **language-aware repair guide** for retrieval → ranking → reasoning.
 - Structural fixes with measurable acceptance targets.
 - Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.

+---
+
 ## When to use
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
- Queries code-switch or mix scripts and top-k order drifts across runs.
- Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
- RTL punctuation or control chars flip token order or offsets.
- Token counts jump after deploy even though data did not change.
+- Corpus spans **CJK or Indic scripts** and retrieval keeps missing the correct section.  
+- Queries **code-switch or mix scripts**, and top-k order drifts across runs.  
+- **Accents/diacritics** or **fullwidth/halfwidth** forms break matching or citations.  
+- **RTL punctuation or control chars** flip token order or offsets.  
+- Token counts jump after deploy even though **data did not change**.  

 ---

 ## Open these first
- Visual recovery map: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)  
- Retrieval knobs end-to-end: [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)  
- Traceability and snippet schema: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)  
- Embedding vs meaning: [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)  
- Metric and normalization: [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)  
- OCR confusables and hyphens: [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
+- Visual recovery map → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)  
+- Retrieval knobs end-to-end → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)  
+- Traceability and snippet schema → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)  
+- Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)  
+- Metric and normalization → [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)  
+- OCR confusables and hyphens → [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)  

 ---

 ## Quick routes to per-page guides

- Tokenizer mismatch across languages → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md)  
- Script mixing in a single query → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md)  
- Locale drift and analyzer skew → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)  
- Unicode normalization policy (NFKC/NFD etc.) → [unicode_normalization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/unicode_normalization.md)  
- CJK segmentation and word-break contracts → [cjk_segmentation_wordbreak.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/cjk_segmentation_wordbreak.md)  
- Fullwidth vs halfwidth, punctuation variants → [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md)  
- Diacritics policy and folding → [diacritics_and_folding.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md)  
- RTL and bidi control characters → [bidi_rtl_control_chars.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/bidi_rtl_control_chars.md)  
- Transliteration and romanization traps → [transliteration_and_romanization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/transliteration_and_romanization.md)  
- Collation and stable sort keys → [locale_collation_and_sorting.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_collation_and_sorting.md)  
- Numbering systems and sort orders → [numbering_and_sort_orders.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/numbering_and_sort_orders.md)  
- Date and time format variants → [date_time_format_variants.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/date_time_format_variants.md)  
- Time zones and DST stability → [timezones_and_dst.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/timezones_and_dst.md)  
- Keyboard IMEs and composition → [keyboard_input_methods.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/keyboard_input_methods.md)  
- Input language switching guards → [input_language_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/input_language_switching.md)  
- Emoji, ZWJ, grapheme clusters → [emoji_zwj_grapheme_clusters.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/emoji_zwj_grapheme_clusters.md)  
- Mixed-locale metadata fields → [mixed_locale_metadata.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/mixed_locale_metadata.md)
-
-> MVP coverage includes the first 8–10 pages. Add the rest when traffic is mixed-locale or search intensive.
+| Topic | Page |
+|-------|------|
+| Tokenizer mismatch across languages | [tokenizer_mismatch.md](./tokenizer_mismatch.md) |
+| Script mixing in a single query | [script_mixing.md](./script_mixing.md) |
+| Locale drift and analyzer skew | [locale_drift.md](./locale_drift.md) |
+| Unicode normalization policy | [unicode_normalization.md](./unicode_normalization.md) |
+| CJK segmentation and word-break | [cjk_segmentation_wordbreak.md](./cjk_segmentation_wordbreak.md) |
+| Fullwidth vs halfwidth, punctuation variants | [digits_width_punctuation.md](./digits_width_punctuation.md) |
+| Diacritics folding rules | [diacritics_and_folding.md](./diacritics_and_folding.md) |
+| RTL and bidi control characters | [rtl_bidi_control.md](./rtl_bidi_control.md) |
+| Transliteration and romanization | [transliteration_and_romanization.md](./transliteration_and_romanization.md) |
+| Collation and stable sort keys | [locale_collation_and_sorting.md](./locale_collation_and_sorting.md) |
+| Numbering systems and sort orders | [numbering_and_sort_orders.md](./numbering_and_sort_orders.md) |
+| Date and time format variants | [date_time_format_variants.md](./date_time_format_variants.md) |
+| Time zones and DST stability | [timezones_and_dst.md](./timezones_and_dst.md) |
+| Keyboard IMEs and composition | [keyboard_input_methods.md](./keyboard_input_methods.md) |
+| Input language switching guards | [input_language_switching.md](./input_language_switching.md) |
+| Emoji, ZWJ, grapheme clusters | [emoji_zwj_grapheme_clusters.md](./emoji_zwj_grapheme_clusters.md) |
+| Mixed-locale metadata fields | [mixed_locale_metadata.md](./mixed_locale_metadata.md) |

 ---

@ -58,69 +60,37 @@ This hub localizes language-layer failures and routes you to the exact structura
 - Coverage of target section ≥ 0.70  
 - λ remains convergent across two seeds  
 - Tokenization variance for the same query ≤ 12% across environments  
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
-
---
-
-## Map symptoms to structural fixes
-
- Wrong-meaning hits despite high similarity  
-  → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
-
- Similarity drops when switching locales or analyzers  
-  → [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
-
- CJK tokens split differently between dev and prod  
-  → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md) · [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
-
- Mixed scripts in one query derails ranking  
-  → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
-
- Fullwidth punctuation or RTL marks break citations  
-  → [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
-
- “Looks identical” after OCR but fails to match  
-  → [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
+- Normalization pass rate for NFKC + width + diacritics ≥ 0.98  

 ---

 ## Fix in 60 seconds
-
-1) **Normalize once, up front**  
-   Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock it in ingestion and query paths.
-
-2) **Match tokenizer and analyzer**  
-   Use the same segmenter for CJK/Indic in both embedding and store analyzers. Record exact versions in the data contract.
-
-3) **Stabilize mixed-script queries**  
-   Detect code-switch, split by script, run per-script retrieval, rerank deterministically.
-
-4) **Verify**  
-   Compute ΔS on three paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across two seeds.
+1. **Normalize once, up front** → Apply NFKC, collapse fullwidth/halfwidth, unify diacritics.  
+2. **Match tokenizer and analyzer** → Same segmenter for CJK/Indic across embed + store analyzers.  
+3. **Stabilize mixed-script queries** → Detect code-switch, split per script, rerank deterministically.  
+4. **Verify** → ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent across two seeds.  

 ---

-## Copy-paste prompt for your LLM step
+## FAQ (Beginner-Friendly)

-```
+**Q1: Why do answers break when I mix English and Chinese in one query?**  
+A: Most vector stores tokenize differently by script. Without alignment, Chinese words get split incorrectly and English tokens dominate. Fix with [script_mixing.md](./script_mixing.md) and [tokenizer_mismatch.md](./tokenizer_mismatch.md).

-You have TXT OS and the WFGY Problem Map loaded.
+**Q2: What does “locale drift” mean?**  
+A: Locale drift happens when environments use different analyzers (e.g., zh_TW vs zh_CN) so the same query splits differently. See [locale_drift.md](./locale_drift.md).

-My multilingual bug:
+**Q3: Why do “identical-looking” characters not match?**  
+A: They may differ in width (fullwidth vs halfwidth), normalization (NFKC vs NFD), or diacritics. Always apply [unicode_normalization.md](./unicode_normalization.md) and [digits_width_punctuation.md](./digits_width_punctuation.md).

-* symptom: \[one line]
-* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states
-* notes: tokenizer/analyzer versions, normalization policy, scripts seen
+**Q4: How do I handle Arabic or Hebrew text?**  
+A: RTL scripts can insert invisible bidi control chars that flip token order. See [rtl_bidi_control.md](./rtl_bidi_control.md).

-Tell me:
+**Q5: Do I need different embeddings for each language?**  
+A: No. You can combine multilingual embeddings with deterministic normalization and alias fields. If that fails, only then use [fallback translation bridges](../Language/fallback_translation_and_glossary_bridge.md).

-1. which layer is failing and why,
-2. the exact WFGY page to open from this repo,
-3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
-4. a reproducible test to verify.
-   Use BBMC/BBCR/BBPF/BBAM when relevant.
-
-```
+**Q6: How do I debug when results change between environments?**  
+A: Compare tokenizer version, analyzer settings, normalization passes, and collation rules. Document them in [data-contracts.md](../../data-contracts.md).

 ---