Update README.md

This commit is contained in:
PSBigBig 2025-09-01 18:22:41 +08:00 committed by GitHub
parent bf01cd03ed
commit 9d1332832d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,55 +1,57 @@
# Language & Locale: Global Fix Map
# Language & Locale · Global Fix Map
Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, and Latin mixes.
Stabilize multilingual RAG and reasoning across **CJK, RTL, Indic, Latin, emoji, and locale variants**.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.
---
## What this page is
- A compact language-aware repair guide for retrieval → ranking → reasoning.
- A compact **language-aware repair guide** for retrieval → ranking → reasoning.
- Structural fixes with measurable acceptance targets.
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.
---
## When to use
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
- Queries code-switch or mix scripts and top-k order drifts across runs.
- Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
- RTL punctuation or control chars flip token order or offsets.
- Token counts jump after deploy even though data did not change.
- Corpus spans **CJK or Indic scripts** and retrieval keeps missing the correct section.
- Queries **code-switch or mix scripts**, and top-k order drifts across runs.
- **Accents/diacritics** or **fullwidth/halfwidth** forms break matching or citations.
- **RTL punctuation or control chars** flip token order or offsets.
- Token counts jump after deploy even though **data did not change**.
---
## Open these first
- Visual recovery map: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
- Retrieval knobs end-to-end: [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
- Traceability and snippet schema: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
- Embedding vs meaning: [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- Metric and normalization: [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
- OCR confusables and hyphens: [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
- Visual recovery map [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
- Retrieval knobs end-to-end [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
- Traceability and snippet schema [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
- Embedding vs meaning [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- Metric and normalization [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
- OCR confusables and hyphens [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
---
## Quick routes to per-page guides
- Tokenizer mismatch across languages → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md)
- Script mixing in a single query → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md)
- Locale drift and analyzer skew → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
- Unicode normalization policy (NFKC/NFD etc.) → [unicode_normalization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/unicode_normalization.md)
- CJK segmentation and word-break contracts → [cjk_segmentation_wordbreak.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/cjk_segmentation_wordbreak.md)
- Fullwidth vs halfwidth, punctuation variants → [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md)
- Diacritics policy and folding → [diacritics_and_folding.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md)
- RTL and bidi control characters → [bidi_rtl_control_chars.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/bidi_rtl_control_chars.md)
- Transliteration and romanization traps → [transliteration_and_romanization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/transliteration_and_romanization.md)
- Collation and stable sort keys → [locale_collation_and_sorting.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_collation_and_sorting.md)
- Numbering systems and sort orders → [numbering_and_sort_orders.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/numbering_and_sort_orders.md)
- Date and time format variants → [date_time_format_variants.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/date_time_format_variants.md)
- Time zones and DST stability → [timezones_and_dst.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/timezones_and_dst.md)
- Keyboard IMEs and composition → [keyboard_input_methods.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/keyboard_input_methods.md)
- Input language switching guards → [input_language_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/input_language_switching.md)
- Emoji, ZWJ, grapheme clusters → [emoji_zwj_grapheme_clusters.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/emoji_zwj_grapheme_clusters.md)
- Mixed-locale metadata fields → [mixed_locale_metadata.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/mixed_locale_metadata.md)
> MVP coverage includes the first 810 pages. Add the rest when traffic is mixed-locale or search intensive.
| Topic | Page |
|-------|------|
| Tokenizer mismatch across languages | [tokenizer_mismatch.md](./tokenizer_mismatch.md) |
| Script mixing in a single query | [script_mixing.md](./script_mixing.md) |
| Locale drift and analyzer skew | [locale_drift.md](./locale_drift.md) |
| Unicode normalization policy | [unicode_normalization.md](./unicode_normalization.md) |
| CJK segmentation and word-break | [cjk_segmentation_wordbreak.md](./cjk_segmentation_wordbreak.md) |
| Fullwidth vs halfwidth, punctuation variants | [digits_width_punctuation.md](./digits_width_punctuation.md) |
| Diacritics folding rules | [diacritics_and_folding.md](./diacritics_and_folding.md) |
| RTL and bidi control characters | [rtl_bidi_control.md](./rtl_bidi_control.md) |
| Transliteration and romanization | [transliteration_and_romanization.md](./transliteration_and_romanization.md) |
| Collation and stable sort keys | [locale_collation_and_sorting.md](./locale_collation_and_sorting.md) |
| Numbering systems and sort orders | [numbering_and_sort_orders.md](./numbering_and_sort_orders.md) |
| Date and time format variants | [date_time_format_variants.md](./date_time_format_variants.md) |
| Time zones and DST stability | [timezones_and_dst.md](./timezones_and_dst.md) |
| Keyboard IMEs and composition | [keyboard_input_methods.md](./keyboard_input_methods.md) |
| Input language switching guards | [input_language_switching.md](./input_language_switching.md) |
| Emoji, ZWJ, grapheme clusters | [emoji_zwj_grapheme_clusters.md](./emoji_zwj_grapheme_clusters.md) |
| Mixed-locale metadata fields | [mixed_locale_metadata.md](./mixed_locale_metadata.md) |
---
@ -58,69 +60,37 @@ This hub localizes language-layer failures and routes you to the exact structura
- Coverage of target section ≥ 0.70
- λ remains convergent across two seeds
- Tokenization variance for the same query ≤ 12% across environments
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
---
## Map symptoms to structural fixes
- Wrong-meaning hits despite high similarity
→ [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- Similarity drops when switching locales or analyzers
→ [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
- CJK tokens split differently between dev and prod
→ [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md) · [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
- Mixed scripts in one query derails ranking
→ [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
- Fullwidth punctuation or RTL marks break citations
→ [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
- “Looks identical” after OCR but fails to match
→ [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
---
## Fix in 60 seconds
1) **Normalize once, up front**
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock it in ingestion and query paths.
2) **Match tokenizer and analyzer**
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Record exact versions in the data contract.
3) **Stabilize mixed-script queries**
Detect code-switch, split by script, run per-script retrieval, rerank deterministically.
4) **Verify**
Compute ΔS on three paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across two seeds.
1. **Normalize once, up front** → Apply NFKC, collapse fullwidth/halfwidth, unify diacritics.
2. **Match tokenizer and analyzer** → Same segmenter for CJK/Indic across embed + store analyzers.
3. **Stabilize mixed-script queries** → Detect code-switch, split per script, rerank deterministically.
4. **Verify** → ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent across two seeds.
---
## Copy-paste prompt for your LLM step
## FAQ (Beginner-Friendly)
```
**Q1: Why do answers break when I mix English and Chinese in one query?**
A: Most vector stores tokenize differently by script. Without alignment, Chinese words get split incorrectly and English tokens dominate. Fix with [script_mixing.md](./script_mixing.md) and [tokenizer_mismatch.md](./tokenizer_mismatch.md).
You have TXT OS and the WFGY Problem Map loaded.
**Q2: What does “locale drift” mean?**
A: Locale drift happens when environments use different analyzers (e.g., zh_TW vs zh_CN) so the same query splits differently. See [locale_drift.md](./locale_drift.md).
My multilingual bug:
**Q3: Why do “identical-looking” characters not match?**
A: They may differ in width (fullwidth vs halfwidth), normalization (NFKC vs NFD), or diacritics. Always apply [unicode_normalization.md](./unicode_normalization.md) and [digits_width_punctuation.md](./digits_width_punctuation.md).
* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states
* notes: tokenizer/analyzer versions, normalization policy, scripts seen
**Q4: How do I handle Arabic or Hebrew text?**
A: RTL scripts can insert invisible bidi control chars that flip token order. See [rtl_bidi_control.md](./rtl_bidi_control.md).
Tell me:
**Q5: Do I need different embeddings for each language?**
A: No. You can combine multilingual embeddings with deterministic normalization and alias fields. If that fails, only then use [fallback translation bridges](../Language/fallback_translation_and_glossary_bridge.md).
1. which layer is failing and why,
2. the exact WFGY page to open from this repo,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test to verify.
Use BBMC/BBCR/BBPF/BBAM when relevant.
```
**Q6: How do I debug when results change between environments?**
A: Compare tokenizer version, analyzer settings, normalization passes, and collation rules. Document them in [data-contracts.md](../../data-contracts.md).
---