Update README.md

This commit is contained in:
PSBigBig 2025-08-30 20:57:02 +08:00 committed by GitHub
parent 6a31437413
commit accde807d2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,48 +1,55 @@
# Language & Locale Global Fix Map
# Language & Locale: Global Fix Map
Stabilize multilingual RAG and reasoning across CJK/RTL/Indic/Latin mixes.
This hub localizes language-layer failures and routes you to the right structural fix. No infra change required.
Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, and Latin mixes.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.
---
## What this page is
- A compact, language-aware repair guide for retrieval → ranking → reasoning.
- A compact language-aware repair guide for retrieval → ranking → reasoning.
- Structural fixes with measurable acceptance targets.
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, etc.
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.
## When to use
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
- Queries code-switch or mix scripts and the top-k drifts each run.
- Accents/diacritics or fullwidth/halfwidth forms break matching.
- RTL punctuation or invisible marks flip token order.
- Queries code-switch or mix scripts and top-k order drifts across runs.
- Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
- RTL punctuation or control chars flip token order or offsets.
- Token counts jump after deploy even though data did not change.
---
## Open these first
- Visual recovery map: [RAG Architecture & Recovery](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
- Retrieval knobs end-to-end: [Retrieval Playbook](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
- Traceability and snippet schema: [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
- Embedding vs meaning: [Embedding ≠ Semantic](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- Metric and normalization: [Metric Mismatch](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [Normalization & Scaling](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
- OCR confusables and hyphens: [OCR Parsing — Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
- Visual recovery map: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
- Retrieval knobs end-to-end: [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
- Traceability and snippet schema: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
- Embedding vs meaning: [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- Metric and normalization: [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
- OCR confusables and hyphens: [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
---
## Quick routes to per-page guides
- Tokenizer mismatch across languages → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md)
- Script mixing in one query (CJK + Latin, etc.) → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md)
- Locale drift and analyzer skew (prod vs dev) → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
- Normalization and casing policy (NFKC, lowercasing, accent fold) → [normalization_and_casing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/normalization_and_casing.md)
- CJK/Indic segmentation and RTL direction marks → [segmentation_and_rtl.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/segmentation_and_rtl.md)
- Fullwidth vs halfwidth, punctuation variants → [fullwidth_halfwidth.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/fullwidth_halfwidth.md)
- Diacritics and accent folding policy → [diacritics_and_folding.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md)
- Transliteration and romanization traps → [transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/transliteration.md)
- Stopwords and analyzer mismatch in stores → [analyzer_stopwords.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/analyzer_stopwords.md)
- Code-switch detection and reranking policy → [code_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/code_switching.md)
- Script mixing in a single query → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md)
- Locale drift and analyzer skew → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
- Unicode normalization policy (NFKC/NFD etc.) → [unicode_normalization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/unicode_normalization.md)
- CJK segmentation and word-break contracts → [cjk_segmentation_wordbreak.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/cjk_segmentation_wordbreak.md)
- Fullwidth vs halfwidth, punctuation variants → [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md)
- Diacritics policy and folding → [diacritics_and_folding.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md)
- RTL and bidi control characters → [bidi_rtl_control_chars.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/bidi_rtl_control_chars.md)
- Transliteration and romanization traps → [transliteration_and_romanization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/transliteration_and_romanization.md)
- Collation and stable sort keys → [locale_collation_and_sorting.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_collation_and_sorting.md)
- Numbering systems and sort orders → [numbering_and_sort_orders.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/numbering_and_sort_orders.md)
- Date and time format variants → [date_time_format_variants.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/date_time_format_variants.md)
- Time zones and DST stability → [timezones_and_dst.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/timezones_and_dst.md)
- Keyboard IMEs and composition → [keyboard_input_methods.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/keyboard_input_methods.md)
- Input language switching guards → [input_language_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/input_language_switching.md)
- Emoji, ZWJ, grapheme clusters → [emoji_zwj_grapheme_clusters.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/emoji_zwj_grapheme_clusters.md)
- Mixed-locale metadata fields → [mixed_locale_metadata.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/mixed_locale_metadata.md)
> MVP set is the first 6 pages. The rest are recommended adds when your traffic is mixed-locale or heavy-search.
> MVP coverage includes the first 810 pages. Add the rest when traffic is mixed-locale or search intensive.
---
@ -51,45 +58,45 @@ This hub localizes language-layer failures and routes you to the right structura
- Coverage of target section ≥ 0.70
- λ remains convergent across two seeds
- Tokenization variance for the same query ≤ 12% across environments
- Normalization pass rate (NFKC + width + diacritics) ≥ 0.98
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
---
## Map symptoms structural fixes
## Map symptoms to structural fixes
- Wrong-meaning hits despite high similarity.
- Wrong-meaning hits despite high similarity
→ [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- High similarity drops when you switch locales or analyzers.
- Similarity drops when switching locales or analyzers
→ [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
- CJK tokens split differently between dev and prod.
- CJK tokens split differently between dev and prod
→ [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md) · [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
- Mixed scripts in one query derails ranking order.
→ [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md) · [code_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/code_switching.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
- Mixed scripts in one query derails ranking
→ [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
- Fullwidth punctuation or RTL marks break citations.
→ [fullwidth_halfwidth.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/fullwidth_halfwidth.md) · [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
- Fullwidth punctuation or RTL marks break citations
→ [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
- “Looks identical” but fails to match after OCR.
→ [OCR Parsing — Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
- “Looks identical” after OCR but fails to match
→ [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
---
## Fix in 60 seconds
1) **Normalize once, up front**
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock this in the ingestion job and in the query path.
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock it in ingestion and query paths.
2) **Match tokenizer and analyzer**
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Document exact versions in your data contract.
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Record exact versions in the data contract.
3) **Stabilize mixed-script queries**
Detect code-switch, split query by script, run per-script retrieval, then rerank deterministically.
Detect code-switch, split by script, run per-script retrieval, rerank deterministically.
4) **Verify**
Compute ΔS on 3 paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across 2 seeds.
Compute ΔS on three paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across two seeds.
---
@ -97,7 +104,7 @@ This hub localizes language-layer failures and routes you to the right structura
```
You have TXT OS and the WFGY Problem Map.
You have TXT OS and the WFGY Problem Map loaded.
My multilingual bug:
@ -109,7 +116,7 @@ Tell me:
1. which layer is failing and why,
2. the exact WFGY page to open from this repo,
3. minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test to verify.
Use BBMC/BBCR/BBPF/BBAM when relevant.
@ -162,4 +169,3 @@ Tell me:
[![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
 
</div>