mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
Update README.md
This commit is contained in:
parent
6a31437413
commit
accde807d2
1 changed files with 48 additions and 42 deletions
|
|
@ -1,48 +1,55 @@
|
|||
# Language & Locale — Global Fix Map
|
||||
# Language & Locale: Global Fix Map
|
||||
|
||||
Stabilize multilingual RAG and reasoning across CJK/RTL/Indic/Latin mixes.
|
||||
This hub localizes language-layer failures and routes you to the right structural fix. No infra change required.
|
||||
Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, and Latin mixes.
|
||||
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.
|
||||
|
||||
---
|
||||
|
||||
## What this page is
|
||||
- A compact, language-aware repair guide for retrieval → ranking → reasoning.
|
||||
- A compact language-aware repair guide for retrieval → ranking → reasoning.
|
||||
- Structural fixes with measurable acceptance targets.
|
||||
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, etc.
|
||||
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.
|
||||
|
||||
## When to use
|
||||
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
|
||||
- Queries code-switch or mix scripts and the top-k drifts each run.
|
||||
- Accents/diacritics or fullwidth/halfwidth forms break matching.
|
||||
- RTL punctuation or invisible marks flip token order.
|
||||
- Queries code-switch or mix scripts and top-k order drifts across runs.
|
||||
- Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
|
||||
- RTL punctuation or control chars flip token order or offsets.
|
||||
- Token counts jump after deploy even though data did not change.
|
||||
|
||||
---
|
||||
|
||||
## Open these first
|
||||
- Visual recovery map: [RAG Architecture & Recovery](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
|
||||
- Retrieval knobs end-to-end: [Retrieval Playbook](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
|
||||
- Traceability and snippet schema: [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||||
- Embedding vs meaning: [Embedding ≠ Semantic](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||||
- Metric and normalization: [Metric Mismatch](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [Normalization & Scaling](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
|
||||
- OCR confusables and hyphens: [OCR Parsing — Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
|
||||
- Visual recovery map: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
|
||||
- Retrieval knobs end-to-end: [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
|
||||
- Traceability and snippet schema: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||||
- Embedding vs meaning: [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||||
- Metric and normalization: [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
|
||||
- OCR confusables and hyphens: [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
|
||||
|
||||
---
|
||||
|
||||
## Quick routes to per-page guides
|
||||
|
||||
- Tokenizer mismatch across languages → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md)
|
||||
- Script mixing in one query (CJK + Latin, etc.) → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md)
|
||||
- Locale drift and analyzer skew (prod vs dev) → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
|
||||
- Normalization and casing policy (NFKC, lowercasing, accent fold) → [normalization_and_casing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/normalization_and_casing.md)
|
||||
- CJK/Indic segmentation and RTL direction marks → [segmentation_and_rtl.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/segmentation_and_rtl.md)
|
||||
- Fullwidth vs halfwidth, punctuation variants → [fullwidth_halfwidth.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/fullwidth_halfwidth.md)
|
||||
- Diacritics and accent folding policy → [diacritics_and_folding.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md)
|
||||
- Transliteration and romanization traps → [transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/transliteration.md)
|
||||
- Stopwords and analyzer mismatch in stores → [analyzer_stopwords.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/analyzer_stopwords.md)
|
||||
- Code-switch detection and reranking policy → [code_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/code_switching.md)
|
||||
- Script mixing in a single query → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md)
|
||||
- Locale drift and analyzer skew → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
|
||||
- Unicode normalization policy (NFKC/NFD etc.) → [unicode_normalization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/unicode_normalization.md)
|
||||
- CJK segmentation and word-break contracts → [cjk_segmentation_wordbreak.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/cjk_segmentation_wordbreak.md)
|
||||
- Fullwidth vs halfwidth, punctuation variants → [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md)
|
||||
- Diacritics policy and folding → [diacritics_and_folding.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md)
|
||||
- RTL and bidi control characters → [bidi_rtl_control_chars.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/bidi_rtl_control_chars.md)
|
||||
- Transliteration and romanization traps → [transliteration_and_romanization.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/transliteration_and_romanization.md)
|
||||
- Collation and stable sort keys → [locale_collation_and_sorting.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_collation_and_sorting.md)
|
||||
- Numbering systems and sort orders → [numbering_and_sort_orders.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/numbering_and_sort_orders.md)
|
||||
- Date and time format variants → [date_time_format_variants.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/date_time_format_variants.md)
|
||||
- Time zones and DST stability → [timezones_and_dst.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/timezones_and_dst.md)
|
||||
- Keyboard IMEs and composition → [keyboard_input_methods.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/keyboard_input_methods.md)
|
||||
- Input language switching guards → [input_language_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/input_language_switching.md)
|
||||
- Emoji, ZWJ, grapheme clusters → [emoji_zwj_grapheme_clusters.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/emoji_zwj_grapheme_clusters.md)
|
||||
- Mixed-locale metadata fields → [mixed_locale_metadata.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/mixed_locale_metadata.md)
|
||||
|
||||
> MVP set is the first 6 pages. The rest are recommended adds when your traffic is mixed-locale or heavy-search.
|
||||
> MVP coverage includes the first 8–10 pages. Add the rest when traffic is mixed-locale or search intensive.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -51,45 +58,45 @@ This hub localizes language-layer failures and routes you to the right structura
|
|||
- Coverage of target section ≥ 0.70
|
||||
- λ remains convergent across two seeds
|
||||
- Tokenization variance for the same query ≤ 12% across environments
|
||||
- Normalization pass rate (NFKC + width + diacritics) ≥ 0.98
|
||||
- Normalization pass rate for NFKC + width + diacritics ≥ 0.98
|
||||
|
||||
---
|
||||
|
||||
## Map symptoms → structural fixes
|
||||
## Map symptoms to structural fixes
|
||||
|
||||
- Wrong-meaning hits despite high similarity.
|
||||
- Wrong-meaning hits despite high similarity
|
||||
→ [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||||
|
||||
- High similarity drops when you switch locales or analyzers.
|
||||
- Similarity drops when switching locales or analyzers
|
||||
→ [metric_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/metric_mismatch.md) · [normalization_and_scaling.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/Embeddings/normalization_and_scaling.md)
|
||||
|
||||
- CJK tokens split differently between dev and prod.
|
||||
- CJK tokens split differently between dev and prod
|
||||
→ [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md) · [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/locale_drift.md)
|
||||
|
||||
- Mixed scripts in one query derails ranking order.
|
||||
→ [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md) · [code_switching.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/code_switching.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
|
||||
- Mixed scripts in one query derails ranking
|
||||
→ [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/script_mixing.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md)
|
||||
|
||||
- Fullwidth punctuation or RTL marks break citations.
|
||||
→ [fullwidth_halfwidth.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/fullwidth_halfwidth.md) · [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||||
- Fullwidth punctuation or RTL marks break citations
|
||||
→ [digits_width_punctuation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/LanguageLocale/digits_width_punctuation.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||||
|
||||
- “Looks identical” but fails to match after OCR.
|
||||
→ [OCR Parsing — Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
|
||||
- “Looks identical” after OCR but fails to match
|
||||
→ [OCR_Parsing README](https://github.com/onestardao/WFGY/blob/main/ProblemMap/OCR_Parsing/README.md)
|
||||
|
||||
---
|
||||
|
||||
## Fix in 60 seconds
|
||||
|
||||
1) **Normalize once, up front**
|
||||
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock this in the ingestion job and in the query path.
|
||||
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock it in ingestion and query paths.
|
||||
|
||||
2) **Match tokenizer and analyzer**
|
||||
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Document exact versions in your data contract.
|
||||
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Record exact versions in the data contract.
|
||||
|
||||
3) **Stabilize mixed-script queries**
|
||||
Detect code-switch, split query by script, run per-script retrieval, then rerank deterministically.
|
||||
Detect code-switch, split by script, run per-script retrieval, rerank deterministically.
|
||||
|
||||
4) **Verify**
|
||||
Compute ΔS on 3 paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across 2 seeds.
|
||||
Compute ΔS on three paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across two seeds.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -97,7 +104,7 @@ This hub localizes language-layer failures and routes you to the right structura
|
|||
|
||||
```
|
||||
|
||||
You have TXT OS and the WFGY Problem Map.
|
||||
You have TXT OS and the WFGY Problem Map loaded.
|
||||
|
||||
My multilingual bug:
|
||||
|
||||
|
|
@ -109,7 +116,7 @@ Tell me:
|
|||
|
||||
1. which layer is failing and why,
|
||||
2. the exact WFGY page to open from this repo,
|
||||
3. minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
|
||||
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
|
||||
4. a reproducible test to verify.
|
||||
Use BBMC/BBCR/BBPF/BBAM when relevant.
|
||||
|
||||
|
|
@ -162,4 +169,3 @@ Tell me:
|
|||
[](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)
|
||||
|
||||
</div>
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue