Update README.md

This commit is contained in:
PSBigBig 2025-08-30 14:26:41 +08:00 committed by GitHub
parent a16754a60e
commit fca8244bd2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,87 +1,113 @@
# Language & Multilingual Global Fix Map
# Language & Multilingual · Global Fix Map
A compact hub to **stabilize cross-lingual retrieval and reasoning**. Use this folder when your corpus or queries span CJK, RTL, Indic, Cyrillic, accented Latin, or when users code-switch.
A compact hub to **stabilize cross-lingual retrieval and reasoning**.
Use this folder when your corpus or queries include CJK, RTL, Indic, Cyrillic, accented Latin, or frequent code-switching.
---
## Quick routes to per-page guides
* Tokenizer mismatch → [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md)
* Script mixing in one query → [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md)
* Locale drift and normalization → [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md)
* End-to-end overview and recipes → [multilingual\_guide.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/multilingual_guide.md)
- Tokenizer mismatch → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md)
- Script mixing inside one query → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md)
- Locale normalization and variants → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md)
- End-to-end overview and recipes → [multilingual_guide.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/multilingual_guide.md)
- Proper nouns and alias shield → [proper_noun_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md)
- Romanization and transliteration rules → [romanization_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md)
- Query language detection contract → [query_language_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md)
- Analyzer routing per language → [query_routing_and_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md)
- Multilingual hybrid ranking → [hybrid_ranking_multilingual.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/hybrid_ranking_multilingual.md)
- Stopwords and morphology locks → [stopword_and_morphology_controls.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/stopword_and_morphology_controls.md)
- Fallback translation with glossary → [fallback_translation_and_glossary_bridge.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md)
- Bilingual and code-switch eval sets → [code_switching_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md)
---
## When to use this folder
* High similarity yet wrong meaning on bilingual or mixed-script corpora
* Citations point to the wrong section after translating the question
* Hybrid retrievers underperform a single retriever across languages
* Index “looks” healthy while coverage remains low for non-Latin scripts
* Names flip between native, transliteration, and English aliases
* zh-Hans and zh-Hant never co-retrieve, Thai recall collapses without clear reason
- High similarity yet wrong meaning on bilingual or mixed-script corpora
- Citations point to the wrong section after translating the question
- Hybrid retrievers underperform a single retriever across languages
- Index looks healthy while coverage stays low for non-Latin scripts
- Names flip between native, transliteration, and English aliases
- zh-Hans and zh-Hant never co-retrieve, Thai recall drops with no clear cause
---
## Acceptance targets
* ΔS(question, retrieved) ≤ 0.45 across language variants
* Coverage of the target section ≥ 0.70 after repair
* λ remains convergent across three paraphrases and two seeds
* E\_resonance stays flat on long windows that mix scripts
- ΔS(question, retrieved) ≤ 0.45 across language variants
- Coverage of the target section ≥ 0.70 after repair
- λ remains convergent across three paraphrases and two seeds
- E_resonance stays flat on long windows that mix scripts
- Citation fields complete, alias noise does not leak into evidence
---
## Open these first
* Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
* End-to-end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
* Why this snippet, how to cite → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
* Snippet schema fence → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
* Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
* Chunk boundary sanity → [chunking-checklist.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/chunking-checklist.md)
- Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
- End-to-end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
- Why this snippet and how to cite → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
- Snippet schema fence → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
- Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
- Chunk boundary sanity → [chunking-checklist.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/chunking-checklist.md)
---
## Map symptoms → structural fixes (Problem Map)
## Map symptoms → structural fixes
| Symptom | Likely cause | Open this |
| ------------------------------------------ | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Wrong-meaning hits despite high similarity | Embedding not multilingual or pre-norm mismatch | [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md) |
| Citations jump sections after translation | Snippet schema too loose | [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md), [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) |
| zh-Hans/zh-Hant never co-retrieve | Variant mapping missing | [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) |
| Thai/CJK recall collapses | Tokenizer mismatch, missing segmenter | [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) |
| Mixed Latin + CJK query under-recalls | Analyzer split across scripts | [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) |
| Hybrid retriever worse than single | Query parsing split, mis-weighted rerank | [patterns/pattern\_query\_parsing\_split.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/patterns/pattern_query_parsing_split.md), [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md) |
| Symptom | Likely cause | Open this |
|---|---|---|
| High similarity yet wrong meaning | Embedding not multilingual or pre-normalization mismatch | [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md) |
| Citations jump sections after translation | Snippet schema too loose | [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) |
| zh-Hans and zh-Hant never co-retrieve | Variant mapping and width rules missing | [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) |
| Thai or CJK recall collapses | Tokenizer mismatch or missing segmenter | [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) |
| Mixed Latin plus CJK query under-recalls | Analyzer split across scripts | [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) |
| Hybrid retriever worse than single | Query parsing split or mis-weighted rerank | [patterns/pattern_query_parsing_split.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/patterns/pattern_query_parsing_split.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md) |
| Proper nouns oscillate across spellings | Missing alias fields and entity shield | [proper_noun_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) |
| Inconsistent transliteration causes misses | Romanization rules and aliases not aligned | [romanization_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) |
| Language detection drifts | Detection contract unlocked or weak samples | [query_language_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md) |
| Search vs index behave differently | Analyzer routing error | [query_routing_and_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md) |
| Ranking unstable across languages | Monolingual reranker or unaligned features | [hybrid_ranking_multilingual.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/hybrid_ranking_multilingual.md) |
| Negations or particles vanish | Stopword or morphology rules too aggressive | [stopword_and_morphology_controls.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/stopword_and_morphology_controls.md) |
| Persistent high ΔS on local language path | Need controlled translation bridge with glossary | [fallback_translation_and_glossary_bridge.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md) |
---
## Fix in 60 seconds
1. **Measure ΔS across variants**
Run question in source language, target language, and a code-switched version. If ΔS improves only after transliteration or segmentation, language handling is the failing layer.
1. **Detect language**
Emit stable language and confidence per the detection contract. If unstable, stop and fix detection first.
Open → [query_language_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md)
2. **Probe λ\_observe**
Hold citations fixed and flip only script or segmentation. If λ flips, lock schema and unify analyzers.
2. **Lock normalization and analyzers**
Keep the same locale, width, accents, and segmentation for both index and search.
Open → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) · [query_routing_and_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md)
3. **Apply the smallest change**
3. **Protect entities and syntax**
Add alias fields and romanization pairs. Clamp stopwords and morphological rules for scope words like negations or units.
Open → [proper_noun_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) · [romanization_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) · [stopword_and_morphology_controls.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/stopword_and_morphology_controls.md)
* Unify index and search analyzers, add Thai/CJK segmenters
* Map zh-Hans ↔ zh-Hant at ingest or store in variant subfields
* Add alias fields for proper nouns: `native`, `romanized`, `normalized`
* Normalize digits, punctuation, and diacritics for RTL/Indic consistently
* Use a multilingual embedding; otherwise add a lexical sidecar and deterministic rerank
4. **Stabilize ranking and hybrid flows**
Use multilingual reranker or dual-track lexical plus vector, keep ordering deterministic.
Open → [hybrid_ranking_multilingual.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/hybrid_ranking_multilingual.md)
4. **Verify**
Coverage ≥ 0.70 and ΔS ≤ 0.45 across two languages and one mixed query. λ convergent on two seeds.
5. **Use a translation bridge only as last resort**
Enable only when the native path keeps high ΔS. Always pair with a glossary.
Open → [fallback_translation_and_glossary_bridge.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md)
6. **Verify**
With bilingual and code-switch test sets confirm ΔS ≤ 0.45 and Coverage ≥ 0.70, λ convergent.
Open → [code_switching_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md)
---
## Store-agnostic quick recipes
* Elasticsearch/OpenSearch: ICU normalization, width fold, bigram analyzers for CJK, Thai segmenter, keyword subfields for product names.
* Vector stores: pre-normalize identically for corpus and queries, validate with a bilingual gold set, rerank deterministically if embeddings are monolingual.
- Normalize the same way for corpus and queries before any vector store, keep tokenizer consistent on both sides
- CJK and Thai need segmentation or bigrams, keep critical fields as keyword to protect entities
- If you cannot use multilingual embeddings, add a lexical sidecar then align features in a deterministic rerank
---