mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
Update README.md
This commit is contained in:
parent
a16754a60e
commit
fca8244bd2
1 changed files with 71 additions and 45 deletions
|
|
@ -1,87 +1,113 @@
|
|||
# Language & Multilingual — Global Fix Map
|
||||
# Language & Multilingual · Global Fix Map
|
||||
|
||||
A compact hub to **stabilize cross-lingual retrieval and reasoning**. Use this folder when your corpus or queries span CJK, RTL, Indic, Cyrillic, accented Latin, or when users code-switch.
|
||||
A compact hub to **stabilize cross-lingual retrieval and reasoning**.
|
||||
Use this folder when your corpus or queries include CJK, RTL, Indic, Cyrillic, accented Latin, or frequent code-switching.
|
||||
|
||||
---
|
||||
|
||||
## Quick routes to per-page guides
|
||||
|
||||
* Tokenizer mismatch → [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md)
|
||||
* Script mixing in one query → [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md)
|
||||
* Locale drift and normalization → [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md)
|
||||
* End-to-end overview and recipes → [multilingual\_guide.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/multilingual_guide.md)
|
||||
- Tokenizer mismatch → [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md)
|
||||
- Script mixing inside one query → [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md)
|
||||
- Locale normalization and variants → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md)
|
||||
- End-to-end overview and recipes → [multilingual_guide.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/multilingual_guide.md)
|
||||
- Proper nouns and alias shield → [proper_noun_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md)
|
||||
- Romanization and transliteration rules → [romanization_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md)
|
||||
- Query language detection contract → [query_language_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md)
|
||||
- Analyzer routing per language → [query_routing_and_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md)
|
||||
- Multilingual hybrid ranking → [hybrid_ranking_multilingual.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/hybrid_ranking_multilingual.md)
|
||||
- Stopwords and morphology locks → [stopword_and_morphology_controls.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/stopword_and_morphology_controls.md)
|
||||
- Fallback translation with glossary → [fallback_translation_and_glossary_bridge.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md)
|
||||
- Bilingual and code-switch eval sets → [code_switching_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md)
|
||||
|
||||
---
|
||||
|
||||
## When to use this folder
|
||||
|
||||
* High similarity yet wrong meaning on bilingual or mixed-script corpora
|
||||
* Citations point to the wrong section after translating the question
|
||||
* Hybrid retrievers underperform a single retriever across languages
|
||||
* Index “looks” healthy while coverage remains low for non-Latin scripts
|
||||
* Names flip between native, transliteration, and English aliases
|
||||
* zh-Hans and zh-Hant never co-retrieve, Thai recall collapses without clear reason
|
||||
- High similarity yet wrong meaning on bilingual or mixed-script corpora
|
||||
- Citations point to the wrong section after translating the question
|
||||
- Hybrid retrievers underperform a single retriever across languages
|
||||
- Index looks healthy while coverage stays low for non-Latin scripts
|
||||
- Names flip between native, transliteration, and English aliases
|
||||
- zh-Hans and zh-Hant never co-retrieve, Thai recall drops with no clear cause
|
||||
|
||||
---
|
||||
|
||||
## Acceptance targets
|
||||
|
||||
* ΔS(question, retrieved) ≤ 0.45 across language variants
|
||||
* Coverage of the target section ≥ 0.70 after repair
|
||||
* λ remains convergent across three paraphrases and two seeds
|
||||
* E\_resonance stays flat on long windows that mix scripts
|
||||
- ΔS(question, retrieved) ≤ 0.45 across language variants
|
||||
- Coverage of the target section ≥ 0.70 after repair
|
||||
- λ remains convergent across three paraphrases and two seeds
|
||||
- E_resonance stays flat on long windows that mix scripts
|
||||
- Citation fields complete, alias noise does not leak into evidence
|
||||
|
||||
---
|
||||
|
||||
## Open these first
|
||||
|
||||
* Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
|
||||
* End-to-end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
|
||||
* Why this snippet, how to cite → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||||
* Snippet schema fence → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||||
* Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||||
* Chunk boundary sanity → [chunking-checklist.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/chunking-checklist.md)
|
||||
- Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
|
||||
- End-to-end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
|
||||
- Why this snippet and how to cite → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||||
- Snippet schema fence → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||||
- Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||||
- Chunk boundary sanity → [chunking-checklist.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/chunking-checklist.md)
|
||||
|
||||
---
|
||||
|
||||
## Map symptoms → structural fixes (Problem Map)
|
||||
## Map symptoms → structural fixes
|
||||
|
||||
| Symptom | Likely cause | Open this |
|
||||
| ------------------------------------------ | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Wrong-meaning hits despite high similarity | Embedding not multilingual or pre-norm mismatch | [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md) |
|
||||
| Citations jump sections after translation | Snippet schema too loose | [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md), [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) |
|
||||
| zh-Hans/zh-Hant never co-retrieve | Variant mapping missing | [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) |
|
||||
| Thai/CJK recall collapses | Tokenizer mismatch, missing segmenter | [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) |
|
||||
| Mixed Latin + CJK query under-recalls | Analyzer split across scripts | [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) |
|
||||
| Hybrid retriever worse than single | Query parsing split, mis-weighted rerank | [patterns/pattern\_query\_parsing\_split.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/patterns/pattern_query_parsing_split.md), [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md) |
|
||||
| Symptom | Likely cause | Open this |
|
||||
|---|---|---|
|
||||
| High similarity yet wrong meaning | Embedding not multilingual or pre-normalization mismatch | [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md) |
|
||||
| Citations jump sections after translation | Snippet schema too loose | [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) |
|
||||
| zh-Hans and zh-Hant never co-retrieve | Variant mapping and width rules missing | [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) |
|
||||
| Thai or CJK recall collapses | Tokenizer mismatch or missing segmenter | [tokenizer_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) |
|
||||
| Mixed Latin plus CJK query under-recalls | Analyzer split across scripts | [script_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) |
|
||||
| Hybrid retriever worse than single | Query parsing split or mis-weighted rerank | [patterns/pattern_query_parsing_split.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/patterns/pattern_query_parsing_split.md) · [rerankers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md) |
|
||||
| Proper nouns oscillate across spellings | Missing alias fields and entity shield | [proper_noun_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) |
|
||||
| Inconsistent transliteration causes misses | Romanization rules and aliases not aligned | [romanization_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) |
|
||||
| Language detection drifts | Detection contract unlocked or weak samples | [query_language_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md) |
|
||||
| Search vs index behave differently | Analyzer routing error | [query_routing_and_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md) |
|
||||
| Ranking unstable across languages | Monolingual reranker or unaligned features | [hybrid_ranking_multilingual.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/hybrid_ranking_multilingual.md) |
|
||||
| Negations or particles vanish | Stopword or morphology rules too aggressive | [stopword_and_morphology_controls.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/stopword_and_morphology_controls.md) |
|
||||
| Persistent high ΔS on local language path | Need controlled translation bridge with glossary | [fallback_translation_and_glossary_bridge.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md) |
|
||||
|
||||
---
|
||||
|
||||
## Fix in 60 seconds
|
||||
|
||||
1. **Measure ΔS across variants**
|
||||
Run question in source language, target language, and a code-switched version. If ΔS improves only after transliteration or segmentation, language handling is the failing layer.
|
||||
1. **Detect language**
|
||||
Emit stable language and confidence per the detection contract. If unstable, stop and fix detection first.
|
||||
Open → [query_language_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_language_detection.md)
|
||||
|
||||
2. **Probe λ\_observe**
|
||||
Hold citations fixed and flip only script or segmentation. If λ flips, lock schema and unify analyzers.
|
||||
2. **Lock normalization and analyzers**
|
||||
Keep the same locale, width, accents, and segmentation for both index and search.
|
||||
Open → [locale_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) · [query_routing_and_analyzers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/query_routing_and_analyzers.md)
|
||||
|
||||
3. **Apply the smallest change**
|
||||
3. **Protect entities and syntax**
|
||||
Add alias fields and romanization pairs. Clamp stopwords and morphological rules for scope words like negations or units.
|
||||
Open → [proper_noun_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) · [romanization_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) · [stopword_and_morphology_controls.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/stopword_and_morphology_controls.md)
|
||||
|
||||
* Unify index and search analyzers, add Thai/CJK segmenters
|
||||
* Map zh-Hans ↔ zh-Hant at ingest or store in variant subfields
|
||||
* Add alias fields for proper nouns: `native`, `romanized`, `normalized`
|
||||
* Normalize digits, punctuation, and diacritics for RTL/Indic consistently
|
||||
* Use a multilingual embedding; otherwise add a lexical sidecar and deterministic rerank
|
||||
4. **Stabilize ranking and hybrid flows**
|
||||
Use multilingual reranker or dual-track lexical plus vector, keep ordering deterministic.
|
||||
Open → [hybrid_ranking_multilingual.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/hybrid_ranking_multilingual.md)
|
||||
|
||||
4. **Verify**
|
||||
Coverage ≥ 0.70 and ΔS ≤ 0.45 across two languages and one mixed query. λ convergent on two seeds.
|
||||
5. **Use a translation bridge only as last resort**
|
||||
Enable only when the native path keeps high ΔS. Always pair with a glossary.
|
||||
Open → [fallback_translation_and_glossary_bridge.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md)
|
||||
|
||||
6. **Verify**
|
||||
With bilingual and code-switch test sets confirm ΔS ≤ 0.45 and Coverage ≥ 0.70, λ convergent.
|
||||
Open → [code_switching_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md)
|
||||
|
||||
---
|
||||
|
||||
## Store-agnostic quick recipes
|
||||
|
||||
* Elasticsearch/OpenSearch: ICU normalization, width fold, bigram analyzers for CJK, Thai segmenter, keyword subfields for product names.
|
||||
* Vector stores: pre-normalize identically for corpus and queries, validate with a bilingual gold set, rerank deterministically if embeddings are monolingual.
|
||||
- Normalize the same way for corpus and queries before any vector store, keep tokenizer consistent on both sides
|
||||
- CJK and Thai need segmentation or bigrams, keep critical fields as keyword to protect entities
|
||||
- If you cannot use multilingual embeddings, add a lexical sidecar then align features in a deterministic rerank
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue