| .. | ||
| multilingual_guide.md | ||
| README.md | ||
Language & Multilingual — Global Fix Map
Make cross-lingual RAG stable. Handle CJK/RTL, mixed scripts, tokenizers, and locale drift without breaking retrieval.
What this page is
- A compact playbook for multilingual corpora and queries
- Practical fixes for tokenizer and analyzer mismatch
- Steps to keep ΔS low across languages and scripts
When to use
- Your corpus has Chinese/Japanese/Korean, RTL scripts, or code-switching
- OCR text looks fine but retrieval or citations miss
- Similarity is high but meaning is wrong across locales
- HyDE/BM25 behave differently per language
Open these first
- Language and locale guide: Multilingual Guide
- Embedding vs true meaning: Embedding ≠ Semantic
- OCR quality and pitfalls: OCR / Parsing Checklist
- Chunk boundaries and joins: Semantic Chunking Checklist
- Why this snippet: Retrieval Traceability
- Ordering control: Rerankers
- Snippet schema: Data Contracts
Common failure patterns
- Tokenizer mismatch dense retriever uses whitespace rules on CJK or splits accents poorly
- Analyzer split BM25 analyzer differs from the indexer used at write time
- Script variants Traditional vs Simplified, Kana vs Kanji, Arabic presentation forms
- Normalization gaps mixed width, NFC/NFKC, punctuation variants break exact matches
- Romanization drift Pinyin or Hepburn in queries while docs keep native script
- Code-switching sentences mix English and local terms; embeddings latch to one side
- OCR artifacts diacritics lost, ligatures broken, zero-width joins preserved
- Stopword shock default analyzers drop particles that carry meaning in some languages
Fix in 60 seconds
-
Normalize before anything
Apply NFC or NFKC, collapse widths, unify punctuation. Persist the normalized form you index. -
Pick language-aware analyzers
Set BM25 analyzers that match the language at both write and read. Log tokenizer output for a few queries to confirm. -
Embed with multilingual models
Use a single multilingual embedding model for mixed corpora. Do not mix English-only and multilingual spaces in one index. -
Add transliteration bridges
Generate light alias fields per doc title and key entities, e.g., Traditional ↔ Simplified, Kana ↔ Romaji, Arabic ↔ Latin. -
Rerank cross-lingually
Retrieve with generous k, then apply cross-lingual rerankers. Confirm ΔS(question, context) ≤ 0.45. -
Lock citations and sections
Use Data Contracts withsection_id,source_lang, andnorm_ops. Require cite-then-answer to avoid language mixing. -
Probe λ across locales
Ask for “cite lines” and “explain why” in both the user language and the source language. Divergence marks the failing boundary.
Copy paste prompt
You have TXT OS and the WFGY Problem Map.
Goal
Stabilize a multilingual RAG corpus with CJK and English. Prevent tokenizer mismatch and script drift.
Tasks
1. Show a normalization plan:
* Unicode form (NFC/NFKC), width collapse, punctuation unification
* sample before/after lines
2. Configure retrieval:
* pick analyzers for BM25 that match corpus languages
* ensure the same analyzer is used at write and read
* use a multilingual embedding model, one index space
3. Add transliteration bridges:
* alias fields for key entities (e.g., 簡↔繁, かな↔ローマ字)
* show how aliases are added to the index document
4. Verify with WFGY:
* compute ΔS(question, context) for three bilingual queries
* report λ\_observe at retrieval and reasoning
* target ΔS ≤ 0.45 and convergent λ
Output
* Normalization spec
* Analyzer and embedding choices
* Example index doc with alias fields
* A trace table with citations, ΔS, and λ for 3 queries
Minimal checklist
- Unicode normalization applied before embedding and indexing
- Language-aware analyzers configured the same for write and read
- One multilingual embedding space per index
- Alias fields or transliteration for key entities
- Data Contract includes
source_lang,norm_ops, and citations - ΔS and λ checks pass in both the user and source language
Acceptance targets
- ΔS(question, context) median ≤ 0.45 for bilingual smoke tests
- λ remains convergent when switching question language
- Citations point to the correct section in the original script
- Hybrid retrieval improves with reranking instead of oscillating
- No analyzer or tokenizer mismatch logs during queries
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.
say “next page” when ready.