9.6 KiB
Transliteration & Romanization — Global Fix Map
🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize queries when users type Pinyin/Romaji/IAST/RTGS/Buckwalter variants instead of native scripts.
Use this page to bridge native-script ↔ romanized forms for retrieval, ranking, and reasoning without changing infra.
What this page is
- A compact repair guide for script-to-roman mismatches in RAG and search.
- Concrete schema rules and query tactics that unify multiple romanization systems per language.
- Exact jump links to structural fixes so you can verify with measurable targets.
When to use
- Users type Pinyin with tone marks vs numbers or ad-hoc English spellings.
- Japanese Hepburn vs Kunrei differences (“shi/si”, “tsu/tu”; macron “ō” vs “ou”).
- Korean Revised RR vs McCune-Reischauer (“Busan/Pusan”, “Jeju/Cheju”).
- Arabic/Persian DIN/Buckwalter/ad-hoc (“Muhammad/Mohamed/Mohammad”).
- Cyrillic names produce Čajkovskij/Chaikovsky/Tchaikovsky style drift.
- Greek/Indic/Vietnamese diacritics collapse or fold in unpredictable ways.
- Native pages index fine, yet romanized queries miss or rank poorly.
Open these first
- Visual map and recovery: rag-architecture-and-recovery.md
- End-to-end knobs: retrieval-playbook.md
- Why this snippet: retrieval-traceability.md
- Payload schema: data-contracts.md
- Tokenizer fit: tokenizer_mismatch.md
- Diacritic folding: diacritics_and_folding.md
- Locale drift: locale_drift.md
- CJK wordbreaks: cjk_segmentation_wordbreak.md
- Mixed scripts in one query: script_mixing.md
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
- Coverage of target section ≥ 0.70.
- λ remains convergent across two seeds.
- Cross-script parity: romanized query recall ≥ 0.90 of native-script recall on your gold set.
Common failure patterns → exact fix
-
Romanization system mismatch (e.g., Hepburn vs Kunrei, RR vs MR, DIN vs ad-hoc).
Lock a canonical key and index both forms.
Open: retrieval-playbook.md, data-contracts.md -
Tone marks vs numbers in Pinyin (“lǐ/lĭ/li3”).
Add a tone-stripped key and a numeric-tone key, dedupe at index time.
Open: diacritics_and_folding.md -
Macrons and long vowels in Romaji (“ō/oo/ou”).
Normalize to a canonical long-vowel key and store alternates.
Open: data-contracts.md -
Arabic definite article or spacing drift (“al Riyadh/Ar-Riyāḍ/Riyadh”).
Strip articles into a side key; keep original for display.
Open: retrieval-traceability.md -
Query parsing split under hybrid retrievers when romanized variants explode terms.
Pin two-stage queries and rerank deterministically.
Open: pattern_query_parsing_split.md, rerankers.md
60-second fix checklist
-
Measure ΔS and λ
Compute ΔS(question, retrieved). If ΔS ≥ 0.60 and λ flips on small romanization edits, treat as schema/index mismatch. -
Add a Romanization Bridge (store-agnostic)
- For each document and title, compute and store:
key_native= original scriptkey_roman_primary= your chosen canonical system per languagekey_roman_alt[]= enumerated alternates (e.g., Hepburn/Kunrei, RR/MR, DIN/ad-hoc, Pinyin tone/number)
- Index
key_nativeandkey_roman_*in the same record; dedupe by stabledoc_id. - At query time, expand to canonical plus alternates, then rerank.
-
Lock the contract
Require fields in the snippet payload:
lang,script,roman_system,key_roman_primary,key_roman_alt,source_url,offsets.
See: data-contracts.md -
Probe robustness
Run a three-paraphrase test: native script, canonical romanization, alternate romanization.
Targets must meet Acceptance.
Minimal per-language mapping hints
- Chinese: tone marks ↔ numeric tones; collapse
ü↔vin legacy inputs. - Japanese: macron
ō/ū↔ou/oo/uu; “shi/si”, “ji/zi”, “chi/ti”, “tsu/tu”; small “っ/tsu” doubling. - Korean: RR ↔ MR pairs (“Busan/Pusan”, “Jeju/Cheju”, “Gyeong/ Kyŏng”).
- Arabic/Persian: articles
al-/el-optional; hamza/ta marbuta variants; DIN ↔ Buckwalter map. - Indic (IAST/ISO/Harvard-Kyoto):
ś/ṣ/ñ/ṭ/ḍ↔sh/sh/ny/t/d. - Cyrillic:
Ч/ч↔Ch/Č,Ю/ю↔Yu/Iu, etc., preserve soft sign loss with an alternate key. - Greek:
Μιχάλης↔Michalis/Michales; handlemp/ntdigraphs.
Document your choices inside roman_system so audits are reproducible.
Copy-paste prompt (LLM step)
You have TXT OS and the WFGY Problem Map loaded.
My multilingual issue:
* symptom: romanized queries miss or rank poorly vs native script
* traces: ΔS(question,retrieved)=..., λ states across native/roman/alternate
Tell me:
1. which layer is failing and why,
2. the exact WFGY page to open,
3. the minimal Romanization Bridge to add (keys + query expansion),
4. how to verify with coverage ≥ 0.70 and ΔS ≤ 0.45 across three paraphrases.
Reference: retrieval-playbook, retrieval-traceability, data-contracts, diacritics\_and\_folding.
Next planned file in LanguageLocale/
- mixed_numeric_systems.md ← numbers written in native digits vs ASCII, year/era systems, and counters.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.