13 KiB
Romanization & Transliteration — Guardrails and Fix Pattern
🧭 Quick Return to Map
You are in a sub-page of Language.
To reorient, go back here:
- Language — multilingual processing and semantic alignment
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Make cross-script search and RAG stable when users type Latin transliterations of non-Latin names and terms. This page gives a minimal contract, store wiring, and tests so Hepburn vs Kunrei, Pinyin vs mixed tone marks, RR vs MR, ISO9 vs GOST, Buckwalter vs ISO 233, and similar systems do not break recall or flip ranking.
Open these first
- Visual map and recovery → rag-architecture-and-recovery.md
- End to end retrieval knobs → retrieval-playbook.md
- Traceability and cite-then-explain → retrieval-traceability.md
- Contract the payload → data-contracts.md
- Embedding vs meaning → embedding-vs-semantic.md
- Tokenizer variance → tokenizer_mismatch.md
- Mixed scripts in one query → script_mixing.md
- Locale normalization and width/diacritics → locale_drift.md
- Names and brand aliases → proper_noun_aliases.md
- End to end multilingual playbook → multilingual_guide.md
Core acceptance targets
- ΔS(question, retrieved) ≤ 0.45 for native script, romanized, and accent-stripped variants
- Coverage of target section ≥ 0.70 under three paraphrases and two seeds
- λ remains convergent when switching romanizers inside the same language
- No false merges across entities when romanized forms collide
Minimal contract
Add a small, explicit layer around romanization so behavior is auditable.
Document side fields
raw_text # untouched source
lang # BCP-47 primary tag
script # ISO 15924 (Han, Cyrl, Arab, Hira, Kana, Hang, etc.)
canonical # preferred display form for proper nouns if known
alias_tail # pipe-joined alias list incl. romanized forms
romanizers # systems observed for this doc: "pinyin|rr|hepburn|iso9|buckwalter"
Query side context
q_text # user input
q_lang_guess # detector result, nullable
q_script_guess # detector result, nullable
q_romanizer_hint # optional, from UI or logs, e.g. "hepburn"
Rules
- Never mutate
raw_textorcanonical. - Romanized strings live only in
alias_tailand store-specific synonym views. - Record which systems were used. Mixing systems without a record increases ΔS variance.
Store wiring
BM25 style indexes
- Keep
raw_textwith a locale-aware analyzer. - Add a synonym graph on a separate field that contains romanized aliases.
- Apply width normalization and diacritic strip only in alias field. Keep canonical untouched. See locale_drift.md.
Vector stores
- Append
alias_tailto the chunk text right after the first canonical mention. - Keep short, high precision alias lists. Over-expansion harms meaning.
- If nearest neighbors look similar yet wrong, verify metric per embedding-vs-semantic.md.
Hybrid
- When BM25 yields an exact canonical match, bias reranker features to keep it above looser transliterations.
- Log ΔS and λ per candidate so you can see when a romanized neighbor outranks the native script without evidence.
System map (examples)
| Language | Common systems | Notes |
|---|---|---|
| Chinese | Pinyin (tone marks or digits) | Keep tone-less aliases for user input, but preserve tone marks in canonical forms. |
| Japanese | Hepburn, Kunrei, Nihon | Handle long vowels (ō vs ou) and small tsu. |
| Korean | RR (Revised Romanization), MR | Names often appear without hyphens, add both. |
| Russian and Cyrillic | ISO 9, GOST, BGN/PCGN | Map soft sign and yo/ë variants. |
| Arabic | Buckwalter, ISO 233, DMG | Decide on hamza and taa marbuta conventions, keep both if present in corpus. |
| Hebrew | SBL, Academy rules | Deal with mater lectionis and dagesh normalization. |
| Hindi and Indic | ITRANS, ISO 15919 | Normalize nukta forms. |
Keep this list in code comments and in your ops runbook, not only in the model prompt.
Typical failure → fix
| Symptom | Likely cause | Open this |
|---|---|---|
| Native script doc exists, romanized query misses it | no alias view built at index time | retrieval-playbook.md |
| Romanized neighbor outranks exact canonical snippet | reranker features not constrained | retrieval-traceability.md |
| Answers flip between Hepburn and Kunrei inputs | mixed systems without logging, λ not clamped | tokenizer_mismatch.md |
| Cyrillic ISO9 vs GOST produce different chunks | analyzer mismatch per field | locale_drift.md |
| Arabic Buckwalter forms merge two entities | alias collision, missing scope fence | proper_noun_aliases.md |
60-second fix checklist
- Wire alias view for documents that carry non-Latin scripts.
- Record the system used for any generated alias.
- Normalize only in alias fields for width and diacritics, never in canonical.
- Bias reranker to keep exact canonical hits above loose translits.
- Log ΔS and λ for native vs romanized queries and compare.
Copy snippets
Alias expansion at ingest time (no external libs)
def simple_pinyin_drop_tones(s: str) -> str:
tone_map = str.maketrans("āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜü", "aaaaeeeeiiiioooouuuuuuuuu")
return s.translate(tone_map)
def width_fold(s: str) -> str:
# simple NFKC fold
import unicodedata as ud
return ud.normalize("NFKC", s)
def alias_pack(canonical: str, lang: str, romanizer_hint: str | None = None) -> list[str]:
out = {canonical}
if lang == "zh":
out.add(simple_pinyin_drop_tones(canonical))
# add more light rules per language as needed
return [width_fold(x) for x in out]
Prompt fence for romanizers
You have TXTOS and the WFGY Problem Map.
When the question or snippet contains a non-Latin name or term:
1) Try native script first. If the user input looks romanized, search both native and alias views.
2) Keep the canonical form in the final answer. Cite the exact snippet that contains the canonical form.
3) If multiple romanization systems match, state which system appears in the cited text.
Eval plan
Use a code-switching set with 5 languages and 10 entities each. For every entity build 3 questions:
- native script,
- romanized in system A,
- romanized in system B.
Run the suite with code_switching_eval.md.
Targets
- top-k 10 recall across forms ≥ 0.85
- ΔS(question, retrieved) ≤ 0.45 on the best hit
- λ convergent across two seeds and three paraphrases
If recall is fine but ranking flips between systems, tighten reranker constraints and verify with retrieval-traceability.md.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.