WFGY/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md

13 KiB
Raw Permalink Blame History

Romanization & Transliteration — Guardrails and Fix Pattern

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Make cross-script search and RAG stable when users type Latin transliterations of non-Latin names and terms. This page gives a minimal contract, store wiring, and tests so Hepburn vs Kunrei, Pinyin vs mixed tone marks, RR vs MR, ISO9 vs GOST, Buckwalter vs ISO 233, and similar systems do not break recall or flip ranking.


Open these first


Core acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 for native script, romanized, and accent-stripped variants
  • Coverage of target section ≥ 0.70 under three paraphrases and two seeds
  • λ remains convergent when switching romanizers inside the same language
  • No false merges across entities when romanized forms collide

Minimal contract

Add a small, explicit layer around romanization so behavior is auditable.

Document side fields

raw_text            # untouched source
lang                # BCP-47 primary tag
script              # ISO 15924 (Han, Cyrl, Arab, Hira, Kana, Hang, etc.)
canonical           # preferred display form for proper nouns if known
alias_tail          # pipe-joined alias list incl. romanized forms
romanizers          # systems observed for this doc: "pinyin|rr|hepburn|iso9|buckwalter"

Query side context

q_text              # user input
q_lang_guess        # detector result, nullable
q_script_guess      # detector result, nullable
q_romanizer_hint    # optional, from UI or logs, e.g. "hepburn"

Rules

  • Never mutate raw_text or canonical.
  • Romanized strings live only in alias_tail and store-specific synonym views.
  • Record which systems were used. Mixing systems without a record increases ΔS variance.

Store wiring

BM25 style indexes

  • Keep raw_text with a locale-aware analyzer.
  • Add a synonym graph on a separate field that contains romanized aliases.
  • Apply width normalization and diacritic strip only in alias field. Keep canonical untouched. See locale_drift.md.

Vector stores

  • Append alias_tail to the chunk text right after the first canonical mention.
  • Keep short, high precision alias lists. Over-expansion harms meaning.
  • If nearest neighbors look similar yet wrong, verify metric per embedding-vs-semantic.md.

Hybrid

  • When BM25 yields an exact canonical match, bias reranker features to keep it above looser transliterations.
  • Log ΔS and λ per candidate so you can see when a romanized neighbor outranks the native script without evidence.

System map (examples)

Language Common systems Notes
Chinese Pinyin (tone marks or digits) Keep tone-less aliases for user input, but preserve tone marks in canonical forms.
Japanese Hepburn, Kunrei, Nihon Handle long vowels (ō vs ou) and small tsu.
Korean RR (Revised Romanization), MR Names often appear without hyphens, add both.
Russian and Cyrillic ISO 9, GOST, BGN/PCGN Map soft sign and yo/ë variants.
Arabic Buckwalter, ISO 233, DMG Decide on hamza and taa marbuta conventions, keep both if present in corpus.
Hebrew SBL, Academy rules Deal with mater lectionis and dagesh normalization.
Hindi and Indic ITRANS, ISO 15919 Normalize nukta forms.

Keep this list in code comments and in your ops runbook, not only in the model prompt.


Typical failure → fix

Symptom Likely cause Open this
Native script doc exists, romanized query misses it no alias view built at index time retrieval-playbook.md
Romanized neighbor outranks exact canonical snippet reranker features not constrained retrieval-traceability.md
Answers flip between Hepburn and Kunrei inputs mixed systems without logging, λ not clamped tokenizer_mismatch.md
Cyrillic ISO9 vs GOST produce different chunks analyzer mismatch per field locale_drift.md
Arabic Buckwalter forms merge two entities alias collision, missing scope fence proper_noun_aliases.md

60-second fix checklist

  1. Wire alias view for documents that carry non-Latin scripts.
  2. Record the system used for any generated alias.
  3. Normalize only in alias fields for width and diacritics, never in canonical.
  4. Bias reranker to keep exact canonical hits above loose translits.
  5. Log ΔS and λ for native vs romanized queries and compare.

Copy snippets

Alias expansion at ingest time (no external libs)

def simple_pinyin_drop_tones(s: str) -> str:
    tone_map = str.maketrans("āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜü", "aaaaeeeeiiiioooouuuuuuuuu")
    return s.translate(tone_map)

def width_fold(s: str) -> str:
    # simple NFKC fold
    import unicodedata as ud
    return ud.normalize("NFKC", s)

def alias_pack(canonical: str, lang: str, romanizer_hint: str | None = None) -> list[str]:
    out = {canonical}
    if lang == "zh":
        out.add(simple_pinyin_drop_tones(canonical))
    # add more light rules per language as needed
    return [width_fold(x) for x in out]

Prompt fence for romanizers

You have TXTOS and the WFGY Problem Map.

When the question or snippet contains a non-Latin name or term:
1) Try native script first. If the user input looks romanized, search both native and alias views.
2) Keep the canonical form in the final answer. Cite the exact snippet that contains the canonical form.
3) If multiple romanization systems match, state which system appears in the cited text.

Eval plan

Use a code-switching set with 5 languages and 10 entities each. For every entity build 3 questions:

  1. native script,
  2. romanized in system A,
  3. romanized in system B.

Run the suite with code_switching_eval.md.

Targets

  • top-k 10 recall across forms ≥ 0.85
  • ΔS(question, retrieved) ≤ 0.45 on the best hit
  • λ convergent across two seeds and three paraphrases

If recall is fine but ranking flips between systems, tighten reranker constraints and verify with retrieval-traceability.md.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars