mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
225 lines
13 KiB
Markdown
225 lines
13 KiB
Markdown
# Romanization & Transliteration — Guardrails and Fix Pattern
|
||
|
||
<details>
|
||
<summary><strong>🧭 Quick Return to Map</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a sub-page of **Language**.
|
||
> To reorient, go back here:
|
||
>
|
||
> - [**Language** — multilingual processing and semantic alignment](./README.md)
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
|
||
>
|
||
> Think of this page as a desk within a ward.
|
||
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
|
||
</details>
|
||
|
||
|
||
Make cross-script search and RAG stable when users type Latin transliterations of non-Latin names and terms. This page gives a minimal contract, store wiring, and tests so **Hepburn vs Kunrei**, **Pinyin vs mixed tone marks**, **RR vs MR**, **ISO9 vs GOST**, **Buckwalter vs ISO 233**, and similar systems do not break recall or flip ranking.
|
||
|
||
---
|
||
|
||
## Open these first
|
||
|
||
* Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
|
||
* End to end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md)
|
||
* Traceability and cite-then-explain → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||
* Contract the payload → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||
* Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||
* Tokenizer variance → [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md)
|
||
* Mixed scripts in one query → [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md)
|
||
* Locale normalization and width/diacritics → [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md)
|
||
* Names and brand aliases → [proper\_noun\_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md)
|
||
* End to end multilingual playbook → [multilingual\_guide.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/multilingual_guide.md)
|
||
|
||
---
|
||
|
||
## Core acceptance targets
|
||
|
||
* ΔS(question, retrieved) ≤ 0.45 for native script, romanized, and accent-stripped variants
|
||
* Coverage of target section ≥ 0.70 under three paraphrases and two seeds
|
||
* λ remains convergent when switching romanizers inside the same language
|
||
* No false merges across entities when romanized forms collide
|
||
|
||
---
|
||
|
||
## Minimal contract
|
||
|
||
Add a small, explicit layer around romanization so behavior is auditable.
|
||
|
||
**Document side fields**
|
||
|
||
```
|
||
raw_text # untouched source
|
||
lang # BCP-47 primary tag
|
||
script # ISO 15924 (Han, Cyrl, Arab, Hira, Kana, Hang, etc.)
|
||
canonical # preferred display form for proper nouns if known
|
||
alias_tail # pipe-joined alias list incl. romanized forms
|
||
romanizers # systems observed for this doc: "pinyin|rr|hepburn|iso9|buckwalter"
|
||
```
|
||
|
||
**Query side context**
|
||
|
||
```
|
||
q_text # user input
|
||
q_lang_guess # detector result, nullable
|
||
q_script_guess # detector result, nullable
|
||
q_romanizer_hint # optional, from UI or logs, e.g. "hepburn"
|
||
```
|
||
|
||
**Rules**
|
||
|
||
* Never mutate `raw_text` or `canonical`.
|
||
* Romanized strings live only in `alias_tail` and store-specific synonym views.
|
||
* Record which systems were used. Mixing systems without a record increases ΔS variance.
|
||
|
||
---
|
||
|
||
## Store wiring
|
||
|
||
**BM25 style indexes**
|
||
|
||
* Keep `raw_text` with a locale-aware analyzer.
|
||
* Add a **synonym graph** on a separate field that contains romanized aliases.
|
||
* Apply **width normalization** and **diacritic strip** only in alias field. Keep canonical untouched. See [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md).
|
||
|
||
**Vector stores**
|
||
|
||
* Append `alias_tail` to the chunk text right after the first canonical mention.
|
||
* Keep short, high precision alias lists. Over-expansion harms meaning.
|
||
* If nearest neighbors look similar yet wrong, verify metric per [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md).
|
||
|
||
**Hybrid**
|
||
|
||
* When BM25 yields an exact canonical match, bias reranker features to keep it above looser transliterations.
|
||
* Log ΔS and λ per candidate so you can see when a romanized neighbor outranks the native script without evidence.
|
||
|
||
---
|
||
|
||
## System map (examples)
|
||
|
||
| Language | Common systems | Notes |
|
||
| -------------------- | ----------------------------- | ---------------------------------------------------------------------------------- |
|
||
| Chinese | Pinyin (tone marks or digits) | Keep tone-less aliases for user input, but preserve tone marks in canonical forms. |
|
||
| Japanese | Hepburn, Kunrei, Nihon | Handle long vowels (ō vs ou) and small tsu. |
|
||
| Korean | RR (Revised Romanization), MR | Names often appear without hyphens, add both. |
|
||
| Russian and Cyrillic | ISO 9, GOST, BGN/PCGN | Map soft sign and yo/ë variants. |
|
||
| Arabic | Buckwalter, ISO 233, DMG | Decide on hamza and taa marbuta conventions, keep both if present in corpus. |
|
||
| Hebrew | SBL, Academy rules | Deal with mater lectionis and dagesh normalization. |
|
||
| Hindi and Indic | ITRANS, ISO 15919 | Normalize nukta forms. |
|
||
|
||
Keep this list in code comments and in your ops runbook, not only in the model prompt.
|
||
|
||
---
|
||
|
||
## Typical failure → fix
|
||
|
||
| Symptom | Likely cause | Open this |
|
||
| --------------------------------------------------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||
| Native script doc exists, romanized query misses it | no alias view built at index time | [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md) |
|
||
| Romanized neighbor outranks exact canonical snippet | reranker features not constrained | [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) |
|
||
| Answers flip between Hepburn and Kunrei inputs | mixed systems without logging, λ not clamped | [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) |
|
||
| Cyrillic ISO9 vs GOST produce different chunks | analyzer mismatch per field | [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) |
|
||
| Arabic Buckwalter forms merge two entities | alias collision, missing scope fence | [proper\_noun\_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) |
|
||
|
||
---
|
||
|
||
## 60-second fix checklist
|
||
|
||
1. **Wire alias view** for documents that carry non-Latin scripts.
|
||
2. **Record the system** used for any generated alias.
|
||
3. **Normalize only in alias fields** for width and diacritics, never in canonical.
|
||
4. **Bias reranker** to keep exact canonical hits above loose translits.
|
||
5. **Log ΔS and λ** for native vs romanized queries and compare.
|
||
|
||
---
|
||
|
||
## Copy snippets
|
||
|
||
**Alias expansion at ingest time (no external libs)**
|
||
|
||
```python
|
||
def simple_pinyin_drop_tones(s: str) -> str:
|
||
tone_map = str.maketrans("āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜü", "aaaaeeeeiiiioooouuuuuuuuu")
|
||
return s.translate(tone_map)
|
||
|
||
def width_fold(s: str) -> str:
|
||
# simple NFKC fold
|
||
import unicodedata as ud
|
||
return ud.normalize("NFKC", s)
|
||
|
||
def alias_pack(canonical: str, lang: str, romanizer_hint: str | None = None) -> list[str]:
|
||
out = {canonical}
|
||
if lang == "zh":
|
||
out.add(simple_pinyin_drop_tones(canonical))
|
||
# add more light rules per language as needed
|
||
return [width_fold(x) for x in out]
|
||
```
|
||
|
||
**Prompt fence for romanizers**
|
||
|
||
```
|
||
You have TXTOS and the WFGY Problem Map.
|
||
|
||
When the question or snippet contains a non-Latin name or term:
|
||
1) Try native script first. If the user input looks romanized, search both native and alias views.
|
||
2) Keep the canonical form in the final answer. Cite the exact snippet that contains the canonical form.
|
||
3) If multiple romanization systems match, state which system appears in the cited text.
|
||
```
|
||
|
||
---
|
||
|
||
## Eval plan
|
||
|
||
Use a **code-switching set** with 5 languages and 10 entities each. For every entity build 3 questions:
|
||
|
||
1. native script,
|
||
2. romanized in system A,
|
||
3. romanized in system B.
|
||
|
||
Run the suite with [code\_switching\_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md).
|
||
|
||
Targets
|
||
|
||
* top-k 10 recall across forms ≥ 0.85
|
||
* ΔS(question, retrieved) ≤ 0.45 on the best hit
|
||
* λ convergent across two seeds and three paraphrases
|
||
|
||
If recall is fine but ranking flips between systems, tighten reranker constraints and verify with [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md).
|
||
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
|
||
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
|
||
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
|
||
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
|
||
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
|
||
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
|
||
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
|
||
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
|
||
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
|
||
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|