8.6 KiB
RTL & BiDi Control — Guardrails and Fix Pattern
Stabilize retrieval and reasoning when left-to-right content mixes with right-to-left scripts or invisible BiDi marks. No infra change required. All fixes map back to WFGY pages with measurable targets.
What this page is
- A compact repair guide for directionality bugs that flip tokens, citations, or numbers.
- Steps to normalize control characters, lock direction metadata, and keep offsets verifiable.
- Store-agnostic checks you can run in minutes.
When to use
- Citations look correct to the eye but snippet offsets do not match.
- Punctuation or brackets render on the wrong side in answers.
- Arabic or Hebrew lines invert number order or collapse after parsing.
- JSON fields with mixed direction break validation or flip keys.
- Search returns near hits but ΔS stays high on RTL content.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Why this snippet (traceability schema): Retrieval Traceability
- Snippet and citation schema: Data Contracts
- Wrong-meaning hits despite high similarity: Embedding ≠ Semantic
- Hybrid instability and reorder issues: Rerankers
- Digits width and punctuation mix: Digits • Width • Punctuation
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
- Coverage of target section ≥ 0.70.
- λ remains convergent across two seeds.
- Offsets verified after normalization on both query and snippet.
Typical breakpoints → exact fix
-
Invisible BiDi marks inside snippets cause reversed punctuation or bracket order.
Fix: strip control code points during indexing and query pre-norm. Persist adirflag on the clean text.
Open: Data Contracts, Retrieval Traceability -
Rendered order vs stored order mismatch makes citations fail.
Fix: compute character offsets on the normalized text only. Log the normalization pipeline in trace.
Open: Retrieval Traceability -
Numbers flip in Arabic or Hebrew lines when Eastern Arabic digits mix with Latin punctuation.
Fix: normalize digits to a single system for retrieval. Keep the original form for display.
Open: Digits • Width • Punctuation -
JSON payloads break or tool calls mis-route because keys include RTL marks.
Fix: forbid control chars in keys through schema, allow in values only after normalization.
Open: Prompt Injection, Data Contracts
60-second fix checklist
-
Strip BiDi controls during ingest and query Remove these if present:
LRM U+200E,RLM U+200F,LRE U+202A,RLE U+202B,LRO U+202D,RLO U+202E,PDF U+202C,
LRI U+2066,RLI U+2067,FSI U+2068,PDI U+2069.
Also normalizeNBSP U+00A0,ZWJ U+200Dwhen it changes tokenization. -
Persist direction metadata Add
dir = "rtl" | "ltr" | "auto"at snippet and paragraph levels. Store it in the trace envelope. -
Index on normalized text only
- Normalize to NFC.
- Strip BiDi marks.
- Fold digits per store policy.
- Keep original text for rendering.
-
Contract the payload Require fields:
snippet_id,dir,norm_hash,offsets_on_norm,source_url.
Reject ifdirmissing on RTL sources.
Open: Data Contracts -
Probe λ_observe Vary k = 5, 10, 20. If ΔS stays flat and high, rebuild the index after normalization and re-verify offsets.
Copy-paste prompt
You have TXT OS and the WFGY Problem Map loaded.
My multilingual issue:
* symptoms: punctuation flips or offsets fail on RTL lines
* traces: ΔS(question,retrieved)=..., λ across 3 paraphrases, direction flags
Tell me:
1. the failing layer and why,
2. the exact WFGY page to open,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible check that verifies offsets after normalization.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + ” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.