BiDi and RTL Control Characters: Guardrails and Fix Pattern

Stabilize Arabic, Hebrew, and mixed LTR/RTL flows when retrieval looks correct but citations or JSON fields render out of order. This page localizes failures caused by invisible BiDi controls and mixed number shapes.

Open these first

Visual map and recovery: RAG Architecture & Recovery
End to end retrieval knobs: Retrieval Playbook
Snippet and citation schema: Data Contracts
Why this snippet: Retrieval Traceability
Tokenizer and casing traps: tokenizer_mismatch.md · digits_width_punctuation.md · diacritics_and_folding.md · locale_drift.md · script_mixing.md
CJK counterpart for word breaks: cjk_segmentation_wordbreak.md

Core acceptance targets

ΔS(question, retrieved) ≤ 0.45
Coverage of target section ≥ 0.70
λ remains convergent across 3 paraphrases and 2 seeds
Render and storage orders are consistent for fields tagged as dir aware

Typical failure patterns → exact fix

Symptom	Likely cause	Open this
Citations render right-to-left and break JSON order	unisolated RLM/LRM/RLE/LRE/RLO/PDF or mixed numerals	Data Contracts, Retrieval Traceability
Answer text looks correct, but ΔS spikes and λ flips between runs	invisible controls in prompt or snippet, header reorder	Retrieval Playbook
Punctuation mirrors or shifts around anchors	missing isolates, UI direction not declared	tokenizer_mismatch.md
Arabic-Indic vs European digits change ranking	digit shape inconsistency, analyzer mismatch	digits_width_punctuation.md
Mixed Hebrew + English entity names fail exact match	script mixing without explicit isolation	script_mixing.md

60-second fix checklist

Measure Log ΔS(question, retrieved) and ΔS(retrieved, anchor). If ΔS ≥ 0.60, suspect direction controls or digit shapes.
Normalize direction metadata
- On ingest: strip legacy overrides LRE/RLE/LRO/RLO/PDF if not required by the source.
- Keep isolates only: LRI, RLI, FSI, closed by PDI.
- Tag fields with dir="auto|ltr|rtl" at render time. Store as logical order.
Unify digits and punctuation
- Convert digits to one canonical shape for indexing.
- Normalize Arabic comma and question mark to canonical forms in the index layer.
Schema fences
- Contract requires text, dir, normalized_text, raw_text, digit_shape, controls_present.
- Reject snippets that contain unclosed BiDi controls.
Retrieval probe
- Re-run with k in {5, 10, 20}. If rank stays unstable, rebuild analyzer with explicit direction and the same tokenization used at query time.

Minimal schema addon

{
  "snippet_id": "S123",
  "raw_text": "…",
  "normalized_text": "…",
  "dir": "rtl",
  "digit_shape": "arabic-indic|european",
  "controls_present": ["RLI","PDI"],
  "offsets": [120, 240]
}

Deep diagnostics

Control scan. Count and classify BiDi controls per snippet. Fail if any of LRE/RLE/LRO/RLO/PDF appear without a matching close.
Render vs storage. Serialize JSON, render in the UI, then parse back and compare field order and values. Mismatch implies missing isolates or UI dir drift.
Anchor triangulation. Compare ΔS to the expected section and a decoy with similar punctuation density. If close, re-index with digit normalization.

Copy-paste prompt

You have TXT OS and the WFGY Problem Map loaded.

My issue: mixed Hebrew/Arabic with English numerals causes wrong citation order.
Traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ across 3 paraphrases.

Do:
1) Identify direction-control risks and whether digits or punctuation cause rank flips.
2) Point me to the exact WFGY pages to apply.
3) Give the minimal steps to push ΔS ≤ 0.45 while keeping λ convergent.
Return a short JSON plan with {dir_policy, digit_policy, controls_fix, verify_steps}.

Next planned page: indic_tokenization_schwa.md

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

11 KiB Raw Blame History Unescape Escape