mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
11 KiB
11 KiB
BiDi and RTL Control Characters: Guardrails and Fix Pattern
Stabilize Arabic, Hebrew, and mixed LTR/RTL flows when retrieval looks correct but citations or JSON fields render out of order. This page localizes failures caused by invisible BiDi controls and mixed number shapes.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Snippet and citation schema: Data Contracts
- Why this snippet: Retrieval Traceability
- Tokenizer and casing traps: tokenizer_mismatch.md · digits_width_punctuation.md · diacritics_and_folding.md · locale_drift.md · script_mixing.md
- CJK counterpart for word breaks: cjk_segmentation_wordbreak.md
Core acceptance targets
- ΔS(question, retrieved) ≤ 0.45
- Coverage of target section ≥ 0.70
- λ remains convergent across 3 paraphrases and 2 seeds
- Render and storage orders are consistent for fields tagged as
diraware
Typical failure patterns → exact fix
| Symptom | Likely cause | Open this |
|---|---|---|
| Citations render right-to-left and break JSON order | unisolated RLM/LRM/RLE/LRE/RLO/PDF or mixed numerals | Data Contracts, Retrieval Traceability |
| Answer text looks correct, but ΔS spikes and λ flips between runs | invisible controls in prompt or snippet, header reorder | Retrieval Playbook |
| Punctuation mirrors or shifts around anchors | missing isolates, UI direction not declared | tokenizer_mismatch.md |
| Arabic-Indic vs European digits change ranking | digit shape inconsistency, analyzer mismatch | digits_width_punctuation.md |
| Mixed Hebrew + English entity names fail exact match | script mixing without explicit isolation | script_mixing.md |
60-second fix checklist
-
Measure Log ΔS(question, retrieved) and ΔS(retrieved, anchor). If ΔS ≥ 0.60, suspect direction controls or digit shapes.
-
Normalize direction metadata
- On ingest: strip legacy overrides
LRE/RLE/LRO/RLO/PDFif not required by the source. - Keep isolates only:
LRI,RLI,FSI, closed byPDI. - Tag fields with
dir="auto|ltr|rtl"at render time. Store as logical order.
- On ingest: strip legacy overrides
-
Unify digits and punctuation
- Convert digits to one canonical shape for indexing.
- Normalize Arabic comma and question mark to canonical forms in the index layer.
-
Schema fences
- Contract requires
text,dir,normalized_text,raw_text,digit_shape,controls_present. - Reject snippets that contain unclosed BiDi controls.
- Contract requires
-
Retrieval probe
- Re-run with k in {5, 10, 20}. If rank stays unstable, rebuild analyzer with explicit direction and the same tokenization used at query time.
Minimal schema addon
{
"snippet_id": "S123",
"raw_text": "…",
"normalized_text": "…",
"dir": "rtl",
"digit_shape": "arabic-indic|european",
"controls_present": ["RLI","PDI"],
"offsets": [120, 240]
}
Deep diagnostics
- Control scan. Count and classify BiDi controls per snippet. Fail if any of
LRE/RLE/LRO/RLO/PDFappear without a matching close. - Render vs storage. Serialize JSON, render in the UI, then parse back and compare field order and values. Mismatch implies missing isolates or UI
dirdrift. - Anchor triangulation. Compare ΔS to the expected section and a decoy with similar punctuation density. If close, re-index with digit normalization.
Copy-paste prompt
You have TXT OS and the WFGY Problem Map loaded.
My issue: mixed Hebrew/Arabic with English numerals causes wrong citation order.
Traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ across 3 paraphrases.
Do:
1) Identify direction-control risks and whether digits or punctuation cause rank flips.
2) Point me to the exact WFGY pages to apply.
3) Give the minimal steps to push ΔS ≤ 0.45 while keeping λ convergent.
Return a short JSON plan with {dir_policy, digit_policy, controls_fix, verify_steps}.
Next planned page: indic_tokenization_schwa.md
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.