WFGY/ProblemMap/GlobalFixMap/LanguageLocale/bidi_rtl_control_chars.md
2025-08-30 16:43:29 +08:00

11 KiB
Raw Blame History

BiDi and RTL Control Characters: Guardrails and Fix Pattern

Stabilize Arabic, Hebrew, and mixed LTR/RTL flows when retrieval looks correct but citations or JSON fields render out of order. This page localizes failures caused by invisible BiDi controls and mixed number shapes.

Open these first

Core acceptance targets

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage of target section ≥ 0.70
  • λ remains convergent across 3 paraphrases and 2 seeds
  • Render and storage orders are consistent for fields tagged as dir aware

Typical failure patterns → exact fix

Symptom Likely cause Open this
Citations render right-to-left and break JSON order unisolated RLM/LRM/RLE/LRE/RLO/PDF or mixed numerals Data Contracts, Retrieval Traceability
Answer text looks correct, but ΔS spikes and λ flips between runs invisible controls in prompt or snippet, header reorder Retrieval Playbook
Punctuation mirrors or shifts around anchors missing isolates, UI direction not declared tokenizer_mismatch.md
Arabic-Indic vs European digits change ranking digit shape inconsistency, analyzer mismatch digits_width_punctuation.md
Mixed Hebrew + English entity names fail exact match script mixing without explicit isolation script_mixing.md

60-second fix checklist

  1. Measure Log ΔS(question, retrieved) and ΔS(retrieved, anchor). If ΔS ≥ 0.60, suspect direction controls or digit shapes.

  2. Normalize direction metadata

    • On ingest: strip legacy overrides LRE/RLE/LRO/RLO/PDF if not required by the source.
    • Keep isolates only: LRI, RLI, FSI, closed by PDI.
    • Tag fields with dir="auto|ltr|rtl" at render time. Store as logical order.
  3. Unify digits and punctuation

    • Convert digits to one canonical shape for indexing.
    • Normalize Arabic comma and question mark to canonical forms in the index layer.
  4. Schema fences

    • Contract requires text, dir, normalized_text, raw_text, digit_shape, controls_present.
    • Reject snippets that contain unclosed BiDi controls.
  5. Retrieval probe

    • Re-run with k in {5, 10, 20}. If rank stays unstable, rebuild analyzer with explicit direction and the same tokenization used at query time.

Minimal schema addon

{
  "snippet_id": "S123",
  "raw_text": "…",
  "normalized_text": "…",
  "dir": "rtl",
  "digit_shape": "arabic-indic|european",
  "controls_present": ["RLI","PDI"],
  "offsets": [120, 240]
}

Deep diagnostics

  • Control scan. Count and classify BiDi controls per snippet. Fail if any of LRE/RLE/LRO/RLO/PDF appear without a matching close.
  • Render vs storage. Serialize JSON, render in the UI, then parse back and compare field order and values. Mismatch implies missing isolates or UI dir drift.
  • Anchor triangulation. Compare ΔS to the expected section and a decoy with similar punctuation density. If close, re-index with digit normalization.

Copy-paste prompt

You have TXT OS and the WFGY Problem Map loaded.

My issue: mixed Hebrew/Arabic with English numerals causes wrong citation order.
Traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ across 3 paraphrases.

Do:
1) Identify direction-control risks and whether digits or punctuation cause rank flips.
2) Point me to the exact WFGY pages to apply.
3) Give the minimal steps to push ΔS ≤ 0.45 while keeping λ convergent.
Return a short JSON plan with {dir_policy, digit_policy, controls_fix, verify_steps}.

Next planned page: indic_tokenization_schwa.md


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow