10 KiB
Digits, Width, and Punctuation — Guardrails and Fix Pattern
🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize retrieval when digits, character width, and punctuation variants silently change tokenization and ranking. This page aligns numeric classes, width folding, quotes/hyphens, and exotic spaces across ingest → index → query → display.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End-to-end retrieval knobs: Retrieval Playbook
- Traceability and snippet schema: Retrieval Traceability • Data Contracts
- Tokenizers and normalization: tokenizer_mismatch.md • locale_drift.md
- Reranking strategies: rerankers.md
When to use
- Arabic-Indic digits or CJK fullwidth digits never match Latin digits.
- Smart quotes, em/en/quasi hyphens, or dashes break phrase matching or offsets.
- Thousands separators vary by locale (
,vs.vs NBSP) and kill numeric recall. - Halfwidth vs fullwidth punctuation breaks token boundaries and citations.
- Mixed unit formats (
1,234.56vs1 234,56) return different snippets.
Core acceptance
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases
- Coverage ≥ 0.70 to the correct section
- λ remains convergent across two seeds
- Offset parity: citation offsets match visible glyphs after normalization
- Numeric fold pass rate ≥ 0.98 on a 200-sample mixed-locale set
60-second checklist
-
Digit class fold
- Map Arabic-Indic and other locale digits to ASCII
0–9in a search_text view. - Keep visual_text unchanged for display. Store both. See Data Contracts.
- Map Arabic-Indic and other locale digits to ASCII
-
Width fold
- Convert fullwidth Latin letters, digits, and punctuation to halfwidth in
search_text. - Log a
width_fold=true|falseflag in snippet metadata. See locale_drift.md.
- Convert fullwidth Latin letters, digits, and punctuation to halfwidth in
-
Punctuation normalization
- Quotes: map “ ” ‘ ’ to " ' for search view; keep raw in display.
- Hyphens/dashes: map U+2010..U+2015 to ASCII
-in search view; track original in trace. - Spaces: collapse NBSP, NNBSP, thin/narrow spaces to ASCII space for search.
-
Number normalization
- Normalize thousands and decimal separators by locale rules into a canonical numeric token for search.
- Keep raw string for display; store a
numeric_normfield per snippet.
-
Analyzer parity
- Ensure store analyzers (BM25/ES/OpenSearch) apply the same width/digit/punct rules as embedding pre-processing.
-
Verify
- Three paraphrases, two seeds. Check ΔS, coverage, λ. Validate offsets visually.
Symptom map → exact fix
| Symptom | Likely cause | Open this |
|---|---|---|
١٢٣٤ fails to match 1234 |
digit classes differ | locale_drift.md • retrieval-playbook.md |
1234 (fullwidth) misses 1234 |
width fold missing | tokenizer_mismatch.md |
| “quoted phrase” not found | smart quotes not normalized | locale_drift.md |
co-founder vs co-founder mismatch |
hyphen/dash variants differ | tokenizer_mismatch.md |
1 234,56 vs 1,234.56 mismatch |
thousands/decimal separators differ | retrieval-traceability.md |
| Offsets jump after PDF/OCR | NBSP/soft hyphen/ZWJ artifacts | OCR Parsing Checklist |
Minimal field plan
visual_text: original text for display/citation.search_text: NFC+width fold+digit fold+punct fold+space fold.numeric_norm: canonical numbers extracted fromsearch_textwhen present.- Trace fields:
unicode_form,width_fold,digit_class,punct_fold,space_class.
Store notes
- Elasticsearch/OpenSearch: pin analyzers in index template; apply char filters for digit/width/quotes/hyphens; verify analyzer in both ingest and query.
- Vector stores: embed
search_text; keepvisual_textonly for display and exact citation strings. - Hybrid pipelines: run BM25 over
search_text, then cross-encoder rerank on raw snippets to protect nuance.
Repro test (gold set outline)
- Build a 50-item set mixing Arabic-Indic digits, fullwidth digits, smart quotes, multiple hyphens, NBSP, and locale number formats.
- Run retrieval before/after normalization; compute ΔS and coverage.
- Manually verify top-1 offsets against
visual_text. - Accept if ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent, and offsets stable.
Copy-paste prompt
You have TXT OS and the WFGY Problem Map.
My bug: digits/width/punctuation drift.
Traces: ΔS=..., coverage=..., λ=..., examples: {١٢٣٤ vs 1234, 1234 vs 1234, “ ” vs " ", co- vs co-}.
Tell me:
1. failing normalization step and why,
2. exact WFGY pages to open,
3. minimal changes to push ΔS ≤ 0.45 and keep λ convergent,
4. a 50-item gold test to verify, including offset checks.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.