mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
9.7 KiB
9.7 KiB
Digits, Width, and Punctuation — Guardrails and Fix Pattern
Stabilize retrieval when digits, character width, and punctuation variants silently change tokenization and ranking. This page aligns numeric classes, width folding, quotes/hyphens, and exotic spaces across ingest → index → query → display.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End-to-end retrieval knobs: Retrieval Playbook
- Traceability and snippet schema: Retrieval Traceability • Data Contracts
- Tokenizers and normalization: tokenizer_mismatch.md • locale_drift.md
- Reranking strategies: rerankers.md
When to use
- Arabic-Indic digits or CJK fullwidth digits never match Latin digits.
- Smart quotes, em/en/quasi hyphens, or dashes break phrase matching or offsets.
- Thousands separators vary by locale (
,vs.vs NBSP) and kill numeric recall. - Halfwidth vs fullwidth punctuation breaks token boundaries and citations.
- Mixed unit formats (
1,234.56vs1 234,56) return different snippets.
Core acceptance
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases
- Coverage ≥ 0.70 to the correct section
- λ remains convergent across two seeds
- Offset parity: citation offsets match visible glyphs after normalization
- Numeric fold pass rate ≥ 0.98 on a 200-sample mixed-locale set
60-second checklist
-
Digit class fold
- Map Arabic-Indic and other locale digits to ASCII
0–9in a search_text view. - Keep visual_text unchanged for display. Store both. See Data Contracts.
- Map Arabic-Indic and other locale digits to ASCII
-
Width fold
- Convert fullwidth Latin letters, digits, and punctuation to halfwidth in
search_text. - Log a
width_fold=true|falseflag in snippet metadata. See locale_drift.md.
- Convert fullwidth Latin letters, digits, and punctuation to halfwidth in
-
Punctuation normalization
- Quotes: map “ ” ‘ ’ to " ' for search view; keep raw in display.
- Hyphens/dashes: map U+2010..U+2015 to ASCII
-in search view; track original in trace. - Spaces: collapse NBSP, NNBSP, thin/narrow spaces to ASCII space for search.
-
Number normalization
- Normalize thousands and decimal separators by locale rules into a canonical numeric token for search.
- Keep raw string for display; store a
numeric_normfield per snippet.
-
Analyzer parity
- Ensure store analyzers (BM25/ES/OpenSearch) apply the same width/digit/punct rules as embedding pre-processing.
-
Verify
- Three paraphrases, two seeds. Check ΔS, coverage, λ. Validate offsets visually.
Symptom map → exact fix
| Symptom | Likely cause | Open this |
|---|---|---|
١٢٣٤ fails to match 1234 |
digit classes differ | locale_drift.md • retrieval-playbook.md |
1234 (fullwidth) misses 1234 |
width fold missing | tokenizer_mismatch.md |
| “quoted phrase” not found | smart quotes not normalized | locale_drift.md |
co-founder vs co-founder mismatch |
hyphen/dash variants differ | tokenizer_mismatch.md |
1 234,56 vs 1,234.56 mismatch |
thousands/decimal separators differ | retrieval-traceability.md |
| Offsets jump after PDF/OCR | NBSP/soft hyphen/ZWJ artifacts | OCR Parsing Checklist |
Minimal field plan
visual_text: original text for display/citation.search_text: NFC+width fold+digit fold+punct fold+space fold.numeric_norm: canonical numbers extracted fromsearch_textwhen present.- Trace fields:
unicode_form,width_fold,digit_class,punct_fold,space_class.
Store notes
- Elasticsearch/OpenSearch: pin analyzers in index template; apply char filters for digit/width/quotes/hyphens; verify analyzer in both ingest and query.
- Vector stores: embed
search_text; keepvisual_textonly for display and exact citation strings. - Hybrid pipelines: run BM25 over
search_text, then cross-encoder rerank on raw snippets to protect nuance.
Repro test (gold set outline)
- Build a 50-item set mixing Arabic-Indic digits, fullwidth digits, smart quotes, multiple hyphens, NBSP, and locale number formats.
- Run retrieval before/after normalization; compute ΔS and coverage.
- Manually verify top-1 offsets against
visual_text. - Accept if ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent, and offsets stable.
Copy-paste prompt
You have TXT OS and the WFGY Problem Map.
My bug: digits/width/punctuation drift.
Traces: ΔS=..., coverage=..., λ=..., examples: {١٢٣٤ vs 1234, 1234 vs 1234, “ ” vs " ", co- vs co-}.
Tell me:
1. failing normalization step and why,
2. exact WFGY pages to open,
3. minimal changes to push ΔS ≤ 0.45 and keep λ convergent,
4. a 50-item gold test to verify, including offset checks.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.