Tokenization & Casing — OCR Parsing Guardrails

🧭 Quick Return to Map

You are in a sub-page of OCR_Parsing.
To reorient, go back here:

OCR_Parsing — text recognition and document structure parsing

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused fix page for post-OCR text where casing, spaces, or token boundaries are corrupted. Use this to normalize the stream before chunking/embedding, and verify with measurable targets. Works across Tesseract, Google DocAI, Azure OCR, ABBYY, PaddleOCR, and custom engines.

When to use this page

Words are split or glued (e.g., re tri eval, metadataindex).
Case flaps mid-sentence (tHE DocUment), acronyms collapse (R a G).
Invisible characters or double spaces change token counts.
Chunkers behave inconsistently between runs with the same image/PDF.
Embedding recall looks fine locally but retrieval ΔS stays high.

Open these first

Visual map and recovery: RAG Architecture & Recovery
Chunking checklist: Chunking Checklist
Retrieval traceability (cite-then-explain schema): Retrieval Traceability
Payload schema fences: Data Contracts
Embedding vs meaning (when token noise leaks into vectors): Embedding ≠ Semantic

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 after normalization.
Coverage to target section ≥ 0.70 on three paraphrases.
λ remains convergent across two seeds.
Token count variance for the same page ≤ 1 percent after normalization.

Symptoms → exact fix

Mid-token splits or merges (infor mation, vectorstoreindex)
Fix No.1: Rejoin by dictionary and layout anchors.
Use wordlist + n-gram agree check, prefer joins that reduce ΔS on a small gold set.
See: Chunking Checklist
Casing drift (random upper/lower), acronym scatter (r.a.g.)
Fix No.2: Casing normalization with protected spans.
Protect enums, acronyms, chemical names, LaTeX blocks, code fonts.
Whitespace noise (NBSP, thin space, double space)
Fix No.3: Unicode normalization + space collapse, keep offsets table.
Record before/after offset map to preserve citation alignment.
See: Retrieval Traceability
Punctuation misreads (l vs 1, O vs 0, , vs .)
Fix No.4: Confusable set pass + local language model vote.
Only apply inside numeric or acronym contexts, keep audit log.
Tokenizer mismatch across components
Fix No.5: Single tokenizer contract for parse → chunk → embed.
Declare tokenizer_name, lowercase, strip_accents in payload.
See: Data Contracts

60-second fix checklist

Normalize
- Unicode NFC, strip BOM, collapse spaces except inside code/math blocks.
- Replace NBSP and thin spaces with ASCII space.
- Build offset_map old→new for citations.
Protect
- Detect protected spans: URLs, emails, hex, code, LaTeX, table headers, known acronyms.
- Freeze casing and punctuation inside protected spans.
Rejoin / Split
- Rejoin candidates by dictionary + bigram score.
- Split stuck words when edit distance to dictionary is lower after split.
Contract
- Emit tokenizer_contract.json:
  { "tokenizer": "bert-base-uncased", "lowercase": true, "strip_accents": true }
  Attach to every downstream step.
  See: Data Contracts
Verify
- Recompute ΔS on a 20-question gold set.
- If ΔS stays ≥ 0.60, open Embedding ≠ Semantic and rebuild index with the same tokenizer contract.

Minimal pipeline patch (pseudo)

text = ocr_text

text = unicode_normalize_nfc(text)
text, offset_map = collapse_spaces_with_offsets(text)
spans = detect_protected_spans(text)  # urls, code, latex, acronyms
text = normalize_casing(text, protect=spans)

cands = find_split_merge_candidates(text, protect=spans)
text = apply_split_merge(text, cands, scorer="bigram+dict")

emit_payload(
  content=text,
  offset_map=offset_map,
  tokenizer_contract={
    "tokenizer": "bert-base-uncased",
    "lowercase": True,
    "strip_accents": True
  }
)

Keep the offset map, or citations will drift. Tracing and schema rules come from: Retrieval Traceability · Data Contracts

Eval recipe

Build a tiny page-level gold: 10–20 questions with expected anchor sections.
Measure before/after: ΔS(question, retrieved), coverage, λ states.
Acceptance to sign off: ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent on both seeds.
Log token count and confusable corrections per page for audit.

When to escalate

ΔS remains high after normalization and re-chunking Open: Embedding ≠ Semantic
Citations drift after formatting changes Open: Retrieval Traceability
Layout destroys sentence flow or table cells Open: OCR_Parsing/table_parsing.md once added, and the general Chunking Checklist

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

11 KiB Raw Blame History Unescape Escape