Tokenization & Casing — OCR Parsing Guardrails

🧭 Quick Return to Map

You are in a sub-page of OCR_Parsing.
To reorient, go back here:

OCR_Parsing — text recognition and document structure parsing

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused fix page for post-OCR text where casing, spaces, or token boundaries are corrupted. Use this to normalize the stream before chunking/embedding, and verify with measurable targets. Works across Tesseract, Google DocAI, Azure OCR, ABBYY, PaddleOCR, and custom engines.

When to use this page

Words are split or glued (e.g., re tri eval, metadataindex).
Case flaps mid-sentence (tHE DocUment), acronyms collapse (R a G).
Invisible characters or double spaces change token counts.
Chunkers behave inconsistently between runs with the same image/PDF.
Embedding recall looks fine locally but retrieval ΔS stays high.

Open these first

Visual map and recovery: RAG Architecture & Recovery
Chunking checklist: Chunking Checklist
Retrieval traceability (cite-then-explain schema): Retrieval Traceability
Payload schema fences: Data Contracts
Embedding vs meaning (when token noise leaks into vectors): Embedding ≠ Semantic

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 after normalization.
Coverage to target section ≥ 0.70 on three paraphrases.
λ remains convergent across two seeds.
Token count variance for the same page ≤ 1 percent after normalization.

Symptoms → exact fix

Mid-token splits or merges (infor mation, vectorstoreindex)
Fix No.1: Rejoin by dictionary and layout anchors.
Use wordlist + n-gram agree check, prefer joins that reduce ΔS on a small gold set.
See: Chunking Checklist
Casing drift (random upper/lower), acronym scatter (r.a.g.)
Fix No.2: Casing normalization with protected spans.
Protect enums, acronyms, chemical names, LaTeX blocks, code fonts.
Whitespace noise (NBSP, thin space, double space)
Fix No.3: Unicode normalization + space collapse, keep offsets table.
Record before/after offset map to preserve citation alignment.
See: Retrieval Traceability
Punctuation misreads (l vs 1, O vs 0, , vs .)
Fix No.4: Confusable set pass + local language model vote.
Only apply inside numeric or acronym contexts, keep audit log.
Tokenizer mismatch across components
Fix No.5: Single tokenizer contract for parse → chunk → embed.
Declare tokenizer_name, lowercase, strip_accents in payload.
See: Data Contracts

60-second fix checklist

Normalize
- Unicode NFC, strip BOM, collapse spaces except inside code/math blocks.
- Replace NBSP and thin spaces with ASCII space.
- Build offset_map old→new for citations.
Protect
- Detect protected spans: URLs, emails, hex, code, LaTeX, table headers, known acronyms.
- Freeze casing and punctuation inside protected spans.
Rejoin / Split
- Rejoin candidates by dictionary + bigram score.
- Split stuck words when edit distance to dictionary is lower after split.
Contract
- Emit tokenizer_contract.json:
  { "tokenizer": "bert-base-uncased", "lowercase": true, "strip_accents": true }
  Attach to every downstream step.
  See: Data Contracts
Verify
- Recompute ΔS on a 20-question gold set.
- If ΔS stays ≥ 0.60, open Embedding ≠ Semantic and rebuild index with the same tokenizer contract.

Minimal pipeline patch (pseudo)

text = ocr_text

text = unicode_normalize_nfc(text)
text, offset_map = collapse_spaces_with_offsets(text)
spans = detect_protected_spans(text)  # urls, code, latex, acronyms
text = normalize_casing(text, protect=spans)

cands = find_split_merge_candidates(text, protect=spans)
text = apply_split_merge(text, cands, scorer="bigram+dict")

emit_payload(
  content=text,
  offset_map=offset_map,
  tokenizer_contract={
    "tokenizer": "bert-base-uncased",
    "lowercase": True,
    "strip_accents": True
  }
)

Keep the offset map, or citations will drift. Tracing and schema rules come from: Retrieval Traceability · Data Contracts

Eval recipe

Build a tiny page-level gold: 10–20 questions with expected anchor sections.
Measure before/after: ΔS(question, retrieved), coverage, λ states.
Acceptance to sign off: ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent on both seeds.
Log token count and confusable corrections per page for audit.

When to escalate

ΔS remains high after normalization and re-chunking Open: Embedding ≠ Semantic
Citations drift after formatting changes Open: Retrieval Traceability
Layout destroys sentence flow or table cells Open: OCR_Parsing/table_parsing.md once added, and the general Chunking Checklist

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer	Page	What it’s for
⭐ Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
⚙️ Engine	WFGY 1.0	Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine	WFGY 2.0	Production tension kernel for RAG and agent systems
⚙️ Engine	WFGY 3.0	TXT based Singularity tension engine (131 S class set)
🗺️ Map	Problem Map 1.0	Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map	Problem Map 2.0	Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map	Problem Map 3.0	Global AI troubleshooting atlas and failure pattern map
🧰 App	TXT OS	.txt semantic OS with fast bootstrap
🧰 App	Blah Blah Blah	Abstract and paradox Q&A built on TXT OS
🧰 App	Blur Blur Blur	Text to image generation with semantic control
🏡 Onboarding	Starter Village	Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.

9.1 KiB Raw Permalink Blame History Unescape Escape