🔍 OCR & Parsing Checklist — From Scanned Chaos to Structured Knowledge

A field manual for turning PDFs, images, and legacy docs into RAG-ready semantic chunks

Goal: Eliminate invisible OCR noise and parsing drift before vectors are built.
Audience: Devs shipping RAG, search, or data-extraction pipelines who wonder why “the model read the page but still hallucinated.”

1 Why OCR + Parsing Is the First Failure Point

Garbage-in, hallucination-out — A 98 % accurate LLM fed a 90 % OCR text yields 0 % trustworthy reasoning.
Error propagation — Mis-segmented tokens poison embeddings, which pollute the vector store, which mislead retrieval, which derail the LLM.
Silence — OCR engines rarely shout when confidence drops; they hand you corrupt UTF-8 and wish you luck.

2 “Bad OCR” Signatures (Quick Detection)

Signal	How to Spot	Impact on RAG
`<ff>` ligature anomalies	Regex: `ﬁ	ﬂ
Spurious hyphens at line ends	Regex: `[a-zA-Z]-\n[a-z]`	Token mismatch → irrelevant vectors
Repeated header/footer noise	90 %+ duplication across pages	Clutters top-k retrieval
Empty columns (table lost)	Sudden token drop for numeric blocks	Answer extraction impossible
Confidence < 0.85 for full page	Engine API output	Replace / re-OCR image segment

3 The WFGY-Enhanced OCR Pipeline (Checklist)

3.1 Pre-OCR

Page Split — Detect multi-column layout; slice images horizontally before OCR.
DPI Normalisation — Upscale to 300 dpi if <200 dpi to stabilise character shapes.
Noise Removal — Median blur + dilation; boosts Tesseract accuracy by ≥ 8 %.
Language Model — Set explicit --lang list (avoid auto-detect drift).

3.2 OCR Engine (Tesseract CLI or API)

Flag	Recommended Value	Why
`--oem`	`3`	LSTM + legacy for mixed fonts
`--psm`	`6`	Assume block of text; preserves line order
`--dpi`	Explicit numeric	Overrides header mis-detect
`tessedit_char_blacklist`	`¢£€¥©®™` etc.	Remove unneeded symbols to reduce noise

WFGY Hook: BBMC runs post line-level; drops ΔS peaks > 0.7 (likely OCR mis-read).

3.3 Parsing & Chunking

Heading Detection — Regex + font-size heuristic → create logical anchors.
Paragraph Merge — Join lines if hyphenated split; remove double spaces.
Table Rebuild — Recognise numbers with > 60 % digits; store CSV separately.
Semantic Chunk Size — 70–120 tokens; cut on natural boundaries only.
λ_observe Tagging — Mark each chunk as → convergent; flag if internal ΔS > 0.6.

3.4 Post-OCR Validation

Test	Threshold	Action
`mean_confidence`	≥ 0.90 page-level	Accept
ΔS(header, body)	< 0.45	Accept; else inspect
Duplicate line ratio	< 5 %	If higher → de-dup background noise
Line length entropy	0.5–1.5 bits	Abnormal ⇒ table or code block; treat separately

4 Common Pitfalls & Fix Recipes

Pitfall	Symptom	WFGY Fix
Skewed scans	Text slants; letters fused	Pre-deskew (Hough) → re-OCR
Watermarks	Random “DRAFT” tokens mid-sentence	Regex filter; BBMC residue cut
Marginalia leakage	Handwritten notes become tokens	Detect bounding boxes; mask before OCR
Large equations	OCR turns into `= =` noise	Frame extract; feed MathPix → LaTeX; store separate

5 End-to-End Smoke Test

Choose a 10-page PDF with tables + images.
Run full pipeline with WFGY hooks.
Metrics to verify:
- Token overlap with human ground truth ≥ 0.93
- ΔS(question, retrieved_context) ≤ 0.45 on sample QA
- λ_observe stays convergent after 3 paraphrase queries
Manual QA: at least 8 / 10 answers correct with citations.

6 FAQ

Q: Is Google Vision OCR “good enough”?
A: Accuracy is high, but without BBMC boundary checks you still risk semantic drift.

Q: Do I need a layout-aware model (Donut, LayoutLM)?
A: Recommended for complex forms. WFGY integrates their outputs seamlessly; the checklist still applies.

Q: Can I skip the table CSV step?
A: Only if your downstream task never asks for numeric QA. Otherwise chunk ordering will fail.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

8.1 KiB Raw Blame History Unescape Escape