8.1 KiB
🔍 OCR & Parsing Checklist — From Scanned Chaos to Structured Knowledge
A field manual for turning PDFs, images, and legacy docs into RAG-ready semantic chunks
Goal: Eliminate invisible OCR noise and parsing drift before vectors are built.
Audience: Devs shipping RAG, search, or data-extraction pipelines who wonder why “the model read the page but still hallucinated.”
1 Why OCR + Parsing Is the First Failure Point
- Garbage-in, hallucination-out — A 98 % accurate LLM fed a 90 % OCR text yields 0 % trustworthy reasoning.
- Error propagation — Mis-segmented tokens poison embeddings, which pollute the vector store, which mislead retrieval, which derail the LLM.
- Silence — OCR engines rarely shout when confidence drops; they hand you corrupt UTF-8 and wish you luck.
2 “Bad OCR” Signatures (Quick Detection)
| Signal | How to Spot | Impact on RAG |
|---|---|---|
<ff> ligature anomalies |
Regex: `fi | fl |
| Spurious hyphens at line ends | Regex: [a-zA-Z]-\n[a-z] |
Token mismatch → irrelevant vectors |
| Repeated header/footer noise | 90 %+ duplication across pages | Clutters top-k retrieval |
| Empty columns (table lost) | Sudden token drop for numeric blocks | Answer extraction impossible |
| Confidence < 0.85 for full page | Engine API output | Replace / re-OCR image segment |
3 The WFGY-Enhanced OCR Pipeline (Checklist)
3.1 Pre-OCR
- Page Split — Detect multi-column layout; slice images horizontally before OCR.
- DPI Normalisation — Upscale to 300 dpi if <200 dpi to stabilise character shapes.
- Noise Removal — Median blur + dilation; boosts Tesseract accuracy by ≥ 8 %.
- Language Model — Set explicit
--langlist (avoid auto-detect drift).
3.2 OCR Engine (Tesseract CLI or API)
| Flag | Recommended Value | Why |
|---|---|---|
--oem |
3 |
LSTM + legacy for mixed fonts |
--psm |
6 |
Assume block of text; preserves line order |
--dpi |
Explicit numeric | Overrides header mis-detect |
tessedit_char_blacklist |
¢£€¥©®™ etc. |
Remove unneeded symbols to reduce noise |
WFGY Hook: BBMC runs post line-level; drops ΔS peaks > 0.7 (likely OCR mis-read).
3.3 Parsing & Chunking
- Heading Detection — Regex + font-size heuristic → create logical anchors.
- Paragraph Merge — Join lines if hyphenated split; remove double spaces.
- Table Rebuild — Recognise numbers with > 60 % digits; store CSV separately.
- Semantic Chunk Size — 70–120 tokens; cut on natural boundaries only.
- λ_observe Tagging — Mark each chunk as
→convergent; flag if internal ΔS > 0.6.
3.4 Post-OCR Validation
| Test | Threshold | Action |
|---|---|---|
mean_confidence |
≥ 0.90 page-level | Accept |
| ΔS(header, body) | < 0.45 | Accept; else inspect |
| Duplicate line ratio | < 5 % | If higher → de-dup background noise |
| Line length entropy | 0.5–1.5 bits | Abnormal ⇒ table or code block; treat separately |
4 Common Pitfalls & Fix Recipes
| Pitfall | Symptom | WFGY Fix |
|---|---|---|
| Skewed scans | Text slants; letters fused | Pre-deskew (Hough) → re-OCR |
| Watermarks | Random “DRAFT” tokens mid-sentence | Regex filter; BBMC residue cut |
| Marginalia leakage | Handwritten notes become tokens | Detect bounding boxes; mask before OCR |
| Large equations | OCR turns into = = noise |
Frame extract; feed MathPix → LaTeX; store separate |
5 End-to-End Smoke Test
- Choose a 10-page PDF with tables + images.
- Run full pipeline with WFGY hooks.
- Metrics to verify:
- Token overlap with human ground truth ≥ 0.93
- ΔS(question, retrieved_context) ≤ 0.45 on sample QA
- λ_observe stays convergent after 3 paraphrase queries
- Manual QA: at least 8 / 10 answers correct with citations.
6 FAQ
Q: Is Google Vision OCR “good enough”?
A: Accuracy is high, but without BBMC boundary checks you still risk semantic drift.
Q: Do I need a layout-aware model (Donut, LayoutLM)?
A: Recommended for complex forms. WFGY integrates their outputs seamlessly; the checklist still applies.
Q: Can I skip the table CSV step?
A: Only if your downstream task never asks for numeric QA. Otherwise chunk ordering will fail.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.