WFGY/ProblemMap/ocr-parsing-checklist.md

6.7 KiB
Raw Blame History

🔍 OCR & Parsing Checklist — From Scanned Chaos to Structured Knowledge

A field manual for turning PDFs, images, and legacy docs into RAG-ready semantic chunks


Goal: Eliminate invisible OCR noise and parsing drift before vectors are built.
Audience: Devs shipping RAG, search, or data-extraction pipelines who wonder why “the model read the page but still hallucinated.”


1 Why OCR + Parsing Is the First Failure Point

  1. Garbage-in, hallucination-out — A 98 % accurate LLM fed a 90 % OCR text yields 0 % trustworthy reasoning.
  2. Error propagation — Mis-segmented tokens poison embeddings, which pollute the vector store, which mislead retrieval, which derail the LLM.
  3. Silence — OCR engines rarely shout when confidence drops; they hand you corrupt UTF-8 and wish you luck.

2 “Bad OCR” Signatures (Quick Detection)

Signal How to Spot Impact on RAG
<ff> ligature anomalies Regex: `fi
Spurious hyphens at line ends Regex: [a-zA-Z]-\n[a-z] Token mismatch → irrelevant vectors
Repeated header/footer noise 90 %+ duplication across pages Clutters top-k retrieval
Empty columns (table lost) Sudden token drop for numeric blocks Answer extraction impossible
Confidence < 0.85 for full page Engine API output Replace / re-OCR image segment

3 The WFGY-Enhanced OCR Pipeline (Checklist)

3.1 Pre-OCR

  • Page Split — Detect multi-column layout; slice images horizontally before OCR.
  • DPI Normalisation — Upscale to 300 dpi if <200 dpi to stabilise character shapes.
  • Noise Removal — Median blur + dilation; boosts Tesseract accuracy by ≥ 8 %.
  • Language Model — Set explicit --lang list (avoid auto-detect drift).

3.2 OCR Engine (Tesseract CLI or API)

Flag Recommended Value Why
--oem 3 LSTM + legacy for mixed fonts
--psm 6 Assume block of text; preserves line order
--dpi Explicit numeric Overrides header mis-detect
tessedit_char_blacklist ¢£€¥©®™ etc. Remove unneeded symbols to reduce noise

WFGY Hook: BBMC runs post line-level; drops ΔS peaks > 0.7 (likely OCR mis-read).

3.3 Parsing & Chunking

  • Heading Detection — Regex + font-size heuristic → create logical anchors.
  • Paragraph Merge — Join lines if hyphenated split; remove double spaces.
  • Table Rebuild — Recognise numbers with > 60 % digits; store CSV separately.
  • Semantic Chunk Size — 70120 tokens; cut on natural boundaries only.
  • λ_observe Tagging — Mark each chunk as convergent; flag if internal ΔS > 0.6.

3.4 Post-OCR Validation

Test Threshold Action
mean_confidence ≥ 0.90 page-level Accept
ΔS(header, body) < 0.45 Accept; else inspect
Duplicate line ratio < 5 % If higher → de-dup background noise
Line length entropy 0.51.5 bits Abnormal ⇒ table or code block; treat separately

4 Common Pitfalls & Fix Recipes

Pitfall Symptom WFGY Fix
Skewed scans Text slants; letters fused Pre-deskew (Hough) → re-OCR
Watermarks Random “DRAFT” tokens mid-sentence Regex filter; BBMC residue cut
Marginalia leakage Handwritten notes become tokens Detect bounding boxes; mask before OCR
Large equations OCR turns into = = noise Frame extract; feed MathPix → LaTeX; store separate

5 End-to-End Smoke Test

  1. Choose a 10-page PDF with tables + images.
  2. Run full pipeline with WFGY hooks.
  3. Metrics to verify:
    • Token overlap with human ground truth ≥ 0.93
    • ΔS(question, retrieved_context) ≤ 0.45 on sample QA
    • λ_observe stays convergent after 3 paraphrase queries
  4. Manual QA: at least 8 / 10 answers correct with citations.

6 FAQ

Q: Is Google Vision OCR “good enough”?
A: Accuracy is high, but without BBMC boundary checks you still risk semantic drift.

Q: Do I need a layout-aware model (Donut, LayoutLM)?
A: Recommended for complex forms. WFGY integrates their outputs seamlessly; the checklist still applies.

Q: Can I skip the table CSV step?
A: Only if your downstream task never asks for numeric QA. Otherwise chunk ordering will fail.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars