WFGY/ProblemMap/ocr-parsing-checklist.md
2025-08-15 23:21:20 +08:00

8.1 KiB
Raw Blame History

🔍 OCR & Parsing Checklist — From Scanned Chaos to Structured Knowledge

A field manual for turning PDFs, images, and legacy docs into RAG-ready semantic chunks


Goal: Eliminate invisible OCR noise and parsing drift before vectors are built.
Audience: Devs shipping RAG, search, or data-extraction pipelines who wonder why “the model read the page but still hallucinated.”


1 Why OCR + Parsing Is the First Failure Point

  1. Garbage-in, hallucination-out — A 98 % accurate LLM fed a 90 % OCR text yields 0 % trustworthy reasoning.
  2. Error propagation — Mis-segmented tokens poison embeddings, which pollute the vector store, which mislead retrieval, which derail the LLM.
  3. Silence — OCR engines rarely shout when confidence drops; they hand you corrupt UTF-8 and wish you luck.

2 “Bad OCR” Signatures (Quick Detection)

Signal How to Spot Impact on RAG
<ff> ligature anomalies Regex: `fi
Spurious hyphens at line ends Regex: [a-zA-Z]-\n[a-z] Token mismatch → irrelevant vectors
Repeated header/footer noise 90 %+ duplication across pages Clutters top-k retrieval
Empty columns (table lost) Sudden token drop for numeric blocks Answer extraction impossible
Confidence < 0.85 for full page Engine API output Replace / re-OCR image segment

3 The WFGY-Enhanced OCR Pipeline (Checklist)

3.1 Pre-OCR

  • Page Split — Detect multi-column layout; slice images horizontally before OCR.
  • DPI Normalisation — Upscale to 300 dpi if <200 dpi to stabilise character shapes.
  • Noise Removal — Median blur + dilation; boosts Tesseract accuracy by ≥ 8 %.
  • Language Model — Set explicit --lang list (avoid auto-detect drift).

3.2 OCR Engine (Tesseract CLI or API)

Flag Recommended Value Why
--oem 3 LSTM + legacy for mixed fonts
--psm 6 Assume block of text; preserves line order
--dpi Explicit numeric Overrides header mis-detect
tessedit_char_blacklist ¢£€¥©®™ etc. Remove unneeded symbols to reduce noise

WFGY Hook: BBMC runs post line-level; drops ΔS peaks > 0.7 (likely OCR mis-read).

3.3 Parsing & Chunking

  • Heading Detection — Regex + font-size heuristic → create logical anchors.
  • Paragraph Merge — Join lines if hyphenated split; remove double spaces.
  • Table Rebuild — Recognise numbers with > 60 % digits; store CSV separately.
  • Semantic Chunk Size — 70120 tokens; cut on natural boundaries only.
  • λ_observe Tagging — Mark each chunk as convergent; flag if internal ΔS > 0.6.

3.4 Post-OCR Validation

Test Threshold Action
mean_confidence ≥ 0.90 page-level Accept
ΔS(header, body) < 0.45 Accept; else inspect
Duplicate line ratio < 5 % If higher → de-dup background noise
Line length entropy 0.51.5 bits Abnormal ⇒ table or code block; treat separately

4 Common Pitfalls & Fix Recipes

Pitfall Symptom WFGY Fix
Skewed scans Text slants; letters fused Pre-deskew (Hough) → re-OCR
Watermarks Random “DRAFT” tokens mid-sentence Regex filter; BBMC residue cut
Marginalia leakage Handwritten notes become tokens Detect bounding boxes; mask before OCR
Large equations OCR turns into = = noise Frame extract; feed MathPix → LaTeX; store separate

5 End-to-End Smoke Test

  1. Choose a 10-page PDF with tables + images.
  2. Run full pipeline with WFGY hooks.
  3. Metrics to verify:
    • Token overlap with human ground truth ≥ 0.93
    • ΔS(question, retrieved_context) ≤ 0.45 on sample QA
    • λ_observe stays convergent after 3 paraphrase queries
  4. Manual QA: at least 8 / 10 answers correct with citations.

6 FAQ

Q: Is Google Vision OCR “good enough”?
A: Accuracy is high, but without BBMC boundary checks you still risk semantic drift.

Q: Do I need a layout-aware model (Donut, LayoutLM)?
A: Recommended for complex forms. WFGY integrates their outputs seamlessly; the checklist still applies.

Q: Can I skip the table CSV step?
A: Only if your downstream task never asks for numeric QA. Otherwise chunk ordering will fail.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow