8 KiB
Scanned PDFs and Quality: OCR Parsing Guardrails
🧭 Quick Return to Map
You are in a sub-page of OCR_Parsing.
To reorient, go back here:
- OCR_Parsing — text recognition and document structure parsing
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize OCR extraction on noisy scans, low-resolution images, and multi-generation photocopies. Ensure text is auditable, retrievable, and bound by schema despite quality issues.
Open these first
- OCR parsing checklist: ocr-parsing-checklist.md
- Data contracts: data-contracts.md
- Hallucination control: hallucination.md
- Chunking guide: chunking-checklist.md
Acceptance targets
- OCR character error rate (CER) ≤ 2% after cleanup
- ΔS(question, retrieved) ≤ 0.45 even when scan quality < 300 dpi
- λ remains convergent across paraphrases
- All extracted text auditable against source image hash
Typical failure signatures → fix
-
Broken characters and merged glyphs
Apply normalization and Unicode repair before indexing. Validate against a whitelist of expected ranges. -
Multi-generation photocopy blur
Route through OCR engine supporting adaptive binarization. Anchor outputs with image hash to avoid ghost drift. -
Double-encoded PDFs (text + image overlay)
Deduplicate layers. Choose the higher-confidence text layer and tag source. -
Skewed pages or rotated scans
Run deskew filter before OCR. Capture skew angle metadata for audit. -
Mixed-language or font variants
Force language models per region. Split by script. Store per-block language code. -
Noise artifacts (staple marks, stamps, watermarks)
Strip bounding boxes below token threshold. Mark asnoise_blockinstead of narrative text.
Fix in 60 seconds
-
Hash source image
Storescan_idandimage_hashfor every page. Tie all extracted text back to this anchor. -
Normalize text
Apply Unicode NFKC. Collapse broken ligatures and fix spacing errors. -
De-layer double PDFs
Choose the OCR text layer with confidence ≥ 0.90. Drop shadow text. -
Audit with ΔS
Probe scanned text with 3 paraphrases. If ΔS ≥ 0.60, run re-OCR with stricter binarization. -
Chunk and contract
Split by page. Enforce data contract fields:page_no,scan_id,text_clean,bbox.
Minimal recipes by engine
-
Google Document AI
UsequalityScores.confidencefield. Reject blocks with confidence < 0.7. -
AWS Textract
HashBlockType=PAGE. Keep page-level confidence. Store asscan_id. -
Azure OCR
Normalize boundingRegions. Addlanguagecode explicitly if detected. -
ABBYY
Use<charParams>confidence. Flag low confidence segments for secondary OCR. -
PaddleOCR
Use angle classification for deskew. Split multilingual pages into per-line language tags.
Data contract extension
{
"scan\_id": "p12\_imghash",
"page\_no": 12,
"image\_hash": "sha256:...",
"text\_clean": "...",
"language": "en",
"confidence": 0.92,
"noise\_blocks": \[...],
"source\_url": "..."
}
Verification
- Leak check: ensure no shadow/duplicate text.
- Quality probe: CER ≤ 2% on 1k sample chars.
- Stability probe: ΔS stable across paraphrases.
- Auditability: all text traceable to image hash.
Copy-paste LLM prompt
You have TXTOS and WFGY Problem Map.
My scan:
* page\_no: {n}
* text\_clean: "..."
* confidence: 0.xx
* image\_hash: "..."
Tasks:
1. If text looks corrupted, fail fast and cite fix page.
2. Validate schema (ocr-parsing-checklist, data-contracts).
3. Return JSON: { "answer":"...", "citations":\[...], "ΔS":0.xx, "λ\_state":"..." }
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.
要不要我直接幫你接續做下一個 multi_language_and_fonts.md?