WFGY/ProblemMap/GlobalFixMap/OCR_Parsing/scanned_pdfs_and_quality.md
2025-08-28 19:26:15 +08:00

7.4 KiB
Raw Blame History

Scanned PDFs and Quality: OCR Parsing Guardrails

Stabilize OCR extraction on noisy scans, low-resolution images, and multi-generation photocopies. Ensure text is auditable, retrievable, and bound by schema despite quality issues.

Open these first

Acceptance targets

  • OCR character error rate (CER) ≤ 2% after cleanup
  • ΔS(question, retrieved) ≤ 0.45 even when scan quality < 300 dpi
  • λ remains convergent across paraphrases
  • All extracted text auditable against source image hash

Typical failure signatures → fix

  • Broken characters and merged glyphs
    Apply normalization and Unicode repair before indexing. Validate against a whitelist of expected ranges.

  • Multi-generation photocopy blur
    Route through OCR engine supporting adaptive binarization. Anchor outputs with image hash to avoid ghost drift.

  • Double-encoded PDFs (text + image overlay)
    Deduplicate layers. Choose the higher-confidence text layer and tag source.

  • Skewed pages or rotated scans
    Run deskew filter before OCR. Capture skew angle metadata for audit.

  • Mixed-language or font variants
    Force language models per region. Split by script. Store per-block language code.

  • Noise artifacts (staple marks, stamps, watermarks)
    Strip bounding boxes below token threshold. Mark as noise_block instead of narrative text.


Fix in 60 seconds

  1. Hash source image
    Store scan_id and image_hash for every page. Tie all extracted text back to this anchor.

  2. Normalize text
    Apply Unicode NFKC. Collapse broken ligatures and fix spacing errors.

  3. De-layer double PDFs
    Choose the OCR text layer with confidence ≥ 0.90. Drop shadow text.

  4. Audit with ΔS
    Probe scanned text with 3 paraphrases. If ΔS ≥ 0.60, run re-OCR with stricter binarization.

  5. Chunk and contract
    Split by page. Enforce data contract fields: page_no, scan_id, text_clean, bbox.


Minimal recipes by engine

  • Google Document AI
    Use qualityScores.confidence field. Reject blocks with confidence < 0.7.

  • AWS Textract
    Hash BlockType=PAGE. Keep page-level confidence. Store as scan_id.

  • Azure OCR
    Normalize boundingRegions. Add language code explicitly if detected.

  • ABBYY
    Use <charParams> confidence. Flag low confidence segments for secondary OCR.

  • PaddleOCR
    Use angle classification for deskew. Split multilingual pages into per-line language tags.


Data contract extension


{
"scan\_id": "p12\_imghash",
"page\_no": 12,
"image\_hash": "sha256:...",
"text\_clean": "...",
"language": "en",
"confidence": 0.92,
"noise\_blocks": \[...],
"source\_url": "..."
}


Verification

  • Leak check: ensure no shadow/duplicate text.
  • Quality probe: CER ≤ 2% on 1k sample chars.
  • Stability probe: ΔS stable across paraphrases.
  • Auditability: all text traceable to image hash.

Copy-paste LLM prompt


You have TXTOS and WFGY Problem Map.

My scan:

* page\_no: {n}
* text\_clean: "..."
* confidence: 0.xx
* image\_hash: "..."

Tasks:

1. If text looks corrupted, fail fast and cite fix page.
2. Validate schema (ocr-parsing-checklist, data-contracts).
3. Return JSON: { "answer":"...", "citations":\[...], "ΔS":0.xx, "λ\_state":"..." }


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow  

要不要我直接幫你接續做下一個 multi_language_and_fonts.md