mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
260 lines
9.8 KiB
Markdown
260 lines
9.8 KiB
Markdown
<!--
|
||
Search Anchor:
|
||
document ai ocr global fix map
|
||
ocr pipeline bugs
|
||
pdf ocr failures
|
||
scanned pdf processing
|
||
two column pdf ocr
|
||
multi column layout broken
|
||
tables lost after ocr
|
||
table structure lost
|
||
column alignment lost
|
||
semantic grouping broken
|
||
paragraphs split wrong
|
||
lines merged across columns
|
||
forms invoices receipts ocr
|
||
handwritten notes ocr
|
||
layout aware model drift
|
||
doc ai layout models
|
||
ocr plus rag pipeline
|
||
ocr plus embeddings
|
||
ocr json schema mismatch
|
||
ocr json fields change
|
||
bbox page_id confidence text
|
||
page coordinate traceability
|
||
citations do not match scanned page
|
||
scanned region mismatch
|
||
image to text but meaning wrong
|
||
high similarity wrong snippet
|
||
delta s question extracted text
|
||
lambda observe ocr stability
|
||
coverage target page section
|
||
coverage >= 0.70 on gold page
|
||
e_resonance flat long document
|
||
|
||
providers and tools:
|
||
tesseract
|
||
tesseract js
|
||
google document ai
|
||
google docai
|
||
aws textract
|
||
azure ocr
|
||
abbyy
|
||
paddleocr
|
||
open source ocr engines
|
||
cloud ocr apis
|
||
vision plus ocr models
|
||
ocr evaluation
|
||
|
||
pipelines and formats:
|
||
pdf scans
|
||
fax like images
|
||
rotated scans
|
||
sideways pages
|
||
deskew and derotate
|
||
multi page pdf
|
||
forms invoices receipts
|
||
two pass ocr
|
||
dual provider comparison
|
||
json lines output
|
||
bbox normalization
|
||
page segmentation
|
||
region detection
|
||
table detection
|
||
header footer detection
|
||
|
||
contracts and guardrails:
|
||
traceability contract
|
||
snippet_id section_id source_url offsets tokens
|
||
ocr traceability fields
|
||
page_id bbox text confidence
|
||
stable ids for pages and blocks
|
||
boot ordering for ocr version
|
||
pre deploy collapse ocr version drift
|
||
cold start fences for ocr engines
|
||
cite then explain from scanned page
|
||
ocr plus retrieval traceability
|
||
rag on top of scanned corpus
|
||
|
||
common incidents:
|
||
ocr text looks fine but rag answers wrong
|
||
citations go to wrong pdf page
|
||
multi column layout breaks retrieval
|
||
ocr version changed after deploy
|
||
provider returns different json fields
|
||
ocr plus vision hybrid worse than single
|
||
different answers for paraphrased question
|
||
two ocr runs disagree strongly
|
||
|
||
use with:
|
||
delta s <= 0.45
|
||
coverage >= 0.70
|
||
lambda stable across 3 paraphrases 2 seeds
|
||
e_resonance flat across pages
|
||
-->
|
||
|
||
<!--
|
||
Primary pages in this folder:
|
||
ProblemMap/GlobalFixMap/DocumentAI_and_OCR/tesseract.md
|
||
ProblemMap/GlobalFixMap/DocumentAI_and_OCR/google_docai.md
|
||
ProblemMap/GlobalFixMap/DocumentAI_and_OCR/aws_textract.md
|
||
ProblemMap/GlobalFixMap/DocumentAI_and_OCR/azure_ocr.md
|
||
ProblemMap/GlobalFixMap/DocumentAI_and_OCR/abbyy.md
|
||
ProblemMap/GlobalFixMap/DocumentAI_and_OCR/paddleocr.md
|
||
-->
|
||
|
||
<!--
|
||
Related routing pages:
|
||
ProblemMap/embedding-vs-semantic.md
|
||
ProblemMap/retrieval-traceability.md
|
||
ProblemMap/data-contracts.md
|
||
ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md
|
||
ProblemMap/GlobalFixMap/Chunking/chunking-checklist.md
|
||
ProblemMap/bootstrap-ordering.md
|
||
ProblemMap/predeploy-collapse.md
|
||
ProblemMap/patterns/pattern_query_parsing_split.md
|
||
ProblemMap/retrieval-playbook.md
|
||
ProblemMap/context-drift.md
|
||
-->
|
||
|
||
<!--
|
||
Cross folder jumps:
|
||
ProblemMap/GlobalFixMap/Chunking/README.md
|
||
ProblemMap/GlobalFixMap/Retrieval/README.md
|
||
ProblemMap/GlobalFixMap/Embeddings/README.md
|
||
ProblemMap/GlobalFixMap/VectorDBs_and_Stores/README.md
|
||
ProblemMap/SemanticClinicIndex.md
|
||
-->
|
||
|
||
|
||
# Document AI & OCR — Global Fix Map
|
||
|
||
<details>
|
||
<summary><strong>🏥 Quick Return to Emergency Room</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a specialist desk.
|
||
> For full triage and doctors on duty, return here:
|
||
>
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md)
|
||
>
|
||
> Think of this page as a sub-room.
|
||
> If you want full consultation and prescriptions, go back to the Emergency Room lobby.
|
||
</details>
|
||
|
||
A **beginner-friendly hub** to stabilize OCR (Optical Character Recognition) and document AI pipelines across providers and open-source stacks.
|
||
This page helps you:
|
||
1. Understand common OCR failures.
|
||
2. Jump directly to per-tool guides.
|
||
3. Apply structural WFGY fixes with measurable acceptance targets.
|
||
|
||
---
|
||
|
||
## 📌 When to use this folder
|
||
Use this map if you see any of these problems:
|
||
- OCR extracts text but loses **tables or column alignment**.
|
||
- Words are captured but **semantic grouping is wrong** (paragraphs broken).
|
||
- Citations don’t match the **original scanned page**.
|
||
- Layout-aware models drift after **format changes** (e.g. headers, forms).
|
||
- Two-column PDFs or rotated scans break retrieval.
|
||
- Cloud OCR services return **different JSON fields** each run.
|
||
|
||
---
|
||
|
||
## 🎯 Acceptance targets for OCR systems
|
||
Think of these as “green lights” after your OCR step:
|
||
- **ΔS(question, extracted text) ≤ 0.45** (semantic match stays tight).
|
||
- **Coverage ≥ 0.70** of target section or table.
|
||
- **λ stays convergent** across 3 paraphrases and 2 random seeds.
|
||
- **E_resonance stays flat** across long documents (no drifting answers).
|
||
|
||
---
|
||
|
||
## 🚀 Quick routes — per-provider guides
|
||
|
||
| Provider / Tool | Open this guide |
|
||
|-------------------------|-----------------|
|
||
| **Tesseract** (open-source OCR) | [tesseract.md](./tesseract.md) |
|
||
| **Google Document AI** | [google_docai.md](./google_docai.md) |
|
||
| **AWS Textract** | [aws_textract.md](./aws_textract.md) |
|
||
| **Azure OCR** | [azure_ocr.md](./azure_ocr.md) |
|
||
| **ABBYY** (enterprise OCR) | [abbyy.md](./abbyy.md) |
|
||
| **PaddleOCR** (open-source) | [paddleocr.md](./paddleocr.md) |
|
||
|
||
---
|
||
|
||
## 🛠️ Common symptoms → exact fixes
|
||
|
||
| Symptom | Likely cause | Fix page |
|
||
|---------|--------------|----------|
|
||
| High similarity but wrong snippet | Embeddings confuse words with meaning | [embedding-vs-semantic.md](../../embedding-vs-semantic.md) |
|
||
| Citations don’t line up with scanned region | Missing traceability or weak schema | [retrieval-traceability.md](../../retrieval-traceability.md) · [data-contracts.md](../../data-contracts.md) |
|
||
| Multi-column / rotated pages fail | Chunking instability | [chunking-checklist.md](../../chunking-checklist.md) |
|
||
| Wrong OCR version after deploy | Boot ordering or pre-deploy collapse | [bootstrap-ordering.md](../../bootstrap-ordering.md) · [predeploy-collapse.md](../../predeploy-collapse.md) |
|
||
| OCR+Vision hybrid worse than single | Query parsing split issue | [pattern_query_parsing_split.md](../../patterns/pattern_query_parsing_split.md) |
|
||
|
||
---
|
||
|
||
## ✅ 60-second fix checklist
|
||
1. Run OCR twice (two providers or seeds) → compare ΔS & λ.
|
||
2. Validate JSON schema → enforce `{page_id, bbox, text, confidence}`.
|
||
3. De-rotate scans, split multi-column before embedding.
|
||
4. Confirm **coverage ≥ 0.70** on a gold page.
|
||
5. Force “cite then explain” in downstream reasoning steps.
|
||
|
||
---
|
||
|
||
## ❓ FAQ (beginner-friendly)
|
||
|
||
**Q: What is ΔS and why should I care?**
|
||
ΔS measures semantic drift — if it’s above 0.45, your OCR text no longer matches the question well. Keep it lower to ensure stable answers.
|
||
|
||
**Q: What does λ mean in practice?**
|
||
λ checks consistency across paraphrases. If the system gives different answers for re-phrased questions, λ is unstable.
|
||
|
||
**Q: Why do my citations not match the scanned PDF?**
|
||
Usually because the OCR JSON has no stable IDs or coordinates. Fix by enforcing traceability fields like `page_id` and `bbox`.
|
||
|
||
**Q: My OCR works on simple PDFs but fails on forms or invoices. Why?**
|
||
That’s a **chunking issue**. Multi-column and rotated layouts need pre-processing before feeding to embeddings.
|
||
|
||
**Q: Do I need to switch providers if accuracy is low?**
|
||
Not always. Most errors come from pipeline design (chunking, contracts, retrieval) rather than the OCR engine itself.
|
||
|
||
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
|------|------|--------------|
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your OCR issue>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
|
||
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
|
||
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
|
||
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
|
||
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
|
||
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
|
||
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
|
||
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
|
||
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
|
||
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|