mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 19:50:17 +00:00
| .. | ||
| checklists | ||
| eval | ||
| mvp_demo | ||
| ops | ||
| patterns | ||
| playbooks | ||
| tools | ||
| .gitkeep | ||
| abbyy.md | ||
| aws_textract.md | ||
| azure_ocr.md | ||
| google_docai.md | ||
| paddleocr.md | ||
| README.md | ||
| tesseract.md | ||
Document AI & OCR — Global Fix Map
A hub to stabilize OCR and document AI pipelines across providers and open-source stacks.
Use this folder to jump to guardrails, check common breakpoints, and apply structural fixes with measurable targets.
Quick routes to per-provider pages
- Tesseract: tesseract.md
- Google Document AI: google_docai.md
- AWS Textract: aws_textract.md
- Azure OCR: azure_ocr.md
- ABBYY: abbyy.md
- PaddleOCR: paddleocr.md
When to use this folder
- OCR extracts text but misses table alignment or field boundaries.
- High word recall but wrong semantic grouping.
- Citations mismatch scanned sections.
- Layout-aware models drift when format changes.
- Two-column or rotated pages break retrieval.
- Cloud OCR service gives inconsistent JSON schema across runs.
Acceptance targets for any OCR system
- ΔS(question, extracted text) ≤ 0.45
- Field/section coverage ≥ 0.70
- λ remains convergent across 3 paraphrases and 2 seeds
- E_resonance flat over long document windows
Map symptoms → structural fixes (Problem Map)
-
High similarity but wrong snippet
→ embedding-vs-semantic.md -
Traceability missing, citations don’t line up with scanned region
→ retrieval-traceability.md
→ data-contracts.md -
Chunking instability (multi-column / rotated scans)
→ chunking-checklist.md -
Cold boot / wrong version OCR model
→ bootstrap-ordering.md
→ predeploy-collapse.md -
Hybrid OCR (vision + text) worse than single mode
→ pattern_query_parsing_split.md
60-second fix checklist
- Run OCR twice with different seeds / providers. Compare ΔS and λ.
- Validate JSON schema consistency: enforce fields
{page_id, bbox, text, confidence}. - Apply de-rotation and multi-column split before embedding.
- Check coverage ≥ 0.70 on a gold page.
- Enforce cite-then-explain in downstream reasoning.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + ” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | Semantic firewall engine (reasoning & math) | View → |
| Problem Map 1.0 | Original 16-mode fix framework | View → |
| Semantic Clinic Index | Expanded clinic: OCR, prompt injection, memory drift | View → |
| Benchmarks vs GPT-5 | OCR + reasoning stress test | View → |
👑 Hall of Fame: See the Stargazers who supported this from the start.