vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-19 16:31:07 +00:00

History

PSBigBig 2547498a8f Create README.md		2025-08-28 17:11:41 +08:00
..
checklists	Create .gitkeep	2025-08-25 19:05:54 +08:00
eval	Create .gitkeep	2025-08-25 19:06:29 +08:00
mvp_demo	Create .gitkeep	2025-08-25 19:06:17 +08:00
ops	Create .gitkeep	2025-08-25 19:06:39 +08:00
patterns	Create .gitkeep	2025-08-25 19:05:30 +08:00
playbooks	Create .gitkeep	2025-08-25 19:06:05 +08:00
tools	Create .gitkeep	2025-08-25 19:05:42 +08:00
.gitkeep	Create .gitkeep	2025-08-25 19:05:16 +08:00
abbyy.md	Create abbyy.md	2025-08-28 16:51:13 +08:00
aws_textract.md	Create aws_textract.md	2025-08-28 16:41:28 +08:00
azure_ocr.md	Create azure_ocr.md	2025-08-28 16:47:06 +08:00
google_docai.md	Create google_docai.md	2025-08-28 16:32:01 +08:00
paddleocr.md	Create paddleocr.md	2025-08-28 17:01:31 +08:00
README.md	Create README.md	2025-08-28 17:11:41 +08:00
tesseract.md	Create tesseract.md	2025-08-28 16:21:22 +08:00

README.md

Document AI & OCR — Global Fix Map

A hub to stabilize OCR and document AI pipelines across providers and open-source stacks.
Use this folder to jump to guardrails, check common breakpoints, and apply structural fixes with measurable targets.

Quick routes to per-provider pages

Tesseract: tesseract.md
Google Document AI: google_docai.md
AWS Textract: aws_textract.md
Azure OCR: azure_ocr.md
ABBYY: abbyy.md
PaddleOCR: paddleocr.md

When to use this folder

OCR extracts text but misses table alignment or field boundaries.
High word recall but wrong semantic grouping.
Citations mismatch scanned sections.
Layout-aware models drift when format changes.
Two-column or rotated pages break retrieval.
Cloud OCR service gives inconsistent JSON schema across runs.

Acceptance targets for any OCR system

ΔS(question, extracted text) ≤ 0.45
Field/section coverage ≥ 0.70
λ remains convergent across 3 paraphrases and 2 seeds
E_resonance flat over long document windows

Map symptoms → structural fixes (Problem Map)

High similarity but wrong snippet
→ embedding-vs-semantic.md
Traceability missing, citations don’t line up with scanned region
→ retrieval-traceability.md
→ data-contracts.md
Chunking instability (multi-column / rotated scans)
→ chunking-checklist.md
Cold boot / wrong version OCR model
→ bootstrap-ordering.md
→ predeploy-collapse.md
Hybrid OCR (vision + text) worse than single mode
→ pattern_query_parsing_split.md

60-second fix checklist

Run OCR twice with different seeds / providers. Compare ΔS and λ.
Validate JSON schema consistency: enforce fields {page_id, bbox, text, confidence}.
Apply de-rotation and multi-column split before embedding.
Check coverage ≥ 0.70 on a gold page.
Enforce cite-then-explain in downstream reasoning.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + ”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	Semantic firewall engine (reasoning & math)	View →
Problem Map 1.0	Original 16-mode fix framework	View →
Semantic Clinic Index	Expanded clinic: OCR, prompt injection, memory drift	View →
Benchmarks vs GPT-5	OCR + reasoning stress test	View →

👑 Hall of Fame: See the Stargazers who supported this from the start.

README.md Unescape Escape