9.9 KiB
Layout, Headers, and Footers: OCR Parsing Guardrails
🧭 Quick Return to Map
You are in a sub-page of OCR_Parsing.
To reorient, go back here:
- OCR_Parsing — text recognition and document structure parsing
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Strip or normalize page furniture before chunking or embedding. Stop headers, footers, page numbers, and watermarks from polluting semantic meaning and wrecking retrieval.
Open these first
- OCR end to end checklist: ocr-parsing-checklist.md
- Snippet and citation schema: data-contracts.md
- Why this snippet: retrieval-traceability.md
- Chunking checklist: chunking-checklist.md
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 after layout cleanup
- Coverage ≥ 0.70 for the target section
- λ stays convergent across three paraphrases and two seeds
- Zero header/footer strings inside content tokens of any chunk
Typical failure signatures → exact fix
-
Running headers become part of every chunk
Detect repeating lines at top bands per page. Move them topage_furniture.headermetadata or drop by rule. See: ocr-parsing-checklist.md -
Footers and page numbers leak into answers
Identify bottom band strings and numeric patterns. Attach to metadatapage_furniture.footerandpage_furniture.page_num, not the content body. See: data-contracts.md -
Watermarks mix with paragraphs
Use low opacity or diagonal angle cues plus oversized bbox to flag watermark blocks. Remove from body, keepwatermark_textonly in metadata. See: chunking-checklist.md -
Section title duplicated on every page
Treat it as header when identical across ≥ 2 consecutive pages. Promote a single canonical section anchor. See: retrieval-traceability.md -
Footnotes interleaved with paragraph flow
Extract footnote blocks. Keepfootnote_id,anchor_offset, and a separate citation lane. Never mix into paragraph tokens. See: data-contracts.md
Fix in 60 seconds
-
Detect bands
For each page, split objects by vertical bands:top ≤ 12%,bottom ≥ 88%. Flag candidates for headers and footers. -
Find repeats
Normalize text (case, whitespace, punctuation), then mark strings that repeat across pages. Anything stable goes to furniture. -
Rewrite content
Remove furniture from body tokens. Store originals underpage_furniture. -
Rechunk
Chunk paragraphs without furniture. Carrysection_id,page, and cleaned offsets. -
Probe
Re-run three paraphrases. ΔS drops and λ stops flipping if furniture is out of the body.
Minimal recipes by engine
-
Google Document AI
Uselayout.boundingPolyfor band checks. Identifyparagraphblocks within top/bottom polygons and mark as furniture. KeepdetectedLanguages. Apply after-parse dedupe across pages. -
AWS Textract
FromBlockswithBlockType=LINEorWORD, inspectGeometry.BoundingBox.Top/Height. Route repeated top-band lines topage_furniture.header, bottom band tofooter. KeepPAGErelationships for page numbers. -
Azure OCR
UselineswithboundingRegions. Sort bypolygony positions, isolate bands, then repeat-check across pages. Store page number patterns^\s*[ivxlcdm]+$|^\s*\d+\s*$aspage_num. -
ABBYY
Export XML, read<block>with coordinates. Apply band and repeat filters. Preserve watermarks aswatermark_textif block has style or angle attributes. -
PaddleOCR
Use bbox outputs to filter by top/bottom thresholds. De-dup by normalized text across pages. Keep furniture in metadata only.
Data contract additions for layout cleanup
Add these fields to your snippet schema:
{
"page": 7,
"section\_id": "2.3",
"bbox": \[x0,y0,x1,y1],
"text\_clean": "...", // body without furniture
"text\_raw": "...", // optional, original page text
"page\_furniture": {
"header": "Company Annual Report 2024",
"footer": "Confidential",
"page\_num": "xv",
"watermark\_text": "DRAFT"
},
"footnotes": \[
{"id":"fn12","text":"...","anchor\_offset":325}
],
"source\_url": "..."
}
Mandatory rule: model must read text_clean for reasoning. page_furniture is trace-only.
Verification
- Furniture leak test: sample 20 chunks, assert no header/footer strings inside
text_clean. - ΔS drop: compare ΔS before and after cleaning on the same question. Target ≤ 0.45.
- λ stability: shuffle prompt headers, confirm λ stays convergent.
- Footnote audit: a question about a footnote must cite
footnotes[*].id, not guess from body text.
If ΔS remains flat and high, reopen chunking and metric checks. See: chunking-checklist.md
Copy-paste prompt for the LLM step
You have TXT OS and the WFGY Problem Map.
For each snippet I provide:
* use text\_clean for reasoning,
* treat page\_furniture as trace only,
* cite then explain.
Tasks:
1. If header/footer strings appear inside text\_clean, fail fast and return the minimal structural fix referencing:
ocr-parsing-checklist, data-contracts, retrieval-traceability, chunking-checklist.
2. Return JSON:
{ "citations":\[...], "answer":"...", "λ\_state":"→|←|<>|×", "ΔS":0.xx, "next\_fix":"..." }
Keep it auditable and short.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.