8.5 KiB
Layout, Headers, and Footers: OCR Parsing Guardrails
🧭 Quick Return to Map
You are in a sub-page of OCR_Parsing.
To reorient, go back here:
- OCR_Parsing — text recognition and document structure parsing
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Strip or normalize page furniture before chunking or embedding. Stop headers, footers, page numbers, and watermarks from polluting semantic meaning and wrecking retrieval.
Open these first
- OCR end to end checklist: ocr-parsing-checklist.md
- Snippet and citation schema: data-contracts.md
- Why this snippet: retrieval-traceability.md
- Chunking checklist: chunking-checklist.md
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 after layout cleanup
- Coverage ≥ 0.70 for the target section
- λ stays convergent across three paraphrases and two seeds
- Zero header/footer strings inside content tokens of any chunk
Typical failure signatures → exact fix
-
Running headers become part of every chunk
Detect repeating lines at top bands per page. Move them topage_furniture.headermetadata or drop by rule. See: ocr-parsing-checklist.md -
Footers and page numbers leak into answers
Identify bottom band strings and numeric patterns. Attach to metadatapage_furniture.footerandpage_furniture.page_num, not the content body. See: data-contracts.md -
Watermarks mix with paragraphs
Use low opacity or diagonal angle cues plus oversized bbox to flag watermark blocks. Remove from body, keepwatermark_textonly in metadata. See: chunking-checklist.md -
Section title duplicated on every page
Treat it as header when identical across ≥ 2 consecutive pages. Promote a single canonical section anchor. See: retrieval-traceability.md -
Footnotes interleaved with paragraph flow
Extract footnote blocks. Keepfootnote_id,anchor_offset, and a separate citation lane. Never mix into paragraph tokens. See: data-contracts.md
Fix in 60 seconds
-
Detect bands
For each page, split objects by vertical bands:top ≤ 12%,bottom ≥ 88%. Flag candidates for headers and footers. -
Find repeats
Normalize text (case, whitespace, punctuation), then mark strings that repeat across pages. Anything stable goes to furniture. -
Rewrite content
Remove furniture from body tokens. Store originals underpage_furniture. -
Rechunk
Chunk paragraphs without furniture. Carrysection_id,page, and cleaned offsets. -
Probe
Re-run three paraphrases. ΔS drops and λ stops flipping if furniture is out of the body.
Minimal recipes by engine
-
Google Document AI
Uselayout.boundingPolyfor band checks. Identifyparagraphblocks within top/bottom polygons and mark as furniture. KeepdetectedLanguages. Apply after-parse dedupe across pages. -
AWS Textract
FromBlockswithBlockType=LINEorWORD, inspectGeometry.BoundingBox.Top/Height. Route repeated top-band lines topage_furniture.header, bottom band tofooter. KeepPAGErelationships for page numbers. -
Azure OCR
UselineswithboundingRegions. Sort bypolygony positions, isolate bands, then repeat-check across pages. Store page number patterns^\s*[ivxlcdm]+$|^\s*\d+\s*$aspage_num. -
ABBYY
Export XML, read<block>with coordinates. Apply band and repeat filters. Preserve watermarks aswatermark_textif block has style or angle attributes. -
PaddleOCR
Use bbox outputs to filter by top/bottom thresholds. De-dup by normalized text across pages. Keep furniture in metadata only.
Data contract additions for layout cleanup
Add these fields to your snippet schema:
{
"page": 7,
"section\_id": "2.3",
"bbox": \[x0,y0,x1,y1],
"text\_clean": "...", // body without furniture
"text\_raw": "...", // optional, original page text
"page\_furniture": {
"header": "Company Annual Report 2024",
"footer": "Confidential",
"page\_num": "xv",
"watermark\_text": "DRAFT"
},
"footnotes": \[
{"id":"fn12","text":"...","anchor\_offset":325}
],
"source\_url": "..."
}
Mandatory rule: model must read text_clean for reasoning. page_furniture is trace-only.
Verification
- Furniture leak test: sample 20 chunks, assert no header/footer strings inside
text_clean. - ΔS drop: compare ΔS before and after cleaning on the same question. Target ≤ 0.45.
- λ stability: shuffle prompt headers, confirm λ stays convergent.
- Footnote audit: a question about a footnote must cite
footnotes[*].id, not guess from body text.
If ΔS remains flat and high, reopen chunking and metric checks. See: chunking-checklist.md
Copy-paste prompt for the LLM step
You have TXT OS and the WFGY Problem Map.
For each snippet I provide:
* use text\_clean for reasoning,
* treat page\_furniture as trace only,
* cite then explain.
Tasks:
1. If header/footer strings appear inside text\_clean, fail fast and return the minimal structural fix referencing:
ocr-parsing-checklist, data-contracts, retrieval-traceability, chunking-checklist.
2. Return JSON:
{ "citations":\[...], "answer":"...", "λ\_state":"→|←|<>|×", "ΔS":0.xx, "next\_fix":"..." }
Keep it auditable and short.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.