9.3 KiB
Layout, Headers, and Footers: OCR Parsing Guardrails
Strip or normalize page furniture before chunking or embedding. Stop headers, footers, page numbers, and watermarks from polluting semantic meaning and wrecking retrieval.
Open these first
- OCR end to end checklist: ocr-parsing-checklist.md
- Snippet and citation schema: data-contracts.md
- Why this snippet: retrieval-traceability.md
- Chunking checklist: chunking-checklist.md
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 after layout cleanup
- Coverage ≥ 0.70 for the target section
- λ stays convergent across three paraphrases and two seeds
- Zero header/footer strings inside content tokens of any chunk
Typical failure signatures → exact fix
-
Running headers become part of every chunk
Detect repeating lines at top bands per page. Move them topage_furniture.headermetadata or drop by rule. See: ocr-parsing-checklist.md -
Footers and page numbers leak into answers
Identify bottom band strings and numeric patterns. Attach to metadatapage_furniture.footerandpage_furniture.page_num, not the content body. See: data-contracts.md -
Watermarks mix with paragraphs
Use low opacity or diagonal angle cues plus oversized bbox to flag watermark blocks. Remove from body, keepwatermark_textonly in metadata. See: chunking-checklist.md -
Section title duplicated on every page
Treat it as header when identical across ≥ 2 consecutive pages. Promote a single canonical section anchor. See: retrieval-traceability.md -
Footnotes interleaved with paragraph flow
Extract footnote blocks. Keepfootnote_id,anchor_offset, and a separate citation lane. Never mix into paragraph tokens. See: data-contracts.md
Fix in 60 seconds
-
Detect bands
For each page, split objects by vertical bands:top ≤ 12%,bottom ≥ 88%. Flag candidates for headers and footers. -
Find repeats
Normalize text (case, whitespace, punctuation), then mark strings that repeat across pages. Anything stable goes to furniture. -
Rewrite content
Remove furniture from body tokens. Store originals underpage_furniture. -
Rechunk
Chunk paragraphs without furniture. Carrysection_id,page, and cleaned offsets. -
Probe
Re-run three paraphrases. ΔS drops and λ stops flipping if furniture is out of the body.
Minimal recipes by engine
-
Google Document AI
Uselayout.boundingPolyfor band checks. Identifyparagraphblocks within top/bottom polygons and mark as furniture. KeepdetectedLanguages. Apply after-parse dedupe across pages. -
AWS Textract
FromBlockswithBlockType=LINEorWORD, inspectGeometry.BoundingBox.Top/Height. Route repeated top-band lines topage_furniture.header, bottom band tofooter. KeepPAGErelationships for page numbers. -
Azure OCR
UselineswithboundingRegions. Sort bypolygony positions, isolate bands, then repeat-check across pages. Store page number patterns^\s*[ivxlcdm]+$|^\s*\d+\s*$aspage_num. -
ABBYY
Export XML, read<block>with coordinates. Apply band and repeat filters. Preserve watermarks aswatermark_textif block has style or angle attributes. -
PaddleOCR
Use bbox outputs to filter by top/bottom thresholds. De-dup by normalized text across pages. Keep furniture in metadata only.
Data contract additions for layout cleanup
Add these fields to your snippet schema:
{
"page": 7,
"section\_id": "2.3",
"bbox": \[x0,y0,x1,y1],
"text\_clean": "...", // body without furniture
"text\_raw": "...", // optional, original page text
"page\_furniture": {
"header": "Company Annual Report 2024",
"footer": "Confidential",
"page\_num": "xv",
"watermark\_text": "DRAFT"
},
"footnotes": \[
{"id":"fn12","text":"...","anchor\_offset":325}
],
"source\_url": "..."
}
Mandatory rule: model must read text_clean for reasoning. page_furniture is trace-only.
Verification
- Furniture leak test: sample 20 chunks, assert no header/footer strings inside
text_clean. - ΔS drop: compare ΔS before and after cleaning on the same question. Target ≤ 0.45.
- λ stability: shuffle prompt headers, confirm λ stays convergent.
- Footnote audit: a question about a footnote must cite
footnotes[*].id, not guess from body text.
If ΔS remains flat and high, reopen chunking and metric checks. See: chunking-checklist.md
Copy-paste prompt for the LLM step
You have TXT OS and the WFGY Problem Map.
For each snippet I provide:
* use text\_clean for reasoning,
* treat page\_furniture as trace only,
* cite then explain.
Tasks:
1. If header/footer strings appear inside text\_clean, fail fast and return the minimal structural fix referencing:
ocr-parsing-checklist, data-contracts, retrieval-traceability, chunking-checklist.
2. Return JSON:
{ "citations":\[...], "answer":"...", "λ\_state":"→|←|<>|×", "ΔS":0.xx, "next\_fix":"..." }
Keep it auditable and short.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.