WFGY/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md

12 KiB
Raw Permalink Blame History

Code and tables as typed blocks

🧭 Quick Return to Map

You are in a sub-page of Chunking.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

How to detect, normalize, and persist code snippets and tables as first class blocks so section spans stay clean and citations land precisely.

Open these first

Acceptance targets

  • Typed block detection reproduces across two runs from the same source. Match rate for off_begin and off_end ≥ 0.98.
  • No code or table text leaks into adjacent prose blocks.
  • Captions attach to their figure or table. No orphan captions.
  • ΔS(question, retrieved) ≤ 0.45 when the question cites a code block id or table cell anchor.
  • For tables, header detection accuracy ≥ 0.95 on a ten table sample.

Block model

Every block carries a type and byte offsets in the canonical text.

{
  "block_id": "A.3.1.bk012",
  "type": "code | table | figure | prose | caption",
  "off_begin": 120445,
  "off_end": 120992,
  "page": 33,
  "attrs": {
    "lang": "python",
    "fence": "```",
    "table": {
      "rows": 14,
      "cols": 6,
      "header_rows": 1,
      "grid_kind": "markdown | csv | pdf_grid",
      "cell_spans": [[r,c,rowspan,colspan], ...]
    },
    "figure_ref": "Fig.4.2",
    "caption_for": "figure|table:block_id"
  }
}

Block ids follow the rules in chunk_id_schema.md.


Detection rules for code

Prefer fences and indentation signals. Normalize before detection.

  • Fenced code. Lines bounded by triple backticks or tildes. Extract an optional language tag from the opening fence.
  • Indented code. Four space or one tab leaders on most lines in a run. Reject if the run also satisfies prose paragraph rules.
  • Inline code. The short one line version is not a separate block. Keep it inside the prose block.
  • PDF and OCR. Use monospaced font tags if available. Otherwise detect low entropy per line and stable left margins.

Normalization

  • Convert tabs to four spaces for offset stability.
  • Keep internal spacing. Do not wrap lines.
  • Remove soft hyphen artifacts. See pdf_layouts_and_ocr.md.

Attributes

  • lang from the fence tag when present. Else guess with a small whitelist. Store guesses with lang_guess=true.

Detection rules for tables

Handle three major forms.

  • Markdown tables. Lines with | separators and a second line that contains only pipes, colons, and dashes. Count columns from the header line.
  • CSV like blocks. Comma separated lines of similar column count for at least three consecutive rows. Accept quoted fields. Reject when commas appear inside most cells.
  • PDF grid tables. Use vertical alignment and repeated left margins to find columns. Merge wrapped cells when the next line starts under the same column start.

Normalization

  • Collapse internal multiple spaces to a single space unless inside code marks.
  • Trim trailing spaces. Preserve newlines and row order.
  • Resolve header row count from the markdown divider or from bold font in PDF.

Attributes

  • rows, cols, header_rows, grid_kind.
  • Optional cell_spans for merged cells in PDF grids.

Figures and captions

  • A figure block represents an image or diagram placeholder in the canonical text. It keeps off_begin and off_end for the caption zone only.
  • A caption block follows the figure or table. If it contains a local token like Figure 3 or Table 2, set caption_for to the nearest preceding figure or table block inside the same section.
  • During section sweep, pull a caption into the same section as its target block. See section_detection.md.

Anchors and citation

  • A code block uses its first non blank line as anchor for cite first prompts.
  • A table uses the header row. If no header exists, use the first data row.
  • A caption may also be the anchor when the question references the figure or the table name.

Payload example in RAG

{
  "snippet_id": "A.3.1.bk012",
  "type": "code",
  "source_url": "...",
  "offsets": [120445, 120992],
  "anchor_offsets": [120445, 120489],
  "section_id": "A.3.1"
}

Schema belongs in your contract. See data-contracts.md and cite rules in retrieval-traceability.md.


Pseudocode

def detect_typed_blocks(lines):
    blocks = []
    i = 0
    while i < len(lines):
        if is_fenced_open(lines[i]):
            lang = fence_lang(lines[i])
            j = find_fence_close(lines, i + 1)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("code", off0, off1, {"lang": lang, "fence": "```"}))
            i = j + 1
            continue

        if is_indented_code_run(lines, i):
            j = expand_indented_run(lines, i)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("code", off0, off1, {"lang": None}))
            i = j + 1
            continue

        if is_markdown_table_head(lines, i):
            j = expand_markdown_table(lines, i)
            rows, cols, hdr = table_shape_markdown(lines, i, j)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("table", off0, off1, {
                "rows": rows, "cols": cols, "header_rows": hdr, "grid_kind": "markdown"
            }))
            i = j + 1
            continue

        if is_csv_like_run(lines, i):
            j = expand_csv_run(lines, i)
            rows, cols = csv_shape(lines, i, j)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("table", off0, off1, {
                "rows": rows, "cols": cols, "header_rows": 1, "grid_kind": "csv"
            }))
            i = j + 1
            continue

        # fallback to prose
        j = expand_paragraph(lines, i)
        off0, off1 = span_offsets(lines, i, j)
        blocks.append(make_block("prose", off0, off1, {}))
        i = j + 1
    return attach_captions(blocks)

attach_captions scans for caption patterns and links them with the nearest figure or table in the same section.


Common pitfalls and fixes

  • Tables split by page headers Normalize headers and footers before detection. See pdf_layouts_and_ocr.md.

  • Indented code misread as a list Reject when most lines begin with list markers. Require a minimum run length and stable left margin.

  • Markdown tables with wrapped cells Use a soft wrap join that keeps pipe counts aligned. Preserve row count and column count.

  • Captions orphaned or pulled into parents Link captions to the nearest figure or table and keep them in the same section. Re run the section validator.

  • Offsets change after minor edits Keep block ids stable and run the migration map. See reindex_migration.md.


Tests to include in CI

  • Markdown file with mixed prose, code fences, and tables. Expect correct block counts and stable offsets across runs.
  • PDF with grid tables and multi line cells. Expect merged cells recorded in cell_spans where needed.
  • OCR source with monospaced code and broken hyphenation. Expect code detection without line wraps.
  • Document with figure and caption pairs across section boundaries. Expect captions linked to the correct target block.

Copy paste prompt for a quick check

You have TXT OS and WFGY Problem Map loaded.

Given a typed block:
block_id = {id}
type = {code|table|figure|caption}
offsets = {start, end}
attrs = {...}

Task:
1) Verify the block type given the text slice.
2) If caption, confirm caption_for links to the nearest figure or table in the same section.
3) Suggest the minimal structural fix page if detection would fail.
Return JSON:
{ "ok": true|false, "why": "...", "open": "pdf_layouts_and_ocr.md | section_detection.md | reindex_migration.md" }

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars

要我繼續下一頁就說:GO pdf_layouts_and_ocr.md