WFGY/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md
2025-08-27 15:19:02 +08:00

14 KiB
Raw Blame History

Code and tables as typed blocks

How to detect, normalize, and persist code snippets and tables as first class blocks so section spans stay clean and citations land precisely.

Open these first

Acceptance targets

  • Typed block detection reproduces across two runs from the same source. Match rate for off_begin and off_end ≥ 0.98.
  • No code or table text leaks into adjacent prose blocks.
  • Captions attach to their figure or table. No orphan captions.
  • ΔS(question, retrieved) ≤ 0.45 when the question cites a code block id or table cell anchor.
  • For tables, header detection accuracy ≥ 0.95 on a ten table sample.

Block model

Every block carries a type and byte offsets in the canonical text.

{
  "block_id": "A.3.1.bk012",
  "type": "code | table | figure | prose | caption",
  "off_begin": 120445,
  "off_end": 120992,
  "page": 33,
  "attrs": {
    "lang": "python",
    "fence": "```",
    "table": {
      "rows": 14,
      "cols": 6,
      "header_rows": 1,
      "grid_kind": "markdown | csv | pdf_grid",
      "cell_spans": [[r,c,rowspan,colspan], ...]
    },
    "figure_ref": "Fig.4.2",
    "caption_for": "figure|table:block_id"
  }
}

Block ids follow the rules in chunk_id_schema.md.


Detection rules for code

Prefer fences and indentation signals. Normalize before detection.

  • Fenced code. Lines bounded by triple backticks or tildes. Extract an optional language tag from the opening fence.
  • Indented code. Four space or one tab leaders on most lines in a run. Reject if the run also satisfies prose paragraph rules.
  • Inline code. The short one line version is not a separate block. Keep it inside the prose block.
  • PDF and OCR. Use monospaced font tags if available. Otherwise detect low entropy per line and stable left margins.

Normalization

  • Convert tabs to four spaces for offset stability.
  • Keep internal spacing. Do not wrap lines.
  • Remove soft hyphen artifacts. See pdf_layouts_and_ocr.md.

Attributes

  • lang from the fence tag when present. Else guess with a small whitelist. Store guesses with lang_guess=true.

Detection rules for tables

Handle three major forms.

  • Markdown tables. Lines with | separators and a second line that contains only pipes, colons, and dashes. Count columns from the header line.
  • CSV like blocks. Comma separated lines of similar column count for at least three consecutive rows. Accept quoted fields. Reject when commas appear inside most cells.
  • PDF grid tables. Use vertical alignment and repeated left margins to find columns. Merge wrapped cells when the next line starts under the same column start.

Normalization

  • Collapse internal multiple spaces to a single space unless inside code marks.
  • Trim trailing spaces. Preserve newlines and row order.
  • Resolve header row count from the markdown divider or from bold font in PDF.

Attributes

  • rows, cols, header_rows, grid_kind.
  • Optional cell_spans for merged cells in PDF grids.

Figures and captions

  • A figure block represents an image or diagram placeholder in the canonical text. It keeps off_begin and off_end for the caption zone only.
  • A caption block follows the figure or table. If it contains a local token like Figure 3 or Table 2, set caption_for to the nearest preceding figure or table block inside the same section.
  • During section sweep, pull a caption into the same section as its target block. See section_detection.md.

Anchors and citation

  • A code block uses its first non blank line as anchor for cite first prompts.
  • A table uses the header row. If no header exists, use the first data row.
  • A caption may also be the anchor when the question references the figure or the table name.

Payload example in RAG

{
  "snippet_id": "A.3.1.bk012",
  "type": "code",
  "source_url": "...",
  "offsets": [120445, 120992],
  "anchor_offsets": [120445, 120489],
  "section_id": "A.3.1"
}

Schema belongs in your contract. See data-contracts.md and cite rules in retrieval-traceability.md.


Pseudocode

def detect_typed_blocks(lines):
    blocks = []
    i = 0
    while i < len(lines):
        if is_fenced_open(lines[i]):
            lang = fence_lang(lines[i])
            j = find_fence_close(lines, i + 1)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("code", off0, off1, {"lang": lang, "fence": "```"}))
            i = j + 1
            continue

        if is_indented_code_run(lines, i):
            j = expand_indented_run(lines, i)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("code", off0, off1, {"lang": None}))
            i = j + 1
            continue

        if is_markdown_table_head(lines, i):
            j = expand_markdown_table(lines, i)
            rows, cols, hdr = table_shape_markdown(lines, i, j)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("table", off0, off1, {
                "rows": rows, "cols": cols, "header_rows": hdr, "grid_kind": "markdown"
            }))
            i = j + 1
            continue

        if is_csv_like_run(lines, i):
            j = expand_csv_run(lines, i)
            rows, cols = csv_shape(lines, i, j)
            off0, off1 = span_offsets(lines, i, j)
            blocks.append(make_block("table", off0, off1, {
                "rows": rows, "cols": cols, "header_rows": 1, "grid_kind": "csv"
            }))
            i = j + 1
            continue

        # fallback to prose
        j = expand_paragraph(lines, i)
        off0, off1 = span_offsets(lines, i, j)
        blocks.append(make_block("prose", off0, off1, {}))
        i = j + 1
    return attach_captions(blocks)

attach_captions scans for caption patterns and links them with the nearest figure or table in the same section.


Common pitfalls and fixes

  • Tables split by page headers Normalize headers and footers before detection. See pdf_layouts_and_ocr.md.

  • Indented code misread as a list Reject when most lines begin with list markers. Require a minimum run length and stable left margin.

  • Markdown tables with wrapped cells Use a soft wrap join that keeps pipe counts aligned. Preserve row count and column count.

  • Captions orphaned or pulled into parents Link captions to the nearest figure or table and keep them in the same section. Re run the section validator.

  • Offsets change after minor edits Keep block ids stable and run the migration map. See reindex_migration.md.


Tests to include in CI

  • Markdown file with mixed prose, code fences, and tables. Expect correct block counts and stable offsets across runs.
  • PDF with grid tables and multi line cells. Expect merged cells recorded in cell_spans where needed.
  • OCR source with monospaced code and broken hyphenation. Expect code detection without line wraps.
  • Document with figure and caption pairs across section boundaries. Expect captions linked to the correct target block.

Copy paste prompt for a quick check

You have TXT OS and WFGY Problem Map loaded.

Given a typed block:
block_id = {id}
type = {code|table|figure|caption}
offsets = {start, end}
attrs = {...}

Task:
1) Verify the block type given the text slice.
2) If caption, confirm caption_for links to the nearest figure or table in the same section.
3) Suggest the minimal structural fix page if detection would fail.
Return JSON:
{ "ok": true|false, "why": "...", "open": "pdf_layouts_and_ocr.md | section_detection.md | reindex_migration.md" }

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow  

要我繼續下一頁就說:GO pdf_layouts_and_ocr.md