mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
282 lines
13 KiB
Markdown
282 lines
13 KiB
Markdown
# Code and tables as typed blocks
|
||
|
||
<details>
|
||
<summary><strong>🧭 Quick Return to Map</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a sub-page of **Chunking**.
|
||
> To reorient, go back here:
|
||
>
|
||
> - [**Chunking** — text segmentation and context window management](./README.md)
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
|
||
>
|
||
> Think of this page as a desk within a ward.
|
||
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
|
||
</details>
|
||
|
||
|
||
How to detect, normalize, and persist code snippets and tables as first class blocks so section spans stay clean and citations land precisely.
|
||
|
||
## Open these first
|
||
- PDF layouts and OCR cleanup: [pdf_layouts_and_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md)
|
||
- Section boundaries after titles: [section_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md)
|
||
- Title hierarchy and numbering: [title_hierarchy.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md)
|
||
- Stable chunk ids: [chunk_id_schema.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md)
|
||
- Why this snippet and offsets: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||
- Payload contracts for RAG: [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||
- Safe reindex after tiny edits: [reindex_migration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/reindex_migration.md)
|
||
|
||
## Acceptance targets
|
||
- Typed block detection reproduces across two runs from the same source. Match rate for `off_begin` and `off_end` ≥ 0.98.
|
||
- No code or table text leaks into adjacent prose blocks.
|
||
- Captions attach to their figure or table. No orphan captions.
|
||
- ΔS(question, retrieved) ≤ 0.45 when the question cites a code block id or table cell anchor.
|
||
- For tables, header detection accuracy ≥ 0.95 on a ten table sample.
|
||
|
||
---
|
||
|
||
## Block model
|
||
|
||
Every block carries a type and byte offsets in the canonical text.
|
||
|
||
```json
|
||
{
|
||
"block_id": "A.3.1.bk012",
|
||
"type": "code | table | figure | prose | caption",
|
||
"off_begin": 120445,
|
||
"off_end": 120992,
|
||
"page": 33,
|
||
"attrs": {
|
||
"lang": "python",
|
||
"fence": "```",
|
||
"table": {
|
||
"rows": 14,
|
||
"cols": 6,
|
||
"header_rows": 1,
|
||
"grid_kind": "markdown | csv | pdf_grid",
|
||
"cell_spans": [[r,c,rowspan,colspan], ...]
|
||
},
|
||
"figure_ref": "Fig.4.2",
|
||
"caption_for": "figure|table:block_id"
|
||
}
|
||
}
|
||
````
|
||
|
||
Block ids follow the rules in [chunk\_id\_schema.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md).
|
||
|
||
---
|
||
|
||
## Detection rules for code
|
||
|
||
Prefer fences and indentation signals. Normalize before detection.
|
||
|
||
* **Fenced code**. Lines bounded by triple backticks or tildes. Extract an optional language tag from the opening fence.
|
||
* **Indented code**. Four space or one tab leaders on most lines in a run. Reject if the run also satisfies prose paragraph rules.
|
||
* **Inline code**. The short one line version is not a separate block. Keep it inside the prose block.
|
||
* **PDF and OCR**. Use monospaced font tags if available. Otherwise detect low entropy per line and stable left margins.
|
||
|
||
Normalization
|
||
|
||
* Convert tabs to four spaces for offset stability.
|
||
* Keep internal spacing. Do not wrap lines.
|
||
* Remove soft hyphen artifacts. See [pdf\_layouts\_and\_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md).
|
||
|
||
Attributes
|
||
|
||
* `lang` from the fence tag when present. Else guess with a small whitelist. Store guesses with `lang_guess=true`.
|
||
|
||
---
|
||
|
||
## Detection rules for tables
|
||
|
||
Handle three major forms.
|
||
|
||
* **Markdown tables**. Lines with `|` separators and a second line that contains only pipes, colons, and dashes. Count columns from the header line.
|
||
* **CSV like blocks**. Comma separated lines of similar column count for at least three consecutive rows. Accept quoted fields. Reject when commas appear inside most cells.
|
||
* **PDF grid tables**. Use vertical alignment and repeated left margins to find columns. Merge wrapped cells when the next line starts under the same column start.
|
||
|
||
Normalization
|
||
|
||
* Collapse internal multiple spaces to a single space unless inside code marks.
|
||
* Trim trailing spaces. Preserve newlines and row order.
|
||
* Resolve header row count from the markdown divider or from bold font in PDF.
|
||
|
||
Attributes
|
||
|
||
* `rows`, `cols`, `header_rows`, `grid_kind`.
|
||
* Optional `cell_spans` for merged cells in PDF grids.
|
||
|
||
---
|
||
|
||
## Figures and captions
|
||
|
||
* A **figure** block represents an image or diagram placeholder in the canonical text. It keeps `off_begin` and `off_end` for the caption zone only.
|
||
* A **caption** block follows the figure or table. If it contains a local token like `Figure 3` or `Table 2`, set `caption_for` to the nearest preceding figure or table block inside the same section.
|
||
* During section sweep, pull a caption into the same section as its target block. See [section\_detection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md).
|
||
|
||
---
|
||
|
||
## Anchors and citation
|
||
|
||
* A code block uses its first non blank line as anchor for cite first prompts.
|
||
* A table uses the header row. If no header exists, use the first data row.
|
||
* A caption may also be the anchor when the question references the figure or the table name.
|
||
|
||
Payload example in RAG
|
||
|
||
```json
|
||
{
|
||
"snippet_id": "A.3.1.bk012",
|
||
"type": "code",
|
||
"source_url": "...",
|
||
"offsets": [120445, 120992],
|
||
"anchor_offsets": [120445, 120489],
|
||
"section_id": "A.3.1"
|
||
}
|
||
```
|
||
|
||
Schema belongs in your contract. See [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) and cite rules in [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md).
|
||
|
||
---
|
||
|
||
## Pseudocode
|
||
|
||
````python
|
||
def detect_typed_blocks(lines):
|
||
blocks = []
|
||
i = 0
|
||
while i < len(lines):
|
||
if is_fenced_open(lines[i]):
|
||
lang = fence_lang(lines[i])
|
||
j = find_fence_close(lines, i + 1)
|
||
off0, off1 = span_offsets(lines, i, j)
|
||
blocks.append(make_block("code", off0, off1, {"lang": lang, "fence": "```"}))
|
||
i = j + 1
|
||
continue
|
||
|
||
if is_indented_code_run(lines, i):
|
||
j = expand_indented_run(lines, i)
|
||
off0, off1 = span_offsets(lines, i, j)
|
||
blocks.append(make_block("code", off0, off1, {"lang": None}))
|
||
i = j + 1
|
||
continue
|
||
|
||
if is_markdown_table_head(lines, i):
|
||
j = expand_markdown_table(lines, i)
|
||
rows, cols, hdr = table_shape_markdown(lines, i, j)
|
||
off0, off1 = span_offsets(lines, i, j)
|
||
blocks.append(make_block("table", off0, off1, {
|
||
"rows": rows, "cols": cols, "header_rows": hdr, "grid_kind": "markdown"
|
||
}))
|
||
i = j + 1
|
||
continue
|
||
|
||
if is_csv_like_run(lines, i):
|
||
j = expand_csv_run(lines, i)
|
||
rows, cols = csv_shape(lines, i, j)
|
||
off0, off1 = span_offsets(lines, i, j)
|
||
blocks.append(make_block("table", off0, off1, {
|
||
"rows": rows, "cols": cols, "header_rows": 1, "grid_kind": "csv"
|
||
}))
|
||
i = j + 1
|
||
continue
|
||
|
||
# fallback to prose
|
||
j = expand_paragraph(lines, i)
|
||
off0, off1 = span_offsets(lines, i, j)
|
||
blocks.append(make_block("prose", off0, off1, {}))
|
||
i = j + 1
|
||
return attach_captions(blocks)
|
||
````
|
||
|
||
`attach_captions` scans for caption patterns and links them with the nearest figure or table in the same section.
|
||
|
||
---
|
||
|
||
## Common pitfalls and fixes
|
||
|
||
* **Tables split by page headers**
|
||
Normalize headers and footers before detection. See [pdf\_layouts\_and\_ocr.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md).
|
||
|
||
* **Indented code misread as a list**
|
||
Reject when most lines begin with list markers. Require a minimum run length and stable left margin.
|
||
|
||
* **Markdown tables with wrapped cells**
|
||
Use a soft wrap join that keeps pipe counts aligned. Preserve row count and column count.
|
||
|
||
* **Captions orphaned or pulled into parents**
|
||
Link captions to the nearest figure or table and keep them in the same section. Re run the section validator.
|
||
|
||
* **Offsets change after minor edits**
|
||
Keep block ids stable and run the migration map. See [reindex\_migration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/reindex_migration.md).
|
||
|
||
---
|
||
|
||
## Tests to include in CI
|
||
|
||
* Markdown file with mixed prose, code fences, and tables. Expect correct block counts and stable offsets across runs.
|
||
* PDF with grid tables and multi line cells. Expect merged cells recorded in `cell_spans` where needed.
|
||
* OCR source with monospaced code and broken hyphenation. Expect code detection without line wraps.
|
||
* Document with figure and caption pairs across section boundaries. Expect captions linked to the correct target block.
|
||
|
||
---
|
||
|
||
## Copy paste prompt for a quick check
|
||
|
||
```
|
||
You have TXT OS and WFGY Problem Map loaded.
|
||
|
||
Given a typed block:
|
||
block_id = {id}
|
||
type = {code|table|figure|caption}
|
||
offsets = {start, end}
|
||
attrs = {...}
|
||
|
||
Task:
|
||
1) Verify the block type given the text slice.
|
||
2) If caption, confirm caption_for links to the nearest figure or table in the same section.
|
||
3) Suggest the minimal structural fix page if detection would fail.
|
||
Return JSON:
|
||
{ "ok": true|false, "why": "...", "open": "pdf_layouts_and_ocr.md | section_detection.md | reindex_migration.md" }
|
||
```
|
||
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| Engine | [WFGY 1.0](/legacy/README.md) | Original PDF based tension engine |
|
||
| Engine | [WFGY 2.0](/core/README.md) | Production tension kernel and math engine for RAG and agents |
|
||
| Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine, 131 S class set |
|
||
| Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure checklist and fix map |
|
||
| Map | [Problem Map 2.0](/ProblemMap/rag-architecture-and-recovery.md) | RAG focused recovery pipeline |
|
||
| Map | [Problem Map 3.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card, image as a debug protocol layer |
|
||
| Map | [Semantic Clinic](/ProblemMap/SemanticClinicIndex.md) | Symptom to family to exact fix |
|
||
| Map | [Grandma’s Clinic](/ProblemMap/GrandmaClinic/README.md) | Plain language stories mapped to Problem Map 1.0 |
|
||
| Onboarding | [Starter Village](/StarterVillage/README.md) | Guided tour for newcomers |
|
||
| App | [TXT OS](/OS/README.md) | TXT semantic OS, fast boot |
|
||
| App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q and A built on TXT OS |
|
||
| App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image with semantic control |
|
||
| App | [Blow Blow Blow](/OS/BlowBlowBlow/README.md) | Reasoning game engine and memory demo |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|
||
要我繼續下一頁就說:**GO pdf\_layouts\_and\_ocr.md**
|