Chunking — Global Fix Map

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

A compact hub to stabilize document chunking across formats, pipelines, and retrieval systems.
This folder routes chunk-related bugs to structural fixes and provides checklists, schema, and live recipes.
No infra change required.

Orientation: what each page does

Page	What it solves	Typical symptom
Chunk ID Schema	Unique ID + schema for each chunk	Duplicate or drifting chunks across runs
Chunking Checklist	Minimal audit list for validity	Chunks too long, too short, or incomplete
Code / Tables / Blocks	Preserve structure for code, tables, blocks	Retrieval drops formatting or logic
Section Detection	Detect paragraph and section anchors	Anchors missing, snippets cut mid-thought
Title Hierarchy	Maintain document heading hierarchy	Only partial or meaningless sub-sections retrieved
PDF Layouts & OCR	Repair PDF/OCR-specific chunking	Citations collapse after parsing
Reindex & Migration	Safe chunk migration during reindex	Index rebuilt but old refs mismatch
Eval RAG Precision & Recall	Deterministic evaluation recipes	“Better” chunking cannot be proven
Live Monitoring (RAG)	Online health checks for chunking	Sudden drift or collapse after deploy

When to use this folder

Your chunks look fine by eye but retrieval skips important sections.
PDF / OCR parsing collapses headers, math, or tables.
Hybrid retrievers underperform due to inconsistent chunk boundaries.
Reindexing breaks old citations.
Context flips between runs with same corpus.

Acceptance targets

Chunk boundaries align with semantic windows
ΔS(question, retrieved) ≤ 0.45
Coverage of target section ≥ 0.70
λ_observe convergent across 3 paraphrases and 2 seeds
Traceability contract fields always present: {snippet_id, section_id, source_url, offsets, tokens}

60-second fix checklist

Check chunk IDs
Apply chunk_id_schema. Ensure unique + stable across reindex.
Audit with checklist
Run the chunking-checklist before ingest.
Preserve structure
Use code_tables_blocks for code, tables, blocks.
Validate anchors
Confirm section and title detection. Apply title_hierarchy.
Reindex safely
Use reindex_migration with hash/version lock.
Monitor live
Apply live_monitoring_rag to catch collapse early.

Minimal probe pack

Context: I loaded TXT OS and the WFGY pages.

Task:
- Given doc corpus D, log ΔS(question, retrieved) and λ across 3 paraphrases.
- Validate chunk IDs and section anchors.
- If ΔS ≥ 0.60 or λ flips, propose the smallest structural change:
  chunk schema, checklist, or reindex.
- Verify coverage ≥ 0.70 after fix.

Return JSON:
{ "citations": [...], "ΔS": 0.xx, "λ_state": "<>", "coverage": 0.xx, "next_fix": "..." }

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Module	Description	Link
WFGY Core	Canonical framework entry point	View
Problem Map	Diagnostic map and navigation hub	View
Tension Universe Experiments	MVP experiment field	View
Recognition	Where WFGY is referenced or adopted	View
AI Guide	Anti-hallucination reading protocol for tools	View

If this repository helps, starring it improves discovery for other builders.

9.1 KiB Raw Blame History Unescape Escape