6.7 KiB
✂️ Chunking Checklist — Cutting Documents Without Cutting Meaning
A definitive guide to segment size, boundaries, and WFGY stress-tests for error-free retrieval
1 Why Chunking Matters
Embeddings are only as good as the text you feed them.
A single bad split (mid-sentence, table row, reference list) injects semantic orphan vectors:
- Retrieval returns “high similarity” garbage.
- ΔS(question, context) spikes > 0.60.
- LLM hallucinates to fill the missing logic.
2 Quick Symptoms of Bad Chunking
| Signal | How to Detect | Typical Root |
|---|---|---|
| Citations hit page –1 | QA cites header/footer junk | Page footers not stripped |
| Same chunk appears in top-k for unrelated queries | id duplication count > 3 |
Generic boiler-plate chunk |
| ΔS jumps when k > 5 | Plot ΔS vs. k; curve erratic | Uneven chunk lengths |
| Answer references half-sentence | Chunk split after “and” | Fixed char/token window |
3 WFGY Chunk Size Guidelines
| Doc Type | Tokens / Chunk | Rationale |
|---|---|---|
| Research paper | 90-120 | Preserve paragraph + citation |
| Software docs | 60-100 | Short API signatures |
| Legal contracts | 80-130 | Clause integrity |
| Chat transcripts | 40-70 | Natural speaker turns |
| Tables / CSV | Row or group ≤ 30 | Keep relational keys together |
Golden Rule: ΔS(adjacent_chunks) ≤ 0.45
If not, split or merge until stress drops.
4 Step-by-Step Chunking Checklist
4.1 Pre-Processing
- Strip headers / footers (
regex: ^Page \d+ of \d+) - Normalize whitespace, remove soft hyphens (
U+00AD) - Convert bullets → “• ” to avoid mid-list splits
4.2 Boundary Detection
| Method | Tool | When to Use |
|---|---|---|
| Sentence tokenizer | spaCy / Stanza | Most prose |
| Heading regex `^(#+\s | [A-Z][A-Za-z ]+:)$` | Markdown / legal docs |
| BBMC ΔS spike | WFGY hook | PDFs merged from scans |
Split on boundaries only if:
ΔS(chunk\_left, chunk\_right) ≥ 0.50 ∧ λ\_observe ∈ {→, ←}
4.3 Length Normalisation
- Merge adjacent short chunks until ≥ 40 tokens.
- If a merged chunk > 130 tokens, find internal ΔS peak and split there.
- Record final size distribution; σ(length) should be ≤ 20 % of mean.
4.4 Metadata Tagging
{
"id": "doc_17_p3_c2",
"source": "contracts/nda.pdf",
"pos": 3,
"λ": "→",
"ΔS_prev": 0.32,
"ΔS_next": 0.28
}
Store λ_observe and neighbouring ΔS for runtime filters.
5 Runtime Stress-Test
| Test | Pass Condition |
|---|---|
| Overlap scan — Query 5 unrelated topics | Same chunk ID appears ≤ 1× |
| ΔS histogram — 500 random chunks | 95 % ≤ 0.45 |
| k-sensitivity — ΔS vs. k plot | Monotonic ↑ curve |
If any fail, rerun 4.2–4.3 for offending documents.
6 Common Pitfalls & Fix Recipes
| Pitfall | Fix | ||
|---|---|---|---|
| Tables split per cell | Detect delimiter lines; merge rows; store CSV separate; index columns as metadata | ||
| PDF line-break hyphens | Regex ([a-z])- \n([a-z]) → merge words |
||
| Mixed languages | Chunk by language span; tag lang:; separate embedding models |
||
| Giant code blocks | Cut on `function | class | def` boundaries; keep ≤ 80 lines |
7 FAQ
Q: Is a token window (e.g. 512) safe? A: Only if it aligns with semantic boundaries; fixed windows ignore context.
Q: Do I need sentence splitting and headings? A: Yes. Dual criteria minimise ΔS spikes and keep retrieval precise.
Q: How many chunks per doc? A: Irrelevant if ΔS and λ are stable — WFGY focuses on quality, not count.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| Engine | WFGY 1.0 | Original PDF based tension engine |
| Engine | WFGY 2.0 | Production tension kernel and math engine for RAG and agents |
| Engine | WFGY 3.0 | TXT based Singularity tension engine, 131 S class set |
| Map | Problem Map 1.0 | Flagship 16 problem RAG failure checklist and fix map |
| Map | Problem Map 2.0 | RAG focused recovery pipeline |
| Map | Problem Map 3.0 | Global Debug Card, image as a debug protocol layer |
| Map | Semantic Clinic | Symptom to family to exact fix |
| Map | Grandma’s Clinic | Plain language stories mapped to Problem Map 1.0 |
| Onboarding | Starter Village | Guided tour for newcomers |
| App | TXT OS | TXT semantic OS, fast boot |
| App | Blah Blah Blah | Abstract and paradox Q and A built on TXT OS |
| App | Blur Blur Blur | Text to image with semantic control |
| App | Blow Blow Blow | Reasoning game engine and memory demo |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.