8 KiB
✂️ Chunking Checklist — Cutting Documents Without Cutting Meaning
A definitive guide to segment size, boundaries, and WFGY stress-tests for error-free retrieval
1 Why Chunking Matters
Embeddings are only as good as the text you feed them.
A single bad split (mid-sentence, table row, reference list) injects semantic orphan vectors:
- Retrieval returns “high similarity” garbage.
- ΔS(question, context) spikes > 0.60.
- LLM hallucinates to fill the missing logic.
2 Quick Symptoms of Bad Chunking
| Signal | How to Detect | Typical Root |
|---|---|---|
| Citations hit page –1 | QA cites header/footer junk | Page footers not stripped |
| Same chunk appears in top-k for unrelated queries | id duplication count > 3 |
Generic boiler-plate chunk |
| ΔS jumps when k > 5 | Plot ΔS vs. k; curve erratic | Uneven chunk lengths |
| Answer references half-sentence | Chunk split after “and” | Fixed char/token window |
3 WFGY Chunk Size Guidelines
| Doc Type | Tokens / Chunk | Rationale |
|---|---|---|
| Research paper | 90-120 | Preserve paragraph + citation |
| Software docs | 60-100 | Short API signatures |
| Legal contracts | 80-130 | Clause integrity |
| Chat transcripts | 40-70 | Natural speaker turns |
| Tables / CSV | Row or group ≤ 30 | Keep relational keys together |
Golden Rule: ΔS(adjacent_chunks) ≤ 0.45
If not, split or merge until stress drops.
4 Step-by-Step Chunking Checklist
4.1 Pre-Processing
- Strip headers / footers (
regex: ^Page \d+ of \d+) - Normalize whitespace, remove soft hyphens (
U+00AD) - Convert bullets → “• ” to avoid mid-list splits
4.2 Boundary Detection
| Method | Tool | When to Use |
|---|---|---|
| Sentence tokenizer | spaCy / Stanza | Most prose |
| Heading regex `^(#+\s | [A-Z][A-Za-z ]+:)$` | Markdown / legal docs |
| BBMC ΔS spike | WFGY hook | PDFs merged from scans |
Split on boundaries only if:
ΔS(chunk\_left, chunk\_right) ≥ 0.50 ∧ λ\_observe ∈ {→, ←}
4.3 Length Normalisation
- Merge adjacent short chunks until ≥ 40 tokens.
- If a merged chunk > 130 tokens, find internal ΔS peak and split there.
- Record final size distribution; σ(length) should be ≤ 20 % of mean.
4.4 Metadata Tagging
{
"id": "doc_17_p3_c2",
"source": "contracts/nda.pdf",
"pos": 3,
"λ": "→",
"ΔS_prev": 0.32,
"ΔS_next": 0.28
}
Store λ_observe and neighbouring ΔS for runtime filters.
5 Runtime Stress-Test
| Test | Pass Condition |
|---|---|
| Overlap scan — Query 5 unrelated topics | Same chunk ID appears ≤ 1× |
| ΔS histogram — 500 random chunks | 95 % ≤ 0.45 |
| k-sensitivity — ΔS vs. k plot | Monotonic ↑ curve |
If any fail, rerun 4.2–4.3 for offending documents.
6 Common Pitfalls & Fix Recipes
| Pitfall | Fix | ||
|---|---|---|---|
| Tables split per cell | Detect delimiter lines; merge rows; store CSV separate; index columns as metadata | ||
| PDF line-break hyphens | Regex ([a-z])- \n([a-z]) → merge words |
||
| Mixed languages | Chunk by language span; tag lang:; separate embedding models |
||
| Giant code blocks | Cut on `function | class | def` boundaries; keep ≤ 80 lines |
7 FAQ
Q: Is a token window (e.g. 512) safe? A: Only if it aligns with semantic boundaries; fixed windows ignore context.
Q: Do I need sentence splitting and headings? A: Yes. Dual criteria minimise ΔS spikes and keep retrieval precise.
Q: How many chunks per doc? A: Irrelevant if ΔS and λ are stable — WFGY focuses on quality, not count.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.