WFGY/ProblemMap/chunking-checklist.md
2025-08-15 23:14:24 +08:00

8 KiB
Raw Blame History

✂️ Chunking Checklist — Cutting Documents Without Cutting Meaning

A definitive guide to segment size, boundaries, and WFGY stress-tests for error-free retrieval


1 Why Chunking Matters

Embeddings are only as good as the text you feed them.
A single bad split (mid-sentence, table row, reference list) injects semantic orphan vectors:

  • Retrieval returns “high similarity” garbage.
  • ΔS(question, context) spikes > 0.60.
  • LLM hallucinates to fill the missing logic.

2 Quick Symptoms of Bad Chunking

Signal How to Detect Typical Root
Citations hit page 1 QA cites header/footer junk Page footers not stripped
Same chunk appears in top-k for unrelated queries id duplication count > 3 Generic boiler-plate chunk
ΔS jumps when k > 5 Plot ΔS vs. k; curve erratic Uneven chunk lengths
Answer references half-sentence Chunk split after “and” Fixed char/token window

3 WFGY Chunk Size Guidelines

Doc Type Tokens / Chunk Rationale
Research paper 90-120 Preserve paragraph + citation
Software docs 60-100 Short API signatures
Legal contracts 80-130 Clause integrity
Chat transcripts 40-70 Natural speaker turns
Tables / CSV Row or group ≤ 30 Keep relational keys together

Golden Rule: ΔS(adjacent_chunks) ≤ 0.45
If not, split or merge until stress drops.


4 Step-by-Step Chunking Checklist

4.1 Pre-Processing

  • Strip headers / footers (regex: ^Page \d+ of \d+)
  • Normalize whitespace, remove soft hyphens (U+00AD)
  • Convert bullets → “• ” to avoid mid-list splits

4.2 Boundary Detection

Method Tool When to Use
Sentence tokenizer spaCy / Stanza Most prose
Heading regex `^(#+\s [A-Z][A-Za-z ]+:)$` Markdown / legal docs
BBMC ΔS spike WFGY hook PDFs merged from scans

Split on boundaries only if:


ΔS(chunk\_left, chunk\_right) ≥ 0.50  ∧  λ\_observe ∈ {→, ←}

4.3 Length Normalisation

  1. Merge adjacent short chunks until ≥ 40 tokens.
  2. If a merged chunk > 130 tokens, find internal ΔS peak and split there.
  3. Record final size distribution; σ(length) should be ≤ 20 % of mean.

4.4 Metadata Tagging

{
  "id": "doc_17_p3_c2",
  "source": "contracts/nda.pdf",
  "pos": 3,
  "λ": "→",
  "ΔS_prev": 0.32,
  "ΔS_next": 0.28
}

Store λ_observe and neighbouring ΔS for runtime filters.


5 Runtime Stress-Test

Test Pass Condition
Overlap scan — Query 5 unrelated topics Same chunk ID appears ≤ 1×
ΔS histogram — 500 random chunks 95 % ≤ 0.45
k-sensitivity — ΔS vs. k plot Monotonic ↑ curve

If any fail, rerun 4.24.3 for offending documents.


6 Common Pitfalls & Fix Recipes

Pitfall Fix
Tables split per cell Detect delimiter lines; merge rows; store CSV separate; index columns as metadata
PDF line-break hyphens Regex ([a-z])- \n([a-z]) → merge words
Mixed languages Chunk by language span; tag lang:; separate embedding models
Giant code blocks Cut on `function class def` boundaries; keep ≤ 80 lines

7 FAQ

Q: Is a token window (e.g. 512) safe? A: Only if it aligns with semantic boundaries; fixed windows ignore context.

Q: Do I need sentence splitting and headings? A: Yes. Dual criteria minimise ΔS spikes and keep retrieval precise.

Q: How many chunks per doc? A: Irrelevant if ΔS and λ are stable — WFGY focuses on quality, not count.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow