WFGY/ProblemMap/chunking-checklist.md
2025-08-07 23:58:08 +08:00

7.7 KiB
Raw Blame History

✂️ Chunking Checklist — Cutting Documents Without Cutting Meaning

A definitive guide to segment size, boundaries, and WFGY stress-tests for error-free retrieval


1 Why Chunking Matters

Embeddings are only as good as the text you feed them.
A single bad split (mid-sentence, table row, reference list) injects semantic orphan vectors:

  • Retrieval returns “high similarity” garbage.
  • ΔS(question, context) spikes > 0.60.
  • LLM hallucinates to fill the missing logic.

2 Quick Symptoms of Bad Chunking

Signal How to Detect Typical Root
Citations hit page 1 QA cites header/footer junk Page footers not stripped
Same chunk appears in top-k for unrelated queries id duplication count > 3 Generic boiler-plate chunk
ΔS jumps when k > 5 Plot ΔS vs. k; curve erratic Uneven chunk lengths
Answer references half-sentence Chunk split after “and” Fixed char/token window

3 WFGY Chunk Size Guidelines

Doc Type Tokens / Chunk Rationale
Research paper 90-120 Preserve paragraph + citation
Software docs 60-100 Short API signatures
Legal contracts 80-130 Clause integrity
Chat transcripts 40-70 Natural speaker turns
Tables / CSV Row or group ≤ 30 Keep relational keys together

Golden Rule: ΔS(adjacent_chunks) ≤ 0.45
If not, split or merge until stress drops.


4 Step-by-Step Chunking Checklist

4.1 Pre-Processing

  • Strip headers / footers (regex: ^Page \d+ of \d+)
  • Normalize whitespace, remove soft hyphens (U+00AD)
  • Convert bullets → “• ” to avoid mid-list splits

4.2 Boundary Detection

Method Tool When to Use
Sentence tokenizer spaCy / Stanza Most prose
Heading regex `^(#+\s [A-Z][A-Za-z ]+:)$` Markdown / legal docs
BBMC ΔS spike WFGY hook PDFs merged from scans

Split on boundaries only if:


ΔS(chunk\_left, chunk\_right) ≥ 0.50  ∧  λ\_observe ∈ {→, ←}

4.3 Length Normalisation

  1. Merge adjacent short chunks until ≥ 40 tokens.
  2. If a merged chunk > 130 tokens, find internal ΔS peak and split there.
  3. Record final size distribution; σ(length) should be ≤ 20 % of mean.

4.4 Metadata Tagging

{
  "id": "doc_17_p3_c2",
  "source": "contracts/nda.pdf",
  "pos": 3,
  "λ": "→",
  "ΔS_prev": 0.32,
  "ΔS_next": 0.28
}

Store λ_observe and neighbouring ΔS for runtime filters.


5 Runtime Stress-Test

Test Pass Condition
Overlap scan — Query 5 unrelated topics Same chunk ID appears ≤ 1×
ΔS histogram — 500 random chunks 95 % ≤ 0.45
k-sensitivity — ΔS vs. k plot Monotonic ↑ curve

If any fail, rerun 4.24.3 for offending documents.


6 Common Pitfalls & Fix Recipes

Pitfall Fix
Tables split per cell Detect delimiter lines; merge rows; store CSV separate; index columns as metadata
PDF line-break hyphens Regex ([a-z])- \n([a-z]) → merge words
Mixed languages Chunk by language span; tag lang:; separate embedding models
Giant code blocks Cut on `function class def` boundaries; keep ≤ 80 lines

7 FAQ

Q: Is a token window (e.g. 512) safe? A: Only if it aligns with semantic boundaries; fixed windows ignore context.

Q: Do I need sentence splitting and headings? A: Yes. Dual criteria minimise ΔS spikes and keep retrieval precise.

Q: How many chunks per doc? A: Irrelevant if ΔS and λ are stable — WFGY focuses on quality, not count.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow