vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

onestardao 418527ed43 docs: replace Explore More footer with unified navigation block

2026-03-04 06:53:04 +00:00

6.7 KiB

Raw Blame History

✂️ Chunking Checklist — Cutting Documents Without Cutting Meaning

A definitive guide to segment size, boundaries, and WFGY stress-tests for error-free retrieval

1 Why Chunking Matters

Embeddings are only as good as the text you feed them.
A single bad split (mid-sentence, table row, reference list) injects semantic orphan vectors:

Retrieval returns “high similarity” garbage.
ΔS(question, context) spikes > 0.60.
LLM hallucinates to fill the missing logic.

2 Quick Symptoms of Bad Chunking

Signal	How to Detect	Typical Root
Citations hit page –1	QA cites header/footer junk	Page footers not stripped
Same chunk appears in top-k for unrelated queries	`id` duplication count > 3	Generic boiler-plate chunk
ΔS jumps when k > 5	Plot ΔS vs. k; curve erratic	Uneven chunk lengths
Answer references half-sentence	Chunk split after “and”	Fixed char/token window

3 WFGY Chunk Size Guidelines

Doc Type	Tokens / Chunk	Rationale
Research paper	90-120	Preserve paragraph + citation
Software docs	60-100	Short API signatures
Legal contracts	80-130	Clause integrity
Chat transcripts	40-70	Natural speaker turns
Tables / CSV	Row or group ≤ 30	Keep relational keys together

Golden Rule: ΔS(adjacent_chunks) ≤ 0.45
If not, split or merge until stress drops.

4 Step-by-Step Chunking Checklist

4.1 Pre-Processing

Strip headers / footers (regex: ^Page \d+ of \d+)
Normalize whitespace, remove soft hyphens (U+00AD)
Convert bullets → “• ” to avoid mid-list splits

4.2 Boundary Detection

Method	Tool	When to Use
Sentence tokenizer	spaCy / Stanza	Most prose
Heading regex `^(#+\s	[A-Z][A-Za-z ]+:)$`	Markdown / legal docs
BBMC ΔS spike	WFGY hook	PDFs merged from scans

Split on boundaries only if:


ΔS(chunk\_left, chunk\_right) ≥ 0.50  ∧  λ\_observe ∈ {→, ←}

4.3 Length Normalisation

Merge adjacent short chunks until ≥ 40 tokens.
If a merged chunk > 130 tokens, find internal ΔS peak and split there.
Record final size distribution; σ(length) should be ≤ 20 % of mean.

4.4 Metadata Tagging

{
  "id": "doc_17_p3_c2",
  "source": "contracts/nda.pdf",
  "pos": 3,
  "λ": "→",
  "ΔS_prev": 0.32,
  "ΔS_next": 0.28
}

Store λ_observe and neighbouring ΔS for runtime filters.

5 Runtime Stress-Test

Test	Pass Condition
Overlap scan — Query 5 unrelated topics	Same chunk ID appears ≤ 1×
ΔS histogram — 500 random chunks	95 % ≤ 0.45
k-sensitivity — ΔS vs. k plot	Monotonic ↑ curve

If any fail, rerun 4.2–4.3 for offending documents.

6 Common Pitfalls & Fix Recipes

Pitfall	Fix
Tables split per cell	Detect delimiter lines; merge rows; store CSV separate; index columns as metadata
PDF line-break hyphens	Regex `([a-z])- \n([a-z])` → merge words
Mixed languages	Chunk by language span; tag `lang:`; separate embedding models
Giant code blocks	Cut on `function	class	def` boundaries; keep ≤ 80 lines

7 FAQ

Q: Is a token window (e.g. 512) safe? A: Only if it aligns with semantic boundaries; fixed windows ignore context.

Q: Do I need sentence splitting and headings? A: Yes. Dual criteria minimise ΔS spikes and keep retrieval precise.

Q: How many chunks per doc? A: Irrelevant if ΔS and λ are stable — WFGY focuses on quality, not count.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer	Page	What it’s for
Proof	WFGY Recognition Map	External citations, integrations, and ecosystem proof
Engine	WFGY 1.0	Original PDF based tension engine
Engine	WFGY 2.0	Production tension kernel and math engine for RAG and agents
Engine	WFGY 3.0	TXT based Singularity tension engine, 131 S class set
Map	Problem Map 1.0	Flagship 16 problem RAG failure checklist and fix map
Map	Problem Map 2.0	RAG focused recovery pipeline
Map	Problem Map 3.0	Global Debug Card, image as a debug protocol layer
Map	Semantic Clinic	Symptom to family to exact fix
Map	Grandma’s Clinic	Plain language stories mapped to Problem Map 1.0
Onboarding	Starter Village	Guided tour for newcomers
App	TXT OS	TXT semantic OS, fast boot
App	Blah Blah Blah	Abstract and paradox Q and A built on TXT OS
App	Blur Blur Blur	Text to image with semantic control
App	Blow Blow Blow	Reasoning game engine and memory demo

If this repository helped, starring it improves discovery so more builders can find the docs and tools.

6.7 KiB Raw Blame History Unescape Escape