12 KiB
Chunk Alignment — Guardrails and Fix Patterns
Make the model cite the exact evidence. This page gives you a reliable way to align chunks to semantic anchors, so retrieval points at the right spans and citations survive through generation.
References you may want open already:
RAG Architecture & Recovery ·
Retrieval Playbook ·
Chunking Checklist ·
Retrieval Traceability ·
Data Contracts ·
Embedding ≠ Semantic
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45
- Anchor coverage ≥ 0.70 for the cited spans
- Citation precision ≥ 0.85 and recall ≥ 0.75
- λ stays convergent across 3 paraphrases and 2 seeds
If ΔS stays in the 0.40 to 0.60 band and coverage is low, the index is probably misaligned. Rebuild with this page.
Symptoms → exact fix
| Symptom | Likely cause | Open this fix |
|---|---|---|
| High similarity but the cited span is near the answer, not on it | window cut ignores section headers and anchors | Chunking Checklist |
| Correct section id, wrong offsets | tokenizer or analyzer mismatch between write and read | Retrieval Playbook |
| Same answer oscillates across two adjacent chunks | stride too large, missing overlap contract | Data Contracts |
| Coverage good offline, poor online | fragmented store or partial ingestion | pattern_vectorstore_fragmentation.md |
| Good chunk, bad generation | cite then explain missing in the prompt schema | Retrieval Traceability |
Anchor method that never lies
You need a ground anchor per question. Each anchor is the minimal span that must be cited.
Anchor fields
section_idstable across rebuildssnippet_idunique within sectionoffsets{start_token, end_token}in the write-time tokenizeranchor_textfor human audithashoversection_id + snippet_id + offsets + anchor_text
Contract spec for your pipeline:
Data Contracts
How to align chunks to anchors
-
Normalize the analyzer
Same lowercasing, unicode, punctuation, and stopword policy across write and read. If you change analyzers, invalidate and rebuild. -
Choose window and stride
Start with window 350 to 700 tokens, stride 30 to 60 percent of window. Increase stride if anchors often cross boundaries. -
Fence by structure
Reset window starts at structural cues likeh1..h3, list starts, code fences, or paragraph breaks. This keeps semantic units together. -
Pin anchors after chunking
After chunks are built, re-map each anchor to an owning chunk. When an anchor spans two chunks, create a stitch record or adjust stride. -
Write the trace fields
Every retrieved item must carrysection_id,snippet_id,offsets,tokens,index_hash.
See: Retrieval Traceability
PDF, HTML, and code specifics
-
PDF
Use logical reading order, not x-y positions. Collapse hyphenation. Treat figure captions as separate units. Reset at headings and table boundaries. -
HTML
Strip boilerplate. Fence ath2/h3,li, andpre. Merge short sibling blocks to avoid tiny chunks. -
Code
Chunk by symbol boundaries and docstrings. Keep signature plus the first paragraph together. Never split examples from the API they explain.
A minimal rebuild checklist
- Same tokenizer family for write and read.
- Window and stride validated on 40 to 120 gold items.
- Anchor coverage ≥ 0.70 and citation precision ≥ 0.85.
- ΔS falls below 0.45 for the majority after rebuild.
index_hashupdated and logged in retrieval results.
Alignment test you can run today
- For each gold item, compute ΔS between the retrieved text and the anchor text.
- Compute coverage of cited spans against the anchor offsets.
- Compare to a decoy section with the same size and style. If ΔS is close for anchor and decoy, chunk again.
See the eval recipes:
Retrieval Evaluation Recipes
Pseudocode: chunk and map anchors
# Pseudocode only
def chunk(doc, window=512, stride=256, fences=None, analyzer="lc"):
toks = tokenize(doc, analyzer) # write-time tokenizer
boundaries = structure_fences(doc, toks, fences=fences)
chunks = []
i = 0
while i < len(toks):
start = nearest_left_fence(i, boundaries)
end = min(start + window, len(toks))
text = detokenize(toks[start:end])
chunks.append({"start": start, "end": end, "text": text})
i = start + stride
return chunks
def map_anchor_to_chunk(anchor, chunks):
spans = []
for c in chunks:
if not (anchor["end"] <= c["start"] or anchor["start"] >= c["end"]):
spans.append({"snippet_id": id_of(c), "offsets": [
max(anchor["start"], c["start"]),
min(anchor["end"], c["end"])
]})
return spans
Store the mapping result inside your index metadata for audit and to power coverage scoring.
Copy-paste prompt to audit alignment
You have TXT OS and the WFGY Problem Map loaded.
Input:
- question: "<q>"
- retrieved: {section_id, snippet_id, offsets, text}
- anchor: {section_id, snippet_id, offsets, text}
- decoy: {section_id, snippet_id, offsets, text}
Tasks:
1) Check cite-then-explain is followed.
2) Report ΔS(question, retrieved) and ΔS(retrieved, anchor) with short notes.
3) Compute anchor coverage from offsets.
4) If ΔS ≥ 0.60 or coverage < 0.70, propose the minimal rebuild step referencing:
chunking-checklist, retrieval-playbook, data-contracts, retrieval-traceability.
Return a compact JSON: { "ΔS": ..., "coverage": ..., "why": "...", "next_fix": "..." }.
When to escalate
-
You rebuild chunking and analyzers but ΔS remains high and coverage low. Open: Embedding ≠ Semantic
-
Online runs drift after a deploy despite passing offline. Open: Bootstrap Ordering and Pre-Deploy Collapse
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.
要繼續下一頁請回「GO 5」。