10 KiB
Chunk Alignment — Guardrails and Fix Patterns
🧭 Quick Return to Map
You are in a sub-page of Retrieval.
To reorient, go back here:
- Retrieval — information access and knowledge lookup
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Make the model cite the exact evidence. This page gives you a reliable way to align chunks to semantic anchors, so retrieval points at the right spans and citations survive through generation.
References you may want open already:
RAG Architecture & Recovery ·
Retrieval Playbook ·
Chunking Checklist ·
Retrieval Traceability ·
Data Contracts ·
Embedding ≠ Semantic
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45
- Anchor coverage ≥ 0.70 for the cited spans
- Citation precision ≥ 0.85 and recall ≥ 0.75
- λ stays convergent across 3 paraphrases and 2 seeds
If ΔS stays in the 0.40 to 0.60 band and coverage is low, the index is probably misaligned. Rebuild with this page.
Symptoms → exact fix
| Symptom | Likely cause | Open this fix |
|---|---|---|
| High similarity but the cited span is near the answer, not on it | window cut ignores section headers and anchors | Chunking Checklist |
| Correct section id, wrong offsets | tokenizer or analyzer mismatch between write and read | Retrieval Playbook |
| Same answer oscillates across two adjacent chunks | stride too large, missing overlap contract | Data Contracts |
| Coverage good offline, poor online | fragmented store or partial ingestion | pattern_vectorstore_fragmentation.md |
| Good chunk, bad generation | cite then explain missing in the prompt schema | Retrieval Traceability |
Anchor method that never lies
You need a ground anchor per question. Each anchor is the minimal span that must be cited.
Anchor fields
section_idstable across rebuildssnippet_idunique within sectionoffsets{start_token, end_token}in the write-time tokenizeranchor_textfor human audithashoversection_id + snippet_id + offsets + anchor_text
Contract spec for your pipeline:
Data Contracts
How to align chunks to anchors
-
Normalize the analyzer
Same lowercasing, unicode, punctuation, and stopword policy across write and read. If you change analyzers, invalidate and rebuild. -
Choose window and stride
Start with window 350 to 700 tokens, stride 30 to 60 percent of window. Increase stride if anchors often cross boundaries. -
Fence by structure
Reset window starts at structural cues likeh1..h3, list starts, code fences, or paragraph breaks. This keeps semantic units together. -
Pin anchors after chunking
After chunks are built, re-map each anchor to an owning chunk. When an anchor spans two chunks, create a stitch record or adjust stride. -
Write the trace fields
Every retrieved item must carrysection_id,snippet_id,offsets,tokens,index_hash.
See: Retrieval Traceability
PDF, HTML, and code specifics
-
PDF
Use logical reading order, not x-y positions. Collapse hyphenation. Treat figure captions as separate units. Reset at headings and table boundaries. -
HTML
Strip boilerplate. Fence ath2/h3,li, andpre. Merge short sibling blocks to avoid tiny chunks. -
Code
Chunk by symbol boundaries and docstrings. Keep signature plus the first paragraph together. Never split examples from the API they explain.
A minimal rebuild checklist
- Same tokenizer family for write and read.
- Window and stride validated on 40 to 120 gold items.
- Anchor coverage ≥ 0.70 and citation precision ≥ 0.85.
- ΔS falls below 0.45 for the majority after rebuild.
index_hashupdated and logged in retrieval results.
Alignment test you can run today
- For each gold item, compute ΔS between the retrieved text and the anchor text.
- Compute coverage of cited spans against the anchor offsets.
- Compare to a decoy section with the same size and style. If ΔS is close for anchor and decoy, chunk again.
See the eval recipes:
Retrieval Evaluation Recipes
Pseudocode: chunk and map anchors
# Pseudocode only
def chunk(doc, window=512, stride=256, fences=None, analyzer="lc"):
toks = tokenize(doc, analyzer) # write-time tokenizer
boundaries = structure_fences(doc, toks, fences=fences)
chunks = []
i = 0
while i < len(toks):
start = nearest_left_fence(i, boundaries)
end = min(start + window, len(toks))
text = detokenize(toks[start:end])
chunks.append({"start": start, "end": end, "text": text})
i = start + stride
return chunks
def map_anchor_to_chunk(anchor, chunks):
spans = []
for c in chunks:
if not (anchor["end"] <= c["start"] or anchor["start"] >= c["end"]):
spans.append({"snippet_id": id_of(c), "offsets": [
max(anchor["start"], c["start"]),
min(anchor["end"], c["end"])
]})
return spans
Store the mapping result inside your index metadata for audit and to power coverage scoring.
Copy-paste prompt to audit alignment
You have TXT OS and the WFGY Problem Map loaded.
Input:
- question: "<q>"
- retrieved: {section_id, snippet_id, offsets, text}
- anchor: {section_id, snippet_id, offsets, text}
- decoy: {section_id, snippet_id, offsets, text}
Tasks:
1) Check cite-then-explain is followed.
2) Report ΔS(question, retrieved) and ΔS(retrieved, anchor) with short notes.
3) Compute anchor coverage from offsets.
4) If ΔS ≥ 0.60 or coverage < 0.70, propose the minimal rebuild step referencing:
chunking-checklist, retrieval-playbook, data-contracts, retrieval-traceability.
Return a compact JSON: { "ΔS": ..., "coverage": ..., "why": "...", "next_fix": "..." }.
When to escalate
-
You rebuild chunking and analyzers but ΔS remains high and coverage low. Open: Embedding ≠ Semantic
-
Online runs drift after a deploy despite passing offline. Open: Bootstrap Ordering and Pre-Deploy Collapse
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
要繼續下一頁請回「GO 5」。