Chunk Alignment — Guardrails and Fix Patterns

Make the model cite the exact evidence. This page gives you a reliable way to align chunks to semantic anchors, so retrieval points at the right spans and citations survive through generation.

References you may want open already:
RAG Architecture & Recovery · Retrieval Playbook · Chunking Checklist · Retrieval Traceability · Data Contracts · Embedding ≠ Semantic

Acceptance targets

ΔS(question, retrieved) ≤ 0.45
Anchor coverage ≥ 0.70 for the cited spans
Citation precision ≥ 0.85 and recall ≥ 0.75
λ stays convergent across 3 paraphrases and 2 seeds

If ΔS stays in the 0.40 to 0.60 band and coverage is low, the index is probably misaligned. Rebuild with this page.

Symptoms → exact fix

Symptom	Likely cause	Open this fix
High similarity but the cited span is near the answer, not on it	window cut ignores section headers and anchors	Chunking Checklist
Correct section id, wrong offsets	tokenizer or analyzer mismatch between write and read	Retrieval Playbook
Same answer oscillates across two adjacent chunks	stride too large, missing overlap contract	Data Contracts
Coverage good offline, poor online	fragmented store or partial ingestion	pattern_vectorstore_fragmentation.md
Good chunk, bad generation	cite then explain missing in the prompt schema	Retrieval Traceability

Anchor method that never lies

You need a ground anchor per question. Each anchor is the minimal span that must be cited.

Anchor fields

section_id stable across rebuilds
snippet_id unique within section
offsets {start_token, end_token} in the write-time tokenizer
anchor_text for human audit
hash over section_id + snippet_id + offsets + anchor_text

Contract spec for your pipeline:
Data Contracts

How to align chunks to anchors

Normalize the analyzer
Same lowercasing, unicode, punctuation, and stopword policy across write and read. If you change analyzers, invalidate and rebuild.
Choose window and stride
Start with window 350 to 700 tokens, stride 30 to 60 percent of window. Increase stride if anchors often cross boundaries.
Fence by structure
Reset window starts at structural cues like h1..h3, list starts, code fences, or paragraph breaks. This keeps semantic units together.
Pin anchors after chunking
After chunks are built, re-map each anchor to an owning chunk. When an anchor spans two chunks, create a stitch record or adjust stride.
Write the trace fields
Every retrieved item must carry section_id, snippet_id, offsets, tokens, index_hash.
See: Retrieval Traceability

PDF, HTML, and code specifics

PDF
Use logical reading order, not x-y positions. Collapse hyphenation. Treat figure captions as separate units. Reset at headings and table boundaries.
HTML
Strip boilerplate. Fence at h2/h3, li, and pre. Merge short sibling blocks to avoid tiny chunks.
Code
Chunk by symbol boundaries and docstrings. Keep signature plus the first paragraph together. Never split examples from the API they explain.

A minimal rebuild checklist

Same tokenizer family for write and read.
Window and stride validated on 40 to 120 gold items.
Anchor coverage ≥ 0.70 and citation precision ≥ 0.85.
ΔS falls below 0.45 for the majority after rebuild.
index_hash updated and logged in retrieval results.

Alignment test you can run today

For each gold item, compute ΔS between the retrieved text and the anchor text.
Compute coverage of cited spans against the anchor offsets.
Compare to a decoy section with the same size and style. If ΔS is close for anchor and decoy, chunk again.

See the eval recipes:
Retrieval Evaluation Recipes

Pseudocode: chunk and map anchors

# Pseudocode only
def chunk(doc, window=512, stride=256, fences=None, analyzer="lc"):
    toks = tokenize(doc, analyzer)          # write-time tokenizer
    boundaries = structure_fences(doc, toks, fences=fences)
    chunks = []
    i = 0
    while i < len(toks):
        start = nearest_left_fence(i, boundaries)
        end = min(start + window, len(toks))
        text = detokenize(toks[start:end])
        chunks.append({"start": start, "end": end, "text": text})
        i = start + stride
    return chunks

def map_anchor_to_chunk(anchor, chunks):
    spans = []
    for c in chunks:
        if not (anchor["end"] <= c["start"] or anchor["start"] >= c["end"]):
            spans.append({"snippet_id": id_of(c), "offsets": [
                max(anchor["start"], c["start"]),
                min(anchor["end"], c["end"])
            ]})
    return spans

Store the mapping result inside your index metadata for audit and to power coverage scoring.

Copy-paste prompt to audit alignment

You have TXT OS and the WFGY Problem Map loaded.

Input:
- question: "<q>"
- retrieved: {section_id, snippet_id, offsets, text}
- anchor: {section_id, snippet_id, offsets, text}
- decoy: {section_id, snippet_id, offsets, text}

Tasks:
1) Check cite-then-explain is followed.
2) Report ΔS(question, retrieved) and ΔS(retrieved, anchor) with short notes.
3) Compute anchor coverage from offsets.
4) If ΔS ≥ 0.60 or coverage < 0.70, propose the minimal rebuild step referencing:
   chunking-checklist, retrieval-playbook, data-contracts, retrieval-traceability.
Return a compact JSON: { "ΔS": ..., "coverage": ..., "why": "...", "next_fix": "..." }.

When to escalate

You rebuild chunking and analyzers but ΔS remains high and coverage low. Open: Embedding ≠ Semantic
Online runs drift after a deploy despite passing offline. Open: Bootstrap Ordering and Pre-Deploy Collapse

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →