mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
222 lines
10 KiB
Markdown
222 lines
10 KiB
Markdown
# Chunk Alignment — Guardrails and Fix Patterns
|
||
|
||
<details>
|
||
<summary><strong>🧭 Quick Return to Map</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a sub-page of **Retrieval**.
|
||
> To reorient, go back here:
|
||
>
|
||
> - [**Retrieval** — information access and knowledge lookup](./README.md)
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
|
||
>
|
||
> Think of this page as a desk within a ward.
|
||
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
|
||
</details>
|
||
|
||
|
||
Make the model cite the exact evidence. This page gives you a reliable way to align chunks to semantic anchors, so retrieval points at the right spans and citations survive through generation.
|
||
|
||
References you may want open already:
|
||
[RAG Architecture & Recovery](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) ·
|
||
[Retrieval Playbook](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md) ·
|
||
[Chunking Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/chunking-checklist.md) ·
|
||
[Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) ·
|
||
[Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) ·
|
||
[Embedding ≠ Semantic](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||
|
||
---
|
||
|
||
## Acceptance targets
|
||
|
||
- ΔS(question, retrieved) ≤ 0.45
|
||
- Anchor coverage ≥ 0.70 for the cited spans
|
||
- Citation precision ≥ 0.85 and recall ≥ 0.75
|
||
- λ stays convergent across 3 paraphrases and 2 seeds
|
||
|
||
If ΔS stays in the 0.40 to 0.60 band and coverage is low, the index is probably misaligned. Rebuild with this page.
|
||
|
||
---
|
||
|
||
## Symptoms → exact fix
|
||
|
||
| Symptom | Likely cause | Open this fix |
|
||
|---|---|---|
|
||
| High similarity but the cited span is near the answer, not on it | window cut ignores section headers and anchors | [Chunking Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/chunking-checklist.md) |
|
||
| Correct section id, wrong offsets | tokenizer or analyzer mismatch between write and read | [Retrieval Playbook](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md) |
|
||
| Same answer oscillates across two adjacent chunks | stride too large, missing overlap contract | [Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) |
|
||
| Coverage good offline, poor online | fragmented store or partial ingestion | [pattern_vectorstore_fragmentation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/patterns/pattern_vectorstore_fragmentation.md) |
|
||
| Good chunk, bad generation | cite then explain missing in the prompt schema | [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) |
|
||
|
||
---
|
||
|
||
## Anchor method that never lies
|
||
|
||
You need a ground anchor per question. Each anchor is the minimal span that must be cited.
|
||
|
||
**Anchor fields**
|
||
|
||
- `section_id` stable across rebuilds
|
||
- `snippet_id` unique within section
|
||
- `offsets` `{start_token, end_token}` in the write-time tokenizer
|
||
- `anchor_text` for human audit
|
||
- `hash` over `section_id + snippet_id + offsets + anchor_text`
|
||
|
||
Contract spec for your pipeline:
|
||
[Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||
|
||
---
|
||
|
||
## How to align chunks to anchors
|
||
|
||
1) **Normalize the analyzer**
|
||
Same lowercasing, unicode, punctuation, and stopword policy across write and read. If you change analyzers, invalidate and rebuild.
|
||
|
||
2) **Choose window and stride**
|
||
Start with window 350 to 700 tokens, stride 30 to 60 percent of window. Increase stride if anchors often cross boundaries.
|
||
|
||
3) **Fence by structure**
|
||
Reset window starts at structural cues like `h1..h3`, list starts, code fences, or paragraph breaks. This keeps semantic units together.
|
||
|
||
4) **Pin anchors after chunking**
|
||
After chunks are built, re-map each anchor to an owning chunk. When an anchor spans two chunks, create a stitch record or adjust stride.
|
||
|
||
5) **Write the trace fields**
|
||
Every retrieved item must carry `section_id`, `snippet_id`, `offsets`, `tokens`, `index_hash`.
|
||
See: [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md)
|
||
|
||
---
|
||
|
||
## PDF, HTML, and code specifics
|
||
|
||
- **PDF**
|
||
Use logical reading order, not x-y positions. Collapse hyphenation. Treat figure captions as separate units. Reset at headings and table boundaries.
|
||
|
||
- **HTML**
|
||
Strip boilerplate. Fence at `h2/h3`, `li`, and `pre`. Merge short sibling blocks to avoid tiny chunks.
|
||
|
||
- **Code**
|
||
Chunk by symbol boundaries and docstrings. Keep signature plus the first paragraph together. Never split examples from the API they explain.
|
||
|
||
---
|
||
|
||
## A minimal rebuild checklist
|
||
|
||
- Same tokenizer family for write and read.
|
||
- Window and stride validated on 40 to 120 gold items.
|
||
- Anchor coverage ≥ 0.70 and citation precision ≥ 0.85.
|
||
- ΔS falls below 0.45 for the majority after rebuild.
|
||
- `index_hash` updated and logged in retrieval results.
|
||
|
||
---
|
||
|
||
## Alignment test you can run today
|
||
|
||
1) For each gold item, compute ΔS between the retrieved text and the anchor text.
|
||
2) Compute coverage of cited spans against the anchor offsets.
|
||
3) Compare to a decoy section with the same size and style. If ΔS is close for anchor and decoy, chunk again.
|
||
|
||
See the eval recipes:
|
||
[Retrieval Evaluation Recipes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Retrieval/retrieval_eval_recipes.md)
|
||
|
||
---
|
||
|
||
## Pseudocode: chunk and map anchors
|
||
|
||
```python
|
||
# Pseudocode only
|
||
def chunk(doc, window=512, stride=256, fences=None, analyzer="lc"):
|
||
toks = tokenize(doc, analyzer) # write-time tokenizer
|
||
boundaries = structure_fences(doc, toks, fences=fences)
|
||
chunks = []
|
||
i = 0
|
||
while i < len(toks):
|
||
start = nearest_left_fence(i, boundaries)
|
||
end = min(start + window, len(toks))
|
||
text = detokenize(toks[start:end])
|
||
chunks.append({"start": start, "end": end, "text": text})
|
||
i = start + stride
|
||
return chunks
|
||
|
||
def map_anchor_to_chunk(anchor, chunks):
|
||
spans = []
|
||
for c in chunks:
|
||
if not (anchor["end"] <= c["start"] or anchor["start"] >= c["end"]):
|
||
spans.append({"snippet_id": id_of(c), "offsets": [
|
||
max(anchor["start"], c["start"]),
|
||
min(anchor["end"], c["end"])
|
||
]})
|
||
return spans
|
||
````
|
||
|
||
Store the mapping result inside your index metadata for audit and to power coverage scoring.
|
||
|
||
---
|
||
|
||
## Copy-paste prompt to audit alignment
|
||
|
||
```
|
||
You have TXT OS and the WFGY Problem Map loaded.
|
||
|
||
Input:
|
||
- question: "<q>"
|
||
- retrieved: {section_id, snippet_id, offsets, text}
|
||
- anchor: {section_id, snippet_id, offsets, text}
|
||
- decoy: {section_id, snippet_id, offsets, text}
|
||
|
||
Tasks:
|
||
1) Check cite-then-explain is followed.
|
||
2) Report ΔS(question, retrieved) and ΔS(retrieved, anchor) with short notes.
|
||
3) Compute anchor coverage from offsets.
|
||
4) If ΔS ≥ 0.60 or coverage < 0.70, propose the minimal rebuild step referencing:
|
||
chunking-checklist, retrieval-playbook, data-contracts, retrieval-traceability.
|
||
Return a compact JSON: { "ΔS": ..., "coverage": ..., "why": "...", "next_fix": "..." }.
|
||
```
|
||
|
||
---
|
||
|
||
## When to escalate
|
||
|
||
* You rebuild chunking and analyzers but ΔS remains high and coverage low.
|
||
Open: [Embedding ≠ Semantic](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md)
|
||
|
||
* Online runs drift after a deploy despite passing offline.
|
||
Open: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) and [Pre-Deploy Collapse](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
|
||
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
|
||
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
|
||
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
|
||
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
|
||
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
|
||
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
|
||
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
|
||
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
|
||
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
|
||
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|
||
要繼續下一頁請回「GO 5」。
|