mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
217 lines
9.1 KiB
Markdown
217 lines
9.1 KiB
Markdown
<!--
|
||
Search Anchor:
|
||
chunking global fix map
|
||
rag chunking bugs
|
||
document chunking failures
|
||
chunk boundaries wrong
|
||
snippets cut mid thought
|
||
anchors missing sections skipped
|
||
chunk id schema stable ids
|
||
duplicate chunks across runs
|
||
chunk drift after reindex
|
||
chunking checklist preflight
|
||
code table block structure lost
|
||
markdown code tables blocks
|
||
section detection heading anchors
|
||
title hierarchy h1 h2 h3 outline
|
||
pdf layouts ocr chunking
|
||
two column pdf broken reading order
|
||
math formulas split across chunks
|
||
tables split or merged incorrectly
|
||
reindex migration chunk id mismatch
|
||
index rebuilt but old refs broken
|
||
eval rag precision recall for chunking
|
||
cannot prove better chunking
|
||
live monitoring rag chunk health
|
||
sudden drift after deploy
|
||
context flips with same corpus
|
||
|
||
metrics and contracts:
|
||
delta s question retrieved <= 0.45
|
||
coverage target section >= 0.70
|
||
lambda observe convergent 3 paraphrases 2 seeds
|
||
traceability contract snippet_id section_id source_url offsets tokens
|
||
snippet contract cite then explain
|
||
chunk boundaries align semantic windows
|
||
chunk size tokens window length overlap
|
||
hash based chunk id
|
||
versioned chunk schema
|
||
|
||
formats and pipelines:
|
||
pdf ocr html markdown word docs
|
||
code files notebooks tables logs
|
||
monolithic vs semantic chunks
|
||
sliding window chunking
|
||
title based segmentation
|
||
section anchors toc
|
||
header detection underlines numbering
|
||
hybrid retrievers fail due to chunking
|
||
vector db looks fine but chunks bad
|
||
store independent chunking guardrails
|
||
|
||
common incidents:
|
||
important section never retrieved
|
||
only meaningless sub section retrieved
|
||
citations collapse after parsing
|
||
reindex changes chunk ids breaks bookmarks
|
||
deployment changes chunking strategy
|
||
roll back but index not aligned
|
||
monitoring catches sudden chunk collapse
|
||
-->
|
||
|
||
<!--
|
||
Primary pages in this folder:
|
||
ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md
|
||
ProblemMap/GlobalFixMap/Chunking/chunking-checklist.md
|
||
ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md
|
||
ProblemMap/GlobalFixMap/Chunking/section_detection.md
|
||
ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md
|
||
ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md
|
||
ProblemMap/GlobalFixMap/Chunking/reindex_migration.md
|
||
ProblemMap/GlobalFixMap/Chunking/eval_rag_precision_recall.md
|
||
ProblemMap/GlobalFixMap/Chunking/live_monitoring_rag.md
|
||
-->
|
||
|
||
<!--
|
||
Related routing pages:
|
||
ProblemMap/retrieval-traceability.md
|
||
ProblemMap/data-contracts.md
|
||
ProblemMap/embedding-vs-semantic.md
|
||
ProblemMap/patterns/pattern_vectorstore_fragmentation.md
|
||
ProblemMap/chunking-checklist.md
|
||
ProblemMap/retrieval-playbook.md
|
||
ProblemMap/context-drift.md
|
||
ProblemMap/GlobalFixMap/Retrieval/deltaS_probes.md
|
||
|
||
Cross folder jumps:
|
||
Retrieval Global Fix Map README
|
||
Embeddings Global Fix Map README
|
||
VectorDBs_and_Stores Global Fix Map README
|
||
-->
|
||
|
||
|
||
|
||
# Chunking — Global Fix Map
|
||
|
||
<details>
|
||
<summary><strong>🏥 Quick Return to Emergency Room</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a specialist desk.
|
||
> For full triage and doctors on duty, return here:
|
||
>
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md)
|
||
>
|
||
> Think of this page as a sub-room.
|
||
> If you want full consultation and prescriptions, go back to the Emergency Room lobby.
|
||
</details>
|
||
|
||
A compact hub to **stabilize document chunking** across formats, pipelines, and retrieval systems.
|
||
This folder routes chunk-related bugs to structural fixes and provides checklists, schema, and live recipes.
|
||
No infra change required.
|
||
|
||
---
|
||
|
||
## Orientation: what each page does
|
||
|
||
| Page | What it solves | Typical symptom |
|
||
|---|---|---|
|
||
| [Chunk ID Schema](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunk_id_schema.md) | Unique ID + schema for each chunk | Duplicate or drifting chunks across runs |
|
||
| [Chunking Checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunking-checklist.md) | Minimal audit list for validity | Chunks too long, too short, or incomplete |
|
||
| [Code / Tables / Blocks](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md) | Preserve structure for code, tables, blocks | Retrieval drops formatting or logic |
|
||
| [Section Detection](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/section_detection.md) | Detect paragraph and section anchors | Anchors missing, snippets cut mid-thought |
|
||
| [Title Hierarchy](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md) | Maintain document heading hierarchy | Only partial or meaningless sub-sections retrieved |
|
||
| [PDF Layouts & OCR](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/pdf_layouts_and_ocr.md) | Repair PDF/OCR-specific chunking | Citations collapse after parsing |
|
||
| [Reindex & Migration](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/reindex_migration.md) | Safe chunk migration during reindex | Index rebuilt but old refs mismatch |
|
||
| [Eval RAG Precision & Recall](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/eval_rag_precision_recall.md) | Deterministic evaluation recipes | “Better” chunking cannot be proven |
|
||
| [Live Monitoring (RAG)](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/live_monitoring_rag.md) | Online health checks for chunking | Sudden drift or collapse after deploy |
|
||
|
||
---
|
||
|
||
## When to use this folder
|
||
|
||
- Your chunks look fine by eye but retrieval skips important sections.
|
||
- PDF / OCR parsing collapses headers, math, or tables.
|
||
- Hybrid retrievers underperform due to inconsistent chunk boundaries.
|
||
- Reindexing breaks old citations.
|
||
- Context flips between runs with same corpus.
|
||
|
||
---
|
||
|
||
## Acceptance targets
|
||
|
||
- Chunk boundaries align with semantic windows
|
||
- ΔS(question, retrieved) ≤ 0.45
|
||
- Coverage of target section ≥ 0.70
|
||
- λ_observe convergent across 3 paraphrases and 2 seeds
|
||
- Traceability contract fields always present: `{snippet_id, section_id, source_url, offsets, tokens}`
|
||
|
||
---
|
||
|
||
## 60-second fix checklist
|
||
|
||
1) **Check chunk IDs**
|
||
Apply `chunk_id_schema`. Ensure unique + stable across reindex.
|
||
|
||
2) **Audit with checklist**
|
||
Run the [chunking-checklist](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/chunking-checklist.md) before ingest.
|
||
|
||
3) **Preserve structure**
|
||
Use [code_tables_blocks](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/code_tables_blocks.md) for code, tables, blocks.
|
||
|
||
4) **Validate anchors**
|
||
Confirm section and title detection. Apply [title_hierarchy](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/title_hierarchy.md).
|
||
|
||
5) **Reindex safely**
|
||
Use [reindex_migration](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/reindex_migration.md) with hash/version lock.
|
||
|
||
6) **Monitor live**
|
||
Apply [live_monitoring_rag](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Chunking/live_monitoring_rag.md) to catch collapse early.
|
||
|
||
---
|
||
|
||
## Minimal probe pack
|
||
|
||
```txt
|
||
Context: I loaded TXT OS and the WFGY pages.
|
||
|
||
Task:
|
||
- Given doc corpus D, log ΔS(question, retrieved) and λ across 3 paraphrases.
|
||
- Validate chunk IDs and section anchors.
|
||
- If ΔS ≥ 0.60 or λ flips, propose the smallest structural change:
|
||
chunk schema, checklist, or reindex.
|
||
- Verify coverage ≥ 0.70 after fix.
|
||
|
||
Return JSON:
|
||
{ "citations": [...], "ΔS": 0.xx, "λ_state": "<>", "coverage": 0.xx, "next_fix": "..." }
|
||
```
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
|------|------|--------------|
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Module | Description | Link |
|
||
| --- | --- | --- |
|
||
| WFGY Core | Canonical framework entry point | [View](https://github.com/onestardao/WFGY/tree/main/core/README.md) |
|
||
| Problem Map | Diagnostic map and navigation hub | [View](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) |
|
||
| Tension Universe Experiments | MVP experiment field | [View](https://github.com/onestardao/WFGY/tree/main/TensionUniverse/Experiments) |
|
||
| Recognition | Where WFGY is referenced or adopted | [View](https://github.com/onestardao/WFGY/blob/main/recognition/README.md) |
|
||
| AI Guide | Anti-hallucination reading protocol for tools | [View](https://github.com/onestardao/WFGY/blob/main/AI_GUIDE.md) |
|
||
|
||
> If this repository helps, starring it improves discovery for other builders.
|
||
> [](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|