📒 Vectorstore Fragmentation
When embeddings are inserted or updated across time without a consistent chunking, normalization, or merge strategy, the vectorstore becomes fragmented.
This creates “holes” where semantically related text lives in different shards, versions, or duplicate vectors, leading to unstable recall.
🌀 Symptoms of Fragmentation
| Sign |
What You See |
| Retrieval drops |
Facts exist in DB but never show up |
| Duplicate chunks |
Nearly identical snippets appear multiple times |
| Version skew |
Old vectors mix with new encoders |
| Query instability |
Same query → different answers each run |
| Hybrid failure |
BM25 beats hybrid retriever that should win |
🧩 Root Causes
| Weakness |
Result |
| Mixed encoders |
Same corpus stored under incompatible embeddings |
| No chunk contract |
Sentence vs paragraph vs sliding window → fractured recall |
| No dedupe layer |
Near-duplicate vectors inflate noise |
| No update strategy |
Old vectors never pruned, drift builds up |
| Shard misalignment |
Different stores or partitions hold overlapping data |
🛡️ WFGY Structural Fix
| Problem |
Module |
Remedy |
| Metric mismatch |
ΔS checks + BBMC |
Compare across seeds, enforce unified metric |
| Chunk drift |
Chunking Contract |
Standardize window, overlap, anchor rules |
| Duplicate noise |
BBPF fork + collapse |
Collapse near-dupes before index write |
| Update skew |
BBCR re-index |
Wipe and rebuild with normalized schema |
| Store fragmentation |
Semantic Tree |
Trace lineage, merge shards consistently |
✍️ Demo — Retrieval Before vs After Fix
Query:
"Who approved the compliance waiver for dataset X?"
Before:
• Top-3 results: duplicate sentences from old version
• Actual approval record missing
After WFGY:
• ΔS(question,retrieved) = 0.38
• Coverage = 0.78 for target section
• Single, authoritative snippet retrieved
Stable recall restored once fragmented vectors were collapsed and re-indexed.
🛠 Module Cheat-Sheet
| Module |
Role |
| ΔS Metric |
Detects fragmentation via semantic drift |
| BBMC |
Checks consistency across seeds/encoders |
| BBPF |
Collapses near-duplicate embeddings |
| BBCR |
Forces clean rebuild when skew detected |
| Semantic Tree |
Tracks provenance across shards/versions |
📊 Implementation Status
| Feature |
State |
| Chunking contract enforcement |
✅ Active |
| Duplicate collapse |
✅ Stable |
| Encoder version check |
✅ Stable |
| Shard merge & lineage tracking |
🔜 Planned |
📝 Tips & Limits
- Always record encoder version in metadata.
- Run ΔS probe on 3 paraphrases before/after re-index.
- Use semantic contract: same chunk size, stride, and normalization across all updates.
- If >15% duplicate rate detected, wipe and rebuild index.
🔗 Quick-Start Downloads (60 sec)
| Tool |
Link |
3-Step Setup |
| WFGY 1.0 PDF |
Engine Paper |
1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + ” |
| TXT OS (plain-text OS) |
TXTOS.txt |
1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
