| .. | ||
| checklists | ||
| eval | ||
| mvp_demo | ||
| ops | ||
| patterns | ||
| playbooks | ||
| tools | ||
| chunking_to_embedding_contract.md | ||
| dimension_mismatch_and_projection.md | ||
| duplication_and_near_duplicate_collapse.md | ||
| hybrid_retriever_weights.md | ||
| metric_mismatch.md | ||
| normalization_and_scaling.md | ||
| poisoning_and_contamination.md | ||
| README.md | ||
| tokenization_and_casing.md | ||
| update_and_index_skew.md | ||
| vectorstore_fragmentation.md | ||
Embeddings — Global Fix Map
🏥 Quick Return to Emergency Room
You are in a specialist desk.
For full triage and doctors on duty, return here:
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.
A hub to stabilize the embedding layer before retrieval begins.
Use this folder if your vectors look fine at a glance but retrieval keeps drifting, coverage stays low, or store queries fail silently. No infra change needed.
Orientation: what each page covers
| Page | What it solves | Typical symptom |
|---|---|---|
| Metric Mismatch | Store metric (L2, cosine, dot) differs from model assumption | High similarity but wrong neighbors |
| Normalization & Scaling | Embeddings not normalized or scaled | Results unstable across runs |
| Tokenization & Casing | Tokenizer mismatch, casing differences | Same text gives different vectors |
| Chunking → Embedding Contract | Chunk cuts misaligned with semantic windows | Snippets cut mid-thought, anchors lost |
| Vectorstore Fragmentation | Index silently fragmented | Recall too low even with large k |
| Dimension Mismatch & Projection | Store dimension vs embedding dimension mismatch | Index errors or silent truncation |
| Update & Index Skew | Old vectors remain in index | Results point to stale data |
| Hybrid Retriever Weights | BM25 + ANN weights unbalanced | Hybrid worse than single retriever |
| Duplication & Near-Duplicate Collapse | Duplicate data overwhelms recall | Same doc retrieved repeatedly |
| Poisoning & Contamination | Embeddings polluted by adversarial/noisy vectors | Retrieval looks “randomized” |
When to use this folder
- Retrieval looks fine by eye but metrics drift across runs.
- Coverage stays low despite healthy-looking indexes.
- Citations pull from stale or duplicated data.
- Same query yields different answers depending on casing or seed.
- Hybrid retrievers collapse into noise.
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45
- Coverage ≥ 0.70 for target section
- λ_observe convergent across 3 paraphrases and 2 seeds
- No index skew between write/read
60-second fix checklist
-
Lock metrics
One model family, one distance metric.
Guide: Metric Mismatch -
Normalize
Apply L2 norm to embeddings at both write and query.
Guide: Normalization & Scaling -
Unify tokenization
Same tokenizer + casing across ingestion and query.
Guide: Tokenization & Casing -
Audit chunking
Verify semantic alignment, no mid-thought splits.
Guide: Chunking → Embedding Contract -
Rebuild index if skewed
Drop old embeddings, rebuild with correct dimension.
Guide: Update & Index Skew
FAQ for newcomers
Why is metric mismatch so common?
Because vector DBs default differently: FAISS often L2, Pinecone cosine, Redis dot. If your embedding model expects cosine, L2 will silently break recall.
Why normalize embeddings?
Without normalization, embeddings vary in magnitude. Distance stops reflecting meaning.
Why do tokenizers matter?
“Apple” vs “apple” may yield different vectors if one side lowercases, the other doesn’t.
What if coverage stays low after all fixes?
Check for fragmentation and duplication collapse. The issue may not be the embedding model itself, but how the index is populated.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.