vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

History

PSBigBig 1bd52cfd00 Update README.md		2025-09-03 23:50:01 +08:00
..
checklists	Create .gitkeep	2025-09-01 16:38:34 +08:00
eval	Create .gitkeep	2025-09-01 16:38:44 +08:00
mvp_demo	Create .gitkeep	2025-09-01 16:38:55 +08:00
ops	Create .gitkeep	2025-09-01 16:39:05 +08:00
patterns	Create .gitkeep	2025-09-01 16:39:13 +08:00
playbooks	Create .gitkeep	2025-09-01 16:39:22 +08:00
tools	Create .gitkeep	2025-09-01 16:39:31 +08:00
chunking_to_embedding_contract.md	Create chunking_to_embedding_contract.md	2025-08-28 17:16:56 +08:00
dimension_mismatch_and_projection.md	Create dimension_mismatch_and_projection.md	2025-08-28 18:04:33 +08:00
duplication_and_near_duplicate_collapse.md	Create duplication_and_near_duplicate_collapse.md	2025-08-28 18:29:11 +08:00
hybrid_retriever_weights.md	Create hybrid_retriever_weights.md	2025-08-28 18:10:47 +08:00
metric_mismatch.md	Update metric_mismatch.md	2025-08-28 19:46:18 +08:00
normalization_and_scaling.md	Create normalization_and_scaling.md	2025-08-28 16:48:34 +08:00
poisoning_and_contamination.md	Create poisoning_and_contamination.md	2025-08-28 19:16:28 +08:00
README.md	Update README.md	2025-09-03 23:50:01 +08:00
tokenization_and_casing.md	Create tokenization_and_casing.md	2025-08-28 17:06:03 +08:00
update_and_index_skew.md	Create update_and_index_skew.md	2025-08-28 18:09:08 +08:00
vectorstore_fragmentation.md	Create vectorstore_fragmentation.md	2025-08-28 17:37:23 +08:00

README.md

Embeddings — Global Fix Map

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

A hub to stabilize the embedding layer before retrieval begins.
Use this folder if your vectors look fine at a glance but retrieval keeps drifting, coverage stays low, or store queries fail silently. No infra change needed.

Orientation: what each page covers

Page	What it solves	Typical symptom
Metric Mismatch	Store metric (L2, cosine, dot) differs from model assumption	High similarity but wrong neighbors
Normalization & Scaling	Embeddings not normalized or scaled	Results unstable across runs
Tokenization & Casing	Tokenizer mismatch, casing differences	Same text gives different vectors
Chunking → Embedding Contract	Chunk cuts misaligned with semantic windows	Snippets cut mid-thought, anchors lost
Vectorstore Fragmentation	Index silently fragmented	Recall too low even with large k
Dimension Mismatch & Projection	Store dimension vs embedding dimension mismatch	Index errors or silent truncation
Update & Index Skew	Old vectors remain in index	Results point to stale data
Hybrid Retriever Weights	BM25 + ANN weights unbalanced	Hybrid worse than single retriever
Duplication & Near-Duplicate Collapse	Duplicate data overwhelms recall	Same doc retrieved repeatedly
Poisoning & Contamination	Embeddings polluted by adversarial/noisy vectors	Retrieval looks “randomized”

When to use this folder

Retrieval looks fine by eye but metrics drift across runs.
Coverage stays low despite healthy-looking indexes.
Citations pull from stale or duplicated data.
Same query yields different answers depending on casing or seed.
Hybrid retrievers collapse into noise.

Acceptance targets

ΔS(question, retrieved) ≤ 0.45
Coverage ≥ 0.70 for target section
λ_observe convergent across 3 paraphrases and 2 seeds
No index skew between write/read

60-second fix checklist

Lock metrics
One model family, one distance metric.
Guide: Metric Mismatch
Normalize
Apply L2 norm to embeddings at both write and query.
Guide: Normalization & Scaling
Unify tokenization
Same tokenizer + casing across ingestion and query.
Guide: Tokenization & Casing
Audit chunking
Verify semantic alignment, no mid-thought splits.
Guide: Chunking → Embedding Contract
Rebuild index if skewed
Drop old embeddings, rebuild with correct dimension.
Guide: Update & Index Skew

FAQ for newcomers

Why is metric mismatch so common?
Because vector DBs default differently: FAISS often L2, Pinecone cosine, Redis dot. If your embedding model expects cosine, L2 will silently break recall.

Why normalize embeddings?
Without normalization, embeddings vary in magnitude. Distance stops reflecting meaning.

Why do tokenizers matter?
“Apple” vs “apple” may yield different vectors if one side lowercases, the other doesn’t.

What if coverage stays low after all fixes?
Check for fragmentation and duplication collapse. The issue may not be the embedding model itself, but how the index is populated.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

README.md Unescape Escape