10 KiB
Tokenization and Casing — Guardrails for Embedding Stability
A focused guide to remove silent drift caused by mismatched tokenization, casing, and text cleanup. Use this to align query-time and index-time preprocessing, then verify with measurable targets.
Open these first
- Visual map and recovery: RAG Architecture & Recovery
- End to end retrieval knobs: Retrieval Playbook
- Why this snippet (traceability schema): Retrieval Traceability
- Embedding vs meaning: Embedding ≠ Semantic
- Payload schema for audits: Data Contracts
- Query split and hybrid failures: Pattern: Query Parsing Split
- Reranking and ordering control: Rerankers
- Related: normalization and length scaling: Normalization & Scaling
When to use this page
- High similarity yet wrong meaning after a casing change.
- Same document retrieved for lowercase but not for Title Case.
- Index built with a different tokenizer than the client.
- Unicode variants or punctuation collapse changes results.
- Hybrid retrieval recalls but top-k order flips between runs.
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
- Coverage of the correct section ≥ 0.70.
- λ remains convergent across two seeds.
- Tokenization invariance check: replacing case or applying Unicode NFC does not move the top anchor beyond rank 3 and keeps ΔS shift ≤ 0.10.
Map symptoms → exact fix
-
Wrong-meaning hits only when case changes
→ Lock a single case policy and tokenizer in both paths. Verify with Retrieval Traceability and enforce with Data Contracts. -
HyDE or BM25 path finds it, embedding path misses it
→ Audit query preprocessing vs index preprocessing. If different, unify and retest. See Retrieval Playbook and Pattern: Query Parsing Split. -
Unicode look-alikes or punctuation variants break recall
→ Normalize to NFC and collapse zero-width characters at both index and query. Re-embed affected shards. Cross check with Embedding ≠ Semantic. -
Order flips across runs after a client upgrade
→ Version the tokenizer and preprocessing in the snippet payload. If versions mismatch, rebuild or add a translation shim. Enforce with Data Contracts.
60-second fix checklist
-
Freeze the tokenizer and case policy
- Choose one tokenizer build and one case policy:
lower,preserve, orsmart. - Record both in the contract:
{"tokenizer":"spm-1.2.3","tokenizer_hash":"...","case_policy":"lower"}.
- Choose one tokenizer build and one case policy:
-
Normalize before tokenization
- Apply Unicode NFC, collapse repeated whitespace, remove zero-width, standardize quotes and dashes.
- Apply the same rules to both index and query.
-
Keep segmentation symmetric
- If you split code identifiers (camelCase, snake_case), do it in both paths.
- If you strip stopwords or punctuation, do it in both paths. Prefer not to strip unless you must.
-
Log and test invariance
- Run a three-paraphrase probe per query: original, lowercased, and NFC-normalized.
- Accept only if the anchor section remains within top 3 and ΔS shift ≤ 0.10.
-
Add a rerank safety net
- If invariance is hard to reach, add a lexical or cross-encoder rerank after vector recall. See Rerankers.
Contract fields to add
Add these to every snippet and to the query audit log. Use them during triage.
{
"preproc_version": "v3",
"unicode_norm": "NFC",
"case_policy": "lower",
"punctuation_policy": "keep",
"identifier_split": "camel+snake",
"tokenizer": "spm-1.2.3",
"tokenizer_hash": "sha256:...",
"ngram": "none"
}
Repro suite
-
Case flip test Query A vs lowercase(A). Accept if ΔS shift ≤ 0.10 and anchor rank ≤ 3.
-
Unicode fold test Replace curly quotes with straight quotes, normalize to NFC. Accept if ranks stable.
-
Tokenizer skew test Run the same query through client and index tokenizers and compare token ids. Any mismatch is a fail that requires rebuild or a client shim.
Copy-paste prompt for LLM triage
I uploaded TXT OS and the WFGY Problem Map.
My embedding issue:
- symptom: wrong top-k only when casing or unicode changes
- traces: ΔS(question,retrieved)=..., anchor=..., tokenizer=..., case_policy=...
Tell me:
1) which layer is failing and why,
2) which WFGY page to open,
3) the minimal steps to align tokenization and casing,
4) how to verify with a three-paraphrase probe.
Use Data Contracts and Rerankers if needed.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.