Tokenizer Mismatch — Guardrails and Fix Pattern

A focused fix when embedder, retriever, reranker, and generator do not share the same tokenization or normalization rules. Use this page to localize the failure, align the text pipeline, and verify with measurable targets.

Open these first

Visual map and recovery: rag-architecture-and-recovery.md
End to end retrieval knobs: retrieval-playbook.md
Traceability and snippet schema: retrieval-traceability.md
Payload schema and contracts: data-contracts.md
Embedding vs meaning: embedding-vs-semantic.md
Language mixing and locale drift:
script_mixing.md · locale_drift.md
Tokenization and casing details:
tokenization_and_casing.md · normalization_and_scaling.md

When to use this page

High similarity to the right document but wrong snippet or misaligned offsets.
Same query returns different top-k after re-index or provider switch.
Citations do not line up with visible tokens in CJK or Indic scripts.
Mixed width or composed characters behave inconsistently after export.
Reranker improves precision but answers still drift in long chains.

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
Coverage of target section ≥ 0.70 with stable offsets.
λ remains convergent across two seeds after tokenizer lock.
Snippet offsets map to visible glyphs after NFC or NFKC pass.

60-second fix checklist

Identify tokenizers in play
Record for each stage: embedder, store analyzer, reranker, generator. Note version, normalization, casing, segmentation rules.
Normalize once, early
Apply one canonical pass for the corpus and the queries. Pick NFC for general Latin scripts. Pick NFKC when full-width, compatibility forms, or half-width punctuations appear. Keep the same pass for both corpus and queries.
See: normalization_and_scaling.md
Lock casing strategy
Either preserve case end to end or lower both sides before embedding. Do not mix.
See: tokenization_and_casing.md
Unify segmenter
CJK and Thai cannot rely on whitespace. Use the same segmenter for chunking and for query pre-processing. Validate offsets after segmentation.
Version the tokenizer
Store TOKENIZER_FAMILY, TOKENIZER_VERSION, NORM_PASS inside snippet metadata. Reject inserts that do not match.
Spec fields live in: data-contracts.md
Rebuild the index if needed
If ΔS stays high and offsets are unstable, rebuild with the aligned tokenizer and normalization. Verify with a small gold set.

Symptom map → exact fix

Symptom	Likely cause	Open this
Citations jump inside CJK lines	chunker uses char windows, retriever uses wordpiece	retrieval-traceability.md · tokenization_and_casing.md
Wrong-meaning hits with high cosine	incompatible normalization between corpus and query	embedding-vs-semantic.md · normalization_and_scaling.md
BM25 improves, hybrid becomes worse	query split from mixed scripts or width	script_mixing.md · retrieval-playbook.md
Offsets do not align after PDF export	composed characters or soft hyphen artifacts	retrieval-traceability.md
Answers flip between runs	prompt headers reorder, λ becomes variant	context-drift.md

Deep checks

Normalization audit
Log a 1k sample of tokens from corpus and queries. Count deltas after NFC vs NFKC. Reject mismatches above 0.5 percent.
Width and compatibility scan
Count full-width Latin, half-width katakana, ligatures, ZWJ, soft hyphen. Normalize or strip consistently.
Segmenter parity
For CJK, Thai, Khmer, Lao, use the same dictionary for chunking and for query prep. Verify that start_offset and end_offset point to visible glyphs.
Analyzer parity in the store
If the store applies analyzers, make them explicit. For Elastic or OpenSearch, pin the analyzer in the index template and document it in the snippet schema.
Reranker bridge
If the retriever is sparse and the reranker is dense, ensure identical normalization happens before both. Otherwise reranker scores become unstable.

Minimal reproducible test

Pick three paraphrases of the same question.
For each: compute ΔS(question, retrieved) and record λ state.
Inspect offsets on the top snippet. Confirm visual alignment after normalization.
Target: ΔS ≤ 0.45 and λ convergent on two seeds. Coverage ≥ 0.70 for the correct section.

Copy-paste prompt


You have TXT OS and WFGY Problem Map loaded.

My tokenizer issue:

* corpus normalization: NFC or NFKC?
* segmenter family and version for chunking vs query
* store analyzer and reranker tokenizer
* symptom: offsets drift, wrong-meaning hits, hybrid instability

Tell me:

1. the failing layer and why,
2. the exact WFGY pages to open,
3. the minimal steps to align tokenizer, normalization, and analyzers,
4. a short test to verify with ΔS ≤ 0.45 and stable offsets.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

10 KiB Raw Blame History Unescape Escape