WFGY/ProblemMap/GlobalFixMap/LanguageLocale/tokenizer_mismatch.md
2025-08-30 14:43:55 +08:00

10 KiB
Raw Blame History

Tokenizer Mismatch — Guardrails and Fix Pattern

A focused fix when embedder, retriever, reranker, and generator do not share the same tokenization or normalization rules. Use this page to localize the failure, align the text pipeline, and verify with measurable targets.

Open these first

When to use this page

  • High similarity to the right document but wrong snippet or misaligned offsets.
  • Same query returns different top-k after re-index or provider switch.
  • Citations do not line up with visible tokens in CJK or Indic scripts.
  • Mixed width or composed characters behave inconsistently after export.
  • Reranker improves precision but answers still drift in long chains.

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 on three paraphrases.
  • Coverage of target section ≥ 0.70 with stable offsets.
  • λ remains convergent across two seeds after tokenizer lock.
  • Snippet offsets map to visible glyphs after NFC or NFKC pass.

60-second fix checklist

  1. Identify tokenizers in play
    Record for each stage: embedder, store analyzer, reranker, generator. Note version, normalization, casing, segmentation rules.

  2. Normalize once, early
    Apply one canonical pass for the corpus and the queries. Pick NFC for general Latin scripts. Pick NFKC when full-width, compatibility forms, or half-width punctuations appear. Keep the same pass for both corpus and queries.
    See: normalization_and_scaling.md

  3. Lock casing strategy
    Either preserve case end to end or lower both sides before embedding. Do not mix.
    See: tokenization_and_casing.md

  4. Unify segmenter
    CJK and Thai cannot rely on whitespace. Use the same segmenter for chunking and for query pre-processing. Validate offsets after segmentation.

  5. Version the tokenizer
    Store TOKENIZER_FAMILY, TOKENIZER_VERSION, NORM_PASS inside snippet metadata. Reject inserts that do not match.
    Spec fields live in: data-contracts.md

  6. Rebuild the index if needed
    If ΔS stays high and offsets are unstable, rebuild with the aligned tokenizer and normalization. Verify with a small gold set.


Symptom map → exact fix

Symptom Likely cause Open this
Citations jump inside CJK lines chunker uses char windows, retriever uses wordpiece retrieval-traceability.md · tokenization_and_casing.md
Wrong-meaning hits with high cosine incompatible normalization between corpus and query embedding-vs-semantic.md · normalization_and_scaling.md
BM25 improves, hybrid becomes worse query split from mixed scripts or width script_mixing.md · retrieval-playbook.md
Offsets do not align after PDF export composed characters or soft hyphen artifacts retrieval-traceability.md
Answers flip between runs prompt headers reorder, λ becomes variant context-drift.md

Deep checks

  • Normalization audit
    Log a 1k sample of tokens from corpus and queries. Count deltas after NFC vs NFKC. Reject mismatches above 0.5 percent.

  • Width and compatibility scan
    Count full-width Latin, half-width katakana, ligatures, ZWJ, soft hyphen. Normalize or strip consistently.

  • Segmenter parity
    For CJK, Thai, Khmer, Lao, use the same dictionary for chunking and for query prep. Verify that start_offset and end_offset point to visible glyphs.

  • Analyzer parity in the store
    If the store applies analyzers, make them explicit. For Elastic or OpenSearch, pin the analyzer in the index template and document it in the snippet schema.

  • Reranker bridge
    If the retriever is sparse and the reranker is dense, ensure identical normalization happens before both. Otherwise reranker scores become unstable.


Minimal reproducible test

  1. Pick three paraphrases of the same question.
  2. For each: compute ΔS(question, retrieved) and record λ state.
  3. Inspect offsets on the top snippet. Confirm visual alignment after normalization.
  4. Target: ΔS ≤ 0.45 and λ convergent on two seeds. Coverage ≥ 0.70 for the correct section.

Copy-paste prompt


You have TXT OS and WFGY Problem Map loaded.

My tokenizer issue:

* corpus normalization: NFC or NFKC?
* segmenter family and version for chunking vs query
* store analyzer and reranker tokenizer
* symptom: offsets drift, wrong-meaning hits, hybrid instability

Tell me:

1. the failing layer and why,
2. the exact WFGY pages to open,
3. the minimal steps to align tokenizer, normalization, and analyzers,
4. a short test to verify with ΔS ≤ 0.45 and stable offsets.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow