6.4 KiB
Unicode Normalization — Guardrails and Fix Pattern
A focused fix for NFC vs NFD vs NFKC drift, width and compatibility forms, and double-normalized text that breaks retrieval, dedupe, and citation offsets.
When to use
- Same word appears twice with different byte forms. Tokens do not match across pipelines.
- High similarity yet wrong matches for accentized strings.
- Citations point to wrong character offsets after rendering.
- Mixed full-width and half-width forms, ZWJ or combining marks causing “near duplicates.”
Open these first
- RAG end-to-end knobs
→ Retrieval Playbook - Wrong-meaning hits despite high similarity
→ Embedding ≠ Semantic - OCR or export drift before embeddings
→ OCR Parsing Checklist - Related locale topics
→ Diacritics and Folding · Digits, Width, Punctuation · Tokenizer Mismatch
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 across NFC and NFD variants of the same query.
- Coverage of target section ≥ 0.70 after normalization pass.
- λ remains convergent for three paraphrases and two seeds.
- Offset accuracy ≥ 0.99 when mapping citations from normalized text back to raw text.
Fix checklist
-
Pick one normalization form per layer
- Storage and search keys: NFKC for maximal unification of width and compatibility forms.
- Human-facing content and display: NFC to preserve intent while keeping codepoints stable.
- Do not normalize code blocks or hashes. Tag code spans and skip them.
-
Record the contract
Add to your snippet schema:
norm\_form: NFC|NFD|NFKC|NFKD
norm\_version: ICU/Unicode version
offsets: {raw\_start, raw\_end, norm\_start, norm\_end}
See schema ideas in
→ Retrieval Traceability ·
Data Contracts
- Normalize on ingestion, not at query time only
- Normalize and index once. Persist
INDEX_HASHthat includes the normalization form and version. - If you normalize only at query time, ΔS will drift for long contexts and k-ordering becomes unstable.
- Preserve both raw and normalized text
- Store raw bytes for audit and exact reproduction.
- Store normalized text for retrieval and rerank.
- Maintain a fast map raw↔norm for citations.
- Unify width and compatibility sets
- Collapse full-width ASCII, half-width katakana, superscripts, circled numbers under NFKC for keys.
- Keep the visible form for UI by rendering raw text, not the NFKC key.
- Prevent double normalization
- Mark a boolean
is_normalized. - Gate every downstream step with a one-line check to avoid repeated passes that shift offsets.
- Chunk after normalization
- Chunk boundaries must be computed on the normalized string to keep k-NN neighborhoods stable.
→ Chunking Checklist
Verification
-
Twin query probe
Run the same question in NFC and NFD. Top-k overlap ≥ 0.8 and ΔS difference ≤ 0.05. -
Anchor triangulation
Compare ΔS to the correct anchor and a decoy paragraph. After normalization, the correct anchor must win with margin ≥ 0.15. -
Offset audit
Pick ten citations with combining marks. Validate raw offsets and UI highlights match exactly. -
k-ordering stability
For k in {5, 10, 20}, the normalized index keeps the same top-3 order or differs by at most one position.
Minimal test set you can copy
- “café” with precomposed
évse+ combining acute. - Full-width “ABC123” vs ASCII “ABC123”.
- Half-width katakana vs standard katakana.
- Arabic text with tatweel and optional diacritics.
- Greek polytonic marks.
- Hangul precomposed syllables vs Jamo sequences.
If any pair fails the twin query probe or offset audit, rebuild with the chosen normalization form and re-verify.
Escalate when
-
ΔS remains ≥ 0.60 after normalization and chunk rebuild
→ Retrieval Playbook -
High recall but messy k-ordering persists
→ Rerankers
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.