Unicode Normalization — Guardrails and Fix Pattern

A focused fix for NFC vs NFD vs NFKC drift, width and compatibility forms, and double-normalized text that breaks retrieval, dedupe, and citation offsets.

When to use

Same word appears twice with different byte forms. Tokens do not match across pipelines.
High similarity yet wrong matches for accentized strings.
Citations point to wrong character offsets after rendering.
Mixed full-width and half-width forms, ZWJ or combining marks causing “near duplicates.”

Open these first

RAG end-to-end knobs
→ Retrieval Playbook
Wrong-meaning hits despite high similarity
→ Embedding ≠ Semantic
OCR or export drift before embeddings
→ OCR Parsing Checklist
Related locale topics
→ Diacritics and Folding · Digits, Width, Punctuation · Tokenizer Mismatch

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 across NFC and NFD variants of the same query.
Coverage of target section ≥ 0.70 after normalization pass.
λ remains convergent for three paraphrases and two seeds.
Offset accuracy ≥ 0.99 when mapping citations from normalized text back to raw text.

Fix checklist

Pick one normalization form per layer
- Storage and search keys: NFKC for maximal unification of width and compatibility forms.
- Human-facing content and display: NFC to preserve intent while keeping codepoints stable.
- Do not normalize code blocks or hashes. Tag code spans and skip them.
Record the contract
Add to your snippet schema:


norm\_form: NFC|NFD|NFKC|NFKD
norm\_version: ICU/Unicode version
offsets: {raw\_start, raw\_end, norm\_start, norm\_end}

See schema ideas in
→ Retrieval Traceability · Data Contracts

Normalize on ingestion, not at query time only

Normalize and index once. Persist INDEX_HASH that includes the normalization form and version.
If you normalize only at query time, ΔS will drift for long contexts and k-ordering becomes unstable.

Preserve both raw and normalized text

Store raw bytes for audit and exact reproduction.
Store normalized text for retrieval and rerank.
Maintain a fast map raw↔norm for citations.

Unify width and compatibility sets

Collapse full-width ASCII, half-width katakana, superscripts, circled numbers under NFKC for keys.
Keep the visible form for UI by rendering raw text, not the NFKC key.

Prevent double normalization

Mark a boolean is_normalized.
Gate every downstream step with a one-line check to avoid repeated passes that shift offsets.

Chunk after normalization

Chunk boundaries must be computed on the normalized string to keep k-NN neighborhoods stable.
→ Chunking Checklist

Verification

Twin query probe
Run the same question in NFC and NFD. Top-k overlap ≥ 0.8 and ΔS difference ≤ 0.05.
Anchor triangulation
Compare ΔS to the correct anchor and a decoy paragraph. After normalization, the correct anchor must win with margin ≥ 0.15.
Offset audit
Pick ten citations with combining marks. Validate raw offsets and UI highlights match exactly.
k-ordering stability
For k in {5, 10, 20}, the normalized index keeps the same top-3 order or differs by at most one position.

Minimal test set you can copy

“café” with precomposed é vs e + combining acute.
Full-width “ＡＢＣ１２３” vs ASCII “ABC123”.
Half-width katakana vs standard katakana.
Arabic text with tatweel and optional diacritics.
Greek polytonic marks.
Hangul precomposed syllables vs Jamo sequences.

If any pair fails the twin query probe or offset audit, rebuild with the chosen normalization form and re-verify.

Escalate when

ΔS remains ≥ 0.60 after normalization and chunk rebuild
→ Retrieval Playbook
High recall but messy k-ordering persists
→ Rerankers

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

6.4 KiB Raw Blame History