WFGY/ProblemMap/GlobalFixMap/LanguageLocale/unicode_normalization.md
2025-08-30 19:24:00 +08:00

6.4 KiB

Unicode Normalization — Guardrails and Fix Pattern

A focused fix for NFC vs NFD vs NFKC drift, width and compatibility forms, and double-normalized text that breaks retrieval, dedupe, and citation offsets.

When to use

  • Same word appears twice with different byte forms. Tokens do not match across pipelines.
  • High similarity yet wrong matches for accentized strings.
  • Citations point to wrong character offsets after rendering.
  • Mixed full-width and half-width forms, ZWJ or combining marks causing “near duplicates.”

Open these first

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 across NFC and NFD variants of the same query.
  • Coverage of target section ≥ 0.70 after normalization pass.
  • λ remains convergent for three paraphrases and two seeds.
  • Offset accuracy ≥ 0.99 when mapping citations from normalized text back to raw text.

Fix checklist

  1. Pick one normalization form per layer

    • Storage and search keys: NFKC for maximal unification of width and compatibility forms.
    • Human-facing content and display: NFC to preserve intent while keeping codepoints stable.
    • Do not normalize code blocks or hashes. Tag code spans and skip them.
  2. Record the contract
    Add to your snippet schema:


norm\_form: NFC|NFD|NFKC|NFKD
norm\_version: ICU/Unicode version
offsets: {raw\_start, raw\_end, norm\_start, norm\_end}

See schema ideas in
Retrieval Traceability · Data Contracts

  1. Normalize on ingestion, not at query time only
  • Normalize and index once. Persist INDEX_HASH that includes the normalization form and version.
  • If you normalize only at query time, ΔS will drift for long contexts and k-ordering becomes unstable.
  1. Preserve both raw and normalized text
  • Store raw bytes for audit and exact reproduction.
  • Store normalized text for retrieval and rerank.
  • Maintain a fast map raw↔norm for citations.
  1. Unify width and compatibility sets
  • Collapse full-width ASCII, half-width katakana, superscripts, circled numbers under NFKC for keys.
  • Keep the visible form for UI by rendering raw text, not the NFKC key.
  1. Prevent double normalization
  • Mark a boolean is_normalized.
  • Gate every downstream step with a one-line check to avoid repeated passes that shift offsets.
  1. Chunk after normalization
  • Chunk boundaries must be computed on the normalized string to keep k-NN neighborhoods stable.
    Chunking Checklist

Verification

  • Twin query probe
    Run the same question in NFC and NFD. Top-k overlap ≥ 0.8 and ΔS difference ≤ 0.05.

  • Anchor triangulation
    Compare ΔS to the correct anchor and a decoy paragraph. After normalization, the correct anchor must win with margin ≥ 0.15.

  • Offset audit
    Pick ten citations with combining marks. Validate raw offsets and UI highlights match exactly.

  • k-ordering stability
    For k in {5, 10, 20}, the normalized index keeps the same top-3 order or differs by at most one position.


Minimal test set you can copy

  • “café” with precomposed é vs e + combining acute.
  • Full-width “ABC123” vs ASCII “ABC123”.
  • Half-width katakana vs standard katakana.
  • Arabic text with tatweel and optional diacritics.
  • Greek polytonic marks.
  • Hangul precomposed syllables vs Jamo sequences.

If any pair fails the twin query probe or offset audit, rebuild with the chosen normalization form and re-verify.


Escalate when

  • ΔS remains ≥ 0.60 after normalization and chunk rebuild
    Retrieval Playbook

  • High recall but messy k-ordering persists
    Rerankers


👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow