Locale Drift & Normalization — Guardrails and Fix Pattern

🧭 Quick Return to Map

You are in a sub-page of LanguageLocale.
To reorient, go back here:

LanguageLocale — localization, regional settings, and context adaptation

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A focused fix for locale-specific normalization bugs that break retrieval or cause answers to flip between runs. Use this page to align Unicode form, width, accents, digits, quotes, spaces, casing across ingest, index, query, and display. No infra change required.

Open these first

Visual map and recovery: RAG Architecture & Recovery
End to end retrieval knobs: Retrieval Playbook
OCR and parsing checks: OCR Parsing Checklist
Snippet and citation schema: Data Contracts
Tokenizer alignment: tokenizer_mismatch.md
Code-switching in one query: script_mixing.md
Embedding vs meaning: embedding-vs-semantic.md
Rerank determinism: rerankers.md

When to use this page

Same page looks identical to the eye, yet retrieval misses or ranks it differently.
Queries with accented vs unaccented forms return different snippets.
Half-width vs full-width characters change similarity scores.
Arabic-Indic digits or CJK punctuation break matches.
Smart quotes vs straight quotes fragment hits.
Turkish I or locale-aware casing flips λ between runs.
NBSP or narrow spaces split tokens unpredictably.

Core acceptance targets

ΔS(question, retrieved) ≤ 0.45 in the native locale, ≤ 0.50 after cross-locale normalization.
Coverage of target section ≥ 0.70 on three paraphrases.
λ remains convergent across three locale variants of the same question.
No analyzer drift between ingest and query paths in hybrid pipelines.

Symptoms → Likely cause → Open this

Symptom	Likely cause	Open this
Visually identical text but no hit	Unicode form mismatch NFD vs NFC	Retrieval Playbook, OCR Parsing Checklist
Hits split between copies of the same doc	Width folding not applied, full-width vs half-width	Retrieval Playbook
Accented vs plain query return different snippets	Accent folding policy inconsistent	embedding-vs-semantic.md
Numbers in Arabic-Indic scripts never match	Digit class mismatch or analyzer not locale aware	tokenizer_mismatch.md
Quotes or hyphens break phrase queries	Smart quotes, em/quasi hyphens, Unicode confusables	OCR Parsing Checklist
Same prompt, answers flip by locale	Casing rules differ, tr locale I/i special case, λ unstable	script_mixing.md, rerankers.md
Token counts explode on CJK	NBSP, narrow no-break, or ideographic space not normalized	Retrieval Playbook

60-second fix checklist

Normalize at ingest and query Apply in this order: Unicode NFC, width fold, digit fold to ASCII, smart-punct to ASCII, collapse exotic spaces, then locale-aware casefold. Keep the original text for display.
Dual-key storage Store both visual_text and search_text. Index BM25 on search_text. Keep visual_text for citations and display. Schema in Data Contracts.
Embed both views when needed For high-variance locales, create embeddings for raw and normalized text. Track which path produced the hit inside the snippet payload. See guidance in embedding-vs-semantic.md.
Align analyzers Ensure the same analyzer and token rules for ingest and query in hybrid retrieval. Verify with three paraphrases and two seeds. If λ flips, pin headers and apply a BBAM variance clamp.
Gold set verification Run a 20-item multilingual gold set. Require ΔS ≤ 0.45 native and ≤ 0.50 normalized, coverage ≥ 0.70, λ convergent.

Minimal adapter spec

Payload must carry: locale, unicode_form, width_fold, digit_class, space_class, accent_fold, case_mode.
Snippet must include both visual_text and search_text plus normalization_trace.
Reject answers if citation visual_text and search_text disagree on offsets.

See schema patterns in Data Contracts and tracing in Retrieval Traceability.

Copy-paste prompt

You have TXT OS and the WFGY Problem Map loaded.

My bug smells like locale drift.
Question variants: {q_native, q_no_accents, q_width_folded}.
Traces: ΔS_native=..., ΔS_normalized=..., λ states across three variants.

Tell me:
1) which normalization step is missing and why,
2) the exact WFGY page to open,
3) the smallest change to push ΔS ≤ 0.45 native and ≤ 0.50 normalized,
4) a reproducible test with 3 paraphrases and 2 seeds.
Use BBMC/BBCR/BBAM when relevant.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

13 KiB Raw Blame History Unescape Escape