12 KiB
Locale Drift — Guardrails and Fix Patterns
🧭 Quick Return to Map
You are in a sub-page of Language.
To reorient, go back here:
- Language — multilingual processing and semantic alignment
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize retrieval when locale settings silently change token rules, analyzers, and normalization between ingest, index, and query. Typical failures include en_US vs en_GB spelling, tr_TR case-folding (“i/İ”), decimal and thousands separators, date formats, Simplified/Traditional Chinese, and accent stripping differences.
Open these first
- Visual map and recovery: rag-architecture-and-recovery.md
- End to end retrieval knobs: retrieval-playbook.md
- Why this snippet and how to cite: retrieval-traceability.md
- Snippet schema fence: data-contracts.md
- Embedding vs meaning: embedding-vs-semantic.md
- Chunk boundary sanity: chunking-checklist.md
Related in this folder:
- Tokenizer drift: tokenizer_mismatch.md
- Mixed scripts in one query: script_mixing.md
Core acceptance targets
- ΔS(question, retrieved) ≤ 0.45 across locale variants of the same query
- Coverage of the target section ≥ 0.70 after repair
- λ remains convergent across three paraphrases and two seeds
- E_resonance flat on long windows that include locale-sensitive tokens (dates, numbers, currencies)
What this failure looks like
| Symptom | Likely cause | Where to fix |
|---|---|---|
High similarity yet wrong section when query switches , and . in numbers (e.g., 1.234,56) |
Different locale decimal/thousand separators between ingest and query | Normalize numerics before index and query; align analyzers |
| Dates “03/07/2024” retrieved as July instead of March | Ambiguous locale date parsing | Canonicalize to ISO YYYY-MM-DD at ingest and query |
| “istanbul” mismatches titles with “İstanbul” | Turkish case-folding rules differ across stages | Use locale-aware fold or ASCII base form consistently |
| “straße/strasse” flip in German content | ß vs ss normalization mismatch | Decide policy (preserve ß or fold to ss) and apply everywhere |
| “café” differs from “cafe” across stores | Accent stripping only on one side | Apply accent policy uniformly; prefer keeping both forms via subfield |
| English vs Chinese punctuation causes token joins/drops | Locale-specific punctuation width and spacing | Normalize width, unify punctuation rules; ensure same analyzer |
| zh-Hans vs zh-Hant documents never co-retrieve | Variant mapping missing | Map variants at ingest or add alias field; verify embeddings share policy |
Fix in 60 seconds
-
Measure ΔS Compute ΔS(question, retrieved) with current locale. Re-run with a canonicalized query: ISO dates, normalized numbers, consistent case-folding. If ΔS drops by ≥ 0.10, locale drift is your root cause.
-
Probe λ_observe Flip only the locale-sensitive tokens (date, number, currency symbol, diacritics). If λ flips or citations jump, lock schema and fix normalization before touching rerankers.
-
Apply the smallest structural change
- Canonical numerics: convert decimals to
.and thousands to thin-space or remove thousands. - Canonical dates: rewrite to
YYYY-MM-DDand store a parsed date field for filters. - Case-folding: choose locale-aware rules where needed (
tr_TRi/İ), else use simple lower with exceptions list. - Diacritics: either preserve and add an accent-folded subfield, or fold everywhere.
- CJK: unify Simplified/Traditional mapping per field and keep a raw subfield.
- Verify Coverage ≥ 0.70 and ΔS ≤ 0.45 on three paraphrases and two seeds using both locale renderings.
Minimal repair recipes by stack
Elasticsearch / OpenSearch
- Define a canonical analyzer chain shared by
indexandsearch_analyzer. Suggested: ICU normalizer (NFC) → width fold → optional accent fold (or keep + keyword subfield) → locale-aware lowercase. - Add numeric and date normalizers in an ingest pipeline. Persist ISO strings, plus typed fields for range queries.
- For German and Turkish, use dedicated token filters (
german_normalization, custom fold for Turkish i/İ). - For Chinese, Japanese: keep a keyword subfield for exact product names and a bigram analyzer for recall. Open: retrieval-playbook.md
BM25 in code or light stores
- Pre-normalize text and queries with a single code path: ISO dates, canonical numerics, consistent punctuation width, optional accent fold, locale-aware lower.
- Log the effective tokens to verify identical behavior across runs. Open: retrieval-traceability.md
Vector stores (FAISS, Milvus, Qdrant, Weaviate, pgvector)
- Apply the same locale normalization before embedding for both corpus and queries.
- For numerics and dates, consider lexical sidecar (BM25) to capture exact forms, then deterministic rerank.
- Re-embed a gold slice to validate ΔS before full rebuild. Open: pattern_vectorstore_fragmentation.md
Locale normalization policy — quick checklist
- Dates stored and queried as ISO
YYYY-MM-DD; display can be localized later. - Numerics use
.as decimal; thousands removed or unified; currency symbol separated from amount. - Case-folding policy documented; Turkish special-case applied where needed.
- Accent policy consistent: preserve + accent-folded subfield, or fold globally.
- CJK variant policy decided (Hans/Hant) and applied at both ingest and query.
- Punctuation width unified; zero-width and bidi controls stripped where not meaningful.
- Analyzer identity enforced across index and search paths.
Diagnostic checklist
- Same normalization code runs in ingest and in query clients.
- Same analyzer configuration used for the field in both
index_analyzerandsearch_analyzer. - Logging proves that effective tokens match across locales for the same meaning.
- Citations remain in the same section after locale canonicalization.
- Rerank stage reads normalized text, not raw payloads.
Copy-paste tests
Locale flip probe
Q0: original user query as typed (locale A)
Q1: ISO date, canonical number, same words (locale neutral)
Q2: render in locale B (date/number style only)
Return ΔS for Q0,Q1,Q2, λ_state per run, and a note if citations left the target section.
Turkish i/İ sanity
Build two forms: 'istanbul', 'İstanbul'.
Verify tokens and matches are identical against titles and anchor fields.
If not, log analyzer outputs and apply a Turkish-aware fold.
Accent policy audit
Index doc: 'café', 'résumé'.
Queries: 'cafe', 'resume', original diacritics.
Expect both forms to match the same snippet and citations to remain stable.
When to escalate
-
ΔS stays ≥ 0.60 after locale normalization and analyzer alignment. Re-chunk with stable boundaries and re-embed a gold slice. Open: chunking-checklist.md
-
Answers alternate between locales while citations drift. Enforce snippet schema and forbid cross-section reuse. Open: data-contracts.md, retrieval-traceability.md
-
Hybrid retrieval still underperforms a single retriever after fixes. Align locale rules before rerank and make rerank deterministic. Open: rerankers.md
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | Canonical framework entry point | View |
| Problem Map | Diagnostic map and navigation hub | View |
| Tension Universe Experiments | MVP experiment field | View |
| Recognition | Where WFGY is referenced or adopted | View |
| AI Guide | Anti-hallucination reading protocol for tools | View |
If this repository helps, starring it improves discovery for other builders.