14 KiB
Locale Drift — Guardrails and Fix Patterns
Stabilize retrieval when locale settings silently change token rules, analyzers, and normalization between ingest, index, and query. Typical failures include en_US vs en_GB spelling, tr_TR case-folding (“i/İ”), decimal and thousands separators, date formats, Simplified/Traditional Chinese, and accent stripping differences.
Open these first
- Visual map and recovery: rag-architecture-and-recovery.md
- End to end retrieval knobs: retrieval-playbook.md
- Why this snippet and how to cite: retrieval-traceability.md
- Snippet schema fence: data-contracts.md
- Embedding vs meaning: embedding-vs-semantic.md
- Chunk boundary sanity: chunking-checklist.md
Related in this folder:
- Tokenizer drift: tokenizer_mismatch.md
- Mixed scripts in one query: script_mixing.md
Core acceptance targets
- ΔS(question, retrieved) ≤ 0.45 across locale variants of the same query
- Coverage of the target section ≥ 0.70 after repair
- λ remains convergent across three paraphrases and two seeds
- E_resonance flat on long windows that include locale-sensitive tokens (dates, numbers, currencies)
What this failure looks like
| Symptom | Likely cause | Where to fix |
|---|---|---|
High similarity yet wrong section when query switches , and . in numbers (e.g., 1.234,56) |
Different locale decimal/thousand separators between ingest and query | Normalize numerics before index and query; align analyzers |
| Dates “03/07/2024” retrieved as July instead of March | Ambiguous locale date parsing | Canonicalize to ISO YYYY-MM-DD at ingest and query |
| “istanbul” mismatches titles with “İstanbul” | Turkish case-folding rules differ across stages | Use locale-aware fold or ASCII base form consistently |
| “straße/strasse” flip in German content | ß vs ss normalization mismatch | Decide policy (preserve ß or fold to ss) and apply everywhere |
| “café” differs from “cafe” across stores | Accent stripping only on one side | Apply accent policy uniformly; prefer keeping both forms via subfield |
| English vs Chinese punctuation causes token joins/drops | Locale-specific punctuation width and spacing | Normalize width, unify punctuation rules; ensure same analyzer |
| zh-Hans vs zh-Hant documents never co-retrieve | Variant mapping missing | Map variants at ingest or add alias field; verify embeddings share policy |
Fix in 60 seconds
-
Measure ΔS Compute ΔS(question, retrieved) with current locale. Re-run with a canonicalized query: ISO dates, normalized numbers, consistent case-folding. If ΔS drops by ≥ 0.10, locale drift is your root cause.
-
Probe λ_observe Flip only the locale-sensitive tokens (date, number, currency symbol, diacritics). If λ flips or citations jump, lock schema and fix normalization before touching rerankers.
-
Apply the smallest structural change
- Canonical numerics: convert decimals to
.and thousands to thin-space or remove thousands. - Canonical dates: rewrite to
YYYY-MM-DDand store a parsed date field for filters. - Case-folding: choose locale-aware rules where needed (
tr_TRi/İ), else use simple lower with exceptions list. - Diacritics: either preserve and add an accent-folded subfield, or fold everywhere.
- CJK: unify Simplified/Traditional mapping per field and keep a raw subfield.
- Verify Coverage ≥ 0.70 and ΔS ≤ 0.45 on three paraphrases and two seeds using both locale renderings.
Minimal repair recipes by stack
Elasticsearch / OpenSearch
- Define a canonical analyzer chain shared by
indexandsearch_analyzer. Suggested: ICU normalizer (NFC) → width fold → optional accent fold (or keep + keyword subfield) → locale-aware lowercase. - Add numeric and date normalizers in an ingest pipeline. Persist ISO strings, plus typed fields for range queries.
- For German and Turkish, use dedicated token filters (
german_normalization, custom fold for Turkish i/İ). - For Chinese, Japanese: keep a keyword subfield for exact product names and a bigram analyzer for recall. Open: retrieval-playbook.md
BM25 in code or light stores
- Pre-normalize text and queries with a single code path: ISO dates, canonical numerics, consistent punctuation width, optional accent fold, locale-aware lower.
- Log the effective tokens to verify identical behavior across runs. Open: retrieval-traceability.md
Vector stores (FAISS, Milvus, Qdrant, Weaviate, pgvector)
- Apply the same locale normalization before embedding for both corpus and queries.
- For numerics and dates, consider lexical sidecar (BM25) to capture exact forms, then deterministic rerank.
- Re-embed a gold slice to validate ΔS before full rebuild. Open: pattern_vectorstore_fragmentation.md
Locale normalization policy — quick checklist
- Dates stored and queried as ISO
YYYY-MM-DD; display can be localized later. - Numerics use
.as decimal; thousands removed or unified; currency symbol separated from amount. - Case-folding policy documented; Turkish special-case applied where needed.
- Accent policy consistent: preserve + accent-folded subfield, or fold globally.
- CJK variant policy decided (Hans/Hant) and applied at both ingest and query.
- Punctuation width unified; zero-width and bidi controls stripped where not meaningful.
- Analyzer identity enforced across index and search paths.
Diagnostic checklist
- Same normalization code runs in ingest and in query clients.
- Same analyzer configuration used for the field in both
index_analyzerandsearch_analyzer. - Logging proves that effective tokens match across locales for the same meaning.
- Citations remain in the same section after locale canonicalization.
- Rerank stage reads normalized text, not raw payloads.
Copy-paste tests
Locale flip probe
Q0: original user query as typed (locale A)
Q1: ISO date, canonical number, same words (locale neutral)
Q2: render in locale B (date/number style only)
Return ΔS for Q0,Q1,Q2, λ_state per run, and a note if citations left the target section.
Turkish i/İ sanity
Build two forms: 'istanbul', 'İstanbul'.
Verify tokens and matches are identical against titles and anchor fields.
If not, log analyzer outputs and apply a Turkish-aware fold.
Accent policy audit
Index doc: 'café', 'résumé'.
Queries: 'cafe', 'resume', original diacritics.
Expect both forms to match the same snippet and citations to remain stable.
When to escalate
-
ΔS stays ≥ 0.60 after locale normalization and analyzer alignment. Re-chunk with stable boundaries and re-embed a gold slice. Open: chunking-checklist.md
-
Answers alternate between locales while citations drift. Enforce snippet schema and forbid cross-section reuse. Open: data-contracts.md, retrieval-traceability.md
-
Hybrid retrieval still underperforms a single retriever after fixes. Align locale rules before rerank and make rerank deterministic. Open: rerankers.md
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.