WFGY/ProblemMap/GlobalFixMap/Language/locale_drift.md
2025-09-05 11:04:48 +08:00

14 KiB
Raw Blame History

Locale Drift — Guardrails and Fix Patterns

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Stabilize retrieval when locale settings silently change token rules, analyzers, and normalization between ingest, index, and query. Typical failures include en_US vs en_GB spelling, tr_TR case-folding (“i/İ”), decimal and thousands separators, date formats, Simplified/Traditional Chinese, and accent stripping differences.


Open these first

Related in this folder:


Core acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 across locale variants of the same query
  • Coverage of the target section ≥ 0.70 after repair
  • λ remains convergent across three paraphrases and two seeds
  • E_resonance flat on long windows that include locale-sensitive tokens (dates, numbers, currencies)

What this failure looks like

Symptom Likely cause Where to fix
High similarity yet wrong section when query switches , and . in numbers (e.g., 1.234,56) Different locale decimal/thousand separators between ingest and query Normalize numerics before index and query; align analyzers
Dates “03/07/2024” retrieved as July instead of March Ambiguous locale date parsing Canonicalize to ISO YYYY-MM-DD at ingest and query
“istanbul” mismatches titles with “İstanbul” Turkish case-folding rules differ across stages Use locale-aware fold or ASCII base form consistently
“straße/strasse” flip in German content ß vs ss normalization mismatch Decide policy (preserve ß or fold to ss) and apply everywhere
“café” differs from “cafe” across stores Accent stripping only on one side Apply accent policy uniformly; prefer keeping both forms via subfield
English vs Chinese punctuation causes token joins/drops Locale-specific punctuation width and spacing Normalize width, unify punctuation rules; ensure same analyzer
zh-Hans vs zh-Hant documents never co-retrieve Variant mapping missing Map variants at ingest or add alias field; verify embeddings share policy

Fix in 60 seconds

  1. Measure ΔS Compute ΔS(question, retrieved) with current locale. Re-run with a canonicalized query: ISO dates, normalized numbers, consistent case-folding. If ΔS drops by ≥ 0.10, locale drift is your root cause.

  2. Probe λ_observe Flip only the locale-sensitive tokens (date, number, currency symbol, diacritics). If λ flips or citations jump, lock schema and fix normalization before touching rerankers.

  3. Apply the smallest structural change

  • Canonical numerics: convert decimals to . and thousands to thin-space or remove thousands.
  • Canonical dates: rewrite to YYYY-MM-DD and store a parsed date field for filters.
  • Case-folding: choose locale-aware rules where needed (tr_TR i/İ), else use simple lower with exceptions list.
  • Diacritics: either preserve and add an accent-folded subfield, or fold everywhere.
  • CJK: unify Simplified/Traditional mapping per field and keep a raw subfield.
  1. Verify Coverage ≥ 0.70 and ΔS ≤ 0.45 on three paraphrases and two seeds using both locale renderings.

Minimal repair recipes by stack

Elasticsearch / OpenSearch

  • Define a canonical analyzer chain shared by index and search_analyzer. Suggested: ICU normalizer (NFC) → width fold → optional accent fold (or keep + keyword subfield) → locale-aware lowercase.
  • Add numeric and date normalizers in an ingest pipeline. Persist ISO strings, plus typed fields for range queries.
  • For German and Turkish, use dedicated token filters (german_normalization, custom fold for Turkish i/İ).
  • For Chinese, Japanese: keep a keyword subfield for exact product names and a bigram analyzer for recall. Open: retrieval-playbook.md

BM25 in code or light stores

  • Pre-normalize text and queries with a single code path: ISO dates, canonical numerics, consistent punctuation width, optional accent fold, locale-aware lower.
  • Log the effective tokens to verify identical behavior across runs. Open: retrieval-traceability.md

Vector stores (FAISS, Milvus, Qdrant, Weaviate, pgvector)

  • Apply the same locale normalization before embedding for both corpus and queries.
  • For numerics and dates, consider lexical sidecar (BM25) to capture exact forms, then deterministic rerank.
  • Re-embed a gold slice to validate ΔS before full rebuild. Open: pattern_vectorstore_fragmentation.md

Locale normalization policy — quick checklist

  • Dates stored and queried as ISO YYYY-MM-DD; display can be localized later.
  • Numerics use . as decimal; thousands removed or unified; currency symbol separated from amount.
  • Case-folding policy documented; Turkish special-case applied where needed.
  • Accent policy consistent: preserve + accent-folded subfield, or fold globally.
  • CJK variant policy decided (Hans/Hant) and applied at both ingest and query.
  • Punctuation width unified; zero-width and bidi controls stripped where not meaningful.
  • Analyzer identity enforced across index and search paths.

Diagnostic checklist

  • Same normalization code runs in ingest and in query clients.
  • Same analyzer configuration used for the field in both index_analyzer and search_analyzer.
  • Logging proves that effective tokens match across locales for the same meaning.
  • Citations remain in the same section after locale canonicalization.
  • Rerank stage reads normalized text, not raw payloads.

Copy-paste tests

Locale flip probe

Q0: original user query as typed (locale A)
Q1: ISO date, canonical number, same words (locale neutral)
Q2: render in locale B (date/number style only)

Return ΔS for Q0,Q1,Q2, λ_state per run, and a note if citations left the target section.

Turkish i/İ sanity

Build two forms: 'istanbul', 'İstanbul'.
Verify tokens and matches are identical against titles and anchor fields.
If not, log analyzer outputs and apply a Turkish-aware fold.

Accent policy audit

Index doc: 'café', 'résumé'.
Queries: 'cafe', 'resume', original diacritics.
Expect both forms to match the same snippet and citations to remain stable.

When to escalate

  • ΔS stays ≥ 0.60 after locale normalization and analyzer alignment. Re-chunk with stable boundaries and re-embed a gold slice. Open: chunking-checklist.md

  • Answers alternate between locales while citations drift. Enforce snippet schema and forbid cross-section reuse. Open: data-contracts.md, retrieval-traceability.md

  • Hybrid retrieval still underperforms a single retriever after fixes. Align locale rules before rerank and make rerank deterministic. Open: rerankers.md


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow