vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

2025-08-30 10:27:09 +08:00

14 KiB

Raw Blame History

Locale Drift — Guardrails and Fix Patterns

Stabilize retrieval when locale settings silently change token rules, analyzers, and normalization between ingest, index, and query. Typical failures include en_US vs en_GB spelling, tr_TR case-folding (“i/İ”), decimal and thousands separators, date formats, Simplified/Traditional Chinese, and accent stripping differences.

Open these first

Visual map and recovery: rag-architecture-and-recovery.md
End to end retrieval knobs: retrieval-playbook.md
Why this snippet and how to cite: retrieval-traceability.md
Snippet schema fence: data-contracts.md
Embedding vs meaning: embedding-vs-semantic.md
Chunk boundary sanity: chunking-checklist.md

Related in this folder:

Tokenizer drift: tokenizer_mismatch.md
Mixed scripts in one query: script_mixing.md

Core acceptance targets

ΔS(question, retrieved) ≤ 0.45 across locale variants of the same query
Coverage of the target section ≥ 0.70 after repair
λ remains convergent across three paraphrases and two seeds
E_resonance flat on long windows that include locale-sensitive tokens (dates, numbers, currencies)

What this failure looks like

Symptom	Likely cause	Where to fix
High similarity yet wrong section when query switches `,` and `.` in numbers (e.g., `1.234,56`)	Different locale decimal/thousand separators between ingest and query	Normalize numerics before index and query; align analyzers
Dates “03/07/2024” retrieved as July instead of March	Ambiguous locale date parsing	Canonicalize to ISO `YYYY-MM-DD` at ingest and query
“istanbul” mismatches titles with “İstanbul”	Turkish case-folding rules differ across stages	Use locale-aware fold or ASCII base form consistently
“straße/strasse” flip in German content	ß vs ss normalization mismatch	Decide policy (preserve ß or fold to `ss`) and apply everywhere
“café” differs from “cafe” across stores	Accent stripping only on one side	Apply accent policy uniformly; prefer keeping both forms via subfield
English vs Chinese punctuation causes token joins/drops	Locale-specific punctuation width and spacing	Normalize width, unify punctuation rules; ensure same analyzer
zh-Hans vs zh-Hant documents never co-retrieve	Variant mapping missing	Map variants at ingest or add alias field; verify embeddings share policy

Fix in 60 seconds

Measure ΔS Compute ΔS(question, retrieved) with current locale. Re-run with a canonicalized query: ISO dates, normalized numbers, consistent case-folding. If ΔS drops by ≥ 0.10, locale drift is your root cause.
Probe λ_observe Flip only the locale-sensitive tokens (date, number, currency symbol, diacritics). If λ flips or citations jump, lock schema and fix normalization before touching rerankers.
Apply the smallest structural change

Canonical numerics: convert decimals to . and thousands to thin-space or remove thousands.
Canonical dates: rewrite to YYYY-MM-DD and store a parsed date field for filters.
Case-folding: choose locale-aware rules where needed (tr_TR i/İ), else use simple lower with exceptions list.
Diacritics: either preserve and add an accent-folded subfield, or fold everywhere.
CJK: unify Simplified/Traditional mapping per field and keep a raw subfield.

Verify Coverage ≥ 0.70 and ΔS ≤ 0.45 on three paraphrases and two seeds using both locale renderings.

Minimal repair recipes by stack

Elasticsearch / OpenSearch

Define a canonical analyzer chain shared by index and search_analyzer. Suggested: ICU normalizer (NFC) → width fold → optional accent fold (or keep + keyword subfield) → locale-aware lowercase.
Add numeric and date normalizers in an ingest pipeline. Persist ISO strings, plus typed fields for range queries.
For German and Turkish, use dedicated token filters (german_normalization, custom fold for Turkish i/İ).
For Chinese, Japanese: keep a keyword subfield for exact product names and a bigram analyzer for recall. Open: retrieval-playbook.md

BM25 in code or light stores

Pre-normalize text and queries with a single code path: ISO dates, canonical numerics, consistent punctuation width, optional accent fold, locale-aware lower.
Log the effective tokens to verify identical behavior across runs. Open: retrieval-traceability.md

Vector stores (FAISS, Milvus, Qdrant, Weaviate, pgvector)

Apply the same locale normalization before embedding for both corpus and queries.
For numerics and dates, consider lexical sidecar (BM25) to capture exact forms, then deterministic rerank.
Re-embed a gold slice to validate ΔS before full rebuild. Open: pattern_vectorstore_fragmentation.md

Locale normalization policy — quick checklist

Dates stored and queried as ISO YYYY-MM-DD; display can be localized later.
Numerics use . as decimal; thousands removed or unified; currency symbol separated from amount.
Case-folding policy documented; Turkish special-case applied where needed.
Accent policy consistent: preserve + accent-folded subfield, or fold globally.
CJK variant policy decided (Hans/Hant) and applied at both ingest and query.
Punctuation width unified; zero-width and bidi controls stripped where not meaningful.
Analyzer identity enforced across index and search paths.

Diagnostic checklist

Same normalization code runs in ingest and in query clients.
Same analyzer configuration used for the field in both index_analyzer and search_analyzer.
Logging proves that effective tokens match across locales for the same meaning.
Citations remain in the same section after locale canonicalization.
Rerank stage reads normalized text, not raw payloads.

Copy-paste tests

Locale flip probe

Q0: original user query as typed (locale A)
Q1: ISO date, canonical number, same words (locale neutral)
Q2: render in locale B (date/number style only)

Return ΔS for Q0,Q1,Q2, λ_state per run, and a note if citations left the target section.

Turkish i/İ sanity

Build two forms: 'istanbul', 'İstanbul'.
Verify tokens and matches are identical against titles and anchor fields.
If not, log analyzer outputs and apply a Turkish-aware fold.

Accent policy audit

Index doc: 'café', 'résumé'.
Queries: 'cafe', 'resume', original diacritics.
Expect both forms to match the same snippet and citations to remain stable.

When to escalate

ΔS stays ≥ 0.60 after locale normalization and analyzer alignment. Re-chunk with stable boundaries and re-embed a gold slice. Open: chunking-checklist.md
Answers alternate between locales while citations drift. Enforce snippet schema and forbid cross-section reuse. Open: data-contracts.md, retrieval-traceability.md
Hybrid retrieval still underperforms a single retriever after fixes. Align locale rules before rerank and make rerank deterministic. Open: rerankers.md

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame ⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

14 KiB Raw Blame History Unescape Escape