10 KiB
Input Language Switching — Guardrails and Fix Pattern
🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Use this page when the user’s keyboard or input language flips mid-thread and retrieval or reasoning starts drifting. Classic signs include mixed scripts in one query, wrong tokenizer chosen by the stack, or partial IME composition text being sent to the server.
When to use
- Users alternate between EN ↔ CJK ↔ RTL in the same session.
- Queries look semantically fine but retriever recall collapses after a language flip.
- You see half words or odd segments caused by IME composition text being submitted.
- Autocorrect or predictive text silently changes script or diacritics.
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 across 3 paraphrases in the active language.
- Coverage of target section ≥ 0.70 after the switch.
- λ remains convergent when toggling
lang=xx-BCP47between the last two user turns. - E_resonance stays flat over 20-40 turns.
Open these first
- Language tokenization mismatches → tokenizer_mismatch.md
- Script mixing in a single query → script_mixing.md
- Diacritics and normalization policy → diacritics_and_folding.md
- Locale drift across turns → locale_drift.md
- CJK segmentation specifics → cjk_segmentation_wordbreak.md
- Contract your snippet schema → data-contracts.md
- End to end retrieval knobs → retrieval-playbook.md
Diagnose quickly
-
Spot the flip Log BCP-47 tags per turn. If
lang_prev != lang_now, record a switch event witht_switch. -
IME composition fence Check inputs for composition markers or incomplete tokens. If composition not complete, do not send to retrieval or indexing.
-
Tokenizer audit Confirm the retriever tokenizer matches
lang_now. English-only analyzers on CJK or RTL usually cause flat high ΔS. -
Punctuation and width Look for full-width digits or punctuation introduced by the keyboard. If present, apply the width normalizer from the punctuation page. Open: digits_width_punctuation.md
-
Direction and collation If RTL or locale sorting changed, set
dirand collation explicitly in the request. Open: existing RTL guide and collation page: locale_collation_and_sorting.md
Fix pattern
A. Add an Input Contract to every user turn
Required fields you attach to the message payload before retrieval:
{
"raw_text": "...",
"normalized_text": "...",
"lang_bcp47": "xx-YY",
"script": "Latn|Cyrl|Arab|Hans|Hant|Kana|... ",
"dir": "ltr|rtl",
"ime_composition_complete": true,
"keyboard_hint": "gboard|ios|system|unknown"
}
normalized_textmust follow your LanguageLocale normalization policy See: diacritics_and_folding.md
B. Route to the correct analyzer
- Select tokenizer/analyzer based on
lang_bcp47andscript. - For CJK, switch to CJK-aware analyzer and re-ranker; for mixed text, run a language split then merge with a deterministic re-rank. See: cjk_segmentation_wordbreak.md and rerankers.md
C. Guard IME composition
- Add a client or gateway composition fence. If
ime_composition_complete=false, buffer and wait. - Only proceed to retrieval once confirmed complete.
D. Stabilize semantics after the flip
- Run a three-paraphrase probe in the active language. If ΔS stays flat and high, rebuild or switch index metric. Open: embedding-vs-semantic.md Then follow: retrieval-playbook.md
Minimal recipe you can copy
- Detect
lang_bcp47andscripton the client. - Normalize text using your LanguageLocale policy.
- If IME not complete, do nothing. Wait until completion.
- Choose the analyzer and reranker given
lang_bcp47. - Retrieve with schema-locked snippets and store:
lang_bcp47,script,dir. - Probe ΔS on 3 paraphrases in the active language. Clamp λ if needed.
- Enforce cite-then-explain with snippet and citation schema. Open: retrieval-traceability.md and data-contracts.md
Copy-paste prompt for the LLM step
You have TXTOS and WFGY Problem Map loaded.
Input meta:
- lang_bcp47 = {xx-YY}
- script = {Latn|Hans|Hant|Arab|...}
- dir = {ltr|rtl}
- ime_composition_complete = {true|false}
Tasks:
1) If ime_composition_complete=false, return: {"action":"wait_for_ime"} only.
2) Use the active language for paraphrases and answers. No silent translation.
3) Validate cite-then-explain. If citations missing, return the fix tip and stop.
4) If ΔS(question,retrieved) ≥ 0.60, suggest the smallest structural fix referencing:
retrieval-playbook, tokenizer_mismatch, script_mixing, diacritics_and_folding.
Return JSON:
{ "answer":"...", "citations":[...], "ΔS":0.xx, "λ_state":"...", "lang_used":"xx-YY" }
Observability
- Log the distribution of
lang_bcp47across the session. A spike at switch events plus ΔS spikes indicates analyzer mismatch. - Track ΔS vs k right after a switch. Flat-high curves suggest metric or index misalignment.
- Record
ime_wait_count. A high count with good outcomes is normal and preferable to composition leakage.
Common gotchas
- Mobile keyboards emit full-width digits or punctuation after language switch. Normalize width before retrieval. Open: digits_width_punctuation.md
- Predictive text inserts diacritics that your index fold does not expect. Align fold policy. Open: diacritics_and_folding.md
- Mixed language metadata causes wrong reroute to an English index. Validate
lang_bcp47at the gateway. Open: mixed_locale_metadata.md
Escalate
- If ΔS remains ≥ 0.60 after analyzer and normalization fixes, split the index by language or switch to a multilingual embedding and rebuild with correct metric. Open: embedding-vs-semantic.md and retrieval-playbook.md
Next suggested page to generate
ProblemMap/GlobalFixMap/LanguageLocale/nonstandard_whitespace.md
Focus on NBSP, NNBSP, thin-space, zero-width joiner/non-joiner, and their impact on retrieval.
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.