WFGY/ProblemMap/GlobalFixMap/LanguageLocale/input_language_switching.md
2025-09-05 11:07:24 +08:00

10 KiB
Raw Permalink Blame History

Input Language Switching — Guardrails and Fix Pattern

🧭 Quick Return to Map

You are in a sub-page of LanguageLocale.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Use this page when the users keyboard or input language flips mid-thread and retrieval or reasoning starts drifting. Classic signs include mixed scripts in one query, wrong tokenizer chosen by the stack, or partial IME composition text being sent to the server.

When to use

  • Users alternate between EN ↔ CJK ↔ RTL in the same session.
  • Queries look semantically fine but retriever recall collapses after a language flip.
  • You see half words or odd segments caused by IME composition text being submitted.
  • Autocorrect or predictive text silently changes script or diacritics.

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 across 3 paraphrases in the active language.
  • Coverage of target section ≥ 0.70 after the switch.
  • λ remains convergent when toggling lang=xx-BCP47 between the last two user turns.
  • E_resonance stays flat over 20-40 turns.

Open these first


Diagnose quickly

  1. Spot the flip Log BCP-47 tags per turn. If lang_prev != lang_now, record a switch event with t_switch.

  2. IME composition fence Check inputs for composition markers or incomplete tokens. If composition not complete, do not send to retrieval or indexing.

  3. Tokenizer audit Confirm the retriever tokenizer matches lang_now. English-only analyzers on CJK or RTL usually cause flat high ΔS.

  4. Punctuation and width Look for full-width digits or punctuation introduced by the keyboard. If present, apply the width normalizer from the punctuation page. Open: digits_width_punctuation.md

  5. Direction and collation If RTL or locale sorting changed, set dir and collation explicitly in the request. Open: existing RTL guide and collation page: locale_collation_and_sorting.md


Fix pattern

A. Add an Input Contract to every user turn

Required fields you attach to the message payload before retrieval:

{
  "raw_text": "...",
  "normalized_text": "...",
  "lang_bcp47": "xx-YY",
  "script": "Latn|Cyrl|Arab|Hans|Hant|Kana|... ",
  "dir": "ltr|rtl",
  "ime_composition_complete": true,
  "keyboard_hint": "gboard|ios|system|unknown"
}

B. Route to the correct analyzer

  • Select tokenizer/analyzer based on lang_bcp47 and script.
  • For CJK, switch to CJK-aware analyzer and re-ranker; for mixed text, run a language split then merge with a deterministic re-rank. See: cjk_segmentation_wordbreak.md and rerankers.md

C. Guard IME composition

  • Add a client or gateway composition fence. If ime_composition_complete=false, buffer and wait.
  • Only proceed to retrieval once confirmed complete.

D. Stabilize semantics after the flip


Minimal recipe you can copy

  1. Detect lang_bcp47 and script on the client.
  2. Normalize text using your LanguageLocale policy.
  3. If IME not complete, do nothing. Wait until completion.
  4. Choose the analyzer and reranker given lang_bcp47.
  5. Retrieve with schema-locked snippets and store: lang_bcp47, script, dir.
  6. Probe ΔS on 3 paraphrases in the active language. Clamp λ if needed.
  7. Enforce cite-then-explain with snippet and citation schema. Open: retrieval-traceability.md and data-contracts.md

Copy-paste prompt for the LLM step

You have TXTOS and WFGY Problem Map loaded.

Input meta:
- lang_bcp47 = {xx-YY}
- script = {Latn|Hans|Hant|Arab|...}
- dir = {ltr|rtl}
- ime_composition_complete = {true|false}

Tasks:
1) If ime_composition_complete=false, return: {"action":"wait_for_ime"} only.
2) Use the active language for paraphrases and answers. No silent translation.
3) Validate cite-then-explain. If citations missing, return the fix tip and stop.
4) If ΔS(question,retrieved) ≥ 0.60, suggest the smallest structural fix referencing:
   retrieval-playbook, tokenizer_mismatch, script_mixing, diacritics_and_folding.
Return JSON:
{ "answer":"...", "citations":[...], "ΔS":0.xx, "λ_state":"...", "lang_used":"xx-YY" }

Observability

  • Log the distribution of lang_bcp47 across the session. A spike at switch events plus ΔS spikes indicates analyzer mismatch.
  • Track ΔS vs k right after a switch. Flat-high curves suggest metric or index misalignment.
  • Record ime_wait_count. A high count with good outcomes is normal and preferable to composition leakage.

Common gotchas

  • Mobile keyboards emit full-width digits or punctuation after language switch. Normalize width before retrieval. Open: digits_width_punctuation.md
  • Predictive text inserts diacritics that your index fold does not expect. Align fold policy. Open: diacritics_and_folding.md
  • Mixed language metadata causes wrong reroute to an English index. Validate lang_bcp47 at the gateway. Open: mixed_locale_metadata.md

Escalate


Next suggested page to generate ProblemMap/GlobalFixMap/LanguageLocale/nonstandard_whitespace.md Focus on NBSP, NNBSP, thin-space, zero-width joiner/non-joiner, and their impact on retrieval.


👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow