# Query Language Detection · Global Fix Map Detect the query language and script correctly, route it to the right analyzer and tokenizer, and keep λ stable across paraphrases. This page gives a small contract, deterministic fallbacks, and tests so short queries, code-switched inputs, and romanized forms do not break retrieval. --- ## Open these first * Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) * End to end retrieval knobs → [retrieval-playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-playbook.md) * Why this snippet → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) * Contract the payload → [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md) * Embedding vs meaning → [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md) * Tokenizer variance → [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) * Mixed scripts in one query → [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md) * Locale normalization and width/diacritics → [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) * Proper noun aliases → [proper\_noun\_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md) * Romanization and transliteration → [romanization\_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) * Multilingual overview → [multilingual\_guide.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/multilingual_guide.md) --- ## Core acceptance targets * ΔS(question, retrieved) ≤ 0.45 across three paraphrases and two seeds * Coverage of target section ≥ 0.70 * λ remains convergent when detector confidence is low or when code-switching is present * Detector outputs BCP-47 `lang` and ISO 15924 `script` with an explicit confidence and rationale * No false collapse when romanized forms are used instead of native script --- ## Minimal contract **Inputs** ``` q_text # user query raw hints.lang_pref # optional ui/user preference e.g. "ja" hints.romanizer # optional, e.g. "hepburn" context.domain # optional product/domain which biases vocabulary ``` **Detector output** ``` lang # BCP-47 primary tag, null if unknown (e.g., "zh", "ja", "en") script # ISO 15924, e.g., "Hans", "Hant", "Latn", "Cyrl", "Arab" confidence # 0..1 rationale # short note, e.g., "CJK bigram ratio 0.82" variants # list of plausible alternates, sorted by confidence romanized_suspect # bool, true if looks like transliteration of non-Latin ``` **Router decision** ``` analyzer_id # store-specific analyzer to call tokenizer_id # LLM or retriever tokenizer profile alias_view # whether to search romanized alias field(s) ``` All five fields must be logged with the retrieval response so you can audit flips. --- ## Typical failure → exact fix | Symptom | Likely cause | Open this | | ------------------------------------------------------------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Short query mis-detected as English, CJK missed | length bias without script probe | [script\_mixing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/script_mixing.md), [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) | | Romanized Japanese finds wrong page or no hit | detector returns `en+Latn` but romanized\_suspect not set | [romanization\_transliteration.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/romanization_transliteration.md) | | Arabic mixed digits and ASCII flips direction and rank | RTL controls and width not normalized | [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md) | | Brand or person whose alias equals a common word routes to wrong language | alias collision without scope fence | [proper\_noun\_aliases.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/proper_noun_aliases.md), [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) | | High similarity yet wrong meaning across languages | analyzer or metric mismatch | [embedding-vs-semantic.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md), [tokenizer\_mismatch.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/tokenizer_mismatch.md) | --- ## 60-second fix checklist 1. **Two-stage detection** Script-first using Unicode ranges, then language model on normalized text. Never rely on language-only detectors for queries shorter than 6 tokens. 2. **Confidence bands** If `confidence < 0.65`, run mixed routing: search native analyzer for all `variants.script` plus the romanized alias view. 3. **Romanized suspect path** If `romanized_suspect=true`, search native-script alias view and bias reranker to prefer canonical snippets. 4. **Width and diacritics** Fold width and diacritics only for the detection step and alias view, not for canonical matching. See [locale\_drift.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/locale_drift.md). 5. **Log ΔS and λ** Keep per-variant logs so you can see which analyzer produced stable evidence. --- ## Copy snippets **A. Script-first detector skeleton** ```python import unicodedata as ud from collections import Counter def guess_script(s: str) -> tuple[str, float]: buckets = Counter() total = 0 for ch in s: if ch.isspace() or ch.isdigit(): continue total += 1 name = ud.name(ch, "") # very light bins, expand as needed if "CJK" in name or "HIRAGANA" in name or "KATAKANA" in name or "HANGUL" in name: buckets["CJK"] += 1 elif "CYRILLIC" in name: buckets["CYRL"] += 1 elif "ARABIC" in name or "HEBREW" in name: buckets["RTL"] += 1 else: buckets["LATN"] += 1 if total == 0: return "UNK", 0.0 script, cnt = max(buckets.items(), key=lambda x: x[1]) conf = cnt / total # map to ISO 15924 class iso = {"CJK":"Han", "CYRL":"Cyrl", "RTL":"Arab", "LATN":"Latn"}.get(script, "Zyyy") return iso, conf ``` **B. Romanized suspect heuristic** ```python def is_romanized_suspect(q: str, script_iso: str) -> bool: # e.g., looks like "Tōkyō", "Toukyou", "Zhongguo", "Rossiya" if script_iso != "Latn": return False vowels = sum(ch.lower() in "aeiou" for ch in q) tone_marks = any(ch in "āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ" for ch in q) hyphen = "-" in q long_vowel = any(seq in q.lower() for seq in ["ou","aa","ee","oo","uu"]) return tone_marks or hyphen or long_vowel or vowels >= max(4, len(q)//3) ``` **C. Router decision** ```python def route(q_text, hints): script, s_conf = guess_script(q_text) roman_sus = is_romanized_suspect(q_text, script) low_conf = s_conf < 0.65 or len(q_text.split()) < 6 routes = [] if script in ["Han", "Hira", "Kana", "Hang"]: routes.append(("analyzer:cjk", "tokenizer:cjk", False)) elif script == "Cyrl": routes.append(("analyzer:cyrl", "tokenizer:default", False)) elif script == "Arab": routes.append(("analyzer:rtl", "tokenizer:default", False)) else: routes.append(("analyzer:latn", "tokenizer:default", roman_sus)) if low_conf: # add alternates and alias view routes.append(("analyzer:latn", "tokenizer:default", True)) routes.append(("analyzer:cjk", "tokenizer:cjk", True)) return { "script": script, "confidence": round(s_conf, 2), "romanized_suspect": roman_sus, "routes": routes } ``` **D. Prompt fence for detectors** ``` You have TXTOS and the WFGY Problem Map. When a query is short or mixed: 1) Detect script first. If confidence is low, search both native script and romanized alias views. 2) Cite the snippet in the canonical script if available. Use cite-then-explain. 3) Report {lang, script, detector_confidence, romanized_suspect} in the trace. ``` --- ## Eval plan Use the set from [code\_switching\_eval.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Language/code_switching_eval.md). Add 3 extra buckets: * short queries with 1 to 3 tokens * romanized vs native for the same entity * mixed ASCII and RTL digits Targets * detector accuracy on script ≥ 0.97 for length ≥ 6 tokens, ≥ 0.90 for length 1–5 * ΔS(question, retrieved) ≤ 0.45 and λ convergent across two seeds * no rank flip between native and romanized when evidence matches If recall is fine but ranking flips, clamp reranker and verify with [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md). --- ### 🧭 Explore More | Module | Description | Link | |-----------------------|----------------------------------------------------------|----------| | WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | [View →](https://github.com/onestardao/WFGY/tree/main/core/README.md) | | Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | [View →](https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md) | | Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md) | | Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | [View →](https://github.com/onestardao/WFGY/blob/main/ProblemMap/SemanticClinicIndex.md) | | Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | [View →](https://github.com/onestardao/WFGY/tree/main/SemanticBlueprint/README.md) | | Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | [View →](https://github.com/onestardao/WFGY/tree/main/benchmarks/benchmark-vs-gpt5/README.md) | --- > 👑 **Early Stargazers: [See the Hall of Fame](https://github.com/onestardao/WFGY/tree/main/stargazers)** — > Engineers, hackers, and open source builders who supported WFGY from day one. > GitHub stars ⭐ [WFGY Engine 2.0](https://github.com/onestardao/WFGY/blob/main/core/README.md) is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the [Unlock Board](https://github.com/onestardao/WFGY/blob/main/STAR_UNLOCKS.md).
[![WFGY Main](https://img.shields.io/badge/WFGY-Main-red?style=flat-square)](https://github.com/onestardao/WFGY)   [![TXT OS](https://img.shields.io/badge/TXT%20OS-Reasoning%20OS-orange?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS)   [![Blah](https://img.shields.io/badge/Blah-Semantic%20Embed-yellow?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlahBlahBlah)   [![Blot](https://img.shields.io/badge/Blot-Persona%20Core-green?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlotBlotBlot)   [![Bloc](https://img.shields.io/badge/Bloc-Reasoning%20Compiler-blue?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlocBlocBloc)   [![Blur](https://img.shields.io/badge/Blur-Text2Image%20Engine-navy?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlurBlurBlur)   [![Blow](https://img.shields.io/badge/Blow-Game%20Logic-purple?style=flat-square)](https://github.com/onestardao/WFGY/tree/main/OS/BlowBlowBlow)