Query Language Detection · Global Fix Map

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Language — multilingual processing and semantic alignment

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Detect the query language and script correctly, route it to the right analyzer and tokenizer, and keep λ stable across paraphrases. This page gives a small contract, deterministic fallbacks, and tests so short queries, code-switched inputs, and romanized forms do not break retrieval.

Open these first

Visual map and recovery → rag-architecture-and-recovery.md
End to end retrieval knobs → retrieval-playbook.md
Why this snippet → retrieval-traceability.md
Contract the payload → data-contracts.md
Embedding vs meaning → embedding-vs-semantic.md
Tokenizer variance → tokenizer_mismatch.md
Mixed scripts in one query → script_mixing.md
Locale normalization and width/diacritics → locale_drift.md
Proper noun aliases → proper_noun_aliases.md
Romanization and transliteration → romanization_transliteration.md
Multilingual overview → multilingual_guide.md

Core acceptance targets

ΔS(question, retrieved) ≤ 0.45 across three paraphrases and two seeds
Coverage of target section ≥ 0.70
λ remains convergent when detector confidence is low or when code-switching is present
Detector outputs BCP-47 lang and ISO 15924 script with an explicit confidence and rationale
No false collapse when romanized forms are used instead of native script

Minimal contract

Inputs

q_text              # user query raw
hints.lang_pref     # optional ui/user preference e.g. "ja"
hints.romanizer     # optional, e.g. "hepburn"
context.domain      # optional product/domain which biases vocabulary

Detector output

lang                # BCP-47 primary tag, null if unknown (e.g., "zh", "ja", "en")
script              # ISO 15924, e.g., "Hans", "Hant", "Latn", "Cyrl", "Arab"
confidence          # 0..1
rationale           # short note, e.g., "CJK bigram ratio 0.82"
variants            # list of plausible alternates, sorted by confidence
romanized_suspect   # bool, true if looks like transliteration of non-Latin

Router decision

analyzer_id         # store-specific analyzer to call
tokenizer_id        # LLM or retriever tokenizer profile
alias_view          # whether to search romanized alias field(s)

All five fields must be logged with the retrieval response so you can audit flips.

Typical failure → exact fix

Symptom	Likely cause	Open this
Short query mis-detected as English, CJK missed	length bias without script probe	script_mixing.md, locale_drift.md
Romanized Japanese finds wrong page or no hit	detector returns `en+Latn` but romanized_suspect not set	romanization_transliteration.md
Arabic mixed digits and ASCII flips direction and rank	RTL controls and width not normalized	locale_drift.md
Brand or person whose alias equals a common word routes to wrong language	alias collision without scope fence	proper_noun_aliases.md, retrieval-traceability.md
High similarity yet wrong meaning across languages	analyzer or metric mismatch	embedding-vs-semantic.md, tokenizer_mismatch.md

60-second fix checklist

Two-stage detection Script-first using Unicode ranges, then language model on normalized text. Never rely on language-only detectors for queries shorter than 6 tokens.
Confidence bands If confidence < 0.65, run mixed routing: search native analyzer for all variants.script plus the romanized alias view.
Romanized suspect path If romanized_suspect=true, search native-script alias view and bias reranker to prefer canonical snippets.
Width and diacritics Fold width and diacritics only for the detection step and alias view, not for canonical matching. See locale_drift.md.
Log ΔS and λ Keep per-variant logs so you can see which analyzer produced stable evidence.

Copy snippets

A. Script-first detector skeleton

import unicodedata as ud
from collections import Counter

def guess_script(s: str) -> tuple[str, float]:
    buckets = Counter()
    total = 0
    for ch in s:
        if ch.isspace() or ch.isdigit():
            continue
        total += 1
        name = ud.name(ch, "")
        # very light bins, expand as needed
        if "CJK" in name or "HIRAGANA" in name or "KATAKANA" in name or "HANGUL" in name:
            buckets["CJK"] += 1
        elif "CYRILLIC" in name:
            buckets["CYRL"] += 1
        elif "ARABIC" in name or "HEBREW" in name:
            buckets["RTL"] += 1
        else:
            buckets["LATN"] += 1
    if total == 0:
        return "UNK", 0.0
    script, cnt = max(buckets.items(), key=lambda x: x[1])
    conf = cnt / total
    # map to ISO 15924 class
    iso = {"CJK":"Han", "CYRL":"Cyrl", "RTL":"Arab", "LATN":"Latn"}.get(script, "Zyyy")
    return iso, conf

B. Romanized suspect heuristic

def is_romanized_suspect(q: str, script_iso: str) -> bool:
    # e.g., looks like "Tōkyō", "Toukyou", "Zhongguo", "Rossiya"
    if script_iso != "Latn":
        return False
    vowels = sum(ch.lower() in "aeiou" for ch in q)
    tone_marks = any(ch in "āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ" for ch in q)
    hyphen = "-" in q
    long_vowel = any(seq in q.lower() for seq in ["ou","aa","ee","oo","uu"])
    return tone_marks or hyphen or long_vowel or vowels >= max(4, len(q)//3)

C. Router decision

def route(q_text, hints):
    script, s_conf = guess_script(q_text)
    roman_sus = is_romanized_suspect(q_text, script)
    low_conf = s_conf < 0.65 or len(q_text.split()) < 6

    routes = []
    if script in ["Han", "Hira", "Kana", "Hang"]:
        routes.append(("analyzer:cjk", "tokenizer:cjk", False))
    elif script == "Cyrl":
        routes.append(("analyzer:cyrl", "tokenizer:default", False))
    elif script == "Arab":
        routes.append(("analyzer:rtl", "tokenizer:default", False))
    else:
        routes.append(("analyzer:latn", "tokenizer:default", roman_sus))

    if low_conf:
        # add alternates and alias view
        routes.append(("analyzer:latn", "tokenizer:default", True))
        routes.append(("analyzer:cjk", "tokenizer:cjk", True))
    return {
        "script": script,
        "confidence": round(s_conf, 2),
        "romanized_suspect": roman_sus,
        "routes": routes
    }

D. Prompt fence for detectors

You have TXTOS and the WFGY Problem Map.

When a query is short or mixed:
1) Detect script first. If confidence is low, search both native script and romanized alias views.
2) Cite the snippet in the canonical script if available. Use cite-then-explain.
3) Report {lang, script, detector_confidence, romanized_suspect} in the trace.

Eval plan

Use the set from code_switching_eval.md. Add 3 extra buckets:

short queries with 1 to 3 tokens
romanized vs native for the same entity
mixed ASCII and RTL digits

Targets

detector accuracy on script ≥ 0.97 for length ≥ 6 tokens, ≥ 0.90 for length 1–5
ΔS(question, retrieved) ≤ 0.45 and λ convergent across two seeds
no rank flip between native and romanized when evidence matches

If recall is fine but ranking flips, clamp reranker and verify with retrieval-traceability.md.

Explore More

Module	Description	Link
WFGY Core	Canonical framework entry point	View
Problem Map	Diagnostic map and navigation hub	View
Tension Universe Experiments	MVP experiment field	View
Recognition	Where WFGY is referenced or adopted	View
AI Guide	Anti-hallucination reading protocol for tools	View

If this repository helps, starring it improves discovery for other builders.

12 KiB Raw Blame History Unescape Escape