WFGY/ProblemMap/GlobalFixMap/Language
2025-09-01 18:06:04 +08:00
..
code_switching_eval.md Create code_switching_eval.md 2025-08-30 10:59:16 +08:00
fallback_translation_and_glossary_bridge.md Create fallback_translation_and_glossary_bridge.md 2025-08-30 14:12:24 +08:00
hybrid_ranking_multilingual.md Create hybrid_ranking_multilingual.md 2025-08-30 13:45:25 +08:00
locale_drift.md Update locale_drift.md 2025-08-30 10:27:09 +08:00
multilingual_guide.md Update multilingual_guide.md 2025-08-30 10:27:27 +08:00
proper_noun_aliases.md Create proper_noun_aliases.md 2025-08-30 11:09:32 +08:00
query_language_detection.md Create query_language_detection.md 2025-08-30 12:09:45 +08:00
query_routing_and_analyzers.md Create query_routing_and_analyzers.md 2025-08-30 12:41:49 +08:00
README.md Update README.md 2025-09-01 18:06:04 +08:00
romanization_transliteration.md Create romanization_transliteration.md 2025-08-30 11:51:04 +08:00
script_mixing.md Update script_mixing.md 2025-08-30 10:36:49 +08:00
stopword_and_morphology_controls.md Create stopword_and_morphology_controls.md 2025-08-30 13:59:44 +08:00
tokenizer_mismatch.md Update tokenizer_mismatch.md 2025-08-30 10:36:30 +08:00

Language & Multilingual · Global Fix Map

A compact hub to stabilize cross-lingual retrieval and reasoning.
Use this folder when your corpus or queries include CJK, RTL, Indic, Cyrillic, accented Latin, or frequent code-switching. No infra change required.


Orientation — pages and what they solve

Page What it solves Typical symptom
tokenizer_mismatch.md Locks tokenization and segmentation for CJK/Thai/Indic High sim but low recall on CJK/Thai; broken tokens
script_mixing.md One query carries mixed scripts and analyzers split Mixed Latin+CJK queries under-recall or flip
locale_drift.md Normalization for width/accents/variants (Hans↔Hant) zh-Hans/zh-Hant never co-retrieve; accent variants miss
multilingual_guide.md End-to-end recipes and acceptance targets Unsure where drift comes from across languages
proper_noun_aliases.md Alias shield for names, brands, products Proper nouns oscillate across spellings
romanization_transliteration.md Romanization pairs and transliteration consistency Inconsistent transliteration causes misses
query_language_detection.md Stable language detection contract Detection flips per run; routing becomes random
query_routing_and_analyzers.md Route analyzers per language + parity w/ index Search vs index behave differently
hybrid_ranking_multilingual.md Deterministic hybrid rerank across languages Multilingual ranking unstable, hybrid < single
stopword_and_morphology_controls.md Clamp stopwords/lemmatizers to protect meaning Negations/particles vanish; unit words lost
fallback_translation_and_glossary_bridge.md Controlled translation bridge with glossary Local path ΔS stays high; glossary needed
code_switching_eval.md Bilingual & code-switch eval sets + checks Cannot prove multilingual stability before ship

When to use this folder

  • High similarity yet wrong meaning on bilingual or mixed-script corpora
  • Citations point to the wrong section after translating the question
  • Hybrid retrievers underperform a single retriever across languages
  • Index looks healthy while coverage stays low for non-Latin scripts
  • Names flip between native, transliteration, and English aliases
  • zh-Hans and zh-Hant never co-retrieve; Thai recall drops for no reason

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 across language variants
  • Coverage ≥ 0.70 to the intended section after repair
  • λ_observe convergent across 3 paraphrases and 2 seeds
  • E_resonance flat on long windows that mix scripts
  • Citation fields complete; alias noise does not leak into evidence

Map symptoms → structural fixes

Symptom Likely cause Open this
High similarity yet wrong meaning Embedding not multilingual or pre-normalization mismatch embedding-vs-semantic.md
Citations jump sections after translation Snippet schema too loose data-contracts.md · retrieval-traceability.md
zh-Hans and zh-Hant never co-retrieve Variant mapping and width rules missing locale_drift.md
Thai or CJK recall collapses Tokenizer mismatch or missing segmenter tokenizer_mismatch.md
Mixed Latin + CJK query under-recalls Analyzer split across scripts script_mixing.md
Hybrid worse than single Query parsing split or mis-weighted rerank patterns/pattern_query_parsing_split.md · rerankers.md
Proper nouns oscillate Missing alias fields and entity shield proper_noun_aliases.md
Transliteration inconsistency Romanization rules not aligned romanization_transliteration.md
Language detection drifts Detection contract weak or unlocked query_language_detection.md
Search vs index disagree Analyzer routing error query_routing_and_analyzers.md
Ranking unstable across languages Mono-lingual reranker or unaligned features hybrid_ranking_multilingual.md
Negations/particles vanish Stopword or morphology too aggressive stopword_and_morphology_controls.md
Persistent high ΔS on local path Need glossary-backed translation bridge fallback_translation_and_glossary_bridge.md

Fix in 60 seconds

  1. Detect language
    Emit stable language + confidence. If unstable, fix detection first.
    query_language_detection.md

  2. Lock normalization and analyzers
    Keep locale, width, accents, and segmentation identical on write/read.
    locale_drift.md · query_routing_and_analyzers.md

  3. Protect entities and syntax
    Alias fields and romanization pairs; clamp stopwords/morphology for negations and units.
    proper_noun_aliases.md · romanization_transliteration.md · stopword_and_morphology_controls.md

  4. Stabilize ranking
    Use multilingual or dual-track rerank with deterministic ordering.
    hybrid_ranking_multilingual.md

  5. Translation bridge only if needed
    Pair with a glossary and keep native path as default.
    fallback_translation_and_glossary_bridge.md

  6. Verify
    With bilingual & code-switch sets confirm ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent.
    code_switching_eval.md


Store-agnostic quick recipes

  • Normalize the same way for corpus and queries before storing vectors
  • CJK/Thai require segmentation or bigrams; keep entity fields as keyword
  • If no multilingual embeddings, add a lexical sidecar and align features with a deterministic rerank

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow