WFGY/ProblemMap/GlobalFixMap/Language/README.md
2025-08-30 10:39:52 +08:00

9.9 KiB
Raw Blame History

Language & Multilingual — Global Fix Map

A compact hub to stabilize cross-lingual retrieval and reasoning. Use this folder when your corpus or queries span CJK, RTL, Indic, Cyrillic, accented Latin, or when users code-switch.


Quick routes to per-page guides


When to use this folder

  • High similarity yet wrong meaning on bilingual or mixed-script corpora
  • Citations point to the wrong section after translating the question
  • Hybrid retrievers underperform a single retriever across languages
  • Index “looks” healthy while coverage remains low for non-Latin scripts
  • Names flip between native, transliteration, and English aliases
  • zh-Hans and zh-Hant never co-retrieve, Thai recall collapses without clear reason

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 across language variants
  • Coverage of the target section ≥ 0.70 after repair
  • λ remains convergent across three paraphrases and two seeds
  • E_resonance stays flat on long windows that mix scripts

Open these first


Map symptoms → structural fixes (Problem Map)

Symptom Likely cause Open this
Wrong-meaning hits despite high similarity Embedding not multilingual or pre-norm mismatch embedding-vs-semantic.md
Citations jump sections after translation Snippet schema too loose data-contracts.md, retrieval-traceability.md
zh-Hans/zh-Hant never co-retrieve Variant mapping missing locale_drift.md
Thai/CJK recall collapses Tokenizer mismatch, missing segmenter tokenizer_mismatch.md
Mixed Latin + CJK query under-recalls Analyzer split across scripts script_mixing.md
Hybrid retriever worse than single Query parsing split, mis-weighted rerank patterns/pattern_query_parsing_split.md, rerankers.md

Fix in 60 seconds

  1. Measure ΔS across variants Run question in source language, target language, and a code-switched version. If ΔS improves only after transliteration or segmentation, language handling is the failing layer.

  2. Probe λ_observe Hold citations fixed and flip only script or segmentation. If λ flips, lock schema and unify analyzers.

  3. Apply the smallest change

  • Unify index and search analyzers, add Thai/CJK segmenters
  • Map zh-Hans ↔ zh-Hant at ingest or store in variant subfields
  • Add alias fields for proper nouns: native, romanized, normalized
  • Normalize digits, punctuation, and diacritics for RTL/Indic consistently
  • Use a multilingual embedding; otherwise add a lexical sidecar and deterministic rerank
  1. Verify Coverage ≥ 0.70 and ΔS ≤ 0.45 across two languages and one mixed query. λ convergent on two seeds.

Store-agnostic quick recipes

  • Elasticsearch/OpenSearch: ICU normalization, width fold, bigram analyzers for CJK, Thai segmenter, keyword subfields for product names.
  • Vector stores: pre-normalize identically for corpus and queries, validate with a bilingual gold set, rerank deterministically if embeddings are monolingual.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow