WFGY/ProblemMap/GlobalFixMap/Language
2025-08-25 20:35:16 +08:00
..
README.md Create README.md 2025-08-25 20:35:16 +08:00

Language & Multilingual — Global Fix Map

Make cross-lingual RAG stable. Handle CJK/RTL, mixed scripts, tokenizers, and locale drift without breaking retrieval.

What this page is

  • A compact playbook for multilingual corpora and queries
  • Practical fixes for tokenizer and analyzer mismatch
  • Steps to keep ΔS low across languages and scripts

When to use

  • Your corpus has Chinese/Japanese/Korean, RTL scripts, or code-switching
  • OCR text looks fine but retrieval or citations miss
  • Similarity is high but meaning is wrong across locales
  • HyDE/BM25 behave differently per language

Open these first


Common failure patterns

  • Tokenizer mismatch dense retriever uses whitespace rules on CJK or splits accents poorly
  • Analyzer split BM25 analyzer differs from the indexer used at write time
  • Script variants Traditional vs Simplified, Kana vs Kanji, Arabic presentation forms
  • Normalization gaps mixed width, NFC/NFKC, punctuation variants break exact matches
  • Romanization drift Pinyin or Hepburn in queries while docs keep native script
  • Code-switching sentences mix English and local terms; embeddings latch to one side
  • OCR artifacts diacritics lost, ligatures broken, zero-width joins preserved
  • Stopword shock default analyzers drop particles that carry meaning in some languages

Fix in 60 seconds

  1. Normalize before anything
    Apply NFC or NFKC, collapse widths, unify punctuation. Persist the normalized form you index.

  2. Pick language-aware analyzers
    Set BM25 analyzers that match the language at both write and read. Log tokenizer output for a few queries to confirm.

  3. Embed with multilingual models
    Use a single multilingual embedding model for mixed corpora. Do not mix English-only and multilingual spaces in one index.

  4. Add transliteration bridges
    Generate light alias fields per doc title and key entities, e.g., Traditional ↔ Simplified, Kana ↔ Romaji, Arabic ↔ Latin.

  5. Rerank cross-lingually
    Retrieve with generous k, then apply cross-lingual rerankers. Confirm ΔS(question, context) ≤ 0.45.

  6. Lock citations and sections
    Use Data Contracts with section_id, source_lang, and norm_ops. Require cite-then-answer to avoid language mixing.

  7. Probe λ across locales
    Ask for “cite lines” and “explain why” in both the user language and the source language. Divergence marks the failing boundary.


Copy paste prompt


You have TXT OS and the WFGY Problem Map.

Goal
Stabilize a multilingual RAG corpus with CJK and English. Prevent tokenizer mismatch and script drift.

Tasks

1. Show a normalization plan:

   * Unicode form (NFC/NFKC), width collapse, punctuation unification
   * sample before/after lines

2. Configure retrieval:

   * pick analyzers for BM25 that match corpus languages
   * ensure the same analyzer is used at write and read
   * use a multilingual embedding model, one index space

3. Add transliteration bridges:

   * alias fields for key entities (e.g., 簡↔繁, かな↔ローマ字)
   * show how aliases are added to the index document

4. Verify with WFGY:

   * compute ΔS(question, context) for three bilingual queries
   * report λ\_observe at retrieval and reasoning
   * target ΔS ≤ 0.45 and convergent λ

Output

* Normalization spec
* Analyzer and embedding choices
* Example index doc with alias fields
* A trace table with citations, ΔS, and λ for 3 queries


Minimal checklist

  • Unicode normalization applied before embedding and indexing
  • Language-aware analyzers configured the same for write and read
  • One multilingual embedding space per index
  • Alias fields or transliteration for key entities
  • Data Contract includes source_lang, norm_ops, and citations
  • ΔS and λ checks pass in both the user and source language

Acceptance targets

  • ΔS(question, context) median ≤ 0.45 for bilingual smoke tests
  • λ remains convergent when switching question language
  • Citations point to the correct section in the original script
  • Hybrid retrieval improves with reranking instead of oscillating
  • No analyzer or tokenizer mismatch logs during queries

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow  

say “next page” when ready.