WFGY/ProblemMap/multilingual-guide.md
2025-08-15 23:20:57 +08:00

8.9 KiB
Raw Blame History

🌍 Multilingual RAG Guide — CJK, RTL, and Code-Mix Done Right

Build one pipeline that works across languages without wrecking recall or citations.

Quick Nav
Retrieval Playbook · Embedding vs Semantic · Patterns · Rerankers


0) Principles

  1. Detect → Segment → Normalize → Dual-index → Fence citations.
  2. Keep original text for display and normalized text for search.
  3. Pick truly multilingual embeddings or per-lang encoders—dont mix silently.

1) Language detection

  • Use CLD3/fastText or a simple rule-first fallback (CJK regex, RTL markers).
  • Store lang on chunks and queries.
  • If uncertain, mark lang="und" and route to char-level retrieval.
{"chunk_id":"c1","lang":"zh","text":"ΔS 衡量語義張力 ..."}

2) Segmentation & normalization

Language Segmenter Notes
Chinese (zh) jieba / pkuseg Also store OpenCC simplified/Traditional variants
Japanese (ja) MeCab Preserve readings for search if useful
Korean (ko) MeCab-ko / khaiii Keep compound nouns intact when possible
Thai (th) PyThaiNLP Sentence boundaries matter for chunking
Arabic/Hebrew (ar/he) ICU Handle diacritics/RTL shaping
Code-mix ICU + heuristic Fall back to character n-grams if needed

Normalization checklist:

  • Unify punctuation (full-width/half-width).
  • Unicode NFC.
  • Lowercase where appropriate (not for proper nouns in citations).
  • Keep both text_orig and text_norm.

3) Embeddings & indexing

Recommended multilingual encoders

  • bge-m3 (multilingual, strong cross-lingual)
  • LaBSE (older but solid cross-lingual)

Patterns

  • If queries in lang A frequently need answers in lang B → cross-lingual retrieval.
  • For noisy OCR in CJK, consider char-level dense + BM25 hybrid.

Indexing

  • Add lang and script fields; route BM25 analyzers per language.
  • For FAISS dense index: single multilingual vector space is easiest; if per-lang spaces, keep a router.

4) Retrieval & reranking

  • First stage: hybrid (dense + BM25) with RRF.
  • If candidates include multiple languages, re-rank with cross-encoder multilingual models (e.g., bge-reranker-base works well).
  • Never drop minority language candidates too early—keep k_in ≥ 100 when cross-lingual.

5) Prompting & citations (SCU-safe)

  • Show citations with language labels and line spans.
  • Forbid the LLM from translating citations; allow translations only in the explanation.
  • If answer language ≠ citation language, state: “Cited in {lang}, answer in {lang}.”

6) Evaluation

  • Build a multilingual gold set:

    • Include cross-lingual Q→A pairs (e.g., query in English, answer/citation in zh).
    • Track recall@50, nDCG@10, and ΔS per language.
  • Acceptance: ΔS ≤ 0.45 for top-ctx in each language; stable λ across 3 paraphrases per language.


7) Common multilingual pitfalls → fixes

Pitfall Why it happens Fix
Hybrid fails on CJK Analyzer/tokenizer mismatch Use ICU analyzers; char-level BM25
S/T Chinese mismatch Source vs query script differ Store both via OpenCC; index two variants
Citations translated Prompt schema unlocked Fence citations; explain can translate
Cross-lang recall low Monolingual embeddings Use bge-m3/LaBSE; or translate query then search
Arabic/Hebrew garbled RTL shaping ICU normalization; verify rendering layer

8) Minimal example (Python, FAISS + BM25 + bge-m3)

# pip install sentence-transformers rank_bm25 faiss-cpu opencc-python-reimplemented
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np, faiss, opencc

enc = SentenceTransformer("BAAI/bge-m3")
def norm(s, lang):
    if lang in ("zh-hant","zh-hans","zh"):
        return opencc.OpenCC('t2s.json').convert(s)
    return s

chunks = [{"text":"ΔS 衡量語義張力", "lang":"zh"}, {"text":"Delta-S measures semantic stress", "lang":"en"}]
X = enc.encode([norm(c["text"], c["lang"]) for c in chunks], normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1]); index.add(X.astype(np.float32))
bm25 = BM25Okapi([c["text"].split() for c in chunks])

def search(q, lang="en"):
    qv = enc.encode([norm(q, lang)], normalize_embeddings=True).astype(np.float32)
    _, I = index.search(qv, 50); dense_rank=[(int(i), r+1) for r,i in enumerate(I[0])]
    sparse_rank=[(i, r+1) for r,i in enumerate(bm25.get_top_n(q.split(), list(range(len(chunks))), 50))]
    # RRF fuse

(Use a proper RRF from the Retrieval Playbook.)


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow