WFGY/ProblemMap/multilingual-guide.md

7.6 KiB
Raw Blame History

🌍 Multilingual RAG Guide — CJK, RTL, and Code-Mix Done Right

Build one pipeline that works across languages without wrecking recall or citations.

Quick Nav
Retrieval Playbook · Embedding vs Semantic · Patterns · Rerankers


0) Principles

  1. Detect → Segment → Normalize → Dual-index → Fence citations.
  2. Keep original text for display and normalized text for search.
  3. Pick truly multilingual embeddings or per-lang encoders—dont mix silently.

1) Language detection

  • Use CLD3/fastText or a simple rule-first fallback (CJK regex, RTL markers).
  • Store lang on chunks and queries.
  • If uncertain, mark lang="und" and route to char-level retrieval.
{"chunk_id":"c1","lang":"zh","text":"ΔS 衡量語義張力 ..."}

2) Segmentation & normalization

Language Segmenter Notes
Chinese (zh) jieba / pkuseg Also store OpenCC simplified/Traditional variants
Japanese (ja) MeCab Preserve readings for search if useful
Korean (ko) MeCab-ko / khaiii Keep compound nouns intact when possible
Thai (th) PyThaiNLP Sentence boundaries matter for chunking
Arabic/Hebrew (ar/he) ICU Handle diacritics/RTL shaping
Code-mix ICU + heuristic Fall back to character n-grams if needed

Normalization checklist:

  • Unify punctuation (full-width/half-width).
  • Unicode NFC.
  • Lowercase where appropriate (not for proper nouns in citations).
  • Keep both text_orig and text_norm.

3) Embeddings & indexing

Recommended multilingual encoders

  • bge-m3 (multilingual, strong cross-lingual)
  • LaBSE (older but solid cross-lingual)

Patterns

  • If queries in lang A frequently need answers in lang B → cross-lingual retrieval.
  • For noisy OCR in CJK, consider char-level dense + BM25 hybrid.

Indexing

  • Add lang and script fields; route BM25 analyzers per language.
  • For FAISS dense index: single multilingual vector space is easiest; if per-lang spaces, keep a router.

4) Retrieval & reranking

  • First stage: hybrid (dense + BM25) with RRF.
  • If candidates include multiple languages, re-rank with cross-encoder multilingual models (e.g., bge-reranker-base works well).
  • Never drop minority language candidates too early—keep k_in ≥ 100 when cross-lingual.

5) Prompting & citations (SCU-safe)

  • Show citations with language labels and line spans.
  • Forbid the LLM from translating citations; allow translations only in the explanation.
  • If answer language ≠ citation language, state: “Cited in {lang}, answer in {lang}.”

6) Evaluation

  • Build a multilingual gold set:

    • Include cross-lingual Q→A pairs (e.g., query in English, answer/citation in zh).
    • Track recall@50, nDCG@10, and ΔS per language.
  • Acceptance: ΔS ≤ 0.45 for top-ctx in each language; stable λ across 3 paraphrases per language.


7) Common multilingual pitfalls → fixes

Pitfall Why it happens Fix
Hybrid fails on CJK Analyzer/tokenizer mismatch Use ICU analyzers; char-level BM25
S/T Chinese mismatch Source vs query script differ Store both via OpenCC; index two variants
Citations translated Prompt schema unlocked Fence citations; explain can translate
Cross-lang recall low Monolingual embeddings Use bge-m3/LaBSE; or translate query then search
Arabic/Hebrew garbled RTL shaping ICU normalization; verify rendering layer

8) Minimal example (Python, FAISS + BM25 + bge-m3)

# pip install sentence-transformers rank_bm25 faiss-cpu opencc-python-reimplemented
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np, faiss, opencc

enc = SentenceTransformer("BAAI/bge-m3")
def norm(s, lang):
    if lang in ("zh-hant","zh-hans","zh"):
        return opencc.OpenCC('t2s.json').convert(s)
    return s

chunks = [{"text":"ΔS 衡量語義張力", "lang":"zh"}, {"text":"Delta-S measures semantic stress", "lang":"en"}]
X = enc.encode([norm(c["text"], c["lang"]) for c in chunks], normalize_embeddings=True)
index = faiss.IndexFlatIP(X.shape[1]); index.add(X.astype(np.float32))
bm25 = BM25Okapi([c["text"].split() for c in chunks])

def search(q, lang="en"):
    qv = enc.encode([norm(q, lang)], normalize_embeddings=True).astype(np.float32)
    _, I = index.search(qv, 50); dense_rank=[(int(i), r+1) for r,i in enumerate(I[0])]
    sparse_rank=[(i, r+1) for r,i in enumerate(bm25.get_top_n(q.split(), list(range(len(chunks))), 50))]
    # RRF fuse

(Use a proper RRF from the Retrieval Playbook.)


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
Engine WFGY 1.0 Original PDF based tension engine
Engine WFGY 2.0 Production tension kernel and math engine for RAG and agents
Engine WFGY 3.0 TXT based Singularity tension engine, 131 S class set
Map Problem Map 1.0 Flagship 16 problem RAG failure checklist and fix map
Map Problem Map 2.0 RAG focused recovery pipeline
Map Problem Map 3.0 Global Debug Card, image as a debug protocol layer
Map Semantic Clinic Symptom to family to exact fix
Map Grandmas Clinic Plain language stories mapped to Problem Map 1.0
Onboarding Starter Village Guided tour for newcomers
App TXT OS TXT semantic OS, fast boot
App Blah Blah Blah Abstract and paradox Q and A built on TXT OS
App Blur Blur Blur Text to image with semantic control
App Blow Blow Blow Reasoning game engine and memory demo

If this repository helped, starring it improves discovery so more builders can find the docs and tools. GitHub Repo stars