vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-20 01:03:33 +00:00

History

PSBigBig dbaf362083 Create multilingual_guide.md		2025-08-30 10:06:12 +08:00
..
multilingual_guide.md	Create multilingual_guide.md	2025-08-30 10:06:12 +08:00
README.md	Create README.md	2025-08-25 20:35:16 +08:00

README.md

Language & Multilingual — Global Fix Map

Make cross-lingual RAG stable. Handle CJK/RTL, mixed scripts, tokenizers, and locale drift without breaking retrieval.

What this page is

A compact playbook for multilingual corpora and queries
Practical fixes for tokenizer and analyzer mismatch
Steps to keep ΔS low across languages and scripts

When to use

Your corpus has Chinese/Japanese/Korean, RTL scripts, or code-switching
OCR text looks fine but retrieval or citations miss
Similarity is high but meaning is wrong across locales
HyDE/BM25 behave differently per language

Open these first

Language and locale guide: Multilingual Guide
Embedding vs true meaning: Embedding ≠ Semantic
OCR quality and pitfalls: OCR / Parsing Checklist
Chunk boundaries and joins: Semantic Chunking Checklist
Why this snippet: Retrieval Traceability
Ordering control: Rerankers
Snippet schema: Data Contracts

Common failure patterns

Tokenizer mismatch dense retriever uses whitespace rules on CJK or splits accents poorly
Analyzer split BM25 analyzer differs from the indexer used at write time
Script variants Traditional vs Simplified, Kana vs Kanji, Arabic presentation forms
Normalization gaps mixed width, NFC/NFKC, punctuation variants break exact matches
Romanization drift Pinyin or Hepburn in queries while docs keep native script
Code-switching sentences mix English and local terms; embeddings latch to one side
OCR artifacts diacritics lost, ligatures broken, zero-width joins preserved
Stopword shock default analyzers drop particles that carry meaning in some languages

Fix in 60 seconds

Normalize before anything
Apply NFC or NFKC, collapse widths, unify punctuation. Persist the normalized form you index.
Pick language-aware analyzers
Set BM25 analyzers that match the language at both write and read. Log tokenizer output for a few queries to confirm.
Embed with multilingual models
Use a single multilingual embedding model for mixed corpora. Do not mix English-only and multilingual spaces in one index.
Add transliteration bridges
Generate light alias fields per doc title and key entities, e.g., Traditional ↔ Simplified, Kana ↔ Romaji, Arabic ↔ Latin.
Rerank cross-lingually
Retrieve with generous k, then apply cross-lingual rerankers. Confirm ΔS(question, context) ≤ 0.45.
Lock citations and sections
Use Data Contracts with section_id, source_lang, and norm_ops. Require cite-then-answer to avoid language mixing.
Probe λ across locales
Ask for “cite lines” and “explain why” in both the user language and the source language. Divergence marks the failing boundary.

Copy paste prompt


You have TXT OS and the WFGY Problem Map.

Goal
Stabilize a multilingual RAG corpus with CJK and English. Prevent tokenizer mismatch and script drift.

Tasks

1. Show a normalization plan:

   * Unicode form (NFC/NFKC), width collapse, punctuation unification
   * sample before/after lines

2. Configure retrieval:

   * pick analyzers for BM25 that match corpus languages
   * ensure the same analyzer is used at write and read
   * use a multilingual embedding model, one index space

3. Add transliteration bridges:

   * alias fields for key entities (e.g., 簡↔繁, かな↔ローマ字)
   * show how aliases are added to the index document

4. Verify with WFGY:

   * compute ΔS(question, context) for three bilingual queries
   * report λ\_observe at retrieval and reasoning
   * target ΔS ≤ 0.45 and convergent λ

Output

* Normalization spec
* Analyzer and embedding choices
* Example index doc with alias fields
* A trace table with citations, ΔS, and λ for 3 queries

Minimal checklist

Unicode normalization applied before embedding and indexing
Language-aware analyzers configured the same for write and read
One multilingual embedding space per index
Alias fields or transliteration for key entities
Data Contract includes source_lang, norm_ops, and citations
ΔS and λ checks pass in both the user and source language

Acceptance targets

ΔS(question, context) median ≤ 0.45 for bilingual smoke tests
λ remains convergent when switching question language
Citations point to the correct section in the original script
Hybrid retrieval improves with reranking instead of oscillating
No analyzer or tokenizer mismatch logs during queries

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →