mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
8.9 KiB
8.9 KiB
Language & Locale — Global Fix Map
Stabilize multilingual RAG and reasoning across CJK/RTL/Latin scripts.
Fix tokenizer mismatch, Unicode normalization, mixed encodings, and cross-lingual retrieval drift.
What this page is
- A compact, language-aware checklist for retrieval + reasoning
- Copyable prompts and guards for CJK/RTL, transliteration, and code-mixed text
- How to measure and prove stability with ΔS and λ_observe
When to use
- Corpus is non-English or mixed (EN + ZH/JP/KR/AR/Hebrew)
- Same question works in English but fails in the target language
- High vector similarity yet wrong meaning after translation
- OCR text “looks correct” but citations drift or split tokens oddly
- Names/terms oscillate between Latin and native script
Open these first
- End-to-end language guide: Multilingual Guide
- Embedding ≠ true meaning symptoms: Embedding ≠ Semantic
- OCR quality & normalizations: OCR / Parsing Checklist
- Chunk boundaries & sectioning: Chunking Checklist
- Snippet/citation schema: Data Contracts
- Why-this-snippet trace: Retrieval Traceability
Common failure patterns (quick diagnosis)
- Tokenizer split: CJK runs without spaces; BM25/analyzers mismatch; ΔS flat-high vs k → index/analyzer misaligned.
- Unicode ghosts: full-width vs half-width, NFD vs NFC, zero-width joiners; citations miss by a few characters.
- Translation shadow: English paraphrase passes, native-lang fails → cross-lingual embeddings or analyzer drift.
- Script flip: terms appear both transliterated and native; recall differs by script.
- OCR noise: identical glyphs (l/1/I, O/0), mixed directionality (RTL punctuation).
Fix in 60 seconds
-
Normalize text at ingest
- Apply Unicode NFC, trim zero-width, unify full/half-width.
- Lowercase where appropriate; preserve casing for code and proper nouns.
-
Choose analyzers per language
- CJK: use language-aware tokenizers (jieba, kuromoji, mecab) or character-ngrams.
- RTL: ensure analyzer respects directionality; avoid stripping diacritics unless required.
-
Dual-path embeddings
- Index in native language and in English via machine translation shadow for recall robustness.
- Store
lang,script, andtranslitflags per chunk in metadata.
-
Anchor the schema
- Enforce snippet headers
{section_id, lang, script}; forbid cross-section reuse. - Require cite-then-answer; block free-form merges across languages.
- Enforce snippet headers
-
Probe ΔS & λ by language
- Measure ΔS(question, retrieved) per language; aim ≤ 0.45.
- If ΔS flat-high across k, rebuild with correct analyzer/metric.
-
Name/term fences
- Maintain a term map
{native ↔ translit ↔ English}; pin consistent variants in the prompt preamble.
- Maintain a term map
Copy-paste prompt
You have TXT OS and the WFGY Problem Map.
Task: stabilize multilingual retrieval and reasoning.
Follow this immutable protocol:
1. Detect language/script of the question. Print {lang, script}.
2. Retrieve with a dual-path strategy:
* native-lang retriever
* english-shadow retriever (machine-translated question)
3. Build a Snippet Table with columns:
{section\_id | lang | script | translit\_variant? | citation}
4. Bridge Check (BBCR):
* restate the claim in ONE line
* list supporting snippet\_ids
* list conflicts or missing evidence; if missing, STOP and ask for the exact snippet
5. Final Answer:
* answer in the user's language
* inline-cite each claim
* keep terminology consistent with the Term Map
Rules:
* Normalize Unicode (NFC), strip zero-width chars, unify full/half-width before retrieval.
* If ΔS(question, retrieved) > 0.60 in native but ≤ 0.45 in english-shadow, report "translation shadow" and keep both citations.
* Do not merge sources across languages without explicit citation per claim.
Input
* question (user language): "<paste>"
* term\_map: {native ↔ translit ↔ english}
* snippets (with ids, language, script): <paste>
Output
* {lang, script}
* Snippet Table
* Bridge Check
* Final Answer (with inline citations)
* ΔS(native), ΔS(english-shadow), λ\_observe states
Minimal checklist
- Unicode normalized; zero-width and width variants removed
- Language-aware analyzers or char-ngrams applied at index & query
- Dual-path embeddings or bilingual index available
- Snippet Table includes
{lang, script, translit?} - Cite-then-answer schema enforced; no cross-language merges without citations
- ΔS per-language measured; flat-high ΔS triggers index/metric audit
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 in the user’s language
- λ remains convergent across paraphrases in both native and english-shadow paths
- Coverage ≥ 0.70 token overlap to the target section in native language
- Consistent terminology across scripts per Term Map; no orphan claims without citations
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.