| .. | ||
| cjk_segmentation_wordbreak.md | ||
| diacritics_and_folding.md | ||
| digits_width_punctuation.md | ||
| locale_drift.md | ||
| README.md | ||
| rtl_bidi_control.md | ||
| script_mixing.md | ||
| tokenizer_mismatch.md | ||
Language & Locale — Global Fix Map
Stabilize multilingual RAG and reasoning across CJK/RTL/Indic/Latin mixes.
This hub localizes language-layer failures and routes you to the right structural fix. No infra change required.
What this page is
- A compact, language-aware repair guide for retrieval → ranking → reasoning.
- Structural fixes with measurable acceptance targets.
- Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, etc.
When to use
- Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
- Queries code-switch or mix scripts and the top-k drifts each run.
- Accents/diacritics or fullwidth/halfwidth forms break matching.
- RTL punctuation or invisible marks flip token order.
- Token counts jump after deploy even though data did not change.
Open these first
- Visual recovery map: RAG Architecture & Recovery
- Retrieval knobs end-to-end: Retrieval Playbook
- Traceability and snippet schema: Retrieval Traceability · Data Contracts
- Embedding vs meaning: Embedding ≠ Semantic
- Metric and normalization: Metric Mismatch · Normalization & Scaling
- OCR confusables and hyphens: OCR Parsing — Checklist
Quick routes to per-page guides
- Tokenizer mismatch across languages → tokenizer_mismatch.md
- Script mixing in one query (CJK + Latin, etc.) → script_mixing.md
- Locale drift and analyzer skew (prod vs dev) → locale_drift.md
- Normalization and casing policy (NFKC, lowercasing, accent fold) → normalization_and_casing.md
- CJK/Indic segmentation and RTL direction marks → segmentation_and_rtl.md
- Fullwidth vs halfwidth, punctuation variants → fullwidth_halfwidth.md
- Diacritics and accent folding policy → diacritics_and_folding.md
- Transliteration and romanization traps → transliteration.md
- Stopwords and analyzer mismatch in stores → analyzer_stopwords.md
- Code-switch detection and reranking policy → code_switching.md
MVP set is the first 6 pages. The rest are recommended adds when your traffic is mixed-locale or heavy-search.
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on three paraphrases
- Coverage of target section ≥ 0.70
- λ remains convergent across two seeds
- Tokenization variance for the same query ≤ 12% across environments
- Normalization pass rate (NFKC + width + diacritics) ≥ 0.98
Map symptoms → structural fixes
-
Wrong-meaning hits despite high similarity.
→ embedding-vs-semantic.md -
High similarity drops when you switch locales or analyzers.
→ metric_mismatch.md · normalization_and_scaling.md -
CJK tokens split differently between dev and prod.
→ tokenizer_mismatch.md · locale_drift.md -
Mixed scripts in one query derails ranking order.
→ script_mixing.md · code_switching.md · rerankers.md -
Fullwidth punctuation or RTL marks break citations.
→ fullwidth_halfwidth.md · Retrieval Traceability -
“Looks identical” but fails to match after OCR.
→ OCR Parsing — Checklist
Fix in 60 seconds
-
Normalize once, up front
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock this in the ingestion job and in the query path. -
Match tokenizer and analyzer
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Document exact versions in your data contract. -
Stabilize mixed-script queries
Detect code-switch, split query by script, run per-script retrieval, then rerank deterministically. -
Verify
Compute ΔS on 3 paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across 2 seeds.
Copy-paste prompt for your LLM step
You have TXT OS and the WFGY Problem Map.
My multilingual bug:
* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states
* notes: tokenizer/analyzer versions, normalization policy, scripts seen
Tell me:
1. which layer is failing and why,
2. the exact WFGY page to open from this repo,
3. minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test to verify.
Use BBMC/BBCR/BBPF/BBAM when relevant.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.