vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-19 16:31:07 +00:00

History

PSBigBig 64d7d9f32f Delete ProblemMap/GlobalFixMap/LanguageLocale/bidi_rtl_control_chars.md		2025-08-30 21:04:07 +08:00
..
cjk_segmentation_wordbreak.md	Create cjk_segmentation_wordbreak.md	2025-08-30 16:37:15 +08:00
date_time_format_variants.md	Create date_time_format_variants.md	2025-08-30 18:54:04 +08:00
diacritics_and_folding.md	Create diacritics_and_folding.md	2025-08-30 15:57:48 +08:00
digits_width_punctuation.md	Create digits_width_punctuation.md	2025-08-30 15:46:11 +08:00
emoji_zwj_grapheme_clusters.md	Create emoji_zwj_grapheme_clusters.md	2025-08-30 17:01:38 +08:00
input_language_switching.md	Create input_language_switching.md	2025-08-30 20:36:27 +08:00
keyboard_input_methods.md	Create keyboard_input_methods.md	2025-08-30 20:27:28 +08:00
locale_collation_and_sorting.md	Create locale_collation_and_sorting.md	2025-08-30 20:16:28 +08:00
locale_drift.md	Create locale_drift.md	2025-08-30 15:19:53 +08:00
mixed_locale_metadata.md	Create mixed_locale_metadata.md	2025-08-30 17:55:06 +08:00
numbering_and_sort_orders.md	Create numbering_and_sort_orders.md	2025-08-30 17:51:06 +08:00
README.md	Update README.md	2025-08-30 20:57:02 +08:00
rtl_bidi_control.md	Create rtl_bidi_control.md	2025-08-30 16:17:51 +08:00
script_mixing.md	Create script_mixing.md	2025-08-30 15:07:58 +08:00
timezones_and_dst.md	Create timezones_and_dst.md	2025-08-30 19:05:34 +08:00
tokenizer_mismatch.md	Update tokenizer_mismatch.md	2025-08-30 14:51:50 +08:00
transliteration_and_romanization.md	Create transliteration_and_romanization.md	2025-08-30 17:46:04 +08:00
unicode_normalization.md	Create unicode_normalization.md	2025-08-30 19:24:00 +08:00

README.md

Language & Locale: Global Fix Map

Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, and Latin mixes.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.

What this page is

A compact language-aware repair guide for retrieval → ranking → reasoning.
Structural fixes with measurable acceptance targets.
Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.

When to use

Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
Queries code-switch or mix scripts and top-k order drifts across runs.
Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
RTL punctuation or control chars flip token order or offsets.
Token counts jump after deploy even though data did not change.

Open these first

Visual recovery map: rag-architecture-and-recovery.md
Retrieval knobs end-to-end: retrieval-playbook.md
Traceability and snippet schema: retrieval-traceability.md · data-contracts.md
Embedding vs meaning: embedding-vs-semantic.md
Metric and normalization: metric_mismatch.md · normalization_and_scaling.md
OCR confusables and hyphens: OCR_Parsing README

Quick routes to per-page guides

Tokenizer mismatch across languages → tokenizer_mismatch.md
Script mixing in a single query → script_mixing.md
Locale drift and analyzer skew → locale_drift.md
Unicode normalization policy (NFKC/NFD etc.) → unicode_normalization.md
CJK segmentation and word-break contracts → cjk_segmentation_wordbreak.md
Fullwidth vs halfwidth, punctuation variants → digits_width_punctuation.md
Diacritics policy and folding → diacritics_and_folding.md
RTL and bidi control characters → bidi_rtl_control_chars.md
Transliteration and romanization traps → transliteration_and_romanization.md
Collation and stable sort keys → locale_collation_and_sorting.md
Numbering systems and sort orders → numbering_and_sort_orders.md
Date and time format variants → date_time_format_variants.md
Time zones and DST stability → timezones_and_dst.md
Keyboard IMEs and composition → keyboard_input_methods.md
Input language switching guards → input_language_switching.md
Emoji, ZWJ, grapheme clusters → emoji_zwj_grapheme_clusters.md
Mixed-locale metadata fields → mixed_locale_metadata.md

MVP coverage includes the first 8–10 pages. Add the rest when traffic is mixed-locale or search intensive.

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 on three paraphrases
Coverage of target section ≥ 0.70
λ remains convergent across two seeds
Tokenization variance for the same query ≤ 12% across environments
Normalization pass rate for NFKC + width + diacritics ≥ 0.98

Map symptoms to structural fixes

Wrong-meaning hits despite high similarity
→ embedding-vs-semantic.md
Similarity drops when switching locales or analyzers
→ metric_mismatch.md · normalization_and_scaling.md
CJK tokens split differently between dev and prod
→ tokenizer_mismatch.md · locale_drift.md
Mixed scripts in one query derails ranking
→ script_mixing.md · rerankers.md
Fullwidth punctuation or RTL marks break citations
→ digits_width_punctuation.md · retrieval-traceability.md
“Looks identical” after OCR but fails to match
→ OCR_Parsing README

Fix in 60 seconds

Normalize once, up front
Apply NFKC, collapse fullwidth to halfwidth where appropriate, unify diacritics policy. Lock it in ingestion and query paths.
Match tokenizer and analyzer
Use the same segmenter for CJK/Indic in both embedding and store analyzers. Record exact versions in the data contract.
Stabilize mixed-script queries
Detect code-switch, split by script, run per-script retrieval, rerank deterministically.
Verify
Compute ΔS on three paraphrases, check coverage ≥ 0.70, ensure λ stays convergent across two seeds.

Copy-paste prompt for your LLM step


You have TXT OS and the WFGY Problem Map loaded.

My multilingual bug:

* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states
* notes: tokenizer/analyzer versions, normalization policy, scripts seen

Tell me:

1. which layer is failing and why,
2. the exact WFGY page to open from this repo,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test to verify.
   Use BBMC/BBCR/BBPF/BBAM when relevant.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

README.md Unescape Escape