vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

History

PSBigBig 198ac74bb9 Update README.md		2025-09-03 23:51:24 +08:00
..
checklists	Create .gitkeep	2025-09-01 18:20:03 +08:00
eval	Create .gitkeep	2025-09-01 18:20:12 +08:00
mvp_demo	Create .gitkeep	2025-09-01 18:20:22 +08:00
ops	Create .gitkeep	2025-09-01 18:20:30 +08:00
patterns	Create .gitkeep	2025-09-01 18:20:38 +08:00
playbooks	Create .gitkeep	2025-09-01 18:20:48 +08:00
tools	Create .gitkeep	2025-09-01 18:20:57 +08:00
cjk_segmentation_wordbreak.md	Create cjk_segmentation_wordbreak.md	2025-08-30 16:37:15 +08:00
date_time_format_variants.md	Create date_time_format_variants.md	2025-08-30 18:54:04 +08:00
diacritics_and_folding.md	Create diacritics_and_folding.md	2025-08-30 15:57:48 +08:00
digits_width_punctuation.md	Create digits_width_punctuation.md	2025-08-30 15:46:11 +08:00
emoji_zwj_grapheme_clusters.md	Create emoji_zwj_grapheme_clusters.md	2025-08-30 17:01:38 +08:00
input_language_switching.md	Create input_language_switching.md	2025-08-30 20:36:27 +08:00
keyboard_input_methods.md	Create keyboard_input_methods.md	2025-08-30 20:27:28 +08:00
locale_collation_and_sorting.md	Create locale_collation_and_sorting.md	2025-08-30 20:16:28 +08:00
locale_drift.md	Create locale_drift.md	2025-08-30 15:19:53 +08:00
mixed_locale_metadata.md	Create mixed_locale_metadata.md	2025-08-30 17:55:06 +08:00
numbering_and_sort_orders.md	Create numbering_and_sort_orders.md	2025-08-30 17:51:06 +08:00
README.md	Update README.md	2025-09-03 23:51:24 +08:00
rtl_bidi_control.md	Create rtl_bidi_control.md	2025-08-30 16:17:51 +08:00
script_mixing.md	Create script_mixing.md	2025-08-30 15:07:58 +08:00
timezones_and_dst.md	Create timezones_and_dst.md	2025-08-30 19:05:34 +08:00
tokenizer_mismatch.md	Update tokenizer_mismatch.md	2025-08-30 14:51:50 +08:00
transliteration_and_romanization.md	Create transliteration_and_romanization.md	2025-08-30 17:46:04 +08:00
unicode_normalization.md	Create unicode_normalization.md	2025-08-30 19:24:00 +08:00

README.md

Language & Locale · Global Fix Map

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

Stabilize multilingual RAG and reasoning across CJK, RTL, Indic, Latin, emoji, and locale variants.
This hub localizes language-layer failures and routes you to the exact structural fix. No infra change required.

What this page is

A compact language-aware repair guide for retrieval → ranking → reasoning.
Structural fixes with measurable acceptance targets.
Store-agnostic. Works with FAISS, Redis, pgvector, Elastic, Weaviate, Milvus, and more.

When to use

Corpus spans CJK or Indic scripts and retrieval keeps missing the correct section.
Queries code-switch or mix scripts, and top-k order drifts across runs.
Accents/diacritics or fullwidth/halfwidth forms break matching or citations.
RTL punctuation or control chars flip token order or offsets.
Token counts jump after deploy even though data did not change.

Open these first

Visual recovery map → rag-architecture-and-recovery.md
Retrieval knobs end-to-end → retrieval-playbook.md
Traceability and snippet schema → retrieval-traceability.md · data-contracts.md
Embedding vs meaning → embedding-vs-semantic.md
Metric and normalization → metric_mismatch.md · normalization_and_scaling.md
OCR confusables and hyphens → OCR_Parsing README

Quick routes to per-page guides

Topic	Page
Tokenizer mismatch across languages	tokenizer_mismatch.md
Script mixing in a single query	script_mixing.md
Locale drift and analyzer skew	locale_drift.md
Unicode normalization policy	unicode_normalization.md
CJK segmentation and word-break	cjk_segmentation_wordbreak.md
Fullwidth vs halfwidth, punctuation variants	digits_width_punctuation.md
Diacritics folding rules	diacritics_and_folding.md
RTL and bidi control characters	rtl_bidi_control.md
Transliteration and romanization	transliteration_and_romanization.md
Collation and stable sort keys	locale_collation_and_sorting.md
Numbering systems and sort orders	numbering_and_sort_orders.md
Date and time format variants	date_time_format_variants.md
Time zones and DST stability	timezones_and_dst.md
Keyboard IMEs and composition	keyboard_input_methods.md
Input language switching guards	input_language_switching.md
Emoji, ZWJ, grapheme clusters	emoji_zwj_grapheme_clusters.md
Mixed-locale metadata fields	mixed_locale_metadata.md

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 on three paraphrases
Coverage of target section ≥ 0.70
λ remains convergent across two seeds
Tokenization variance for the same query ≤ 12% across environments
Normalization pass rate for NFKC + width + diacritics ≥ 0.98

Fix in 60 seconds

Normalize once, up front → Apply NFKC, collapse fullwidth/halfwidth, unify diacritics.
Match tokenizer and analyzer → Same segmenter for CJK/Indic across embed + store analyzers.
Stabilize mixed-script queries → Detect code-switch, split per script, rerank deterministically.
Verify → ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent across two seeds.

FAQ (Beginner-Friendly)

Q1: Why do answers break when I mix English and Chinese in one query?
A: Most vector stores tokenize differently by script. Without alignment, Chinese words get split incorrectly and English tokens dominate. Fix with script_mixing.md and tokenizer_mismatch.md.

Q2: What does “locale drift” mean?
A: Locale drift happens when environments use different analyzers (e.g., zh_TW vs zh_CN) so the same query splits differently. See locale_drift.md.

Q3: Why do “identical-looking” characters not match?
A: They may differ in width (fullwidth vs halfwidth), normalization (NFKC vs NFD), or diacritics. Always apply unicode_normalization.md and digits_width_punctuation.md.

Q4: How do I handle Arabic or Hebrew text?
A: RTL scripts can insert invisible bidi control chars that flip token order. See rtl_bidi_control.md.

Q5: Do I need different embeddings for each language?
A: No. You can combine multilingual embeddings with deterministic normalization and alias fields. If that fails, only then use fallback translation bridges.

Q6: How do I debug when results change between environments?
A: Compare tokenizer version, analyzer settings, normalization passes, and collation rules. Document them in data-contracts.md.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

README.md Unescape Escape