vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-22 03:02:03 +00:00

History

PSBigBig 44ac075fc6 Update README.md		2025-09-01 18:06:04 +08:00
..
code_switching_eval.md	Create code_switching_eval.md	2025-08-30 10:59:16 +08:00
fallback_translation_and_glossary_bridge.md	Create fallback_translation_and_glossary_bridge.md	2025-08-30 14:12:24 +08:00
hybrid_ranking_multilingual.md	Create hybrid_ranking_multilingual.md	2025-08-30 13:45:25 +08:00
locale_drift.md	Update locale_drift.md	2025-08-30 10:27:09 +08:00
multilingual_guide.md	Update multilingual_guide.md	2025-08-30 10:27:27 +08:00
proper_noun_aliases.md	Create proper_noun_aliases.md	2025-08-30 11:09:32 +08:00
query_language_detection.md	Create query_language_detection.md	2025-08-30 12:09:45 +08:00
query_routing_and_analyzers.md	Create query_routing_and_analyzers.md	2025-08-30 12:41:49 +08:00
README.md	Update README.md	2025-09-01 18:06:04 +08:00
romanization_transliteration.md	Create romanization_transliteration.md	2025-08-30 11:51:04 +08:00
script_mixing.md	Update script_mixing.md	2025-08-30 10:36:49 +08:00
stopword_and_morphology_controls.md	Create stopword_and_morphology_controls.md	2025-08-30 13:59:44 +08:00
tokenizer_mismatch.md	Update tokenizer_mismatch.md	2025-08-30 10:36:30 +08:00

README.md

Language & Multilingual · Global Fix Map

A compact hub to stabilize cross-lingual retrieval and reasoning.
Use this folder when your corpus or queries include CJK, RTL, Indic, Cyrillic, accented Latin, or frequent code-switching. No infra change required.

Orientation — pages and what they solve

Page	What it solves	Typical symptom
tokenizer_mismatch.md	Locks tokenization and segmentation for CJK/Thai/Indic	High sim but low recall on CJK/Thai; broken tokens
script_mixing.md	One query carries mixed scripts and analyzers split	Mixed Latin+CJK queries under-recall or flip
locale_drift.md	Normalization for width/accents/variants (Hans↔Hant)	zh-Hans/zh-Hant never co-retrieve; accent variants miss
multilingual_guide.md	End-to-end recipes and acceptance targets	Unsure where drift comes from across languages
proper_noun_aliases.md	Alias shield for names, brands, products	Proper nouns oscillate across spellings
romanization_transliteration.md	Romanization pairs and transliteration consistency	Inconsistent transliteration causes misses
query_language_detection.md	Stable language detection contract	Detection flips per run; routing becomes random
query_routing_and_analyzers.md	Route analyzers per language + parity w/ index	Search vs index behave differently
hybrid_ranking_multilingual.md	Deterministic hybrid rerank across languages	Multilingual ranking unstable, hybrid < single
stopword_and_morphology_controls.md	Clamp stopwords/lemmatizers to protect meaning	Negations/particles vanish; unit words lost
fallback_translation_and_glossary_bridge.md	Controlled translation bridge with glossary	Local path ΔS stays high; glossary needed
code_switching_eval.md	Bilingual & code-switch eval sets + checks	Cannot prove multilingual stability before ship

When to use this folder

High similarity yet wrong meaning on bilingual or mixed-script corpora
Citations point to the wrong section after translating the question
Hybrid retrievers underperform a single retriever across languages
Index looks healthy while coverage stays low for non-Latin scripts
Names flip between native, transliteration, and English aliases
zh-Hans and zh-Hant never co-retrieve; Thai recall drops for no reason

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 across language variants
Coverage ≥ 0.70 to the intended section after repair
λ_observe convergent across 3 paraphrases and 2 seeds
E_resonance flat on long windows that mix scripts
Citation fields complete; alias noise does not leak into evidence

Map symptoms → structural fixes

Symptom	Likely cause	Open this
High similarity yet wrong meaning	Embedding not multilingual or pre-normalization mismatch	embedding-vs-semantic.md
Citations jump sections after translation	Snippet schema too loose	data-contracts.md · retrieval-traceability.md
zh-Hans and zh-Hant never co-retrieve	Variant mapping and width rules missing	locale_drift.md
Thai or CJK recall collapses	Tokenizer mismatch or missing segmenter	tokenizer_mismatch.md
Mixed Latin + CJK query under-recalls	Analyzer split across scripts	script_mixing.md
Hybrid worse than single	Query parsing split or mis-weighted rerank	patterns/pattern_query_parsing_split.md · rerankers.md
Proper nouns oscillate	Missing alias fields and entity shield	proper_noun_aliases.md
Transliteration inconsistency	Romanization rules not aligned	romanization_transliteration.md
Language detection drifts	Detection contract weak or unlocked	query_language_detection.md
Search vs index disagree	Analyzer routing error	query_routing_and_analyzers.md
Ranking unstable across languages	Mono-lingual reranker or unaligned features	hybrid_ranking_multilingual.md
Negations/particles vanish	Stopword or morphology too aggressive	stopword_and_morphology_controls.md
Persistent high ΔS on local path	Need glossary-backed translation bridge	fallback_translation_and_glossary_bridge.md

Fix in 60 seconds

Detect language
Emit stable language + confidence. If unstable, fix detection first.
→ query_language_detection.md
Lock normalization and analyzers
Keep locale, width, accents, and segmentation identical on write/read.
→ locale_drift.md · query_routing_and_analyzers.md
Protect entities and syntax
Alias fields and romanization pairs; clamp stopwords/morphology for negations and units.
→ proper_noun_aliases.md · romanization_transliteration.md · stopword_and_morphology_controls.md
Stabilize ranking
Use multilingual or dual-track rerank with deterministic ordering.
→ hybrid_ranking_multilingual.md
Translation bridge only if needed
Pair with a glossary and keep native path as default.
→ fallback_translation_and_glossary_bridge.md
Verify
With bilingual & code-switch sets confirm ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent.
→ code_switching_eval.md

Store-agnostic quick recipes

Normalize the same way for corpus and queries before storing vectors
CJK/Thai require segmentation or bigrams; keep entity fields as keyword
If no multilingual embeddings, add a lexical sidecar and align features with a deterministic rerank

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

README.md Unescape Escape