vrr/WFGY

mirror of https://github.com/onestardao/WFGY.git synced 2026-05-20 01:03:33 +00:00

History

PSBigBig 4628e0bf89 Update proper_noun_aliases.md		2025-09-05 11:04:59 +08:00
..
code_switching_eval.md	Update code_switching_eval.md	2025-09-05 11:04:32 +08:00
fallback_translation_and_glossary_bridge.md	Update fallback_translation_and_glossary_bridge.md	2025-09-05 11:04:36 +08:00
hybrid_ranking_multilingual.md	Update hybrid_ranking_multilingual.md	2025-09-05 11:04:42 +08:00
locale_drift.md	Update locale_drift.md	2025-09-05 11:04:48 +08:00
multilingual_guide.md	Update multilingual_guide.md	2025-09-05 11:04:54 +08:00
proper_noun_aliases.md	Update proper_noun_aliases.md	2025-09-05 11:04:59 +08:00
query_language_detection.md	Create query_language_detection.md	2025-08-30 12:09:45 +08:00
query_routing_and_analyzers.md	Create query_routing_and_analyzers.md	2025-08-30 12:41:49 +08:00
README.md	Update README.md	2025-09-03 23:51:17 +08:00
romanization_transliteration.md	Create romanization_transliteration.md	2025-08-30 11:51:04 +08:00
script_mixing.md	Update script_mixing.md	2025-08-30 10:36:49 +08:00
stopword_and_morphology_controls.md	Create stopword_and_morphology_controls.md	2025-08-30 13:59:44 +08:00
tokenizer_mismatch.md	Update tokenizer_mismatch.md	2025-08-30 10:36:30 +08:00

README.md

Language & Multilingual · Global Fix Map

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

A compact hub to stabilize cross-lingual retrieval and reasoning.
Use this folder when your corpus or queries include CJK, RTL, Indic, Cyrillic, accented Latin, or frequent code-switching. No infra change required.

Orientation — pages and what they solve

Page	What it solves	Typical symptom
tokenizer_mismatch.md	Locks tokenization and segmentation for CJK/Thai/Indic	High sim but low recall on CJK/Thai; broken tokens
script_mixing.md	One query carries mixed scripts and analyzers split	Mixed Latin+CJK queries under-recall or flip
locale_drift.md	Normalization for width/accents/variants (Hans↔Hant)	zh-Hans/zh-Hant never co-retrieve; accent variants miss
multilingual_guide.md	End-to-end recipes and acceptance targets	Unsure where drift comes from across languages
proper_noun_aliases.md	Alias shield for names, brands, products	Proper nouns oscillate across spellings
romanization_transliteration.md	Romanization pairs and transliteration consistency	Inconsistent transliteration causes misses
query_language_detection.md	Stable language detection contract	Detection flips per run; routing becomes random
query_routing_and_analyzers.md	Route analyzers per language + parity w/ index	Search vs index behave differently
hybrid_ranking_multilingual.md	Deterministic hybrid rerank across languages	Multilingual ranking unstable, hybrid < single
stopword_and_morphology_controls.md	Clamp stopwords/lemmatizers to protect meaning	Negations/particles vanish; unit words lost
fallback_translation_and_glossary_bridge.md	Controlled translation bridge with glossary	Local path ΔS stays high; glossary needed
code_switching_eval.md	Bilingual & code-switch eval sets + checks	Cannot prove multilingual stability before ship

When to use this folder

High similarity yet wrong meaning on bilingual or mixed-script corpora
Citations point to the wrong section after translating the question
Hybrid retrievers underperform a single retriever across languages
Index looks healthy while coverage stays low for non-Latin scripts
Names flip between native, transliteration, and English aliases
zh-Hans and zh-Hant never co-retrieve; Thai recall drops for no reason

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 across language variants
Coverage ≥ 0.70 to the intended section after repair
λ_observe convergent across 3 paraphrases and 2 seeds
E_resonance flat on long windows that mix scripts
Citation fields complete; alias noise does not leak into evidence

Map symptoms → structural fixes

Symptom	Likely cause	Open this
High similarity yet wrong meaning	Embedding not multilingual or pre-normalization mismatch	embedding-vs-semantic.md
Citations jump sections after translation	Snippet schema too loose	data-contracts.md · retrieval-traceability.md
zh-Hans and zh-Hant never co-retrieve	Variant mapping and width rules missing	locale_drift.md
Thai or CJK recall collapses	Tokenizer mismatch or missing segmenter	tokenizer_mismatch.md
Mixed Latin + CJK query under-recalls	Analyzer split across scripts	script_mixing.md
Hybrid worse than single	Query parsing split or mis-weighted rerank	patterns/pattern_query_parsing_split.md · rerankers.md
Proper nouns oscillate	Missing alias fields and entity shield	proper_noun_aliases.md
Transliteration inconsistency	Romanization rules not aligned	romanization_transliteration.md
Language detection drifts	Detection contract weak or unlocked	query_language_detection.md
Search vs index disagree	Analyzer routing error	query_routing_and_analyzers.md
Ranking unstable across languages	Mono-lingual reranker or unaligned features	hybrid_ranking_multilingual.md
Negations/particles vanish	Stopword or morphology too aggressive	stopword_and_morphology_controls.md
Persistent high ΔS on local path	Need glossary-backed translation bridge	fallback_translation_and_glossary_bridge.md

Fix in 60 seconds

Detect language
Emit stable language + confidence. If unstable, fix detection first.
→ query_language_detection.md
Lock normalization and analyzers
Keep locale, width, accents, and segmentation identical on write/read.
→ locale_drift.md · query_routing_and_analyzers.md
Protect entities and syntax
Alias fields and romanization pairs; clamp stopwords/morphology for negations and units.
→ proper_noun_aliases.md · romanization_transliteration.md · stopword_and_morphology_controls.md
Stabilize ranking
Use multilingual or dual-track rerank with deterministic ordering.
→ hybrid_ranking_multilingual.md
Translation bridge only if needed
Pair with a glossary and keep native path as default.
→ fallback_translation_and_glossary_bridge.md
Verify
With bilingual & code-switch sets confirm ΔS ≤ 0.45, Coverage ≥ 0.70, λ convergent.
→ code_switching_eval.md

Store-agnostic quick recipes

Normalize the same way for corpus and queries before storing vectors
CJK/Thai require segmentation or bigrams; keep entity fields as keyword
If no multilingual embeddings, add a lexical sidecar and align features with a deterministic rerank

Got it — here’s the English FAQ version for the Language & Multilingual · Global Fix Map README. It follows the same style and clarity as the Chinese one, but rewritten in English for new users.

FAQ — Common Questions (Language & Multilingual)

Q1. Why does a bilingual or mixed query look similar but hit the wrong section?
A1. Most often the index and query use different analyzers or normalization steps, or CJK/Thai segmentation was never applied. Always lock the same normalization (width, accents, casing, segmentation) for both sides, then rebuild the index.
Open: tokenizer_mismatch.md · query_routing_and_analyzers.md

Q2. Why do zh-Hans and zh-Hant never co-retrieve?
A2. Variant and width rules are missing. Apply Unicode normalization, full/half-width mapping, and variant mapping before indexing.
Open: locale_drift.md

Q3. After translating the question into English, citations jump to the wrong section.
A3. The citation schema is too loose, missing fields like section_id and offsets. Enforce snippet contracts and cite-then-explain.
Open: data-contracts.md · retrieval-traceability.md

Q4. Why does Thai or Japanese recall fluctuate a lot?
A4. Classic tokenizer mismatch. Ensure index and query share the same segmenter; if not, use bigram or hybrid segmentation.
Open: tokenizer_mismatch.md

Q5. Why do mixed Latin + CJK queries under-recall?
A5. The analyzer splits into two routes and weights unevenly. Script-aware splitting or fixed routing is needed.
Open: script_mixing.md · query_routing_and_analyzers.md

Q6. Why do proper nouns oscillate between native, romanized, and English aliases?
A6. Alias fields and romanization tables are missing. Add aliases and protect them with keyword fields.
Open: proper_noun_aliases.md · romanization_transliteration.md

Q7. Why does multilingual reranking give different orderings each run?
A7. You are using a monolingual reranker or unaligned features. Switch to a multilingual reranker or dual-track (lexical+vector) with deterministic tie-breaks.
Open: hybrid_ranking_multilingual.md

Q8. Should I enable translation bridging from the start?
A8. No. Always try the native language path first. Only enable when ΔS stays above 0.45 over time, and always with glossaries.
Open: fallback_translation_and_glossary_bridge.md

Q9. Why do negations or particles disappear, flipping the meaning?
A9. Stopword or morphology rules are too aggressive. Protect negations, units, and structural particles.
Open: stopword_and_morphology_controls.md

Q10. Why does language detection keep flipping and causing misrouting?
A10. The detection contract isn’t locked, or samples are too short. Set stable model, sample length, confidence threshold, and fallback paths.
Open: query_language_detection.md

Q11. Metrics look fine but recall for non-Latin languages stays low.
A11. First check normalization and segmentation, then verify aliases/romanization and multilingual rerank alignment. Add code-switch eval sets for validation.
Open: multilingual_guide.md · code_switching_eval.md

Q12. What is the minimum acceptance test?
A12. Run bilingual and code-switch eval sets. Confirm all:

ΔS(question, retrieved) ≤ 0.45
Coverage ≥ 0.70
λ convergent.
If not, debug in order: detection → normalization → entity protection → rerank → translation bridge.

Q13. Is there a ready-to-paste diagnostic prompt?
A13. Yes. Use the following inside your LLM:

You have TXTOS and the WFGY Problem Map loaded.

Task:  
- Given a bilingual question Q, measure ΔS(Q, retrieved) and λ across 3 paraphrases.  
- Verify index/query normalization (width, accents, casing, segmentation).  
- Enforce cite-then-explain. Protect entities with alias/romanization.  
- If ΔS ≥ 0.60 or λ flips, output minimal structural fix until ΔS ≤ 0.45, Coverage ≥ 0.70.

Return JSON:  
{ "citations":[...], "ΔS":0.xx, "λ_state":"<>|→|←|×", "coverage":0.xx, "next_fix":"..." }

Q14. If I want to change the least, what’s the fix priority? A14. 1) Lock language detection contract 2) Lock normalization and analyzers 3) Add aliases/romanization 4) Multilingual rerank 5) Only then enable translation bridge.

Q15. Accuracy improved, but rankings across languages still flip occasionally. A15. Add stable sort keys and fixed weight tables. Inject language features into rerankers and set deterministic tie-break rules. Open: hybrid_ranking_multilingual.md

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

README.md Unescape Escape