vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig 64f954951c

Update fallback_translation_and_glossary_bridge.md

2025-09-05 11:04:36 +08:00

16 KiB

Raw Blame History

Fallback Translation and Glossary Bridge · Global Fix Map

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Language — multilingual processing and semantic alignment

WFGY Global Fix Map — main Emergency Room, 300+ structured fixes

WFGY Problem Map 1.0 — 16 reproducible failure modes

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

When native-language recall keeps missing the right snippet, switch to a controlled translation bridge with a domain glossary and alias shield. Translate only where needed, protect entities and negations, and verify improvement with ΔS, λ, and coverage.

Open these first

Visual map and recovery → rag-architecture-and-recovery.md
End to end retrieval knobs → retrieval-playbook.md
Traceability schema → retrieval-traceability.md
Contract the payload → data-contracts.md
Language overview → multilingual_guide.md
Tokenizer variance → tokenizer_mismatch.md
Mixed scripts → script_mixing.md
Locale normalization → locale_drift.md
Romanization rules → romanization_transliteration.md
Proper nouns and aliases → proper_noun_aliases.md
Language detection → query_language_detection.md
Analyzer routing → query_routing_and_analyzers.md
Multilingual ranking → hybrid_ranking_multilingual.md
Bilingual eval sets → code_switching_eval.md
Stopwords and morphology → stopword_and_morphology_controls.md

Core acceptance targets

ΔS(question, retrieved) ≤ 0.45 on three paraphrases and two seeds
Coverage of target section ≥ 0.70
λ convergent after the bridge, across native vs pivot language
No entity corruption or negation loss in the final citation set
Rank@k improves or remains flat after the bridge is enabled

When to enable the bridge

Enable only if all three hold:

Native path shows flat-high ΔS across k settings.
Query language and corpus language differ or the corpus is mixed locale.
Entity recall improves during a quick pivot test without harming citations.

If any native pipeline item is obviously wrong, fix that first. See tokenizer, analyzer, or morphology pages above.

What usually breaks

Symptom	Likely cause	Open this
Correct doc exists yet never ranks in top k	analyzer or tokenizer mismatch between query and store	tokenizer_mismatch.md · query_routing_and_analyzers.md
Names translate or transliterate inconsistently	missing alias shield or mixed romanization	proper_noun_aliases.md · romanization_transliteration.md
Negations flip meaning after MT	no do-not-translate list for negation tokens	stopword_and_morphology_controls.md
CJK queries degrade when pivoting via English	script segmentation and width rules differ by stage	script_mixing.md
Turkish/Greek accent fold changes matches	locale normalization not pinned per stage	locale_drift.md
Good recall but order is noisy across languages	reranker trained mono-lingual or features not aligned	hybrid_ranking_multilingual.md

Design: glossary bridge in two modes

Mode A — Query-side pivot Translate the query to the corpus language with a glossary and alias shield. Run retrieval native to the store, then reason in user language.

Mode B — Corpus-side pivot Keep query in user language, retrieve in native, but translate candidate snippets to the user language for reranking and reasoning. Never re-index on the pivot.

Glossary components

do_not_translate: names, products, codes, unit strings, legal terms.
preferred_terms: enforce a deterministic mapping for domain words.
romanization_map: stable transliteration table with 1-to-N aliases.
negation_and_modality: tokens that must survive intact.
protected_char_classes: width, diacritics, punctuation class locks.

Trace fields to log

{
  "bridge_mode": "A|B",
  "pivot_lang": "en|zh|..",
  "glossary_hash": "sha256:...",
  "alias_set_hash": "sha256:...",
  "ΔS_before": 0.xx,
  "ΔS_after": 0.yy,
  "coverage_before": 0.xx,
  "coverage_after": 0.yy
}

Minimal implementation steps

Detect language Use the contract from query_language_detection.md. Refuse fallback if detection is unstable.
Assemble glossary
- Pull domain terms.
- Add aliases from proper_noun_aliases.md.
- Add romanization table from romanization_transliteration.md.
- Add negation and unit strings to do-not-translate.
Choose mode
- Mode A if store is single-locale and analyzers are correct.
- Mode B if store is mixed or analyzers cannot be changed.
Run retrieval Route analyzers per query_routing_and_analyzers.md. For Mode B, translate only candidates for reranking.
Verify Compute ΔS and coverage. Require λ convergent across two seeds and three paraphrases. Log trace fields.
Publish Keep the glossary versioned and pinned in eval reports. Guard with retrieval-traceability.md and data-contracts.md.

Spec: glossary JSON

{
  "version": "glossary_acme_finance_2025_08_30",
  "pivot_lang": "en",
  "do_not_translate": ["Value at Risk", "CAGR", "ROE", "İstanbul", "北京市", "§"],
  "preferred_terms": {
    "账面价值": "book value",
    "净现值": "net present value"
  },
  "romanization_map": {
    "北京市": ["Beijing Shi", "Beijing City"],
    "İstanbul": ["Istanbul", "Stamboul"]
  },
  "negation_and_modality": ["not", "never", "must", "should"],
  "protected_char_classes": ["fullwidth_digit", "narrow_no_break_space"]
}

Copy-paste prompt for the LLM step

You have TXTOS and the WFGY Problem Map loaded.

My multilingual issue:
- native_lang: {xx}
- user_lang: {yy}
- mode: {A|B}
- glossary: {do_not_translate, preferred_terms, romanization_map, negation_and_modality}
- question: "{user_question}"
- candidates: [{snippet_id, text, source_url}...]

Do:
1) Apply the glossary strictly. Protect names, units, negations.
2) Perform cite-then-explain. If citations are weak, return the minimal fix and do not fabricate.
3) Return JSON:
{ "bridge_mode": "A|B", "pivot_lang": "en|...", "citations": [...],
  "answer": "...", "ΔS": 0.xx, "coverage": 0.xx, "λ_state": "→|←|<>|×",
  "next_fix": "..." }
Keep it auditable and short.

Eval protocol

Use bilingual and code-switching sets from code_switching_eval.md.
Compare native vs bridge on the same questions and seeds.
Accept only if ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent, and entity recall does not regress.
Report deltas for Rank@k and citation accuracy.

Common gotchas

Translating the index. Never translate and re-index as a “quick fix”. Pivot only at query or candidate stage.
Letting the MT rewrite units or numbers. Add them to do-not-translate.
Dropping diacritics or width during translation. Pin normalization from locale_drift.md and script_mixing.md.
Reranking with a mono-lingual model. If scores are noisy across languages, follow hybrid_ranking_multilingual.md.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame ⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

16 KiB Raw Blame History Unescape Escape