WFGY/ProblemMap/GlobalFixMap/Language/fallback_translation_and_glossary_bridge.md

15 KiB
Raw Permalink Blame History

Fallback Translation and Glossary Bridge · Global Fix Map

🧭 Quick Return to Map

You are in a sub-page of Language.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

When native-language recall keeps missing the right snippet, switch to a controlled translation bridge with a domain glossary and alias shield. Translate only where needed, protect entities and negations, and verify improvement with ΔS, λ, and coverage.


Open these first


Core acceptance targets

  • ΔS(question, retrieved) ≤ 0.45 on three paraphrases and two seeds
  • Coverage of target section ≥ 0.70
  • λ convergent after the bridge, across native vs pivot language
  • No entity corruption or negation loss in the final citation set
  • Rank@k improves or remains flat after the bridge is enabled

When to enable the bridge

Enable only if all three hold:

  1. Native path shows flat-high ΔS across k settings.
  2. Query language and corpus language differ or the corpus is mixed locale.
  3. Entity recall improves during a quick pivot test without harming citations.

If any native pipeline item is obviously wrong, fix that first. See tokenizer, analyzer, or morphology pages above.


What usually breaks

Symptom Likely cause Open this
Correct doc exists yet never ranks in top k analyzer or tokenizer mismatch between query and store tokenizer_mismatch.md · query_routing_and_analyzers.md
Names translate or transliterate inconsistently missing alias shield or mixed romanization proper_noun_aliases.md · romanization_transliteration.md
Negations flip meaning after MT no do-not-translate list for negation tokens stopword_and_morphology_controls.md
CJK queries degrade when pivoting via English script segmentation and width rules differ by stage script_mixing.md
Turkish/Greek accent fold changes matches locale normalization not pinned per stage locale_drift.md
Good recall but order is noisy across languages reranker trained mono-lingual or features not aligned hybrid_ranking_multilingual.md

Design: glossary bridge in two modes

Mode A — Query-side pivot Translate the query to the corpus language with a glossary and alias shield. Run retrieval native to the store, then reason in user language.

Mode B — Corpus-side pivot Keep query in user language, retrieve in native, but translate candidate snippets to the user language for reranking and reasoning. Never re-index on the pivot.

Glossary components

  • do_not_translate: names, products, codes, unit strings, legal terms.
  • preferred_terms: enforce a deterministic mapping for domain words.
  • romanization_map: stable transliteration table with 1-to-N aliases.
  • negation_and_modality: tokens that must survive intact.
  • protected_char_classes: width, diacritics, punctuation class locks.

Trace fields to log

{
  "bridge_mode": "A|B",
  "pivot_lang": "en|zh|..",
  "glossary_hash": "sha256:...",
  "alias_set_hash": "sha256:...",
  "ΔS_before": 0.xx,
  "ΔS_after": 0.yy,
  "coverage_before": 0.xx,
  "coverage_after": 0.yy
}

Minimal implementation steps

  1. Detect language Use the contract from query_language_detection.md. Refuse fallback if detection is unstable.

  2. Assemble glossary

  3. Choose mode

    • Mode A if store is single-locale and analyzers are correct.
    • Mode B if store is mixed or analyzers cannot be changed.
  4. Run retrieval Route analyzers per query_routing_and_analyzers.md. For Mode B, translate only candidates for reranking.

  5. Verify Compute ΔS and coverage. Require λ convergent across two seeds and three paraphrases. Log trace fields.

  6. Publish Keep the glossary versioned and pinned in eval reports. Guard with retrieval-traceability.md and data-contracts.md.


Spec: glossary JSON

{
  "version": "glossary_acme_finance_2025_08_30",
  "pivot_lang": "en",
  "do_not_translate": ["Value at Risk", "CAGR", "ROE", "İstanbul", "北京市", "§"],
  "preferred_terms": {
    "账面价值": "book value",
    "净现值": "net present value"
  },
  "romanization_map": {
    "北京市": ["Beijing Shi", "Beijing City"],
    "İstanbul": ["Istanbul", "Stamboul"]
  },
  "negation_and_modality": ["not", "never", "must", "should"],
  "protected_char_classes": ["fullwidth_digit", "narrow_no_break_space"]
}

Copy-paste prompt for the LLM step

You have TXTOS and the WFGY Problem Map loaded.

My multilingual issue:
- native_lang: {xx}
- user_lang: {yy}
- mode: {A|B}
- glossary: {do_not_translate, preferred_terms, romanization_map, negation_and_modality}
- question: "{user_question}"
- candidates: [{snippet_id, text, source_url}...]

Do:
1) Apply the glossary strictly. Protect names, units, negations.
2) Perform cite-then-explain. If citations are weak, return the minimal fix and do not fabricate.
3) Return JSON:
{ "bridge_mode": "A|B", "pivot_lang": "en|...", "citations": [...],
  "answer": "...", "ΔS": 0.xx, "coverage": 0.xx, "λ_state": "→|←|<>|×",
  "next_fix": "..." }
Keep it auditable and short.

Eval protocol

  • Use bilingual and code-switching sets from code_switching_eval.md.
  • Compare native vs bridge on the same questions and seeds.
  • Accept only if ΔS ≤ 0.45, coverage ≥ 0.70, λ convergent, and entity recall does not regress.
  • Report deltas for Rank@k and citation accuracy.

Common gotchas

  • Translating the index. Never translate and re-index as a “quick fix”. Pivot only at query or candidate stage.
  • Letting the MT rewrite units or numbers. Add them to do-not-translate.
  • Dropping diacritics or width during translation. Pin normalization from locale_drift.md and script_mixing.md.
  • Reranking with a mono-lingual model. If scores are noisy across languages, follow hybrid_ranking_multilingual.md.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars