WFGY/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md
2025-08-30 15:57:48 +08:00

10 KiB
Raw Blame History

image# Diacritics & Folding — Guardrails and Fix Pattern

A focused repair when accents and diacritic marks cause retrieval drift, broken citations, or unstable reranking. Use this page to lock a per-language normalization policy, keep citations faithful to the original text, and keep ΔS within target.

Open these first

When to use this page

  • Store search finds “Malaga” while the source reads “Málaga”, citations fail to land.
  • BM25 works after accent folding but vectors point to different sections.
  • Vietnamese, French, Spanish or German show uneven recall after a language mix.
  • OCR keeps combining marks that your tokenizer later drops.
  • Reranker prefers unaccented variants even when the gold passage contains accents.

Acceptance targets

  • ΔS(question, retrieved) ≤ 0.45
  • Coverage of target section ≥ 0.70
  • Citation offsets within ±4 tokens between displayed text and source
  • Per-language exact-match on a 300-item accent set ≥ 0.95
  • λ remains convergent across 3 paraphrases and 2 seeds

Map symptoms to the exact fix

Symptom Likely cause Open this and apply
Citation points to the wrong offsets when accents exist One view folded, the other original Data Contracts · define visual_text (original) and search_text (folded) in every snippet; verify with Retrieval Traceability
High BM25 score, low vector agreement on accented words Analyzer folds accents but embedding text did not, or the reverse Align ingest and query analyzers in the store; embed visual_text and rerank with deterministic policy, see Retrieval Playbook
French and Vietnamese regress after “remove accents” policy Per-language rules collapsed into a global fold Keep a per-language policy with stored locale, see locale-drift.md
Tokenizer splits or drops combining marks OCR export or tokenizer mismatch Repair OCR and choose a consistent tokenizer, see tokenizer_mismatch.md and Retrieval Traceability
Reranker prefers unaccented decoys Feature bias and query split across scripts Lock reranker inputs and tie back to citation-first plan, see Rerankers and script_mixing.md
Full-width digits or punctuation shift offsets in CJK + Latin mix Width and punctuation normalization out of sync Normalize width for search_text only, preserve for visual_text, see digits_width_punctuation.md

60-second fix checklist

  1. Choose a normalization policy

    • Store two views per snippet:
      visual_text = original source in NFC, accents preserved.
      search_text = NFD, remove \p{Mn} combining marks, casefold, language-aware exceptions.
    • Always render and cite from visual_text. Index BM25 on search_text. Vectors usually embed visual_text.
  2. Record locale and analyzer

    • Add locale (e.g., fr, vi, es, de).
    • Log index_analyzer and query_analyzer names in trace. They must match.
  3. Reranking and order

    • Use citation-first assembly. If λ flips when you reorder headers, lock schema and apply BBAM variance clamp.
  4. Probe ΔS and coverage

    • Vary k = 5, 10, 20. If ΔS stays high and flat, suspect analyzer mismatch or wrong fold target.
  5. Build a small gold

    • 300 pairs per language with accented vs unaccented queries. Require ≥ 0.95 exact match and stable ΔS.

Minimal test plan

  • Paraphrase triad on each language pair.
  • Accent toggle test: same query with and without accents.
  • Citation parity: offsets within ±4 tokens between displayed answer and source.
  • Store drift audit after deploy: compare analyzer signatures across index and query clients.

Copy-paste prompt for your LLM step


You have TXT OS and the WFGY Problem Map loaded.

My issue: diacritics and folding.

* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states, citation offsets, locale=...

Tell me:

1. failing layer and why,
2. the exact WFGY page to open,
3. minimal steps to reach ΔS ≤ 0.45, coverage ≥ 0.70, and citation offset ≤ 4 tokens,
4. a reproducible test using a 300-item accent set.
   Use Data Contracts, Retrieval Traceability, and Rerankers when relevant.


🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

🧭 Explore More

Module Description Link
WFGY Core WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →
🧙‍♂️ Starter Village 🏡 New here? Lost in symbols? Click here and let the wizard guide you through Start →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars WFGY Engine 2.0 is already unlocked. Star the repo to help others discover it and unlock more on the Unlock Board.

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow