WFGY/ProblemMap/GlobalFixMap/LanguageLocale/diacritics_and_folding.md at 8cf46f7098d2829bf00be7b64c52e9960658d53f

vrr/WFGY

Fork 0

mirror of https://github.com/onestardao/WFGY.git synced 2026-04-28 11:40:07 +00:00

PSBigBig 3b278ca0f0

Create diacritics_and_folding.md

2025-08-30 15:57:48 +08:00

10 KiB

Raw Blame History

# Diacritics & Folding — Guardrails and Fix Pattern

A focused repair when accents and diacritic marks cause retrieval drift, broken citations, or unstable reranking. Use this page to lock a per-language normalization policy, keep citations faithful to the original text, and keep ΔS within target.

Open these first

Visual map and recovery: RAG Architecture & Recovery
End to end retrieval knobs: Retrieval Playbook
Why this snippet (traceability schema): Retrieval Traceability
Snippet and citation schema: Data Contracts
Embedding vs meaning: Embedding ≠ Semantic
Tokenizer mismatch: tokenizer_mismatch.md
Script mixing in one query: script_mixing.md
Digits, width, punctuation drift: digits_width_punctuation.md
Normalization and scaling notes: normalization_and_scaling.md
Locale drift overview: locale-drift.md

When to use this page

Store search finds “Malaga” while the source reads “Málaga”, citations fail to land.
BM25 works after accent folding but vectors point to different sections.
Vietnamese, French, Spanish or German show uneven recall after a language mix.
OCR keeps combining marks that your tokenizer later drops.
Reranker prefers unaccented variants even when the gold passage contains accents.

Acceptance targets

ΔS(question, retrieved) ≤ 0.45
Coverage of target section ≥ 0.70
Citation offsets within ±4 tokens between displayed text and source
Per-language exact-match on a 300-item accent set ≥ 0.95
λ remains convergent across 3 paraphrases and 2 seeds

Map symptoms to the exact fix

Symptom	Likely cause	Open this and apply
Citation points to the wrong offsets when accents exist	One view folded, the other original	Data Contracts · define visual_text (original) and search_text (folded) in every snippet; verify with Retrieval Traceability
High BM25 score, low vector agreement on accented words	Analyzer folds accents but embedding text did not, or the reverse	Align ingest and query analyzers in the store; embed visual_text and rerank with deterministic policy, see Retrieval Playbook
French and Vietnamese regress after “remove accents” policy	Per-language rules collapsed into a global fold	Keep a per-language policy with stored `locale`, see locale-drift.md
Tokenizer splits or drops combining marks	OCR export or tokenizer mismatch	Repair OCR and choose a consistent tokenizer, see tokenizer_mismatch.md and Retrieval Traceability
Reranker prefers unaccented decoys	Feature bias and query split across scripts	Lock reranker inputs and tie back to citation-first plan, see Rerankers and script_mixing.md
Full-width digits or punctuation shift offsets in CJK + Latin mix	Width and punctuation normalization out of sync	Normalize width for search_text only, preserve for visual_text, see digits_width_punctuation.md

60-second fix checklist

Choose a normalization policy
- Store two views per snippet:
  visual_text = original source in NFC, accents preserved.
  search_text = NFD, remove \p{Mn} combining marks, casefold, language-aware exceptions.
- Always render and cite from visual_text. Index BM25 on search_text. Vectors usually embed visual_text.
Record locale and analyzer
- Add locale (e.g., fr, vi, es, de).
- Log index_analyzer and query_analyzer names in trace. They must match.
Reranking and order
- Use citation-first assembly. If λ flips when you reorder headers, lock schema and apply BBAM variance clamp.
Probe ΔS and coverage
- Vary k = 5, 10, 20. If ΔS stays high and flat, suspect analyzer mismatch or wrong fold target.
Build a small gold
- 300 pairs per language with accented vs unaccented queries. Require ≥ 0.95 exact match and stable ΔS.

Minimal test plan

Paraphrase triad on each language pair.
Accent toggle test: same query with and without accents.
Citation parity: offsets within ±4 tokens between displayed answer and source.
Store drift audit after deploy: compare analyzer signatures across index and query clients.

Copy-paste prompt for your LLM step


You have TXT OS and the WFGY Problem Map loaded.

My issue: diacritics and folding.

* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., ΔS(retrieved,anchor)=..., λ states, citation offsets, locale=...

Tell me:

1. failing layer and why,
2. the exact WFGY page to open,
3. minimal steps to reach ΔS ≤ 0.45, coverage ≥ 0.70, and citation offset ≤ 4 tokens,
4. a reproducible test using a 300-item accent set.
   Use Data Contracts, Retrieval Traceability, and Rerankers when relevant.

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

10 KiB Raw Blame History Unescape Escape