CJK Segmentation & Word-Break — Guardrails and Fix Pattern

Stabilize retrieval and ranking for Chinese / Japanese / Korean text where whitespace is not a reliable token boundary.
Use this page to localize segmentation failures, choose the right analyzer, and verify the fix with measurable targets.

When to use this page

High similarity by characters but wrong meaning or empty recall on whole queries.
BM25 looks random; tiny single-character tokens dominate the index.
Citations cut through the middle of words; snippet offsets don’t match what users see.
Mixed CJK + Latin queries split unpredictably across runs or providers.

Open these first

Visual map and recovery → rag-architecture-and-recovery.md
End-to-end retrieval knobs → retrieval-playbook.md
Why this snippet (traceability) → retrieval-traceability.md
Payload schema & cite-then-explain → data-contracts.md
Chunking checklist for semantic boundaries → chunking-checklist.md
Embedding ≠ meaning (sanity) → embedding-vs-semantic.md
Related locale pages: tokenizer_mismatch.md · script_mixing.md · digits_width_punctuation.md · diacritics_and_folding.md · locale_drift.md

Acceptance targets

ΔS(question, retrieved) ≤ 0.45 on 3 paraphrases
Coverage of target section ≥ 0.70
λ remains convergent across 2 seeds
Tokenization sanity: OOV rate falls by ≥ 40% vs whitespace; tokens/char ≤ 0.7 on CJK pages
E_resonance flat on long windows

Map symptoms → structural fixes (Problem Map)

Symptom	Likely cause	Open this
Query returns almost nothing; recall jumps when you add spaces	Index built with whitespace/Latin analyzer on CJK	chunking-checklist.md, retrieval-playbook.md
Top-k filled with 1-char shards, citations cut mid-word	No CJK word-break at index or search time	retrieval-traceability.md, data-contracts.md
BM25 unstable; hybrid worse than single retriever	Search-time analyzer ≠ index-time analyzer	retrieval-playbook.md
Romanized terms and CJK compound in one query break apart	Mixed script + width + punctuation rules differ	script_mixing.md, digits_width_punctuation.md
High similarity, wrong meaning	Character-level overlap, no semantic units	embedding-vs-semantic.md

60-second fix checklist (store-agnostic)

Pick the right analyzer and lock it
- Chinese: use a dictionary or statistical segmenter at index + search.
- Japanese: use a MeCab/Kuromoji-class tokenizer with POS; keep base form.
- Korean: use a Nori-class analyzer; index decomp+comp forms consistently.
Normalize before segmenting
- Apply NFKC for width and compatibility forms (see page links above).
- Keep punctuation folding consistent across index/search.
Unify index-time and query-time configs
- Same language, same tokenizer, same stop/fold rules. No “smart defaults”.
Chunk on semantic units, not line breaks
- Respect sentence/phrase boundaries after segmentation.
- Store offsets, tokens, section_id in the snippet schema.
Probe
- Log tokens/char, unique-term ratio, OOV rate, and ΔS before/after.
- If ΔS stays ≥ 0.60 with good segmentation, revisit metric/index mismatch.

Store adapters (quick recipes)

Elasticsearch / OpenSearch
- CN: install and set a CJK analyzer; index + search use the same analyzer.
- JP: kuromoji with baseform filter; disable random synonyms unless audited.
- KR: nori; keep decompound mode consistent at index+query.
- Verify with _analyze samples; reindex after any analyzer change.
pgvector / Postgres
- Segmentation happens before embedding. Pre-segment text in ETL.
- Keep the same pipeline for ingestion and live queries.
Weaviate / Qdrant / Chroma / Milvus / FAISS
- The vector store won’t fix segmentation. Preprocess: NFKC → CJK segmenter → chunk.
- Log the preprocessing hash in metadata; fail closed on mismatch.
Vespa / Typesense / Elastic-compatible
- Use the platform’s CJK tokenizer if available; otherwise pre-segment and index the segmented text as the field value.

Deep diagnostics

Three-way segmentation A/B/C
Try 3 segmenters; compute ΔS and tokens/char on a small gold set. Pick the lowest ΔS with stable λ.
Anchor triangulation
Compare ΔS to the correct anchor vs a decoy section. If both are close, you’re still at char-overlap, not word-level meaning.
Rerank sanity
After proper segmentation, reranking should lift precision. If not, check analyzer mismatch between index and query path.

Copy-paste prompt for the LLM step


You have TXT OS and WFGY Problem Map loaded.

My CJK issue:

* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., tokens/char=..., OOV\_before=..., OOV\_after=...

Tell me:

1. which layer failed (segmentation, normalization, index/search mismatch),
2. which exact WFGY page to open,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test (3 paraphrases × 2 seeds) to verify the fix.
   Use BBMC/BBCR/BBPF/BBAM when relevant.

Next planned page

rtl_bidi_directionality.md (Arabic/Hebrew mixing, mirroring, numerals)

🔗 Quick-Start Downloads (60 sec)

Tool	Link	3-Step Setup
WFGY 1.0 PDF	Engine Paper	1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)	TXTOS.txt	1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

🧭 Explore More

Module	Description	Link
WFGY Core	WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack	View →
Problem Map 1.0	Initial 16-mode diagnostic and symbolic fix framework	View →
Problem Map 2.0	RAG-focused failure tree, modular fixes, and pipelines	View →
Semantic Clinic Index	Expanded failure catalog: prompt injection, memory bugs, logic drift	View →
Semantic Blueprint	Layer-based symbolic reasoning & semantic modulations	View →
Benchmark vs GPT-5	Stress test GPT-5 with full WFGY reasoning suite	View →
🧙‍♂️ Starter Village 🏡	New here? Lost in symbols? Click here and let the wizard guide you through	Start →

👑 Early Stargazers: See the Hall of Fame —
Engineers, hackers, and open source builders who supported WFGY from day one.

⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.

11 KiB Raw Blame History Unescape Escape