9.8 KiB
CJK Segmentation & Word-Break — Guardrails and Fix Pattern
🧭 Quick Return to Map
You are in a sub-page of LanguageLocale.
To reorient, go back here:
- LanguageLocale — localization, regional settings, and context adaptation
- WFGY Global Fix Map — main Emergency Room, 300+ structured fixes
- WFGY Problem Map 1.0 — 16 reproducible failure modes
Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Stabilize retrieval and ranking for Chinese / Japanese / Korean text where whitespace is not a reliable token boundary.
Use this page to localize segmentation failures, choose the right analyzer, and verify the fix with measurable targets.
When to use this page
- High similarity by characters but wrong meaning or empty recall on whole queries.
- BM25 looks random; tiny single-character tokens dominate the index.
- Citations cut through the middle of words; snippet offsets don’t match what users see.
- Mixed CJK + Latin queries split unpredictably across runs or providers.
Open these first
- Visual map and recovery → rag-architecture-and-recovery.md
- End-to-end retrieval knobs → retrieval-playbook.md
- Why this snippet (traceability) → retrieval-traceability.md
- Payload schema & cite-then-explain → data-contracts.md
- Chunking checklist for semantic boundaries → chunking-checklist.md
- Embedding ≠ meaning (sanity) → embedding-vs-semantic.md
- Related locale pages: tokenizer_mismatch.md · script_mixing.md · digits_width_punctuation.md · diacritics_and_folding.md · locale_drift.md
Acceptance targets
- ΔS(question, retrieved) ≤ 0.45 on 3 paraphrases
- Coverage of target section ≥ 0.70
- λ remains convergent across 2 seeds
- Tokenization sanity: OOV rate falls by ≥ 40% vs whitespace; tokens/char ≤ 0.7 on CJK pages
- E_resonance flat on long windows
Map symptoms → structural fixes (Problem Map)
| Symptom | Likely cause | Open this |
|---|---|---|
| Query returns almost nothing; recall jumps when you add spaces | Index built with whitespace/Latin analyzer on CJK | chunking-checklist.md, retrieval-playbook.md |
| Top-k filled with 1-char shards, citations cut mid-word | No CJK word-break at index or search time | retrieval-traceability.md, data-contracts.md |
| BM25 unstable; hybrid worse than single retriever | Search-time analyzer ≠ index-time analyzer | retrieval-playbook.md |
| Romanized terms and CJK compound in one query break apart | Mixed script + width + punctuation rules differ | script_mixing.md, digits_width_punctuation.md |
| High similarity, wrong meaning | Character-level overlap, no semantic units | embedding-vs-semantic.md |
60-second fix checklist (store-agnostic)
-
Pick the right analyzer and lock it
- Chinese: use a dictionary or statistical segmenter at index + search.
- Japanese: use a MeCab/Kuromoji-class tokenizer with POS; keep base form.
- Korean: use a Nori-class analyzer; index decomp+comp forms consistently.
-
Normalize before segmenting
- Apply NFKC for width and compatibility forms (see page links above).
- Keep punctuation folding consistent across index/search.
-
Unify index-time and query-time configs
- Same language, same tokenizer, same stop/fold rules. No “smart defaults”.
-
Chunk on semantic units, not line breaks
- Respect sentence/phrase boundaries after segmentation.
- Store
offsets,tokens,section_idin the snippet schema.
-
Probe
- Log tokens/char, unique-term ratio, OOV rate, and ΔS before/after.
- If ΔS stays ≥ 0.60 with good segmentation, revisit metric/index mismatch.
Store adapters (quick recipes)
-
Elasticsearch / OpenSearch
- CN: install and set a CJK analyzer; index + search use the same analyzer.
- JP: kuromoji with baseform filter; disable random synonyms unless audited.
- KR: nori; keep decompound mode consistent at index+query.
- Verify with
_analyzesamples; reindex after any analyzer change.
-
pgvector / Postgres
- Segmentation happens before embedding. Pre-segment text in ETL.
- Keep the same pipeline for ingestion and live queries.
-
Weaviate / Qdrant / Chroma / Milvus / FAISS
- The vector store won’t fix segmentation. Preprocess: NFKC → CJK segmenter → chunk.
- Log the preprocessing hash in metadata; fail closed on mismatch.
-
Vespa / Typesense / Elastic-compatible
- Use the platform’s CJK tokenizer if available; otherwise pre-segment and index the segmented text as the field value.
Deep diagnostics
-
Three-way segmentation A/B/C
Try 3 segmenters; compute ΔS and tokens/char on a small gold set. Pick the lowest ΔS with stable λ. -
Anchor triangulation
Compare ΔS to the correct anchor vs a decoy section. If both are close, you’re still at char-overlap, not word-level meaning. -
Rerank sanity
After proper segmentation, reranking should lift precision. If not, check analyzer mismatch between index and query path.
Copy-paste prompt for the LLM step
You have TXT OS and WFGY Problem Map loaded.
My CJK issue:
* symptom: \[one line]
* traces: ΔS(question,retrieved)=..., tokens/char=..., OOV\_before=..., OOV\_after=...
Tell me:
1. which layer failed (segmentation, normalization, index/search mismatch),
2. which exact WFGY page to open,
3. the minimal steps to push ΔS ≤ 0.45 and keep λ convergent,
4. a reproducible test (3 paraphrases × 2 seeds) to verify the fix.
Use BBMC/BBCR/BBPF/BBAM when relevant.
Next planned page
rtl_bidi_directionality.md (Arabic/Hebrew mixing, mirroring, numerals)
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
Explore More
| Layer | Page | What it’s for |
|---|---|---|
| ⭐ Proof | WFGY Recognition Map | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | WFGY 1.0 | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | WFGY 2.0 | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | WFGY 3.0 | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | Problem Map 1.0 | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | Problem Map 2.0 | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | Problem Map 3.0 | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | TXT OS | .txt semantic OS with fast bootstrap |
| 🧰 App | Blah Blah Blah | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | Blur Blur Blur | Text to image generation with semantic control |
| 🏡 Onboarding | Starter Village | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.