koboldcpp/models
Kabir Potdar 42532afff4
unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110)
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks

* cont : remove trailing whitespace

---------

Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-05-14 11:03:40 +02:00
..
templates autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892) 2026-04-15 10:51:50 +02:00
.editorconfig
ggml-vocab-aquila.gguf
ggml-vocab-baichuan.gguf
ggml-vocab-bert-bge.gguf
ggml-vocab-bert-bge.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-bert-bge.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-command-r.gguf command-r : add BPE pre-tokenization (#7063) 2024-05-05 08:19:30 +03:00
ggml-vocab-command-r.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-command-r.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-coder.gguf
ggml-vocab-deepseek-coder.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-coder.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-llm.gguf
ggml-vocab-deepseek-llm.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-llm.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-falcon.gguf
ggml-vocab-falcon.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-falcon.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-gemma-4.gguf vocab: add gemma4 tokenizer tests, fix edge case (#21534) 2026-04-09 11:41:14 +02:00
ggml-vocab-gemma-4.gguf.inp vocab: add gemma4 tokenizer tests, fix edge case (#21534) 2026-04-09 11:41:14 +02:00
ggml-vocab-gemma-4.gguf.out vocab: add gemma4 tokenizer tests, fix edge case (#21534) 2026-04-09 11:41:14 +02:00
ggml-vocab-gpt-2.gguf
ggml-vocab-gpt-2.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-gpt-2.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-gpt-neox.gguf
ggml-vocab-llama-bpe.gguf
ggml-vocab-llama-bpe.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-llama-bpe.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-llama-spm.gguf
ggml-vocab-llama-spm.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-llama-spm.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-mpt.gguf
ggml-vocab-mpt.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-mpt.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-nomic-bert-moe.gguf tests : improve UGM tokenizer test coverage (#13773) 2025-05-25 16:22:29 +02:00
ggml-vocab-phi-3.gguf Per token attributes (#7685) 2024-06-04 09:17:17 +02:00
ggml-vocab-phi-3.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-phi-3.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-qwen2.gguf llama : add BPE pre-tokenization for Qwen2 (#7114) 2024-05-08 15:06:43 +03:00
ggml-vocab-qwen2.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-qwen2.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-qwen35.gguf unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110) 2026-05-14 11:03:40 +02:00
ggml-vocab-qwen35.gguf.inp unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110) 2026-05-14 11:03:40 +02:00
ggml-vocab-qwen35.gguf.out unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110) 2026-05-14 11:03:40 +02:00
ggml-vocab-refact.gguf tests : add test-tokenizer-0.sh + fix some tokenizers (#7036) 2024-05-04 08:32:32 +03:00
ggml-vocab-refact.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-refact.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-starcoder.gguf
ggml-vocab-starcoder.gguf.inp convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00
ggml-vocab-starcoder.gguf.out convert : allow partial update to the chkhsh pre-tokenizer list (#13847) 2025-05-30 12:24:37 +02:00