koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-22 11:16:08 +00:00

History

Kashif Rasul 7ea23ddf7b vocab : add Carbon-3B (HybridDNATokenizer) support (#23410 ) * vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss. * vocab : align Carbon-3B changes with llama.cpp conventions * Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased. * vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre. * tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases. * vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works. * vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA * vocab : drop llm_tokenizer_bpe vocab-type assert * vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch * vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe * vocab : annotate #endif with PRETOKENIZERDEBUG * vocab : drop local hybriddna fixture (moves to ggml-org/vocabs) * deduplicate * simplify * simplify --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>		2026-05-21 08:34:32 +02:00
..
__init__.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
afmoe.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
arctic.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
baichuan.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
bailingmoe.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
base.py	vocab : add Carbon-3B (HybridDNATokenizer) support (#23410 )	2026-05-21 08:34:32 +02:00
bert.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
bitnet.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
bloom.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
chameleon.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
chatglm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
codeshell.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
cogvlm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
command_r.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
dbrx.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
deci.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
deepseek.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
dots1.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
dotsocr.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
dream.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
ernie.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
exaone.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
falcon.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
falcon_h1.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
gemma.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
glm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
gpt2.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
gpt_oss.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
gptneox.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
granite.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
grok.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
grovemoe.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
hunyuan.py	mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (#23329 )	2026-05-21 00:35:37 +02:00
internlm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
internvl.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
jais.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
jamba.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
januspro.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
kimi_linear.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
kimivl.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
lfm2.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
lighton_ocr.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
llada.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
llama.py	vocab : add Carbon-3B (HybridDNATokenizer) support (#23410 )	2026-05-21 08:34:32 +02:00
llama4.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
llava.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
maincoder.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
mamba.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
mimo.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
minicpm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
minimax.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
mistral.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
mistral3.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
mpt.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
nemotron.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
olmo.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
openelm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
orion.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
pangu.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
phi.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
pixtral.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
plamo.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
plm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
qwen.py	update bid to match each layers MTP source (#23237 )	2026-05-18 12:37:12 +08:00
qwen3vl.py	convert : fix Qwen3 ASR conversion (#23081 )	2026-05-15 18:38:39 +02:00
qwenvl.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
refact.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
rwkv.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
sarashina2.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
smallthinker.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
smolvlm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
stablelm.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
starcoder.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
step3.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
t5.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
ultravox.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
wavtokenizer.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
xverse.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00
youtuvl.py	Refactor: convert_hf_to_gguf.py (#17114 )	2026-05-15 15:18:12 +02:00