Find a file
Kashif Rasul 7ea23ddf7b
vocab : add Carbon-3B (HybridDNATokenizer) support (#23410)
* vocab : add Carbon-3B (HybridDNATokenizer) support

Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the
HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}.
The base BPE is Qwen3-4B-Base's; what differs is that text inside
<dna>...</dna> regions is chunked into fixed 6-mers (right-padded
with 'A' on the trailing partial), and any base outside ACGT maps
to <oov>.

* src/llama-vocab.{h,cpp}: new pre-type, dispatched from
  llm_tokenizer_bpe_session::tokenize.
* src/llama-vocab-carbon.h: pure helpers (tokenize_carbon,
  emit_dna_kmers) factored out for unit testing — no llama_vocab
  dependency, vocab access goes through a std::function.
* conversion/base.py: detect HybridDNATokenizer by class name in
  get_vocab_base_pre (chktxt collides with Qwen3 base since it
  has no <dna>), and pass trust_remote_code=True in get_vocab_base
  so the custom tokenizer class can load.
* tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer,
  multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer
  right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
  two regions, vocab miss.

* vocab : align Carbon-3B changes with llama.cpp conventions

* Fold tokenize_carbon + emit_dna_kmers inline into
  llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h),
  matching how every other tokenizer keeps its helpers inside
  llama-vocab.cpp.

* Replace the standalone unit test with the conventional
  test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf
  (vocab-only conversion) + .inp/.out fixtures covering single
  6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial
  right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
  two regions.

* Register "carbon" in convert_hf_to_gguf_update.py's model list
  (pointing at HuggingFaceBio/Carbon-3B) and teach both
  AutoTokenizer call sites in the updater to pass
  trust_remote_code=True for it, matching how t5 is special-cased.

* vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch

Refactor the conversion-side changes to follow the per-tokenizer-family
convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm,
etc. instead of conditionalising the shared get_vocab_base /
get_vocab_base_pre paths.

* conversion/base.py: add _set_vocab_carbon — self-contained, loads
  with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA
  vocab is visible, writes tokenizer.ggml.pre = "carbon" directly.
* conversion/llama.py: branch in LlamaModel.set_vocab on
  tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and
  dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py
  (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and
  conversion/phi.py.
* conversion/base.py: revert the conditional in get_vocab_base and the
  class-name short-circuit in the auto-generated get_vocab_base_pre.

* tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples

Add 6 cases from the Carbon-3B model card on top of the existing edge
coverage: the unterminated basic-completion prompt, the closed 33-bp
example, the metadata-conditioned prompt (with <vertebrate_mammalian>
and <protein_coding_region> which BPE-decompose since they are not in
the vocab), the documented anti-pattern of raw DNA without <dna> tags,
and the two likelihood-scoring examples. Brings the suite to 19 cases.

* vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE

Refactor per upstream review:

> This should be its own tokenizer model, ie. carbonhybriddna instead
> of gpt2 and not carbon pre-tokenizer. That way you can keep the
> correct pre-tokenizer, in case that ever changes.

Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a
new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific
branch inside llm_tokenizer_bpe_session::tokenize (only existing
pre-types differ in regex, not dispatch logic), and (b) conflated
"hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer".

This change moves it to its own vocab type, peer to PLAMO2, with the
GGUF model name matching the HF tokenizer class (HybridDNATokenizer):

* include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7.
* src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that
  owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and
  routes raw text through a DNA-aware splitter; wired into
  init_tokenizer, tokenize, type_name, byte_to_token, and the
  BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov>
  are pure ASCII, so byte-level BPE decoding handles them).
  LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type
  config block alongside SPM/WPM/UGM/RWKV, where pre_type is set
  to QWEN2 and the matching add_space_prefix / escape_whitespaces /
  clean_spaces flags are applied — mirroring qwen2's BPE path so
  byte-level BPE merging stays bit-identical to the Python
  reference for non-DNA text.
* src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON.
* conversion/base.py: _set_vocab_hybriddna writes
  tokenizer.ggml.model = "hybriddna" (no separate pre).
* conversion/llama.py: dispatch on tokenizer_class ==
  "HybridDNATokenizer" same as bert.py / phi.py do.
* models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture +
  regenerated metadata.
* convert_hf_to_gguf_update.py: drop the stale chkhsh entry and
  trust_remote_code special-case (no longer needed since dispatch
  is now class-name driven, not chkhsh).

Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}:
tokenization is bit-identical to the Python HybridDNATokenizer for
all 19 test fixtures plus the model-card metadata-conditioned
prompt; greedy completion produces the same DNA continuation as
the Python reference; spec-dec with 500M as draft for 8B still
works.

* vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA

* vocab : drop llm_tokenizer_bpe vocab-type assert

* vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch

* vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe

* vocab : annotate #endif with PRETOKENIZERDEBUG

* vocab : drop local hybriddna fixture (moves to ggml-org/vocabs)

* deduplicate

* simplify

* simplify

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-21 08:34:32 +02:00
.devops docker : copy conversion files (#23370) 2026-05-20 11:03:18 +02:00
.gemini contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
.github snapdragon: update toolchain to v0.6 (#23369) 2026-05-19 22:04:04 -07:00
.pi/gg save-load-state : refactor tests and improve readability (#23196) 2026-05-19 09:46:34 +03:00
app app : show version (#23426) 2026-05-21 06:21:13 +02:00
benches benches : add Nemotron 3 Nano on DGX Spark (#20652) 2026-03-16 21:50:43 +02:00
ci ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009) 2026-05-17 13:12:11 +02:00
cmake cmake : do not check for bin install dir (#23234) 2026-05-18 02:33:14 +02:00
common Move to backend sampling for MTP draft path (#23287) 2026-05-20 22:34:45 +05:30
conversion vocab : add Carbon-3B (HybridDNATokenizer) support (#23410) 2026-05-21 08:34:32 +02:00
docs doc: fix spec mtp typo (#23435) 2026-05-21 09:30:55 +03:00
examples save-load-state : refactor tests and improve readability (#23196) 2026-05-19 09:46:34 +03:00
ggml ggml : Check the right iface method before using the fallback 2d get (#23306) 2026-05-21 09:24:40 +03:00
gguf-py mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (#23329) 2026-05-21 00:35:37 +02:00
grammars webui: Move static build output from repo code to HF Bucket (#22937) 2026-05-14 13:21:41 +02:00
include llama + spec: MTP Support (#22673) 2026-05-16 20:06:23 +08:00
licenses refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
media media : add transparent icon svg and png [no ci] (#15891) 2025-09-10 14:51:28 +03:00
models unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110) 2026-05-14 11:03:40 +02:00
pocs libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
requirements py : Bump typer to latest to fix huggingface_hub issue (#21701) 2026-04-11 09:44:15 +03:00
scripts hexagon: HMX quantized matmul rework (#23368) 2026-05-20 07:39:01 -07:00
src vocab : add Carbon-3B (HybridDNATokenizer) support (#23410) 2026-05-21 08:34:32 +02:00
tests hexagon: ssm-conv fix for large prompts (#23307) 2026-05-20 22:14:13 -07:00
tools ui: Improve Git Hooks for UI development (#23403) 2026-05-21 08:27:50 +02:00
vendor vendor : update cpp-httplib to 0.45.0 (#23103) 2026-05-16 15:25:21 +03:00
.clang-format fix: apply clang-format to CUDA macros (#16017) 2025-09-16 08:59:19 +02:00
.clang-tidy clang-tidy : disable warning about performance enum size (#16127) 2025-09-22 19:57:46 +02:00
.dockerignore ci : fix docker build number and tag name (#9638) 2024-09-25 17:26:01 +02:00
.ecrc common : Update stb_image.h to latest version (#9161) 2024-08-27 08:58:50 +03:00
.editorconfig ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming (#23064) 2026-05-16 02:02:40 +02:00
.flake8 llama : move end-user examples to tools directory (#13249) 2025-05-02 20:27:13 +02:00
.gitignore ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming (#23064) 2026-05-16 02:02:40 +02:00
.gitmodules ggml : remove kompute backend (#14501) 2025-07-03 07:48:32 +03:00
.pre-commit-config.yaml convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
AGENTS.md contrib : rewrite AGENTS.md, make it more clear about project values (#21270) 2026-04-01 23:31:51 +02:00
AUTHORS authors : update (#19263) 2026-02-02 08:51:25 +02:00
build-xcframework.sh build : remove LLAMA_HTTPLIB option (#19623) 2026-02-15 15:38:50 +01:00
CLAUDE.md contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
CMakeLists.txt app : introduce the llama unified executable (#23296) 2026-05-20 13:22:22 +02:00
CMakePresets.json cmake : Add CMake presets for Linux and GCC (#14656) 2025-07-13 08:12:36 +03:00
CODEOWNERS add myself to conversion (#23261) 2026-05-18 12:42:56 +02:00
CONTRIBUTING.md contributing: new contributors should not submit trivial fixes (#23045) 2026-05-14 23:55:24 +08:00
convert_hf_to_gguf.py convert : update mtp related help (#23334) 2026-05-19 21:16:58 +02:00
convert_hf_to_gguf_update.py Refactor: convert_hf_to_gguf.py (#17114) 2026-05-15 15:18:12 +02:00
convert_llama_ggml_to_gguf.py ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
convert_lora_to_gguf.py convert : filter lora tensor names (#23077) 2026-05-19 09:44:25 +03:00
flake.nix fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295) 2025-08-13 11:21:31 -07:00
LICENSE docs : Minor cleanups (#19252) 2026-02-02 08:38:55 +02:00
Makefile make : remove make in favor of CMake (#15449) 2025-08-20 13:31:16 +03:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
pyproject.toml feat: migrate to PEP 621 and add uv support (#21907) 2026-05-06 14:04:10 +02:00
pyrightconfig.json ci : switch from pyright to ty (#20826) 2026-03-21 08:54:34 +01:00
README.md [SCYL] add chapter for performance reference in SYCL.md (#23315) 2026-05-19 09:44:51 +03:00
requirements.txt tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034) 2025-03-05 13:05:13 +00:00
SECURITY.md docs : fix broken link and typo (#19560) 2026-02-13 09:38:09 +01:00
ty.toml mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (#23345) 2026-05-20 17:37:10 +02:00

llama.cpp

llama

License: MIT Release Server

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics


Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.

Example command:

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

  • Plain C/C++ implementation without any dependencies
  • Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
  • AVX, AVX2, AVX512 and AMX support for x86 architectures
  • RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Text-only

Multimodal

Bindings
UIs

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Tools
  • akx/ggify download PyTorch models from Hugging Face Hub and convert them to GGML
  • akx/ollama-dl download models from the Ollama library to be used directly with llama.cpp
  • crashr/gppm launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
  • gpustack/gguf-parser - review/check the GGUF file and estimate the memory usage
  • Styled Lines (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
  • unslothai/unsloth 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
Infrastructure
  • Paddler - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
  • GPUStack - Manage GPU clusters for running LLMs
  • llama_cpp_canister - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
  • llama-swap - transparent proxy that adds automatic model switching with llama-server
  • Kalavai - Crowdsource end to end LLM deployment at any scale
  • llmaz - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
  • LLMKube - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal support"
Games
  • Lucy's Labyrinth - A simple maze game where agents controlled by an AI model will try to trick you.

Supported backends

Backend Target devices
Metal Apple Silicon
BLAS All
BLIS All
SYCL Intel GPU
OpenVINO [In Progress] Intel CPUs, GPUs, and NPUs
MUSA Moore Threads GPU
CUDA Nvidia GPU
HIP AMD GPU
ZenDNN AMD CPU
Vulkan GPU
CANN Ascend NPU
OpenCL Adreno GPU
IBM zDNN IBM Z & LinuxONE
WebGPU [In Progress] All
RPC All
Hexagon [In Progress] Snapdragon
VirtGPU VirtGPU APIR

Obtaining and quantizing models

The Hugging Face platform hosts a number of LLMs compatible with llama.cpp:

You can either manually download the GGUF file or directly use any llama.cpp-compatible models from Hugging Face or other model hosting sites, by using this CLI argument: -hf <user>/<model>[:quant]. For example:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable MODEL_ENDPOINT. The MODEL_ENDPOINT must point to a Hugging Face compatible API endpoint.

After downloading a model, use the CLI tools to run it locally - see below.

llama.cpp requires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using the convert_*.py Python scripts in this repo.

The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with llama.cpp:

To learn more about model quantization, read this documentation

llama-cli

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

  • Run in conversation mode

    Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding -cnv and specifying a suitable chat template with --chat-template NAME

    llama-cli -m model.gguf
    
    # > hi, who are you?
    # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
    #
    # > what is 1+1?
    # Easy peasy! The answer to 1+1 is... 2!
    
  • Run in conversation mode with custom chat template
    # use the "chatml" template (use -h to see the list of supported templates)
    llama-cli -m model.gguf -cnv --chat-template chatml
    
    # use a custom template
    llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
    
  • Constrain the output with a custom grammar
    llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
    
    # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
    

    The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.

    For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/

llama-server

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

  • Start a local HTTP server with default configuration on port 8080
    llama-server -m model.gguf --port 8080
    
    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
    
  • Support multiple-users and parallel decoding
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
    
  • Enable speculative decoding
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
    
  • Serve an embedding model
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
    
  • Serve a reranking model
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
    
  • Constrain all outputs with a grammar
    # custom grammar
    llama-server -m model.gguf --grammar-file grammar.gbnf
    
    # JSON
    llama-server -m model.gguf --grammar-file grammars/json.gbnf
    

llama-perplexity

A tool for measuring the perplexity 1 (and other quality metrics) of a model over a given text.

  • Measure the perplexity over a text file
    llama-perplexity -m model.gguf -f file.txt
    
    # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
    # Final estimate: PPL = 5.4007 +/- 0.67339
    
  • Measure KL divergence
    # TODO
    

llama-bench

Benchmark the performance of the inference for various parameters.

  • Run default benchmark
    llama-bench -m model.gguf
    
    # Output:
    # | model               |       size |     params | backend    | threads |          test |                  t/s |
    # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
    # | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
    #
    # build: 3e0ba0e60 (4229)
    

llama-simple

A minimal example for implementing apps with llama.cpp. Useful for developers.

  • Basic text completion
    llama-simple -m model.gguf
    
    # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
    

Contributing

  • Contributors can open PRs
  • Collaborators will be invited based on contributions
  • Maintainers can push to branches in the llama.cpp repo and merge PRs into the master branch
  • Any help with managing issues, PRs and projects is very appreciated!
  • See good first issues for tasks suitable for first contributions
  • Read the CONTRIBUTING.md for more information
  • Make sure to read this: Inference at the edge
  • A bit of backstory for those who are interested: Changelog podcast

Other documentation

Development documentation

Seminal papers and background on the models

If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

XCFramework

The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.

import PackageDescription

let package = Package(
    name: "MyLlamaPackage",
    targets: [
        .executableTarget(
            name: "MyLlamaPackage",
            dependencies: [
                "LlamaFramework"
            ]),
        .binaryTarget(
            name: "LlamaFramework",
            url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
            checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
        )
    ]
)

The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum.

Completions

Command-line completion is available for some environments.

Bash Completion

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

Optionally this can be added to your .bashrc or .bash_profile to load it automatically. For example:

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

Dependencies

  • yhirose/cpp-httplib - Single-header HTTP server, used by llama-server - MIT license
  • stb-image - Single-header image format decoder, used by multimodal subsystem - Public domain
  • nlohmann/json - Single-header JSON library, used by various tools/examples - MIT License
  • miniaudio.h - Single-header audio format decoder, used by multimodal subsystem - Public domain
  • subprocess.h - Single-header process launching solution for C and C++ - Public domain