koboldcpp/tools/cli
Ethan Turner fa0b8a70a8
cli: Remove redundant local sampling variables (#20429) (#22264)
This change implements the third requested change in issue 20429.
Because defaults.sampling contains the reasoning budget token count and
the reasoning budget message, it's not necessary to assign them to
struct variables.
2026-04-24 00:53:23 +02:00
..
cli.cpp cli: Remove redundant local sampling variables (#20429) (#22264) 2026-04-24 00:53:23 +02:00
CMakeLists.txt libs : rename libcommon -> libllama-common (#21936) 2026-04-17 11:11:46 +03:00
README.md server: save and clear idle slots on new task (--clear-idle) (#20993) 2026-04-03 19:02:27 +02:00

llama.cpp/tools/cli

Usage

Common params

Argument Explanation
-h, --help, --usage print usage and exit
--version show version and build info
--license show source code license and dependencies
-cl, --cache-list show list of models in cache
--completion-bash print source-able bash completion script for llama.cpp
--verbose-prompt print a verbose prompt before generation (default: false)
-t, --threads N number of CPU threads to use during generation (default: -1)
(env: LLAMA_ARG_THREADS)
-tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
-C, --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: "")
-Cr, --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1> use strict CPU placement (default: 0)
--prio N set process/thread priority : low(-1), normal(0), medium(1), high(2), realtime(3) (default: 0)
--poll <0...100> use polling level to wait for work (0 - no polling, default: 50)
-Cb, --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch (default: same as --cpu-mask)
-Crb, --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict)
--prio-batch N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: 0)
--poll-batch <0|1> use polling to wait for work (default: same as --poll)
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE)
-n, --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity)
(env: LLAMA_ARG_N_PREDICT)
-b, --batch-size N logical maximum batch size (default: 2048)
(env: LLAMA_ARG_BATCH)
-ub, --ubatch-size N physical maximum batch size (default: 512)
(env: LLAMA_ARG_UBATCH)
--keep N number of tokens to keep from the initial prompt (default: 0, -1 = all)
--swa-full use full-size SWA cache (default: false)
(more info)
(env: LLAMA_ARG_SWA_FULL)
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
-p, --prompt PROMPT prompt to start generation with; for system message, use -sys
--perf, --no-perf whether to enable internal libllama performance timings (default: false)
(env: LLAMA_ARG_PERF)
-f, --file FNAME a file containing the prompt (default: none)
-bf, --binary-file FNAME binary file containing the prompt (default: none)
-e, --escape, --no-escape whether to process escapes sequences (\n, \r, \t, ', ", \) (default: true)
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model
(env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N RoPE context scaling factor, expands context by a factor of N
(env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
(env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
(env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size)
(env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.00, 0.0 = full interpolation)
(env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
(env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: -1.00)
(env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N YaRN: low correction dim or beta (default: -1.00)
(env: LLAMA_ARG_YARN_BETA_FAST)
-kvo, --kv-offload, -nkvo, --no-kv-offload whether to enable KV cache offloading (default: enabled)
(env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack whether to enable weight repacking (default: enabled)
(env: LLAMA_ARG_REPACK)
--no-host bypass host buffer allowing extra buffers to be used
(env: LLAMA_ARG_NO_HOST)
-ctk, --cache-type-k TYPE KV cache data type for K
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K)
-ctv, --cache-type-v TYPE KV cache data type for V
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V)
-dt, --defrag-thold N KV cache defragmentation threshold (DEPRECATED)
(env: LLAMA_ARG_DEFRAG_THOLD)
-np, --parallel N number of parallel sequences to decode (default: 1)
(env: LLAMA_ARG_N_PARALLEL)
--mlock force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)
(env: LLAMA_ARG_MMAP)
-dio, --direct-io, -ndio, --no-direct-io use DirectIO if available. (default: disabled)
(env: LLAMA_ARG_DIO)
--numa TYPE attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system page cache before using this
see https://github.com/ggml-org/llama.cpp/issues/1437
(env: LLAMA_ARG_NUMA)
-dev, --device <dev1,dev2,..> comma-separated list of devices to use for offloading (none = don't offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE)
--list-devices print list of available devices and exit
-ot, --override-tensor <tensor name pattern>=<buffer type>,... override tensor buffer type
(env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe keep all Mixture of Experts (MoE) weights in the CPU
(env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N keep the Mixture of Experts (MoE) weights of the first N layers in the CPU
(env: LLAMA_ARG_N_CPU_MOE)
-ngl, --gpu-layers, --n-gpu-layers N max. number of layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)
(env: LLAMA_ARG_N_GPU_LAYERS)
-sm, --split-mode {none,layer,row} how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
(env: LLAMA_ARG_SPLIT_MODE)
-ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
(env: LLAMA_ARG_TENSOR_SPLIT)
-mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0)
(env: LLAMA_ARG_MAIN_GPU)
-fit, --fit [on|off] whether to adjust unset arguments to fit in device memory ('on' or 'off', default: 'on')
(env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,... target margin per device for --fit, comma-separated list of values, single value is broadcast across all devices, default: 1024
(env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N minimum ctx size that can be set by --fit option, default: 4096
(env: LLAMA_ARG_FIT_CTX)
--check-tensors check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,... advanced option to override model metadata by key. to specify multiple overrides, either use comma-separated values.
types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload whether to offload host tensor operations to device (default: true)
--lora FNAME path to LoRA adapter (use comma-separated values to load multiple adapters)
--lora-scaled FNAME:SCALE,... path to LoRA adapter with user defined scaling (format: FNAME:SCALE,...)
note: use comma-separated values
--control-vector FNAME add a control vector
note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,... add a control vector with user defined scaling SCALE
note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END layer range to apply the control vector(s) to, start and end inclusive
-m, --model FNAME model path to load
(env: LLAMA_ARG_MODEL)
-mu, --model-url MODEL_URL model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-dr, --docker-repo [<repo>/]<model>[:quant] Docker Hub model repository. repo is optional, default to ai/. quant is optional, default to :latest.
example: gemma3
(default: unused)
(env: LLAMA_ARG_DOCKER_REPO)
-hf, -hfr, --hf-repo <user>/<model>[:quant] Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.
mmproj is also downloaded automatically if available. to disable, add --no-mmproj
example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
(default: unused)
(env: LLAMA_ARG_HF_REPO)
-hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant] Same as --hf-repo, but for the draft model (default: unused)
(env: LLAMA_ARG_HFD_REPO)
-hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hfv, -hfrv, --hf-repo-v <user>/<model>[:quant] Hugging Face model repository for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_FILE_V)
-hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment variable)
(env: HF_TOKEN)
--log-disable Log disable
--log-file FNAME Log to file
(env: LLAMA_LOG_FILE)
--log-colors [on|off|auto] Set colored logging ('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
(env: LLAMA_LOG_COLORS)
-v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for debugging)
--offline Offline mode: forces use of cache, prevents network access
(env: LLAMA_OFFLINE)
-lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be ignored. Values:
- 0: generic output
- 1: error
- 2: warning
- 3: info
- 4: debug
(default: 3)

(env: LLAMA_LOG_VERBOSITY)
--log-prefix Enable prefix in log messages
(env: LLAMA_LOG_PREFIX)
--log-timestamps Enable timestamps in log messages
(env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE KV cache data type for K for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE KV cache data type for V for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)

Sampling params

Argument Explanation
--samplers SAMPLERS samplers that will be used for generation in the order, separated by ';'
(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s, --seed SEED RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE simplified sequence for samplers that will be used (default: edskypmxt)
--ignore-eos ignore end of stream token and continue generating (implies --logit-bias EOS-inf)
--temp, --temperature N temperature (default: 0.80)
--top-k N top-k sampling (default: 40, 0 = disabled)
(env: LLAMA_ARG_TOP_K)
--top-p N top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N set DRY sampling base value (default: 1.75)
--dry-allowed-length N set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = context size)
--dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers ('\n', ':', '"', '*') in the process; use "none" to not use any sequence breakers
--adaptive-target N adaptive-p: select tokens near this probability (valid range 0.0 to 1.0; negative = disabled) (default: -1.00)
(more info)
--adaptive-decay N adaptive-p: decay rate for target adaptation over time. lower values are more reactive, higher values are more stable.
(valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.00)
--mirostat N use Mirostat sampling.
Top K, Nucleus and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.00)
-l, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1 to increase likelihood of token ' Hello',
or --logit-bias 15043-1 to decrease likelihood of token ' Hello'
--grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ dir)
--grammar-file FNAME file to read grammar from
-j, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. {} for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead
-jf, --json-schema-file FILE File containing a JSON schema to constrain generations (https://json-schema.org/), e.g. {} for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead
-bs, --backend-sampling enable backend sampling (experimental) (default: disabled)
(env: LLAMA_ARG_BACKEND_SAMPLING)

CLI-specific params

Argument Explanation
--display-prompt, --no-display-prompt whether to print prompt at generation (default: true)
-co, --color [on|off|auto] Colorize output to distinguish prompt and user input from generations ('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
-ctxcp, --ctx-checkpoints, --swa-checkpoints N max number of context checkpoints to create per slot (default: 32)(more info)
(env: LLAMA_ARG_CTX_CHECKPOINTS)
-cpent, --checkpoint-every-n-tokens N create a checkpoint every n tokens during prefill (processing), -1 to disable (default: 8192)
(env: LLAMA_ARG_CHECKPOINT_EVERY_NT)
-cram, --cache-ram N set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)(more info)
(env: LLAMA_ARG_CACHE_RAM)
--context-shift, --no-context-shift whether to use context shift on infinite text generation (default: disabled)
(env: LLAMA_ARG_CONTEXT_SHIFT)
-sys, --system-prompt PROMPT system prompt to use with model (if applicable, depending on chat template)
--show-timings, --no-show-timings whether to show timing information after each response (default: true)
(env: LLAMA_ARG_SHOW_TIMINGS)
-sysf, --system-prompt-file FNAME a file containing the system prompt (default: none)
-r, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode
-sp, --special special tokens output enabled (default: false)
-cnv, --conversation, -no-cnv, --no-conversation whether to run in conversation mode:
- does not print special tokens and suffix/prefix
- interactive mode is also enabled
(default: auto enabled if chat template is available)
-st, --single-turn run conversation for a single turn only, then exit when done
will not be interactive if first turn is predefined with --prompt
(default: false)
-mli, --multiline-input allows you to write or paste multiple lines without ending each in ''
--warmup, --no-warmup whether to perform warmup with an empty run (default: enabled)
-mm, --mmproj FILE path to a multimodal projector file. see tools/mtmd/README.md
note: if -hf is used, this argument can be omitted
(env: LLAMA_ARG_MMPROJ)
-mmu, --mmproj-url URL URL to a multimodal projector file. see tools/mtmd/README.md
(env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto whether to use multimodal projector file (if available), useful when using -hf (default: enabled)
(env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload whether to enable GPU offloading for multimodal projector (default: enabled)
(env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image, --audio FILE path to an image or audio file. use with multimodal models, use comma-separated values for multiple files
--image-min-tokens N minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)
(env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)
(env: LLAMA_ARG_IMAGE_MAX_TOKENS)
-otd, --override-tensor-draft <tensor name pattern>=<buffer type>,... override tensor buffer type for draft model
-cmoed, --cpu-moe-draft keep all Mixture of Experts (MoE) weights in the CPU for the draft model
(env: LLAMA_ARG_CPU_MOE_DRAFT)
-ncmoed, --n-cpu-moe-draft N keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model
(env: LLAMA_ARG_N_CPU_MOE_DRAFT)
--chat-template-kwargs STRING sets additional params for the json template parser, must be a valid json object string, e.g. '{"key1":"value1","key2":"value2"}'
(env: LLAMA_CHAT_TEMPLATE_KWARGS)
--jinja, --no-jinja whether to use jinja template engine for chat (default: enabled)
(env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:
- none: leaves thoughts unparsed in message.content
- deepseek: puts thoughts in message.reasoning_content
- deepseek-legacy: keeps <think> tags in message.content while also populating message.reasoning_content
(default: auto)
(env: LLAMA_ARG_THINK)
-rea, --reasoning [on|off|auto] Use reasoning/thinking in the chat ('on', 'off', or 'auto', default: 'auto' (detect from template))
(env: LLAMA_ARG_REASONING)
--reasoning-budget N token budget for thinking: -1 for unrestricted, 0 for immediate end, N>0 for token budget (default: -1)
(env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE message injected before the end-of-thinking tag when reasoning budget is exhausted (default: none)
(env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE set custom jinja chat template (default: template taken from model's metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE set custom jinja chat template file (default: template taken from model's metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing force a pure content parser, even if a Jinja template is specified; model will output everything in the content section, including any reasoning and/or tool calls (default: disabled)
(env: LLAMA_ARG_SKIP_CHAT_PARSING)
--simple-io use basic IO for better compatibility in subprocesses and limited consoles
--draft, --draft-n, --draft-max N number of tokens to draft for speculative decoding (default: 16)
(env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding (default: 0)
(env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P minimum speculative decoding probability (greedy) (default: 0.75)
(env: LLAMA_ARG_DRAFT_P_MIN)
-cd, --ctx-size-draft N size of the prompt context for the draft model (default: 0, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft <dev1,dev2,..> comma-separated list of devices to use for offloading the draft model (none = don't offload)
use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)
(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md, --model-draft FNAME draft model for speculative decoding (default: unused)
(env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT translate the string in TARGET into DRAFT if the draft model and main model are not compatible
--gpt-oss-20b-default use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default use Gemma 3 12B QAT (note: can download weights from the internet)