* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor llama_model_quantize_params to expose a pure C interface
* Restore comment and cleanup struct def
* Code review refactoring
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Code review refactoring
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* WIP: add NVFP4 quantization support
* tests
* improve NVFP4 dot product implementation performance and fix bad super call
* typo
* Use nvfp4 kvalues
* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
* vulcal and perf fixes
* wip
* Fix metal
* fix vulcan
* Rename threshold & fix wrong scale
* Fix MOE
* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
* quantize: add NVFP4 as a quantization type option
* Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
* Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
* ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
* CPU only backend 64 super-block layout
* cleanup
* Remove unused LUT
* int
* exclude NVFP4 from unsupported ops in metal build
* remove quantization for now
* store scales as native UE4M3, preserve original model bits when possible
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* correct comment
* format
* reduce duplication and cleanup
* Address comments
* move detection to prepare_tensors
* Use math instead of const
* Move
* fix comment
* Shelf quantize tests
* Rebase and move check
* cleanup
* lint
* Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use fallback quant config
* Simplify
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* organize
* Refactor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* fix return type
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* Refactoring to use new llama_put_adapter_loras
* cont : alternative lora API
---------
Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Updated documentation
Model is no longer a parameter
* llama : fix trailing whitespace in comment
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* initial commit for branch
* simplify constants
* add params to `struct common_params_sampling`, add reference to PR
* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`
* add args, rename `queue_size` -> `window_size`
* improved comments
* minor
* remove old unused code from algorithm
* minor
* add power law case to `common_sampler_init`, add sampler name mappings
* clarify behaviour when `window_size = 0`
* add missing enums
* remove `target_range` param, make `target == 1` no-op, cleanup code
* oops, straggler
* add missing parameters in `server-task.cpp`
* copy from author
ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069
* remove old debug log, style nit
* fix compiler warning, add commented-out logging per token
* re-write + change parameters + simplify
* oops forgot args.cpp
* fix leftover `window_size`
* add missing values to `common_params_sampling::print()`
* with logging
* does this fix it?
* no, but does this?
* update default decay
* optimize
* fix bad merge
my git skills are lacking
* silence `missing initializer for member`
* update default decay to 0.9
* fix logging
* format (double)
* add power law to the new `samplers` vector
* log sampler init values
* improve logging messages in llama_sampler_power_law
* remove extraneous logging
* simplify target computation
last commit with debug logging!
* remove debug logging, explicitly clamp params at init
* add `use_power_law` flag + logic, minor cleanup
* update `power-law` -> `adaptive-p`
* fix cold start EMA
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`
this fixes a "cold start" problem with the moving average
* update `SHARPNESS` constant to `10.0f`
* minor style fixes
no functional changes
* minor style fixes cont.
* update `llama_sampler_adaptive_p_i` for backend sampling (ref: #17004)
* separate into `apply` + `accept` functions
* `pending_token_idx`: switch from `llama_token` to `int32`
functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now
* don't transform logits <= -1e9f
* fix masking in backend top-p, min-p
* address review comments
* typo in comments `RND` -> `RNG`
* add docs
* add recommended values in completion docs
* address PR feedback
* remove trailing whitespace (for CI `editorconfig`)
* add to adaptive-p to `common_sampler_types_from_chars`
* lora: make sure model keep track of associated adapters
* deprecate llama_adapter_lora_free
* minor : std::unordered_set over std::set
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Adding --direct-io flag for model loading
* Fixing read_raw() calls
* Fixing Windows read_raw_at
* Changing type off_t to size_t for windows and Renaming functions
* disable direct io when mmap is explicitly enabled
* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL
* Fallback to std::fread in case O_DIRECT fails due to bad address
* Windows: remove const keywords and unused functions
* Update src/llama-mmap.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>