koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-01 14:29:33 +00:00

Author	SHA1	Message	Date
Concedo	8ca4283f55	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/release.yml # .github/workflows/server.yml # .github/workflows/ui-build.yml # .github/workflows/ui-publish.yml # CMakeLists.txt # docs/autoparser.md # docs/backend/snapdragon/CMakeUserPresets.json # docs/backend/snapdragon/README.md # docs/backend/snapdragon/windows.md # docs/function-calling.md # examples/model-conversion/scripts/embedding/run-original-model.py # ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-opencl/kernels/gemm_moe_mxfp4_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q4_0_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q4_1_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q4_k_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q5_0_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q5_1_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q5_k_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemm_moe_q6_k_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_mxfp4_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q4_0_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q4_1_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q4_k_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q5_0_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q5_1_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q5_k_f32_ns.cl # ggml/src/ggml-opencl/kernels/gemv_moe_q6_k_f32_ns.cl # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/dmmv.cpp # ggml/src/ggml-sycl/gated_delta_net.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-vulkan/CMakeLists.txt # ggml/src/ggml-zendnn/CMakeLists.txt # ggml/src/ggml-zendnn/ggml-zendnn.cpp # requirements/requirements-convert_hf_to_gguf.txt # scripts/snapdragon/windows/setup-build.ps1 # tools/perplexity/perplexity.cpp	2026-05-24 13:55:44 +08:00
Michael Wand	b0df4c0cfd	model : add NVFP4 MTP scale tensors (#23563 ) * Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr	2026-05-23 13:30:31 +02:00
Concedo	632c41a72f	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build-apple.yml # .github/workflows/build-cmake-pkg.yml # .github/workflows/release.yml # .pi/gg/SYSTEM.md # CMakeLists.txt # CODEOWNERS # README.md # build-xcframework.sh # ci/run.sh # docs/build.md # examples/CMakeLists.txt # examples/llama.android/lib/build.gradle.kts # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-save-load-state.cpp # tools/batched-bench/CMakeLists.txt # tools/cli/CMakeLists.txt # tools/completion/CMakeLists.txt # tools/llama-bench/CMakeLists.txt # tools/perplexity/CMakeLists.txt # tools/quantize/CMakeLists.txt # tools/server/CMakeLists.txt	2026-05-22 20:42:51 +08:00
Kashif Rasul	afcda09d15	vocab : fix HybridDNA tokenizer (#23466 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details * vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-22 11:17:31 +02:00
Concedo	718dc159b6	Merge branch 'upstream' into concedo_experimental # Conflicts: # CMakeLists.txt # docs/speculative.md # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/hmx-ops.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/ssm-conv.c # ggml/src/ggml-opencl/ggml-opencl.cpp # scripts/snapdragon/adb/run-bench.sh # scripts/snapdragon/adb/run-cli.sh # scripts/snapdragon/adb/run-completion.sh # scripts/snapdragon/adb/run-mtmd.sh # scripts/snapdragon/windows/run-bench.ps1 # scripts/snapdragon/windows/run-cli.ps1 # scripts/snapdragon/windows/run-completion.ps1 # scripts/snapdragon/windows/run-mtmd.ps1 # src/llama-vocab.cpp # tests/test-backend-ops.cpp # tools/batched-bench/CMakeLists.txt # tools/batched-bench/batched-bench.cpp # tools/cli/CMakeLists.txt # tools/cli/README.md # tools/cli/cli.cpp # tools/completion/CMakeLists.txt # tools/completion/README.md # tools/llama-bench/CMakeLists.txt # tools/llama-bench/llama-bench.cpp # tools/mtmd/CMakeLists.txt # tools/mtmd/tests/test-deepseek-ocr.py # tools/mtmd/tests/tests-requirements.txt # tools/perplexity/CMakeLists.txt # tools/perplexity/perplexity.cpp # tools/quantize/CMakeLists.txt # tools/server/CMakeLists.txt # tools/server/README.md # ty.toml	2026-05-21 23:47:21 +08:00
Aman Gupta	12e5d99078	mtp: use inp_out_ids for skipping logit computation (#23433 ) when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.	2026-05-21 15:23:14 +08:00
Kashif Rasul	7ea23ddf7b	vocab : add Carbon-3B (HybridDNATokenizer) support (#23410 ) * vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss. * vocab : align Carbon-3B changes with llama.cpp conventions * Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased. * vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre. * tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases. * vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works. * vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA * vocab : drop llm_tokenizer_bpe vocab-type assert * vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch * vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe * vocab : annotate #endif with PRETOKENIZERDEBUG * vocab : drop local hybriddna fixture (moves to ggml-org/vocabs) * deduplicate * simplify * simplify --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-21 08:34:32 +02:00
Daniel Elliott	eeeaf6180b	llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (#23131 ) When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4), the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs, self_kq_mask) are created as graph input nodes but never consumed by any compute node, so the backend scheduler never allocates a buffer for them. Calling mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits GGML_ASSERT(buffer) at ggml-backend.cpp:194. The same scenario applies symmetrically: if a model had zero SWA layers, the SWA tensors would be unallocated. Fix: guard both the base and SWA set_input calls with null/buffer checks, matching the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674) which has the comment: 'base tensors may not be allocated if there are no non-SWA attention layers'. Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for unallocated tensors, preventing a null-dereference on the reuse path.	2026-05-21 09:20:51 +03:00
wendadawen	6a257d4463	mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (#23329 ) - HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference. - Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch	2026-05-21 00:35:37 +02:00
Gaurav Garg	ad27757261	Move to backend sampling for MTP draft path (#23287 ) * Move to backend sampling for MTP draft path Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K. * Allow sampler chains to be partially offloaded to backend * Add --spec-draft-backend-sampling argument. Enabled by default.	2026-05-20 22:34:45 +05:30
Concedo	7d987af23a	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/cann.Dockerfile # .devops/cpu.Dockerfile # .devops/cuda.Dockerfile # .devops/intel.Dockerfile # .devops/llama-cli-cann.Dockerfile # .devops/musa.Dockerfile # .devops/openvino.Dockerfile # .devops/rocm.Dockerfile # .devops/s390x.Dockerfile # .devops/vulkan.Dockerfile # .github/ISSUE_TEMPLATE/011-bug-results.yml # .github/ISSUE_TEMPLATE/019-bug-misc.yml # .github/workflows/build-and-test-snapdragon.yml # .github/workflows/docker.yml # .github/workflows/server-self-hosted.yml # .github/workflows/ui-ci.yml # .pi/gg/SYSTEM.md # README.md # common/arg.cpp # docs/backend/SYCL.md # docs/backend/snapdragon/CMakeUserPresets.json # docs/backend/snapdragon/README.md # docs/speculative.md # examples/save-load-state/save-load-state.cpp # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/rope-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/gated_delta_net.wgsl # tools/cli/README.md # tools/server/README.md	2026-05-20 18:48:34 +08:00
Georgi Gerganov	57ebaf4edd	metal : optimize pad + cpy (#23354 ) * metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup	2026-05-20 09:42:00 +03:00
Daniel Bevenius	baf3cc6e1d	model : clarify MTP layer comment in qwen35.cpp [no ci] (#23338 ) This commit attempts to clarify a code comment in graph_mtp regarding where the MTP layer is stored. The motivation for this is that it was not obvious to me what the original comment meant and hopefully this makes it clearer.	2026-05-19 18:41:44 +02:00
Georgi Gerganov	d14ce3dab4	llama : MTP clean-up (#23269 ) * llama : disable equal splits for recurrent memory with partial rollback * spec : re-enable p-min with MTP drafts * spec : re-enable ngram spec in combination with RS rollback * spec : fix ngram-map-* params * spec : fix acceptance logic in combined ngram + draft configs * graph : fix reuse for combined `token` + `embd` batches * spec : log parameters for each speculative implementation - add LOG_INF in each constructor with implementation type and parameters - extract device string logic into common_speculative_get_devices_str() - move 'adding speculative implementation' log from init into constructors Assisted-by: llama.cpp:local pi * spec : extend --spec-default with ngram-map-k4v Assisted-by: llama.cpp:local pi * minor : fix n_embd log * args : update draft.n_max == 3 + regen docs * spec : relax ngram-mod rejection thold to 0.25 @ 5 low * logs : improve * docs : update speculative decoding CLI argument documentation - Add missing draft model CPU scheduling and tensor override parameters - Update --spec-type to include all available types (excluding draft-eagle3 WIP) - Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0) - Remove deprecated options (spec-draft-ctx-size, spec-draft-replace) - Add environment variables for new parameters Assisted-by: llama.cpp:local pi * arg : step-back on adding k4v to the default spec config * cont : fix name	2026-05-19 15:32:58 +03:00
Concedo	fecf2dc3fa	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/server-self-hosted.yml # CMakeLists.txt # CODEOWNERS # ci/run.sh # cmake/llama-config.cmake.in # common/chat.cpp # examples/sycl/start-svr.sh # examples/sycl/test.sh # examples/sycl/win-start-svr.bat # examples/sycl/win-test.bat # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/vecdotq.hpp # ggml/src/ggml-vulkan/CMakeLists.txt # scripts/wc2wt.sh # tests/test-backend-ops.cpp # tests/test-chat.cpp	2026-05-18 21:27:23 +08:00
Andrei	49c21f97cd	llama: initialize pre-norm embedding mask flag (#23256 )	2026-05-18 14:20:49 +03:00
Aman Gupta	3e12fbdea5	llama: avoid copying logits during prompt decode in MTP (#23198 ) * llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm	2026-05-17 23:30:25 +08:00
Concedo	1e828ccabf	Merge branch 'upstream' into concedo_experimental # Conflicts: # common/common.cpp # ggml/CMakeLists.txt # scripts/sync-ggml.last # scripts/sync_vendor.py # src/llama-context.cpp # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-05-17 11:26:18 +08:00
Concedo	9203b6a051	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/labeler.yml # .github/workflows/build-self-hosted.yml # .github/workflows/release.yml # .github/workflows/server-sanitize.yml # .github/workflows/server-self-hosted.yml # .github/workflows/server.yml # .github/workflows/ui-build.yml # .github/workflows/ui-ci.yml # .github/workflows/ui-publish.yml # .gitignore # CMakeLists.txt # CODEOWNERS # scripts/ui-download.cmake # scripts/xxd.cmake # tests/test-backend-ops.cpp # tests/test-reasoning-budget.cpp # tools/CMakeLists.txt # tools/server/CMakeLists.txt # tools/server/README.md	2026-05-16 22:56:33 +08:00
Aman Gupta	255582687b	llama + spec: MTP Support (#22673 ) * spec: support MTP * fix batch size * rename files * cont : simplify (#7) * MTP: clean-up (#9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: `8c05923630` Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (#11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (#13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-16 20:06:23 +08:00
ynankani	42928bc14d	model : NvFP4 quantized LM head support (#23046 ) * NvFP4 quantized LM head support Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Add assert for NvFp4 lm head and tied embeddings Signed-off-by: ynankani <ynankani@nvidia.com> * Address review commnets Signed-off-by: ynankani <ynankani@nvidia.com> * Create output_s tensor only when LM head NvFp4 Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com>	2026-05-16 11:09:27 +02:00
Concedo	cc82c3164e	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/intel.Dockerfile # .github/workflows/build-cross.yml # .github/workflows/build-sycl.yml # .github/workflows/build.yml # .github/workflows/editorconfig.yml # .github/workflows/release.yml # cmake/riscv64-spacemit-linux-gnu-gcc.cmake # docs/backend/OPENVINO.md # docs/backend/SYCL.md # docs/build-riscv64-spacemit.md # docs/ops.md # docs/ops/WebGPU.csv # embd_res/ggml-vocab-qwen35.gguf # embd_res/ggml-vocab-qwen35.gguf.inp # embd_res/ggml-vocab-qwen35.gguf.out # examples/model-conversion/Makefile # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/hmx-utils.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/hvx-utils.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/common.cpp # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl # ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl # ggml/src/ggml-zendnn/CMakeLists.txt # ggml/src/ggml-zendnn/ggml-zendnn.cpp # scripts/snapdragon/adb/run-completion.sh # tests/CMakeLists.txt # tools/cli/README.md # tools/completion/README.md # tools/mtmd/clip-impl.h # tools/mtmd/clip.cpp # tools/mtmd/clip.h # tools/server/README.md	2026-05-14 19:04:04 +08:00
Kabir Potdar	42532afff4	unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110 ) * unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests - Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks). - Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919). - Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing. - Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry. This mirrors the Qwen2 fix (commit `0d049d6`), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows. Closes #21919. * fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks * cont : remove trailing whitespace --------- Co-authored-by: Kabir <kabir@example.com> Co-authored-by: Alde Rojas <hello@alde.dev>	2026-05-14 11:03:40 +02:00
Concedo	f7923b261f	need to fix cuda compile. Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/python-type-check.yml # examples/speculative-simple/README.md # examples/speculative-simple/speculative-simple.cpp # ggml/src/ggml-cuda/im2col.cu # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # tests/test-backend-ops.cpp # tools/cli/README.md # tools/mtmd/CMakeLists.txt # tools/server/README.md	2026-05-12 20:47:07 +08:00
Georgi Gerganov	68e7ea3eab	spec : parallel drafting support (#22838 ) * spec : refactor * spec : drop support for incompatible vocabs * spec : update common_speculative_init() * cont : pass seq_id * cont : dedup ctx_seq_rm_type * server : sketch the ctx_dft decode loop * server : draft prompt cache and checkpoints * server : improve ctx names * server, spec : transition to unified spec context * cont : sync main and drft contexts * cont : async drft eval when possible * cont : handle non-ckpt models * cont : pass correct n_past for drafting * cont : process images throught the draft context * spec : handle draft running out of context * server : fix mtmd draft processing * server : fix URL for draft model * server : add comment * server : clean-up + dry * speculative-simple : update * spec : fix n_past type * server : fix slot ctx_drft ptr * tools : update readme * naming : improve consistency * spec : refactor for multi-sequence speculative context * cont : prepare params * cont : prepare params * spec : support parallel drafts * server : support parallel drafting * llama : reuse device buffers when possible * server, spec : clean-up * cont : clean-up * cont : minor * spec : reset `drafting` flag at the end * spec : introduce `common_speculative_process()` * spec : allow for multiple spec types (chain of speculators) * replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified * introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>) to figure out which implementations the user has enabled * introduce common_speculative_type_from_names(const std::vector<std::string> & names) to parse the already user provided spec types * all speculators run sequentially, best one wins (we verify its drafted tokens) * maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length --------- Co-authored-by: Petros Sideris <petros.sideris@nokia.com>	2026-05-11 19:09:43 +03:00
Concedo	2771e16fbc	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/intel.Dockerfile # .devops/nix/package.nix # .gitignore # docs/backend/SYCL.md # docs/ops.md # docs/ops/SYCL.csv # ggml/CMakeLists.txt # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-cuda/ggml-cuda.cu # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/common.hpp # ggml/src/ggml-sycl/convert.cpp # ggml/src/ggml-sycl/dequantize.hpp # ggml/src/ggml-sycl/fattn-common.hpp # ggml/src/ggml-sycl/getrows.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/im2col.cpp # ggml/src/ggml-sycl/im2col.hpp # ggml/src/ggml-sycl/mmvq.cpp # ggml/src/ggml-sycl/quants.hpp # ggml/src/ggml-sycl/vecdotq.hpp # ggml/src/ggml-virtgpu/ggml-backend-device.cpp # scripts/sync-ggml.last # scripts/sync_vendor.py # tests/test-backend-ops.cpp	2026-05-11 16:18:28 +08:00
Concedo	9b0b36b5ef	Merge commit '`66001722aa`' into concedo_experimental # Conflicts: # README.md # docs/ops.md # docs/ops/SYCL.csv # examples/sycl/start-svr.sh # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/htp-ctx.h # ggml/src/ggml-hexagon/htp/htp-ops.h # ggml/src/ggml-hexagon/htp/main.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-sycl/gated_delta_net.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/pad.cpp # ggml/src/ggml-sycl/ssm_conv.cpp # tests/test-backend-ops.cpp # tests/test-reasoning-budget.cpp # tools/server/README.md # tools/server/webui/src/lib/constants/settings-config.ts	2026-05-11 15:40:10 +08:00
Sigbjørn Skjæret	5755a100cd	model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite (#22870 )	2026-05-10 08:44:29 +02:00
Sumit Chatterjee	1e5ad35d56	model : add sarvam_moe architecture support (#20275 )	2026-05-09 16:31:50 +02:00
ynankani	9f5f0e689c	model : support Gemma4_26B_A4B_NVFP4 (#22804 ) * Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani <ynankani@nvidia.com> * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address review comments Signed-off-by: ynankani <ynankani@nvidia.com> * fix CRLF Signed-off-by: ynankani <ynankani@nvidia.com> * Lint error fix Signed-off-by: ynankani <ynankani@nvidia.com> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-08 20:42:09 +02:00
Concedo	eb30b29d69	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/gguf-publish.yml # CODEOWNERS # examples/sycl/test.sh # pyproject.toml # tools/mtmd/CMakeLists.txt # tools/mtmd/README.md	2026-05-08 14:48:57 +08:00
Georgi Gerganov	e43431b381	llama : fix device state save/load (#22805 )	2026-05-07 21:43:40 +03:00
Georgi Gerganov	803627f121	llama : remove unnecessary seq_id check during state restore (#22797 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details	2026-05-07 16:37:26 +03:00
AesSedai	8e52631d55	model: Add Mimo v2.5 model support (#22493 ) * add mimo-v2.5 support * mimo-v2.5: fix modify_tensors row split * mimi-v2.5: forgot `add_attn_value_scale` plumbing * mimi-v2.5: fix tp dequant to detect tp rows * mimo-v2.5: fix TP iteration to be descending * mimo-v2.5: fix comment * mimo-v2.5: retain fused qkv * mimo-v2.5: missed the attn_value scale during merge * mimo-v2.5: fused QKV needs contiguous for scaling attention value * mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mimo-v2.5: include MTP weights in gguf --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-07 13:21:58 +02:00
Adrien Gallouët	3980e04d5a	llama : add missing call to ggml_backend_load_all() (#22752 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-07 08:24:47 +03:00
Gilad S.	5207d120ea	model : don't crash on unsupported architecture (#22742 ) * model: don't crash on unsupported architecture * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-06 18:51:21 +02:00
Concedo	9e9497f0cc	Merge remote-tracking branch 'origin/upstream' into concedo_experimental # Conflicts: # examples/save-load-state/save-load-state.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl # ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # scripts/sync-ggml.last # scripts/sync_vendor.py # src/llama-graph.cpp # tests/test-backend-ops.cpp # tests/test-state-restore-fragmented.cpp	2026-05-06 21:20:06 +08:00
Concedo	7240da764a	Merge commit '`935a340292`' into concedo_experimental # Conflicts: # examples/diffusion/CMakeLists.txt # scripts/server-test-function-call.py # src/llama-model.cpp # src/models/gemma4.cpp # tests/test-chat.cpp # tests/test-reasoning-budget.cpp # tools/server/README.md	2026-05-06 21:02:25 +08:00
Adrien Gallouët	bf76ac77be	common : only load backends when required (#22290 ) * common : only load backends when required Signed-off-by: Adrien Gallouët <angt@huggingface.co> * llama : call ggml_backend_load_all() directly from llama_backend_init() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add ggml_backend_load_all() where llama_backend_init() is not used Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 09:23:50 +02:00
Georgi Gerganov	d6e7b033a4	llama : add option to save memory in device buffers (#22679 ) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state	2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret	fa595462ca	graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630 ) * qkv may not always be contiguous * cont : make the cont conditional --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-05 06:34:44 +03:00
Ismail	a817a22bc6	ggml : implement fast walsh-hadamard transform for kv rotation (#21352 ) (#22631 )	2026-05-05 10:05:05 +08:00
Xuan-Son Nguyen	994118a183	model: move `load_hparams` and `load_tensors` to per-model definition (#22004 ) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add back create_tensor_qkv * fix files missing include * enforce one llm_build per arch * cmake: use glob * missing model params * nits * wip * wip (2) * wip (3) * test-llama-archs is happy * improve switch case * move more stuff into llm_arch_model_i * fix downstream code * nits * nits (2) * fix order * llama_model_base * LLAMA_LOAD_LOCALS * small fix * fix build errors * auto * rm migration script and ifdef	2026-05-04 12:36:59 +02:00
Concedo	2905c6254f	Merge branch 'upstream' into concedo_experimental # Conflicts: # .pi/gg/SYSTEM.md # docs/speculative.md # ggml/src/ggml-virtgpu/virtgpu-shm.cpp # ggml/src/ggml-virtgpu/virtgpu.cpp # ggml/src/ggml-virtgpu/virtgpu.h # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-05-04 15:36:13 +08:00
Julien Denize	048a490f76	convert : Mistral format yarn apply_scale support (#22612 ) * [BUGFIX] Mistral format apply_scale support. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix misunderstood boolean parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-03 21:51:21 +02:00
Georgi Gerganov	0754b7b6fe	server : avoid checkpoint data host copies (#22558 ) * server : avoid checkpoint data host copies * llama : refactor llama_io_read_i	2026-05-02 18:03:25 +03:00
Concedo	7c70187e26	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/ISSUE_TEMPLATE/010-bug-compilation.yml # .github/ISSUE_TEMPLATE/011-bug-results.yml # .github/ISSUE_TEMPLATE/019-bug-misc.yml # .github/ISSUE_TEMPLATE/020-enhancement.yml # .github/ISSUE_TEMPLATE/030-research.yml # .github/ISSUE_TEMPLATE/040-refactor.yml # ggml/CMakeLists.txt # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-hexagon/CMakeLists.txt # ggml/src/ggml-hexagon/ggml-hexagon.cpp # ggml/src/ggml-hexagon/htp/CMakeLists.txt # ggml/src/ggml-hexagon/htp/cmake-toolchain.cmake # ggml/src/ggml-hexagon/htp/flash-attn-ops.c # ggml/src/ggml-hexagon/htp/hex-utils.h # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/hmx-ops.h # ggml/src/ggml-hexagon/htp/hmx-utils.h # ggml/src/ggml-hexagon/htp/hvx-base.h # ggml/src/ggml-hexagon/htp/hvx-copy.h # ggml/src/ggml-hexagon/htp/hvx-exp.h # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-virtgpu/ggml-backend.cpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec.wgsl # ggml/src/ggml-zdnn/ggml-zdnn.cpp # ggml/src/ggml-zendnn/ggml-zendnn.cpp # scripts/sync-ggml.last # tests/test-backend-ops.cpp	2026-05-02 18:07:50 +08:00
ddh0	b97ebdc98f	llama-quant : fix `--tensor-type` when default `qtype` is overriden (#22572 ) fix #22544 (my fault!) Credit to @Anai-Guo, ref #22559 - since that one was closed due to the new contributor policy I am taking the liberty of re-submitting that PR here.	2026-05-01 19:55:55 +02:00
Reese Levine	5cbfb18075	Update llama-mmap to use ftello/fseeko (#22497 ) * Update llama-mmap to work with 32-bit wasm and >2GB models * Update to gguf.cpp style	2026-04-30 14:17:52 -07:00
Concedo	70be589894	Merge branch 'upstream' into concedo_experimental # Conflicts: # CODEOWNERS # examples/debug/debug.cpp # examples/eval-callback/eval-callback.cpp # ggml/src/ggml-cpu/amx/mmq.cpp # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # scripts/pr2wt.sh	2026-04-28 21:13:40 +08:00

1 2 3 4 5 ...

1383 commits