koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-02 15:39:26 +00:00

Author	SHA1	Message	Date
LostRuins Concedo	fc80cdccc2	Merge commit '`bea04522ff`' into concedo_experimental # Conflicts: # scripts/sync-ggml.last # src/CMakeLists.txt # tests/test-backend-ops.cpp	2025-11-05 12:41:01 +08:00
Concedo	3aec5ed0fd	Kcpp triage for rowsplit: revert https://github.com/ggml-org/llama.cpp/pull/16715 until https://github.com/ggml-org/llama.cpp/issues/16799 is resolved revert https://github.com/ggml-org/llama.cpp/pull/16715 (+2 squashed commit) Squashed commit: [289af2ee2] Revert "Hide latency of bias and gate-loading (#16847)" This reverts commit `8b11deea46`. [a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (#16807)" This reverts commit `463bbf20bf`.	2025-11-02 09:58:41 +08:00
Piotr Wilkin (ilintar)	bea04522ff	refactor : llama-model.cpp (#16252 ) * Sqashed: llama-model.cpp refactoring * Fix formatting of attn / ffn / ffn_moe calls * Fix import regression / unify spacing in models.h * totally DID NOT miss those! * Add missing qwen3vl(moe) models * Add missing new .cpp files to build * Remove extra semicolons * Editor checker * Update src/models/models.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)	0de0a01576	model : Minimax M2 (#16831 ) * Model: Minimax M2 * Cleanup * Cleanup pt. 2 * Cleanup pt. 3 * Update convert_hf_to_gguf_update.py - merge catch blocks Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove vocab models and test * Remove all redundant hparam settings covered by TextModel * Move super to start, don't set block_count * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-31 21:20:47 +01:00
Giuseppe Scrivano	e58d585604	model : add Granite Hybrid nano types (#16896 ) Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-31 21:20:07 +01:00
Concedo	75375157fd	Merge commit '`8da3c0e200`' into concedo_experimental # Conflicts: # tests/test-backend-ops.cpp	2025-10-31 21:35:58 +08:00
Georgi Gerganov	8da3c0e200	batch : fix consistency checks for the input positions (#16890 )	2025-10-31 13:50:33 +02:00
Concedo	2b00e55356	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/docker.yml # ggml/src/ggml-opencl/kernels/mul_mm_f16_f32_l4_lm.cl # ggml/src/ggml-opencl/kernels/mul_mm_f32_f32_l4_lm.cl # ggml/src/ggml-sycl/rope.cpp # ggml/src/ggml-webgpu/wgsl-shaders/rope.tmpl.wgsl # requirements/requirements-convert_legacy_llama.txt # tests/test-backend-ops.cpp # tests/test-rope.cpp # tools/server/README.md	2025-10-31 10:52:57 +08:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Tianyue-Zhao	bacddc049a	model: Add support for CogVLM model (#15002 ) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-30 12:18:50 +01:00
Jan Boon	d7395115ba	llama : use std::abs instead of abs (#16853 )	2025-10-30 08:30:58 +02:00
Concedo	16cbe9f24e	Merge branch 'upstream' into concedo_experimental # Conflicts: # CODEOWNERS # docs/ops.md # docs/ops/SYCL.csv # examples/embedding/README.md # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/norm.cpp # ggml/src/ggml-sycl/norm.hpp # scripts/snapdragon/adb/run-bench.sh # scripts/snapdragon/adb/run-cli.sh # src/llama-batch.cpp # tests/test-backend-ops.cpp # tests/test-chat.cpp # tests/test-json-schema-to-grammar.cpp # tools/llama-bench/README.md	2025-10-30 13:44:46 +08:00
Concedo	472438aad3	Merge commit '`5a4ff43e7d`' into concedo_experimental # Conflicts: # docs/build.md # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # src/llama-context.cpp # tests/test-backend-ops.cpp	2025-10-30 13:13:00 +08:00
Xuan-Son Nguyen	3464bdac37	llama: fix ASAN error with M-RoPE (#16848 )	2025-10-29 20:11:39 +01:00
Xuan-Son Nguyen	e3af5563bd	llama: store mrope data in KV cell (#16825 ) * llama: store mrope data in KV cell * correct x,y ordering * address review comments * add consistency checks * Update src/llama-kv-cache.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add TODO * fix asan error * kv-cells : improve ext handling * cont : fix headers --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-29 18:09:18 +01:00
Georgi Gerganov	85a7d8677b	memory : remove KV cache size padding (#16812 ) * memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic	2025-10-28 20:19:44 +02:00
Johannes Gäßler	7a0e900e36	llama: consistent ctx <-> buf order for KV cache (#16746 )	2025-10-28 11:23:54 +01:00
Diego Devesa	5a4ff43e7d	llama : disable pipeline parallelism if compute buffer allocation fails (#16748 )	2025-10-27 21:51:28 +01:00
Concedo	eaee2110c3	Merge branch 'upstream' into concedo_experimental # Conflicts: # README.md # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-backend-ops.cpp	2025-10-27 22:36:19 +08:00
Johannes Gäßler	945501f5ea	llama: fix leaked buffers for mmap + split files (#16765 ) Some checks failed Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / pyright type-check (push) Has been cancelled Details	2025-10-27 09:17:31 +01:00
Sigbjørn Skjæret	73a48c9790	convert : enable expert group selection for all models with it (#16691 )	2025-10-26 17:21:23 +01:00
Sigbjørn Skjæret	f696428ce8	graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (#16655 ) * add missing norm topk bias * use clamping instead, update number and add comment	2025-10-26 17:20:32 +01:00
Sigbjørn Skjæret	7cce4f8158	model : set res->t_embd in SmallThinker models (#16782 )	2025-10-26 16:08:52 +01:00
Aman Gupta	f77c13b91f	CUDA: General GEMV fusion (#16715 )	2025-10-26 19:28:04 +08:00
Concedo	59fafefbe6	Merge branch 'upstream' into concedo_experimental	2025-10-25 22:38:24 +08:00
Shunta Saito	226f295f4d	model : set res->t_embd in PLaMo2 models (#16766 )	2025-10-25 12:26:27 +02:00
Concedo	12a8bfd453	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CODEOWNERS # README.md # docs/ops.md # docs/ops/SYCL.csv # docs/ops/Vulkan.csv # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-sycl/backend.hpp # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # tests/test-backend-ops.cpp # tests/test-thread-safety.cpp	2025-10-23 17:22:17 +08:00
Max Krasnyansky	63d2fc46e1	Add experimental ggml-hexagon backend for the Hexagon NPU (#16547 ) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX Note: This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2025-10-22 13:47:09 -07:00
Sigbjørn Skjæret	84bf3c6778	model : add BailingMoeV2 support (#16063 ) * add BailingMoeV2 support * update llm types * undo * undo * update llm types * add model collection link * update * almost working * correct group selection and rename n_group_exp * avoid large top_k and use argmax instead for now if we had something like argmax2 that would be equivalent, but this works fine until then * poke * skip group selection when there are no tokens * fix 1T conversion * hopefully fixed expert group selection third time's the charm? * make expert group selection generally available The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture. * allow n_expert_groups to be 1 (Kimi K2) * address review suggestions	2025-10-20 21:38:20 +02:00
takuya kodama	06332e2867	llama-batch: fix build fails with `-Werror=missing-braces` (#16614 ) ## Why it failed When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \ cmake --build ../llama.cpp.build/ ... In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4, from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5, from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8: /home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces] 126 \| std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id \| ^ cc1plus: some warnings being treated as errors ``` The issue is that std::array initialization requires double braces. ## How to fix This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization. This is part of a series of commits to fix missing braces warnings across the codebase. - src/llama-batch.h <- This PR is here. - src/llama-context.cpp - tests/test-backend-ops.cpp - tests/test-gguf.cpp - tools/mtmd/clip.cpp Benefits: - std::array is a struct containing a C-style array, requiring nested braces - Enables stricter compiler warnings to catch potential issues	2025-10-20 11:27:09 +03:00
Concedo	82137e1c89	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/update-ops-docs.yml # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt	2025-10-20 16:05:11 +08:00
takuya kodama	7062dd8460	llama-context: only warn on pooling_type when user specified (#16674 ) The unexpeced pooling_type warning was incorrectly shown when users did not specify the --pooling-type parameter. In this case, the parameter defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code automatically applies the model's default pooling type. Example of spurious warning: ``` $ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello" ... llama_init_from_model: model default pooling_type is [2], but [-1] was specified ... ``` This fix ensures the warning only appears when users explicitly specify a pooling type that differs from the model's default (e.g., using --pooling-type mean on a model that expects CLS pooling).	2025-10-20 10:44:21 +03:00
Giuseppe Scrivano	0398752dd4	model : add Granite Hybrid types (#16635 ) add Granite 4 models mapping their embedding dimensions to the # of parameters. Information taken from https://huggingface.co/ibm-granite/granite-4.0-h-tiny Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-19 23:54:31 +02:00
Concedo	f47a0690ac	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/ops.md # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # tests/test-backend-ops.cpp # tests/test-grammar-integration.cpp # tools/rpc/rpc-server.cpp	2025-10-18 11:10:37 +08:00
Johannes Gäßler	66b0dbcb2d	llama-model: fix insonsistent ctxs <-> bufs order (#16581 )	2025-10-17 17:41:09 +02:00
Concedo	ebc1cb0641	before merging conflicting round	2025-10-16 12:15:44 +08:00
Concedo	2d22e61f3d	Merge commit '`1ee9d0b415`' into concedo_experimental # Conflicts: # tests/test-backend-ops.cpp	2025-10-16 12:09:46 +08:00
Concedo	2cee3b2055	Merge commit '`e38b7c6e9e`' into concedo_experimental	2025-10-16 12:08:03 +08:00
Concedo	f3b0ed157b	Revert "graph : support cacheless embeddings with FA and iSWA" This reverts commit `d4d465bce4`.	2025-10-16 12:07:48 +08:00
Xuan-Son Nguyen	3e3cb19f64	llama-quant: add support for mmproj (#16592 ) * llama-quant: add support for mmproj * Update src/llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * check prefix instead * small fix --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-15 14:48:08 +02:00
Georgi Gerganov	e60f241eac	metal : FA support F32 K and V and head size = 32 (#16531 ) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]	2025-10-13 23:07:57 +03:00
Georgi Gerganov	e38b7c6e9e	graph : support cacheless embeddings with FA and iSWA (#16528 ) * graph : support cacheless embeddings with FA and iSWA * cont : deduplicate mask creation * cont : fix name	2025-10-13 22:42:37 +03:00
Concedo	9503547ca1	Merge remote-tracking branch 'lcpp/gg/cacheless-embd' into concedo_experimental	2025-10-12 16:47:48 +08:00
Concedo	7e7da2583e	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/ggml-cuda/CMakeLists.txt # ggml/src/ggml-cuda/common.cuh # ggml/src/ggml-cuda/fattn.cu # ggml/src/ggml-hip/CMakeLists.txt # ggml/src/ggml-musa/CMakeLists.txt	2025-10-12 16:42:51 +08:00
Georgi Gerganov	d4d465bce4	graph : support cacheless embeddings with FA and iSWA	2025-10-12 10:35:38 +03:00
Daniel Bevenius	a2fba89a42	hparams : add check for layer index in is_recurrent (#16511 ) * hparams : add check for layer index in is_recurrent This commit adds a check in the is_recurrent method to ensure that the provided layer index is within the valid range. The motivation for this change is to prevent potential out-of-bounds and also be consistent with other methods in the class that perform similar checks, like is_swa.	2025-10-12 07:19:06 +02:00
Concedo	720fc30832	Merge branch 'upstream' into concedo_experimental	2025-10-11 23:19:38 +08:00
Georgi Gerganov	a3cb04744f	metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494 ) Some checks failed Python Type-Check / pyright type-check (push) Has been cancelled Details	2025-10-11 16:54:10 +03:00
Concedo	6d8f8cd65b	Merge branch 'upstream' into concedo_experimental # Conflicts: # ggml/src/CMakeLists.txt	2025-10-11 10:01:43 +08:00
Georgi Gerganov	81086cd6a3	vocab : mark EOT token for Granite models (#16499 ) * vocab : mark EOT token for Granite models * sampling : fallback to EOS when EOT is not found	2025-10-10 17:17:31 +03:00

1 2 3 4 5 ...

912 commits