koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-18 06:19:19 +00:00

Author	SHA1	Message	Date
Wagner Bruna	243b03586b	sd: build each source file separately (#2188 ) * sd: build source files separately * sd: decouple stable-diffusion.cpp and sdtype_adapter.cpp * sd: remove include util.h from sdtype_adapter.cpp * sd: update source file lists and review dependencies	2026-05-07 22:50:10 +08:00
Concedo	81f2b5c448	prepare for sdcpp build refactor	2026-05-07 22:49:14 +08:00
Concedo	9e9497f0cc	Merge remote-tracking branch 'origin/upstream' into concedo_experimental # Conflicts: # examples/save-load-state/save-load-state.cpp # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/gemm_noshuffle_q4_0_f32.cl # ggml/src/ggml-opencl/kernels/gemm_noshuffle_q8_0_f32.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_q4_0_f32_spec.cl # ggml/src/ggml-opencl/kernels/gemv_noshuffle_q8_0_f32.cl # ggml/src/ggml-rpc/ggml-rpc.cpp # scripts/sync-ggml.last # scripts/sync_vendor.py # src/llama-graph.cpp # tests/test-backend-ops.cpp # tests/test-state-restore-fragmented.cpp	2026-05-06 21:20:06 +08:00
Concedo	7240da764a	Merge commit '`935a340292`' into concedo_experimental # Conflicts: # examples/diffusion/CMakeLists.txt # scripts/server-test-function-call.py # src/llama-model.cpp # src/models/gemma4.cpp # tests/test-chat.cpp # tests/test-reasoning-budget.cpp # tools/server/README.md	2026-05-06 21:02:25 +08:00
henk717	bcf9c81e0d	Linux CUDA13 Action (#2186 ) * Linux CU13 CI * Bump max CUDA arch * CUDA13 Linux * Upload the correct build to rolling (CUDA13) * Downgrade cuda to get better compatibility Runpod can't handle 13.1, and if they can't handle it neither can the people with a secondary GPU of an older generation. * Add support for compute capability 89 in NVCCFLAGS	2026-05-06 18:06:39 +08:00
Concedo	15e86c4f9b	hard coded reasoning_effort field from the api payload and force it into the jinja kwargs (request by @henk717). field name also hardcoded.	2026-05-06 17:35:26 +08:00
Aleksander Grygier	e3e3f8e46a	webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719 ) * refactor: Remove Google favicon utility * fix: MCP Server favicon * refactor: Cleanup * refactor: MCP Server Information * fix: Fix MCP Settings UI * refactor: Cleanup	2026-05-06 11:12:27 +02:00
zzzzwc	f08f20a0e3	ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423 )	2026-05-06 15:41:14 +08:00
viggy	07eaf919ed	add tabindex and aria-hidden (#22699 )	2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret	74d6248f71	convert : add filter_tensors method to pre-filter tensors (#22597 ) * add filter_tensors classmethod * remove language_model * fix parts validation	2026-05-06 08:06:05 +02:00
fl0rianr	2ca1161bd7	ggml : use `CL_DEVICE_GLOBAL_MEM_SIZE` as memory estimate for OpenCL --fit (#22688 ) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de>	2026-05-05 22:12:48 -07:00
Trivikram Reddy	bbeb89d76c	Hexagon: Process M-tail rows on HMX instead of HVX (#22724 ) * hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-05 09:43:03 -07:00
lhez	ff806a110d	opencl: refactor Adreno q4_0 (#22335 ) * opencl: refactor adreno q4_0 gemm/gemv dispatch * opencl: refactor q4_0 gemm/gemv loading, use consistent names * opencl: use consistent name for adreno q8_0 gemm/gemv * opencl: use consistent names for adreno q4_0 gemm/gemv * opencl: simplify adreno q4_0 set_tensor * opencl: refactor q4_0 get_tensor	2026-05-05 09:38:57 -07:00
Radoslav Gerganov	d5003b6e4d	rpc : use graph uid instead of graph cache (#22701 ) Store the last graph uid and compare against it to determine if the same graph is being computed.	2026-05-05 13:47:13 +03:00
Adrien Gallouët	2635ac76e8	common : fix missing-noreturn warnings when compiling with clang 21 (#22702 ) common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3719 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3726 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3733 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3740 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3747 \| [](common_params & /params/, int /value/) { \| ^ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 13:16:25 +03:00
Georgi Gerganov	70a8309114	sync : ggml	2026-05-05 13:15:59 +03:00
Georgi Gerganov	c91faf997f	ggml : bump version to 0.11.0 (ggml/1478)	2026-05-05 13:15:59 +03:00
Adrien Gallouët	bf76ac77be	common : only load backends when required (#22290 ) * common : only load backends when required Signed-off-by: Adrien Gallouët <angt@huggingface.co> * llama : call ggml_backend_load_all() directly from llama_backend_init() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add ggml_backend_load_all() where llama_backend_init() is not used Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 09:23:50 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	a09a00e502	vendor : update cpp-httplib to 0.43.3 (#22686 )	2026-05-05 09:04:57 +02:00
Georgi Gerganov	2bacb1eb77	server : validate --tools CLI argument against known tool names (#22538 ) Previously, unknown tool names passed via --tools were silently ignored. Now the server validates each tool name at startup and exits with an error if an unrecognized tool is specified, listing the available tools. Assisted-by: llama.cpp:local pi	2026-05-05 06:35:27 +03:00
Georgi Gerganov	d6e7b033a4	llama : add option to save memory in device buffers (#22679 ) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state	2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret	fa595462ca	graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630 ) * qkv may not always be contiguous * cont : make the cont conditional --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-05 06:34:44 +03:00
Ismail	a817a22bc6	ggml : implement fast walsh-hadamard transform for kv rotation (#21352 ) (#22631 )	2026-05-05 10:05:05 +08:00
Charles Xu	eff06702b2	kleidiai : update to v1.24.0 and use release archive (#22549 )	2026-05-04 22:13:31 +03:00
leonardHONG	e77056f9b2	CUDA: use fastdiv for batch index split in get_rows (#22650 )	2026-05-04 16:24:05 +02:00
Xuan-Son Nguyen	935a340292	server: implement /models?reload=1 (#21848 )	2026-05-04 16:23:26 +02:00
Shakhnazar Sailaukan	d8794eecd5	examples: refactor diffusion generation (#22590 ) * examples: refactor diffusion generation * renamed enum values	2026-05-04 20:19:30 +08:00
JusteLeo	36a694c965	webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625 )	2026-05-04 13:38:10 +02:00
Piotr Wilkin (ilintar)	a4701c98f7	common/autoparser: fixes for newline handling / forced tool calls (#22654 ) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead	2026-05-04 13:18:11 +02:00
Xuan-Son Nguyen	994118a183	model: move `load_hparams` and `load_tensors` to per-model definition (#22004 ) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add back create_tensor_qkv * fix files missing include * enforce one llm_build per arch * cmake: use glob * missing model params * nits * wip * wip (2) * wip (3) * test-llama-archs is happy * improve switch case * move more stuff into llm_arch_model_i * fix downstream code * nits * nits (2) * fix order * llama_model_base * LLAMA_LOAD_LOCALS * small fix * fix build errors * auto * rm migration script and ifdef	2026-05-04 12:36:59 +02:00
Evan Huus	c84e6d6db5	server: Add a simple get_datetime server tool (#22649 )	2026-05-04 12:19:41 +02:00
Concedo	2905c6254f	Merge branch 'upstream' into concedo_experimental # Conflicts: # .pi/gg/SYSTEM.md # docs/speculative.md # ggml/src/ggml-virtgpu/virtgpu-shm.cpp # ggml/src/ggml-virtgpu/virtgpu.cpp # ggml/src/ggml-virtgpu/virtgpu.h # ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp # ggml/src/ggml-webgpu/ggml-webgpu.cpp # ggml/src/ggml-webgpu/wgsl-shaders/row_norm.wgsl # tools/cli/README.md # tools/completion/README.md # tools/server/README.md	2026-05-04 15:36:13 +08:00
Concedo	4a8a51a3a7	updated sdui, increase ace step music vae chunk size	2026-05-04 15:30:45 +08:00
Concedo	950676fdb7	split utils.cpp into 2 files to support sd.cpp	2026-05-04 15:04:12 +08:00
Nick Towle	fa8feaed34	webui: restore missing settings (#22666 ) Some checks failed Python Type-Check / python type-check (push) Has been cancelled Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details	2026-05-04 09:04:07 +02:00
Wagner Bruna	276c651a12	sd: sync to master-593-3d6064b (#2175 ) * sd: sync to master-593-3d6064b * sd: use the same sdtype_adapter object for all builds Since master-592-b8079e2, no sd.cpp source depends on the ggml backend build anymore. * sd: fix main_gpu selection * sd: report backend devices to the Python layer	2026-05-04 14:05:34 +08:00
Georgi Gerganov	846262d787	docs : update speculative decoding parameters after refactor (#22397 ) (#22539 ) * docs : update speculative decoding parameters after refactor (#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR #22397: - Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min - Replace --spec-ngram-size-n/m with per-implementation variants - Add documentation for all new --spec-ngram-- parameters - Update all example commands Assisted-by: llama.cpp:local pi pi : add rule to use gh CLI for GitHub resources Assisted-by: llama.cpp:local pi * docs : run llama-gen-docs * arg : fix typo	2026-05-04 08:52:07 +03:00
Atomic-Germ	6dcd824fce	vulkan: delete dead GGML_VK_MAX_NODES def (#22621 )	2026-05-04 07:49:29 +02:00
Chen Yuan	d4b0c22f9e	ggml-webgpu: add layer norm ops (#22406 ) * shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original implementation rather than the kahan summation	2026-05-03 20:52:53 -07:00
Aldehir Rojas	e48034dfc9	common : determine generation prompt using longest common prefix (#22657 )	2026-05-04 00:18:23 +02:00
Julien Denize	048a490f76	convert : Mistral format yarn apply_scale support (#22612 ) * [BUGFIX] Mistral format apply_scale support. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix misunderstood boolean parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-03 21:51:21 +02:00
JM Robles	db44417b02	convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611 ) Llama-architecture q_proj/k_proj weights need an axis-0 row permutation to match GGML's RoPE convention. The BF16 path applies this in LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path bypasses modify_tensors and writes weights directly through ModelBase._repack_nvfp4. Without the permutation, attention heads end up scrambled at inference and the model produces gibberish. This change overrides _repack_nvfp4 on LlamaModel and applies the same permutation to both the nibble-packed weight and the per-block scale before delegating to ModelBase._repack_nvfp4 via super(). Reuses the existing LlamaModel.permute static helper and respects the existing undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.) inherit the fix automatically. Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419 (gibberish) to 43.9, matching the BF16-dequantized baseline (44.0). Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama architecture) with multilingual generation in Spanish/Catalan/Basque/ Galician all coherent with the fix applied. Co-authored-by: Chema <chema@montevive.ai>	2026-05-03 18:22:00 +03:00
Tai An	24495f6c48	docs(args): clarify --debugmode level semantics in help text (#2181 ) Closes #2178 The --debugmode help string previously read "Shows additional debug info in the terminal" with no indication of what numeric values it accepts or what each does — making the recommended troubleshooting flag opaque (per #2178). Document the three values actually checked in the source: -1: Horde-quiet (suppresses non-essential prints; auto-applied when --horde* args are set, see configure_horde_settings) 0: default 1: verbose (extra slot/cache info; larger utfprint buffer; retains 'debug-' horde model prefix; etc.) Also note that bare --debugmode (no value) implies 1, which is the existing argparse behavior (nargs='?', const=1) but easy to miss.	2026-05-03 16:06:13 +08:00
Concedo	676e716ce3	try to handle duplicate think tags by swallowing them	2026-05-03 16:02:38 +08:00
Concedo	2d1c1eb54e	updated lite	2026-05-03 14:54:09 +08:00
Concedo	80a9082166	q5_1 kv in cuda	2026-05-03 13:40:24 +08:00
Concedo	9be810628e	setenv return int	2026-05-03 13:32:05 +08:00
Concedo	2fb97d9c2c	explicitly set env var internally.	2026-05-03 13:18:50 +08:00
lucy	d05fe1d7da	fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533 ) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-02 22:19:25 +02:00
Wagner Bruna	25fab4113e	refactor: handle GGML_VK_VISIBLE_DEVICES at the Python level (#2179 ) All C++ handling code currently: - build a comma-separated list from the info_vulkan array - if GGML_VK_VISIBLE_DEVICES isn't set - set GGML_VK_VISIBLE_DEVICES to the list Once set, GGML_VK_VISIBLE_DEVICES affects the whole process. So this can be done in the same way at the Python level, before all loading functions. Caveat: load_model had the default `inputs.vulkan_info = "0"`, so the default GPU would be "0" only when loading a text model.	2026-05-02 23:10:29 +08:00

1 2 3 4 5 ...

13158 commits