koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-10 04:00:53 +00:00

Author	SHA1	Message	Date
Concedo	30c74d5cce	fixed mcp bug	2026-02-04 20:46:55 +08:00
Concedo	4b073f3aa0	fix sse parsing in mcp	2026-02-04 20:38:33 +08:00
Concedo	349c461453	add stop reason for error	2026-02-04 20:23:18 +08:00
Concedo	a2251a154f	Merge remote-tracking branch 'jeff/rope_noncontig' into concedo_experimental	2026-02-04 16:21:31 +08:00
Concedo	1f803ae27b	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/server.yml # CMakeLists.txt # cmake/common.cmake # ggml/src/ggml-virtgpu/apir_cs_ggml-rpc-front.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-device.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h # ggml/src/ggml-virtgpu/backend/backend-dispatched.h # ggml/src/ggml-virtgpu/backend/backend.cpp # ggml/src/ggml-virtgpu/backend/shared/apir_cs.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h # ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp # ggml/src/ggml-virtgpu/ggml-backend-device.cpp # ggml/src/ggml-virtgpu/ggml-backend-reg.cpp # ggml/src/ggml-virtgpu/ggml-remoting.h # ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml # ggml/src/ggml-virtgpu/regenerate_remoting.py # ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-impl.h # ggml/src/ggml-virtgpu/virtgpu-forward.gen.h # ggml/src/ggml-virtgpu/virtgpu-shm.cpp # ggml/src/ggml-virtgpu/virtgpu.cpp # ggml/src/ggml-virtgpu/virtgpu.h	2026-02-04 16:21:06 +08:00
Wagner Bruna	d9ac52a01a	sd: sync to master-492-f957fa3 (#1957 ) * sd: sync to master-492-f957fa3 * add Res Multistep and Res 2s samplers * make sdflashattention control flash_attn too	2026-02-04 16:12:39 +08:00
Daniel Bevenius	25f40ca65f	completion : simplify batch (embd) processing (#19286 ) Some checks are pending Python Type-Check / pyright type-check (push) Waiting to run Details * completion : simplify batch (embd) processing This commit simplifies the processing of embd by removing the for loop that currently exists which uses params.n_batch as its increment. This commit also removes the clamping of n_eval as the size of embd is always at most the size of params.n_batch. The motivation is to clarify the code as it is currently a little confusing when looking at this for loop in isolation and thinking that it can process multiple batches. * add an assert to verify n_eval is not greater than n_batch	2026-02-04 05:43:28 +01:00
Kevin Pouget	015deb9048	ggml-virtgpu: make the code thread safe (#19204 ) * ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function * ggml-virtgpu: deprecate buffer_type is_host remoting not necessary * ggml-virtgpu: stop using static vars as cache The static init isn't thread safe. * ggml-virtgpu: protect the use of the shared memory to transfer data * ggml-virtgpu: make the remote calls thread-safe * ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory * ggml-virtgpu: add a cleanup function for consistency * ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing * fix style and ordering * Remove the static variable in apir_device_get_count * ggml-virtgpu: improve the logging * fix review minor formatting changes	2026-02-04 10:46:18 +08:00
Aman Gupta	2ceda3f662	ggml-cpu: use LUT for converting e8->f32 scales on x86 (#19288 ) * ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro	2026-02-04 09:43:29 +08:00
Georgi Gerganov	44008ce8f9	metal : add solve_tri (#19302 )	2026-02-03 23:43:14 +02:00
Georgi Gerganov	6a9bf2f788	ci : add sanitizer runs for server (#19291 )	2026-02-03 22:41:20 +02:00
Georgi Gerganov	faa1bc26ee	sampling : delegate input allocation to the scheduler (#19266 ) * sampling : delegate input allocation to the scheduler * graph : compute backend samplers only if needed	2026-02-03 22:16:16 +02:00
Jeff Bolz	5de50e9d86	vulkan: fix non-contig rope	2026-02-03 12:20:08 -06:00
Ruben Ortlam	32b17abdb0	vulkan: disable coopmat1 fa on Nvidia Turing (#19290 )	2026-02-03 17:37:32 +01:00
Aman Gupta	8bece2eb20	CUDA: use mmvq for mul-mat-id for small batch sizes (#18958 ) * CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path	2026-02-03 23:31:23 +08:00
Concedo	e7d980cf4a	updated sdui	2026-02-03 21:33:51 +08:00
Sigbjørn Skjæret	a6fd8ca1fe	models : remove unnecessary cont in openelm (#19289 )	2026-02-03 14:20:57 +01:00
Concedo	dfa725c58d	make the dpi fix more universal. not a perfect solution	2026-02-03 19:49:38 +08:00
Georgi Gerganov	c55bce4159	metal : minor cleanup (#19251 )	2026-02-03 13:43:29 +02:00
Concedo	316530e9cf	fix cuda graph spams	2026-02-03 19:00:50 +08:00
Concedo	7b393fa487	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # AUTHORS # ci/run.sh # docs/backend/SYCL.md # docs/build.md # docs/multimodal/minicpmo2.6.md # docs/multimodal/minicpmo4.0.md # docs/multimodal/minicpmv2.5.md # docs/multimodal/minicpmv2.6.md # docs/multimodal/minicpmv4.0.md # docs/multimodal/minicpmv4.5.md # docs/ops.md # docs/ops/SYCL.csv # docs/speculative.md # examples/deprecation-warning/README.md # examples/deprecation-warning/deprecation-warning.cpp # examples/model-conversion/Makefile # examples/model-conversion/scripts/causal/convert-model.sh # ggml/include/ggml-cann.h # ggml/src/ggml-cann/acl_tensor.cpp # ggml/src/ggml-cann/acl_tensor.h # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-metal/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/concat.cl # ggml/src/ggml-opencl/kernels/repeat.cl # ggml/src/ggml-opencl/kernels/scale.cl # ggml/src/ggml-opencl/kernels/tanh.cl # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/outprod.cpp # ggml/src/ggml-sycl/rope.cpp # ggml/src/ggml-sycl/wkv.cpp # src/llama-vocab.cpp # tests/test-autorelease.cpp # tests/test-backend-ops.cpp # tools/cvector-generator/pca.hpp # tools/export-lora/export-lora.cpp # tools/perplexity/README.md	2026-02-03 19:00:42 +08:00
Oliver Simons	1f1e57f2bf	CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053 ) By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](`557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816)`) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: \| GPU \| Model \| Test \| t/s master \| t/s osimons/fix_bw_mmq_fixup_kernel \| Speedup \| \|:--------------------------------------------------------\|:----------------------\|:-------\|-------------:\|--------------------------------------:\|----------:\| \| NVIDIA RTX 6000 Ada Generation \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 8404.05 \| 8375.79 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| llama 3B Q4_K_M \| pp8096 \| 16148.93 \| 16019.60 \| 0.99 \| \| NVIDIA RTX 6000 Ada Generation \| llama 8B Q4_0 \| pp8096 \| 8008.29 \| 7978.80 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B BF16 \| pp8096 \| 4263.16 \| 4248.53 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B Q4_K_M \| pp8096 \| 5165.11 \| 5157.43 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 12582.80 \| 12758.37 \| 1.01 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 3B Q4_K_M \| pp8096 \| 16879.10 \| 17619.47 \| 1.04 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 8B Q4_0 \| pp8096 \| 10649.90 \| 10982.65 \| 1.03 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B BF16 \| pp8096 \| 7717.73 \| 7716.22 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B Q4_K_M \| pp8096 \| 7301.90 \| 7370.38 \| 1.01 \|	2026-02-03 11:33:14 +01:00
George	e9a859db3c	ggml: added cleanups in ggml_quantize_free (#19278 ) Some checks failed Python Type-Check / pyright type-check (push) Waiting to run Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.	2026-02-03 08:43:39 +02:00
Gaurav Garg	41e3f02647	cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (#19227 ) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.	2026-02-03 08:41:02 +02:00
Alexey Dubrov	1efb5f7ae1	vocab: add Falcon-H1-Tiny-Coder FIM tokens (#19249 )	2026-02-03 08:31:01 +02:00
Georgi Gerganov	aeb827a3cc	spec : simplify time measurement using common_time_meas (#19262 )	2026-02-03 08:20:15 +02:00
lhez	91ea44e89b	opencl: refactor some ops, concat, repeat, tanh and scale (#19226 ) * opencl: refactor concat * opencl: refactor repeat * opencl: refactor tanh * opencl: enable fp16 for tanh * opencl: refactor scale * opencl: fix unused variables	2026-02-02 15:54:43 -08:00
Sid Mohan	0dfcd3b607	jinja : add missing 'in' test to template engine (#19004 ) (#19239 ) * jinja : add missing 'in' test to template engine (#19004) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * reuse test_is_in in binary op --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-02-02 21:00:55 +01:00
Xuan-Son Nguyen	07a7412a3b	mtmd: add min/max pixels gguf metadata (#19273 )	2026-02-02 20:59:06 +01:00
Aman Gupta	9f682fb640	ggml-cpu: FA split across kv for faster TG (#19209 ) * ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl	2026-02-03 01:19:55 +08:00
Matthieu Coudron	a3fa035822	server: print actual model name in 'model not found" error (#19117 ) Experimenting with AI, my environment gets messy fast and it's not always easy to know what model my software is trying to load. This helps with troubleshooting. before: Error: { code = 400, message = "model not found", type = "invalid_request_error" } After: Error: { code = 400, message = "model 'toto' not found", type = "invalid_request_error" }	2026-02-02 16:55:27 +01:00
Aman Gupta	15818ac44c	ci: add test-backend-ops test for CPU (#19268 )	2026-02-02 22:40:28 +08:00
Neo Zhang	bf38346d13	Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246 ) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.	2026-02-02 21:06:21 +08:00
Tamar	4d5e972673	sycl: implement GGML_OP_TOP_K (#19242 )	2026-02-02 21:05:51 +08:00
Georgi Gerganov	6fdddb4987	metal : support virtual devices (#18919 ) * metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async	2026-02-02 14:29:44 +02:00
Daniel Bevenius	6156ae5111	model-conversion : add debug option to conversion script (#19265 ) This commit adds a debug option to the model conversion script to enable using the Python debugger (pdb) during model conversion. The motivation for this is that I've found myself adding this a few times now and it would be quicker to have this flag as an option and a makefile target/recipe for it.	2026-02-02 11:29:57 +01:00
Johannes Gäßler	59377a6c87	ggml-backend: fix async set/get fallback sync (#19179 )	2026-02-02 10:00:05 +01:00
Georgi Gerganov	1239267cc4	authors : update (#19263 ) [no ci]	2026-02-02 08:51:25 +02:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
Sascha Rogmann	b4d05a3d2f	spec : various improvements ton ngram-map + docs (#19253 ) * spec: ngram-map and reasoning chats * spec: add t_begin and t_accept * ngram-map : add internal hash map * docs : update ngram-map, add ngram-mod * docs : fix ngram-map-k * docs : differences between implementations	2026-02-02 08:26:58 +02:00
Concedo	77f4afe72b	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/nixpkgs-instances.nix # docs/backend/snapdragon/CMakeUserPresets.json # ggml/CMakeLists.txt # ggml/src/ggml-webgpu/ggml-webgpu.cpp	2026-02-02 11:43:58 +08:00
Concedo	68f9c6df91	fix cuda graph spams	2026-02-02 11:28:50 +08:00
Nikhil Jain	2dc3ce2166	Remove pipeline cache mutexes (#19195 ) * Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format	2026-02-01 18:47:29 -08:00
Max Krasnyansky	3bc8d2cf23	Bump cmake max version (needed for Windows on Snapdragon builds) (#19188 ) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists	2026-02-01 14:13:38 -08:00
Alexis Williams	8a98ba4582	nix: fix allowUnfreePredicate for packages with multiple licenses (#19237 ) The allowUnfreePredicate in pkgsCuda was wrapping p.meta.license in a list unconditionally. This fails when meta.license is already a list of licenses, as it creates a nested list and then tries to access .free and .shortName on the inner list. Use lib.toList instead, which correctly handles both cases: - Single license attrset -> wraps in list - List of licenses -> returns unchanged	2026-02-01 22:10:48 +02:00
Concedo	ddce19db72	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/package-gguf-py.nix # .devops/nix/scope.nix # common/CMakeLists.txt # docs/backend/SYCL.md # examples/lookahead/lookahead.cpp # examples/lookup/lookup.cpp # examples/sycl/run-llama2.sh # examples/sycl/win-run-llama2.bat # examples/sycl/win-test.bat # ggml/src/ggml-hexagon/CMakeLists.txt # ggml/src/ggml-hexagon/htp/flash-attn-ops.c # ggml/src/ggml-hexagon/htp/hvx-dump.h # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c # ggml/src/ggml-hexagon/htp/softmax-ops.c # ggml/src/ggml-hexagon/htp/unary-ops.c # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/cvt.cl # scripts/sync-ggml.last	2026-02-01 22:35:25 +08:00
Concedo	76b22a7b23	updated lite	2026-02-01 22:16:13 +08:00
Concedo	a5ae116033	increase z-image default clamp to 4.0, to tolerate z-image base requirement for higher cfg	2026-02-01 22:02:20 +08:00
Concedo	b13bf44285	kde fractional scaling fix, tooltip fix (+1 squashed commits) Squashed commits: [1cf02dcce] kde fractional scaling fix	2026-02-01 21:55:44 +08:00
Neo Zhang	2634ed207a	create test.sh to enhance the parameters for testing, update the guide, rm useless script (#19243 )	2026-02-01 18:24:00 +08:00

1 2 3 4 5 ...

11526 commits