koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-06-02 15:39:26 +00:00

Author	SHA1	Message	Date
Concedo	ada982b7c1	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/vulkan.Dockerfile # benches/dgx-spark/dgx-spark.md # scripts/bench-models.sh	2026-02-05 22:24:12 +08:00
Concedo	157fac7bd0	Merge commit '`c342c3b93d`' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CODEOWNERS # scripts/sync_vendor.py	2026-02-05 22:23:05 +08:00
Reithan	de3ed7d7d6	add missing resolve_refs call to enable subschema use (#1959 )	2026-02-05 22:12:59 +08:00
Wagner Bruna	c2d96328fe	sd: sync to master-493-65891d7 (#1960 )	2026-02-05 22:11:47 +08:00
Concedo	ceb548f407	update text (+1 squashed commits) Squashed commits: [2a1532783] update text	2026-02-05 22:11:10 +08:00
Georgi Gerganov	3795cc1e89	benches : update models + numbers (#19359 ) Some checks failed Python Type-Check / pyright type-check (push) Has been cancelled Details * bench : update script * benches : update numbers	2026-02-05 14:34:07 +02:00
Sigbjørn Skjæret	b828e18c75	docker : fix vulkan build (#19352 )	2026-02-05 11:10:39 +01:00
Concedo	0e907e23fb	Revamped help menu	2026-02-05 17:34:39 +08:00
Adrien Gallouët	a4ea7a188f	vendor : update BoringSSL to 0.20260204.0 (#19333 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-05 09:53:35 +01:00
Georgi Gerganov	7a4f97d196	metal : add diag (#19330 )	2026-02-05 10:08:45 +02:00
Oleksandr Kuvshynov	a498c75ad1	vulkan: fix GPU deduplication logic. (#19222 ) * vulkan: fix GPU deduplication logic. As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the (same uuid, same driver) logic is problematic for windows+intel igpu. Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before `88d23ad5` - just dedup based on UUID. Verified that MacOS + 4xVega still reports 4 GPUs with this version. * vulkan: only skip dedup when both drivers are moltenVk	2026-02-05 09:06:59 +01:00
Jeff Bolz	3409ab842d	vulkan: Set k_load_shmem to false when K is too large (#19301 )	2026-02-05 08:48:33 +01:00
Jeff Bolz	c342c3b93d	vulkan: fix non-contig rope (#19299 )	2026-02-05 08:38:59 +01:00
will-lms	af252d0758	metal : add missing includes (#19348 )	2026-02-05 08:05:09 +02:00
Concedo	1b894a58b4	glm 4.7 nothink	2026-02-05 10:51:46 +08:00
Sigbjørn Skjæret	11fb327bf3	vendor : add missing llama_add_compile_flags (#19322 ) * add missing llama_add_compile_flags * disable all warnings for ssl, crypto and fipsmodule	2026-02-05 02:27:38 +01:00
Aaron Teo	e6e934c5ea	vendor: update cpp-httplib version (#19313 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-02-05 05:15:03 +08:00
Daniel Bevenius	b536eb0233	codeowners : add danbev for examples/debug (#19332 ) * codeowners : add danbev for examples/debug * Add @pwilkin to CODEOWNERS for debug --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-02-04 20:20:40 +01:00
Xuan-Son Nguyen	e0c93af2a0	debug: make common_debug_print_tensor readable (#19331 ) * debug: make common_debug_print_tensor readable * editorconfig	2026-02-04 17:55:31 +01:00
Georgi Gerganov	423bee462b	ci : fix sanitize workflow to enable ggml sanitizers too (#19323 )	2026-02-04 15:12:03 +02:00
Concedo	1a36ef20c3	Merge branch 'upstream' into concedo_experimental # Conflicts: # tests/test-backend-ops.cpp	2026-02-04 20:53:35 +08:00
Concedo	30c74d5cce	fixed mcp bug	2026-02-04 20:46:55 +08:00
Concedo	4b073f3aa0	fix sse parsing in mcp	2026-02-04 20:38:33 +08:00
Concedo	349c461453	add stop reason for error	2026-02-04 20:23:18 +08:00
Xuan-Son Nguyen	8abcc70a74	model: (qwen3next) correct vectorized key_gdiff calculation (#19324 ) Some checks are pending Python Type-Check / pyright type-check (push) Waiting to run Details * model: (qwen3next) correct vectorized key_gdiff calculation * move transpose to outside of loop	2026-02-04 13:09:58 +01:00
Georgi Gerganov	eaba92c3dc	tests : add non-cont, inplace rope tests (#19296 ) * tests : add non-cont, inplace rope tests * cont : exercise dim 3 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * cont : more dim3 exercises --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2026-02-04 12:45:21 +02:00
Daniel Bevenius	6ab881b7c3	model-conversion : add tensor-info.py utility (#18954 ) This commit adds a new python script that can be used to print tensors information from a tensor in a safetensors model. The motivation for this is that during model conversion work it can sometimes be useful to verify the shape of tensors in the original model. While it is possible to print the tensors when loading the model this can be slow when working with larger models. With this script it is possible to quickly query tensor shapes. Example usage: ```console (venv) $ ./scripts/utils/tensor-info.py --help usage: tensor-info.py [-h] [-m MODEL_PATH] [-l] [tensor_name] Print tensor information from a safetensors model positional arguments: tensor_name Name of the tensor to inspect options: -h, --help show this help message and exit -m MODEL_PATH, --model-path MODEL_PATH Path to the model directory (default: MODEL_PATH environment variable) -l, --list List unique tensor patterns in the model (layer numbers replaced with #) ``` Listing tensor names: ```console (venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m -l embed_tokens.weight layers.#.input_layernorm.weight layers.#.mlp.down_proj.weight layers.#.mlp.gate_proj.weight layers.#.mlp.up_proj.weight layers.#.post_attention_layernorm.weight layers.#.post_feedforward_layernorm.weight layers.#.pre_feedforward_layernorm.weight layers.#.self_attn.k_norm.weight layers.#.self_attn.k_proj.weight layers.#.self_attn.o_proj.weight layers.#.self_attn.q_norm.weight layers.#.self_attn.q_proj.weight layers.#.self_attn.v_proj.weight norm.weight ``` Printing a specific tensor's information: ```console (venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m layers.0.input_layernorm.weight Tensor: layers.0.input_layernorm.weight File: model.safetensors Shape: [768] ```	2026-02-04 10:40:53 +01:00
Georgi Gerganov	d838c22bb3	spec : fix the check-rate logic of ngram-simple (#19261 ) * spec : fix the check-rate logic of ngram-simple * cont : refactor + fix checks	2026-02-04 10:39:53 +02:00
Concedo	a2251a154f	Merge remote-tracking branch 'jeff/rope_noncontig' into concedo_experimental	2026-02-04 16:21:31 +08:00
Concedo	1f803ae27b	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/server.yml # CMakeLists.txt # cmake/common.cmake # ggml/src/ggml-virtgpu/apir_cs_ggml-rpc-front.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-backend.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer-type.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-buffer.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched-device.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.cpp # ggml/src/ggml-virtgpu/backend/backend-dispatched.gen.h # ggml/src/ggml-virtgpu/backend/backend-dispatched.h # ggml/src/ggml-virtgpu/backend/backend.cpp # ggml/src/ggml-virtgpu/backend/shared/apir_cs.h # ggml/src/ggml-virtgpu/backend/shared/apir_cs_ggml.h # ggml/src/ggml-virtgpu/ggml-backend-buffer-type.cpp # ggml/src/ggml-virtgpu/ggml-backend-device.cpp # ggml/src/ggml-virtgpu/ggml-backend-reg.cpp # ggml/src/ggml-virtgpu/ggml-remoting.h # ggml/src/ggml-virtgpu/ggmlremoting_functions.yaml # ggml/src/ggml-virtgpu/regenerate_remoting.py # ggml/src/ggml-virtgpu/virtgpu-forward-backend.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer-type.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-buffer.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-device.cpp # ggml/src/ggml-virtgpu/virtgpu-forward-impl.h # ggml/src/ggml-virtgpu/virtgpu-forward.gen.h # ggml/src/ggml-virtgpu/virtgpu-shm.cpp # ggml/src/ggml-virtgpu/virtgpu.cpp # ggml/src/ggml-virtgpu/virtgpu.h	2026-02-04 16:21:06 +08:00
Wagner Bruna	d9ac52a01a	sd: sync to master-492-f957fa3 (#1957 ) * sd: sync to master-492-f957fa3 * add Res Multistep and Res 2s samplers * make sdflashattention control flash_attn too	2026-02-04 16:12:39 +08:00
Daniel Bevenius	25f40ca65f	completion : simplify batch (embd) processing (#19286 ) Some checks are pending Python Type-Check / pyright type-check (push) Waiting to run Details * completion : simplify batch (embd) processing This commit simplifies the processing of embd by removing the for loop that currently exists which uses params.n_batch as its increment. This commit also removes the clamping of n_eval as the size of embd is always at most the size of params.n_batch. The motivation is to clarify the code as it is currently a little confusing when looking at this for loop in isolation and thinking that it can process multiple batches. * add an assert to verify n_eval is not greater than n_batch	2026-02-04 05:43:28 +01:00
Kevin Pouget	015deb9048	ggml-virtgpu: make the code thread safe (#19204 ) * ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function * ggml-virtgpu: deprecate buffer_type is_host remoting not necessary * ggml-virtgpu: stop using static vars as cache The static init isn't thread safe. * ggml-virtgpu: protect the use of the shared memory to transfer data * ggml-virtgpu: make the remote calls thread-safe * ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory * ggml-virtgpu: add a cleanup function for consistency * ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing * fix style and ordering * Remove the static variable in apir_device_get_count * ggml-virtgpu: improve the logging * fix review minor formatting changes	2026-02-04 10:46:18 +08:00
Aman Gupta	2ceda3f662	ggml-cpu: use LUT for converting e8->f32 scales on x86 (#19288 ) * ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro	2026-02-04 09:43:29 +08:00
Georgi Gerganov	44008ce8f9	metal : add solve_tri (#19302 )	2026-02-03 23:43:14 +02:00
Georgi Gerganov	6a9bf2f788	ci : add sanitizer runs for server (#19291 )	2026-02-03 22:41:20 +02:00
Georgi Gerganov	faa1bc26ee	sampling : delegate input allocation to the scheduler (#19266 ) * sampling : delegate input allocation to the scheduler * graph : compute backend samplers only if needed	2026-02-03 22:16:16 +02:00
Jeff Bolz	5de50e9d86	vulkan: fix non-contig rope	2026-02-03 12:20:08 -06:00
Ruben Ortlam	32b17abdb0	vulkan: disable coopmat1 fa on Nvidia Turing (#19290 )	2026-02-03 17:37:32 +01:00
Aman Gupta	8bece2eb20	CUDA: use mmvq for mul-mat-id for small batch sizes (#18958 ) * CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path	2026-02-03 23:31:23 +08:00
Concedo	e7d980cf4a	updated sdui	2026-02-03 21:33:51 +08:00
Sigbjørn Skjæret	a6fd8ca1fe	models : remove unnecessary cont in openelm (#19289 )	2026-02-03 14:20:57 +01:00
Concedo	dfa725c58d	make the dpi fix more universal. not a perfect solution	2026-02-03 19:49:38 +08:00
Georgi Gerganov	c55bce4159	metal : minor cleanup (#19251 )	2026-02-03 13:43:29 +02:00
Concedo	316530e9cf	fix cuda graph spams	2026-02-03 19:00:50 +08:00
Concedo	7b393fa487	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # AUTHORS # ci/run.sh # docs/backend/SYCL.md # docs/build.md # docs/multimodal/minicpmo2.6.md # docs/multimodal/minicpmo4.0.md # docs/multimodal/minicpmv2.5.md # docs/multimodal/minicpmv2.6.md # docs/multimodal/minicpmv4.0.md # docs/multimodal/minicpmv4.5.md # docs/ops.md # docs/ops/SYCL.csv # docs/speculative.md # examples/deprecation-warning/README.md # examples/deprecation-warning/deprecation-warning.cpp # examples/model-conversion/Makefile # examples/model-conversion/scripts/causal/convert-model.sh # ggml/include/ggml-cann.h # ggml/src/ggml-cann/acl_tensor.cpp # ggml/src/ggml-cann/acl_tensor.h # ggml/src/ggml-cann/aclnn_ops.cpp # ggml/src/ggml-cann/aclnn_ops.h # ggml/src/ggml-cann/common.h # ggml/src/ggml-cann/ggml-cann.cpp # ggml/src/ggml-metal/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-opencl/kernels/concat.cl # ggml/src/ggml-opencl/kernels/repeat.cl # ggml/src/ggml-opencl/kernels/scale.cl # ggml/src/ggml-opencl/kernels/tanh.cl # ggml/src/ggml-sycl/CMakeLists.txt # ggml/src/ggml-sycl/dpct/helper.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/outprod.cpp # ggml/src/ggml-sycl/rope.cpp # ggml/src/ggml-sycl/wkv.cpp # src/llama-vocab.cpp # tests/test-autorelease.cpp # tests/test-backend-ops.cpp # tools/cvector-generator/pca.hpp # tools/export-lora/export-lora.cpp # tools/perplexity/README.md	2026-02-03 19:00:42 +08:00
Oliver Simons	1f1e57f2bf	CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (#19053 ) By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](`557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816)`) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: \| GPU \| Model \| Test \| t/s master \| t/s osimons/fix_bw_mmq_fixup_kernel \| Speedup \| \|:--------------------------------------------------------\|:----------------------\|:-------\|-------------:\|--------------------------------------:\|----------:\| \| NVIDIA RTX 6000 Ada Generation \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 8404.05 \| 8375.79 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| llama 3B Q4_K_M \| pp8096 \| 16148.93 \| 16019.60 \| 0.99 \| \| NVIDIA RTX 6000 Ada Generation \| llama 8B Q4_0 \| pp8096 \| 8008.29 \| 7978.80 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B BF16 \| pp8096 \| 4263.16 \| 4248.53 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B Q4_K_M \| pp8096 \| 5165.11 \| 5157.43 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 12582.80 \| 12758.37 \| 1.01 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 3B Q4_K_M \| pp8096 \| 16879.10 \| 17619.47 \| 1.04 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 8B Q4_0 \| pp8096 \| 10649.90 \| 10982.65 \| 1.03 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B BF16 \| pp8096 \| 7717.73 \| 7716.22 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B Q4_K_M \| pp8096 \| 7301.90 \| 7370.38 \| 1.01 \|	2026-02-03 11:33:14 +01:00
George	e9a859db3c	ggml: added cleanups in ggml_quantize_free (#19278 ) Some checks failed Python Type-Check / pyright type-check (push) Waiting to run Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.	2026-02-03 08:43:39 +02:00
Gaurav Garg	41e3f02647	cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (#19227 ) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.	2026-02-03 08:41:02 +02:00
Alexey Dubrov	1efb5f7ae1	vocab: add Falcon-H1-Tiny-Coder FIM tokens (#19249 )	2026-02-03 08:31:01 +02:00

1 2 3 4 5 ...

11551 commits