koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-30 20:33:39 +00:00

Author	SHA1	Message	Date
Michael Wand	6fe90deffa	models : Attach Mistral3 NVFP4 weight scales (#23629 )	2026-05-26 07:59:59 +03:00
Alexey Kopytko	581d020b12	SYCL: implement ggml_sycl_pool_vmm (#22862 ) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-26 07:59:00 +03:00
Jeff Bolz	7623de11d9	tests: test-backend-ops -j <N> to run tests in parallel (#23637 ) Create a pool of N threads that grab a chunk of up to 100 tests at a time to iterate through. The number of tests at a time decreases as fewer remain. Each thread uses its own dev and cpu backend, and set_n_threads_fn is not called on the cpu backend. Fix some TSAN issues that arose: - In init_tensor_uniform, don't use static vector of generators. - Replace gmtime with versions that don't use a global variable. - Mutex calls to print_test_result.	2026-05-26 07:57:56 +03:00
Niklas Sheth	c9d98295a3	model : add support for talkie-1930-13b (#22596 ) * initial talkie support, coherent * reorder to follow convention * absorb inverse rope * stop folding scalars to improve quantization * use broadcasting instead of duplication * style cleanup * add scaling support to LoraTorchTensor; use that path in conversion * use layer_out_scale instead of embd_skip_scale	2026-05-26 07:57:38 +03:00
Masashi Yoshimura	1506d39e76	ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (#23594 ) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx	2026-05-25 20:42:49 -07:00
Nikhil Jain	54121f7325	[WebGPU] Check batch_compute_passes before sending passes when not doing GPU profiling (#23457 ) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml	2026-05-25 20:32:49 -07:00
Johannes Gäßler	192d8ae8b8	CUDA: missing PDL sync for FWHT, better fallback (#23690 )	2026-05-26 11:05:51 +08:00
forforever73	35c9b1f39e	metal : add apple device id (#23566 ) Co-authored-by: lvyichen <lvyichen@stepfun.com>	2026-05-25 21:05:16 +03:00
Max Krasnyansky	4bead4e30d	snapdragon: bump toolchain docker to v0.7 to fix ui build issues (#23680 )	2026-05-25 10:57:43 -07:00
Georgi Gerganov	302e2c2652	ci : reduce PR jobs by matching backend paths (#23675 ) * ci : disable SYCL f16 builds * ci : extract android and hip into separate workflows * ci : move webgpu to separate workflow * ci : move the rpc to a separate workflow * ci : extract s309x and ppcl jobs * ci : extract opencl job into a separate workflow	2026-05-25 20:54:54 +03:00
Pascal	328874d054	model: tag ffn_latent as MUL_MAT to fix buft probe (#23664 ) ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks the backend about the declared op, so it tested an elementwise MUL on a q8_0 weight. That used to return true unconditionally and the weight stayed on GPU by luck. Once supports_op told the truth, the probe got a no and the loader pushed the weight and its matmul to CPU, splitting the graph. Tagging it MUL_MAT asks the real question, the math is unchanged. Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.	2026-05-25 16:05:04 +02:00
Aman Gupta	c1f1e28d29	CUDA: add fast walsh-hadamard transform (#23615 ) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-25 21:12:10 +08:00
Pascal	5a4126adc1	ui: fix stop/continue during an agentic loop (#23356 )	2026-05-25 14:18:59 +02:00
Michael Wand	a4d2d4ae41	convert : add compressed-tensors NVFP4 support (#21095 ) * Refactored Compressed Tensors NVFP4 support for new base.py * Support compressed-tensors NVFP4 conversion * Moved Qwen MTP remap into filter_tensors * simplify * pathlib no longer used --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-25 14:16:11 +02:00
Georgi Gerganov	d161ea7071	sync : ggml	2026-05-25 12:43:27 +03:00
Georgi Gerganov	45158f460e	ggml : bump version to 0.13.0 (ggml/1510)	2026-05-25 12:43:27 +03:00
Georgi Gerganov	22307b3e8b	sync : ggml	2026-05-25 12:38:01 +03:00
Georgi Gerganov	ce5890b5f7	ggml : bump version to 0.12.1 (ggml/1508)	2026-05-25 12:38:01 +03:00
Ori Pekelman	b251f74f49	ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)	2026-05-25 12:38:01 +03:00
Dev-X25874	fa97041524	ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492)	2026-05-25 12:38:01 +03:00
Johannes Gäßler	ae251b5ff2	TP: fix ggml context size calculation (#22616 ) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include	2026-05-25 12:37:25 +03:00
Gilad S.	66efd13375	ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (#22341 ) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void `, change `max_expected_size` and offsets to `uint64_t` fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com>	2026-05-25 11:33:29 +02:00
Aman Gupta	6c4cbdc70b	server: MTP layer kv-cache should respect draft type ctk (#23646 )	2026-05-25 16:46:23 +08:00
alex-spacemit	5fdf07e33b	ci : update spacemit toolchain url and enhance curl command (#23642 ) * fix(action): update SpacemiT toolchain URL and version Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c * fix(action): add -L flag to curl command for URL redirection Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06	2026-05-25 10:43:24 +02:00
Sigbjørn Skjæret	062d3115aa	ci : fix pre-tokenizer-hashes check (#23651 )	2026-05-25 10:41:25 +02:00
Tim Neumann	314e729347	llama : document that only one on-device state can be saved per sequence (#23520 )	2026-05-25 10:29:28 +03:00
Aldehir Rojas	d55fb97174	ci : install host compiler on android-ndk build (#23630 )	2026-05-25 10:18:08 +03:00
Jeff Bolz	826539ce59	ggml : Parallelize quant LUT init (#23595 ) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in.	2026-05-25 10:15:46 +03:00
Saba Fallah	b96487645c	ui: media attachments before text (#23467 ) * ui: media attachments before text * fix prettier formatting	2026-05-25 08:50:41 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	9627d0f540	vendor : update cpp-httplib to 0.45.1 (#23639 )	2026-05-25 09:45:22 +03:00
jacekpoplawski	e2ef8fe42c	server: fix checkpoints creation (#22929 ) * common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>	2026-05-25 08:56:18 +03:00
fairydreaming	6d57c26ef8	perplexity : fix even more integer overflows (#23623 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-25 08:12:39 +03:00
Georgi Gerganov	28123a3937	ci : move most slim jobs to self-hosted runners (#23619 ) * ci : remove tag from build-self-hosted.yml * ci : slim -> self-hosted * ci : prevent heavy CPU jobs from running on fast runners * ci : prevent cmake pkg to run on dedicated fast runners * ci : try to bump 3.11 -> 3.13 * ci : move lint back to 3.11 * ci : back to 3.11 * ci : add comment about UI jobs * ci : move python requirements check to CPU runners this job is a bit slow for a dedicated "fast" runner * ci : add self-hosted ui workflow * ci : fix UI naming * tmp to check if arm64 fast is compatible with all jobs * revert last commit	2026-05-25 08:11:19 +03:00
Georgi Gerganov	549b9d8433	ci : update build-self-hosted.yml (#23616 )	2026-05-24 18:20:10 +03:00
Sigbjørn Skjæret	5d246a792d	convert : minor fixes for numpy 2.x (#23571 )	2026-05-24 09:51:31 +02:00
Aldehir Rojas	63248fc3e3	cmake : fix ui build (#23592 ) * cmake/ui : add -fPIC to llama-ui static lib * cmake : rename host compiled embed helper	2026-05-24 02:37:28 -05:00
Aman Gupta	83eebe9d08	server: add margin for draft model for `fit` (#23485 )	2026-05-24 14:43:08 +08:00
Johannes Gäßler	fff63b5108	TP: fix entirely zero-sized slices per device (#23525 )	2026-05-24 08:19:33 +02:00
shaofeiqi	f3061116ff	opencl: batch profiling to improve speed and prevent memory leaks (#23495 )	2026-05-23 23:11:43 -07:00
Yiwei Shao	1c0f6db545	hexagon: apply repl optimization in flash attn softmax as #22993 (#23455 ) Some checks failed Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details	2026-05-23 19:56:59 -07:00
Aparna M P	cec51c7a7d	snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552 )	2026-05-23 19:56:41 -07:00
Aldehir Rojas	b22ff4b7b4	cmake/ui : refactor the build (#23352 )	2026-05-23 17:08:22 -04:00
Aditya Singh	c0c7e147e7	requirements : bump torch to 2.11.0 (#23503 ) * requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on PyPI for platform/CPython combinations where 2.6.x is not present. The accompanying comment already says 'PyTorch 2.6.0 or later', so the looser >=2.6.0 matches the documented intent and unblocks pip install -r requirements/requirements-convert_hf_to_gguf.txt. Fixes #23408 * requirements: bump torch floor to 2.11.0 per maintainer * requirements: pin torch to ==2.11.0 per project policy * requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy * requirements: suppress check_requirements pin warning on mtmd The check_requirements script flags '==' on lines in files matched by //requirements.txt. Append the documented suppression comment to the pinned torch and torchvision lines (and to the s390x platform marker lines) so the check passes while keeping the pins required by project policy. * ty: silence Tensor/Module union check on model[0].auto_model With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns Tensor \| Module rather than Module, so model[0].auto_model fails ty on the SentenceTransformer code path. The runtime behavior is unchanged because SentenceTransformer always wraps a Module at index 0. Adding a targeted unresolved-attribute ignore keeps the type-check green without altering behavior. A follow-up issue tracks typing the variable explicitly.	2026-05-23 18:24:39 +02:00
Michael Wand	b0df4c0cfd	model : add NVFP4 MTP scale tensors (#23563 ) * Add NVFP4 MTP scale tensors * Link Qwen3.5 MTP tensors * Aligned nullptr	2026-05-23 13:30:31 +02:00
dskwe	a497476330	ggml : Check the right iface method before using the fallback 2d get (#23514 )	2026-05-23 12:49:24 +02:00
Jeff Bolz	95405ac65f	vulkan: fix windows find_package of SPIRV-Headers (#23215 ) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only	2026-05-23 09:44:46 +02:00
Shawn Gu	0f3cb3fc8b	opencl: generalize Adreno MoE kernels on M (#23449 )	2026-05-22 17:08:41 -07:00
Aldehir Rojas	1acee6bf89	server: only parse empty msg if continuing an assistant msg (#23506 )	2026-05-22 11:58:15 -04:00
fairydreaming	ef570f6308	perplexity : fix integer overflow (#23496 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-22 15:50:44 +03:00
Alexey Kopytko	cc9e331213	SYCL: improve MoE prefill throughput (#23142 ) - change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity	2026-05-22 15:50:17 +03:00

1 2 3 4 5 ...

9340 commits