koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-11 01:24:36 +00:00

Author	SHA1	Message	Date
Lukas Straub	a9f77a8be3	server : add openai-style logit_bias support (#14946 ) Signed-off-by: Lukas Straub <lukasstraub2@web.de>	2025-07-31 14:08:23 +02:00
Aman Gupta	8a4a856277	Add LLaDA 8b Diffusion model (#14771 ) * Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py	2025-07-31 19:49:09 +08:00
hipudding	11490b3672	CANN: Improve loading efficiency after converting weights to NZ format. (#14985 ) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-07-31 19:47:20 +08:00
compilade	66625a59a5	graph : reduce splits for recurrent and hybrid models (#14825 ) * graph : avoid creating redundant s_copy views * graph : comment the s_copy views	2025-07-31 08:02:46 +03:00
lhez	6e6725459a	opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (#14809 )	2025-07-30 14:56:55 -07:00
Ed Addario	e9192bec56	quantize : fix using combined imatrix GGUFs (multiple datasets) (#14973 )	2025-07-30 21:11:56 +02:00
Daniel Bevenius	41e78c567e	server : add support for `embd_normalize` parameter (#14964 ) This commit adds support for the `embd_normalize` parameter in the server code. The motivation for this is that currently if the server is started with a pooling type that is not `none`, then Euclidean/L2 normalization will be the normalization method used for embeddings. However, this is not always the desired behavior, and users may want to use other normalization (or none) and this commit allows that. Example usage: ```console curl --request POST \ --url http://localhost:8080/embedding \ --header "Content-Type: application/json" \ --data '{"input": "Hello world today", "embd_normalize": -1} ```	2025-07-30 18:07:11 +02:00
uvos	ad4a700117	HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (#14949 )	2025-07-30 17:38:06 +02:00
Georgi Gerganov	e32a4ec60e	sync : ggml ggml-ci	2025-07-30 17:33:11 +03:00
Kai Pastor	e228de9449	cmake : Fix BLAS link interface (ggml/1316)	2025-07-30 17:33:11 +03:00
Kai Pastor	73a8e5ca03	vulkan : fix 32-bit builds (ggml/1313) The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.	2025-07-30 17:33:11 +03:00
Johannes Gäßler	92b8810ec7	CUDA: skip masked KV slices for all FA kernels (#14924 )	2025-07-30 15:46:13 +02:00
Georgi Gerganov	00131d6eaf	tests : update for LLAMA_SET_ROWS=1 (#14961 ) * test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci	2025-07-30 15:12:02 +03:00
Georgi Gerganov	1e15bfd42c	graph : fix stack-use-after-return (#14960 ) ggml-ci	2025-07-30 13:52:11 +03:00
Douglas Hanley	a118d80233	embeddings: fix extraction of CLS pooling results (#14927 ) * embeddings: fix extraction of CLS pooling results * merge RANK pooling into CLS case for inputs	2025-07-30 08:25:05 +03:00
Xinpeng Dou	61550f8231	CANN: update ops docs (#14935 ) * CANN:add ops docs * CANN: update ops docs	2025-07-30 08:39:24 +08:00
uvos	aa79524c51	HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (#14945 )	2025-07-29 20:23:04 +02:00
uvos	b77d11179d	HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (#14930 ) This is useful for testing for regressions on GCN with CDNA hardware. With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.	2025-07-29 17:44:30 +02:00
uvos	c7aa1364fd	HIP: Ignore unsupported unroll transformation in fattn-vec (#14931 ) llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.	2025-07-29 17:43:43 +02:00
kallewoof	1a67fcc306	common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937 ) * bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences	2025-07-29 17:05:38 +02:00
hipudding	204f2cf168	CANN: Add ggml_set_rows (#14943 )	2025-07-29 22:36:43 +08:00
Sigbjørn Skjæret	138b288b59	cuda : add softcap fusion (#14907 )	2025-07-29 14:22:03 +02:00
Johannes Gäßler	bbd0f91779	server-bench: make seed choice configurable (#14929 ) * server-bench: make seed choice configurable * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix error formatting * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-29 10:40:50 +02:00
Aman Gupta	0a5036bee9	CUDA: add roll (#14919 ) * CUDA: add roll * Make everything const, use __restrict__	2025-07-29 14:45:18 +08:00
lhez	8ad7b3e65b	opencl : add ops docs (#14910 )	2025-07-28 18:50:17 +02:00
Leonard Mosescu	bda62193b2	test-backend-ops : extend test case filtering (#14865 ) * Extend test case filtering 1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example: `test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"` 2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example: `test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"` These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled) * Updating the usage help text * Update tests/test-backend-ops.cpp	2025-07-28 18:04:27 +02:00
Radoslav Gerganov	c556418b60	llama-bench : use local GPUs along with RPC servers (#14917 ) Currently if RPC servers are specified with '--rpc' and there is a local GPU available (e.g. CUDA), the benchmark will be performed only on the RPC device(s) but the backend result column will say "CUDA,RPC" which is incorrect. This patch is adding all local GPU devices and makes llama-bench consistent with llama-cli.	2025-07-28 18:59:04 +03:00
xctan	db16e2831c	ggml-cpu : deduplicate scalar implementations (#14897 ) * remove redundant code in riscv * remove redundant code in arm * remove redundant code in loongarch * remove redundant code in ppc * remove redundant code in s390 * remove redundant code in wasm * remove redundant code in x86 * remove fallback headers * fix x86 ggml_vec_dot_q8_0_q8_0	2025-07-28 17:40:24 +02:00
Akarshan Biswas	cd1fce6d4f	SYCL: Add set_rows support for quantized types (#14883 ) * SYCL: Add set_rows support for quantized types This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code. * Use get_global_linear_id() instead ggml-ci * Fix formatting ggml-ci * Use const for ne11 and size_t variables in set_rows_sycl_q ggml-ci * Increase block size for q kernel to 256 ggml-ci * Cleanup imports * Add float.h to cpy.hpp	2025-07-28 20:32:15 +05:30
Xuan-Son Nguyen	00fa15fedc	mtmd : add support for Voxtral (#14862 ) * mtmd : add support for Voxtral * clean up * fix python requirements * add [BEGIN_AUDIO] token * also support Devstral conversion * add docs and tests * fix regression for ultravox * minor coding style improvement * correct project activation fn * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-28 15:01:48 +02:00
Johannes Gäßler	946b1f6859	CUDA: fix pointer incrementation in FA (#14916 )	2025-07-28 14:30:22 +02:00
Dongliang Wei	6c6e397aff	model : add support for SmallThinker series (#14898 ) * support smallthinker * support 20b softmax, 4b no sliding window * new build_moe_ffn_from_probs, and can run 4b * fix 4b rope bug * fix python type check * remove is_moe judge * remove set_dense_start_swa_pattern function and modify set_swa_pattern function * trim trailing whitespace * remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * better whitespace Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use GGML_ASSERT for expert count validation Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Improve null pointer check for probs Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use template parameter for SWA attention logic * better whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * move the creation of inp_out_ids before the layer loop * remove redundant judge for probs --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-28 13:47:00 +02:00
Alberto Cabrera Pérez	afc0e89698	sycl: refactor quantization to q8_1 (#14815 ) * sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat	2025-07-28 11:05:53 +01:00
Georgi Gerganov	a5771c9eea	ops : update BLAS (#14914 )	2025-07-28 10:01:03 +02:00
Georgi Gerganov	c35f9eaf09	ops : update Metal (#14912 )	2025-07-28 08:22:56 +03:00
Georgi Gerganov	1f45f2890e	sync : ggml	2025-07-28 08:15:01 +03:00
Kai Pastor	613c5095c3	cmake : Indent ggml-config.cmake (ggml/1310)	2025-07-28 08:15:01 +03:00
Ed Addario	7f97599581	quantize : update README.md (#14905 ) * Update README.md * Fix trailing whitespace * Update README.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-27 23:31:11 +02:00
Ruben Ortlam	bf78f5439e	vulkan: add ops docs (#14900 )	2025-07-27 15:33:08 +02:00
Akarshan Biswas	bbfc849274	SYCL: add ops doc (#14901 )	2025-07-27 17:52:58 +05:30
Daniel Bevenius	ca0ef2dddb	llama : clarify comment about pp and tg graphs [no ci] (#14895 ) * llama : clarify comment about pp and tg graphs [no ci] This commit clarifies the comment in `llama-context.cpp` regarding the prefill prompt (pp), and token generation (tg) graphs. The motivation for this is that I've struggled to remember these and had to look them up more than once, so I thought it would be helpful to add a comment that makes it clear what these stand for. * squash! llama : clarify comment about pp and tg graphs [no ci] Change "pp" to "prompt processing".	2025-07-27 12:10:51 +02:00
Erik Scholz	89d1029559	vulkan : add fp16 support for the conv_2d kernel (#14872 ) * add f16 to conv_2d testing * weaken conv2d test error threshold	2025-07-27 12:04:33 +02:00
Jeff Bolz	f1a4e72de5	vulkan: skip empty set_rows to avoid invalid API usage (#14860 )	2025-07-27 11:05:34 +02:00
Gabriel Larson	4762ad7316	model : make rope_yarn_log_mul optional for deepseek2 (#14896 ) * make rope_yarn_log_mul optional for deepseek2 * default rope_yarn_log_mul = 0.0f	2025-07-27 11:18:37 +03:00
Shunta Saito	1dc9614e06	llama : fix kq_scale for the attention layers of PLaMo2 (#14892 ) * Fix dimensions for expand * Change dimensions to copy states to cache * Fix the default value for plamo2 conversion * Fix scale given to build_attn * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-07-27 09:38:44 +02:00
Aman Gupta	446595b9b3	Docs: add instructions for adding backends (#14889 )	2025-07-27 09:36:43 +08:00
deepsek	66906cd82a	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624 ) This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices. Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries. This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.	2025-07-27 00:28:14 +02:00
hipudding	11dd5a44eb	CANN: Implement GLU ops (#14884 ) Implement REGLU, GEGLU, SWIGLU ops according to #14158	2025-07-26 17:56:18 +08:00
R0CKSTAR	9b8f3c6c77	musa: fix build warnings (unused variable) (#14869 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-26 10:36:02 +08:00
Aaron Teo	c7f3169cd5	ggml-cpu : disable GGML_NNPA by default due to instability (#14880 ) * docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d) --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-25 19:09:03 +02:00

1 2 3 4 5 ...

6043 commits