koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-08 09:59:50 +00:00

Author	SHA1	Message	Date
Concedo	5e667659ec	Merge commit '`0fc16b42e8`' into concedo_experimental # Conflicts: # src/CMakeLists.txt # src/llama-kv-cache.cpp	2025-06-02 23:14:23 +08:00
Concedo	6ce85c54d6	not working correctly	2025-06-02 22:12:10 +08:00
Concedo	c7d42a5a07	updated lite	2025-06-02 17:28:36 +08:00
Concedo	8e1ebc55b5	dropped support for lora base as upstream no longer uses it. If provided it will be silently ignored	2025-06-02 12:49:53 +08:00
Concedo	51dc1cf920	added scale for text lora	2025-06-02 00:13:42 +08:00
Concedo	74ef097c4a	added ability to set koboldcpp as default handler for gguf and kcpps	2025-06-01 22:36:41 +08:00
Georgi Gerganov	0fc16b42e8	kv-cache : split implementation in separate sources (#13920 ) ggml-ci	2025-06-01 11:39:27 +03:00
Concedo	2142d6ba68	Updated Colab to use internal downloader fixed model command (+1 squashed commits) Squashed commits: [a4d8fd9f1] tryout new colab (+1 squashed commits) Squashed commits: [c97333d44] tryout new colab	2025-06-01 11:41:45 +08:00
Concedo	f3bb947a13	cuda use wmma flash attention for turing (+1 squashed commits) Squashed commits: [3c5112398] 117 (+10 squashed commit) Squashed commit: [4f01bb2d4] 117 graphs 80v [7549034ea] 117 graphs [dabf9cb99] checking if cuda 11.5.2 works [ba7ccdb7a] another try cu11.7 only [752cf2ae5] increase aria2c download log rate [dc4f198fd] test send turing to wmma flash attention [496a22e83] temp build test cu11.7.0 [ca759c424] temp build test cu11.7 [c46ada17c] test build: enable virtual80 for oldcpu [3ccfd939a] test build: with cuda graphs for all	2025-06-01 11:41:45 +08:00
Max Krasnyansky	053b1539c0	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995 ) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-05-31 15:39:19 -07:00
Jiří Podivín	b3a89c3d9e	docs : Note about necessity of having libcurl installed for standard build. (#13945 ) Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2025-05-31 18:58:35 +02:00
Olivier Chafik	e15898d1c7	server: allow unclosed thinking tags (#13931 )	2025-05-31 08:26:10 -07:00
Georgi Gerganov	803f8baf4f	llama : deprecate explicit kv_self defrag/update calls (#13921 ) ggml-ci	2025-05-31 15:58:33 +03:00
Georgi Gerganov	3600cc2886	llama : use n_swa + n_ubatch cells for SWA cache (#13833 ) * llama : use n_swa + n_ubatch cells for SWA cache ggml-ci * llama : add warning about multi-sqeuence SWA contexts	2025-05-31 15:57:44 +03:00
igardev	c7e0a2054b	webui : Replace alert and confirm with custom modals. (#13711 ) * Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons. * use Modal Provider to simplify the use of confirm and alert modals. * Increase the z index of the modal dialogs. * Update index.html.gz * also add showPrompt * rebuild --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-05-31 11:56:08 +02:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching	2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen	51fa76f172	mtmd : drop `_shared` from `libmtmd` name, merge helpers into libmtmd (⚠️ breaking change) (#13917 ) * mtmd : fix missing public header * no object * apply suggestion from Georgi * rm mtmd-helper, merge it to mtmd * missing vendor include dir	2025-05-31 10:14:29 +02:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Shawn yang	eb3949938e	CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856 ) (#13895 ) * 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu 2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted code indentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fixed incorrect setting of variable types Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the judgment logic Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()' * Update ggml/src/ggml-cuda/ggml-cuda.cu Add a defensive security assert Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the support judgment logic. Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * revoke the suggest commit changes due to it's not applicable in jetson_device * Update ggml/src/ggml-cuda/ggml-cuda.cu Add parentheses to enforce operator precedence Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fix ci bug: add a spaces Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: yangxiao <yang_xl@tju.edu.cn> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: yangxiao <yangxl_zz@qq.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-05-31 08:48:04 +02:00
Concedo	b08dca65ed	Merge branch 'upstream' into concedo_experimental # Conflicts: # common/CMakeLists.txt # common/arg.cpp # common/chat.cpp # examples/parallel/README.md # examples/parallel/parallel.cpp # ggml/cmake/common.cmake # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/rope.cpp # models/ggml-vocab-bert-bge.gguf.inp # models/ggml-vocab-bert-bge.gguf.out # models/ggml-vocab-command-r.gguf.inp # models/ggml-vocab-command-r.gguf.out # models/ggml-vocab-deepseek-coder.gguf.inp # models/ggml-vocab-deepseek-coder.gguf.out # models/ggml-vocab-deepseek-llm.gguf.inp # models/ggml-vocab-deepseek-llm.gguf.out # models/ggml-vocab-falcon.gguf.inp # models/ggml-vocab-falcon.gguf.out # models/ggml-vocab-gpt-2.gguf.inp # models/ggml-vocab-gpt-2.gguf.out # models/ggml-vocab-llama-bpe.gguf.inp # models/ggml-vocab-llama-bpe.gguf.out # models/ggml-vocab-llama-spm.gguf.inp # models/ggml-vocab-llama-spm.gguf.out # models/ggml-vocab-mpt.gguf.inp # models/ggml-vocab-mpt.gguf.out # models/ggml-vocab-phi-3.gguf.inp # models/ggml-vocab-phi-3.gguf.out # models/ggml-vocab-qwen2.gguf.inp # models/ggml-vocab-qwen2.gguf.out # models/ggml-vocab-refact.gguf.inp # models/ggml-vocab-refact.gguf.out # models/ggml-vocab-starcoder.gguf.inp # models/ggml-vocab-starcoder.gguf.out # requirements/requirements-gguf_editor_gui.txt # tests/CMakeLists.txt # tests/test-chat.cpp # tests/test-grammar-integration.cpp # tests/test-json-schema-to-grammar.cpp # tools/mtmd/CMakeLists.txt # tools/run/run.cpp # tools/server/CMakeLists.txt	2025-05-31 13:04:21 +08:00
Concedo	c987abf9f5	Merge commit '`763d06edb7`' into concedo_experimental # Conflicts: # .github/workflows/build-linux-cross.yml # ggml/CMakeLists.txt # ggml/src/ggml-cann/CMakeLists.txt # ggml/src/ggml-opencl/CMakeLists.txt # ggml/src/ggml-opencl/ggml-opencl.cpp # ggml/src/ggml-vulkan/CMakeLists.txt # tools/mtmd/CMakeLists.txt # tools/mtmd/clip.cpp # tools/mtmd/mtmd.cpp # tools/server/CMakeLists.txt	2025-05-31 12:44:18 +08:00
Concedo	0c108f6054	Merge commit '`34b7c0439e`' into concedo_experimental # Conflicts: # ggml/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-sycl/element_wise.cpp # ggml/src/ggml-sycl/element_wise.hpp # ggml/src/ggml-sycl/ggml-sycl.cpp # scripts/sync-ggml.last # src/CMakeLists.txt # tools/mtmd/clip.cpp	2025-05-31 12:27:45 +08:00
Concedo	c923e9fe46	added option to unload model from admin control	2025-05-31 11:51:09 +08:00
Concedo	08e0745e7e	added singleinstance flag and local shutdown api	2025-05-31 11:37:32 +08:00
Wagner Bruna	12f99ba907	fix: workaround for default clip_skip issues (#1572 ) Sets the clip_skip value explicitly to 1 or 2 for all generation requests, aligning with the tests in the Conditioner objects in conditioner.hpp. This should fix #1546 regardless of future changes to the default behavior of sdcpp. This workaround can be removed once a proper fix is implemented in sdcpp.	2025-05-31 10:36:30 +08:00
Johannes Gäßler	e562eece7c	CUDA: fix typo in FlashAttention code (#13926 )	2025-05-30 21:22:03 +02:00
Concedo	3829ae728e	attempt to debug	2025-05-31 01:02:30 +08:00
Diego Devesa	b47ab7b8e9	sched : avoid changing cur_copy when a graph is already allocated (#13922 )	2025-05-30 18:56:19 +02:00
Georgi Gerganov	dd665cc9d4	parallel : increase the variability of the prompt lengths (#13927 ) ggml-ci	2025-05-30 19:38:07 +03:00
Diego Devesa	df0c0c7d02	cuda : prevent using split buffers with 3d/4d matrices (#13919 )	2025-05-30 16:37:18 +02:00
Akarshan Biswas	b49a8ff96b	SYCL: Add mrope kernel (#13755 ) * SYCL: Add mrope kernel * feat: Optimize rope operations with vectorization Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution. * Use ceil_div	2025-05-30 19:40:57 +05:30
Georgi Gerganov	53f925074d	sync : vendor (#13901 ) * sync : vendor ggml-ci * cont : fix httplib version ggml-ci * cont : fix lint * cont : fix lint * vendor : move to common folder /vendor ggml-ci * cont : fix lint * cont : move httplib to /vendor + use json_fwd.hpp ggml-ci * cont : fix server build ggml-ci * cont : add missing headers ggml-ci * cont : header clean-up ggml-ci	2025-05-30 16:25:45 +03:00
Sigbjørn Skjæret	db38704f01	convert : fix rwkv bos/eos token (#13844 )	2025-05-30 14:50:43 +02:00
Xuan-Son Nguyen	07e4351ce6	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 ) * convert : allow partial update to the chkhsh pre-tokenizer list * code style * update tokenizer out * rm inp/out files for models not having gguf * fixed hash for glm * skip nomic-bert-moe test * Update convert_hf_to_gguf_update.py * fix minerva-7b hash * rm redundant import	2025-05-30 12:24:37 +02:00
Đinh Trọng Huy	291f2b6913	llama : add support for DistilBert (#13907 ) * add distilbert * small fixes * add note for LLM_ARCH_DISTIL_BERT * Use MODEL_ARCH.BERT for DistilBert --------- Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>	2025-05-30 11:56:02 +02:00
zhangkaihuo	2c90da4c7e	llama : use llm_build_granite for minicpm (#13911 )	2025-05-30 10:31:48 +02:00
Concedo	6529326c59	allow temperatures up to 1.0 when function calling	2025-05-30 15:59:18 +08:00
Concedo	d99f362513	added 2 more readme images	2025-05-30 14:34:20 +08:00
Concedo	fe401ca4c2	fixed a typo	2025-05-30 13:35:42 +08:00
Concedo	a11ab0b08e	reverse clip skip fix as it might be breaking some sdxl models	2025-05-30 10:40:03 +08:00
Christian Kastner	ec9e0301fe	cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890 )	2025-05-30 01:28:54 +02:00
Sigbjørn Skjæret	e83ba3e460	llama : add support for jina-reranker-v2 (#13900 )	2025-05-29 21:42:31 +02:00
Concedo	2a309c144d	updated lite	2025-05-30 00:29:46 +08:00
Concedo	c881bb7348	match a few common oai voices	2025-05-29 23:29:17 +08:00
Sigbjørn Skjæret	2b131621e6	gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561 )	2025-05-29 15:36:05 +02:00
Yibo Cai	54a2c7a8cd	arm64: optimize q4_k_q8_k kernel with i8mm (#13886 ) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ```	2025-05-29 14:39:20 +03:00
Christian Kastner	21fcc21ad5	cmake: Factor out CPU architecture detection (#13883 ) * cmake: Define function for querying architecture The tests and results match exactly those of ggml/src/CMakeLists.txt * Switch arch detection over to new function	2025-05-29 12:50:25 +02:00
Vineel Abhinav	dd8ba93416	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882 ) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-05-29 12:18:43 +03:00
Georgi Gerganov	66c92061f5	tests : remove json.hpp from a test (#13880 ) ggml-ci	2025-05-29 12:17:16 +03:00
Sigbjørn Skjæret	5ca82fc1d7	convert : workaround for AutoConfig dummy labels (#13881 )	2025-05-29 10:00:57 +02:00

1 2 3 4 5 ...

8267 commits