koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-12 18:09:42 +00:00

Author	SHA1	Message	Date
wangshuai09	6e2b6000e5	cann: update cmake (#8765 )	2024-07-30 12:37:35 +02:00
zhentaoyu	c887d8b017	[SYCL] Add `TIMESTEP_EMBEDDING` OP (#8707 ) Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-07-30 14:56:51 +08:00
CarterLi999	75af08c475	ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (#8748 ) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <carter.li@starfivetech.com>	2024-07-29 18:38:34 +02:00
R0CKSTAR	439b3fc75a	cuda : organize vendor-specific headers into vendors directory (#8746 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-29 14:56:12 +02:00
Concedo	102eec3d22	more bugfixes in auto gpu layers selection	2024-07-29 20:38:24 +08:00
Llama	26f1df5e5f	Fix the penultimate token sometimes being lost with SSE streaming (#1031 ) The token immediately before an eot token was lost when SSE streaming was enabled if that token was contained entirely within a stop sequence. As an example of when this could happen, consider this prompt: Type the phrase 'pleas' once. In a Llama 3-derived model, 'pleas' tokenizes as 'ple' 'as'. The token 'as' is contained within this instruct mode stop sequence: <\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|> due to the word 'assistant'. Since `string_contains_sequence_substring` returns True for 'as', this token is added to `tokenReserve` instead of being streamed immediately. If the '<\|eot_id\|>' token was generated next, the text in `tokenReserve` would be discarded.	2024-07-29 20:16:47 +08:00
Concedo	948646ff7a	do not offload if auto layers is less than 2, as its usually slower	2024-07-29 20:13:43 +08:00
Concedo	e39b8aab8b	improvements to auto layer calcs	2024-07-29 18:51:10 +08:00
Meng, Hengyu	0832de7236	[SYCL] add conv support (#8688 )	2024-07-29 10:50:27 +08:00
Johannes Gäßler	6eeaeba126	cmake: use 1 more thread for non-ggml in CI (#8740 )	2024-07-28 22:32:44 +02:00
Concedo	f289fb494a	bump size of some payload arr sequences from 16 to 24	2024-07-28 20:29:39 +08:00
Concedo	e47477fd4d	don't build rope factors from https://github.com/ggerganov/llama.cpp/pull/8676 for CLBlast as it segfaults	2024-07-28 17:27:09 +08:00
Concedo	edbdfbced2	Revert "cu11 build threads" This reverts commit c3aa259907a77b19bb5c94015de61b8178b9d283. (+2 squashed commit) Squashed commit: [bf2f7e7c] missing include [c3aa2599] cu11 build threads	2024-07-28 16:46:10 +08:00
Austin	4730faca61	chore : Fix vulkan related compiler warnings, add help text, improve CLI options (#8477 ) * chore: Fix compiler warnings, add help text, improve CLI options * Add prototypes for function definitions * Invert logic of --no-clean option to be more intuitive * Provide a new help prompt with clear instructions * chore : Add ignore rule for vulkan shader generator Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp Co-authored-by: 0cc4m <picard12@live.de> * chore : Remove void and apply C++ style empty parameters * chore : Remove void and apply C++ style empty parameters --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: 0cc4m <picard12@live.de>	2024-07-28 09:52:42 +02:00
compilade	4c676c85e5	llama : refactor session file management (#8699 ) * llama : refactor session file management * llama : saving and restoring state checks for overflow The size of the buffers should now be given to the functions working with them, otherwise a truncated file could cause out of bound reads. * llama : stream from session file instead of copying into a big buffer Loading session files should no longer cause a memory usage spike. * llama : llama_state_get_size returns the actual size instead of max This is a breaking change, but makes that function much easier to keep up to date, and it also makes it reflect the behavior of llama_state_seq_get_size. * llama : share code between whole and seq_id-specific state saving Both session file types now use a more similar format. * llama : no longer store all hparams in session files Instead, the model arch name is stored. The layer count and the embedding dimensions of the KV cache are still verified when loading. Storing all the hparams is not necessary. * llama : fix uint64_t format type * llama : various integer type cast and format string fixes Some platforms use "%lu" and others "%llu" for uint64_t. Not sure how to handle that, so casting to size_t when displaying errors. * llama : remove _context suffix for llama_data_context * llama : fix session file loading llama_state_get_size cannot be used to get the max size anymore. * llama : more graceful error handling of invalid session files * llama : remove LLAMA_MAX_RNG_STATE It's no longer necessary to limit the size of the RNG state, because the max size of session files is not estimated anymore. * llama : cast seq_id in comparison with unsigned n_seq_max	2024-07-28 00:42:05 -04:00
Concedo	0029e36f50	fix for older phi3 models without swa	2024-07-28 12:13:38 +08:00
Concedo	01afb28a63	not working	2024-07-28 11:43:10 +08:00
R0CKSTAR	e54c35e4fb	feat: Support Moore Threads GPU (#8383 ) * Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-28 01:41:25 +02:00
Concedo	ba5babb876	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/apps.nix # .devops/tools.sh # Makefile # README.md # docs/backend/SYCL.md # docs/build.md # examples/CMakeLists.txt # ggml/include/ggml.h # src/llama-vocab.cpp # tests/test-backend-ops.cpp # tests/test-chat-template.cpp # tests/test-sampling.cpp	2024-07-27 23:15:54 +08:00
Georgi Gerganov	5e2727fe03	scripts : sync vulkan-shaders (#0 )	2024-07-27 18:08:47 +03:00
Georgi Gerganov	56f20aa25d	scripts : sync ggml-aarch64 sources	2024-07-27 18:07:33 +03:00
Georgi Gerganov	345c8c0c87	ggml : add missing semicolon (#0 ) ggml-ci	2024-07-27 17:43:44 +03:00
Georgi Gerganov	ae7985cd7b	sync : ggml ggml-ci	2024-07-27 17:43:44 +03:00
Mahesh Madhav	a05ca93697	ggml : loop tiling optimizations for scalar path (ggml/898) Apply a loop tiling technique to the generic path, which provides performance upside for ISAs with enough registers to take advantage of it. Also helps the compiler optimize this path.	2024-07-27 17:43:44 +03:00
Ivan Filipov	9f77d899b7	ggml: add support for float16 input tensors in pooling operations (ggml/895) * Add support for float16 tensors in 1d pooling operations * Add support for float16 input tensors in 2d pooling operations * code cleanup remove unnecessary casting during srow ptr initialization --------- Co-authored-by: vanaka11 <vanaka1189@gmail.com>	2024-07-27 17:43:44 +03:00
Tony Wasserka	203b7f1531	vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893) This prevents invalid frees when destroying a partially initialized vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer when running out of device memory. Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>	2024-07-27 17:43:44 +03:00
Borislav Stanimirov	d2b851bfa1	cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)	2024-07-27 17:43:44 +03:00
Daniel Bevenius	c12b6e8ee7	ggml : remove unnecessary UNUSED macro call (ggml/880) This commit removes an UNUSED macro call that is not needed as the variable n0 is used in the code and will not produce a warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-27 17:43:44 +03:00
Concedo	eaa702852d	increased padding, it is still way too little but whatever	2024-07-27 22:32:13 +08:00
Jeffrey Morgan	b5e95468b1	llama : add support for llama 3.1 rope scaling factors (#8676 ) * Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-27 15:03:45 +03:00
Georgi Gerganov	92090eca21	llama : add function for model-based max number of graph nodes (#8622 ) * llama : model-based max number of graph nodes ggml-ci * llama : disable 405B max_nodes path due to lack of complaints ggml-ci	2024-07-27 14:59:29 +03:00
Daniel Bevenius	9d03d085dd	common : add --no-warmup option for main/llama-cli (#8712 ) This commit adds a --no-warmup option for llama-cli. The motivation for this is that it can be convenient to skip the warmup llama_decode call when debugging. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-27 13:45:02 +03:00
wangshuai09	bfb4c74981	cann: Fix Multi-NPU execution error (#8710 ) * cann: fix multi-npu exec error * cann: update comment for ggml_backend_cann_supports_buft	2024-07-27 16:36:44 +08:00
Concedo	729eb1e552	no fast forward for empty prompt	2024-07-27 16:29:35 +08:00
slaren	2b1f616b20	ggml : reduce hash table reset cost (#8698 ) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-07-27 04:41:55 +02:00
Concedo	4531ab5465	refactor some fields	2024-07-27 00:04:29 +08:00
Judd	01245f5b16	llama : fix order of parameters (#8706 ) usage of `aclrtGetMemInfo` is correct: https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html Co-authored-by: Judd <foldl@boxvest.com>	2024-07-26 11:38:12 +03:00
Yaiko	01aec4a631	server : add Speech Recognition & Synthesis to UI (#8679 ) * server : add Speech Recognition & Synthesis to UI * server : add Speech Recognition & Synthesis to UI (fixes)	2024-07-26 00:10:16 +02:00
Xuan Son Nguyen	41cd47caab	examples : export-lora : fix issue with quantized base models (#8687 )	2024-07-25 23:49:39 +02:00
DavidKorczynski	49ce0ab6d4	ggml: handle ggml_init failure to fix NULL pointer deref (#8692 ) `ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`. This fixes it by bailing out if no context is found.	2024-07-25 23:23:05 +02:00
Georgi Gerganov	4226a8d10e	llama : fix build + fix fabs compile warnings (#8683 ) ggml-ci	2024-07-25 19:57:31 +03:00
Andreas (Andi) Kunar	bf5a81df37	ggml : fix build on Windows with Snapdragon X (#8531 ) * Improvements for Windows with Snapdragon X * Revert "Improvements for Windows with Snapdragon X" This reverts commit bf21397ae5ea7c73d3494db3b91505599909227d. * Improvements for Windows with Snapdragon X * WOA build clarifications * WIndows on ARM build clarifications * cmake build for Windows clarifications * Update docs/build.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: AndreasKunar <andreaskmsn.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-07-25 19:01:00 +03:00
Georgi Gerganov	88954f7fbd	tests : fix printfs (#8068 )	2024-07-25 18:58:04 +03:00
Concedo	9f2076b4b3	fix rocminfo error	2024-07-25 22:23:36 +08:00
Chen Xi	ed67bcb24f	[SYCL] fix multi-gpu issue on sycl (#8554 ) --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-07-25 19:45:18 +08:00
Georgi Gerganov	eddcb5238b	ggml : add and use ggml_cpu_has_llamafile() (#8664 )	2024-07-25 12:37:42 +03:00
Xuan Son Nguyen	be6d7c0791	examples : remove `finetune` and `train-text-from-scratch` (#8669 ) * examples : remove finetune and train-text-from-scratch * fix build * update help message * fix small typo for export-lora	2024-07-25 10:39:04 +02:00
Ujjawal Panchal	4b0eff3df5	docs : Quantum -> Quantized (#8666 ) * docfix: imatrix readme, quantum models -> quantized models. * docfix: server readme: quantum models -> quantized models.	2024-07-25 11:13:27 +03:00
Fan Shupei	8a4bad50a8	llama: use sliding window for phi3 (#8627 ) * use sliding window for phi3 * fix typo, "data_swa" -> "data" * [conver_hf_to_gguf.py] add phi3 sliding window	2024-07-25 10:21:09 +03:00
Concedo	a84f7c5d81	revert num old cpu for ci	2024-07-25 13:24:34 +08:00

... 13 14 15 16 17 ...

6004 commits