koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-13 02:19:41 +00:00

Author	SHA1	Message	Date
wangshuai09	c8a0090922	cann: support q8_0 for Ascend backend (#8805 )	2024-08-01 10:39:05 +08:00
Igor Okulist	afbbcf3c04	server : update llama-server embedding flag documentation (#8779 ) Fixes #8763	2024-07-31 19:59:09 -04:00
Clint Herron	ed9d2854c9	Build: Fix potential race condition (#8781 ) * Fix potential race condition as pointed out by @fairydreaming in #8776 * Reference the .o rather than rebuilding every time. * Adding in CXXFLAGS and LDFLAGS * Removing unnecessary linker flags.	2024-07-31 15:51:06 -04:00
pculliton	398ede5efe	Adding Gemma 2 2B configs (#8784 ) * Adding Gemma 2 2B configs Updates to Q scaling and Gemma 2 model sizes to match v2 2B model. * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-31 17:12:10 +02:00
Borislav Stanimirov	44d28ddd5c	cmake : fix use of external ggml (#8787 )	2024-07-31 15:40:08 +02:00
Someone	268c566006	nix: cuda: rely on propagatedBuildInputs (#8772 ) Listing individual outputs no longer necessary to reduce the runtime closure size after https://github.com/NixOS/nixpkgs/pull/323056.	2024-07-30 13:35:30 -07:00
Concedo	9a04060aaa	also apply even if tensor split is set	2024-07-30 23:01:50 +08:00
Concedo	2f04f848e1	if gpuid is specified, force specific order	2024-07-30 22:58:25 +08:00
Brian	7e72aa74fd	py: add_array() will not add to kv store if value is an empty array (#8774 ) * gguf_writer.py: add_array() should not add to kv store if empty * Apply suggestions from code review I was wondering if there was a specific reason for `if val` but good to hear we can safely use `len(val == 0` Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-31 00:57:03 +10:00
Concedo	7e95b80211	we don't need this	2024-07-30 22:45:41 +08:00
Concedo	265f37f13c	Merge branch 'upstream' into concedo_experimental	2024-07-30 22:44:41 +08:00
l3utterfly	7c27a19b2e	added android implementation of ggml_print_backtrace_symbols (#8751 ) * added android implementation of ggml_print_backtrace_symbols * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-30 16:40:18 +02:00
Concedo	bf35652ef7	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # flake.lock # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt	2024-07-30 22:31:49 +08:00
Concedo	43c55bb7e2	hack to fix bad unicode fragments corrupting streamed output	2024-07-30 22:18:22 +08:00
Concedo	1df850c95c	add magnum to colab models	2024-07-30 21:13:29 +08:00
Georgi Gerganov	140074bb86	flake.lock: Update (#8729 )	2024-07-30 05:58:57 -07:00
wangshuai09	6e2b6000e5	cann: update cmake (#8765 )	2024-07-30 12:37:35 +02:00
zhentaoyu	c887d8b017	[SYCL] Add `TIMESTEP_EMBEDDING` OP (#8707 ) Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-07-30 14:56:51 +08:00
CarterLi999	75af08c475	ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (#8748 ) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <carter.li@starfivetech.com>	2024-07-29 18:38:34 +02:00
R0CKSTAR	439b3fc75a	cuda : organize vendor-specific headers into vendors directory (#8746 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-29 14:56:12 +02:00
Concedo	102eec3d22	more bugfixes in auto gpu layers selection	2024-07-29 20:38:24 +08:00
Llama	26f1df5e5f	Fix the penultimate token sometimes being lost with SSE streaming (#1031 ) The token immediately before an eot token was lost when SSE streaming was enabled if that token was contained entirely within a stop sequence. As an example of when this could happen, consider this prompt: Type the phrase 'pleas' once. In a Llama 3-derived model, 'pleas' tokenizes as 'ple' 'as'. The token 'as' is contained within this instruct mode stop sequence: <\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|> due to the word 'assistant'. Since `string_contains_sequence_substring` returns True for 'as', this token is added to `tokenReserve` instead of being streamed immediately. If the '<\|eot_id\|>' token was generated next, the text in `tokenReserve` would be discarded.	2024-07-29 20:16:47 +08:00
Concedo	948646ff7a	do not offload if auto layers is less than 2, as its usually slower	2024-07-29 20:13:43 +08:00
Concedo	e39b8aab8b	improvements to auto layer calcs	2024-07-29 18:51:10 +08:00
Meng, Hengyu	0832de7236	[SYCL] add conv support (#8688 )	2024-07-29 10:50:27 +08:00
Johannes Gäßler	6eeaeba126	cmake: use 1 more thread for non-ggml in CI (#8740 )	2024-07-28 22:32:44 +02:00
Concedo	f289fb494a	bump size of some payload arr sequences from 16 to 24	2024-07-28 20:29:39 +08:00
Concedo	e47477fd4d	don't build rope factors from https://github.com/ggerganov/llama.cpp/pull/8676 for CLBlast as it segfaults	2024-07-28 17:27:09 +08:00
Concedo	edbdfbced2	Revert "cu11 build threads" This reverts commit c3aa259907a77b19bb5c94015de61b8178b9d283. (+2 squashed commit) Squashed commit: [bf2f7e7c] missing include [c3aa2599] cu11 build threads	2024-07-28 16:46:10 +08:00
Austin	4730faca61	chore : Fix vulkan related compiler warnings, add help text, improve CLI options (#8477 ) * chore: Fix compiler warnings, add help text, improve CLI options * Add prototypes for function definitions * Invert logic of --no-clean option to be more intuitive * Provide a new help prompt with clear instructions * chore : Add ignore rule for vulkan shader generator Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp Co-authored-by: 0cc4m <picard12@live.de> * chore : Remove void and apply C++ style empty parameters * chore : Remove void and apply C++ style empty parameters --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: 0cc4m <picard12@live.de>	2024-07-28 09:52:42 +02:00
compilade	4c676c85e5	llama : refactor session file management (#8699 ) * llama : refactor session file management * llama : saving and restoring state checks for overflow The size of the buffers should now be given to the functions working with them, otherwise a truncated file could cause out of bound reads. * llama : stream from session file instead of copying into a big buffer Loading session files should no longer cause a memory usage spike. * llama : llama_state_get_size returns the actual size instead of max This is a breaking change, but makes that function much easier to keep up to date, and it also makes it reflect the behavior of llama_state_seq_get_size. * llama : share code between whole and seq_id-specific state saving Both session file types now use a more similar format. * llama : no longer store all hparams in session files Instead, the model arch name is stored. The layer count and the embedding dimensions of the KV cache are still verified when loading. Storing all the hparams is not necessary. * llama : fix uint64_t format type * llama : various integer type cast and format string fixes Some platforms use "%lu" and others "%llu" for uint64_t. Not sure how to handle that, so casting to size_t when displaying errors. * llama : remove _context suffix for llama_data_context * llama : fix session file loading llama_state_get_size cannot be used to get the max size anymore. * llama : more graceful error handling of invalid session files * llama : remove LLAMA_MAX_RNG_STATE It's no longer necessary to limit the size of the RNG state, because the max size of session files is not estimated anymore. * llama : cast seq_id in comparison with unsigned n_seq_max	2024-07-28 00:42:05 -04:00
Concedo	0029e36f50	fix for older phi3 models without swa	2024-07-28 12:13:38 +08:00
Concedo	01afb28a63	not working	2024-07-28 11:43:10 +08:00
R0CKSTAR	e54c35e4fb	feat: Support Moore Threads GPU (#8383 ) * Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-28 01:41:25 +02:00
Concedo	ba5babb876	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/apps.nix # .devops/tools.sh # Makefile # README.md # docs/backend/SYCL.md # docs/build.md # examples/CMakeLists.txt # ggml/include/ggml.h # src/llama-vocab.cpp # tests/test-backend-ops.cpp # tests/test-chat-template.cpp # tests/test-sampling.cpp	2024-07-27 23:15:54 +08:00
Georgi Gerganov	5e2727fe03	scripts : sync vulkan-shaders (#0 )	2024-07-27 18:08:47 +03:00
Georgi Gerganov	56f20aa25d	scripts : sync ggml-aarch64 sources	2024-07-27 18:07:33 +03:00
Georgi Gerganov	345c8c0c87	ggml : add missing semicolon (#0 ) ggml-ci	2024-07-27 17:43:44 +03:00
Georgi Gerganov	ae7985cd7b	sync : ggml ggml-ci	2024-07-27 17:43:44 +03:00
Mahesh Madhav	a05ca93697	ggml : loop tiling optimizations for scalar path (ggml/898) Apply a loop tiling technique to the generic path, which provides performance upside for ISAs with enough registers to take advantage of it. Also helps the compiler optimize this path.	2024-07-27 17:43:44 +03:00
Ivan Filipov	9f77d899b7	ggml: add support for float16 input tensors in pooling operations (ggml/895) * Add support for float16 tensors in 1d pooling operations * Add support for float16 input tensors in 2d pooling operations * code cleanup remove unnecessary casting during srow ptr initialization --------- Co-authored-by: vanaka11 <vanaka1189@gmail.com>	2024-07-27 17:43:44 +03:00
Tony Wasserka	203b7f1531	vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893) This prevents invalid frees when destroying a partially initialized vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer when running out of device memory. Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>	2024-07-27 17:43:44 +03:00
Borislav Stanimirov	d2b851bfa1	cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)	2024-07-27 17:43:44 +03:00
Daniel Bevenius	c12b6e8ee7	ggml : remove unnecessary UNUSED macro call (ggml/880) This commit removes an UNUSED macro call that is not needed as the variable n0 is used in the code and will not produce a warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-27 17:43:44 +03:00
Concedo	eaa702852d	increased padding, it is still way too little but whatever	2024-07-27 22:32:13 +08:00
Jeffrey Morgan	b5e95468b1	llama : add support for llama 3.1 rope scaling factors (#8676 ) * Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>	2024-07-27 15:03:45 +03:00
Georgi Gerganov	92090eca21	llama : add function for model-based max number of graph nodes (#8622 ) * llama : model-based max number of graph nodes ggml-ci * llama : disable 405B max_nodes path due to lack of complaints ggml-ci	2024-07-27 14:59:29 +03:00
Daniel Bevenius	9d03d085dd	common : add --no-warmup option for main/llama-cli (#8712 ) This commit adds a --no-warmup option for llama-cli. The motivation for this is that it can be convenient to skip the warmup llama_decode call when debugging. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-27 13:45:02 +03:00
wangshuai09	bfb4c74981	cann: Fix Multi-NPU execution error (#8710 ) * cann: fix multi-npu exec error * cann: update comment for ggml_backend_cann_supports_buft	2024-07-27 16:36:44 +08:00
Concedo	729eb1e552	no fast forward for empty prompt	2024-07-27 16:29:35 +08:00

1 2 3 4 5 ...

5370 commits