koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-10 17:14:36 +00:00

Author	SHA1	Message	Date
Concedo	57a98ba308	fixed dict loading	2024-07-25 11:41:05 +08:00
Concedo	0024d9d682	fixed order of selection	2024-07-25 11:15:30 +08:00
Concedo	d1f7832d21	adjusted layer estimation	2024-07-24 22:51:02 +08:00
Concedo	e28c42d7f7	adjusted layer estimation	2024-07-24 21:54:49 +08:00
Concedo	b7fc8e644a	fix broken template, updated lite	2024-07-24 20:47:05 +08:00
Concedo	c76f3401e3	remove extra padding for layer guessing	2024-07-24 16:36:34 +08:00
Concedo	c80d5af014	add a tiny amount of padding	2024-07-23 18:58:26 +08:00
henk717	e493f14a3e	New automatic layers (#1012 ) * Henk's version of the fsize algo This is the current version of the fsize algo based on Pyro's algorithm with added padding. * Update koboldcpp.py Add debugs and bump padding * Pyro version Pyro didn't agree with my version, so here is a test with his version * Polish new auto layers This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models. * Remove 10% margin The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter. * Restore 500MB buffer to play it safe I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone. * Cap layers at maximum When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is. For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less. This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1). * Remove old max layer code Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses. Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over. * Restructure algorithm into multiple parts As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.	2024-07-22 15:47:31 +08:00
Concedo	e2b36aa6cf	fixed dry loading seq when not in use, set kcppt to -1 layers by default	2024-07-22 15:44:34 +08:00
Concedo	4d9ccddc2c	don't unpack pyd	2024-07-20 18:58:49 +08:00
Concedo	1a23d49c32	serve tags endpoint	2024-07-19 16:08:54 +08:00
Concedo	a998588f3a	improved estimation	2024-07-19 00:20:11 +08:00
Concedo	caab9cb8ae	fixed unwanted removal	2024-07-18 22:27:22 +08:00
BBC-Esq	621801da0e	Streamline misc (#1007 ) * fix typo and streamline a little * streamline togglehorde * oops	2024-07-18 22:25:38 +08:00
Concedo	8b0a9f7e56	remove keys, use tuple	2024-07-18 22:11:13 +08:00
BBC-Esq	7de1ebf897	Streamline with dictionaries (#1005 ) * dictionary #1 * dictionary #2	2024-07-18 22:05:30 +08:00
BBC-Esq	ce971a0f3d	Streamline with fstrings (#1006 ) * fstring #1 * fstring #2	2024-07-18 21:48:46 +08:00
Concedo	90c1bbbcb9	more url downoad support	2024-07-18 11:56:05 +08:00
Concedo	ad86b1aeb8	Implemented Kcpp Launch Templates (+1 squashed commits) Squashed commits: [5ea4c1de] wip integrating skcpps templates (+1 squashed commits) Squashed commits: [737daa7f] skcpps wip	2024-07-18 00:22:59 +08:00
Concedo	8ccc0144d2	ability to set -1 as gpulayers and determine at runtime (+1 squashed commits) Squashed commits: [594263c3] ability to set -1 as gpulayers and determine at runtime	2024-07-17 20:31:19 +08:00
Concedo	6c883a4803	dummy skcpps format	2024-07-17 18:35:27 +08:00
Concedo	eca7521c13	allowed embedded chat adapters	2024-07-17 18:08:43 +08:00
Concedo	5988243aee	fix wrong order, fix llava debug mode failure	2024-07-17 15:30:19 +08:00
Concedo	e99fa531a2	reorder items	2024-07-17 00:28:48 +08:00
Concedo	d775a419b2	updated lite with chat inject, added layer detect, added more console logging	2024-07-16 23:10:15 +08:00
Concedo	516fd35e93	error popups on python exits	2024-07-16 00:46:32 +08:00
Concedo	21179d675b	try ci for avx1, up ver (+2 squashed commit) Squashed commit: [74150175] up version [97b6163c] try ci for avx1 linux	2024-07-15 23:07:07 +08:00
teddybear082	c08309e773	Rudimentary support of openai chat completions tools calls (#981 ) * Rudimentary support of openai chat completions tools calls -Most small models are not smart enough to do this, especially a combined tool call + role play response, but at least this allows experimentation along these lines with koboldcpp * try to also support specified function and tool choice set to none Allow tools start and end messages to be configured in adapter Try to force grammar to specific function call if specified (untested) * ensure tools get listed right after user content and before end of user message content * omit grammars approach try prompting instead -use more extensive json parsing and direct instructions to models to try to obtain the desired result -seems to work relatively well with Mistral-7B-Instruct-v.0.3.Q4_K_M.gguf and neuralhermes-2.5-mistral-7b.Q4_K_M.gguf -question of whether this is too opinionated of an approach, should the instructions be things that can be passed with the prompt template? * add back llamacpp recommended json grammar Go back to adding grammar but use "official" llamacpp grammar only not a custom one just for openai * Tidy up, remove unnecessary globals * clarity * fix missing local variable error This worked to fix the error I mentioned on my last comment --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-07-14 11:22:45 +08:00
Llama	264575426e	Add the DRY dynamic N-gram anti-repetition sampler (#982 ) * Add the DRY dynamic N-gram anti-repetition sampler The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: https://github.com/oobabooga/text-generation-webui/pull/5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. * Update default DRY parameters to match lite * Improve DRY token debug logging * Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. * Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). * Remove unused llama sampler variables and clean up sequence breakers. * Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. * Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions. * Limit sequence breaker lengths in tokens and characters The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.	2024-07-13 19:08:23 +08:00
Concedo	f529ef26df	alias completion to completions	2024-07-12 22:53:15 +08:00
Concedo	1bf07ceabd	remove unused	2024-07-12 00:17:41 +08:00
Concedo	116d5fe58e	updated lite	2024-07-09 20:42:51 +08:00
Concedo	7f48ed39c2	allow unpacking in CLI	2024-07-06 23:00:45 +08:00
Concedo	8e5fd6f509	Merge branch 'upstream' into concedo_experimental # Conflicts: # .gitignore # README.md # docs/backend/BLIS.md # docs/backend/SYCL.md # docs/development/llama-star/idea-arch.key # docs/development/llama-star/idea-arch.pdf # docs/development/token_generation_performance_tips.md # src/llama.cpp # tests/test-tokenizer-0.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-spm.cpp # tests/test-tokenizer-random.py	2024-07-06 19:39:24 +08:00
Concedo	6b0756506b	improvements to model downloader and chat completions adapter loader	2024-07-04 15:34:08 +08:00
Concedo	3fdbe3351d	adjust some defaults and gui launcher	2024-07-04 00:52:21 +08:00
Concedo	0fc18d2d82	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/nix/package.nix # CMakePresets.json # README.md # flake.lock # ggml/src/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-chat-template.cpp	2024-07-02 21:05:45 +08:00
Concedo	8a07ce306a	edit sampler order warning	2024-06-30 11:46:09 +08:00
Concedo	18df56b8cf	add tensor split gui input for vulkan	2024-06-30 11:16:48 +08:00
Concedo	5e671c2162	ignore utf decode errors	2024-06-28 16:39:48 +08:00
Concedo	1801594972	allow forced positive prompt	2024-06-27 20:21:17 +08:00
Concedo	73b99a7266	add premade chat completions adapter	2024-06-27 00:13:06 +08:00
Concedo	e42bc5d677	add negative prompt support to chat completions adapter	2024-06-26 11:12:24 +08:00
Concedo	151ff95a67	Merge branch 'upstream' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # ggml-cuda.cu # ggml-cuda/common.cuh	2024-06-25 19:25:14 +08:00
Concedo	13398477a1	fix ubatch, autoselect vulkan dgpu if possible	2024-06-22 00:23:46 +08:00
Nexesenex	153527745b	Augmented benchmark stats (#929 ) * Augmented benchmark stats v1 * output instead of coherence * populate bench flags as a flags field instead of multiple lines --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-06-18 21:30:36 +08:00
Concedo	ba9ef4d01b	fix to allow clblast to work even after blas backend splitoff	2024-06-17 15:02:55 +08:00
Concedo	623390e4ab	allow sdui when img model not loaded, allow sdclamped to provide a custom clamp size (+1 squashed commits) Squashed commits: [957c9c9c] allow sdui when img model not loaded, allow sdclamped to provide a custom clamp size	2024-06-14 16:58:50 +08:00
Concedo	e69da9c9d8	strings rename kobold lite to koboldai lite	2024-06-13 20:00:28 +08:00
Concedo	49e4c3fd7b	adjust lite default port, disable double BOS warning, whisper and SD go quiet when horde mode is set too	2024-06-13 15:10:35 +08:00

... 3 4 5 6 7 ...

806 commits