Commit graph

634 commits

Author SHA1 Message Date
Concedo
2c108ab17e correct phrasing 2024-08-14 21:55:53 +08:00
Concedo
f4f24d0e14 small text change 2024-08-11 21:30:46 +08:00
Concedo
139ab3d198 generate passes whole object now 2024-08-11 00:08:13 +08:00
Concedo
da8a96199c add a space between the bench prompt to fix an issue with old bpe tokenizer stack overflow (+1 squashed commits)
Squashed commits:

[44a689de] add a space between the bench prompt to fix an issue with old bpe tokenizer stack overflow
2024-08-10 19:35:56 +08:00
Concedo
86e687ae8b updated lite, added promptlimit 2024-08-10 16:05:24 +08:00
Concedo
03adb90dc6 prompt command done 2024-08-07 20:52:28 +08:00
Concedo
853d57c53c wip prompt 2024-08-06 21:54:08 +08:00
Concedo
6b8b50b350 try fix ipv6 (+1 squashed commits)
Squashed commits:

[8d95a639] try fix ipv6
2024-08-06 15:36:46 +08:00
Concedo
381b4a1844 default multiuser true 2024-08-05 20:03:29 +08:00
Concedo
bd4e55eb74 add used memory checks, add gpulayers for metal 2024-08-05 16:32:05 +08:00
Concedo
23caa63f94 up ver 2024-08-04 23:42:22 +08:00
Concedo
bfdf4b021f adjust v4-v6 allocation, default back to localhost 2024-08-04 11:42:16 +08:00
Concedo
40481abf0c allow ipv6 as well 2024-08-04 00:53:19 +08:00
Concedo
9a0976761e use loopback ip instead of localhost 2024-08-03 00:41:32 +08:00
Concedo
6bf78967f9 more janky nonsense 2024-08-02 21:58:28 +08:00
Concedo
3a72410804 Added vulkan support for SD (+1 squashed commits)
Squashed commits:

[13f42f83] Added vulkan support for SD
2024-08-01 17:12:33 +08:00
Concedo
9a04060aaa also apply even if tensor split is set 2024-07-30 23:01:50 +08:00
Concedo
2f04f848e1 if gpuid is specified, force specific order 2024-07-30 22:58:25 +08:00
Concedo
43c55bb7e2 hack to fix bad unicode fragments corrupting streamed output 2024-07-30 22:18:22 +08:00
Concedo
102eec3d22 more bugfixes in auto gpu layers selection 2024-07-29 20:38:24 +08:00
Llama
26f1df5e5f
Fix the penultimate token sometimes being lost with SSE streaming (#1031)
The token immediately before an eot token was lost when SSE streaming
was enabled if that token was contained entirely within a stop sequence.
As an example of when this could happen, consider this prompt:
  Type the phrase 'pleas' once.
In a Llama 3-derived model, 'pleas' tokenizes as 'ple' 'as'. The token
'as' is contained within this instruct mode stop sequence:
  <|eot_id|><|start_header_id|>assistant<|end_header_id|>
due to the word 'assistant'. Since `string_contains_sequence_substring`
returns True for 'as', this token is added to `tokenReserve` instead of
being streamed immediately. If the '<|eot_id|>' token was generated
next, the text in `tokenReserve` would be discarded.
2024-07-29 20:16:47 +08:00
Concedo
948646ff7a do not offload if auto layers is less than 2, as its usually slower 2024-07-29 20:13:43 +08:00
Concedo
e39b8aab8b improvements to auto layer calcs 2024-07-29 18:51:10 +08:00
Concedo
f289fb494a bump size of some payload arr sequences from 16 to 24 2024-07-28 20:29:39 +08:00
Concedo
01afb28a63 not working 2024-07-28 11:43:10 +08:00
Concedo
eaa702852d increased padding, it is still way too little but whatever 2024-07-27 22:32:13 +08:00
Concedo
4531ab5465 refactor some fields 2024-07-27 00:04:29 +08:00
Concedo
9f2076b4b3 fix rocminfo error 2024-07-25 22:23:36 +08:00
Concedo
57a98ba308 fixed dict loading 2024-07-25 11:41:05 +08:00
Concedo
0024d9d682 fixed order of selection 2024-07-25 11:15:30 +08:00
Concedo
d1f7832d21 adjusted layer estimation 2024-07-24 22:51:02 +08:00
Concedo
e28c42d7f7 adjusted layer estimation 2024-07-24 21:54:49 +08:00
Concedo
b7fc8e644a fix broken template, updated lite 2024-07-24 20:47:05 +08:00
Concedo
c76f3401e3 remove extra padding for layer guessing 2024-07-24 16:36:34 +08:00
Concedo
c80d5af014 add a tiny amount of padding 2024-07-23 18:58:26 +08:00
henk717
e493f14a3e
New automatic layers (#1012)
* Henk's version of the fsize algo

This is the current version of the fsize algo based on Pyro's algorithm with added padding.

* Update koboldcpp.py

Add debugs and bump padding

* Pyro version

Pyro didn't agree with my version, so here is a test with his version

* Polish new auto layers

This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models.

* Remove 10% margin

The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter.

* Restore 500MB buffer to play it safe

I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone.

* Cap layers at maximum

When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is.

For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less.

This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1).

* Remove old max layer code

Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses.

Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over.

* Restructure algorithm into multiple parts

As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.
2024-07-22 15:47:31 +08:00
Concedo
e2b36aa6cf fixed dry loading seq when not in use, set kcppt to -1 layers by default 2024-07-22 15:44:34 +08:00
Concedo
4d9ccddc2c don't unpack pyd 2024-07-20 18:58:49 +08:00
Concedo
1a23d49c32 serve tags endpoint 2024-07-19 16:08:54 +08:00
Concedo
a998588f3a improved estimation 2024-07-19 00:20:11 +08:00
Concedo
caab9cb8ae fixed unwanted removal 2024-07-18 22:27:22 +08:00
BBC-Esq
621801da0e
Streamline misc (#1007)
* fix typo and streamline a little

* streamline togglehorde

* oops
2024-07-18 22:25:38 +08:00
Concedo
8b0a9f7e56 remove keys, use tuple 2024-07-18 22:11:13 +08:00
BBC-Esq
7de1ebf897
Streamline with dictionaries (#1005)
* dictionary #1

* dictionary #2
2024-07-18 22:05:30 +08:00
BBC-Esq
ce971a0f3d
Streamline with fstrings (#1006)
* fstring #1

* fstring #2
2024-07-18 21:48:46 +08:00
Concedo
90c1bbbcb9 more url downoad support 2024-07-18 11:56:05 +08:00
Concedo
ad86b1aeb8 Implemented Kcpp Launch Templates (+1 squashed commits)
Squashed commits:

[5ea4c1de] wip integrating skcpps templates (+1 squashed commits)

Squashed commits:

[737daa7f] skcpps wip
2024-07-18 00:22:59 +08:00
Concedo
8ccc0144d2 ability to set -1 as gpulayers and determine at runtime (+1 squashed commits)
Squashed commits:

[594263c3] ability to set -1 as gpulayers and determine at runtime
2024-07-17 20:31:19 +08:00
Concedo
6c883a4803 dummy skcpps format 2024-07-17 18:35:27 +08:00
Concedo
eca7521c13 allowed embedded chat adapters 2024-07-17 18:08:43 +08:00