Commit graph

653 commits

Author SHA1 Message Date
Concedo
53bf0fb32d removed openblas backend, merged into CPU (with llamafile for BLAS). GPU backend is now automatically selected when running from CLI unless noblas is specified. 2024-09-15 19:21:52 +08:00
Concedo
5b658ab6d4 updated lite 2024-09-12 10:47:47 +08:00
Concedo
70cdb55cc9 Merge commit '947538acb8' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	CMakePresets.json
#	examples/llama-bench/llama-bench.cpp
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-quantize-fns.cpp
2024-09-09 11:26:34 +08:00
Concedo
d777995991 able to handle kcpp protected model name endpoints 2024-09-04 16:26:28 +08:00
Concedo
5d34de0c08 fix basepath 2024-09-02 18:09:58 +08:00
Concedo
3c4fa57026 allow horde worker to work with password protected instances 2024-08-31 21:30:47 +08:00
Concedo
0f9968ef64 fixed some incorrect protocol prefix for localhost 2024-08-29 10:37:43 +08:00
Concedo
5f360f659c Add 5m timeout for horde worker 2024-08-28 23:17:06 +08:00
Concedo
6acbf1d7f4 macos default to full offload when using gpulayers auto (-1) 2024-08-26 12:12:51 +08:00
Concedo
97aa8648ed allow launching with no models loaded 2024-08-25 23:57:32 +08:00
Concedo
0b96097439 add version number into help page 2024-08-22 00:52:30 +08:00
Concedo
5bf527a6ae added xtc sampler 2024-08-21 23:57:15 +08:00
Concedo
cd69ab218e fixed DRY 2024-08-21 17:01:28 +08:00
Concedo
2cf6d16c40 adjust sleep time 2024-08-21 01:06:41 +08:00
Concedo
c1ae350e5b fixed race condition when generating 2024-08-20 20:17:55 +08:00
Concedo
7ee359a59b on multigpu setups, pick lowest free mem instead of highest for auto layers 2024-08-20 19:02:16 +08:00
Concedo
e9eb6fe51a move chat compl to models tab 2024-08-18 14:56:10 +08:00
Concedo
e2e6d892b4 fix declaration order 2024-08-18 02:15:34 +08:00
Concedo
d71b5477c5 update lite, cleanup, fix interrogate format 2024-08-18 00:48:53 +08:00
Concedo
2c108ab17e correct phrasing 2024-08-14 21:55:53 +08:00
Concedo
f4f24d0e14 small text change 2024-08-11 21:30:46 +08:00
Concedo
139ab3d198 generate passes whole object now 2024-08-11 00:08:13 +08:00
Concedo
da8a96199c add a space between the bench prompt to fix an issue with old bpe tokenizer stack overflow (+1 squashed commits)
Squashed commits:

[44a689de] add a space between the bench prompt to fix an issue with old bpe tokenizer stack overflow
2024-08-10 19:35:56 +08:00
Concedo
86e687ae8b updated lite, added promptlimit 2024-08-10 16:05:24 +08:00
Concedo
03adb90dc6 prompt command done 2024-08-07 20:52:28 +08:00
Concedo
853d57c53c wip prompt 2024-08-06 21:54:08 +08:00
Concedo
6b8b50b350 try fix ipv6 (+1 squashed commits)
Squashed commits:

[8d95a639] try fix ipv6
2024-08-06 15:36:46 +08:00
Concedo
381b4a1844 default multiuser true 2024-08-05 20:03:29 +08:00
Concedo
bd4e55eb74 add used memory checks, add gpulayers for metal 2024-08-05 16:32:05 +08:00
Concedo
23caa63f94 up ver 2024-08-04 23:42:22 +08:00
Concedo
bfdf4b021f adjust v4-v6 allocation, default back to localhost 2024-08-04 11:42:16 +08:00
Concedo
40481abf0c allow ipv6 as well 2024-08-04 00:53:19 +08:00
Concedo
9a0976761e use loopback ip instead of localhost 2024-08-03 00:41:32 +08:00
Concedo
6bf78967f9 more janky nonsense 2024-08-02 21:58:28 +08:00
Concedo
3a72410804 Added vulkan support for SD (+1 squashed commits)
Squashed commits:

[13f42f83] Added vulkan support for SD
2024-08-01 17:12:33 +08:00
Concedo
9a04060aaa also apply even if tensor split is set 2024-07-30 23:01:50 +08:00
Concedo
2f04f848e1 if gpuid is specified, force specific order 2024-07-30 22:58:25 +08:00
Concedo
43c55bb7e2 hack to fix bad unicode fragments corrupting streamed output 2024-07-30 22:18:22 +08:00
Concedo
102eec3d22 more bugfixes in auto gpu layers selection 2024-07-29 20:38:24 +08:00
Llama
26f1df5e5f
Fix the penultimate token sometimes being lost with SSE streaming (#1031)
The token immediately before an eot token was lost when SSE streaming
was enabled if that token was contained entirely within a stop sequence.
As an example of when this could happen, consider this prompt:
  Type the phrase 'pleas' once.
In a Llama 3-derived model, 'pleas' tokenizes as 'ple' 'as'. The token
'as' is contained within this instruct mode stop sequence:
  <|eot_id|><|start_header_id|>assistant<|end_header_id|>
due to the word 'assistant'. Since `string_contains_sequence_substring`
returns True for 'as', this token is added to `tokenReserve` instead of
being streamed immediately. If the '<|eot_id|>' token was generated
next, the text in `tokenReserve` would be discarded.
2024-07-29 20:16:47 +08:00
Concedo
948646ff7a do not offload if auto layers is less than 2, as its usually slower 2024-07-29 20:13:43 +08:00
Concedo
e39b8aab8b improvements to auto layer calcs 2024-07-29 18:51:10 +08:00
Concedo
f289fb494a bump size of some payload arr sequences from 16 to 24 2024-07-28 20:29:39 +08:00
Concedo
01afb28a63 not working 2024-07-28 11:43:10 +08:00
Concedo
eaa702852d increased padding, it is still way too little but whatever 2024-07-27 22:32:13 +08:00
Concedo
4531ab5465 refactor some fields 2024-07-27 00:04:29 +08:00
Concedo
9f2076b4b3 fix rocminfo error 2024-07-25 22:23:36 +08:00
Concedo
57a98ba308 fixed dict loading 2024-07-25 11:41:05 +08:00
Concedo
0024d9d682 fixed order of selection 2024-07-25 11:15:30 +08:00
Concedo
d1f7832d21 adjusted layer estimation 2024-07-24 22:51:02 +08:00