koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-10 17:14:36 +00:00

Author	SHA1	Message	Date
Concedo	853d57c53c	wip prompt	2024-08-06 21:54:08 +08:00
Concedo	6b8b50b350	try fix ipv6 (+1 squashed commits) Squashed commits: [8d95a639] try fix ipv6	2024-08-06 15:36:46 +08:00
Concedo	381b4a1844	default multiuser true	2024-08-05 20:03:29 +08:00
Concedo	bd4e55eb74	add used memory checks, add gpulayers for metal	2024-08-05 16:32:05 +08:00
Concedo	23caa63f94	up ver	2024-08-04 23:42:22 +08:00
Concedo	bfdf4b021f	adjust v4-v6 allocation, default back to localhost	2024-08-04 11:42:16 +08:00
Concedo	40481abf0c	allow ipv6 as well	2024-08-04 00:53:19 +08:00
Concedo	9a0976761e	use loopback ip instead of localhost	2024-08-03 00:41:32 +08:00
Concedo	6bf78967f9	more janky nonsense	2024-08-02 21:58:28 +08:00
Concedo	3a72410804	Added vulkan support for SD (+1 squashed commits) Squashed commits: [13f42f83] Added vulkan support for SD	2024-08-01 17:12:33 +08:00
Concedo	9a04060aaa	also apply even if tensor split is set	2024-07-30 23:01:50 +08:00
Concedo	2f04f848e1	if gpuid is specified, force specific order	2024-07-30 22:58:25 +08:00
Concedo	43c55bb7e2	hack to fix bad unicode fragments corrupting streamed output	2024-07-30 22:18:22 +08:00
Concedo	102eec3d22	more bugfixes in auto gpu layers selection	2024-07-29 20:38:24 +08:00
Llama	26f1df5e5f	Fix the penultimate token sometimes being lost with SSE streaming (#1031 ) The token immediately before an eot token was lost when SSE streaming was enabled if that token was contained entirely within a stop sequence. As an example of when this could happen, consider this prompt: Type the phrase 'pleas' once. In a Llama 3-derived model, 'pleas' tokenizes as 'ple' 'as'. The token 'as' is contained within this instruct mode stop sequence: <\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|> due to the word 'assistant'. Since `string_contains_sequence_substring` returns True for 'as', this token is added to `tokenReserve` instead of being streamed immediately. If the '<\|eot_id\|>' token was generated next, the text in `tokenReserve` would be discarded.	2024-07-29 20:16:47 +08:00
Concedo	948646ff7a	do not offload if auto layers is less than 2, as its usually slower	2024-07-29 20:13:43 +08:00
Concedo	e39b8aab8b	improvements to auto layer calcs	2024-07-29 18:51:10 +08:00
Concedo	f289fb494a	bump size of some payload arr sequences from 16 to 24	2024-07-28 20:29:39 +08:00
Concedo	01afb28a63	not working	2024-07-28 11:43:10 +08:00
Concedo	eaa702852d	increased padding, it is still way too little but whatever	2024-07-27 22:32:13 +08:00
Concedo	4531ab5465	refactor some fields	2024-07-27 00:04:29 +08:00
Concedo	9f2076b4b3	fix rocminfo error	2024-07-25 22:23:36 +08:00
Concedo	57a98ba308	fixed dict loading	2024-07-25 11:41:05 +08:00
Concedo	0024d9d682	fixed order of selection	2024-07-25 11:15:30 +08:00
Concedo	d1f7832d21	adjusted layer estimation	2024-07-24 22:51:02 +08:00
Concedo	e28c42d7f7	adjusted layer estimation	2024-07-24 21:54:49 +08:00
Concedo	b7fc8e644a	fix broken template, updated lite	2024-07-24 20:47:05 +08:00
Concedo	c76f3401e3	remove extra padding for layer guessing	2024-07-24 16:36:34 +08:00
Concedo	c80d5af014	add a tiny amount of padding	2024-07-23 18:58:26 +08:00
henk717	e493f14a3e	New automatic layers (#1012 ) * Henk's version of the fsize algo This is the current version of the fsize algo based on Pyro's algorithm with added padding. * Update koboldcpp.py Add debugs and bump padding * Pyro version Pyro didn't agree with my version, so here is a test with his version * Polish new auto layers This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models. * Remove 10% margin The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter. * Restore 500MB buffer to play it safe I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone. * Cap layers at maximum When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is. For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less. This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1). * Remove old max layer code Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses. Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over. * Restructure algorithm into multiple parts As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.	2024-07-22 15:47:31 +08:00
Concedo	e2b36aa6cf	fixed dry loading seq when not in use, set kcppt to -1 layers by default	2024-07-22 15:44:34 +08:00
Concedo	4d9ccddc2c	don't unpack pyd	2024-07-20 18:58:49 +08:00
Concedo	1a23d49c32	serve tags endpoint	2024-07-19 16:08:54 +08:00
Concedo	a998588f3a	improved estimation	2024-07-19 00:20:11 +08:00
Concedo	caab9cb8ae	fixed unwanted removal	2024-07-18 22:27:22 +08:00
BBC-Esq	621801da0e	Streamline misc (#1007 ) * fix typo and streamline a little * streamline togglehorde * oops	2024-07-18 22:25:38 +08:00
Concedo	8b0a9f7e56	remove keys, use tuple	2024-07-18 22:11:13 +08:00
BBC-Esq	7de1ebf897	Streamline with dictionaries (#1005 ) * dictionary #1 * dictionary #2	2024-07-18 22:05:30 +08:00
BBC-Esq	ce971a0f3d	Streamline with fstrings (#1006 ) * fstring #1 * fstring #2	2024-07-18 21:48:46 +08:00
Concedo	90c1bbbcb9	more url downoad support	2024-07-18 11:56:05 +08:00
Concedo	ad86b1aeb8	Implemented Kcpp Launch Templates (+1 squashed commits) Squashed commits: [5ea4c1de] wip integrating skcpps templates (+1 squashed commits) Squashed commits: [737daa7f] skcpps wip	2024-07-18 00:22:59 +08:00
Concedo	8ccc0144d2	ability to set -1 as gpulayers and determine at runtime (+1 squashed commits) Squashed commits: [594263c3] ability to set -1 as gpulayers and determine at runtime	2024-07-17 20:31:19 +08:00
Concedo	6c883a4803	dummy skcpps format	2024-07-17 18:35:27 +08:00
Concedo	eca7521c13	allowed embedded chat adapters	2024-07-17 18:08:43 +08:00
Concedo	5988243aee	fix wrong order, fix llava debug mode failure	2024-07-17 15:30:19 +08:00
Concedo	e99fa531a2	reorder items	2024-07-17 00:28:48 +08:00
Concedo	d775a419b2	updated lite with chat inject, added layer detect, added more console logging	2024-07-16 23:10:15 +08:00
Concedo	516fd35e93	error popups on python exits	2024-07-16 00:46:32 +08:00
Concedo	21179d675b	try ci for avx1, up ver (+2 squashed commit) Squashed commit: [74150175] up version [97b6163c] try ci for avx1 linux	2024-07-15 23:07:07 +08:00
teddybear082	c08309e773	Rudimentary support of openai chat completions tools calls (#981 ) * Rudimentary support of openai chat completions tools calls -Most small models are not smart enough to do this, especially a combined tool call + role play response, but at least this allows experimentation along these lines with koboldcpp * try to also support specified function and tool choice set to none Allow tools start and end messages to be configured in adapter Try to force grammar to specific function call if specified (untested) * ensure tools get listed right after user content and before end of user message content * omit grammars approach try prompting instead -use more extensive json parsing and direct instructions to models to try to obtain the desired result -seems to work relatively well with Mistral-7B-Instruct-v.0.3.Q4_K_M.gguf and neuralhermes-2.5-mistral-7b.Q4_K_M.gguf -question of whether this is too opinionated of an approach, should the instructions be things that can be passed with the prompt template? * add back llamacpp recommended json grammar Go back to adding grammar but use "official" llamacpp grammar only not a custom one just for openai * Tidy up, remove unnecessary globals * clarity * fix missing local variable error This worked to fix the error I mentioned on my last comment --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-07-14 11:22:45 +08:00

1 2 3 4 5 ...

628 commits