* Henk's version of the fsize algo
This is the current version of the fsize algo based on Pyro's algorithm with added padding.
* Update koboldcpp.py
Add debugs and bump padding
* Pyro version
Pyro didn't agree with my version, so here is a test with his version
* Polish new auto layers
This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models.
* Remove 10% margin
The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter.
* Restore 500MB buffer to play it safe
I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone.
* Cap layers at maximum
When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is.
For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less.
This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1).
* Remove old max layer code
Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses.
Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over.
* Restructure algorithm into multiple parts
As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.
* Rudimentary support of openai chat completions tools calls
-Most small models are not smart enough to do this, especially a combined tool call + role play response, but at least this allows experimentation along these lines with koboldcpp
* try to also support specified function and tool choice set to none
Allow tools start and end messages to be configured in adapter
Try to force grammar to specific function call if specified (untested)
* ensure tools get listed right after user content and before end of user message content
* omit grammars approach try prompting instead
-use more extensive json parsing and direct instructions to models to try to obtain the desired result
-seems to work relatively well with Mistral-7B-Instruct-v.0.3.Q4_K_M.gguf and neuralhermes-2.5-mistral-7b.Q4_K_M.gguf
-question of whether this is too opinionated of an approach, should the instructions be things that can be passed with the prompt template?
* add back llamacpp recommended json grammar
Go back to adding grammar but use "official" llamacpp grammar only not a custom one just for openai
* Tidy up, remove unnecessary globals
* clarity
* fix missing local variable error
This worked to fix the error I mentioned on my last comment
---------
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* Add the DRY dynamic N-gram anti-repetition sampler
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.
See this discussion for a motivation and explanation of the sampler:
https://github.com/oobabooga/text-generation-webui/pull/5677
This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.
There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.
* Update default DRY parameters to match lite
* Improve DRY token debug logging
* Replace `and` with `&&` to fix MSVC compile error
Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.
* Fix MSVC compile error because log is not constexpr
Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).
* Remove unused llama sampler variables and clean up sequence breakers.
* Remove KCPP_SAMPLER_DRY as a separate enum entry
The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.
* Add the dry_penalty_last_n to independently control DRY penalty range
This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.
* Limit sequence breaker lengths in tokens and characters
The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.
This change also improves some comments, adding more detail and
changing the wording to increase clarity.