Commit graph

255 commits

Author SHA1 Message Date
Concedo
d775a419b2 updated lite with chat inject, added layer detect, added more console logging 2024-07-16 23:10:15 +08:00
Llama
264575426e
Add the DRY dynamic N-gram anti-repetition sampler (#982)
* Add the DRY dynamic N-gram anti-repetition sampler

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
https://github.com/oobabooga/text-generation-webui/pull/5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

* Update default DRY parameters to match lite

* Improve DRY token debug logging

* Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

* Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).

* Remove unused llama sampler variables and clean up sequence breakers.

* Remove KCPP_SAMPLER_DRY as a separate enum entry

The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.

* Add the dry_penalty_last_n to independently control DRY penalty range

This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.

* Limit sequence breaker lengths in tokens and characters

The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.

This change also improves some comments, adding more detail and
changing the wording to increase clarity.
2024-07-13 19:08:23 +08:00
Concedo
0dd3907940 qwen2 warning FA 2024-07-09 20:53:25 +08:00
Concedo
d120c55e12 try to fix build errors (+1 squashed commits)
Squashed commits:

[27c28292] try fix build errors
2024-06-29 23:11:00 +08:00
Nexesenex
cb2336f5d9
Gradient rope formula with offsets (#938)
* Gradient rope formula with offsets

Positive for Solar models
Negative for Llama 1 and 2 models

* Update gpttype_adapter.cpp

Remove L1/L2

* cleanup PR, skip llama models, keep prints behind debug mode

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2024-06-25 20:46:34 +08:00
Concedo
12abc41bb4 add llava separator 2024-06-22 21:55:13 +08:00
Concedo
13398477a1 fix ubatch, autoselect vulkan dgpu if possible 2024-06-22 00:23:46 +08:00
askmyteapot
1e72b65c38
GradientAI Auto ROPE Base calculation (#910)
* GradientAI Auto ROPE Base calculation

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
has a formula that better fits the ideal rope scaling. 

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

* add in solar scaling logic

Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.

* Update model_adapter.h

adding in tensor count to identify solar models based on tensor count of 435.

* Update model_adapter.cpp

add in n_tensor count for solar identification

* refactor and cleanup GradientAI rope scaling

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2024-06-13 18:12:00 +08:00
Concedo
10b148f4c2 added skip bos for tokenize endpoint 2024-06-05 10:49:11 +08:00
Concedo
10a1d628ad added new binding fields for quant k and quant v 2024-06-03 14:35:59 +08:00
Concedo
4b664b3409 improved EOT handling 2024-05-19 22:04:51 +08:00
Concedo
1db3421c52 multiple minor fixes 2024-05-17 15:47:53 +08:00
Concedo
44443edfda rep pen slope works (+1 squashed commits)
Squashed commits:

[535ad566] experiment with rep pen range
2024-05-15 17:20:57 +08:00
Concedo
eff01660e4 re-added smart context due to people complaining 2024-05-11 17:25:03 +08:00
Concedo
dbe72b959e tidy up and refactor code to support old flags 2024-05-10 16:50:53 +08:00
Concedo
173c7272d5 EOS bypass mode added 2024-05-06 18:01:49 +08:00
Concedo
b48ea96ead removed unwanted debugs 2024-05-01 11:35:07 +08:00
Concedo
c65448d17a add flash attention toggle 2024-04-30 21:29:11 +08:00
Concedo
17a24d753c Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/main-intel.Dockerfile
#	.devops/main-vulkan.Dockerfile
#	.devops/server-intel.Dockerfile
#	.devops/server-vulkan.Dockerfile
#	.github/workflows/bench.yml
#	.github/workflows/build.yml
#	.github/workflows/python-lint.yml
#	.github/workflows/server.yml
#	.gitignore
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	flake.lock
#	llama.cpp
#	models/ggml-vocab-falcon.gguf
#	models/ggml-vocab-llama-spm.gguf
#	models/ggml-vocab-mpt.gguf
#	models/ggml-vocab-stablelm.gguf
#	models/ggml-vocab-starcoder.gguf
#	requirements.txt
#	scripts/check-requirements.sh
#	tests/CMakeLists.txt
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-tokenizer-0-bpe.py
#	tests/test-tokenizer-0-spm.py
#	tests/test-tokenizer-1-spm.cpp
2024-04-30 21:04:17 +08:00
Concedo
c230b78906 refactored a lot of code, remove bantokens, move it to api 2024-04-27 17:57:13 +08:00
Concedo
4ec8a9c57b expose stop reason in generation 2024-04-27 01:12:12 +08:00
Concedo
0871c7cbd1 Add additional debug info and increased ctx sizes, fixed a bug loading vulkan config 2024-04-25 23:07:37 +08:00
Concedo
cb2dbe9e9a improved rep pen speed 2024-04-24 21:29:21 +08:00
Concedo
b4d2031215 merged, added ability to render special tokens 2024-04-22 18:19:58 +08:00
Concedo
3170284fc3 added support for special tokens as stop sequences 2024-04-20 09:48:32 +08:00
Concedo
b01820dec7 auto rope scaling changes 2024-04-19 23:08:55 +08:00
Concedo
9a25d77cc1 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/docker.yml
#	Makefile
#	README-sycl.md
#	README.md
#	ci/run.sh
#	ggml-cuda.cu
#	ggml.c
#	grammars/README.md
#	scripts/get-wikitext-2.sh
#	scripts/hf.sh
#	scripts/sync-ggml.last
#	tests/test-backend-ops.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-json-schema-to-grammar.cpp
2024-04-14 21:18:39 +08:00
Concedo
125f84aa02 fixed compiler warnings 2024-04-08 16:40:55 +08:00
Concedo
a530afa1e4 Merge commit '280345968d' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/llama-cpp-cuda.srpm.spec
#	.devops/main-cuda.Dockerfile
#	.devops/nix/package.nix
#	.devops/server-cuda.Dockerfile
#	.github/workflows/build.yml
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/run.sh
#	docs/token_generation_performance_tips.md
#	flake.lock
#	llama.cpp
#	scripts/LlamaConfig.cmake.in
#	scripts/compare-commits.sh
#	scripts/server-llm.sh
#	tests/test-quantize-fns.cpp
2024-04-07 20:27:17 +08:00
Concedo
2ef03c9de6 fix for physical batch size 2024-03-15 16:45:20 +08:00
Concedo
47c42fd45c fix for mamba processing 2024-03-13 13:27:46 +08:00
Concedo
484d90c330 llava support is now fully functioning 2024-03-11 15:55:32 +08:00
Concedo
d943c739a8 wip submitting of llava image to backend 2024-03-10 17:14:27 +08:00
Concedo
c08d7e5042 wip integration of llava 2024-03-10 11:18:47 +08:00
Concedo
7c64845dea Merge branch 'master' into concedo_experimental
# Conflicts:
#	.devops/nix/sif.nix
#	.github/workflows/build.yml
#	.github/workflows/python-check-requirements.yml
#	README-sycl.md
#	README.md
#	flake.lock
#	flake.nix
#	requirements/requirements-convert-hf-to-gguf.txt
#	scripts/compare-llama-bench.py
2024-03-04 15:33:33 +08:00
Concedo
2d9a90b652 try to fix ci compile errors (+1 squashed commits)
Squashed commits:

[d0d49663] fixed log multiline (+1 squashed commits)

Squashed commits:

[81a8befe] try to fix linux build error (+1 squashed commits)

Squashed commits:

[22850dda] try to fix build (+1 squashed commits)

Squashed commits:

[b8294611] missing type
2024-03-01 23:38:15 +08:00
Concedo
55af5446ad Merge branch 'master' into concedo_experimental
# Conflicts:
#	README.md
#	ci/run.sh
#	llama.cpp
#	scripts/sync-ggml.last
2024-03-01 17:41:37 +08:00
Concedo
524ba12abd refactor - do not use a copy buffer to store generation outputs, instead return a cpp allocated ptr 2024-02-29 14:02:20 +08:00
Concedo
f75e479db0 WIP on sdcpp integration 2024-02-29 00:40:07 +08:00
Concedo
ad638285de Merge branch 'master' into concedo_experimental
# Conflicts:
#	Makefile
#	README.md
#	flake.lock
#	ggml-cuda.cu
#	llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-quantize-fns.cpp
2024-02-28 13:41:35 +08:00
Concedo
d47e13c892 fixed compile error: GGML_BACKEND_TYPE_GPU (+1 squashed commits)
Squashed commits:

[00ca282a] fixed compile error: LLAMA_SPLIT_MODE_ROW
2024-02-26 10:55:35 +08:00
Concedo
b5ba6c9ece test to see if Ofast for ggml library plus batching adjustments fixes speed regression for ggmlv1 models 2024-02-25 21:14:53 +08:00
Concedo
6d6d79f359 fixed a horrible bug in thread counts 2024-02-22 23:57:40 +08:00
Concedo
8d5e25008f Merge branch 'master' into concedo_experimental
# Conflicts:
#	CMakeLists.txt
#	Makefile
#	README.md
#	ci/run.sh
#	tests/test-tokenizer-0-falcon.cpp
#	tests/test-tokenizer-0-llama.cpp
#	tests/test-tokenizer-1-bpe.cpp
#	tests/test-tokenizer-1-llama.cpp
2024-02-17 15:22:05 +08:00
Concedo
066e73d769 context shift even more lenient 2024-02-11 18:30:38 +08:00
Concedo
590af480ab contextshift more forgiving 2024-02-10 20:49:21 +08:00
Concedo
35111ce01a row split mode is now a toggle 2024-02-09 18:35:58 +08:00
Concedo
992eea71d7 fixes for vulkan multigpu 2024-02-09 14:42:27 +08:00
Concedo
fe424a5466 tensor split active text 2024-02-09 12:02:23 +08:00
Concedo
4cd571db89 vulkan multigpu, show uptime 2024-02-08 16:54:38 +08:00