Commit graph

289 commits

Author SHA1 Message Date
Concedo
fe5479f286 unify antislop and token bans 2024-10-10 18:21:07 +08:00
Concedo
9b614d46bd antislop sampler working 2024-10-09 16:33:04 +08:00
Concedo
36e9bac98f wip anti slop sampler 2024-10-09 13:34:47 +08:00
Concedo
f78f8d3d45 wip anti slop 2024-10-07 23:18:13 +08:00
Concedo
65f3c68399 wip antislop 2024-10-07 20:19:22 +08:00
Concedo
740c5e01cb added token delay feature 2024-10-07 19:45:51 +08:00
Concedo
3e8bb10e2d wip on rewind function 2024-10-06 16:21:03 +08:00
Concedo
c38d1ecc8d update templates, fix rwkv 2024-09-22 01:32:12 +08:00
Concedo
53bf0fb32d removed openblas backend, merged into CPU (with llamafile for BLAS). GPU backend is now automatically selected when running from CLI unless noblas is specified. 2024-09-15 19:21:52 +08:00
Concedo
e44ddf26ef Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/build.yml
#	.github/workflows/server.yml
#	CMakeLists.txt
#	Makefile
#	examples/embedding/embedding.cpp
#	examples/imatrix/imatrix.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/llava/MobileVLM-README.md
#	examples/parallel/parallel.cpp
#	examples/perplexity/perplexity.cpp
#	examples/quantize/CMakeLists.txt
#	examples/server/README.md
#	examples/speculative/speculative.cpp
#	tests/test-backend-ops.cpp
2024-09-13 16:17:24 +08:00
Concedo
7bdac9bc44 prevent shifting on rwkv 2024-09-11 20:22:45 +08:00
Concedo
eee67281be move kcpp params out 2024-09-10 16:30:12 +08:00
Concedo
fc7fe2e7a0 allow rwkv6 to run although its broken 2024-09-09 20:50:58 +08:00
Concedo
b63158005f All samplers moved to kcpp side 2024-09-09 18:14:11 +08:00
Concedo
12fd16bfd4 Merge commit 'df270ef745' into concedo_experimental
# Conflicts:
#	Makefile
#	common/CMakeLists.txt
#	common/common.h
#	common/sampling.cpp
#	common/sampling.h
#	examples/infill/infill.cpp
#	examples/llama-bench/llama-bench.cpp
#	examples/quantize-stats/quantize-stats.cpp
#	examples/server/server.cpp
#	include/llama.h
#	src/llama-sampling.cpp
#	src/llama-sampling.h
#	src/llama.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-grammar-parser.cpp
#	tests/test-json-schema-to-grammar.cpp
#	tests/test-llama-grammar.cpp
#	tests/test-sampling.cpp
2024-09-09 17:10:08 +08:00
Concedo
c78690737c fix for DRY segfault on unicode character substring tokenization 2024-09-08 18:25:00 +08:00
Concedo
d220495dd4 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	.github/workflows/docker.yml
#	docs/docker.md
#	examples/llama-bench/llama-bench.cpp
#	flake.lock
#	ggml/include/ggml.h
#	ggml/src/CMakeLists.txt
#	scripts/sync-ggml.last
#	src/llama.cpp
#	tests/test-backend-ops.cpp
#	tests/test-grad0.cpp
#	tests/test-rope.cpp
2024-08-30 10:37:39 +08:00
Concedo
b78a637da5 try to optimize context shifting 2024-08-26 23:07:31 +08:00
Concedo
cca3c4c78b xtc fixes 2024-08-22 23:18:46 +08:00
Concedo
fc2545dc83 fixed a typo 2024-08-22 00:25:56 +08:00
Concedo
5bf527a6ae added xtc sampler 2024-08-21 23:57:15 +08:00
Concedo
1a7ecd55e6 timing for init step, clip for vulkan 2024-08-21 18:14:53 +08:00
Concedo
cd69ab218e fixed DRY 2024-08-21 17:01:28 +08:00
Concedo
6a4becb731 dry is still buggy because token indexes are wrong 2024-08-21 00:59:26 +08:00
Concedo
db6ef8d1e1 revert dry state reset 2024-08-20 22:22:21 +08:00
Concedo
c1ae350e5b fixed race condition when generating 2024-08-20 20:17:55 +08:00
Concedo
e12ab53488 force clear some DRY state vars on new generation - not sure if this helps 2024-08-14 21:35:39 +08:00
Concedo
689a17d756 always prefilter to 5k logits 2024-08-12 22:27:06 +08:00
Concedo
729eb1e552 no fast forward for empty prompt 2024-07-27 16:29:35 +08:00
Concedo
eb5b4d0186 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	Makefile
#	Package.swift
#	src/CMakeLists.txt
#	src/llama.cpp
#	tests/test-grammar-integration.cpp
#	tests/test-llama-grammar.cpp
2024-07-23 23:20:32 +08:00
Concedo
e2b36aa6cf fixed dry loading seq when not in use, set kcppt to -1 layers by default 2024-07-22 15:44:34 +08:00
Concedo
0ecf13fc13 updated lite, extra error logging 2024-07-21 17:55:47 +08:00
Concedo
24b9616344 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/full-cuda.Dockerfile
#	.devops/full-rocm.Dockerfile
#	.devops/full.Dockerfile
#	.devops/llama-cli-cuda.Dockerfile
#	.devops/llama-cli-intel.Dockerfile
#	.devops/llama-cli-rocm.Dockerfile
#	.devops/llama-cli-vulkan.Dockerfile
#	.devops/llama-cli.Dockerfile
#	.devops/llama-server-cuda.Dockerfile
#	.devops/llama-server-intel.Dockerfile
#	.devops/llama-server-rocm.Dockerfile
#	.devops/llama-server-vulkan.Dockerfile
#	.devops/llama-server.Dockerfile
#	CMakeLists.txt
#	CONTRIBUTING.md
#	Makefile
#	ggml/CMakeLists.txt
#	ggml/src/CMakeLists.txt
#	requirements.txt
#	src/llama.cpp
#	tests/test-backend-ops.cpp
2024-07-19 14:23:33 +08:00
Concedo
5988243aee fix wrong order, fix llava debug mode failure 2024-07-17 15:30:19 +08:00
Concedo
d775a419b2 updated lite with chat inject, added layer detect, added more console logging 2024-07-16 23:10:15 +08:00
Llama
264575426e
Add the DRY dynamic N-gram anti-repetition sampler (#982)
* Add the DRY dynamic N-gram anti-repetition sampler

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
https://github.com/oobabooga/text-generation-webui/pull/5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

* Update default DRY parameters to match lite

* Improve DRY token debug logging

* Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

* Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).

* Remove unused llama sampler variables and clean up sequence breakers.

* Remove KCPP_SAMPLER_DRY as a separate enum entry

The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.

* Add the dry_penalty_last_n to independently control DRY penalty range

This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.

* Limit sequence breaker lengths in tokens and characters

The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.

This change also improves some comments, adding more detail and
changing the wording to increase clarity.
2024-07-13 19:08:23 +08:00
Concedo
0dd3907940 qwen2 warning FA 2024-07-09 20:53:25 +08:00
Concedo
d120c55e12 try to fix build errors (+1 squashed commits)
Squashed commits:

[27c28292] try fix build errors
2024-06-29 23:11:00 +08:00
Nexesenex
cb2336f5d9
Gradient rope formula with offsets (#938)
* Gradient rope formula with offsets

Positive for Solar models
Negative for Llama 1 and 2 models

* Update gpttype_adapter.cpp

Remove L1/L2

* cleanup PR, skip llama models, keep prints behind debug mode

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2024-06-25 20:46:34 +08:00
Concedo
12abc41bb4 add llava separator 2024-06-22 21:55:13 +08:00
Concedo
13398477a1 fix ubatch, autoselect vulkan dgpu if possible 2024-06-22 00:23:46 +08:00
askmyteapot
1e72b65c38
GradientAI Auto ROPE Base calculation (#910)
* GradientAI Auto ROPE Base calculation

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
has a formula that better fits the ideal rope scaling. 

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

* add in solar scaling logic

Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k.

* Update model_adapter.h

adding in tensor count to identify solar models based on tensor count of 435.

* Update model_adapter.cpp

add in n_tensor count for solar identification

* refactor and cleanup GradientAI rope scaling

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2024-06-13 18:12:00 +08:00
Concedo
10b148f4c2 added skip bos for tokenize endpoint 2024-06-05 10:49:11 +08:00
Concedo
10a1d628ad added new binding fields for quant k and quant v 2024-06-03 14:35:59 +08:00
Concedo
4b664b3409 improved EOT handling 2024-05-19 22:04:51 +08:00
Concedo
1db3421c52 multiple minor fixes 2024-05-17 15:47:53 +08:00
Concedo
44443edfda rep pen slope works (+1 squashed commits)
Squashed commits:

[535ad566] experiment with rep pen range
2024-05-15 17:20:57 +08:00
Concedo
eff01660e4 re-added smart context due to people complaining 2024-05-11 17:25:03 +08:00
Concedo
dbe72b959e tidy up and refactor code to support old flags 2024-05-10 16:50:53 +08:00
Concedo
173c7272d5 EOS bypass mode added 2024-05-06 18:01:49 +08:00