koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-10 17:14:36 +00:00

Author	SHA1	Message	Date
Concedo	d775a419b2	updated lite with chat inject, added layer detect, added more console logging	2024-07-16 23:10:15 +08:00
Llama	264575426e	Add the DRY dynamic N-gram anti-repetition sampler (#982 ) * Add the DRY dynamic N-gram anti-repetition sampler The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: https://github.com/oobabooga/text-generation-webui/pull/5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. * Update default DRY parameters to match lite * Improve DRY token debug logging * Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. * Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). * Remove unused llama sampler variables and clean up sequence breakers. * Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. * Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions. * Limit sequence breaker lengths in tokens and characters The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.	2024-07-13 19:08:23 +08:00
Concedo	0dd3907940	qwen2 warning FA	2024-07-09 20:53:25 +08:00
Concedo	d120c55e12	try to fix build errors (+1 squashed commits) Squashed commits: [27c28292] try fix build errors	2024-06-29 23:11:00 +08:00
Nexesenex	cb2336f5d9	Gradient rope formula with offsets (#938 ) * Gradient rope formula with offsets Positive for Solar models Negative for Llama 1 and 2 models * Update gpttype_adapter.cpp Remove L1/L2 * cleanup PR, skip llama models, keep prints behind debug mode --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-06-25 20:46:34 +08:00
Concedo	12abc41bb4	add llava separator	2024-06-22 21:55:13 +08:00
Concedo	13398477a1	fix ubatch, autoselect vulkan dgpu if possible	2024-06-22 00:23:46 +08:00
askmyteapot	1e72b65c38	GradientAI Auto ROPE Base calculation (#910 ) * GradientAI Auto ROPE Base calculation https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling. Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX. * add in solar scaling logic Solar based models require the context values to be multiplied by 8. This is (i'm guessing) because the positions as based on a 32k context, but sliding window of 4k. * Update model_adapter.h adding in tensor count to identify solar models based on tensor count of 435. * Update model_adapter.cpp add in n_tensor count for solar identification * refactor and cleanup GradientAI rope scaling --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2024-06-13 18:12:00 +08:00
Concedo	10b148f4c2	added skip bos for tokenize endpoint	2024-06-05 10:49:11 +08:00
Concedo	10a1d628ad	added new binding fields for quant k and quant v	2024-06-03 14:35:59 +08:00
Concedo	4b664b3409	improved EOT handling	2024-05-19 22:04:51 +08:00
Concedo	1db3421c52	multiple minor fixes	2024-05-17 15:47:53 +08:00
Concedo	44443edfda	rep pen slope works (+1 squashed commits) Squashed commits: [535ad566] experiment with rep pen range	2024-05-15 17:20:57 +08:00
Concedo	eff01660e4	re-added smart context due to people complaining	2024-05-11 17:25:03 +08:00
Concedo	dbe72b959e	tidy up and refactor code to support old flags	2024-05-10 16:50:53 +08:00
Concedo	173c7272d5	EOS bypass mode added	2024-05-06 18:01:49 +08:00
Concedo	b48ea96ead	removed unwanted debugs	2024-05-01 11:35:07 +08:00
Concedo	c65448d17a	add flash attention toggle	2024-04-30 21:29:11 +08:00
Concedo	17a24d753c	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/main-intel.Dockerfile # .devops/main-vulkan.Dockerfile # .devops/server-intel.Dockerfile # .devops/server-vulkan.Dockerfile # .github/workflows/bench.yml # .github/workflows/build.yml # .github/workflows/python-lint.yml # .github/workflows/server.yml # .gitignore # Makefile # README-sycl.md # README.md # ci/run.sh # flake.lock # llama.cpp # models/ggml-vocab-falcon.gguf # models/ggml-vocab-llama-spm.gguf # models/ggml-vocab-mpt.gguf # models/ggml-vocab-stablelm.gguf # models/ggml-vocab-starcoder.gguf # requirements.txt # scripts/check-requirements.sh # tests/CMakeLists.txt # tests/test-backend-ops.cpp # tests/test-grammar-integration.cpp # tests/test-tokenizer-0-bpe.py # tests/test-tokenizer-0-spm.py # tests/test-tokenizer-1-spm.cpp	2024-04-30 21:04:17 +08:00
Concedo	c230b78906	refactored a lot of code, remove bantokens, move it to api	2024-04-27 17:57:13 +08:00
Concedo	4ec8a9c57b	expose stop reason in generation	2024-04-27 01:12:12 +08:00
Concedo	0871c7cbd1	Add additional debug info and increased ctx sizes, fixed a bug loading vulkan config	2024-04-25 23:07:37 +08:00
Concedo	cb2dbe9e9a	improved rep pen speed	2024-04-24 21:29:21 +08:00
Concedo	b4d2031215	merged, added ability to render special tokens	2024-04-22 18:19:58 +08:00
Concedo	3170284fc3	added support for special tokens as stop sequences	2024-04-20 09:48:32 +08:00
Concedo	b01820dec7	auto rope scaling changes	2024-04-19 23:08:55 +08:00
Concedo	9a25d77cc1	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/docker.yml # Makefile # README-sycl.md # README.md # ci/run.sh # ggml-cuda.cu # ggml.c # grammars/README.md # scripts/get-wikitext-2.sh # scripts/hf.sh # scripts/sync-ggml.last # tests/test-backend-ops.cpp # tests/test-grammar-integration.cpp # tests/test-json-schema-to-grammar.cpp	2024-04-14 21:18:39 +08:00
Concedo	125f84aa02	fixed compiler warnings	2024-04-08 16:40:55 +08:00
Concedo	a530afa1e4	Merge commit '`280345968d`' into concedo_experimental # Conflicts: # .devops/full-cuda.Dockerfile # .devops/llama-cpp-cuda.srpm.spec # .devops/main-cuda.Dockerfile # .devops/nix/package.nix # .devops/server-cuda.Dockerfile # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # ci/run.sh # docs/token_generation_performance_tips.md # flake.lock # llama.cpp # scripts/LlamaConfig.cmake.in # scripts/compare-commits.sh # scripts/server-llm.sh # tests/test-quantize-fns.cpp	2024-04-07 20:27:17 +08:00
Concedo	2ef03c9de6	fix for physical batch size	2024-03-15 16:45:20 +08:00
Concedo	47c42fd45c	fix for mamba processing	2024-03-13 13:27:46 +08:00
Concedo	484d90c330	llava support is now fully functioning	2024-03-11 15:55:32 +08:00
Concedo	d943c739a8	wip submitting of llava image to backend	2024-03-10 17:14:27 +08:00
Concedo	c08d7e5042	wip integration of llava	2024-03-10 11:18:47 +08:00
Concedo	7c64845dea	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/nix/sif.nix # .github/workflows/build.yml # .github/workflows/python-check-requirements.yml # README-sycl.md # README.md # flake.lock # flake.nix # requirements/requirements-convert-hf-to-gguf.txt # scripts/compare-llama-bench.py	2024-03-04 15:33:33 +08:00
Concedo	2d9a90b652	try to fix ci compile errors (+1 squashed commits) Squashed commits: [d0d49663] fixed log multiline (+1 squashed commits) Squashed commits: [81a8befe] try to fix linux build error (+1 squashed commits) Squashed commits: [22850dda] try to fix build (+1 squashed commits) Squashed commits: [b8294611] missing type	2024-03-01 23:38:15 +08:00
Concedo	55af5446ad	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # ci/run.sh # llama.cpp # scripts/sync-ggml.last	2024-03-01 17:41:37 +08:00
Concedo	524ba12abd	refactor - do not use a copy buffer to store generation outputs, instead return a cpp allocated ptr	2024-02-29 14:02:20 +08:00
Concedo	f75e479db0	WIP on sdcpp integration	2024-02-29 00:40:07 +08:00
Concedo	ad638285de	Merge branch 'master' into concedo_experimental # Conflicts: # Makefile # README.md # flake.lock # ggml-cuda.cu # llama.cpp # tests/test-backend-ops.cpp # tests/test-quantize-fns.cpp	2024-02-28 13:41:35 +08:00
Concedo	d47e13c892	fixed compile error: GGML_BACKEND_TYPE_GPU (+1 squashed commits) Squashed commits: [00ca282a] fixed compile error: LLAMA_SPLIT_MODE_ROW	2024-02-26 10:55:35 +08:00
Concedo	b5ba6c9ece	test to see if Ofast for ggml library plus batching adjustments fixes speed regression for ggmlv1 models	2024-02-25 21:14:53 +08:00
Concedo	6d6d79f359	fixed a horrible bug in thread counts	2024-02-22 23:57:40 +08:00
Concedo	8d5e25008f	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # ci/run.sh # tests/test-tokenizer-0-falcon.cpp # tests/test-tokenizer-0-llama.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-llama.cpp	2024-02-17 15:22:05 +08:00
Concedo	066e73d769	context shift even more lenient	2024-02-11 18:30:38 +08:00
Concedo	590af480ab	contextshift more forgiving	2024-02-10 20:49:21 +08:00
Concedo	35111ce01a	row split mode is now a toggle	2024-02-09 18:35:58 +08:00
Concedo	992eea71d7	fixes for vulkan multigpu	2024-02-09 14:42:27 +08:00
Concedo	fe424a5466	tensor split active text	2024-02-09 12:02:23 +08:00
Concedo	4cd571db89	vulkan multigpu, show uptime	2024-02-08 16:54:38 +08:00

1 2 3 4 5 ...

255 commits