koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-22 03:10:03 +00:00

Author	SHA1	Message	Date
henk717	e493f14a3e	New automatic layers (#1012 ) * Henk's version of the fsize algo This is the current version of the fsize algo based on Pyro's algorithm with added padding. * Update koboldcpp.py Add debugs and bump padding * Pyro version Pyro didn't agree with my version, so here is a test with his version * Polish new auto layers This one cleans up some debug prints, restores the max behavior in case the old alg suits someone better and changes the 200 layers to be the actual max for all backends so users have a better feel for the models. * Remove 10% margin The new version has been much more accurate, for low vram systems I only notice 1 layer difference. Getting rid of it so users can test if its still in safe margins like I expect. On a 6GB system it results in 18 layers instead of 17 being chosen for Tiefighter. * Restore 500MB buffer to play it safe I'm not feeling confident most people keep their vram usage under 1GB with background tasks. For now since we are aiming to have it work on as many systems as possible I restore the 500MB extra space since the fsize inflation is gone. * Cap layers at maximum When using the auto predict we don't want to go over the maximum amount of layers. Users should have a realistic feel for how large the model is. For example when I was using the new auto guesser to communicate if a larger model would fit on someone's system at a higher context, it originally made me think that the model had 60 layers. In reality it had less. This commit will take the layers of the model, and add 3 extra since that is the highest amount of additional layers a backend adds for the context handling (Most its 1). * Remove old max layer code Turns out at extreme contexts on new models such as Nemo the old code is incorrectly assuming we can offload everything. Its also redundant to check for max layers the old way since I capped our new guesses. Old code is now removed to simplify it, and it changed the nemo guess from 43 layers to 15 layers. Still looking into the 15 part, still seems to high but can be the old algo taking over. * Restructure algorithm into multiple parts As requested the different calculations in the algorithm now have their own sections and names so its easier to understand what parts are being used. This also fixes the typo that was caused as a result of it being harder to read, the typo made no difference during execution and the algorithm is confirmed to still work the same.	2024-07-22 15:47:31 +08:00
Concedo	e2b36aa6cf	fixed dry loading seq when not in use, set kcppt to -1 layers by default	2024-07-22 15:44:34 +08:00
Concedo	0ecf13fc13	updated lite, extra error logging	2024-07-21 17:55:47 +08:00
Concedo	4d9ccddc2c	don't unpack pyd	2024-07-20 18:58:49 +08:00
Concedo	1a23d49c32	serve tags endpoint	2024-07-19 16:08:54 +08:00
Concedo	24b9616344	Merge branch 'upstream' into concedo_experimental # Conflicts: # .devops/full-cuda.Dockerfile # .devops/full-rocm.Dockerfile # .devops/full.Dockerfile # .devops/llama-cli-cuda.Dockerfile # .devops/llama-cli-intel.Dockerfile # .devops/llama-cli-rocm.Dockerfile # .devops/llama-cli-vulkan.Dockerfile # .devops/llama-cli.Dockerfile # .devops/llama-server-cuda.Dockerfile # .devops/llama-server-intel.Dockerfile # .devops/llama-server-rocm.Dockerfile # .devops/llama-server-vulkan.Dockerfile # .devops/llama-server.Dockerfile # CMakeLists.txt # CONTRIBUTING.md # Makefile # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # requirements.txt # src/llama.cpp # tests/test-backend-ops.cpp	2024-07-19 14:23:33 +08:00
Johannes Gäßler	a15ef8f8a0	CUDA: fix partial offloading for ne0 % 256 != 0 (#8572 )	2024-07-18 23:48:47 +02:00
Concedo	a998588f3a	improved estimation	2024-07-19 00:20:11 +08:00
65a	705b7ecf60	cmake : install all ggml public headers (#8480 ) Co-authored-by: 65a <65a@65a.invalid>	2024-07-18 17:47:12 +03:00
Concedo	caab9cb8ae	fixed unwanted removal	2024-07-18 22:27:22 +08:00
BBC-Esq	621801da0e	Streamline misc (#1007 ) * fix typo and streamline a little * streamline togglehorde * oops	2024-07-18 22:25:38 +08:00
Concedo	8b0a9f7e56	remove keys, use tuple	2024-07-18 22:11:13 +08:00
BBC-Esq	7de1ebf897	Streamline with dictionaries (#1005 ) * dictionary #1 * dictionary #2	2024-07-18 22:05:30 +08:00
BBC-Esq	ce971a0f3d	Streamline with fstrings (#1006 ) * fstring #1 * fstring #2	2024-07-18 21:48:46 +08:00
Eric Zhang	0d2c7321e9	server: use relative routes for static files in new UI (#8552 ) * server: public: fix api_url on non-index pages * server: public: use relative routes for static files in new UI	2024-07-18 12:43:49 +02:00
Brian	672a6f1018	convert-.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499 ) Main thing is that the default output filename will take this form {name}{parameters}{finetune}{version}{encoding}{kind} In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content No Change: - Internal GGUF Spec - `general.architecture` - `general.quantization_version` - `general.alignment` - `general.file_type` - General Model Details - `general.name` - `general.author` - `general.version` - `general.description` - Licensing details - `general.license` - Typically represents the converted GGUF repo (Unless made from scratch) - `general.url` - Model Source during conversion - `general.source.url` * Removed: - Model Source during conversion - `general.source.huggingface.repository` * Added: - General Model Details - `general.organization` - `general.finetune` - `general.basename` - `general.quantized_by` - `general.size_label` - Licensing details - `general.license.name` - `general.license.link` - Typically represents the converted GGUF repo (Unless made from scratch) - `general.doi` - `general.uuid` - `general.repo_url` - Model Source during conversion - `general.source.doi` - `general.source.uuid` - `general.source.repo_url` - Base Model Source - `general.base_model.count` - `general.base_model.{id}.name` - `general.base_model.{id}.author` - `general.base_model.{id}.version` - `general.base_model.{id}.organization` - `general.base_model.{id}.url` (Model Website/Paper) - `general.base_model.{id}.doi` - `general.base_model.{id}.uuid` - `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...)) - Array based KV stores - `general.tags` - `general.languages` - `general.datasets` --------- Co-authored-by: compilade <git@compilade.net> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>	2024-07-18 20:40:15 +10:00
RunningLeon	3807c3de04	server : respect `--special` cli arg (#8553 )	2024-07-18 11:06:22 +03:00
Concedo	6080fa38ce	updated lite	2024-07-18 15:55:45 +08:00
Concedo	90c1bbbcb9	more url downoad support	2024-07-18 11:56:05 +08:00
Johannes Gäßler	e02b597be3	lookup: fibonacci hashing, fix crashes (#8548 )	2024-07-17 23:35:44 +02:00
Al Mochkin	b3283448ce	build : Fix docker build warnings (#8535 ) (#8537 )	2024-07-17 20:21:55 +02:00
Concedo	ad86b1aeb8	Implemented Kcpp Launch Templates (+1 squashed commits) Squashed commits: [5ea4c1de] wip integrating skcpps templates (+1 squashed commits) Squashed commits: [737daa7f] skcpps wip	2024-07-18 00:22:59 +08:00
Brian	30f80ca0bc	CONTRIBUTING.md : remove mention of noci (#8541 )	2024-07-17 17:57:06 +03:00
Concedo	8ccc0144d2	ability to set -1 as gpulayers and determine at runtime (+1 squashed commits) Squashed commits: [594263c3] ability to set -1 as gpulayers and determine at runtime	2024-07-17 20:31:19 +08:00
hipudding	1bdd8ae19f	[CANN] Add Ascend NPU backend (#6035 ) * [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <391746016@qq.com> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <391746016@qq.com>	2024-07-17 14:23:50 +03:00
Concedo	869e30a6a0	Updated CLInfo from https://github.com/Oblomov/clinfo https://ci.appveyor.com/api/projects/oblomov/clinfo/artifacts/clinfo.exe?job=platform%3a+x64	2024-07-17 19:20:17 +08:00
Concedo	6c883a4803	dummy skcpps format	2024-07-17 18:35:27 +08:00
Concedo	eca7521c13	allowed embedded chat adapters	2024-07-17 18:08:43 +08:00
Masaya, Kato	da3913d8f9	batched: fix n_predict parameter (#8527 )	2024-07-17 10:34:28 +03:00
Georgi Gerganov	d65a8361fe	llama : disable context-shift for DeepSeek v2 (#8501 )	2024-07-17 10:32:59 +03:00
Concedo	5988243aee	fix wrong order, fix llava debug mode failure	2024-07-17 15:30:19 +08:00
Johannes Gäßler	5e116e8dd5	make/cmake: add missing force MMQ/cuBLAS for HIP (#8515 )	2024-07-16 21:20:59 +02:00
Concedo	e99fa531a2	reorder items	2024-07-17 00:28:48 +08:00
Concedo	d775a419b2	updated lite with chat inject, added layer detect, added more console logging	2024-07-16 23:10:15 +08:00
Brian	1666f92dcd	gguf-hash : update clib.json to point to original xxhash repo (#8491 ) * Update clib.json to point to Cyan4973 original xxhash Convinced Cyan4973 to add clib.json directly to his repo, so can now point the clib package directly to him now. Previously pointed to my fork with the clib.json package metadata https://github.com/Cyan4973/xxHash/pull/954 * gguf-hash: readme update to point to Cyan4973 xxHash repo [no ci]	2024-07-16 10:14:16 +03:00
Steve Bonds	37b12f92ab	export-lora : handle help argument (#8497 ) The --help option on export-lora isn't accepted as valid. The help still gets displayed by default, but the script exits with an error message and nonzero status.	2024-07-16 10:04:45 +03:00
Georgi Gerganov	0efec57787	llama : valign + remove unused ftype (#8502 )	2024-07-16 10:00:30 +03:00
compilade	7acfd4e8d5	convert_hf : faster lazy safetensors (#8482 ) * convert_hf : faster lazy safetensors This makes '--dry-run' much, much faster. * convert_hf : fix memory leak in lazy MoE conversion The '_lazy' queue was sometimes self-referential, which caused reference cycles of objects old enough to avoid garbage collection until potential memory exhaustion.	2024-07-15 23:13:10 -04:00
Xuan Son Nguyen	97bdd26eee	Refactor lora adapter support (#8332 ) * lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <slarengh@gmail.com> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-07-15 20:50:47 +02:00
Xuan Son Nguyen	4db8f60fe7	fix ci (#8494 )	2024-07-15 19:23:10 +02:00
Concedo	a441c27cb5	fixed broken link	2024-07-16 01:00:16 +08:00
Concedo	e707ab9025	Merge branch 'upstream' into concedo_experimental # Conflicts: # docs/development/HOWTO-add-model.md # docs/development/token_generation_performance_tips.md # flake.lock	2024-07-16 00:49:34 +08:00
Concedo	516fd35e93	error popups on python exits	2024-07-16 00:46:32 +08:00
Concedo	8412946b9f	fix oldcpu build avx1	2024-07-15 23:42:22 +08:00
Concedo	21179d675b	try ci for avx1, up ver (+2 squashed commit) Squashed commit: [74150175] up version [97b6163c] try ci for avx1 linux	2024-07-15 23:07:07 +08:00
Daniel Bevenius	8fac431b06	ggml : suppress unknown pragma 'GCC' on windows (#8460 ) This commit adds a macro guard to pragma GCC to avoid the following warning on windows: ```console C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068: unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj] ```	2024-07-15 15:48:17 +03:00
M-A	f17f39ff9c	server: update README.md with llama-server --help output [no ci] (#8472 ) The README.md had a stale information. In particular, the --ctx-size "defaults to 512" confused me and I had to check the code to confirm this was false. This the server is evolving rapidly, it's probably better to keep the source of truth at a single place (in the source) and generate the README.md based on that. Did: make llama-server ./llama-server --help > t.txt vimdiff t.txt examples/server/README.md I copied the content inside a backquote block. I would have preferred proper text but it would require a fair amount of surgery to make the current output compatible with markdown. A follow up could be to automate this process with a script. No functional change.	2024-07-15 15:04:56 +03:00
Georgi Gerganov	9104bc20ed	common : add --no-cont-batching arg (#6358 )	2024-07-15 14:54:58 +03:00
NikolaiLyssogor	fc690b018e	docs: fix links in development docs [no ci] (#8481 ) Fixes a few links to within the repo that were broken in the reorganization of the documentation in #8325.	2024-07-15 14:46:39 +03:00
Meng, Hengyu	16bdfa42ac	[SYCL] add concat through dim 1/2 (#8483 ) * add concat through dim 1/2	2024-07-15 19:32:15 +08:00

1 2 3 4 5 ...

5203 commits