koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-13 10:29:43 +00:00

Author	SHA1	Message	Date
compilade	de9692a7d2	llama : fix llama_copy_state_data with fragmented KV cache (#5840 ) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems.	2024-03-03 10:41:55 +02:00
Pierrick Hymbert	e6029348e8	ci : schedule slow server tests only on Release or on demand (#5839 )	2024-03-03 10:35:23 +02:00
Pierrick Hymbert	8ef969afce	server : init http requests thread pool with --parallel if set (#5836 )	2024-03-03 09:48:36 +02:00
Concedo	0c59c1ed90	allow specifying width and height	2024-03-03 15:44:15 +08:00
Georgi Gerganov	fa974646e1	flake.lock: Update (#5842 ) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2024-03-02 20:11:31 -08:00
Pierrick Hymbert	9731134296	server: tests: passkey challenge / self-extend with context shift demo (#5832 ) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test	2024-03-02 22:00:14 +01:00
Michael Podvitskiy	4a6e2d6142	llama : add abort_callback to interrupt computation (#5409 ) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-02 21:52:25 +02:00
Georgi Gerganov	494c870326	ggml : fix IQ3_S AVX implementation (#5834 ) ggml-ci	2024-03-02 20:00:49 +02:00
Jared Van Bortel	4d4d2366fc	convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (#5821 )	2024-03-02 12:27:26 -05:00
Jared Van Bortel	c7a0ad8ec9	convert-hf : make model class definitions self-contained (#5825 )	2024-03-02 12:21:47 -05:00
Kawrakow	bbde6eb256	ggml : IQ3_S improvements (#5829 ) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-02 17:00:51 +02:00
Georgi Gerganov	ef2cd694c4	scripts : add pod-llama.sh	2024-03-02 16:54:20 +02:00
Concedo	fa1d8b8d95	updated lite	2024-03-02 22:33:02 +08:00
Xuan Son Nguyen	6c32d8c7ad	llama : refactor internal quantization functions (#5830 )	2024-03-02 16:19:09 +02:00
compilade	802da0091b	llama : fix segfault from unknown model arch name (#5820 ) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284 * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-02 15:42:56 +02:00
Neo Zhang Jianyu	715641391d	Support multiple GPUs (split mode) on SYCL backend (#5806 ) * suport multiple cards: split-mode - layer\|row * rm warning * rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test * update news * fix merge error * update according to review comments	2024-03-02 19:49:30 +08:00
Concedo	4c0beef598	updated lite, added personal notes	2024-03-02 18:53:15 +08:00
Concedo	fda905a36a	fixed unable to load config	2024-03-02 18:08:45 +08:00
crasm	9bf297a02b	workflows : remove nocleanup arg for check-requirements.sh (#5826 ) Reduces peak tmpfs usage and should prevent the check from failing from running out of space. Fixes the 'No space left on device' issue mentioned in #5703.	2024-03-02 00:11:06 -05:00
Concedo	e1b213ae96	increase steps limit	2024-03-02 12:08:19 +08:00
Concedo	e53d21d748	sanitize SD prompt to avoid segfault	2024-03-02 12:05:59 +08:00
Concedo	59c5448ac8	fixed colab (+1 squashed commits) Squashed commits: [1d1c686f] updated colab and docs	2024-03-02 10:09:07 +08:00
Tushar	cb5e8f7fc4	build(nix): Introduce flake.formatter for `nix fmt` (#5687 ) * build(nix): Introduce flake.formatter for `nix fmt` * chore: Switch to pkgs.nixfmt-rfc-style	2024-03-01 15:18:26 -08:00
nold	da3b9ba2b7	convert-hf-to-gguf : require einops for InternLM2ForCausalLM (#5792 )	2024-03-01 16:51:12 -05:00
Sourab Mangrulkar	c29af7e225	llama : add StarCoder2 support (#5795 ) * Add support for starcoder2 * handle rope type * skip rope freq and rotary embeddings from being serialized * resolve comments * Update llama.cpp * remove redundant changes * handle `rope-theta` * llama : change starcoder2 rope type * address comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-01 21:30:46 +02:00
Concedo	0978134f65	fix macos tunnel	2024-03-02 02:03:13 +08:00
Georgi Gerganov	38d16b1426	server : remove api_like_OAI.py proxy script (#5808 )	2024-03-01 20:00:58 +02:00
ddpasa	c2224f003b	ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously broken (#5813 )	2024-03-01 18:00:00 +01:00
Concedo	2d9a90b652	try to fix ci compile errors (+1 squashed commits) Squashed commits: [d0d49663] fixed log multiline (+1 squashed commits) Squashed commits: [81a8befe] try to fix linux build error (+1 squashed commits) Squashed commits: [22850dda] try to fix build (+1 squashed commits) Squashed commits: [b8294611] missing type	2024-03-01 23:38:15 +08:00
kunal-vaishnavi	e743386728	gemma : fix bfloat16 -> float16 conversion issue (#5810 )	2024-03-01 16:08:08 +02:00
Miwa / Ensan	f49a535686	common : fix flag `--logits-all` to `--all-logits` (#5805 )	2024-03-01 15:48:56 +02:00
Pierrick Hymbert	3ab8b3a92e	llama : cleanup unused mmq flags (#5772 ) * cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-01 13:39:06 +02:00
Concedo	040de7d899	try add tunnels for macos	2024-03-01 17:52:09 +08:00
Concedo	55af5446ad	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # ci/run.sh # llama.cpp # scripts/sync-ggml.last	2024-03-01 17:41:37 +08:00
Douglas Hanley	9600d59e01	unicode : switch to multimap based nfd_map (#5799 ) * switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time	2024-03-01 11:15:36 +02:00
Pierrick Hymbert	5cb02b4a01	server: allow to override threads server pool with --threads-http (#5794 )	2024-03-01 10:08:08 +01:00
Eve	6ea0f010ff	ci : add Ubuntu 22 Vulkan CI run (#5789 )	2024-03-01 10:54:53 +02:00
Concedo	e5861e993d	fix benchmark	2024-03-01 16:54:25 +08:00
Concedo	80011ed8aa	KCPP SD: add warn and step restriction., updated lite, handle quant mode	2024-03-01 16:41:19 +08:00
Georgi Gerganov	f105471ef6	server : fix newlines in help (#5785 )	2024-03-01 09:59:43 +02:00
AidanBeltonS	38d1521608	[SYCL] Use batched mul_mat pathway (#5591 ) * Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-03-01 13:06:47 +05:30
Concedo	3463688a0e	image generation is fully working over api (+1 squashed commits) Squashed commits: [c98ab0b4] single image generation is working now	2024-03-01 14:43:44 +08:00
Xuan Son Nguyen	052051d8ae	Server: normalize naming (#5779 ) * server: normalize naming * fix spacing	2024-02-29 21:42:11 +01:00
Concedo	e8f4d7b3da	added model and config endpoints for sdcpp, added more samplers. speed is still not good	2024-02-29 22:56:09 +08:00
bebopkim	257015bb94	Resolve Metal compilation errors for sdcpp (#720 )	2024-02-29 20:15:45 +08:00
Concedo	5a44d4de2b	refactor and clean identifiers for sd, fix cmake	2024-02-29 18:28:45 +08:00
Concedo	66134bb36e	ui for loading SD models done	2024-02-29 17:08:22 +08:00
Marcus Dunn	d5ab29757e	llama : constified `llama_set_state_data`'s `src` (#5774 )	2024-02-29 10:17:23 +02:00
Concedo	524ba12abd	refactor - do not use a copy buffer to store generation outputs, instead return a cpp allocated ptr	2024-02-29 14:02:20 +08:00
Georgi Gerganov	87c91c0766	ci : reduce 3b ppl chunks to 1 to avoid timeout (#5771 ) ggml-ci	2024-02-28 21:44:21 +02:00

... 12 13 14 15 16 ...

4371 commits