koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-12 18:09:42 +00:00

Author	SHA1	Message	Date
Sergio López	c88c74f967	vulkan: only use M-sized matmul on Apple GPUs (#5412 ) * vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-02-11 15:12:00 +01:00
Alexey Parfenov	a803333a4e	common : use enums for sampler types (#5418 ) * common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:43:31 +02:00
Alexey Parfenov	684780141a	server : allow to specify tokens as strings in logit_bias (#5003 ) * server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:38:14 +02:00
Georgi Gerganov	85910c5b30	main : ctrl+C print timing in non-interactive mode (#3873 )	2024-02-11 15:35:50 +02:00
Georgi Gerganov	139b62a839	common : fix compile warning	2024-02-11 15:33:43 +02:00
Georgi Gerganov	0f2411f154	ggml : fix compile warnings (unused vars) (#4966 )	2024-02-11 15:33:01 +02:00
snadampal	a07d0fee1f	ggml : add mmla kernels for quantized GEMM (#4966 ) * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info	2024-02-11 15:22:33 +02:00
Concedo	f9bc7245ab	b64 decoder	2024-02-11 20:35:34 +08:00
Johannes Gäßler	e4640d8fdf	lookup: add print for drafting performance (#5450 )	2024-02-11 12:44:51 +01:00
Concedo	066e73d769	context shift even more lenient	2024-02-11 18:30:38 +08:00
Xuan Son Nguyen	907e08c110	server : add llama2 chat template (#5425 ) * server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-11 12:16:22 +02:00
Concedo	edb3dc362a	clblast up ver	2024-02-11 17:01:09 +08:00
Concedo	ea3fd87f68	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # scripts/sync-ggml.sh	2024-02-11 15:18:46 +08:00
Concedo	038779af41	another fix for lite	2024-02-11 15:17:09 +08:00
Concedo	afa41c24f5	small fix lite (+1 squashed commits) Squashed commits: [f22db79a] updated lite	2024-02-10 22:31:19 +08:00
Concedo	6f3196ad8e	fix benchmark line	2024-02-10 21:49:14 +08:00
Concedo	590af480ab	contextshift more forgiving	2024-02-10 20:49:21 +08:00
Ian Bull	f026f8120f	metal : use autoreleasepool to avoid memory leaks (#5437 ) There appears to be a known memory leak when using the `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in [1,2] [1] https://developer.apple.com/forums/thread/662721 [2] https://forums.developer.apple.com/forums/thread/120931 This change-set wraps the `ggml_metal_graph_compute` in a `@autoreleasepool`. This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436	2024-02-10 12:53:28 +02:00
Georgi Gerganov	cd9aea63b5	scripts : update sync scripts with new backends	2024-02-10 09:53:05 +02:00
Georgi Gerganov	43b65f5eb8	sync : ggml	2024-02-10 09:30:36 +02:00
Michael Podvitskiy	4633d93af0	ggml : add abort_callback for cpu backend (ggml/725) * a way to use abort_callback with the cpu backend * whisper update	2024-02-10 09:29:21 +02:00
Neuman Vong	4b7b38bef5	vulkan: Set limit for task concurrency (#5427 ) A common default for the maximum number of open files is 256, which can lead to `asyncio.gather(tasks)` failing with Too many open files. $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc ggml_vulkan: Generating and compiling shaders to SPIR-V Traceback (most recent call last): File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module> asyncio.run(main()) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main await asyncio.gather(tasks) [...snip...] OSError: [Errno 24] Too many open files This change sets a reasonable concurrency limit for tasks (and therefore open files), without significant impact on run time.	2024-02-09 19:30:19 +01:00
Concedo	0ec0055edc	updated lite	2024-02-09 22:21:58 +08:00
Daniel Bevenius	e00d2a62dd	llava : add requirements.txt and update README.md (#5428 ) * llava: add requirements.txt and update README.md This commit adds a `requirements.txt` file to the `examples/llava` directory. This file contains the required Python packages to run the scripts in the `examples/llava` directory. The motivation of this to make it easier for users to run the scripts in `examples/llava`. This will avoid users from having to possibly run into missing package issues if the packages are not installed on their system. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llava: fix typo in llava-surgery.py output Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-09 15:00:59 +02:00
Concedo	c3d1a7d123	benchmark coherence fix	2024-02-09 19:03:48 +08:00
Riley Stewart	7c777fcd5d	server : fix prompt caching for repeated prompts (#5420 )	2024-02-09 12:49:49 +02:00
Paul Tsochantaris	e5ca3937c6	llama : do not cap thread count when MoE on CPU (#5419 ) * Not capping thread count when MoE inference is running on CPU * Whitespace	2024-02-09 12:48:06 +02:00
Concedo	35111ce01a	row split mode is now a toggle	2024-02-09 18:35:58 +08:00
Marko Tasic	e4124c2477	readme : add JavaScript/Wasm repo (#5415 )	2024-02-09 12:17:00 +02:00
Michael Podvitskiy	b2f87cb64d	ggml : fix `error C2078: too many initializers` for MSVC ARM64 (#5404 )	2024-02-09 11:56:43 +02:00
Concedo	d1aff0e964	benchmark only save under 1mb	2024-02-09 15:40:29 +08:00
Concedo	e69a505def	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml	2024-02-09 14:46:01 +08:00
Concedo	992eea71d7	fixes for vulkan multigpu	2024-02-09 14:42:27 +08:00
0cc4m	44fbe34360	Fix Vulkan crash on APUs with very little device memory (#5424 ) * Fix Vulkan crash on APUs with very little device memory * Fix debug output function names	2024-02-09 06:52:33 +01:00
Concedo	fe424a5466	tensor split active text	2024-02-09 12:02:23 +08:00
Johannes Gäßler	8e6a9d2de0	CUDA: more warps for mmvq on NVIDIA (#5394 )	2024-02-08 21:56:40 +01:00
slaren	41f308f58e	llama : do not print "offloading layers" message in CPU-only builds (#5416 )	2024-02-08 21:33:03 +01:00
Abhilash Majumder	6e99f2a04f	Fix f16_sycl cpy call from Arc (#5411 ) * fix f16_sycl cpy call * rm old logic * add fp16 build CI * use macro * format fix	2024-02-08 22:39:10 +05:30
Daniel Bevenius	ff4ff05c5f	llava : add missing .py, and fix paths in README.md (#5414 ) This commit adds the missing .py extension to the convert-image-encoder-to-gguf script. It also fixes the paths for the `model` and `mmproj` options in the example llava-cli command. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-08 16:20:03 +02:00
Johannes Gäßler	b7b74cef36	fix trailing whitespace (#5407 )	2024-02-08 11:36:54 +01:00
runfuture	4aa43fab56	llama : fix MiniCPM (#5392 ) * fix bug for norm_rms_eps missing * to align with the same order as convert.py for model write * fix: undo HF models permute tensor * update for flake8 lint	2024-02-08 12:36:19 +02:00
Concedo	22a4d84050	updated readme	2024-02-08 17:34:44 +08:00
Concedo	f374dba49c	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # README.md # tests/test-sampling.cpp	2024-02-08 17:33:03 +08:00
Daniel Bevenius	a6e514a85f	llava: fix typo/formatting in README.md (#5405 ) This commit fixes a typo in the README.md file for the llava example which is causing the formatting to look a little off: Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-02-08 09:58:19 +01:00
Concedo	4cd571db89	vulkan multigpu, show uptime	2024-02-08 16:54:38 +08:00
Johannes Gäßler	26d4efd11e	sampling: fix top_k <= 0 (#5388 ) * sampling: fix top_k <= 0 * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-08 09:46:30 +01:00
Georgi Gerganov	8504d2d0da	tests : .gitignore obj files	2024-02-08 09:46:47 +02:00
Michael Podvitskiy	c4fbb6717c	CMAKE_OSX_ARCHITECTURES for MacOS cross compilation (#5393 ) Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-02-07 16:39:23 -05:00
Ebey Abraham	8c933b70c2	fix typo in readme (#5399 ) Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>	2024-02-07 22:11:30 +01:00
Kamil Tomšík	b906596bb7	Add Ava in the list of llama.cpp UIs (#4362 )	2024-02-07 13:44:52 -05:00

... 3 4 5 6 7 ...

3652 commits