koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2026-05-31 05:03:44 +00:00

Author	SHA1	Message	Date
Concedo	ec5dea14d7	merged, try to fix metal build	2024-03-14 11:15:50 +08:00
Linwei Wang	19885d205e	readme : update details about running llama in Termux on Android (#6039 )	2024-03-13 20:34:40 +02:00
Georgi Gerganov	76a936c893	readme : update API changes and hot topics	2024-03-13 20:33:56 +02:00
Clint Herron	463628372d	grammar : handle missing "root" node (#6004 )	2024-03-13 20:10:40 +02:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
slaren	d8fd0ccf6a	test-backend-ops : skip CPU backend by default (#6028 )	2024-03-13 15:58:30 +02:00
Concedo	9f102b9db6	update makefile	2024-03-13 21:53:52 +08:00
AidanBeltonS	b3d978600f	Update get version (#6025 )	2024-03-13 18:47:54 +05:30
Xuan Son Nguyen	99b71c068f	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
Concedo	7a2de82c96	updated lite	2024-03-13 18:27:19 +08:00
Concedo	a9435163ab	fixed uploading non square images	2024-03-13 14:19:51 +08:00
Concedo	85287c7701	handle uploading non square images	2024-03-13 13:57:14 +08:00
Concedo	47c42fd45c	fix for mamba processing	2024-03-13 13:27:46 +08:00
Concedo	ba950716a9	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # Package.swift # README.md # build.zig # llama.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-llama.cpp	2024-03-13 11:21:58 +08:00
slaren	306d34be7a	ci : remove tidy-review (#6021 )	2024-03-12 17:55:19 +02:00
Concedo	edb05e761f	Update some prints	2024-03-12 21:40:36 +08:00
Concedo	88705cb89a	improve quiet mode for SD	2024-03-12 20:50:39 +08:00
Georgi Gerganov	8030da7afe	ggml : reuse quantum structs across backends (#5943 ) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-12 14:27:20 +02:00
Concedo	60d234550b	fix colab	2024-03-12 20:09:49 +08:00
Georgi Gerganov	184215e783	ggml : fix UB in IQ2_S and IQ3_S (#6012 )	2024-03-12 13:49:55 +02:00
Concedo	6c6ad93f01	added basic support for password protection (+2 squashed commit) Squashed commit: [ff91ca72] added basic support for password protection [91b0b208] updated docs	2024-03-12 19:47:12 +08:00
Georgi Gerganov	48358b2e5b	sycl : update IQ1_S kernels (WIP - not working!) (#5995 ) * sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type	2024-03-12 11:15:05 +02:00
Concedo	a69bc44e7a	edit colab (+1 squashed commits) Squashed commits: [c7ccb99d] update colab with llava	2024-03-12 15:24:53 +08:00
gliptic	5cdb371731	grammar : fix unnecessarily retained pointer to rules (#6003 )	2024-03-11 21:59:03 +02:00
Kawrakow	44ca159faf	1.5 bit: we can do even better (#5999 ) * iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 17:53:15 +02:00
Georgi Gerganov	05b06210c9	llama : more consistent names of count variables (#5994 ) * llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name	2024-03-11 17:49:47 +02:00
Georgi Gerganov	83796e62bc	llama : refactor unicode stuff (#5992 ) * llama : refactor unicode stuff ggml-ci * unicode : names * make : fix c++ compiler * unicode : names * unicode : straighten tables * zig : fix build * unicode : put nfd normalization behind API ggml-ci * swift : fix build * unicode : add BOM * unicode : add <cstdint> ggml-ci * unicode : pass as cpts as const ref	2024-03-11 17:47:47 +02:00
Concedo	6a32c14e86	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # README-sycl.md # README.md # flake.lock # scripts/sync-ggml-am.sh # scripts/sync-ggml.last # scripts/sync-ggml.sh # tests/.gitignore # tests/test-backend-ops.cpp	2024-03-11 23:00:47 +08:00
Concedo	9229ea664e	if no existing filepath, do not use cwd, use last path instead	2024-03-11 22:19:38 +08:00
Stefan Kapusniak	4dd1c2b81a	Improve launcher file dialog initial paths (#740 ) - In the launcher, if an existing value is set for a file value (e.g. Model), use that file's directory the initial directory when the file dialog is opened with 'Browse'. - In the launcher always set the intial directory for 'Load' to cwd.	2024-03-11 22:05:46 +08:00
Concedo	95c8090967	updated lite	2024-03-11 21:59:18 +08:00
Concedo	227f59dab6	added a simple program to do quantization for clip models	2024-03-11 21:50:30 +08:00
Jakub N	828defefb6	Update server docker image URLs (#5997 )	2024-03-11 14:40:42 +01:00
Concedo	2dc647f892	updated lite (+1 squashed commits) Squashed commits: [f33ea44a] updated lite	2024-03-11 20:10:34 +08:00
Concedo	d59ec68753	added interrogate endpoint (+1 squashed commits) Squashed commits: [7bf96261] added interrogate endpoint	2024-03-11 18:50:18 +08:00
Xuan Son Nguyen	caa106d4e0	Server: format error to json (#5961 ) * server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme	2024-03-11 10:56:41 +01:00
Concedo	e4946b96ea	support llava with gpt4v openai endpoint	2024-03-11 17:36:10 +08:00
Michael Podvitskiy	3202361c5b	ggml, ci : Windows ARM runner and build fixes (#5979 ) * windows arm ci * fix `error C2078: too many initializers` with ggml_vld1q_u32 macro for MSVC ARM64 * fix `warning C4146: unary minus operator applied to unsigned type, result still unsigned` * fix `error C2065: '__fp16': undeclared identifier`	2024-03-11 11:28:51 +02:00
Minsoo Cheong	332bdfd798	server : maintain chat completion id for streaming responses (#5988 ) * server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-11 10:09:32 +02:00
Gilad S	ecab1c75de	cmake : fix subdir for `LLAMA_METAL_EMBED_LIBRARY` (#5985 )	2024-03-11 10:00:08 +02:00
Georgi Gerganov	ee35600b90	llama : fix F16/F32 downcast + improve names (#5980 )	2024-03-11 09:56:47 +02:00
Concedo	484d90c330	llava support is now fully functioning	2024-03-11 15:55:32 +08:00
Kawrakow	be858f6205	Better 1.5 bit quantization (#5971 ) * Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2<x^2> as sigma2 in weight adjustment iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-11 07:51:49 +01:00
Abhilash Majumder	ef3ced26a3	[SYCL] Add q3_s and q1_s (#5886 ) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-11 10:27:56 +05:30
AidanBeltonS	3814a07392	[SYCL] Add support for SYCL Nvidia target (#5738 ) * Add support for nvidia target in CMake * Update sycl read-me for Nvidia target * Fix errors	2024-03-11 09:13:57 +08:00
Georgi Gerganov	bb6d00bbf9	metal : move mm_id indices to shared mem (#5982 )	2024-03-10 23:12:48 +02:00
Dean	7ab7b733bb	android : fix utf8 decoding error (#5935 ) * examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 22:03:17 +02:00
Georgi Gerganov	d9f65c97c3	readme : update hot topics	2024-03-10 20:58:26 +02:00
Georgi Gerganov	b838b53ad6	sync : ggml	2024-03-10 20:10:46 +02:00
Georgi Gerganov	df4dc3e7cb	ggml : try fix 32-bit arm compat (whisper/1938) * ggml : try fix 32-bit arm compat * ggml : fix cont	2024-03-10 20:10:39 +02:00

1 2 3 4 5 ...

3859 commits