koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-12 01:54:37 +00:00

Author	SHA1	Message	Date
Kawrakow	76aa30a263	Add ability to use Q5_0, Q5_1, and IQ4_NL for quantized K cache (#6183 ) * k_cache: be able to use Q5_0 * k_cache: be able to use Q5_1 on CODA * k_cache: be able to use Q5_0 on Metal * k_cache: be able to use Q5_1 on Metal * k_cache: be able to use IQ4_NL - just CUDA for now * k_cache: be able to use IQ4_NL on Metal * k_cache: add newly added supported types to llama-bench and CUDA supports_op --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-03-21 08:27:57 +01:00
Ziang Wu	f9c7ba3447	llava : update MobileVLM-README.md (#6180 )	2024-03-20 17:29:51 +02:00
Ziang Wu	272935b281	llava : add MobileVLM_V2 backup (#6175 ) * Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * clip : fix whitespace * fix deifinition mistake in clip.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-20 17:02:32 +02:00
Xuan Son Nguyen	91f8ad167d	Server: version bump for httplib and json (#6169 ) * server: version bump for httplib and json * fix build * bring back content_length	2024-03-20 13:30:36 +01:00
Georgi Gerganov	bc0baab2ea	server : allow to override -ngl in tests (#6170 )	2024-03-20 14:14:32 +02:00
Georgi Gerganov	d795988d9e	Revert "llava : add a MobileVLM_V2-1.7B backup (#6152 )" This reverts commit `f8c4e745e1`.	2024-03-20 13:29:49 +02:00
Ziang Wu	f8c4e745e1	llava : add a MobileVLM_V2-1.7B backup (#6152 ) * Add MobileVLM_V2 backup * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/llava/convert-image-encoder-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * clip : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-20 13:20:37 +02:00
Karthick	47cc7a7bf9	Server: Handle n_keep parameter in the request (#6174 )	2024-03-20 12:02:34 +01:00
Jared Van Bortel	bd60d82d0c	server tests : more pythonic process management; fix bare `except:` (#6146 ) * server tests : remove seemingly redundant newlines in print() * server tests : use built-in subprocess features, not os.kill and psutil * server tests : do not catch e.g. SystemExit; use print_exc * server tests: handle TimeoutExpired exception * server tests: fix connect on dual-stack systems * server: tests: add new tokens regex on windows generated following new repeat penalties default changed in (#6127) * server: tests: remove the hack on windows since now we get the good socket family * server: tests: add new tokens regex following new repeat penalties default changed in (#6127) * server: tests: add new tokens regex following new repeat penalties default changed in (#6127) --------- Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-03-20 06:33:49 +01:00
Neo Zhang Jianyu	6c0b287748	update readme sycl for new update (#6151 ) * update readme sycl for new update * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * Update README-sycl.md Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> * Update README-sycl.md Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> * update by review comments * update w64devkit link * update for verify device id part * Update README-sycl.md Co-authored-by: Meng, Hengyu <airdldl@163.com> --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-03-20 11:21:41 +08:00
DAN™	d8b009a945	Remove undeed header file. (#6158 )	2024-03-19 17:16:09 +01:00
Pierrick Hymbert	d0d5de42e5	gguf-split: split and merge gguf per batch of tensors (#6135 ) * gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-19 12:05:44 +01:00
Concedo	a3fa919c67	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # Makefile # flake.lock # ggml-cuda.cu # ggml-cuda.h	2024-03-19 18:57:22 +08:00
Felix	104f5e0fc1	clip : fix memory leak (#6138 )	2024-03-18 17:40:22 +02:00
slaren	2bf8d0f7c4	backend : offload large batches to GPU (#6083 ) * backend : offload large batches to GPU * fix hip * code cleanup * fix CUDA split buffers * Update ggml-backend-impl.h Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix memset without set_device * imatrix : remove sched affix from weight names * sched : add a new split if the current one has too many inputs reduce max inputs per split more cleanup * update backends ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-03-18 11:03:04 +01:00
Pierrick Hymbert	d01b3c4c32	common: llama_load_model_from_url using --model-url (#6098 ) * common: llama_load_model_from_url with libcurl dependency Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-17 19:12:37 +01:00
Concedo	8b360b661c	Merge branch 'upstream' into concedo_experimental # Conflicts: # Makefile # README.md # common/common.h	2024-03-17 23:03:12 +08:00
Daniel Bevenius	b5f4ae09c3	gritlm : add initial README.md (#6086 ) * gritlm: add initial README.md to examples/gritlm This commit adds a suggestion for an initial README.md for the gritlm example. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! gritlm: add initial README.md to examples/gritlm Use the `scripts/hf.sh` script to download the model file. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! gritlm: add initial README.md to examples/gritlm Fix editorconfig-checker error in examples/gritlm/README.md. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-03-16 17:46:29 +02:00
Ting Lou	4e9a7f7f7f	llava : change API to pure C style for Rust FFI bindgen (#6079 ) Co-authored-by: Lou Ting <louting.t@alibaba-inc.com>	2024-03-15 16:31:05 +02:00
Neo Zhang Jianyu	46acb36767	fix set main gpu error (#6073 )	2024-03-15 18:53:53 +08:00
slaren	b0bc9f4a9d	llama-bench : use random tokens to improve accuracy with mixtral (#6069 )	2024-03-15 10:22:24 +02:00
Concedo	93d3871056	Merge branch 'upstream' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # ggml-metal.m	2024-03-15 10:37:48 +08:00
Steve Grubb	6e0438da3c	gguf : fix resource leaks (#6061 ) There several places where a gguf context is allocated. A call to gguf_free is missing in some error paths. Also on linux, llama-bench was missing a fclose.	2024-03-14 20:29:32 +02:00
Georgi Gerganov	044ec4b2a5	embedding : add EOS token if not present (#899 )	2024-03-14 15:14:14 +02:00
Jian Liao	15a333260a	readme : improve readme for Llava-1.6 example (#6044 ) Co-authored-by: Jian Liao <jianliao@adobe.com>	2024-03-14 13:18:23 +02:00
Pierrick Hymbert	43241adf22	server: disable debug release type sanitizer, simplify trigger (#6047 ) - increase time out for server - do not fail fast	2024-03-14 13:15:39 +02:00
Georgi Gerganov	68265ebfc6	embedding : print all resulting embeddings (#899 )	2024-03-14 12:37:20 +02:00
Georgi Gerganov	0fd6c1f015	embedding : print cosine similarity (#899 )	2024-03-14 10:12:29 +02:00
Concedo	ec5dea14d7	merged, try to fix metal build	2024-03-14 11:15:50 +08:00
slaren	f30ea47a87	llama : add pipeline parallelism support (#6017 ) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-13 18:54:21 +01:00
Xuan Son Nguyen	99b71c068f	Server: Use multi-task for embeddings endpoint (#6001 ) * use multitask for embd endpoint * specify types * remove redundant {"n_predict", 0}	2024-03-13 11:39:11 +01:00
Concedo	ba950716a9	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # Package.swift # README.md # build.zig # llama.cpp # tests/test-tokenizer-1-bpe.cpp # tests/test-tokenizer-1-llama.cpp	2024-03-13 11:21:58 +08:00
Georgi Gerganov	05b06210c9	llama : more consistent names of count variables (#5994 ) * llama : more consistent names of count variables ggml-ci * llama : n_parallel -> n_seq_max * common : fix param name * examples : fix param name	2024-03-11 17:49:47 +02:00
Concedo	6a32c14e86	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # CMakeLists.txt # Makefile # README-sycl.md # README.md # flake.lock # scripts/sync-ggml-am.sh # scripts/sync-ggml.last # scripts/sync-ggml.sh # tests/.gitignore # tests/test-backend-ops.cpp	2024-03-11 23:00:47 +08:00
Concedo	227f59dab6	added a simple program to do quantization for clip models	2024-03-11 21:50:30 +08:00
Jakub N	828defefb6	Update server docker image URLs (#5997 )	2024-03-11 14:40:42 +01:00
Xuan Son Nguyen	caa106d4e0	Server: format error to json (#5961 ) * server: format error to json * server: do not crash on grammar error * fix api key test case * revert limit max n_predict * small fix * correct coding style * update completion.js * launch_slot_with_task * update docs * update_slots * update webui * update readme	2024-03-11 10:56:41 +01:00
Minsoo Cheong	332bdfd798	server : maintain chat completion id for streaming responses (#5988 ) * server: maintain chat completion id for streaming responses * Update examples/server/utils.hpp * Update examples/server/utils.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-11 10:09:32 +02:00
Dean	7ab7b733bb	android : fix utf8 decoding error (#5935 ) * examples: fix utf8 decoding error some models have a tokenizer that decodes an id into an incomplete utf8 sequence, need to validate and wait for next token one example would be: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat-GGUF/resolve/main/qwen1_5-1_8b-chat-q4_0.gguf and and an example of the token is 18137 * android : minor --------- Co-authored-by: zhangfuwen <zhangfuwen@foxmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 22:03:17 +02:00
Pierrick Hymbert	fa8a809a91	server: ci: windows build and tests (#5968 ) * server: ci: windows build and tests * server: ci: remove tmp push branch * server: ci: EOF EOL * Use builti Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * server: tests: server graceful shutdown, then kill, then hard kill * server: tests: remove python2 unicode string * server: tests: remove wrong comment on server starting, close_fds is always true * server: tests: server kill, if pid exists * server: tests: remove dependency to killall * server: tests: ci windows: pid exists better handling --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-03-10 18:17:47 +01:00
DAN™	bcebd7dbf6	llama : add support for GritLM (#5959 ) * add gritlm example * gritlm results match * tabs to spaces * comment out debug printing * rebase to new embed * gritlm embeddings are back babeee * add to gitignore * allow to toggle embedding mode * Clean-up GritLM sample code. * Fix types. * Flush stdout and output ending newline if streaming. * mostly style fixes; correct KQ_mask comment * add causal_attn flag to llama_cparams * gritml : minor * llama : minor --------- Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-10 17:56:30 +02:00
Pierrick Hymbert	621e86b331	server: benchmark: chat/completions scenario and other llm servers comparison (#5941 ) * server: bench: Init a bench scenario with K6 See #5827 * server: bench: EOL EOF * server: bench: PR feedback and improved k6 script configuration * server: bench: remove llamacpp_completions_tokens_seconds as it include prompt processing time and it's misleading server: bench: add max_tokens from SERVER_BENCH_MAX_TOKENS server: bench: increase truncated rate to 80% before failing * server: bench: fix doc * server: bench: change gauge custom metrics to trend * server: bench: change gauge custom metrics to trend server: bench: add trend custom metrics for total tokens per second average * server: bench: doc add an option to debug http request * server: bench: filter dataset too short and too long sequences * server: bench: allow to filter out conversation in the dataset based on env variable * server: bench: fix assistant message sent instead of user message * server: bench: fix assistant message sent instead of user message * server : add defrag thold parameter * server: bench: select prompts based on the current iteration id not randomly to make the bench more reproducible --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-09 23:41:49 +01:00
Georgi Gerganov	77d1ac7e00	server : print chat template info	2024-03-09 22:04:00 +02:00
slaren	d894f352bf	perplexity : support using multiple sequences to allow larger batch sizes (#5946 ) * perplexity : support using multiple sequences to allow larger batch sizes ggml-ci * set cparams.n_parallel to the number of sequences * print tested n_ctx, add assert	2024-03-09 19:55:54 +01:00
Georgi Gerganov	58308a0ecc	server : fix metrics init (#5964 )	2024-03-09 17:34:15 +02:00
Georgi Gerganov	5b09797321	ggml : remove old quantization functions (#5942 ) * ggml : remove old quantization functions ggml-ci * ggml : simplify ggml_quantize_chunk ggml-ci * ggml : restrict correctness ggml-ci * ggml : remove hist data from the quantization API ggml-ci * tests : remove hist usage in test-backend-ops ggml-ci * vulkan : remove hist and fix typo	2024-03-09 15:53:59 +02:00
Georgi Gerganov	97c09585d6	server : clarify some items in the readme (#5957 ) * server : clarify some items in the readme * server : fix typo	2024-03-09 15:47:47 +02:00
SeungWon Jeong	fb215c3832	server : normalize embeddings (#5956 ) * output normalize embedding in '/v1/embeddings' * common : reuse llama_embd_normalize * common : better normalize impl --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-09 14:27:58 +02:00
Alexey Parfenov	0db32beaf0	server : fix passing prompt as tokens (#5955 ) * server: fix passing prompt as tokens * Update examples/server/server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-09 13:16:53 +02:00
Georgi Gerganov	9674aaf35c	server : simplify logic for empty prompts (#5953 )	2024-03-09 12:34:18 +02:00

... 6 7 8 9 10 ...

1137 commits