koboldcpp

mirror of https://github.com/LostRuins/koboldcpp.git synced 2025-09-11 09:34:37 +00:00

Author	SHA1	Message	Date
John	aa23412989	llava : support v1.6 (#5267 ) * Create llava-survery-v2.py * Update convert-image-encoder-to-gguf.py * Update convert-image-encoder-to-gguf.py * Rename llava-survery-v2.py to llava-surgery-v2.py * Update convert-image-encoder-to-gguf.py will now search for projector * Update convert-image-encoder-to-gguf.py whoops * Update llava-surgery-v2.py * Clip: Bugfix for normalization (it did not loat the 3 std and mean values) Clip: bicubic resize function Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6) Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final convert-image-encoder: fixed image-grid flattening * whitespace corrections * ws * Tensors are now properly permuted. Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference. * ws * added verbose_prompt support into cli added stopwords for llava-1.6 into cli * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed * ws * convert : skip unknown tensors (need for LLaVA) * llava : update readme * llava : fix compile warnings * llava : style * convert : add --skip-unknown CLI arg * server : remove clip structs * bugfix for non llava-1.6 It should now work with llava-1.5 as well * clip : minor code rearrange * llava : update readme a bit --------- Co-authored-by: John <cmt-nct@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-14 09:38:35 +02:00
Concedo	3cec37c2e0	Merge branch 'master' into concedo_experimental # Conflicts: # .flake8 # .github/workflows/python-lint.yml # flake.lock # ggml-cuda.cu # ggml-quants.c # llama.cpp # pocs/vdot/q8dot.cpp # pocs/vdot/vdot.cpp # tests/test-quantize-fns.cpp # tests/test-quantize-perf.cpp	2024-02-13 00:14:22 +08:00
Alexey Parfenov	684780141a	server : allow to specify tokens as strings in logit_bias (#5003 ) * server: allow to specify tokens as strings in logit_bias * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:38:14 +02:00
Xuan Son Nguyen	907e08c110	server : add llama2 chat template (#5425 ) * server: add mistral chat template * server: fix typo * server: rename template mistral to llama2 * server: format_llama2: remove BOS * server: validate "--chat-template" argument * server: clean up using_chatml variable Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-02-11 12:16:22 +02:00
Concedo	ea3fd87f68	Merge branch 'master' into concedo_experimental # Conflicts: # README.md # scripts/sync-ggml.sh	2024-02-11 15:18:46 +08:00
Riley Stewart	7c777fcd5d	server : fix prompt caching for repeated prompts (#5420 )	2024-02-09 12:49:49 +02:00
Concedo	ec2dbd99a3	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile # README.md # flake.lock # llama.cpp	2024-02-07 22:21:32 +08:00
Justin Parker	f3e2b4fa3f	server : update `/props` with "total_slots" value (#5373 ) * include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section	2024-02-07 08:15:19 +02:00
Alexey Parfenov	213d1439fa	server : remove model.json endpoint (#5371 )	2024-02-06 20:08:38 +02:00
Justin Parker	8a79c591de	server : include total "num_slots" in props endpoint (#5349 )	2024-02-06 11:20:59 +02:00
Michael Coppola	31e7903221	server : add `dynatemp_range` and `dynatemp_exponent` (#5352 ) * server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-02-06 11:20:00 +02:00
Niall Coates	4ffc7a17d4	server : various fixes for the prompt field in /completion (#5300 ) server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert	2024-02-06 10:16:23 +02:00
Alexey Parfenov	a2d60c9158	server : allow to get default generation settings for completion (#5307 )	2024-02-05 10:10:22 +02:00
Concedo	6dc01297f8	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/nix/package.nix # .github/workflows/build.yml # CMakeLists.txt # Makefile # README.md # flake.nix # llama.cpp # llama.h # tests/test-llama-grammar.cpp	2024-02-04 19:42:57 +08:00
Michael Klimenko	52bb63c708	refactor : switch to emplace_back to avoid extra object (#5291 )	2024-02-03 13:23:37 +02:00
Georgi Gerganov	5cb04dbc16	llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD (#5240 ) * llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-31 17:30:17 +02:00
Concedo	15deabd200	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/editorconfig.yml # .gitignore # CMakeLists.txt # README.md	2024-01-31 18:53:38 +08:00
Georgi Gerganov	e6f291d158	server : fix context shift (#5195 ) * server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes	2024-01-30 20:17:30 +02:00
Concedo	54cc31f9dc	Merge branch 'master' into concedo_experimental # Conflicts: # .ecrc # .github/workflows/build.yml # CMakeLists.txt # README.md # llama.cpp # tests/test-c.c	2024-01-30 19:18:10 +08:00
Concedo	f73de33f74	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .github/workflows/docker.yml # CMakeLists.txt # Makefile # README.md # ci/README.md # ci/run.sh # flake.lock # ggml-metal.m # ggml-opencl.cpp # ggml-vulkan-shaders.hpp # ggml-vulkan.cpp # ggml-vulkan.h # ggml.c # ggml_vk_generate_shaders.py # llama.cpp # llama.h # pocs/vdot/vdot.cpp # tests/test-llama-grammar.cpp # tests/test-sampling.cpp	2024-01-29 23:12:09 +08:00
Wu Jian Ping	c82d18e863	server : embeddings compatibility for OpenAI (#5190 )	2024-01-29 15:48:10 +02:00
Abhilash Majumder	0f648573dd	ggml : add unified SYCL backend for Intel GPUs (#2690 ) * first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-28 17:56:23 +02:00
Michael Klimenko	35a2ee9143	Remove unused data and add fixes (#5154 ) * Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation	2024-01-27 15:25:55 +01:00
Maximilian Winter	ec903c0341	server : add self-extend support (#5104 ) * Ported self extension to server example * Update server.cpp * Fixed prompt caching without self extend * Update server.cpp * Added description to server readme. * Update server.cpp * Update server.cpp * Update server.cpp * Update server.cpp * Update README.md * Changed descriptions * server : formatting * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update server.cpp * Update server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-27 15:38:05 +02:00
Concedo	ed09a854f0	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # CMakeLists.txt # Makefile # README.md # ci/run.sh # ggml-opencl.cpp # tests/CMakeLists.txt	2024-01-27 11:45:07 +08:00
Xuan Son Nguyen	48c857aa10	server : refactored the task processing logic (#5065 ) * server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"	2024-01-26 14:42:20 +02:00
Concedo	1cb8a5e955	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # CMakeLists.txt # Makefile # README.md # ci/run.sh # flake.lock # flake.nix # ggml-cuda.cu # ggml-cuda.h # scripts/get-wikitext-2.sh # tests/CMakeLists.txt	2024-01-21 14:32:15 +08:00
Concedo	71e9a64171	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/nix-ci.yml # CMakeLists.txt # Makefile # ggml-cuda.cu # ggml-opencl.cpp # llama.cpp	2024-01-20 23:27:42 +08:00
Xuan Son Nguyen	821f0a271e	server : defer tasks when "slot unavailable" (#5018 ) * server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>	2024-01-18 22:33:05 +02:00
Concedo	dc7bc0cb50	Merge commit '`584d674be6`' into concedo_experimental # Conflicts: # .github/workflows/nix-flake-update.yml # Makefile # Package.swift # ggml-cuda.cu # tests/test-quantize-fns.cpp	2024-01-14 16:29:44 +08:00
Georgi Gerganov	0ea069b87b	server : fix prompt caching with system prompt (#4914 )	2024-01-13 19:31:26 +02:00
Ziad Ben Hadj-Alouane	356327feb3	server : fix deadlock that occurs in multi-prompt scenarios (#4905 ) * * fix deadlock * * dont ruint all whitespace	2024-01-13 16:20:46 +02:00
makomk	ee8243adaa	server : fix crash with multimodal models without BOS token (#4904 )	2024-01-13 16:16:11 +02:00
slaren	e7e4df031b	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Georgi Gerganov	1d118386fe	server : fix infill when prompt is empty (#4833 )	2024-01-11 23:23:49 +02:00
Laura	4330bd83fe	server : implement credentialed CORS (#4514 ) * Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage	2024-01-11 20:02:48 +02:00
Michael Coppola	27379455c3	server : support for multiple api keys (#4864 ) * server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>	2024-01-11 19:51:17 +02:00
Behnam M	eab6795006	server : add `LOG_INFO` when model is successfully loaded (#4881 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading	2024-01-11 19:41:39 +02:00
Isaac McFadyen	2f043328e3	server : fix typo in model name (#4876 )	2024-01-11 16:33:26 +02:00
Georgi Gerganov	5c1980d8d4	server : fix build + rename enums (#4870 )	2024-01-11 09:10:34 +02:00
Behnam M	cd108e641d	server : add a `/health` endpoint (#4860 ) * added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line	2024-01-10 21:56:05 +02:00
Concedo	f04b6e7287	Merge branch 'master' into concedo_experimental # Conflicts: # .devops/nix/package.nix # CMakeLists.txt # README.md # ggml-metal.m # ggml.c	2024-01-08 14:18:49 +08:00
Georgi Gerganov	67984921a7	server : fix n_predict check (#4798 )	2024-01-07 08:45:26 +02:00
Concedo	c9fdd42da2	Merge branch 'master' into concedo_experimental # Conflicts: # Package.swift	2024-01-05 18:32:54 +08:00
Georgi Gerganov	012cf349ae	server : send token probs for "stream == false" (#4714 )	2024-01-04 19:56:33 +02:00
Concedo	234f79fe9d	Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # ci/run.sh # llama.cpp	2024-01-03 22:33:38 +08:00
Georgi Gerganov	32866c5edd	editorconfig : fix whitespace and indentation #4710	2024-01-02 13:28:15 +02:00
minarchist	5d7002d437	server : add --override-kv parameter (#4710 ) * Changes to server to allow metadata override * documentation * flake.nix: expose full scope in legacyPackages * flake.nix: rocm not yet supported on aarch64, so hide the output * flake.nix: expose checks * workflows: nix-ci: init; build flake outputs * workflows: nix-ci: add a job for eval * workflows: weekly `nix flake update` * workflows: nix-flakestry: drop tag filters ...and add a job for flakehub.com * workflows: nix-ci: add a qemu job for jetsons * flake.nix: suggest the binary caches * flake.lock: update to a commit recently cached by nixpkgs-cuda-ci --------- Co-authored-by: John <john@jLap.lan> Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>	2024-01-02 12:38:15 +02:00
Concedo	9e0dee769b	Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # flake.lock # flake.nix	2024-01-01 16:04:17 +08:00
Georgi Gerganov	9fbda719de	clip : refactor + bug fixes (#4696 ) * clip : refactor + bug fixes ggml-ci * server : add log message	2023-12-30 23:24:42 +02:00

... 3 4 5 6 7 ...

357 commits