diff --git a/README.md b/README.md index 530b7ddfd..7517e40ce 100644 --- a/README.md +++ b/README.md @@ -53,14 +53,15 @@ when you can't use the precompiled binary directly, we provide an automated buil ## OSX and Linux Manual Compiling - Otherwise, you will have to compile your binaries from source. A makefile is provided, simply run `make`. +- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1` - If you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries. - - For Arch Linux: Install `cblas` and `clblast`. - - For Debian: Install `libclblast-dev`. + - For Arch Linux: Install `cblas` `openblas` and `clblast`. + - For Debian: Install `libclblast-dev` and `libopenblas-dev`. - You can attempt a CuBLAS build with `LLAMA_CUBLAS=1`. You will need CUDA Toolkit installed. Some have also reported success with the CMake file, though that is more for windows. -- For a full featured build (all backends), do `make LLAMA_CLBLAST=1 LLAMA_CUBLAS=1 LLAMA_VULKAN=1` +- For a full featured build (all backends), do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1 LLAMA_VULKAN=1` - After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]` -- Note: OpenBLAS backend is now deprecated and will be removed, as pure CPU is now almost always faster. +- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds. ### Arch Linux Packages There are 4 community made AUR packages (Maintained by @AlpinDale) available: [CPU-only](https://aur.archlinux.org/packages/koboldcpp-cpu), [CLBlast](https://aur.archlinux.org/packages/koboldcpp-clblast), [CUBLAS](https://aur.archlinux.org/packages/koboldcpp-cuda), and [HIPBLAS](https://aur.archlinux.org/packages/koboldcpp-hipblas). They are, respectively, for users with no GPU, users with a GPU (vendor-agnostic), users with NVIDIA GPUs, and users with a supported AMD GPU. @@ -88,12 +89,12 @@ You can then run koboldcpp anywhere from the terminal by running `koboldcpp` to - If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller'). - Run the script make_pyinstaller.bat at a regular terminal (or Windows Explorer). - The koboldcpp.exe file will be at your dist folder. -- If you wish to use your own version of the additional Windows libraries (OpenCL, CLBlast), you can do it with: +- If you wish to use your own version of the additional Windows libraries (OpenCL, CLBlast and OpenBLAS), you can do it with: - OpenCL - tested with https://github.com/KhronosGroup/OpenCL-SDK . If you wish to compile it, follow the repository instructions. You will need vcpkg. - CLBlast - tested with https://github.com/CNugteren/CLBlast . If you wish to compile it you will need to reference the OpenCL files. It will only generate the ".lib" file if you compile using MSVC. - - Move the respective .lib files to the /lib folder of your project, overwriting the older files. - - Also, replace the existing versions of the corresponding .dll files located in the project directory root - - Make the KoboldCPP project using the instructions above. + - OpenBLAS - tested with https://github.com/xianyi/OpenBLAS . + - Move the respectives .lib files to the /lib folder of your project, overwriting the older files. + - Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll). - You can attempt a CuBLAS build with using the provided CMake file with visual studio. If you use the CMake file to build, copy the `koboldcpp_cublas.dll` generated into the same directory as the `koboldcpp.py` file. If you are bundling executables, you may need to include CUDA dynamic libraries (such as `cublasLt64_11.dll` and `cublas64_11.dll`) in order for the executable to work correctly on a different PC. - Make the KoboldCPP project using the instructions above. @@ -127,7 +128,7 @@ You can then run koboldcpp anywhere from the terminal by running `koboldcpp` to ## Considerations - For Windows: No installation, single file executable, (It Just Works) -- Since v1.0.6, required libopenblas, however, it was later removed. +- Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS. - Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast. - Since v1.33, you can set the context size to be above what the model supports officially. It does increases perplexity but should still work well below 4096 even on untuned models. (For GPT-NeoX, GPT-J, and LLAMA models) Customize this with `--ropeconfig`. - Since v1.42, supports GGUF models for LLAMA and Falcon @@ -141,7 +142,7 @@ You can then run koboldcpp anywhere from the terminal by running `koboldcpp` to - The other files are also under the AGPL v3.0 License unless otherwise stated ## Notes -- Generation delay scales linearly with original prompt length. If CLBlast is enabled then prompt ingestion becomes a few times faster. This is automatic on windows, but will require linking on OSX and Linux. Set `--gpulayers` + `--useclblast` or `--usecublas`. +- Generation delay scales linearly with original prompt length. If OpenBLAS is enabled then prompt ingestion becomes about 2-3x faster. This is automatic on windows, but will require linking on OSX and Linux. CLBlast speeds this up even further, and `--gpulayers` + `--useclblast` or `--usecublas` more so. - I have heard of someone claiming a false AV positive report. The exe is a simple pyinstaller bundle that includes the necessary python scripts and dlls to run. If this still concerns you, you might wish to rebuild everything from source code using the makefile, and you can rebuild the exe yourself with pyinstaller by using `make_pyinstaller.bat` - API documentation available at `/api` and https://lite.koboldai.net/koboldcpp_api - Supported GGML models (Includes backward compatibility for older versions/legacy GGML models, though some newer features might be unavailable): diff --git a/koboldcpp.py b/koboldcpp.py index e8a779040..e2754362e 100644 --- a/koboldcpp.py +++ b/koboldcpp.py @@ -1593,15 +1593,15 @@ def show_new_gui(): tabcontent = {} lib_option_pairs = [ - (lib_default, "Use CPU"), + (lib_openblas, "Use OpenBLAS"), (lib_clblast, "Use CLBlast"), (lib_cublas, "Use CuBLAS"), (lib_hipblas, "Use hipBLAS (ROCm)"), (lib_vulkan, "Use Vulkan"), - (lib_openblas, "Use OpenBLAS (Deprecated)"), + (lib_default, "Use No BLAS"), (lib_clblast_noavx2, "CLBlast NoAVX2 (Old CPU)"), (lib_vulkan_noavx2, "Vulkan NoAVX2 (Old CPU)"), - (lib_noavx2, "CPU NoAVX2 (Old CPU)"), + (lib_noavx2, "NoAVX2 Mode (Old CPU)"), (lib_failsafe, "Failsafe Mode (Old CPU)")] openblas_option, clblast_option, cublas_option, hipblas_option, vulkan_option, default_option, clblast_noavx2_option, vulkan_noavx2_option, noavx2_option, failsafe_option = (opt if file_exists(lib) or (os.name == 'nt' and file_exists(opt + ".dll")) else None for lib, opt in lib_option_pairs) # slider data @@ -1613,7 +1613,7 @@ def show_new_gui(): if not any(runopts): exitcounter = 999 - show_gui_msgbox("No Backends Available!","KoboldCPP couldn't locate any backends to use (i.e CPU, CLBlast, CuBLAS).\n\nTo use the program, please run the 'make' command from the directory.") + show_gui_msgbox("No Backends Available!","KoboldCPP couldn't locate any backends to use (i.e Default, OpenBLAS, CLBlast, CuBLAS).\n\nTo use the program, please run the 'make' command from the directory.") time.sleep(3) sys.exit(2) @@ -1990,7 +1990,7 @@ def show_new_gui(): # presets selector - makelabel(quick_tab, "Presets:", 1,0,"Select a backend to use.\nCPU runs purely on CPU only.\nCuBLAS runs on Nvidia GPUs, and is much faster.\nCLBlast works on all GPUs but is somewhat slower.\nNoAVX2 and Failsafe modes support older PCs.") + makelabel(quick_tab, "Presets:", 1,0,"Select a backend to use.\nOpenBLAS and NoBLAS runs purely on CPU only.\nCuBLAS runs on Nvidia GPUs, and is much faster.\nCLBlast works on all GPUs but is somewhat slower.\nNoAVX2 and Failsafe modes support older PCs.") runoptbox = ctk.CTkComboBox(quick_tab, values=runopts, width=180,variable=runopts_var, state="readonly") runoptbox.grid(row=1, column=1,padx=8, stick="nw") @@ -2029,7 +2029,7 @@ def show_new_gui(): hardware_tab = tabcontent["Hardware"] # presets selector - makelabel(hardware_tab, "Presets:", 1,0,"Select a backend to use.\nCPU runs purely on CPU only.\nCuBLAS runs on Nvidia GPUs, and is much faster.\nCLBlast works on all GPUs but is somewhat slower.\nNoAVX2 and Failsafe modes support older PCs.") + makelabel(hardware_tab, "Presets:", 1,0,"Select a backend to use.\nOpenBLAS and NoBLAS runs purely on CPU only.\nCuBLAS runs on Nvidia GPUs, and is much faster.\nCLBlast works on all GPUs but is somewhat slower.\nNoAVX2 and Failsafe modes support older PCs.") runoptbox = ctk.CTkComboBox(hardware_tab, values=runopts, width=180,variable=runopts_var, state="readonly") runoptbox.grid(row=1, column=1,padx=8, stick="nw") runoptbox.set(runopts[0]) # Set to first available option @@ -2206,9 +2206,9 @@ def show_new_gui(): args.noavx2 = True if gpulayers_var.get(): args.gpulayers = int(gpulayers_var.get()) - if runopts_var.get()=="Use CPU": + if runopts_var.get()=="Use No BLAS": args.noblas = True - if runopts_var.get()=="CPU NoAVX2 (Old CPU)": + if runopts_var.get()=="NoAVX2 Mode (Old CPU)": args.noavx2 = True if runopts_var.get()=="Failsafe Mode (Old CPU)": args.noavx2 = True @@ -3257,7 +3257,7 @@ if __name__ == '__main__': compatgroup.add_argument("--usecublas", help="Use CuBLAS for GPU Acceleration. Requires CUDA. Select lowvram to not allocate VRAM scratch buffer. Enter a number afterwards to select and use 1 GPU. Leaving no number will use all GPUs. For hipBLAS binaries, please check YellowRoseCx rocm fork.", nargs='*',metavar=('[lowvram|normal] [main GPU ID] [mmq] [rowsplit]'), choices=['normal', 'lowvram', '0', '1', '2', '3', 'mmq', 'rowsplit']) compatgroup.add_argument("--usevulkan", help="Use Vulkan for GPU Acceleration. Can optionally specify GPU Device ID (e.g. --usevulkan 0).", metavar=('[Device ID]'), nargs='*', type=int, default=None) compatgroup.add_argument("--useclblast", help="Use CLBlast for GPU Acceleration. Must specify exactly 2 arguments, platform ID and device ID (e.g. --useclblast 1 0).", type=int, choices=range(0,9), nargs=2) - compatgroup.add_argument("--noblas", help="((THIS COMMAND IS DEPRECATED AND WILL BE REMOVED SOON))", action='store_true') + compatgroup.add_argument("--noblas", help="Do not use OpenBLAS for accelerated prompt ingestion", action='store_true') parser.add_argument("--gpulayers", help="Set number of layers to offload to GPU when using GPU. Requires GPU.",metavar=('[GPU layers]'), nargs='?', const=1, type=int, default=0) parser.add_argument("--tensor_split", help="For CUDA and Vulkan only, ratio to split tensors across multiple GPUs, space-separated list of proportions, e.g. 7 3", metavar=('[Ratios]'), type=float, nargs='+') parser.add_argument("--contextsize", help="Controls the memory allocated for maximum context size, only change if you need more RAM for big contexts. (default 2048). Supported values are [256,512,1024,2048,3072,4096,6144,8192,12288,16384,24576,32768,49152,65536,98304,131072]. IF YOU USE ANYTHING ELSE YOU ARE ON YOUR OWN.",metavar=('[256,512,1024,2048,3072,4096,6144,8192,12288,16384,24576,32768,49152,65536,98304,131072]'), type=check_range(int,256,262144), default=2048)