diff --git a/README.md b/README.md
index 2610d4256..c7b87ff79 100644
--- a/README.md
+++ b/README.md
@@ -1,29 +1,27 @@
# koboldcpp
-KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models. It's a single self contained distributable from Concedo, that builds off llama.cpp, and adds a versatile Kobold API endpoint, additional format support, Stable Diffusion image generation, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything KoboldAI and KoboldAI Lite have to offer.
+KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models, inspired by the original KoboldAI. It's a single self-contained distributable from Concedo, that builds off llama.cpp, and adds a versatile **KoboldAI API endpoint**, additional format support, Stable Diffusion image generation, speech-to-text, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything KoboldAI and KoboldAI Lite have to offer.




-## Windows Usage
-- **[Download the latest .exe release here](https://github.com/LostRuins/koboldcpp/releases/latest)** or clone the git repo.
-- Windows binaries are provided in the form of **koboldcpp.exe**, which is a pyinstaller wrapper for a few **.dll** files and **koboldcpp.py**. You can also rebuild it yourself with the provided makefiles and scripts.
-- Weights are not included, you can use the official llama.cpp `quantize.exe` to generate them from your official weight files (or download them from other places such as [TheBloke's Huggingface](https://huggingface.co/TheBloke).
+## Windows Usage (Precompiled Binary, Recommended)
+- Windows binaries are provided in the form of **koboldcpp.exe**, which is a pyinstaller wrapper containing all necessary files. **[Download the latest koboldcpp.exe release here](https://github.com/LostRuins/koboldcpp/releases/latest)**
- To run, simply execute **koboldcpp.exe**.
- Launching with no command line arguments displays a GUI containing a subset of configurable settings. Generally you dont have to change much besides the `Presets` and `GPU Layers`. Read the `--help` for more info about each settings.
- By default, you can connect to http://localhost:5001
- You can also run it using the command line. For info, please check `koboldcpp.exe --help`
-### Improving Performance
-- **(Nvidia Only) GPU Acceleration**: If you're on Windows with an Nvidia GPU you can get CUDA support out of the box using the `--usecublas` flag, make sure you select the correct .exe with CUDA support.
-- **Any GPU Acceleration**: As a slightly slower alternative, try CLBlast with `--useclblast` flags for a slightly slower but more GPU compatible speedup.
-- **GPU Layer Offloading**: Want even more speedup? Combine one of the above GPU flags with `--gpulayers` to offload entire layers to the GPU! **Much faster, but uses more VRAM**. Experiment to determine number of layers to offload, and reduce by a few if you run out of memory.
-- **Increasing Context Size**: Try `--contextsize 4096` to 2x your context size! without much perplexity gain. Note that you'll have to increase the max context in the KoboldAI Lite UI as well (click and edit the number text field).
-- If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
+## Linux Usage (Precompiled Binary, Recommended)
+On modern Linux systems, you should download the `koboldcpp-linux-x64-cuda1150` prebuilt PyInstaller binary on the **[releases page](https://github.com/LostRuins/koboldcpp/releases/latest)**. Simply download and run the binary.
-For more information, be sure to run the program with the `--help` flag, or [check the wiki](https://github.com/LostRuins/koboldcpp/wiki).
+Alternatively, you can also install koboldcpp to the current directory by running the following terminal command:
+```
+curl -fLo koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 && chmod +x koboldcpp
+```
+After running this command you can launch Koboldcpp from the current directory using `./koboldcpp` in the terminal (for CLI usage, run with `--help`).
## Run on Colab
- KoboldCpp now has an **official Colab GPU Notebook**! This is an easy way to get started without installing anything in a minute or two. [Try it here!](https://colab.research.google.com/github/LostRuins/koboldcpp/blob/concedo/colab.ipynb).
@@ -32,21 +30,32 @@ For more information, be sure to run the program with the `--help` flag, or [che
## Run on RunPod
- KoboldCpp can now be used on RunPod cloud GPUs! This is an easy way to get started without installing anything in a minute or two, and is very scalable, capable of running 70B+ models at afforable cost. [Try our RunPod image here!](https://koboldai.org/runpodcpp).
-## OSX and Linux
+## Docker
+- The official docker can be found at https://hub.docker.com/r/koboldai/koboldcpp
+- If you're building your own docker, remember to set CUDA_DOCKER_ARCH or enable LLAMA_PORTABLE
-### Linux Usage (Precompiled Binary, Recommended)
-On Linux, we provide a `koboldcpp-linux-x64-cuda1150` PyInstaller prebuilt binary on the **[releases](https://github.com/LostRuins/koboldcpp/releases/latest)** page for modern systems. Simply download and run the binary.
+## MacOS
+- You will need to clone the repo and compile from source code, see Compiling for MacOS below.
-Alternatively, you can also install koboldcpp to the current directory by running the following terminal command:
-```
-curl -fLo koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 && chmod +x koboldcpp
-```
-After running this command you can launch Koboldcpp from the current directory using `./koboldcpp` in the terminal (for CLI usage, run with `--help`).
+## Obtaining a GGUF model
+- KoboldCpp uses GGUF models. They are not included here, but you can download GGUF files from other places such as [TheBloke's Huggingface](https://huggingface.co/TheBloke). Search for "GGUF" on huggingface.co for plenty of compatible models in the `.gguf` format.
+- For beginners, we recommend the models [MistRP Airoboros](https://huggingface.co/TheBloke/MistRP-Airoboros-7B-GGUF/resolve/main/mistrp-airoboros-7b.Q4_K_S.gguf) or [Tiefighter 13B](https://huggingface.co/KoboldAI/LLaMA2-13B-Tiefighter-GGUF/resolve/main/LLaMA2-13B-Tiefighter.Q4_K_S.gguf) (larger model).
+- [Alternatively, you can download the tools to convert models to the GGUF format yourself here](https://github.com/LostRuins/koboldcpp/releases/download/v1.69.1/koboldcpp_tools_6jul.zip). Run `convert-hf-to-gguf.py` to convert them, then `quantize_gguf.exe` to quantize the result.
+## Improving Performance
+- **GPU Acceleration**: If you're on Windows with an Nvidia GPU you can get CUDA support out of the box using the `--usecublas` flag (Nvidia Only), or `--usevulkan` (Any GPU), make sure you select the correct .exe with CUDA support.
+- **GPU Layer Offloading**: Add `--gpulayers` to offload model layers to the GPU. The more layers you offload to VRAM, the faster generation speed will become. Experiment to determine number of layers to offload, and reduce by a few if you run out of memory.
+- **Increasing Context Size**: Use `--contextsize (number)` to increase context size, allowing the model to read more text. Note that you may also need to increase the max context in the KoboldAI Lite UI as well (click and edit the number text field).
+- **Old CPU Compatibility**: If you are having crashes or issues, you can try turning off BLAS with the `--noblas` flag. You can also try running in a non-avx2 compatibility mode with `--noavx2`. Lastly, you can try turning off mmap with `--nommap`.
-### Linux Usage (koboldcpp.sh automated compiler script)
-when you can't use the precompiled binary directly, we provide an automated build script which uses conda to obtain all dependencies, and generates (from source) a ready-to-use a pyinstaller binary for linux users. Simply execute the build script with `./koboldcpp.sh dist` and run the generated binary. (Not recomended for systems that already have an existing installation of conda. Dependencies: curl, bzip2)
+For more information, be sure to run the program with the `--help` flag, or **[check the wiki](https://github.com/LostRuins/koboldcpp/wiki).**
+## Compiling KoboldCpp From Source Code
+
+### Compiling on Linux (Using koboldcpp.sh automated compiler script)
+when you can't use the precompiled binary directly, we provide an automated build script which uses conda to obtain all dependencies, and generates (from source) a ready-to-use a pyinstaller binary for linux users.
+- Clone the repo with `git clone https://github.com/LostRuins/koboldcpp.git`
+- Simply execute the build script with `./koboldcpp.sh dist` and run the generated binary. (Not recomended for systems that already have an existing installation of conda. Dependencies: curl, bzip2)
```
./koboldcpp.sh # This launches the GUI for easy configuration and launching (X11 required).
./koboldcpp.sh --help # List all available terminal commands for using Koboldcpp, you can use koboldcpp.sh the same way as our python script and binaries.
@@ -54,38 +63,41 @@ when you can't use the precompiled binary directly, we provide an automated buil
./koboldcpp.sh dist # Generate your own precompiled binary (Due to the nature of Linux compiling these will only work on distributions equal or newer than your own.)
```
-## OSX and Linux Manual Compiling
-- Otherwise, you will have to compile your binaries from source. A makefile is provided, simply run `make`.
-- If you want you can also link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
-- If you want you can also link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`, for this you will need to obtain and link OpenCL and CLBlast libraries.
+### Compiling on Linux (Manual Method)
+- To compile your binaries from source, clone the repo with `git clone https://github.com/LostRuins/koboldcpp.git`
+- A makefile is provided, simply run `make`.
+- Optional OpenBLAS: Link your own install of OpenBLAS manually with `make LLAMA_OPENBLAS=1`
+- Optional CLBlast: Link your own install of CLBlast manually with `make LLAMA_CLBLAST=1`
+- Note: for these you will need to obtain and link OpenCL and CLBlast libraries.
- For Arch Linux: Install `cblas` `openblas` and `clblast`.
- For Debian: Install `libclblast-dev` and `libopenblas-dev`.
- You can attempt a CuBLAS build with `LLAMA_CUBLAS=1`. You will need CUDA Toolkit installed. Some have also reported success with the CMake file, though that is more for windows.
- For a full featured build (all backends), do `make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1 LLAMA_VULKAN=1`. (Note that `LLAMA_CUBLAS=1` will not work on windows, you need visual studio)
-- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.bin] [port]`
+- After all binaries are built, you can run the python script with the command `koboldcpp.py [ggml_model.gguf] [port]`
-- Note: Many OSX users have found that the using Accelerate is actually faster than OpenBLAS. To try, you may wish to run with `--noblas` and compare speeds.
-
-### Arch Linux Packages
-There are some community made AUR packages available: [CUBLAS](https://aur.archlinux.org/packages/koboldcpp-cuda), and [HIPBLAS](https://aur.archlinux.org/packages/koboldcpp-hipblas). They are intended for users with NVIDIA GPUs, and users with a supported AMD GPU. Note that these packages may be outdated, and it's probably better to use official KoboldCpp binaries.
-
-## Compiling on Windows
+### Compiling on Windows
- You're encouraged to use the .exe released, but if you want to compile your binaries from source at Windows, the easiest way is:
- - Use the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
- - Make sure you are using the w64devkit integrated terminal, then run 'make' at the KoboldCpp source folder. This will create the .dll files.
- - If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip ('pip install PyInstaller').
- - Run the script make_pyinstaller.bat at a regular terminal (or Windows Explorer).
+ - Get the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
+ - Clone the repo with `git clone https://github.com/LostRuins/koboldcpp.git`
+ - Make sure you are using the w64devkit integrated terminal, then run `make` at the KoboldCpp source folder. This will create the .dll files.
+ - If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip (`pip install PyInstaller`). Then run the script `make_pyinstaller.bat`
- The koboldcpp.exe file will be at your dist folder.
-- If you wish to use your own version of the additional Windows libraries (OpenCL, CLBlast and OpenBLAS), you can do it with:
+- **Building with CUDA**: Visual Studio, CMake and CUDA Toolkit is required. Clone the repo, then open the CMake file and compile it in Visual Studio. Copy the `koboldcpp_cublas.dll` generated into the same directory as the `koboldcpp.py` file. If you are bundling executables, you may need to include CUDA dynamic libraries (such as `cublasLt64_11.dll` and `cublas64_11.dll`) in order for the executable to work correctly on a different PC.
+- **Replacing Libraries (Not Recommended)**: If you wish to use your own version of the additional Windows libraries (OpenCL, CLBlast and OpenBLAS), you can do it with:
- OpenCL - tested with https://github.com/KhronosGroup/OpenCL-SDK . If you wish to compile it, follow the repository instructions. You will need vcpkg.
- CLBlast - tested with https://github.com/CNugteren/CLBlast . If you wish to compile it you will need to reference the OpenCL files. It will only generate the ".lib" file if you compile using MSVC.
- OpenBLAS - tested with https://github.com/xianyi/OpenBLAS .
- Move the respectives .lib files to the /lib folder of your project, overwriting the older files.
- Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
- - You can attempt a CuBLAS build with using the provided CMake file with visual studio. If you use the CMake file to build, copy the `koboldcpp_cublas.dll` generated into the same directory as the `koboldcpp.py` file. If you are bundling executables, you may need to include CUDA dynamic libraries (such as `cublasLt64_11.dll` and `cublas64_11.dll`) in order for the executable to work correctly on a different PC.
- - Make the KoboldCPP project using the instructions above.
+ - Make the KoboldCpp project using the instructions above.
-## Compiling on Android (Termux Installation)
+### Compiling on MacOS
+- You can compile your binaries from source. You can clone the repo with `git clone https://github.com/LostRuins/koboldcpp.git`
+- A makefile is provided, simply run `make`.
+- If you want Metal GPU support, instead run `make LLAMA_METAL=1`, note that MacOS metal libraries need to be installed.
+- After all binaries are built, you can run the python script with the command `koboldcpp.py --model [ggml_model.gguf]` (and add `--gpulayers (number of layer)` if you wish to offload layers to GPU).
+
+### Compiling on Android (Termux Installation)
- [Install and run Termux from F-Droid](https://f-droid.org/en/packages/com.termux/)
- Enter the command `termux-change-repo` and choose `Mirror by BFSU`
- Install dependencies with `pkg install wget git python` (plus any other missing packages)
@@ -99,16 +111,9 @@ There are some community made AUR packages available: [CUBLAS](https://aur.archl
- If you encounter any errors, make sure your packages are up-to-date with `pkg up`
- GPU acceleration for Termux may be possible but I have not explored it. If you find a good cross-device solution, do share or PR it.
-## AMD
+## AMD Users
- Please check out https://github.com/YellowRoseCx/koboldcpp-rocm
-## Docker
-- The official docker can be found at https://hub.docker.com/r/koboldai/koboldcpp
-- KoboldCpp also has a few unofficial third-party community created docker images. Feel free to try them out, but do not expect up-to-date support:
- - https://github.com/korewaChino/koboldCppDocker
- - https://github.com/noneabove1182/koboldcpp-docker
-- If you're building your own docker, remember to set CUDA_DOCKER_ARCH or enable LLAMA_PORTABLE
-
## Questions and Help
- **First, please check out [The KoboldCpp FAQ and Knowledgebase](https://github.com/LostRuins/koboldcpp/wiki) which may already have answers to your questions! Also please search through past issues and discussions.**
- If you cannot find an answer, open an issue on this github, or find us on the [KoboldAI Discord](https://koboldai.org/discord).
@@ -117,11 +122,11 @@ There are some community made AUR packages available: [CUBLAS](https://aur.archl
- For Windows: No installation, single file executable, (It Just Works)
- Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
- Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.
-- Since v1.33, you can set the context size to be above what the model supports officially. It does increases perplexity but should still work well below 4096 even on untuned models. (For GPT-NeoX, GPT-J, and LLAMA models) Customize this with `--ropeconfig`.
+- Since v1.33, you can set the context size to be above what the model supports officially. It does increases perplexity but should still work well below 4096 even on untuned models. (For GPT-NeoX, GPT-J, and Llama models) Customize this with `--ropeconfig`.
- Since v1.42, supports GGUF models for LLAMA and Falcon
- Since v1.55, lcuda paths on Linux are hardcoded and may require manual changes to the makefile if you do not use koboldcpp.sh for the compilation.
- Since v1.60, provides native image generation with StableDiffusion.cpp, you can load any SD1.5 or SDXL .safetensors model and it will provide an A1111 compatible API to use.
-- **I plan to keep backwards compatibility with ALL past llama.cpp AND alpaca.cpp models**. But you are also encouraged to reconvert/update your models if possible for best results.
+- **I try to keep backwards compatibility with ALL past llama.cpp models**. But you are also encouraged to reconvert/update your models if possible for best results.
## License
- The original GGML library and llama.cpp by ggerganov are licensed under the MIT License
@@ -129,17 +134,18 @@ There are some community made AUR packages available: [CUBLAS](https://aur.archl
- The other files are also under the AGPL v3.0 License unless otherwise stated
## Notes
-- Generation delay scales linearly with original prompt length. If OpenBLAS is enabled then prompt ingestion becomes about 2-3x faster. This is automatic on windows, but will require linking on OSX and Linux. CLBlast speeds this up even further, and `--gpulayers` + `--useclblast` or `--usecublas` more so.
-- I have heard of someone claiming a false AV positive report. The exe is a simple pyinstaller bundle that includes the necessary python scripts and dlls to run. If this still concerns you, you might wish to rebuild everything from source code using the makefile, and you can rebuild the exe yourself with pyinstaller by using `make_pyinstaller.bat`
-- API documentation available at `/api` and https://lite.koboldai.net/koboldcpp_api
-- Supported GGML models (Includes backward compatibility for older versions/legacy GGML models, though some newer features might be unavailable):
- - All up-to-date GGUF models are supported (Mistral/Mixtral/QWEN/Gemma and more)
- - LLAMA and LLAMA2 (LLaMA / Alpaca / GPT4All / Vicuna / Koala / Pygmalion 7B / Metharme 7B / WizardLM and many more)
+- If you wish, after building the koboldcpp libraries with `make`, you can rebuild the exe yourself with pyinstaller by using `make_pyinstaller.bat`
+- API documentation available at `/api` (e.g. `http://localhost:5001/api`) and https://lite.koboldai.net/koboldcpp_api. An OpenAI compatible API is also provided at `/v1` route (e.g. `http://localhost:5001/v1`).
+- **All up-to-date GGUF models are supported**, and KoboldCpp also includes backward compatibility for older versions/legacy GGML `.bin` models, though some newer features might be unavailable.
+- An incomplete list of models and architectures is listed, but there are *many hundreds of other GGUF models*. In general, if it's GGUF, it should work.
+ - Llama / Llama2 / Llama3 / Alpaca / GPT4All / Vicuna / Koala / Pygmalion / Metharme / WizardLM
+ - Mistral / Mixtral / Miqu
+ - Qwen / Qwen2 / Yi
+ - Gemma / Gemma2
- GPT-2 / Cerebras
- - GPT-J
- - RWKV
+ - Phi-2 / Phi-3
- GPT-NeoX / Pythia / StableLM / Dolly / RedPajama
- - MPT models
- - Falcon (GGUF only)
- - [Stable Diffusion and SDXL models](https://github.com/LostRuins/koboldcpp/wiki#can-i-generate-images-with-koboldcpp)
-
+ - GPT-J / RWKV4 / MPT / Falcon / Starcoder / Deepseek and many more
+ - [Stable Diffusion 1.5 and SDXL safetensor models](https://github.com/LostRuins/koboldcpp/wiki#can-i-generate-images-with-koboldcpp)
+ - [LLaVA based Vision models and multimodal projectors (mmproj)](https://github.com/LostRuins/koboldcpp/wiki#what-is-llava-and-mmproj)
+ - [Whisper models for Speech-To-Text](https://huggingface.co/koboldcpp/whisper/tree/main)
diff --git a/colab.ipynb b/colab.ipynb
index b87209aa2..a806fa284 100644
--- a/colab.ipynb
+++ b/colab.ipynb
@@ -48,7 +48,7 @@
"source": [
"#@title v-- Enter your model below and then click this to start Koboldcpp\r\n",
"\r\n",
- "Model = \"https://huggingface.co/KoboldAI/LLaMA2-13B-Tiefighter-GGUF/resolve/main/LLaMA2-13B-Tiefighter.Q4_K_S.gguf\" #@param [\"https://huggingface.co/KoboldAI/LLaMA2-13B-Tiefighter-GGUF/resolve/main/LLaMA2-13B-Tiefighter.Q4_K_S.gguf\",\"https://huggingface.co/KoboldAI/LLaMA2-13B-Estopia-GGUF/resolve/main/LLaMA2-13B-Estopia.Q4_K_S.gguf\",\"https://huggingface.co/Sao10K/Fimbulvetr-11B-v2-GGUF/resolve/main/Fimbulvetr-11B-v2-Test-14.q4_K_M.gguf\",\"https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/resolve/main/mythomax-l2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/ReMM-SLERP-L2-13B-GGUF/resolve/main/remm-slerp-l2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/Xwin-LM-13B-v0.2-GGUF/resolve/main/xwin-lm-13b-v0.2.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/Stheno-L2-13B-GGUF/resolve/main/stheno-l2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF/resolve/main/mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/resolve/main/airoboros-mistral2.2-7b.Q4_K_S.gguf\",\"https://huggingface.co/concedo/KobbleTinyV2-1.1B-GGUF/resolve/main/KobbleTiny-Q4_K.gguf\",\"https://huggingface.co/grimjim/kukulemon-7B-GGUF/resolve/main/kukulemon-7B.Q8_0.gguf\",\"https://huggingface.co/mradermacher/LemonKunoichiWizardV3-GGUF/resolve/main/LemonKunoichiWizardV3.Q4_K_M.gguf\",\"https://huggingface.co/Lewdiculous/Kunoichi-DPO-v2-7B-GGUF-Imatrix/resolve/main/Kunoichi-DPO-v2-7B-Q4_K_M-imatrix.gguf\",\"https://huggingface.co/mradermacher/L3-8B-Stheno-v3.2-i1-GGUF/resolve/main/L3-8B-Stheno-v3.2.i1-Q4_K_M.gguf\",\"https://huggingface.co/Lewdiculous/Llama-3-Lumimaid-8B-v0.1-OAS-GGUF-IQ-Imatrix/resolve/main/v2-Llama-3-Lumimaid-8B-v0.1-OAS-Q4_K_M-imat.gguf\",\"https://huggingface.co/bartowski/NeuralDaredevil-8B-abliterated-GGUF/resolve/main/NeuralDaredevil-8B-abliterated-Q4_K_M.gguf\",\"https://huggingface.co/bartowski/L3-8B-Lunaris-v1-GGUF/resolve/main/L3-8B-Lunaris-v1-Q4_K_M.gguf\",\"https://huggingface.co/mradermacher/L3-Umbral-Mind-RP-v2.0-8B-GGUF/resolve/main/L3-Umbral-Mind-RP-v2.0-8B.Q4_K_M.gguf\"]{allow-input: true}\r\n",
+ "Model = \"https://huggingface.co/KoboldAI/LLaMA2-13B-Tiefighter-GGUF/resolve/main/LLaMA2-13B-Tiefighter.Q4_K_S.gguf\" #@param [\"https://huggingface.co/KoboldAI/LLaMA2-13B-Tiefighter-GGUF/resolve/main/LLaMA2-13B-Tiefighter.Q4_K_S.gguf\",\"https://huggingface.co/KoboldAI/LLaMA2-13B-Estopia-GGUF/resolve/main/LLaMA2-13B-Estopia.Q4_K_S.gguf\",\"https://huggingface.co/Sao10K/Fimbulvetr-11B-v2-GGUF/resolve/main/Fimbulvetr-11B-v2-Test-14.q4_K_M.gguf\",\"https://huggingface.co/TheBloke/MythoMax-L2-13B-GGUF/resolve/main/mythomax-l2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/ReMM-SLERP-L2-13B-GGUF/resolve/main/remm-slerp-l2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/Xwin-LM-13B-v0.2-GGUF/resolve/main/xwin-lm-13b-v0.2.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/Stheno-L2-13B-GGUF/resolve/main/stheno-l2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF/resolve/main/mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf\",\"https://huggingface.co/TheBloke/MistRP-Airoboros-7B-GGUF/resolve/main/mistrp-airoboros-7b.Q4_K_S.gguf\",\"https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/resolve/main/airoboros-mistral2.2-7b.Q4_K_S.gguf\",\"https://huggingface.co/concedo/KobbleTinyV2-1.1B-GGUF/resolve/main/KobbleTiny-Q4_K.gguf\",\"https://huggingface.co/grimjim/kukulemon-7B-GGUF/resolve/main/kukulemon-7B.Q8_0.gguf\",\"https://huggingface.co/mradermacher/LemonKunoichiWizardV3-GGUF/resolve/main/LemonKunoichiWizardV3.Q4_K_M.gguf\",\"https://huggingface.co/Lewdiculous/Kunoichi-DPO-v2-7B-GGUF-Imatrix/resolve/main/Kunoichi-DPO-v2-7B-Q4_K_M-imatrix.gguf\",\"https://huggingface.co/mradermacher/L3-8B-Stheno-v3.2-i1-GGUF/resolve/main/L3-8B-Stheno-v3.2.i1-Q4_K_M.gguf\",\"https://huggingface.co/Lewdiculous/Llama-3-Lumimaid-8B-v0.1-OAS-GGUF-IQ-Imatrix/resolve/main/v2-Llama-3-Lumimaid-8B-v0.1-OAS-Q4_K_M-imat.gguf\",\"https://huggingface.co/bartowski/NeuralDaredevil-8B-abliterated-GGUF/resolve/main/NeuralDaredevil-8B-abliterated-Q4_K_M.gguf\",\"https://huggingface.co/bartowski/L3-8B-Lunaris-v1-GGUF/resolve/main/L3-8B-Lunaris-v1-Q4_K_M.gguf\",\"https://huggingface.co/mradermacher/L3-Umbral-Mind-RP-v2.0-8B-GGUF/resolve/main/L3-Umbral-Mind-RP-v2.0-8B.Q4_K_M.gguf\"]{allow-input: true}\r\n",
"Layers = 99 #@param [99]{allow-input: true}\r\n",
"ContextSize = 4096 #@param [4096] {allow-input: true}\r\n",
"#@markdown
\r\n",
diff --git a/common/common.cpp b/common/common.cpp
index 3352a29bd..542ad6c6d 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -473,6 +473,14 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
else { invalid_param = true; }
return true;
}
+ if (arg == "--attention") {
+ CHECK_ARG
+ std::string value(argv[i]);
+ /**/ if (value == "causal") { params.attention_type = LLAMA_ATTENTION_TYPE_CAUSAL; }
+ else if (value == "non-causal") { params.attention_type = LLAMA_ATTENTION_TYPE_NON_CAUSAL; }
+ else { invalid_param = true; }
+ return true;
+ }
if (arg == "--defrag-thold" || arg == "-dt") {
CHECK_ARG
params.defrag_thold = std::stof(argv[i]);
@@ -758,7 +766,7 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
params.cache_type_v = argv[++i];
return true;
}
- if (arg == "--multiline-input") {
+ if (arg == "-mli" || arg == "--multiline-input") {
params.multiline_input = true;
return true;
}
@@ -1395,7 +1403,9 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
options.push_back({ "*", " --keep N", "number of tokens to keep from the initial prompt (default: %d, -1 = all)", params.n_keep });
options.push_back({ "*", " --chunks N", "max number of chunks to process (default: %d, -1 = all)", params.n_chunks });
options.push_back({ "*", "-fa, --flash-attn", "enable Flash Attention (default: %s)", params.flash_attn ? "enabled" : "disabled" });
- options.push_back({ "*", "-p, --prompt PROMPT", "prompt to start generation with (default: '%s')", params.prompt.c_str() });
+ options.push_back({ "*", "-p, --prompt PROMPT", "prompt to start generation with\n"
+ "in conversation mode, this will be used as system prompt\n"
+ "(default: '%s')", params.prompt.c_str() });
options.push_back({ "*", "-f, --file FNAME", "a file containing the prompt (default: none)" });
options.push_back({ "*", " --in-file FNAME", "an input file (repeat to specify multiple files)" });
options.push_back({ "*", "-bf, --binary-file FNAME", "binary file containing the prompt (default: none)" });
@@ -1410,7 +1420,9 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
"halt generation at PROMPT, return control in interactive mode\n"
"can be specified more than once for multiple prompts" });
options.push_back({ "main", "-sp, --special", "special tokens output enabled (default: %s)", params.special ? "true" : "false" });
- options.push_back({ "main", "-cnv, --conversation", "run in conversation mode (does not print special tokens and suffix/prefix, use default chat template) (default: %s)", params.conversation ? "true" : "false" });
+ options.push_back({ "main", "-cnv, --conversation", "run in conversation mode, does not print special tokens and suffix/prefix\n"
+ "if suffix/prefix are not specified, default chat template will be used\n"
+ "(default: %s)", params.conversation ? "true" : "false" });
options.push_back({ "main infill", "-i, --interactive", "run in interactive mode (default: %s)", params.interactive ? "true" : "false" });
options.push_back({ "main infill", "-if, --interactive-first", "run in interactive mode and wait for input right away (default: %s)", params.interactive_first ? "true" : "false" });
options.push_back({ "main infill", "-mli, --multiline-input", "allows you to write or paste multiple lines without ending each in '\\'" });
@@ -1454,6 +1466,7 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
options.push_back({ "main", " --cfg-scale N", "strength of guidance (default: %.1f, 1.0 = disable)", (double)sparams.cfg_scale });
options.push_back({ "main", " --chat-template JINJA_TEMPLATE",
"set custom jinja chat template (default: template taken from model's metadata)\n"
+ "if suffix/prefix are specified, template will be disabled\n"
"only commonly used templates are accepted:\n"
"https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template" });
options.push_back({ "grammar" });
@@ -1464,8 +1477,10 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
"For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead" });
options.push_back({ "embedding" });
- options.push_back({ "embedding", " --pooling {none,mean,cls}",
+ options.push_back({ "embedding", " --pooling {none,mean,cls,last}",
"pooling type for embeddings, use model default if unspecified" });
+ options.push_back({ "embedding", " --attention {causal,non-causal}",
+ "attention type for embeddings, use model default if unspecified" });
options.push_back({ "context hacking" });
options.push_back({ "*", " --rope-scaling {none,linear,yarn}",
@@ -2071,7 +2086,24 @@ std::tuple llama_init_from_gpt_par
if (params.warmup) {
LOG("warming up the model with an empty run\n");
- std::vector tmp = { llama_token_bos(model), llama_token_eos(model), };
+ std::vector tmp;
+ llama_token bos = llama_token_bos(model);
+ llama_token eos = llama_token_eos(model);
+ // some models (e.g. T5) don't have a BOS token
+ if (bos != -1) {
+ tmp.push_back(bos);
+ }
+ tmp.push_back(eos);
+
+ if (llama_model_has_encoder(model)) {
+ llama_encode(lctx, llama_batch_get_one(tmp.data(), tmp.size(), 0, 0));
+ llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
+ if (decoder_start_token_id == -1) {
+ decoder_start_token_id = bos;
+ }
+ tmp.clear();
+ tmp.push_back(decoder_start_token_id);
+ }
llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
llama_kv_cache_clear(lctx);
llama_synchronize(lctx);
@@ -2154,6 +2186,7 @@ struct llama_context_params llama_context_params_from_gpt_params(const gpt_param
cparams.yarn_beta_slow = params.yarn_beta_slow;
cparams.yarn_orig_ctx = params.yarn_orig_ctx;
cparams.pooling_type = params.pooling_type;
+ cparams.attention_type = params.attention_type;
cparams.defrag_thold = params.defrag_thold;
cparams.cb_eval = params.cb_eval;
cparams.cb_eval_user_data = params.cb_eval_user_data;
diff --git a/common/common.h b/common/common.h
index a811713f4..59e10f9bd 100644
--- a/common/common.h
+++ b/common/common.h
@@ -95,6 +95,7 @@ struct gpt_params {
enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
enum llama_rope_scaling_type rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED;
enum llama_pooling_type pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED; // pooling type for embeddings
+ enum llama_attention_type attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED; // attention type for embeddings
// sampling parameters
int32_t top_k = 40; // <= 0 to use vocab size
@@ -475,5 +476,3 @@ void yaml_dump_string_multiline(FILE * stream, const char * prop_name, const cha
void yaml_dump_non_result_info(
FILE * stream, const gpt_params & params, const llama_context * lctx,
const std::string & timestamp, const std::vector & prompt_tokens, const char * model_desc);
-
-
diff --git a/convert-hf-to-gguf.py b/convert_hf_to_gguf.py
similarity index 91%
rename from convert-hf-to-gguf.py
rename to convert_hf_to_gguf.py
index 05fd70171..455eea883 100755
--- a/convert-hf-to-gguf.py
+++ b/convert_hf_to_gguf.py
@@ -13,7 +13,7 @@ import sys
from enum import IntEnum
from pathlib import Path
from hashlib import sha256
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
import math
import numpy as np
@@ -404,7 +404,7 @@ class Model:
return tokens, toktypes, tokpre
- # NOTE: this function is generated by convert-hf-to-gguf-update.py
+ # NOTE: this function is generated by convert_hf_to_gguf_update.py
# do not modify it manually!
# ref: https://github.com/ggerganov/llama.cpp/pull/6920
# Marker: Start get_vocab_base_pre
@@ -424,7 +424,7 @@ class Model:
res = None
- # NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script
+ # NOTE: if you get an error here, you need to update the convert_hf_to_gguf_update.py script
# or pull the latest version of the model from Huggingface
# don't edit the hashes manually!
if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
@@ -490,15 +490,18 @@ class Model:
if chkhsh == "7fc505bd3104ca1083b150b17d088b59534ede9bde81f0dd2090967d7fe52cee":
# ref: https://huggingface.co/LumiOpen/Viking-7B
res = "viking"
+ if chkhsh == "b53802fb28e26d645c3a310b34bfe07da813026ec7c7716883404d5e0f8b1901":
+ # ref: https://huggingface.co/core42/jais-13b
+ res = "jais"
if res is None:
logger.warning("\n")
logger.warning("**************************************************************************************")
logger.warning("** WARNING: The BPE pre-tokenizer was not recognized!")
logger.warning("** There are 2 possible reasons for this:")
- logger.warning("** - the model has not been added to convert-hf-to-gguf-update.py yet")
+ logger.warning("** - the model has not been added to convert_hf_to_gguf_update.py yet")
logger.warning("** - the pre-tokenization config has changed upstream")
- logger.warning("** Check your model files and convert-hf-to-gguf-update.py and update them accordingly.")
+ logger.warning("** Check your model files and convert_hf_to_gguf_update.py and update them accordingly.")
logger.warning("** ref: https://github.com/ggerganov/llama.cpp/pull/6920")
logger.warning("**")
logger.warning(f"** chkhsh: {chkhsh}")
@@ -674,6 +677,51 @@ class Model:
special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
special_vocab.add_to_gguf(self.gguf_writer)
+ def _set_vocab_builtin(self, model_name: Literal["gpt-neox", "llama-spm"], vocab_size: int):
+ tokenizer_path = Path(sys.path[0]) / "models" / f"ggml-vocab-{model_name}.gguf"
+ logger.warning(f"Using tokenizer from '{os.path.relpath(tokenizer_path, os.getcwd())}'")
+ vocab_reader = gguf.GGUFReader(tokenizer_path, "r")
+
+ default_pre = "mpt" if model_name == "gpt-neox" else "default"
+
+ field = vocab_reader.get_field(gguf.Keys.Tokenizer.MODEL)
+ assert field # tokenizer model
+ self.gguf_writer.add_tokenizer_model(bytes(field.parts[-1]).decode("utf-8"))
+
+ field = vocab_reader.get_field(gguf.Keys.Tokenizer.PRE)
+ self.gguf_writer.add_tokenizer_pre(bytes(field.parts[-1]).decode("utf-8") if field else default_pre)
+
+ field = vocab_reader.get_field(gguf.Keys.Tokenizer.LIST)
+ assert field # token list
+ self.gguf_writer.add_token_list([bytes(field.parts[i]) for i in field.data][:vocab_size])
+
+ if model_name == "llama-spm":
+ field = vocab_reader.get_field(gguf.Keys.Tokenizer.SCORES)
+ assert field # token scores
+ self.gguf_writer.add_token_scores([field.parts[i].tolist()[0] for i in field.data][:vocab_size])
+
+ field = vocab_reader.get_field(gguf.Keys.Tokenizer.TOKEN_TYPE)
+ assert field # token types
+ self.gguf_writer.add_token_types([field.parts[i].tolist()[0] for i in field.data][:vocab_size])
+
+ if model_name != "llama-spm":
+ field = vocab_reader.get_field(gguf.Keys.Tokenizer.MERGES)
+ assert field # token merges
+ self.gguf_writer.add_token_merges([bytes(field.parts[i]) for i in field.data])
+
+ if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.BOS_ID)) is not None:
+ self.gguf_writer.add_bos_token_id(field.parts[-1].tolist()[0])
+ if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.EOS_ID)) is not None:
+ self.gguf_writer.add_eos_token_id(field.parts[-1].tolist()[0])
+ if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.UNK_ID)) is not None:
+ self.gguf_writer.add_unk_token_id(field.parts[-1].tolist()[0])
+ if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.PAD_ID)) is not None:
+ self.gguf_writer.add_pad_token_id(field.parts[-1].tolist()[0])
+ if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.ADD_BOS)) is not None:
+ self.gguf_writer.add_add_bos_token(field.parts[-1].tolist()[0])
+ if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.ADD_EOS)) is not None:
+ self.gguf_writer.add_add_eos_token(field.parts[-1].tolist()[0])
+
@Model.register("GPTNeoXForCausalLM")
class GPTNeoXModel(Model):
@@ -1939,7 +1987,7 @@ class Phi3MiniModel(Model):
if len(rope_scaling_type) == 0:
raise KeyError('Missing the required key rope_scaling.type')
- if rope_scaling_type == 'su':
+ if rope_scaling_type == 'su' or rope_scaling_type == 'longrope':
attn_factor = math.sqrt(1 + math.log(scale) / math.log(orig_max_pos_embds)) if scale > 1.0 else 1.0
elif rope_scaling_type == 'yarn':
attn_factor = 0.1 * math.log(scale) + 1.0 if scale > 1.0 else 1.0
@@ -2313,6 +2361,8 @@ class GemmaModel(Model):
special_vocab._set_special_token("eot", 107)
special_vocab.add_to_gguf(self.gguf_writer)
+ self.gguf_writer.add_add_space_prefix(False)
+
def set_gguf_parameters(self):
hparams = self.hparams
block_count = hparams["num_hidden_layers"]
@@ -2363,6 +2413,7 @@ class Gemma2Model(Model):
special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
special_vocab.add_to_gguf(self.gguf_writer)
+
self.gguf_writer.add_add_space_prefix(False)
def set_gguf_parameters(self):
@@ -2394,7 +2445,7 @@ class Gemma2Model(Model):
raise ValueError("query_pre_attn_scalar must be equal to n_embd / n_head")
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
- del bid # unusem
+ del bid # unused
# lm_head is not used in llama.cpp, while autoawq will include this tensor in model
# To prevent errors, skip loading lm_head.weight.
@@ -2433,39 +2484,7 @@ class MambaModel(Model):
self._set_vocab_sentencepiece()
else:
# Use the GPT-NeoX tokenizer when no tokenizer files are present
- tokenizer_path = Path(sys.path[0]) / "models" / "ggml-vocab-gpt-neox.gguf"
- logger.warning(f"Using tokenizer from '{os.path.relpath(tokenizer_path, os.getcwd())}'")
- neox_reader = gguf.GGUFReader(tokenizer_path, "r")
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.MODEL)
- self.gguf_writer.add_tokenizer_model(bytes(field.parts[-1]).decode("utf-8") if field else "gpt2")
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.PRE)
- self.gguf_writer.add_tokenizer_pre(bytes(field.parts[-1]).decode("utf-8") if field else "mpt")
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.LIST)
- assert field
- self.gguf_writer.add_token_list([bytes(field.parts[i]) for i in field.data][:vocab_size])
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.TOKEN_TYPE)
- assert field
- self.gguf_writer.add_token_types([field.parts[i].tolist()[0] for i in field.data][:vocab_size])
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.MERGES)
- assert field
- self.gguf_writer.add_token_merges([bytes(field.parts[i]) for i in field.data])
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.BOS_ID)
- self.gguf_writer.add_bos_token_id(field.parts[-1].tolist()[0] if field else 1)
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.EOS_ID)
- self.gguf_writer.add_eos_token_id(field.parts[-1].tolist()[0] if field else 0)
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.UNK_ID)
- self.gguf_writer.add_unk_token_id(field.parts[-1].tolist()[0] if field else 0)
-
- field = neox_reader.get_field(gguf.Keys.Tokenizer.PAD_ID)
- self.gguf_writer.add_pad_token_id(field.parts[-1].tolist()[0] if field else 0)
+ self._set_vocab_builtin("gpt-neox", vocab_size)
def set_gguf_parameters(self):
d_model = self.find_hparam(["hidden_size", "d_model"])
@@ -2617,6 +2636,82 @@ class JinaBertV2Model(BertModel):
self.gguf_writer.add_add_eos_token(True)
+@Model.register("OpenELMForCausalLM")
+class OpenELMModel(Model):
+ model_arch = gguf.MODEL_ARCH.OPENELM
+
+ @staticmethod
+ def _make_divisible(v: float | int, divisor: int) -> int:
+ # ref: https://huggingface.co/apple/OpenELM-270M-Instruct/blob/eb111ff2e6724348e5b905984063d4064d4bc579/configuration_openelm.py#L34-L38
+ new_v = max(divisor, int(v + divisor / 2) // divisor * divisor)
+ # Make sure that round down does not go down by more than 10%.
+ if new_v < 0.9 * v:
+ new_v += divisor
+ return new_v
+
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ ffn_multipliers: list[float] = self.hparams["ffn_multipliers"]
+ ffn_dim_divisor: int = self.hparams["ffn_dim_divisor"]
+ self._n_embd: int = self.hparams["model_dim"]
+ self._num_kv_heads: list[int] = self.hparams["num_kv_heads"]
+ self._num_query_heads: list[int] = self.hparams["num_query_heads"]
+ self._ffn_dims: list[int] = [
+ OpenELMModel._make_divisible(multiplier * self._n_embd, ffn_dim_divisor)
+ for multiplier in ffn_multipliers
+ ]
+ assert isinstance(self._num_kv_heads, list) and isinstance(self._num_kv_heads[0], int)
+ assert isinstance(self._num_query_heads, list) and isinstance(self._num_query_heads[0], int)
+
+ # Uses the tokenizer from meta-llama/Llama-2-7b-hf
+ def set_vocab(self):
+ try:
+ self._set_vocab_sentencepiece()
+ except FileNotFoundError:
+ self._set_vocab_builtin("llama-spm", self.hparams["vocab_size"])
+
+ def set_gguf_parameters(self):
+ n_embd = self._n_embd
+ head_dim = self.hparams["head_dim"]
+ rot_pct = 1.0
+ assert self.block_count == len(self._num_kv_heads)
+ assert self.block_count == len(self._num_query_heads)
+ assert self.block_count == len(self._ffn_dims)
+
+ self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
+ self.gguf_writer.add_block_count(self.block_count)
+ self.gguf_writer.add_context_length(self.hparams["max_context_length"])
+ self.gguf_writer.add_embedding_length(n_embd)
+ self.gguf_writer.add_feed_forward_length(self._ffn_dims)
+ self.gguf_writer.add_head_count(self._num_query_heads)
+ self.gguf_writer.add_head_count_kv(self._num_kv_heads)
+ self.gguf_writer.add_rope_freq_base(self.hparams["rope_freq_constant"])
+ # https://huggingface.co/apple/OpenELM-270M-Instruct/blob/c401df2/modeling_openelm.py#L30
+ self.gguf_writer.add_layer_norm_rms_eps(1e-6)
+ self.gguf_writer.add_rope_dimension_count(int(rot_pct * head_dim))
+ self.gguf_writer.add_key_length(head_dim)
+ self.gguf_writer.add_value_length(head_dim)
+ self.gguf_writer.add_file_type(self.ftype)
+
+ def find_hparam(self, keys: Iterable[str], optional: bool = False) -> Any:
+ if "n_layers" in keys:
+ return self.hparams["num_transformer_layers"]
+
+ return super().find_hparam(keys, optional)
+
+ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
+
+ # split ff
+ if bid is not None and name == f"transformer.layers.{bid}.ffn.proj_1.weight":
+ ff_dim = self._ffn_dims[bid]
+ yield (self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE, bid), data_torch[:ff_dim])
+ yield (self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP, bid), data_torch[ff_dim:])
+ return
+
+ yield (self.map_tensor_name(name), data_torch)
+
+
@Model.register("ArcticForCausalLM")
class ArcticModel(Model):
model_arch = gguf.MODEL_ARCH.ARCTIC
@@ -2847,11 +2942,17 @@ class DeepseekV2Model(Model):
raise ValueError(f"Unprocessed experts: {experts}")
-@Model.register("T5ForConditionalGeneration")
@Model.register("T5WithLMHeadModel")
+@Model.register("T5ForConditionalGeneration")
+@Model.register("MT5ForConditionalGeneration")
+@Model.register("UMT5ForConditionalGeneration")
class T5Model(Model):
model_arch = gguf.MODEL_ARCH.T5
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+ self.shared_token_embeddings_found = False
+
def set_vocab(self):
# to avoid TypeError: Descriptors cannot be created directly
# exception when importing sentencepiece_model_pb2
@@ -2859,17 +2960,29 @@ class T5Model(Model):
from sentencepiece import SentencePieceProcessor
from sentencepiece import sentencepiece_model_pb2 as model
- tokenizer_path = self.dir_model / 'spiece.model'
+ tokenizer_path = self.dir_model / 'tokenizer.model'
+
+ # many older models use spiece.model tokenizer model filename
+ if not tokenizer_path.is_file():
+ tokenizer_path = self.dir_model / 'spiece.model'
if not tokenizer_path.is_file():
raise FileNotFoundError(f"File not found: {tokenizer_path}")
sentencepiece_model = model.ModelProto()
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
+
+ # some models like Pile-T5 family use BPE tokenizer instead of Unigram
+ if sentencepiece_model.trainer_spec.model_type == 2: # BPE
+ # assure the tokenizer model file name is correct
+ assert tokenizer_path.name == 'tokenizer.model'
+ return self._set_vocab_sentencepiece()
+ else:
+ assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
+
add_prefix = sentencepiece_model.normalizer_spec.add_dummy_prefix
remove_whitespaces = sentencepiece_model.normalizer_spec.remove_extra_whitespaces
precompiled_charsmap = sentencepiece_model.normalizer_spec.precompiled_charsmap
- assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
tokenizer = SentencePieceProcessor()
tokenizer.LoadFromFile(str(tokenizer_path))
@@ -2939,7 +3052,10 @@ class T5Model(Model):
def set_gguf_parameters(self):
self.gguf_writer.add_name("T5")
- self.gguf_writer.add_context_length(self.hparams["n_positions"])
+ if (n_ctx := self.find_hparam(["n_positions"], optional=True)) is None:
+ logger.warning("Couldn't find context length in config.json, assuming default value of 512")
+ n_ctx = 512
+ self.gguf_writer.add_context_length(n_ctx)
self.gguf_writer.add_embedding_length(self.hparams["d_model"])
self.gguf_writer.add_feed_forward_length(self.hparams["d_ff"])
self.gguf_writer.add_block_count(self.hparams["num_layers"])
@@ -2955,16 +3071,111 @@ class T5Model(Model):
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
- # Sometimes T5 and Flan-T5 based models contain "encoder.embed_tokens.weight" tensor or
- # "decoder.embed_tokens.weight" tensors that are duplicates of "shared.weight" tensor
- # To prevent errors caused by an unnecessary unmapped tensor, skip both of them and use only "shared.weight".
- if name == "decoder.embed_tokens.weight" or name == "encoder.embed_tokens.weight":
- logger.debug(f"Skipping tensor {name!r} in safetensors so that convert can end normally.")
- return []
+ # T5 based models contain shared token embeddings tensors saved randomly as either "encoder.embed_tokens.weight",
+ # "decoder.embed_tokens.weight" or "shared.weight" tensor. In some models there are even multiple of them stored
+ # in the safetensors files. We use the first tensor from these three as the token embeddings for both encoder
+ # and decoder and ignore the remaining ones.
+ if name in ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight", "shared.weight"]:
+ if not self.shared_token_embeddings_found:
+ name = "shared.weight"
+ self.shared_token_embeddings_found = True
+ else:
+ logger.debug(f"Skipping shared tensor {name!r} in safetensors so that convert can end normally.")
+ return []
return [(self.map_tensor_name(name), data_torch)]
+@Model.register("JAISLMHeadModel")
+class JaisModel(Model):
+ model_arch = gguf.MODEL_ARCH.JAIS
+
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ # SwigLU activation
+ assert self.hparams["activation_function"] == "swiglu"
+ # ALiBi position embedding
+ assert self.hparams["position_embedding_type"] == "alibi"
+
+ # Embeddings scale
+ self.embeddings_scale = 1.0
+ # note: For some JAIS flavors, output is tied to (same as) wte in original model
+ self.output_is_wte = False
+ if 'mup_embeddings_scale' in self.hparams:
+ self.output_is_wte = True # Hack (?)
+ self.embeddings_scale = self.hparams['mup_embeddings_scale']
+ elif 'embeddings_scale' in self.hparams:
+ self.embeddings_scale = self.hparams['embeddings_scale']
+ else:
+ assert False
+
+ self.width_scale = 1.0
+ if 'mup_output_alpha' in self.hparams:
+ assert 'mup_width_scale' in self.hparams
+ self.width_scale = self.hparams['mup_output_alpha'] * self.hparams['mup_width_scale']
+ elif 'width_scale' in self.hparams:
+ self.width_scale = self.hparams['width_scale']
+ else:
+ assert False
+
+ self.max_alibi_bias = 8.0
+
+ def set_vocab(self):
+ self._set_vocab_gpt2()
+
+ def set_gguf_parameters(self):
+ self.gguf_writer.add_name(self.dir_model.name)
+ self.gguf_writer.add_block_count(self.hparams["n_layer"])
+ self.gguf_writer.add_context_length(self.hparams["n_positions"])
+ self.gguf_writer.add_embedding_length(self.hparams["n_embd"])
+ self.gguf_writer.add_feed_forward_length(self.hparams["n_inner"])
+ self.gguf_writer.add_head_count(self.hparams["n_head"])
+ self.gguf_writer.add_layer_norm_eps(self.hparams["layer_norm_epsilon"])
+ self.gguf_writer.add_file_type(self.ftype)
+
+ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
+ del bid # unused
+
+ tensors: list[tuple[str, Tensor]] = []
+
+ # we don't need these
+ if name.endswith((".attn.bias")):
+ return tensors
+
+ if name.endswith(("relative_pe.slopes")):
+ # Calculate max ALiBi bias (this is the inverse of the ALiBi calculation)
+ # Some other models has max_alibi_bias spelled out explicitly in the hyperparams,
+ # but Jais's PyTorch model simply precalculates the slope values and places them
+ # in relative_pes.slopes
+ n_head_closest_log2 = 2 ** math.floor(math.log2(self.hparams["n_head"]))
+ first_val = float(data_torch._data[0])
+ self.max_alibi_bias = -round(math.log2(first_val) * n_head_closest_log2)
+
+ return tensors
+
+ if name.endswith((".c_attn.weight", ".c_proj.weight", ".c_fc.weight", ".c_fc2.weight")):
+ data_torch = data_torch.transpose(1, 0)
+
+ new_name = self.map_tensor_name(name)
+
+ if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD):
+ tensors.append((new_name, data_torch * self.embeddings_scale))
+ if self.output_is_wte:
+ tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch * self.width_scale))
+ elif new_name == self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT):
+ assert not self.output_is_wte
+ tensors.append((new_name, data_torch * self.width_scale))
+ else:
+ tensors.append((new_name, data_torch))
+
+ return tensors
+
+ def write_tensors(self):
+ super().write_tensors()
+ self.gguf_writer.add_max_alibi_bias(self.max_alibi_bias)
+
+
###### CONVERSION LOGIC ######
@@ -3014,10 +3225,6 @@ def parse_args() -> argparse.Namespace:
"--vocab-only", action="store_true",
help="extract only the vocab",
)
- parser.add_argument(
- "--awq-path", type=Path, default=None,
- help="Path to scale awq cache file",
- )
parser.add_argument(
"--outfile", type=Path,
help="path to write to; default: based on input. {ftype} will be replaced by the outtype.",
@@ -3095,19 +3302,6 @@ def main() -> None:
dir_model = args.model
- if args.awq_path:
- sys.path.insert(1, str(Path(__file__).parent / 'awq-py'))
- from awq.apply_awq import add_scale_weights # type: ignore[import-not-found]
- tmp_model_path = args.model / "weighted_model"
- dir_model = tmp_model_path
- if tmp_model_path.is_dir():
- logger.info(f"{tmp_model_path} exists as a weighted model.")
- else:
- tmp_model_path.mkdir(parents=True, exist_ok=True)
- logger.info("Saving new weighted model ...")
- add_scale_weights(str(args.model), str(args.awq_path), str(tmp_model_path))
- logger.info(f"Saved weighted model at {tmp_model_path}.")
-
if not dir_model.is_dir():
logger.error(f'Error: {args.model} is not a directory')
sys.exit(1)
diff --git a/convert-hf-to-gguf-update.py b/convert_hf_to_gguf_update.py
similarity index 87%
rename from convert-hf-to-gguf-update.py
rename to convert_hf_to_gguf_update.py
index 2758214fa..e4165ae2d 100755
--- a/convert-hf-to-gguf-update.py
+++ b/convert_hf_to_gguf_update.py
@@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-
# This script downloads the tokenizer models of the specified models from Huggingface and
-# generates the get_vocab_base_pre() function for convert-hf-to-gguf.py
+# generates the get_vocab_base_pre() function for convert_hf_to_gguf.py
#
# This is necessary in order to analyze the type of pre-tokenizer used by the model and
# provide the necessary information to llama.cpp via the GGUF header in order to implement
@@ -15,9 +15,9 @@
# - Add a new model to the "models" list
# - Run the script with your huggingface token:
#
-# python3 convert-hf-to-gguf-update.py
+# python3 convert_hf_to_gguf_update.py
#
-# - Copy-paste the generated get_vocab_base_pre() function into convert-hf-to-gguf.py
+# - Copy-paste the generated get_vocab_base_pre() function into convert_hf_to_gguf.py
# - Update llama.cpp with the new pre-tokenizer if necessary
#
# TODO: generate tokenizer tests for llama.cpp
@@ -37,7 +37,7 @@ from enum import IntEnum, auto
from transformers import AutoTokenizer
logging.basicConfig(level=logging.DEBUG)
-logger = logging.getLogger("convert-hf-to-gguf-update")
+logger = logging.getLogger("convert_hf_to_gguf_update")
sess = requests.Session()
@@ -45,6 +45,7 @@ class TOKENIZER_TYPE(IntEnum):
SPM = auto()
BPE = auto()
WPM = auto()
+ UGM = auto()
# TODO: this string has to exercise as much pre-tokenizer functionality as possible
@@ -55,10 +56,10 @@ if len(sys.argv) == 2:
token = sys.argv[1]
if not token.startswith("hf_"):
logger.info("Huggingface token seems invalid")
- logger.info("Usage: python convert-hf-to-gguf-update.py ")
+ logger.info("Usage: python convert_hf_to_gguf_update.py ")
sys.exit(1)
else:
- logger.info("Usage: python convert-hf-to-gguf-update.py ")
+ logger.info("Usage: python convert_hf_to_gguf_update.py ")
sys.exit(1)
# TODO: add models here, base models preferred
@@ -86,6 +87,10 @@ models = [
{"name": "poro-chat", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LumiOpen/Poro-34B-chat", },
{"name": "jina-v2-code", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-code", },
{"name": "viking", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LumiOpen/Viking-7B", }, # Also used for Viking 13B and 33B
+ {"name": "gemma", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/google/gemma-2b", },
+ {"name": "gemma-2", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/google/gemma-2-9b", },
+ {"name": "jais", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/core42/jais-13b", },
+ {"name": "t5", "tokt": TOKENIZER_TYPE.UGM, "repo": "https://huggingface.co/google-t5/t5-small", },
]
@@ -107,9 +112,13 @@ def download_model(model):
os.makedirs(f"models/tokenizers/{name}", exist_ok=True)
files = ["config.json", "tokenizer.json", "tokenizer_config.json"]
+
if tokt == TOKENIZER_TYPE.SPM:
files.append("tokenizer.model")
+ if tokt == TOKENIZER_TYPE.UGM:
+ files.append("spiece.model")
+
for file in files:
save_path = f"models/tokenizers/{name}/{file}"
if os.path.isfile(save_path):
@@ -125,14 +134,14 @@ for model in models:
logger.error(f"Failed to download model {model['name']}. Error: {e}")
-# generate the source code for the convert-hf-to-gguf.py:get_vocab_base_pre() function:
+# generate the source code for the convert_hf_to_gguf.py:get_vocab_base_pre() function:
src_ifs = ""
for model in models:
name = model["name"]
tokt = model["tokt"]
- if tokt == TOKENIZER_TYPE.SPM:
+ if tokt == TOKENIZER_TYPE.SPM or tokt == TOKENIZER_TYPE.UGM:
continue
# Skip if the tokenizer folder does not exist or there are other download issues previously
@@ -142,7 +151,10 @@ for model in models:
# create the tokenizer
try:
- tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+ if name == "t5":
+ tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+ else:
+ tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
except OSError as e:
logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}")
continue # Skip to the next model if the tokenizer can't be loaded
@@ -189,7 +201,7 @@ src_func = f"""
res = None
- # NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script
+ # NOTE: if you get an error here, you need to update the convert_hf_to_gguf_update.py script
# or pull the latest version of the model from Huggingface
# don't edit the hashes manually!
{src_ifs}
@@ -198,9 +210,9 @@ src_func = f"""
logger.warning("**************************************************************************************")
logger.warning("** WARNING: The BPE pre-tokenizer was not recognized!")
logger.warning("** There are 2 possible reasons for this:")
- logger.warning("** - the model has not been added to convert-hf-to-gguf-update.py yet")
+ logger.warning("** - the model has not been added to convert_hf_to_gguf_update.py yet")
logger.warning("** - the pre-tokenization config has changed upstream")
- logger.warning("** Check your model files and convert-hf-to-gguf-update.py and update them accordingly.")
+ logger.warning("** Check your model files and convert_hf_to_gguf_update.py and update them accordingly.")
logger.warning("** ref: https://github.com/ggerganov/llama.cpp/pull/6920")
logger.warning("**")
logger.warning(f"** chkhsh: {{chkhsh}}")
@@ -214,7 +226,7 @@ src_func = f"""
return res
"""
-convert_py_pth = pathlib.Path("convert-hf-to-gguf.py")
+convert_py_pth = pathlib.Path("convert_hf_to_gguf.py")
convert_py = convert_py_pth.read_text(encoding="utf-8")
convert_py = re.sub(
r"(# Marker: Start get_vocab_base_pre)(.+?)( +# Marker: End get_vocab_base_pre)",
@@ -225,7 +237,7 @@ convert_py = re.sub(
convert_py_pth.write_text(convert_py, encoding="utf-8")
-logger.info("+++ convert-hf-to-gguf.py was updated")
+logger.info("+++ convert_hf_to_gguf.py was updated")
# generate tests for each tokenizer model
@@ -263,6 +275,7 @@ tests = [
"\n =",
"' era",
"Hello, y'all! How are you 😁 ?我想在apple工作1314151天~",
+ "!!!!!!",
"3",
"33",
"333",
@@ -272,7 +285,8 @@ tests = [
"3333333",
"33333333",
"333333333",
- # "Cửa Việt", # llama-bpe fails on this
+ "Cửa Việt", # llama-bpe fails on this
+ " discards",
chktxt,
]
@@ -300,7 +314,10 @@ for model in models:
# create the tokenizer
try:
- tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+ if name == "t5":
+ tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+ else:
+ tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
except OSError as e:
logger.error(f"Failed to load tokenizer for model {name}. Error: {e}")
continue # Skip this model and continue with the next one in the loop
@@ -326,6 +343,6 @@ logger.info("\nRun the following commands to generate the vocab files for testin
for model in models:
name = model["name"]
- print(f"python3 convert-hf-to-gguf.py models/tokenizers/{name}/ --outfile models/ggml-vocab-{name}.gguf --vocab-only") # noqa: NP100
+ print(f"python3 convert_hf_to_gguf.py models/tokenizers/{name}/ --outfile models/ggml-vocab-{name}.gguf --vocab-only") # noqa: NP100
logger.info("\n")
diff --git a/convert-llama-ggml-to-gguf.py b/convert_llama_ggml_to_gguf.py
similarity index 100%
rename from convert-llama-ggml-to-gguf.py
rename to convert_llama_ggml_to_gguf.py
diff --git a/docs/HOWTO-add-model.md b/docs/HOWTO-add-model.md
index 3eec077ea..87093cedd 100644
--- a/docs/HOWTO-add-model.md
+++ b/docs/HOWTO-add-model.md
@@ -17,7 +17,7 @@ Also, it is important to check that the examples and main ggml backends (CUDA, M
### 1. Convert the model to GGUF
This step is done in python with a `convert` script using the [gguf](https://pypi.org/project/gguf/) library.
-Depending on the model architecture, you can use either [convert-hf-to-gguf.py](../convert-hf-to-gguf.py) or [examples/convert-legacy-llama.py](../examples/convert-legacy-llama.py) (for `llama/llama2` models in `.pth` format).
+Depending on the model architecture, you can use either [convert_hf_to_gguf.py](../convert_hf_to_gguf.py) or [examples/convert_legacy_llama.py](../examples/convert_legacy_llama.py) (for `llama/llama2` models in `.pth` format).
The convert script reads the model configuration, tokenizer, tensor names+data and converts them to GGUF metadata and tensors.
diff --git a/examples/batched/batched.cpp b/examples/batched/batched.cpp
index 62d9b144d..2442e954d 100644
--- a/examples/batched/batched.cpp
+++ b/examples/batched/batched.cpp
@@ -93,14 +93,34 @@ int main(int argc, char ** argv) {
// create a llama_batch
// we use this object to submit token data for decoding
- llama_batch batch = llama_batch_init(std::max(tokens_list.size(), (size_t)n_parallel), 0, 1);
+ llama_batch batch = llama_batch_init(std::max(tokens_list.size(), (size_t) n_parallel), 0, n_parallel);
+
+ std::vector seq_ids(n_parallel, 0);
+ for (int32_t i = 0; i < n_parallel; ++i) {
+ seq_ids[i] = i;
+ }
// evaluate the initial prompt
for (size_t i = 0; i < tokens_list.size(); ++i) {
- llama_batch_add(batch, tokens_list[i], i, { 0 }, false);
+ llama_batch_add(batch, tokens_list[i], i, seq_ids, false);
}
GGML_ASSERT(batch.n_tokens == (int) tokens_list.size());
+ if (llama_model_has_encoder(model)) {
+ if (llama_encode(ctx, batch)) {
+ LOG_TEE("%s : failed to eval\n", __func__);
+ return 1;
+ }
+
+ llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
+ if (decoder_start_token_id == -1) {
+ decoder_start_token_id = llama_token_bos(model);
+ }
+
+ llama_batch_clear(batch);
+ llama_batch_add(batch, decoder_start_token_id, 0, seq_ids, false);
+ }
+
// llama_decode will output logits only for the last token of the prompt
batch.logits[batch.n_tokens - 1] = true;
@@ -109,11 +129,11 @@ int main(int argc, char ** argv) {
return 1;
}
- // assign the system KV cache to all parallel sequences
- // this way, the parallel sequences will "reuse" the prompt tokens without having to copy them
- for (int32_t i = 1; i < n_parallel; ++i) {
- llama_kv_cache_seq_cp(ctx, 0, i, -1, -1);
- }
+ //// assign the system KV cache to all parallel sequences
+ //// this way, the parallel sequences will "reuse" the prompt tokens without having to copy them
+ //for (int32_t i = 1; i < n_parallel; ++i) {
+ // llama_kv_cache_seq_cp(ctx, 0, i, -1, -1);
+ //}
if (n_parallel > 1) {
LOG_TEE("\n\n%s: generating %d sequences ...\n", __func__, n_parallel);
diff --git a/examples/convert-legacy-llama.py b/examples/convert_legacy_llama.py
similarity index 100%
rename from examples/convert-legacy-llama.py
rename to examples/convert_legacy_llama.py
diff --git a/examples/embedding/README.md b/examples/embedding/README.md
index 86df18958..e3705b454 100644
--- a/examples/embedding/README.md
+++ b/examples/embedding/README.md
@@ -58,4 +58,3 @@ The above command will output space-separated float values.
```powershell
embedding.exe -p 'Castle<#sep#>Stronghold<#sep#>Dog<#sep#>Cat' --embd-separator '<#sep#>' --embd-normalize 2 --embd-output-format '' -m './path/to/model.gguf' --n-gpu-layers 99 --log-disable 2>/dev/null
```
-
diff --git a/examples/finetune/convert-finetune-checkpoint-to-gguf.py b/examples/finetune/convert_finetune_checkpoint_to_gguf.py
similarity index 100%
rename from examples/finetune/convert-finetune-checkpoint-to-gguf.py
rename to examples/finetune/convert_finetune_checkpoint_to_gguf.py
diff --git a/examples/infill/infill.cpp b/examples/infill/infill.cpp
index 1556a2fb7..f1d82d363 100644
--- a/examples/infill/infill.cpp
+++ b/examples/infill/infill.cpp
@@ -660,4 +660,3 @@ int main(int argc, char ** argv) {
return 0;
}
-
diff --git a/examples/json-schema-pydantic-example.py b/examples/json_schema_pydantic_example.py
similarity index 98%
rename from examples/json-schema-pydantic-example.py
rename to examples/json_schema_pydantic_example.py
index 2a24f8118..c7ca7b8d9 100644
--- a/examples/json-schema-pydantic-example.py
+++ b/examples/json_schema_pydantic_example.py
@@ -1,7 +1,7 @@
# Usage:
#! ./llama-server -m some-model.gguf &
#! pip install pydantic
-#! python json-schema-pydantic-example.py
+#! python json_schema_pydantic_example.py
from pydantic import BaseModel, Extra, TypeAdapter
from annotated_types import MinLen
diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md
index f6c619c87..06a65fba4 100644
--- a/examples/llava/MobileVLM-README.md
+++ b/examples/llava/MobileVLM-README.md
@@ -30,16 +30,16 @@ git clone https://huggingface.co/mtgv/MobileVLM-1.7B
git clone https://huggingface.co/openai/clip-vit-large-patch14-336
```
-2. Use `llava-surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
+2. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
```sh
-python ./examples/llava/llava-surgery.py -m path/to/MobileVLM-1.7B
+python ./examples/llava/llava_surgery.py -m path/to/MobileVLM-1.7B
```
-3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:
+3. Use `convert_image_encoder_to_gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:
```sh
-python ./examples/llava/convert-image-encoder-to-gguf \
+python ./examples/llava/convert_image_encoder_to_gguf \
-m path/to/clip-vit-large-patch14-336 \
--llava-projector path/to/MobileVLM-1.7B/llava.projector \
--output-dir path/to/MobileVLM-1.7B \
@@ -47,17 +47,17 @@ python ./examples/llava/convert-image-encoder-to-gguf \
```
```sh
-python ./examples/llava/convert-image-encoder-to-gguf \
+python ./examples/llava/convert_image_encoder_to_gguf \
-m path/to/clip-vit-large-patch14-336 \
--llava-projector path/to/MobileVLM-1.7B_V2/llava.projector \
--output-dir path/to/MobileVLM-1.7B_V2 \
--projector-type ldpv2
```
-4. Use `examples/convert-legacy-llama.py` to convert the LLaMA part of LLaVA to GGUF:
+4. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
```sh
-python ./examples/convert-legacy-llama.py path/to/MobileVLM-1.7B
+python ./examples/convert_legacy_llama.py path/to/MobileVLM-1.7B
```
5. Use `quantize` to convert LLaMA part's DataType from `fp16` to `q4_k`
diff --git a/examples/llava/README.md b/examples/llava/README.md
index f4554de67..012451361 100644
--- a/examples/llava/README.md
+++ b/examples/llava/README.md
@@ -38,22 +38,22 @@ git clone https://huggingface.co/openai/clip-vit-large-patch14-336
pip install -r examples/llava/requirements.txt
```
-3. Use `llava-surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
+3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
```sh
-python ./examples/llava/llava-surgery.py -m ../llava-v1.5-7b
+python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b
```
-4. Use `convert-image-encoder-to-gguf.py` to convert the LLaVA image encoder to GGUF:
+4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
```sh
-python ./examples/llava/convert-image-encoder-to-gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
+python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
```
-5. Use `examples/convert-legacy-llama.py` to convert the LLaMA part of LLaVA to GGUF:
+5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
```sh
-python ./examples/convert-legacy-llama.py ../llava-v1.5-7b --skip-unknown
+python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
```
Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
@@ -70,9 +70,9 @@ git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
pip install -r examples/llava/requirements.txt
```
-3) Use `llava-surgery-v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
+3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
```console
-python examples/llava/llava-surgery-v2.py -C -m ../llava-v1.6-vicuna-7b/
+python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
```
- you will find a llava.projector and a llava.clip file in your model directory
@@ -86,13 +86,13 @@ curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.jso
5) Create the visual gguf model:
```console
-python ./examples/llava/convert-image-encoder-to-gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
+python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
```
- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
6) Then convert the model to gguf format:
```console
-python ./examples/convert-legacy-llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
+python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
```
7) And finally we can run the llava cli using the 1.6 model version:
diff --git a/examples/llava/convert-image-encoder-to-gguf.py b/examples/llava/convert_image_encoder_to_gguf.py
similarity index 100%
rename from examples/llava/convert-image-encoder-to-gguf.py
rename to examples/llava/convert_image_encoder_to_gguf.py
diff --git a/examples/llava/llava-surgery.py b/examples/llava/llava_surgery.py
similarity index 100%
rename from examples/llava/llava-surgery.py
rename to examples/llava/llava_surgery.py
diff --git a/examples/llava/llava-surgery-v2.py b/examples/llava/llava_surgery_v2.py
similarity index 100%
rename from examples/llava/llava-surgery-v2.py
rename to examples/llava/llava_surgery_v2.py
diff --git a/examples/llava/requirements.txt b/examples/llava/requirements.txt
index 17cb4d5e5..4713f0a34 100644
--- a/examples/llava/requirements.txt
+++ b/examples/llava/requirements.txt
@@ -1,3 +1,3 @@
--r ../../requirements/requirements-convert-legacy-llama.txt
+-r ../../requirements/requirements-convert_legacy_llama.txt
pillow~=10.2.0
-torch~=2.1.1
+torch~=2.2.1
diff --git a/examples/lookup/README.md b/examples/lookup/README.md
index 5bfb0de93..71c345c03 100644
--- a/examples/lookup/README.md
+++ b/examples/lookup/README.md
@@ -10,4 +10,3 @@ More info:
https://github.com/ggerganov/llama.cpp/pull/4484
https://github.com/ggerganov/llama.cpp/issues/4226
-
diff --git a/examples/main-cmake-pkg/.gitignore b/examples/main-cmake-pkg/.gitignore
index e32c11c7f..67c01d64c 100644
--- a/examples/main-cmake-pkg/.gitignore
+++ b/examples/main-cmake-pkg/.gitignore
@@ -48,4 +48,3 @@
build*/
out/
tmp/
-
diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 002f136fd..6dcdea9bd 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -38,7 +38,8 @@ static gpt_params * g_params;
static std::vector * g_input_tokens;
static std::ostringstream * g_output_ss;
static std::vector * g_output_tokens;
-static bool is_interacting = false;
+static bool is_interacting = false;
+static bool need_insert_eot = false;
static bool file_exists(const std::string & path) {
std::ifstream f(path.c_str());
@@ -100,7 +101,8 @@ static void write_logfile(
static void sigint_handler(int signo) {
if (signo == SIGINT) {
if (!is_interacting && g_params->interactive) {
- is_interacting = true;
+ is_interacting = true;
+ need_insert_eot = true;
} else {
console::cleanup();
printf("\n");
@@ -225,7 +227,14 @@ int main(int argc, char ** argv) {
__func__, n_ctx_train, n_ctx);
}
- LOG_TEE("%s: chat template example: %s\n", __func__, llama_chat_format_example(model, params.chat_template).c_str());
+ // print chat template example in conversation mode
+ if (params.conversation) {
+ if (params.enable_chat_template) {
+ LOG_TEE("%s: chat template example: %s\n", __func__, llama_chat_format_example(model, params.chat_template).c_str());
+ } else {
+ LOG_TEE("%s: in-suffix/prefix is specified, chat template will be disabled\n", __func__);
+ }
+ }
// print system information
{
@@ -256,13 +265,15 @@ int main(int argc, char ** argv) {
}
const bool add_bos = llama_should_add_bos_token(model);
- GGML_ASSERT(llama_add_eos_token(model) != 1);
+ if (!llama_model_has_encoder(model)) {
+ GGML_ASSERT(llama_add_eos_token(model) != 1);
+ }
LOG("add_bos: %d\n", add_bos);
std::vector embd_inp;
{
- auto prompt = (params.conversation && params.enable_chat_template)
+ auto prompt = (params.conversation && params.enable_chat_template && !params.prompt.empty())
? chat_add_and_format(model, chat_msgs, "system", params.prompt) // format the system prompt in conversation mode
: params.prompt;
if (params.interactive_first || !params.prompt.empty() || session_tokens.empty()) {
@@ -518,6 +529,24 @@ int main(int argc, char ** argv) {
exit(1);
}
+ if (llama_model_has_encoder(model)) {
+ int enc_input_size = embd_inp.size();
+ llama_token * enc_input_buf = embd_inp.data();
+
+ if (llama_encode(ctx, llama_batch_get_one(enc_input_buf, enc_input_size, 0, 0))) {
+ LOG_TEE("%s : failed to eval\n", __func__);
+ return 1;
+ }
+
+ llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
+ if (decoder_start_token_id == -1) {
+ decoder_start_token_id = llama_token_bos(model);
+ }
+
+ embd_inp.clear();
+ embd_inp.push_back(decoder_start_token_id);
+ }
+
while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
// predict
if (!embd.empty()) {
@@ -886,6 +915,13 @@ int main(int argc, char ** argv) {
LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp).c_str());
+ // if user stop generation mid-way, we must add EOT to finish model's last response
+ if (need_insert_eot && format_chat) {
+ llama_token eot = llama_token_eot(model);
+ embd_inp.push_back(eot == -1 ? llama_token_eos(model) : eot);
+ need_insert_eot = false;
+ }
+
embd_inp.insert(embd_inp.end(), line_pfx.begin(), line_pfx.end());
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
embd_inp.insert(embd_inp.end(), line_sfx.begin(), line_sfx.end());
diff --git a/examples/passkey/README.md b/examples/passkey/README.md
index a48a6283a..2b8e910f9 100644
--- a/examples/passkey/README.md
+++ b/examples/passkey/README.md
@@ -1,5 +1,8 @@
# llama.cpp/example/passkey
+A passkey retrieval task is an evaluation method used to measure a language
+models ability to recall information from long contexts.
+
See the following PRs for more info:
- https://github.com/ggerganov/llama.cpp/pull/3856
diff --git a/examples/perplexity/perplexity.cpp b/examples/perplexity/perplexity.cpp
index 7fbc06f72..3269dfe19 100644
--- a/examples/perplexity/perplexity.cpp
+++ b/examples/perplexity/perplexity.cpp
@@ -1992,6 +1992,12 @@ int main(int argc, char ** argv) {
params.n_batch = std::min(params.n_batch, n_kv);
} else {
params.n_batch = std::min(params.n_batch, params.n_ctx);
+ if (params.kl_divergence) {
+ params.n_parallel = 1;
+ } else {
+ // ensure there's at least enough seq_ids for HellaSwag
+ params.n_parallel = std::max(4, params.n_parallel);
+ }
}
if (params.ppl_stride > 0) {
@@ -2016,9 +2022,6 @@ int main(int argc, char ** argv) {
llama_model * model;
llama_context * ctx;
- // ensure there's at least enough seq_ids for HellaSwag
- params.n_parallel = std::max(4, params.n_parallel);
-
// load the model and apply lora adapter, if any
std::tie(model, ctx) = llama_init_from_gpt_params(params);
if (model == NULL) {
diff --git a/examples/pydantic-models-to-grammar-examples.py b/examples/pydantic_models_to_grammar_examples.py
similarity index 100%
rename from examples/pydantic-models-to-grammar-examples.py
rename to examples/pydantic_models_to_grammar_examples.py
diff --git a/examples/regex-to-grammar.py b/examples/regex_to_grammar.py
similarity index 100%
rename from examples/regex-to-grammar.py
rename to examples/regex_to_grammar.py
diff --git a/examples/server/README.md b/examples/server/README.md
index 4fab006bb..aa4cbbe63 100644
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -375,7 +375,7 @@ Notice that each `probs` is an array of length `n_probs`.
- `default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
-- **POST** `/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only model with [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, ChatML template will be used.
+- **POST** `/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
*Options:*
diff --git a/examples/server/tests/features/passkey.feature b/examples/server/tests/features/passkey.feature
index 1bde7aab8..6a5a84e6a 100644
--- a/examples/server/tests/features/passkey.feature
+++ b/examples/server/tests/features/passkey.feature
@@ -52,4 +52,3 @@ Feature: Passkey / Self-extend with context shift
#| TheBloke/Llama-2-7B-GGUF | llama-2-7b.Q2_K.gguf | 4096 | 3 | 16384 | 512 | 4 | 512 | 500 | 300 | 1234 | 5 | 1234 |
#| TheBloke/Mixtral-8x7B-v0.1-GGUF | mixtral-8x7b-v0.1.Q2_K.gguf | 32768 | 2 | 16384 | 512 | 4 | 512 | 500 | 100 | 0987 | 5 | 0
# 987 |
-
diff --git a/examples/server/themes/buttons-top/index.html b/examples/server/themes/buttons-top/index.html
index 6af30d307..8334bcde5 100644
--- a/examples/server/themes/buttons-top/index.html
+++ b/examples/server/themes/buttons-top/index.html
@@ -1054,4 +1054,3 @@