Merge branch 'upstream' into concedo_experimental

# Conflicts: # src/llama-vocab.cpp
2025-09-12 18:09:42 +00:00 · 2025-07-16 12:03:54 +08:00 · 2025-07-16 12:03:54 +08:00 · cbe9fc87c5
commit cbe9fc87c5
parent ce7aa0d5c0 c81f4192f9
41 changed files with 1470 additions and 27198 deletions
--- a/docs/build-s390x.md
+++ b/docs/build-s390x.md
@ -1,246 +0,0 @@
-> [!IMPORTANT]
-> This build documentation is specific only to IBM Z & LinuxONE mainframes (s390x). You can find the build documentation for other architectures: [build.md](build.md).
-
-# Build llama.cpp locally (for s390x)
-
-The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](../include/llama.h).
-
-The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
-
-**To get the code:**
-
-```bash
-git clone https://github.com/ggml-org/llama.cpp
-cd llama.cpp
-```
-
-## CPU Build with BLAS
-
-Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements. Make sure to have OpenBLAS installed in your environment.
-
-```bash
-cmake -S . -B build             \
-    -DCMAKE_BUILD_TYPE=Release  \
-    -DGGML_BLAS=ON              \
-    -DGGML_BLAS_VENDOR=OpenBLAS
-
-cmake --build build --config Release -j $(nproc)
-```
-
-**Notes**:
-
-   For faster repeated compilation, install [ccache](https://ccache.dev/)
-   By default, VXE/VXE2 is enabled. To disable it (not recommended):
-
-    ```bash
-    cmake -S . -B build             \
-        -DCMAKE_BUILD_TYPE=Release  \
-        -DGGML_BLAS=ON              \
-        -DGGML_BLAS_VENDOR=OpenBLAS \
-        -DGGML_VXE=OFF
-
-    cmake --build build --config Release -j $(nproc)
-    ```
-
-   By default, NNPA is enabled when available. To disable it (not recommended):
-
-    ```bash
-    cmake -S . -B build             \
-        -DCMAKE_BUILD_TYPE=Release  \
-        -DGGML_BLAS=ON              \
-        -DGGML_BLAS_VENDOR=OpenBLAS \
-        -DGGML_NNPA=OFF
-
-    cmake --build build --config Release -j $(nproc)
-    ```
-
-   For debug builds:
-
-    ```bash
-    cmake -S . -B build             \
-        -DCMAKE_BUILD_TYPE=Debug    \
-        -DGGML_BLAS=ON              \
-        -DGGML_BLAS_VENDOR=OpenBLAS
-    cmake --build build --config Debug -j $(nproc)
-    ```
-
-   For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
-
-    ```bash
-    cmake -S . -B build             \
-        -DCMAKE_BUILD_TYPE=Release  \
-        -DGGML_BLAS=ON              \
-        -DGGML_BLAS_VENDOR=OpenBLAS \
-        -DBUILD_SHARED_LIBS=OFF
-
-    cmake --build build --config Release -j $(nproc)
-    ```
-
-## Getting GGUF Models
-
-All models need to be converted to Big-Endian. You can achieve this in three cases:
-
-1. **Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)**
-
-    ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
-
-    You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
-
-    These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
-
-2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
-
-    ![File Type - safetensors](https://img.shields.io/badge/File_Type-safetensors-da1e28)
-
-    The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
-
-    ```bash
-    python3 convert_hf_to_gguf.py \
-        --outfile model-name-be.f16.gguf \
-        --outtype f16 \
-        --bigendian \
-        model-directory/
-    ```
-
-    For example,
-
-    ```bash
-    python3 convert_hf_to_gguf.py \
-        --outfile granite-3.3-2b-instruct-be.f16.gguf \
-        --outtype f16 \
-        --bigendian \
-        granite-3.3-2b-instruct/
-    ```
-
-3. **Convert existing GGUF Little-Endian model to Big-Endian**
-
-    ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
-
-    The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
-
-    ```bash
-    python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
-    ```
-
-    For example,
-
-    ```bash
-    python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
-    mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
-    ```
-
-    **Notes:**
-
-    - The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.
-
-## IBM Accelerators
-
-### 1. SIMD Acceleration
-
-Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
-
-### 2. NNPA Vector Intrinsics Acceleration
-
-Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
-
-### 3. zDNN Accelerator
-
-_Only available in IBM z16 or later system. No direction at the moment._
-
-### 4. Spyre Accelerator
-
-_No direction at the moment._
-
-## Performance Tuning
-
-### 1. Virtualization Setup
-
-It is strongly recommended to use only LPAR (Type-1) virtualization to get the most performance.
-
-Note: Type-2 virtualization is not supported at the moment, while you can get it running, the performance will not be the best.
-
-### 2. IFL (Core) Count
-
-It is recommended to allocate a minimum of 8 shared IFLs assigned to the LPAR. Increasing the IFL count past 8 shared IFLs will only improve Prompt Processing performance but not Token Generation.
-
-Note: IFL count does not equate to vCPU count.
-
-### 3. SMT vs NOSMT (Simultaneous Multithreading)
-
-It is strongly recommended to disable SMT via the kernel boot parameters as it negatively affects performance. Please refer to your Linux distribution's guide on disabling SMT via kernel boot parameters.
-
-### 4. BLAS vs NOBLAS
-
-IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
-
-## Frequently Asked Questions (FAQ)
-
-1. I'm getting the following error message while trying to load a model: `gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?`
-
-    Answer: Please ensure that the model you have downloaded/converted is GGUFv3 Big-Endian. These models are usually denoted with the `-be` suffix, i.e., `granite-3.3-2b-instruct-be.F16.gguf`.
-
-    You may refer to the [Getting GGUF Models](#getting-gguf-models) section to manually convert a `safetensors` model to `GGUF` Big Endian.
-
-2. I'm getting extremely poor performance when running inference on a model
-
-    Answer: Please refer to the [Appendix B: SIMD Support Matrix](#appendix-b-simd-support-matrix) to check if your model quantization is supported by SIMD acceleration.
-
-3. I'm building on IBM z17 and getting the following error messages: `invalid switch -march=z17`
-
-    Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
-
-## Getting Help on IBM Z & LinuxONE
-
-1. **Bugs, Feature Requests**
-
-    Please file an issue in llama.cpp and ensure that the title contains "s390x".
-
-2. **Other Questions**
-
-    Please reach out directly to [aionz@us.ibm.com](mailto:aionz@us.ibm.com).
-
-## Appendix A: Hardware Support Matrix
-
-|         | Support | Minimum Compiler Version |
-| ------- | ------- | ------------------------ |
-| IBM z15 | ✅      |                          |
-| IBM z16 | ✅      |                          |
-| IBM z17 | ✅      | GCC 15.1.0               |
-
-   ✅ - supported and verified to run as intended
-   🚫 - unsupported, we are unlikely able to provide support
-
-## Appendix B: SIMD Support Matrix
-
-|            | VX/VXE/VXE2 | NNPA | zDNN | Spyre |
-| ---------- | ----------- | ---- | ---- | ----- |
-| FP32       | ✅          | ✅   | ❓   | ❓    |
-| FP16       | ✅          | ✅   | ❓   | ❓    |
-| BF16       | 🚫          | 🚫   | ❓   | ❓    |
-| Q4_0       | ✅          | ✅   | ❓   | ❓    |
-| Q4_1       | ✅          | ✅   | ❓   | ❓    |
-| Q5_0       | 🚫          | 🚫   | ❓   | ❓    |
-| Q5_1       | 🚫          | 🚫   | ❓   | ❓    |
-| Q8_0       | ✅          | ✅   | ❓   | ❓    |
-| Q2_K       | 🚫          | 🚫   | ❓   | ❓    |
-| Q3_K       | ✅          | ✅   | ❓   | ❓    |
-| Q4_K       | ✅          | ✅   | ❓   | ❓    |
-| Q5_K       | ✅          | ✅   | ❓   | ❓    |
-| Q6_K       | ✅          | ✅   | ❓   | ❓    |
-| TQ1_0      | 🚫          | 🚫   | ❓   | ❓    |
-| TQ2_0      | 🚫          | 🚫   | ❓   | ❓    |
-| IQ2_XXS    | 🚫          | 🚫   | ❓   | ❓    |
-| IQ2_XS     | 🚫          | 🚫   | ❓   | ❓    |
-| IQ2_S      | 🚫          | 🚫   | ❓   | ❓    |
-| IQ3_XXS    | 🚫          | 🚫   | ❓   | ❓    |
-| IQ3_S      | 🚫          | 🚫   | ❓   | ❓    |
-| IQ1_S      | 🚫          | 🚫   | ❓   | ❓    |
-| IQ1_M      | 🚫          | 🚫   | ❓   | ❓    |
-| IQ4_NL     | ✅          | ✅   | ❓   | ❓    |
-| IQ4_XS     | ✅          | ✅   | ❓   | ❓    |
-| FP32->FP16 | 🚫          | ✅   | ❓   | ❓    |
-| FP16->FP32 | 🚫          | ✅   | ❓   | ❓    |
-
-   ✅ - acceleration available
-   🚫 - acceleration unavailable, will still run using scalar implementation
-   ❓ - acceleration unknown, please contribute if you can test it yourself
--- a/docs/ops.md
+++ b/docs/ops.md
@ -1,95 +0,0 @@
-# GGML Operations
-
-List of GGML operations and backend support status.
-
-Legend:
- ✅ Fully supported by this backend
- 🟡 Partially supported by this backend
- ❌ Not supported by this backend
-
-| Operation | BLAS | CPU | CUDA | Metal |
-|-----------|------|------|------|------|
-|                              ABS | ❌ | ✅ | 🟡 | ❌ |
-|                              ACC | ❌ | ✅ | ✅ | ✅ |
-|                              ADD | ❌ | ✅ | ✅ | 🟡 |
-|                             ADD1 | ❌ | ✅ | ✅ | ❌ |
-|                           ARANGE | ❌ | ✅ | ✅ | ✅ |
-|                           ARGMAX | ❌ | ✅ | ✅ | ✅ |
-|                          ARGSORT | ❌ | ✅ | ✅ | ✅ |
-|                            CLAMP | ❌ | ✅ | ✅ | 🟡 |
-|                           CONCAT | ❌ | ✅ | 🟡 | ✅ |
-|                             CONT | ❌ | ✅ | 🟡 | ✅ |
-|                       CONV_2D_DW | ❌ | ✅ | ✅ | ❌ |
-|                CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ |
-|                CONV_TRANSPOSE_2D | ❌ | ✅ | ✅ | ❌ |
-|                              COS | ❌ | ✅ | ✅ | 🟡 |
-|                      COUNT_EQUAL | ❌ | ✅ | ✅ | ❌ |
-|                              CPY | ❌ | 🟡 | 🟡 | 🟡 |
-|               CROSS_ENTROPY_LOSS | ❌ | ✅ | ✅ | ❌ |
-|          CROSS_ENTROPY_LOSS_BACK | ❌ | ✅ | ✅ | ❌ |
-|                    DIAG_MASK_INF | ❌ | ✅ | ✅ | 🟡 |
-|                              DIV | ❌ | ✅ | ✅ | 🟡 |
-|                              DUP | ❌ | ✅ | 🟡 | 🟡 |
-|                              ELU | ❌ | ✅ | ❌ | 🟡 |
-|                              EXP | ❌ | ✅ | 🟡 | ❌ |
-|                   FLASH_ATTN_EXT | ❌ | ✅ | 🟡 | 🟡 |
-|                GATED_LINEAR_ATTN | ❌ | ✅ | ✅ | ❌ |
-|                            GEGLU | ❌ | ✅ | ✅ | 🟡 |
-|                        GEGLU_ERF | ❌ | ✅ | ✅ | 🟡 |
-|                      GEGLU_QUICK | ❌ | ✅ | ✅ | 🟡 |
-|                             GELU | ❌ | ✅ | 🟡 | 🟡 |
-|                         GELU_ERF | ❌ | ✅ | 🟡 | 🟡 |
-|                       GELU_QUICK | ❌ | ✅ | 🟡 | 🟡 |
-|                         GET_ROWS | ❌ | ✅ | 🟡 | ✅ |
-|                    GET_ROWS_BACK | ❌ | 🟡 | 🟡 | ❌ |
-|                       GROUP_NORM | ❌ | ✅ | ✅ | ✅ |
-|                      HARDSIGMOID | ❌ | ✅ | 🟡 | ❌ |
-|                        HARDSWISH | ❌ | ✅ | 🟡 | ❌ |
-|                           IM2COL | ❌ | ✅ | ✅ | 🟡 |
-|                          L2_NORM | ❌ | ✅ | ✅ | ✅ |
-|                       LEAKY_RELU | ❌ | ✅ | ✅ | ✅ |
-|                              LOG | ❌ | ✅ | ✅ | ❌ |
-|                             MEAN | ❌ | ✅ | ✅ | ✅ |
-|                              MUL | ❌ | ✅ | ✅ | 🟡 |
-|                          MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 |
-|                       MUL_MAT_ID | ❌ | ✅ | ✅ | ✅ |
-|                              NEG | ❌ | ✅ | 🟡 | 🟡 |
-|                             NORM | ❌ | ✅ | ✅ | 🟡 |
-|                   OPT_STEP_ADAMW | ❌ | ✅ | ✅ | ❌ |
-|                         OUT_PROD | 🟡 | 🟡 | 🟡 | ❌ |
-|                              PAD | ❌ | ✅ | ✅ | ✅ |
-|                   PAD_REFLECT_1D | ❌ | ✅ | ❌ | ✅ |
-|                          POOL_2D | ❌ | ✅ | ✅ | ✅ |
-|                            REGLU | ❌ | ✅ | ✅ | 🟡 |
-|                             RELU | ❌ | ✅ | 🟡 | 🟡 |
-|                           REPEAT | ❌ | ✅ | 🟡 | ✅ |
-|                      REPEAT_BACK | ❌ | ✅ | ✅ | ❌ |
-|                         RMS_NORM | ❌ | ✅ | ✅ | 🟡 |
-|                    RMS_NORM_BACK | ❌ | ✅ | ✅ | ❌ |
-|                     RMS_NORM_MUL | ❌ | ✅ | ✅ | ✅ |
-|                             ROPE | ❌ | ✅ | ✅ | ✅ |
-|                        ROPE_BACK | ❌ | ✅ | ✅ | ❌ |
-|                        RWKV_WKV6 | ❌ | ✅ | ✅ | ✅ |
-|                        RWKV_WKV7 | ❌ | ✅ | ✅ | ✅ |
-|                            SCALE | ❌ | ✅ | ✅ | ✅ |
-|                              SET | ❌ | ✅ | ❌ | ✅ |
-|                         SET_ROWS | ❌ | 🟡 | ❌ | 🟡 |
-|                              SGN | ❌ | ✅ | 🟡 | ❌ |
-|                          SIGMOID | ❌ | ✅ | 🟡 | 🟡 |
-|                             SILU | ❌ | ✅ | 🟡 | 🟡 |
-|                        SILU_BACK | ❌ | ✅ | ✅ | ❌ |
-|                              SIN | ❌ | ✅ | ✅ | 🟡 |
-|                         SOFT_MAX | ❌ | ✅ | ✅ | ✅ |
-|                    SOFT_MAX_BACK | ❌ | 🟡 | 🟡 | ❌ |
-|                              SQR | ❌ | ✅ | ✅ | 🟡 |
-|                             SQRT | ❌ | ✅ | ✅ | 🟡 |
-|                         SSM_CONV | ❌ | ✅ | ✅ | ✅ |
-|                         SSM_SCAN | ❌ | ✅ | ✅ | ✅ |
-|                             STEP | ❌ | ✅ | 🟡 | ❌ |
-|                              SUB | ❌ | ✅ | ✅ | 🟡 |
-|                              SUM | ❌ | ✅ | ✅ | ❌ |
-|                         SUM_ROWS | ❌ | ✅ | ✅ | ✅ |
-|                           SWIGLU | ❌ | ✅ | ✅ | 🟡 |
-|                             TANH | ❌ | ✅ | 🟡 | 🟡 |
-|               TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ |
-|                          UPSCALE | ❌ | ✅ | ✅ | 🟡 |
--- a/docs/ops/BLAS.csv
+++ b/docs/ops/BLAS.csv
--- a/docs/ops/CPU.csv
+++ b/docs/ops/CPU.csv
--- a/docs/ops/CUDA.csv
+++ b/docs/ops/CUDA.csv
--- a/docs/ops/Metal.csv
+++ b/docs/ops/Metal.csv