[docs]: refresh KT install commands (#1958)
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run

This commit is contained in:
Peilin Li 2026-04-27 00:45:43 +08:00 committed by GitHub
parent 07e274467a
commit 0656e01ac1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
8 changed files with 37 additions and 31 deletions

View file

@ -95,26 +95,26 @@ This section shows how to install and use **LLaMA-Factory + KTransformers** for
### Environment Setup
According to the following example, install both the **KTransformers** and **LLaMA-Factory** environments simultaneously.
This time, to simplify the installation process of KTransformers, we have specially packaged a wheel file to avoid local compilation.
This time, to simplify the installation process of KTransformers, use the PyPI packages to avoid local compilation.
The detailed installation steps are as follows:
(Note: Make sure your local **Python version**, **Torch version**, **CUDA version**, and the **KTransformers wheel filename** correspond correctly.)
(Note: Make sure your local **Python version**, **Torch version**, and **CUDA version** are compatible with the installed packages.)
```shell
# 1. Create a conda environment
conda create -n Kllama python=3.12 # choose from : [3.10, 3.11, 3.12, 3.13]
conda create -n Kllama python=3.12 # choose from : [3.11, 3.12, 3.13]
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
# 2. Install the LLaMA-Factory environment
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install -e .
# 3. Install the KTransformers wheel that matches your Torch and Python versions, from https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.1 (Note: The CUDA version can differ from that in the wheel filename.)
pip install ktransformers-0.4.1+cu128torch27fancy-cp312-cp312-linux_x86_64.whl
# 3. Install the KTransformers SFT packages
pip install "ktransformers[sft]"
# 4. Install flash-attention, download the corresponding file based on your Python and Torch versions from: https://github.com/Dao-AILab/flash-attention/releases
pip install flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
pip install flash-attn --no-build-isolation
# abi=True/False can find from below
# import torch
# print(torch._C._GLIBCXX_USE_CXX11_ABI)
@ -128,7 +128,7 @@ pip install custom_flashinfer/
### Core Feature 1: Use KTransformers backend to fine-tune ultra-large MoE models
Run the command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek3_lora_sft_kt.yaml`.
Run the command: `USE_KT=1 ACCELERATE_USE_KT=true accelerate launch --config_file examples/ktransformers/accelerate/fsdp2_kt_bf16.yaml -m llamafactory.cli train examples/ktransformers/train_lora/deepseek_v3_lora_sft_kt.yaml`.
Note: You **must** provide a **BF16** model. DeepSeek-V3-671B is released in FP8 by default; convert with [DeepSeek-V3/inference/fp8_cast_bf16.py](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py).
@ -213,7 +213,7 @@ Outputs go to `output_dir` in safetensors format plus adapter metadata for later
### Core Feature 2: Chat with the fine-tuned model (base + LoRA adapter)
Run the command: `llamafactory-cli chat examples/inference/deepseek3_lora_sft_kt.yaml`.
Run the command: `llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml`.
Use the safetensors adapter trained with KT for inference.
@ -238,7 +238,7 @@ During loading, LLaMA-Factory maps layer names to KTs naming. Youll see lo
### Core Feature 3: Batch inference + metrics (base + LoRA adapter)
Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/deepseek3_lora_sft_kt.yaml`.
Run the command: `API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml`.
Invoke the KT fine-tuned adapter to provide the API; the usage logic of other APIs is consistent with the native LLaMA-Factory approach.
```yaml