diff --git a/README.md b/README.md index 31b3d8eb..19684b80 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,7 @@ KTransformers is a research project focused on efficient inference and fine-tuni ## 🔥 Updates +* **Jan 27, 2026**: Kimi-K2.5 Day0 Support! ([Tutorial](./doc/en/Kimi-K2.5.md)) ([SFT Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.5.md)) * **Jan 22, 2026**: Support [CPU-GPU Expert Scheduling](./doc/en/kt-kernel/experts-sched-Tutorial.md), [Native BF16 and FP8 per channel Precision](./doc/en/kt-kernel/Native-Precision-Tutorial.md) and [AutoDL unified fine-tuning and inference](./doc/zh/【云端低价训推】%20KTransformers%2BAutoDL%2BLlamaFactory:随用随租的低成本超大模型「微调%2B推理」一体化流程.pdf) * **Dec 24, 2025**: Support Native MiniMax-M2.1 inference. ([Tutorial](./doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md)) * **Dec 22, 2025**: Support RL-DPO fine-tuning with LLaMA-Factory. ([Tutorial](./doc/en/SFT/DPO_tutorial.md)) diff --git a/doc/en/Kimi-K2.5.md b/doc/en/Kimi-K2.5.md new file mode 100644 index 00000000..e75c22d3 --- /dev/null +++ b/doc/en/Kimi-K2.5.md @@ -0,0 +1,154 @@ +# Running Kimi-K2.5 with SGLang and KT-Kernel + +This tutorial demonstrates how to run Kimi-K2.5 model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. This setup enables efficient deployment of large MoE models by offloading experts to CPU. + +## Table of Contents + +- [Hardware Requirements](#hardware-requirements) +- [Prerequisites](#prerequisites) +- [Step 1: Download Model Weights](#step-1-download-model-weights) +- [Step 2: Launch SGLang Server](#step-2-launch-sglang-server) +- [Step 3: Send Inference Requests](#step-3-send-inference-requests) + +## Hardware Requirements + +**Minimum Configuration:** +- **GPU**: NVIDIA RTX 2x4090 48GB (or equivalent with at least total 48GB VRAM available) +- **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids) +- **RAM**: At least 600GB system memory +- **Storage**: ~600GB for model weights (native INT4 weight, same weight folder for CPU and GPU) + +## Prerequisites + +Before starting, ensure you have: + +1. **KT-Kernel installed**: + + Note: Latest KTransformers' EPLB feature for Kimi-K2.5 will be supported soon. + +``` +git clone https://github.com/kvcache-ai/ktransformers.git +git checkout kimi_k2.5 +git submodule update --init --recursive +cd kt-kernel && ./install.sh +``` + +2. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang) + +Note: Currently, please clone our custom SGLang repository: + +``` +git clone https://github.com/kvcache-ai/sglang.git +git checkout kimi_k2.5 +cd sglang && pip install -e "python[all]" +// maybe need to reinstall cudnn according to the issue when launching SGLang +pip install nvidia-cudnn-cu12==9.16.0.29 +``` + +3. **CUDA toolkit** - Compatible with your GPU (CUDA 12.8+ recommended) +4. **Hugging Face CLI** - For downloading models: + + ```bash + pip install huggingface-hub + ``` + +## Step 1: Download Model Weights + +```bash +# Create a directory for models +mkdir -p /path/to/models +cd /path/to/models + +# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU) +huggingface-cli download moonshotai/Kimi-K2.5 \ + --local-dir /path/to/kimi-k2.5 +``` + +**Note:** Replace `/path/to/models` with your actual storage path throughout this tutorial. + +## Step 2: Launch SGLang Server + +Start the SGLang server with KT-Kernel integration for CPU-GPU heterogeneous inference. + + +### Launch Command (4x RTX 4090 Example) + +```bash +python -m sglang.launch_server \ + --host 0.0.0.0 \ + --port 31245 \ + --model /path/to/kimi-k2.5 \ + --kt-weight-path /path/to/kimi-k2.5 \ + --kt-cpuinfer 96 \ + --kt-threadpool-count 2 \ + --kt-num-gpu-experts 30 \ + --kt-method RAWINT4 \ + --kt-gpu-prefill-token-threshold 400 \ + --trust-remote-code \ + --mem-fraction-static 0.94 \ + --served-model-name Kimi-K2.5 \ + --enable-mixed-chunk \ + --tensor-parallel-size 4 \ + --enable-p2p-check \ + --disable-shared-experts-fusion \ + --chunked-prefill-size 32658 \ + --max-total-tokens 50000 \ + --attention-backend flashinfer +``` + +It takes about 2~3 minutes to start the server. + +See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines. + +## Step 3: Send Inference Requests + +Once the server is running, you can send inference requests using the OpenAI-compatible API. + +### Basic Chat Completion Request + +```bash +curl -s http://localhost:31245/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Kimi-K2.5", + "stream": false, + "messages": [ + {"role": "user", "content": "hi, who are you?"} + ] + }' +``` + +### Example Response + +```json +{ + "id": "2a4e83f8a79b4b57b103b0f298fbaa7d", + "object": "chat.completion", + "created": 1769333912, + "model": "Kimi-K2.5", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": " The user is asking \"hi, who are you?\" which is a simple greeting and identity question. I need to respond appropriately by introducing myself clearly and concisely.\n\nI am Kimi, a large language model trained by Moonshot AI. I should state my name, my nature (AI assistant), and my developer (Moonshot AI). I should keep it friendly and helpful.\n\nKey points to include:\n- Greet them back (\"hi\" or \"hello\")\n- State my name: Kimi\n- State what I am: an AI assistant/language model\n- Mention my developer: Moonshot AI\n- Briefly describe my purpose: to help answer questions, provide information, and assist with various tasks\n- Keep it concise but informative\n- Use a friendly, professional tone\n\nI should avoid overly technical jargon while being accurate. The response should be welcoming and set the stage for further interaction.\n\nPossible response:\n\"Hi! I'm Kimi, an AI assistant created by Moonshot AI. I'm designed to help answer questions, provide information, and assist with a wide range of tasks. How can I help you today?\"\n\nThis covers all the necessary points and invites the user to continue the conversation. Hi! I'm Kimi, an AI assistant created by Moonshot AI. I'm designed to help answer questions, provide information, and assist with a wide range of tasks. How can I help you today?", + "reasoning_content": null, + "tool_calls": null + }, + "logprobs": null, + "finish_reason": "stop", + "matched_stop": 163586 + } + ], + "usage": { + "prompt_tokens": 32, + "total_tokens": 317, + "completion_tokens": 285, + "prompt_tokens_details": null, + "reasoning_tokens": 0 + }, + "metadata": { + "weight_version": "default" + } +} +``` diff --git a/doc/en/SFT_Installation_Guide_KimiK2.5.md b/doc/en/SFT_Installation_Guide_KimiK2.5.md new file mode 100644 index 00000000..0b90d0ba --- /dev/null +++ b/doc/en/SFT_Installation_Guide_KimiK2.5.md @@ -0,0 +1,222 @@ +# Kimi-K2.5 LoRA SFT Tutorial + +This tutorial demonstrates how to perform **LoRA Supervised Fine-Tuning (SFT)** on **Kimi-K2.5** using **LlamaFactory** with **KTransformers** as the backend, and then serve the fine-tuned model using **SGLang**. + +The workflow is: + +```txt +KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving +``` + +## Table of Contents + +- [Hardware Requirements](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#hardware-requirements) +- [Prerequisites](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#prerequisites) +- [Step 0: Environment Setup (Method 1: Source Install)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-0-environment-setup-method-1-source-install) +- [Step 1: Prepare Model Weights (BF16 for SFT)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-1-prepare-model-weights-bf16-for-sft) +- [Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-2-prepare-yaml-for-lora-sft-ktransformers-backend) +- [Step 3: Run LoRA SFT](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-3-run-lora-sft) +- [Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-4-post-sft-quick-verification-with-LlamaFactory-optional) +- [Step 5: SGLang Serving with LoRA (Recommended Delivery Path)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-5-sglang-serving-with-lora-recommended-delivery-path) + +## Hardware Requirements + +### Training (LoRA SFT) + +- **LlamaFactory + KTransformers** +- **GPU**: 4 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available) +- **CPU**: x86 CPU with AMX support +- **RAM**: At least 2TGB system memory +- Swap can be used if CPU memory is insufficient + +### Inference (LoRA Adapter + Original Model) + +- **SGLang + KTransformers** +- **GPU**: 2 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available) +- **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids) +- **RAM**: At least 600GB system memory +- **Storage**: ~600GB for model weights (native INT4 weight, same weight dir for CPU and GPU) + + + +## Step 0: Environment Setup + +We recommend to separate **two conda environments**: + +| Environment | Purpose | +| ----------- | --------------------------------------------------- | +| `kt-kernel` | Inference & serving (KTransformers + SGLang) | +| `kt-sft` | Training (LlamaFactory + KTransformers SFT backend) | + +### 0.1 Inference Environment: `kt-kernel` + +```bash +conda create -n kt-kernel python=3.11 +conda activate kt-kernel + +git clone https://github.com/kvcache-ai/ktransformers.git +git checkout kimi_k2.5 +git submodule update --init --recursive +cd kt-kernel && ./install.sh +``` + +### 0.2 Install SGLang (Inference / Serving) + +**Recommended for Kimi-K2.5:** + +```bash +git clone https://github.com/kvcache-ai/sglang.git +cd sglang +git checkout kimi_k2.5 +pip install -e "python[all]" +``` + +### 0.3 Training Environment: `kt-sft` + +```bash +conda create -n kt-sft python=3.11 +conda activate kt-sft + +git clone https://github.com/hiyouga/LlamaFactory.git +cd LlamaFactory +pip install -e . +``` + +### 0.4 Install KTransformers SFT Dependencies + +```bash +conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64 +conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime + +# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases +pip install ktransformers-.whl +pip install flash_attn-.whl +``` + +## Step 1: Prepare Model Weights (BF16 for SFT) + +### 1.1 Download INT4 Weights + +KTransformers **requires BF16 weights for SFT**. + +```bash +# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU) +huggingface-cli download moonshotai/Kimi-K2.5 \ + --local-dir /path/to/kimi-k2.5 +``` + +### 1.2 Convert INT4 → BF16 + +Kimi-K2.5 base model is in **INT4** format, convert it to **BF16** before SFT. + +## Step 2: Prepare YAML for LoRA SFT (KTransformers Backend) + +### 2.1 Training YAML (LoRA SFT) + +Example file: +`examples/train_lora/kimik2_lora_sft_kt.yaml` + +Required fields: + +```yaml +stage: sft +finetuning_type: lora +bf16: true + +use_kt: true +kt_optimize_rule: +cpu_infer: 32 +chunk_size: 8192 +``` + +Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual. + +### 2.2 Inference YAML (LlamaFactory Verification) + +Key requirements: + +- `adapter_name_or_path`: LoRA output directory +- `infer_backend: ktransformers` +- **Same `use_kt` and `kt_optimize_rule` as training** + +This YAML is used only for **quick verification**, not production serving. + +## Step 3: Run LoRA SFT + +```bash +conda activate kt-sft +cd LlamaFactory + +USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml +``` + +After training, the LoRA adapter is saved to `output_dir`. + +## Step 4: Post-SFT Quick Verification with LlamaFactory (Optional) + +Before production deployment, the new PDF recommends a **lightweight sanity check**. + +```bash +conda activate kt-sft +cd LlamaFactory + +llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml +``` + +Purpose: + +- Validate LoRA correctness +- Ensure reproducibility +- Not for throughput benchmarking + +## Step 5: SGLang Serving with LoRA (Recommended Delivery Path) + +This is the **major runtime update** introduced by the new PDF. + +### 5.1 Convert LoRA for SGLang + +```bash +python ktransformers/kt-kernel/scripts/convert_lora.py \ + --base_path /path/to/kimi-base-model \ + --lora_path /path/to/llamafactory/output_dir \ + --output_path /path/to/lora_converted +``` + +### 5.2 (Optional) Convert CPU Weights to INT8 + +To reduce CPU memory usage: + +```bash +python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \ + --base_path /path/to/kimi-base-model \ + --output_dir /path/to/kimi-base-model-int8 +``` + +This produces: + +```text +/path/to/kimi-base-model-int8/int8 +``` + +### 5.3 Launch SGLang Server with LoRA + +```bash +conda activate kt-kernel + +python -m sglang.launch_server \ + --enable-lora \ + --lora-paths lora1=/path/to/lora_converted \ + --lora-backend triton \ + --model-path /path/to/kimi-base-model \ + --tp 1 \ + --trust-remote-code \ + --context-length 4096 \ + --kt-weight-path /path/to/kimi-base-model-int8/int8 \ + --mem-fraction-static 0.9 +``` + +Notes: + +- `--kt-weight-path` points to CPU INT8 weights +- Adjust `tp`, `context-length`, and memory parameters per machine +- RAWINT4 inference paths can follow **Kimi-K2.5-Native** directly \ No newline at end of file