mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 03:39:48 +00:00
[docs](kt-kernel): add Docker usage guide for SGLang deployment
This commit is contained in:
parent
f992de55da
commit
1e03e8b612
2 changed files with 146 additions and 2 deletions
|
|
@ -10,6 +10,9 @@ High-performance kernel operations for KTransformers, featuring CPU-optimized Mo
|
|||
- [Quick Installation (Recommended)](#quick-installation-recommended)
|
||||
- [Manual Configuration (Advanced)](#manual-configuration-advanced)
|
||||
- [Verification](#verification)
|
||||
- [Docker Usage](#docker-usage)
|
||||
- [Pull the Image](#pull-the-image)
|
||||
- [Create and Start Container](#create-and-start-container)
|
||||
- [Integration with SGLang](#integration-with-sglang)
|
||||
- [Installation Steps](#installation-steps)
|
||||
- [1. Install SGLang](#1-install-sglang)
|
||||
|
|
@ -120,6 +123,75 @@ For advanced build options and binary distribution, see the [Build Configuration
|
|||
python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"
|
||||
```
|
||||
|
||||
## Docker Usage
|
||||
|
||||
We provide a pre-built Docker image for quick deployment with SGLang integration.
|
||||
|
||||
**Prerequisites:**
|
||||
- NVIDIA Driver with CUDA >= 12.9 on the host machine
|
||||
- Docker with NVIDIA Container Toolkit installed
|
||||
|
||||
### Pull the Image
|
||||
|
||||
```bash
|
||||
docker pull approachingai/sglang-kt:latest
|
||||
```
|
||||
|
||||
### Create and Start Container
|
||||
|
||||
**Step 1: Create the container**
|
||||
|
||||
```bash
|
||||
docker run -itd --gpus all \
|
||||
--name sglang-kt \
|
||||
-p 8000:8000 \
|
||||
-v /path/to/models:/models \
|
||||
-v /path/to/cpu-weights:/cpu-weights \
|
||||
--shm-size=32g \
|
||||
approachingai/sglang-kt:latest \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
**Parameter Explanation:**
|
||||
- `--gpus all`: Enable GPU access for the container
|
||||
- `-p 8000:8000`: Map container port 8000 to host port 8000
|
||||
- `-v /path/to/models:/models`: Mount GPU model weights directory
|
||||
- `-v /path/to/cpu-weights:/cpu-weights`: Mount CPU weights directory
|
||||
- `--shm-size=32g`: Set shared memory size
|
||||
|
||||
**Step 2: Enter the container**
|
||||
|
||||
```bash
|
||||
docker exec -it sglang-kt /bin/bash
|
||||
```
|
||||
|
||||
> **Note:** The image comes with AMX backend pre-installed. If your CPU does not support AMX or you want to use a different backend, reinstall inside the container:
|
||||
> ```bash
|
||||
> cd /workspace/ktransformers/kt-kernel && ./install.sh
|
||||
> ```
|
||||
|
||||
**Step 3: Launch SGLang server inside the container**
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/<your-model> \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.92 \
|
||||
--chunked-prefill-size 4096 \
|
||||
--served-model-name <your-model-name> \
|
||||
--enable-mixed-chunk \
|
||||
--kt-method <AMXINT4|AMXINT8|LLAMAFILE> \
|
||||
--kt-weight-path /cpu-weights/<your-cpu-weights> \
|
||||
--kt-cpuinfer <physical-cores> \
|
||||
--kt-threadpool-count <numa-nodes> \
|
||||
--kt-num-gpu-experts <num-experts-on-gpu> \
|
||||
--kt-max-deferred-experts-per-token <0-4>
|
||||
```
|
||||
|
||||
For detailed parameter explanations, see [KT-Kernel Parameters](#kt-kernel-parameters).
|
||||
|
||||
## Integration with SGLang
|
||||
|
||||
KT-Kernel can be used standalone via [Direct Python API](#direct-python-api-usage) or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.
|
||||
|
|
|
|||
|
|
@ -10,6 +10,9 @@
|
|||
- [快速安装(推荐)](#快速安装推荐)
|
||||
- [手动配置(进阶)](#手动配置进阶)
|
||||
- [验证安装](#验证安装)
|
||||
- [Docker 使用](#docker-使用)
|
||||
- [拉取镜像](#拉取镜像)
|
||||
- [创建并启动容器](#创建并启动容器)
|
||||
- [与 SGLang 集成](#与-sglang-集成)
|
||||
- [安装步骤](#安装步骤)
|
||||
- [1. 安装 SGLang](#1-安装-sglang)
|
||||
|
|
@ -119,10 +122,79 @@ export CPUINFER_ENABLE_AMX=OFF # 选项: ON, OFF
|
|||
python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"
|
||||
```
|
||||
|
||||
## Docker 使用
|
||||
|
||||
我们提供预构建的 Docker 镜像,可快速部署集成 SGLang 的推理服务。
|
||||
|
||||
**前置条件:**
|
||||
- 主机 NVIDIA 驱动 CUDA 版本 >= 12.9
|
||||
- 已安装 Docker 及 NVIDIA Container Toolkit
|
||||
|
||||
### 拉取镜像
|
||||
|
||||
```bash
|
||||
docker pull approachingai/sglang-kt:latest
|
||||
```
|
||||
|
||||
### 创建并启动容器
|
||||
|
||||
**步骤 1:创建容器**
|
||||
|
||||
```bash
|
||||
docker run -itd --gpus all \
|
||||
--name sglang-kt \
|
||||
-p 8000:8000 \
|
||||
-v /path/to/models:/models \
|
||||
-v /path/to/cpu-weights:/cpu-weights \
|
||||
--shm-size=32g \
|
||||
approachingai/sglang-kt:latest \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
**参数说明:**
|
||||
- `--gpus all`:容器启用 GPU 访问
|
||||
- `-p 8000:8000`:将容器端口 8000 映射到主机端口 8000
|
||||
- `-v /path/to/models:/models`:挂载 GPU 模型权重目录
|
||||
- `-v /path/to/cpu-weights:/cpu-weights`:挂载 CPU 权重目录
|
||||
- `--shm-size=32g`:设置共享内存大小
|
||||
|
||||
**步骤 2:进入容器**
|
||||
|
||||
```bash
|
||||
docker exec -it sglang-kt /bin/bash
|
||||
```
|
||||
|
||||
> **注意:** 镜像默认安装的是 AMX 后端。如果你的 CPU 不支持 AMX 或想使用其他后端,请在容器内重新安装:
|
||||
> ```bash
|
||||
> cd /workspace/ktransformers/kt-kernel && ./install.sh
|
||||
> ```
|
||||
|
||||
**步骤 3:在容器内启动 SGLang 服务**
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/<your-model> \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.92 \
|
||||
--chunked-prefill-size 4096 \
|
||||
--served-model-name <your-model-name> \
|
||||
--enable-mixed-chunk \
|
||||
--kt-method <AMXINT4|AMXINT8|LLAMAFILE> \
|
||||
--kt-weight-path /cpu-weights/<your-cpu-weights> \
|
||||
--kt-cpuinfer <physical-cores> \
|
||||
--kt-threadpool-count <numa-nodes> \
|
||||
--kt-num-gpu-experts <num-experts-on-gpu> \
|
||||
--kt-max-deferred-experts-per-token <0-4>
|
||||
```
|
||||
|
||||
详细参数说明请参见 [KT-Kernel 参数](#kt-kernel-参数)。
|
||||
|
||||
## 与 SGLang 集成
|
||||
|
||||
KT-Kernel 可以单独通过 [Python API](#直接使用-python-api) 使用,也可以集成到 SGLang 中用于生产部署。
|
||||
本节描述如何与 SGLang 集成,实现 CPU-GPU 混合(异构)推理:将“热” experts 放在 GPU 上,“冷” experts 放在 CPU 上,以达到资源利用和性价比的平衡。
|
||||
KT-Kernel 可以单独通过 [Python API](#直接使用-python-api) 使用,也可以集成到 SGLang 中用于生产部署。
|
||||
本节描述如何与 SGLang 集成,实现 CPU-GPU 混合(异构)推理:将"热" experts 放在 GPU 上,"冷" experts 放在 CPU 上,以达到资源利用和性价比的平衡。
|
||||
|
||||
### 安装步骤
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue