[docs](kt-kernel): add Docker usage guide for SGLang deployment

2026-04-28 03:39:48 +00:00 · 2025-12-09 14:54:49 +08:00 · 2025-12-09 14:54:49 +08:00 · 1e03e8b612
commit 1e03e8b612
parent f992de55da
2 changed files with 146 additions and 2 deletions
--- a/kt-kernel/README.md
+++ b/kt-kernel/README.md
@ -10,6 +10,9 @@ High-performance kernel operations for KTransformers, featuring CPU-optimized Mo
    - [Quick Installation (Recommended)](#quick-installation-recommended)
    - [Manual Configuration (Advanced)](#manual-configuration-advanced)
  - [Verification](#verification)
+  - [Docker Usage](#docker-usage)
+    - [Pull the Image](#pull-the-image)
+    - [Create and Start Container](#create-and-start-container)
  - [Integration with SGLang](#integration-with-sglang)
    - [Installation Steps](#installation-steps)
      - [1. Install SGLang](#1-install-sglang)
@ -120,6 +123,75 @@ For advanced build options and binary distribution, see the [Build Configuration
 python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"
 ```

+## Docker Usage
+
+We provide a pre-built Docker image for quick deployment with SGLang integration.
+
+**Prerequisites:**
+- NVIDIA Driver with CUDA >= 12.9 on the host machine
+- Docker with NVIDIA Container Toolkit installed
+
+### Pull the Image
+
+```bash
+docker pull approachingai/sglang-kt:latest
+```
+
+### Create and Start Container
+
+**Step 1: Create the container**
+
+```bash
+docker run -itd --gpus all \
+  --name sglang-kt \
+  -p 8000:8000 \
+  -v /path/to/models:/models \
+  -v /path/to/cpu-weights:/cpu-weights \
+  --shm-size=32g \
+  approachingai/sglang-kt:latest \
+  /bin/bash
+```
+
+**Parameter Explanation:**
+- `--gpus all`: Enable GPU access for the container
+- `-p 8000:8000`: Map container port 8000 to host port 8000
+- `-v /path/to/models:/models`: Mount GPU model weights directory
+- `-v /path/to/cpu-weights:/cpu-weights`: Mount CPU weights directory
+- `--shm-size=32g`: Set shared memory size
+
+**Step 2: Enter the container**
+
+```bash
+docker exec -it sglang-kt /bin/bash
+```
+
+> **Note:** The image comes with AMX backend pre-installed. If your CPU does not support AMX or you want to use a different backend, reinstall inside the container:
+> ```bash
+> cd /workspace/ktransformers/kt-kernel && ./install.sh
+> ```
+
+**Step 3: Launch SGLang server inside the container**
+
+```bash
+python -m sglang.launch_server \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --model /models/<your-model> \
+  --trust-remote-code \
+  --mem-fraction-static 0.92 \
+  --chunked-prefill-size 4096 \
+  --served-model-name <your-model-name> \
+  --enable-mixed-chunk \
+  --kt-method <AMXINT4|AMXINT8|LLAMAFILE> \
+  --kt-weight-path /cpu-weights/<your-cpu-weights> \
+  --kt-cpuinfer <physical-cores> \
+  --kt-threadpool-count <numa-nodes> \
+  --kt-num-gpu-experts <num-experts-on-gpu> \
+  --kt-max-deferred-experts-per-token <0-4>
+```
+
+For detailed parameter explanations, see [KT-Kernel Parameters](#kt-kernel-parameters).
+
 ## Integration with SGLang

 KT-Kernel can be used standalone via [Direct Python API](#direct-python-api-usage) or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.
--- a/kt-kernel/README_zh.md
+++ b/kt-kernel/README_zh.md
@ -10,6 +10,9 @@
    - [快速安装（推荐）](#快速安装推荐)
    - [手动配置（进阶）](#手动配置进阶)
  - [验证安装](#验证安装)
+  - [Docker 使用](#docker-使用)
+    - [拉取镜像](#拉取镜像)
+    - [创建并启动容器](#创建并启动容器)
  - [与 SGLang 集成](#与-sglang-集成)
    - [安装步骤](#安装步骤)
      - [1. 安装 SGLang](#1-安装-sglang)
@ -119,10 +122,79 @@ export CPUINFER_ENABLE_AMX=OFF       # 选项: ON, OFF
 python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"
 ```

+## Docker 使用
+
+我们提供预构建的 Docker 镜像，可快速部署集成 SGLang 的推理服务。
+
+**前置条件：**
+- 主机 NVIDIA 驱动 CUDA 版本 >= 12.9
+- 已安装 Docker 及 NVIDIA Container Toolkit
+
+### 拉取镜像
+
+```bash
+docker pull approachingai/sglang-kt:latest
+```
+
+### 创建并启动容器
+
+**步骤 1：创建容器**
+
+```bash
+docker run -itd --gpus all \
+  --name sglang-kt \
+  -p 8000:8000 \
+  -v /path/to/models:/models \
+  -v /path/to/cpu-weights:/cpu-weights \
+  --shm-size=32g \
+  approachingai/sglang-kt:latest \
+  /bin/bash
+```
+
+**参数说明：**
+- `--gpus all`：容器启用 GPU 访问
+- `-p 8000:8000`：将容器端口 8000 映射到主机端口 8000
+- `-v /path/to/models:/models`：挂载 GPU 模型权重目录
+- `-v /path/to/cpu-weights:/cpu-weights`：挂载 CPU 权重目录
+- `--shm-size=32g`：设置共享内存大小
+
+**步骤 2：进入容器**
+
+```bash
+docker exec -it sglang-kt /bin/bash
+```
+
+> **注意：** 镜像默认安装的是 AMX 后端。如果你的 CPU 不支持 AMX 或想使用其他后端，请在容器内重新安装：
+> ```bash
+> cd /workspace/ktransformers/kt-kernel && ./install.sh
+> ```
+
+**步骤 3：在容器内启动 SGLang 服务**
+
+```bash
+python -m sglang.launch_server \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --model /models/<your-model> \
+  --trust-remote-code \
+  --mem-fraction-static 0.92 \
+  --chunked-prefill-size 4096 \
+  --served-model-name <your-model-name> \
+  --enable-mixed-chunk \
+  --kt-method <AMXINT4|AMXINT8|LLAMAFILE> \
+  --kt-weight-path /cpu-weights/<your-cpu-weights> \
+  --kt-cpuinfer <physical-cores> \
+  --kt-threadpool-count <numa-nodes> \
+  --kt-num-gpu-experts <num-experts-on-gpu> \
+  --kt-max-deferred-experts-per-token <0-4>
+```
+
+详细参数说明请参见 [KT-Kernel 参数](#kt-kernel-参数)。
+
 ## 与 SGLang 集成

-KT-Kernel 可以单独通过 [Python API](#直接使用-python-api) 使用，也可以集成到 SGLang 中用于生产部署。  
-本节描述如何与 SGLang 集成，实现 CPU-GPU 混合（异构）推理：将“热” experts 放在 GPU 上，“冷” experts 放在 CPU 上，以达到资源利用和性价比的平衡。
+KT-Kernel 可以单独通过 [Python API](#直接使用-python-api) 使用，也可以集成到 SGLang 中用于生产部署。
+本节描述如何与 SGLang 集成，实现 CPU-GPU 混合（异构）推理：将"热" experts 放在 GPU 上，"冷" experts 放在 CPU 上，以达到资源利用和性价比的平衡。

 ### 安装步骤