kvcache-ai-ktransformers/doc/zh/AVX2-Tutorial_zh.md
mrhaoxx 7a9daf0cd4
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892)
* feat: support avx2 bf16 fp8 inference

* feat: support avx2 gptq int4 inference

* fix: numeric issues in fp8 dequant

* Tutorial avx2 (#1900)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* Tutorial avx2 (#1901)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* docs: update README.md

---------

Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>
2026-03-27 14:45:02 +08:00

188 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 在 AVX2 CPU 上使用 KTransformers
本教程介绍如何在仅支持 AVX2 的机器上运行 KTransformers无需 AVX512 或 AMX
## 目录
- [支持的精度格式](#支持的精度格式)
- [硬件要求](#硬件要求)
- [安装](#安装)
- [验证](#验证)
- [启动推理服务](#启动推理服务)
- [示例Qwen3-30B-A3B (BF16)](#示例qwen3-30b-a3b-bf16)
- [示例Qwen3.5-35B-A3B-FP8 (FP8)](#示例qwen35-35b-a3b-fp8-fp8)
- [示例Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)](#示例qwen3-30b-a3b-gptq-int4-gptq_int4)
- [发送请求](#发送请求)
- [性能调优](#性能调优)
- [常见问题](#常见问题)
## 支持的精度格式
| `--kt-method` | 精度 | 说明 |
|---------------|------|------|
| `BF16` | BF16 原精度 | 零精度损失,直接使用 BF16 权重 |
| `FP8` | FP8 分块量化 | |
| `GPTQ_INT4` | INT4 GPTQ | |
## 硬件要求
- **CPU**x86-64 + AVX2 + FMAIntel Haswell 2013+ / AMD Zen+
- **GPU**NVIDIA 24GB+ 显存RTX 3090/4090/5090 等)
- **内存**:不少于模型权重大小(如 Qwen3-30B-A3B BF16 需 64GB+
- **系统**Linux
## 安装
从源码编译安装(一键安装 kt-kernel + SGLang
```bash
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive
# 一键安装
./install.sh
```
在AVX512 AMX机器上 也可以手动强制 AVX2 编译:
```bash
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
./install.sh kt-kernel --manual
```
## 验证
```bash
# 检查 CPU 是否支持 AVX2
lscpu | grep -i avx2
# 检查 kt-kernel 加载的变体
python -c "import kt_kernel; print(kt_kernel.__cpu_variant__)"
# 预期输出avx2
# 系统诊断
kt doctor
```
## 启动推理服务
使用 `--kt-method BF16``FP8``GPTQ_INT4`KT-Kernel 会**自动检测** CPU 并在缺少 AVX512/AMX 时回退到 AVX2 后端。
### 示例Qwen3-30B-A3B (BF16)
```bash
# 下载模型
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /path/to/Qwen3-30B-A3B
# 查看物理核心数和 NUMA 节点数
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|NUMA node\(s\)"
# 启动服务(按实际硬件调整 kt-cpuinfer 和 kt-threadpool-count
python -m sglang.launch_server \
--host 0.0.0.0 --port 30000 \
--model /path/to/Qwen3-30B-A3B \
--kt-weight-path /path/to/Qwen3-30B-A3B \
--kt-cpuinfer 16 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 32 \
--kt-method BF16 \
--attention-backend flashinfer \
--trust-remote-code \
--mem-fraction-static 0.80 \
--chunked-prefill-size 8192 \
--max-running-requests 2 \
--served-model-name Qwen3 \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--enable-p2p-check \
--disable-shared-experts-fusion
```
### 示例Qwen3.5-35B-A3B-FP8 (FP8)
```bash
# 下载模型
huggingface-cli download Qwen/Qwen3.5-35B-A3B-FP8 --local-dir /path/to/Qwen3.5-35B-A3B-FP8
# 启动服务
python -m sglang.launch_server \
--host 0.0.0.0 --port 30000 \
--model /path/to/Qwen3.5-35B-A3B-FP8 \
--kt-weight-path /path/to/Qwen3.5-35B-A3B-FP8 \
--kt-cpuinfer 16 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 2 \
--kt-method FP8 \
--kt-gpu-prefill-token-threshold 400 \
--attention-backend triton \
--trust-remote-code \
--mem-fraction-static 0.85 \
--chunked-prefill-size 4096 \
--max-running-requests 1 \
--max-total-tokens 32000 \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--disable-shared-experts-fusion
```
### 示例Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)
```bash
# 下载模型
huggingface-cli download Qwen/Qwen3-30B-A3B-GPTQ-Int4 --local-dir /path/to/Qwen3-30B-A3B-GPTQ-Int4
# 启动服务
python -m sglang.launch_server \
--host 0.0.0.0 --port 30000 \
--model /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
--kt-weight-path /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
--kt-cpuinfer 16 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 2 \
--kt-method GPTQ_INT4 \
--attention-backend triton \
--trust-remote-code \
--mem-fraction-static 0.85 \
--chunked-prefill-size 4096 \
--max-running-requests 1 \
--max-total-tokens 32000 \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--disable-shared-experts-fusion
```
### 发送请求
```bash
# 交互聊天
kt chat
# OpenAI 兼容 API
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen3","messages":[{"role":"user","content":"你好"}],"stream":true}'
```
## 性能调优
- `--kt-cpuinfer` 设为**物理核心数**
- `--kt-threadpool-count` 设为 **NUMA 节点数**
- `--kt-num-gpu-experts` 越大 CPU 负担越小,但 GPU 显存占用越高
- 内存带宽往往是瓶颈DDR5 高频内存有明显帮助
## 常见问题
**GPU OOM**
- 减小 `--kt-num-gpu-experts``--chunked-prefill-size``--max-total-tokens`
- 降低 `--mem-fraction-static`
更多问题参见 [FAQ](../en/FAQ.md)。