vrr/kvcache-ai-ktransformers

Fork 0

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-28 03:39:48 +00:00

mrhaoxx 7a9daf0cd4

Book-CI / test (push) Waiting to run

Details

Book-CI / test-1 (push) Waiting to run

Details

Book-CI / test-2 (push) Waiting to run

Details

Deploy / deploy (macos-latest) (push) Waiting to run

Details

Deploy / deploy (ubuntu-latest) (push) Waiting to run

Details

Deploy / deploy (windows-latest) (push) Waiting to run

Details

[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892 )

* feat: support avx2 bf16 fp8 inference

* feat: support avx2 gptq int4 inference

* fix: numeric issues in fp8 dequant

* Tutorial avx2 (#1900)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* Tutorial avx2 (#1901)

* fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines

* docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs

* docs: update README.md

---------

Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>

2026-03-27 14:45:02 +08:00

4.8 KiB

Raw Blame History

在 AVX2 CPU 上使用 KTransformers

本教程介绍如何在仅支持 AVX2 的机器上运行 KTransformers（无需 AVX512 或 AMX）。

支持的精度格式

`--kt-method`	精度	说明
`BF16`	BF16 原精度	零精度损失，直接使用 BF16 权重
`FP8`	FP8 分块量化
`GPTQ_INT4`	INT4 GPTQ

硬件要求

CPU：x86-64 + AVX2 + FMA（Intel Haswell 2013+ / AMD Zen+）
GPU：NVIDIA 24GB+ 显存（RTX 3090/4090/5090 等）
内存：不少于模型权重大小（如 Qwen3-30B-A3B BF16 需 64GB+）
系统：Linux

安装

从源码编译安装（一键安装 kt-kernel + SGLang）：

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive

# 一键安装
./install.sh

在AVX512， AMX机器上，也可以手动强制 AVX2 编译：

export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
./install.sh kt-kernel --manual

验证

# 检查 CPU 是否支持 AVX2
lscpu | grep -i avx2

# 检查 kt-kernel 加载的变体
python -c "import kt_kernel; print(kt_kernel.__cpu_variant__)"
# 预期输出：avx2

# 系统诊断
kt doctor

启动推理服务

使用 --kt-method BF16、FP8 或 GPTQ_INT4，KT-Kernel 会自动检测 CPU 并在缺少 AVX512/AMX 时回退到 AVX2 后端。

示例：Qwen3-30B-A3B (BF16)

# 下载模型
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /path/to/Qwen3-30B-A3B

# 查看物理核心数和 NUMA 节点数
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|NUMA node\(s\)"

# 启动服务（按实际硬件调整 kt-cpuinfer 和 kt-threadpool-count）
python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/Qwen3-30B-A3B \
  --kt-weight-path /path/to/Qwen3-30B-A3B \
  --kt-cpuinfer 16 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 32 \
  --kt-method BF16 \
  --attention-backend flashinfer \
  --trust-remote-code \
  --mem-fraction-static 0.80 \
  --chunked-prefill-size 8192 \
  --max-running-requests 2 \
  --served-model-name Qwen3 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --enable-p2p-check \
  --disable-shared-experts-fusion

示例：Qwen3.5-35B-A3B-FP8 (FP8)

# 下载模型
huggingface-cli download Qwen/Qwen3.5-35B-A3B-FP8 --local-dir /path/to/Qwen3.5-35B-A3B-FP8

# 启动服务
python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/Qwen3.5-35B-A3B-FP8 \
  --kt-weight-path /path/to/Qwen3.5-35B-A3B-FP8 \
  --kt-cpuinfer 16 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 2 \
  --kt-method FP8 \
  --kt-gpu-prefill-token-threshold 400 \
  --attention-backend triton \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 4096 \
  --max-running-requests 1 \
  --max-total-tokens 32000 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --disable-shared-experts-fusion

示例：Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)

# 下载模型
huggingface-cli download Qwen/Qwen3-30B-A3B-GPTQ-Int4 --local-dir /path/to/Qwen3-30B-A3B-GPTQ-Int4

# 启动服务
python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
  --kt-weight-path /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
  --kt-cpuinfer 16 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 2 \
  --kt-method GPTQ_INT4 \
  --attention-backend triton \
  --trust-remote-code \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 4096 \
  --max-running-requests 1 \
  --max-total-tokens 32000 \
  --enable-mixed-chunk \
  --tensor-parallel-size 1 \
  --disable-shared-experts-fusion

发送请求

# 交互聊天
kt chat

# OpenAI 兼容 API
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3","messages":[{"role":"user","content":"你好"}],"stream":true}'

性能调优

--kt-cpuinfer 设为物理核心数
--kt-threadpool-count 设为 NUMA 节点数
--kt-num-gpu-experts 越大 CPU 负担越小，但 GPU 显存占用越高
内存带宽往往是瓶颈，DDR5 高频内存有明显帮助

常见问题

GPU OOM

减小 --kt-num-gpu-experts、--chunked-prefill-size、--max-total-tokens
降低 --mem-fraction-static

更多问题参见 FAQ。

4.8 KiB Raw Blame History Unescape Escape

在 AVX2 CPU 上使用 KTransformers

目录

支持的精度格式

硬件要求

安装

验证

启动推理服务

示例：Qwen3-30B-A3B (BF16)

示例：Qwen3.5-35B-A3B-FP8 (FP8)

示例：Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)

发送请求

性能调优

常见问题

4.8 KiB

Raw Blame History