mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 11:49:51 +00:00
[feat](kt-kernel): support avx2 only inference for bf16 fp8 and gptq int4 (#1892)
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
Some checks are pending
Book-CI / test (push) Waiting to run
Book-CI / test-1 (push) Waiting to run
Book-CI / test-2 (push) Waiting to run
Deploy / deploy (macos-latest) (push) Waiting to run
Deploy / deploy (ubuntu-latest) (push) Waiting to run
Deploy / deploy (windows-latest) (push) Waiting to run
* feat: support avx2 bf16 fp8 inference * feat: support avx2 gptq int4 inference * fix: numeric issues in fp8 dequant * Tutorial avx2 (#1900) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * Tutorial avx2 (#1901) * fix: prevent injecting -DLLAMA_AVX512=ON on AVX2-only machines * docs: add AVX2 tutorial for running KTransformers on AVX2-only CPUs * docs: update README.md --------- Co-authored-by: Benjamin F <159887351+yyj6666667@users.noreply.github.com>
This commit is contained in:
parent
8561a71dd1
commit
7a9daf0cd4
19 changed files with 3472 additions and 12 deletions
188
doc/en/kt-kernel/AVX2-Tutorial.md
Normal file
188
doc/en/kt-kernel/AVX2-Tutorial.md
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
# Running KTransformers on AVX2 CPUs
|
||||
|
||||
This tutorial explains how to run KTransformers on machines that only support AVX2 (without AVX512 or AMX).
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Supported Precision Formats](#supported-precision-formats)
|
||||
- [Hardware Requirements](#hardware-requirements)
|
||||
- [Installation](#installation)
|
||||
- [Verification](#verification)
|
||||
- [Starting the Inference Server](#starting-the-inference-server)
|
||||
- [Example: Qwen3-30B-A3B (BF16)](#example-qwen3-30b-a3b-bf16)
|
||||
- [Example: Qwen3.5-35B-A3B-FP8 (FP8)](#example-qwen35-35b-a3b-fp8-fp8)
|
||||
- [Example: Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)](#example-qwen3-30b-a3b-gptq-int4-gptq_int4)
|
||||
- [Sending Requests](#sending-requests)
|
||||
- [Performance Tuning](#performance-tuning)
|
||||
- [FAQ](#faq)
|
||||
|
||||
## Supported Precision Formats
|
||||
|
||||
| `--kt-method` | Precision | Description |
|
||||
|---------------|-----------|-------------|
|
||||
| `BF16` | BF16 native precision | Zero precision loss, uses BF16 weights directly |
|
||||
| `FP8` | FP8 block quantization | |
|
||||
| `GPTQ_INT4` | INT4 GPTQ | |
|
||||
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
- **CPU**: x86-64 + AVX2 + FMA (Intel Haswell 2013+ / AMD Zen+)
|
||||
- **GPU**: NVIDIA 24GB+ VRAM (RTX 3090/4090/5090, etc.)
|
||||
- **Memory**: At least the size of the model weights (e.g., Qwen3-30B-A3B BF16 requires 64GB+)
|
||||
- **OS**: Linux
|
||||
|
||||
## Installation
|
||||
|
||||
Build and install from source (one-click install for kt-kernel + SGLang):
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||
cd ktransformers
|
||||
git submodule update --init --recursive
|
||||
|
||||
# One-click install
|
||||
./install.sh
|
||||
```
|
||||
|
||||
On AVX512 or AMX machines, you can also manually force AVX2 compilation:
|
||||
|
||||
```bash
|
||||
export CPUINFER_CPU_INSTRUCT=AVX2
|
||||
export CPUINFER_ENABLE_AMX=OFF
|
||||
./install.sh kt-kernel --manual
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check if the CPU supports AVX2
|
||||
lscpu | grep -i avx2
|
||||
|
||||
# Check the loaded kt-kernel variant
|
||||
python -c "import kt_kernel; print(kt_kernel.__cpu_variant__)"
|
||||
# Expected output: avx2
|
||||
|
||||
# System diagnostics
|
||||
kt doctor
|
||||
```
|
||||
|
||||
## Starting the Inference Server
|
||||
|
||||
Use `--kt-method BF16`, `FP8`, or `GPTQ_INT4`. KT-Kernel will **automatically detect** the CPU and fall back to the AVX2 backend when AVX512/AMX is unavailable.
|
||||
|
||||
### Example: Qwen3-30B-A3B (BF16)
|
||||
|
||||
```bash
|
||||
# Download the model
|
||||
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /path/to/Qwen3-30B-A3B
|
||||
|
||||
# Check physical core count and NUMA node count
|
||||
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|NUMA node\(s\)"
|
||||
|
||||
# Start the server (adjust kt-cpuinfer and kt-threadpool-count based on your hardware)
|
||||
python -m sglang.launch_server \
|
||||
--host 0.0.0.0 --port 30000 \
|
||||
--model /path/to/Qwen3-30B-A3B \
|
||||
--kt-weight-path /path/to/Qwen3-30B-A3B \
|
||||
--kt-cpuinfer 16 \
|
||||
--kt-threadpool-count 1 \
|
||||
--kt-num-gpu-experts 32 \
|
||||
--kt-method BF16 \
|
||||
--attention-backend flashinfer \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.80 \
|
||||
--chunked-prefill-size 8192 \
|
||||
--max-running-requests 2 \
|
||||
--served-model-name Qwen3 \
|
||||
--enable-mixed-chunk \
|
||||
--tensor-parallel-size 1 \
|
||||
--enable-p2p-check \
|
||||
--disable-shared-experts-fusion
|
||||
```
|
||||
|
||||
### Example: Qwen3.5-35B-A3B-FP8 (FP8)
|
||||
|
||||
```bash
|
||||
# Download the model
|
||||
huggingface-cli download Qwen/Qwen3.5-35B-A3B-FP8 --local-dir /path/to/Qwen3.5-35B-A3B-FP8
|
||||
|
||||
# Start the server
|
||||
python -m sglang.launch_server \
|
||||
--host 0.0.0.0 --port 30000 \
|
||||
--model /path/to/Qwen3.5-35B-A3B-FP8 \
|
||||
--kt-weight-path /path/to/Qwen3.5-35B-A3B-FP8 \
|
||||
--kt-cpuinfer 16 \
|
||||
--kt-threadpool-count 1 \
|
||||
--kt-num-gpu-experts 2 \
|
||||
--kt-method FP8 \
|
||||
--kt-gpu-prefill-token-threshold 400 \
|
||||
--attention-backend triton \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.85 \
|
||||
--chunked-prefill-size 4096 \
|
||||
--max-running-requests 1 \
|
||||
--max-total-tokens 32000 \
|
||||
--enable-mixed-chunk \
|
||||
--tensor-parallel-size 1 \
|
||||
--disable-shared-experts-fusion
|
||||
```
|
||||
|
||||
### Example: Qwen3-30B-A3B-GPTQ-Int4 (GPTQ_INT4)
|
||||
|
||||
```bash
|
||||
# Download the model
|
||||
huggingface-cli download Qwen/Qwen3-30B-A3B-GPTQ-Int4 --local-dir /path/to/Qwen3-30B-A3B-GPTQ-Int4
|
||||
|
||||
# Start the server
|
||||
python -m sglang.launch_server \
|
||||
--host 0.0.0.0 --port 30000 \
|
||||
--model /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
|
||||
--kt-weight-path /path/to/Qwen3-30B-A3B-GPTQ-Int4 \
|
||||
--kt-cpuinfer 16 \
|
||||
--kt-threadpool-count 1 \
|
||||
--kt-num-gpu-experts 2 \
|
||||
--kt-method GPTQ_INT4 \
|
||||
--attention-backend triton \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.85 \
|
||||
--chunked-prefill-size 4096 \
|
||||
--max-running-requests 1 \
|
||||
--max-total-tokens 32000 \
|
||||
--enable-mixed-chunk \
|
||||
--tensor-parallel-size 1 \
|
||||
--disable-shared-experts-fusion
|
||||
```
|
||||
|
||||
### Sending Requests
|
||||
|
||||
```bash
|
||||
# Interactive chat
|
||||
kt chat
|
||||
|
||||
# OpenAI-compatible API
|
||||
curl http://localhost:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model":"Qwen3","messages":[{"role":"user","content":"Hello"}],"stream":true}'
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
- `--kt-cpuinfer`: set to the number of **physical cores**
|
||||
- `--kt-threadpool-count`: set to the number of **NUMA nodes**
|
||||
- `--kt-num-gpu-experts`: higher values reduce CPU load but increase GPU VRAM usage
|
||||
- Memory bandwidth is often the bottleneck; high-frequency DDR5 memory helps significantly
|
||||
|
||||
## FAQ
|
||||
|
||||
|
||||
|
||||
**GPU OOM**
|
||||
- Reduce `--kt-num-gpu-experts`, `--chunked-prefill-size`, `--max-total-tokens`
|
||||
- Lower `--mem-fraction-static`
|
||||
|
||||
For more questions, see [FAQ](../FAQ.md).
|
||||
Loading…
Add table
Add a link
Reference in a new issue