mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 11:49:51 +00:00
* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment
- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: rename PyPI package from kt-kernel to ktransformers
Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"
This reverts commit e0cbbf6364.
* [build]: add ktransformers meta-package for PyPI
`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: show sglang-kt package version in kt version command
- Prioritize sglang-kt package version (aligned with ktransformers)
over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: improve sglang-kt detection in kt doctor and kt version
Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2.post1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
130 lines
5.5 KiB
Markdown
130 lines
5.5 KiB
Markdown
# KTransformers+SGLang Inference Deployment
|
||
Please Note This is Quantization Deployment. For Native Kimi K2 Thinking deployment please refer to [here](./Kimi-K2-Thinking-Native.md).
|
||
|
||
## Installation
|
||
|
||
Step 1: Install SGLang
|
||
|
||
Install the kvcache-ai fork of SGLang (one of):
|
||
```bash
|
||
# Option A: One-click install (from ktransformers root)
|
||
./install.sh
|
||
|
||
# Option B: pip install
|
||
pip install sglang-kt
|
||
```
|
||
|
||
> **Important:** Use `sglang-kt` (kvcache-ai fork), not the official `sglang` package. Run `pip uninstall sglang` first if you have the official version installed.
|
||
|
||
Step 2: Install KTransformers CPU Kernels
|
||
|
||
The KTransformers CPU kernels (kt-kernel) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to the official [kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
|
||
|
||
## Download Model
|
||
|
||
Download the official KIMI weights as GPU weights.
|
||
|
||
* huggingface: https://huggingface.co/moonshotai/Kimi-K2-Thinking
|
||
* modelscope: https://modelscope.cn/models/moonshotai/Kimi-K2-Thinking
|
||
|
||
Download the AMX INT4 quantized weights from https://huggingface.co/KVCache-ai/Kimi-K2-Thinking-CPU-weight as CPU weights.
|
||
|
||
## How to start
|
||
```
|
||
python -m sglang.launch_server --host 0.0.0.0 --port 60000 --model path/to/Kimi-K2-Thinking/ --kt-weight-path path/to/Kimi-K2-Instruct-CPU-weight/ --kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 200 --kt-method AMXINT4 --attention-backend flashinfer --trust-remote-code --mem-fraction-static 0.98 --chunked-prefill-size 4096 --max-running-requests 37 --max-total-tokens 37000 --enable-mixed-chunk --tensor-parallel-size 8 --enable-p2p-check --disable-shared-experts-fusion
|
||
```
|
||
tips:
|
||
|
||
`--kt-cpuinfer`: is recommended to be set to (number of physical CPU cores - 8 (number of GPUs)).
|
||
|
||
`--kt-num-gpu-experts`: refers to the number of experts retained on GPUs, which should be adjusted according to your available GPU memory and expected KV cache space.
|
||
|
||
## Test
|
||
|
||
When testing, you need to add `--disable-radix-cache` and `--disable-chunked-prefix-cache` when starting the server.
|
||
|
||
### bench prefill
|
||
```
|
||
python -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 60000 --num-prompts 37 --random-input-len 1024 --random-output-len 1 --random-range-ratio 1.0 --dataset-name random
|
||
```
|
||
|
||
### bench decode
|
||
```
|
||
python -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 60000 --num-prompts 37 --random-input-len 10 --random-output-len 512 --random-range-ratio 1.0 --dataset-name random
|
||
```
|
||
|
||
## Performance
|
||
|
||
### System Configuration:
|
||
|
||
- GPUs: 8× NVIDIA L20
|
||
- CPU: Intel(R) Xeon(R) Gold 6454S
|
||
|
||
### Bench prefill
|
||
```
|
||
============ Serving Benchmark Result ============
|
||
Backend: sglang
|
||
Traffic request rate: inf
|
||
Max request concurrency: not set
|
||
Successful requests: 37
|
||
Benchmark duration (s): 65.58
|
||
Total input tokens: 37888
|
||
Total input text tokens: 37888
|
||
Total input vision tokens: 0
|
||
Total generated tokens: 37
|
||
Total generated tokens (retokenized): 37
|
||
Request throughput (req/s): 0.56
|
||
Input token throughput (tok/s): 577.74
|
||
Output token throughput (tok/s): 0.56
|
||
Total token throughput (tok/s): 578.30
|
||
Concurrency: 23.31
|
||
----------------End-to-End Latency----------------
|
||
Mean E2E Latency (ms): 41316.50
|
||
Median E2E Latency (ms): 41500.35
|
||
---------------Time to First Token----------------
|
||
Mean TTFT (ms): 41316.48
|
||
Median TTFT (ms): 41500.35
|
||
P99 TTFT (ms): 65336.31
|
||
---------------Inter-Token Latency----------------
|
||
Mean ITL (ms): 0.00
|
||
Median ITL (ms): 0.00
|
||
P95 ITL (ms): 0.00
|
||
P99 ITL (ms): 0.00
|
||
Max ITL (ms): 0.00
|
||
==================================================
|
||
```
|
||
|
||
### Bench decode
|
||
|
||
```
|
||
============ Serving Benchmark Result ============
|
||
Backend: sglang
|
||
Traffic request rate: inf
|
||
Max request concurrency: not set
|
||
Successful requests: 37
|
||
Benchmark duration (s): 412.66
|
||
Total input tokens: 370
|
||
Total input text tokens: 370
|
||
Total input vision tokens: 0
|
||
Total generated tokens: 18944
|
||
Total generated tokens (retokenized): 18618
|
||
Request throughput (req/s): 0.09
|
||
Input token throughput (tok/s): 0.90
|
||
Output token throughput (tok/s): 45.91
|
||
Total token throughput (tok/s): 46.80
|
||
Concurrency: 37.00
|
||
----------------End-to-End Latency----------------
|
||
Mean E2E Latency (ms): 412620.35
|
||
Median E2E Latency (ms): 412640.56
|
||
---------------Time to First Token----------------
|
||
Mean TTFT (ms): 3551.87
|
||
Median TTFT (ms): 3633.59
|
||
P99 TTFT (ms): 3637.37
|
||
---------------Inter-Token Latency----------------
|
||
Mean ITL (ms): 800.53
|
||
Median ITL (ms): 797.89
|
||
P95 ITL (ms): 840.06
|
||
P99 ITL (ms): 864.96
|
||
Max ITL (ms): 3044.56
|
||
==================================================
|
||
```
|