mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-06 20:49:55 +00:00
delete sudo install
This commit is contained in:
parent
795524cacc
commit
8acb270c90
3 changed files with 61 additions and 51 deletions
|
@ -1,50 +1,54 @@
|
|||
# Balance Serve backend (multi-concurrency) for ktransformers
|
||||
|
||||
## KTransformers v0.2.4 Release Notes
|
||||
|
||||
We are excited to announce the official release of the long-awaited **KTransformers v0.2.4**!
|
||||
In this version, we’ve added highly desired **multi-concurrency** support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
|
||||
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:
|
||||
|
||||
|
||||
|
||||
https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
|
||||
|
||||
</p>
|
||||
|
||||
### 🚀 Key Updates
|
||||
|
||||
1. Multi-Concurrency Support
|
||||
- Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
|
||||
- We implemented [custom_flashinfer](https://github.com/kvcache-ai/custom_flashinfer/tree/fix-precision-mla-merge-main) based on the high-performance and highly flexible operator library [flashinfer](https://github.com/flashinfer-ai/flashinfer/), and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
|
||||
- In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
|
||||
- With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
|
||||
2. Engine Architecture Optimization
|
||||

|
||||
Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
|
||||

|
||||
Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
|
||||
- Server:Handles user requests and serves the OpenAI-compatible API.
|
||||
- Inference Engine:Executes model inference and supports chunked prefill.
|
||||
- Scheduler:Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in a FCFS manner and sending them to the inference engine.
|
||||
3. Project Structure Reorganization
|
||||
All C/C++ code is now centralized under the /csrc directory.
|
||||
All C/C++ code is now centralized under the /csrc directory.
|
||||
4. Parameter Adjustments
|
||||
Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
|
||||
We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.
|
||||
Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
|
||||
We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.
|
||||
|
||||
### 📚 Upgrade Notes
|
||||
|
||||
- Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
|
||||
- To enable multi-concurrency, please refer to the latest documentation for configuration examples.
|
||||
|
||||
### What's Changed
|
||||
|
||||
Implemented **custom_flashinfer** @Atream @ovowei @qiyuxinlin
|
||||
Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei
|
||||
Implemented a **continuous batching** scheduler in C++ @ErvinXie
|
||||
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao
|
||||
|
||||
|
||||
Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei
|
||||
Implemented a **continuous batching** scheduler in C++ @ErvinXie
|
||||
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao
|
||||
|
||||
## Installation Guide
|
||||
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
|
||||
|
||||
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
|
||||
|
||||
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
|
||||
|
||||
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
|
||||
|
||||
### 1. Set Up Conda Environment
|
||||
|
||||
We recommend using Miniconda3/Anaconda3 for environment management:
|
||||
|
@ -82,9 +86,9 @@ git submodule update --init --recursive
|
|||
|
||||
|
||||
# Install single NUMA dependencies
|
||||
sudo env USE_BALANCE_SERVE=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
|
||||
USE_BALANCE_SERVE=1 bash ./install.sh
|
||||
# Or Install Dual NUMA dependencies
|
||||
sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
|
||||
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
|
||||
```
|
||||
|
||||
## Running DeepSeek-R1-Q4KM Models
|
||||
|
@ -116,6 +120,7 @@ It features the following arguments:
|
|||
- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
|
||||
|
||||
### 2. access server
|
||||
|
||||
```
|
||||
curl -X POST http://localhost:10002/v1/chat/completions \
|
||||
-H "accept: application/json" \
|
||||
|
|
|
@ -87,47 +87,47 @@ sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev l
|
|||
|
||||
for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced. -->
|
||||
|
||||
|
||||
Download source code and compile:
|
||||
|
||||
- init source code
|
||||
- init source code
|
||||
|
||||
```sh
|
||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||
cd ktransformers
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
- [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ``bash install.sh``
|
||||
- For Linux
|
||||
```sh
|
||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||
cd ktransformers
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
- [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ``bash install.sh``
|
||||
- For Linux
|
||||
|
||||
- For simple install:
|
||||
|
||||
```shell
|
||||
bash install.sh
|
||||
```
|
||||
- For those who have two cpu and 1T RAM:
|
||||
|
||||
```shell
|
||||
# Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model)
|
||||
apt install libnuma-dev
|
||||
export USE_NUMA=1
|
||||
bash install.sh # or #make dev_install
|
||||
```
|
||||
- For Multi-concurrency with 500G RAM:
|
||||
|
||||
```shell
|
||||
sudo env USE_BALANCE_SERVE=1 PYTHONPATH="\$(which python)" PATH="\$(dirname \$(which python)):\$PATH" bash ./install.sh
|
||||
```
|
||||
- For Multi-concurrency with two cpu and 1T RAM:
|
||||
|
||||
```shell
|
||||
sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONPATH="\$(which python)" PATH="\$(dirname \$(which python)):\$PATH" bash ./install.sh
|
||||
```
|
||||
- For Windows (Windows native temprarily deprecated, please try WSL)
|
||||
- For simple install:
|
||||
|
||||
```shell
|
||||
install.bat
|
||||
bash install.sh
|
||||
```
|
||||
- For those who have two cpu and 1T RAM:
|
||||
|
||||
```shell
|
||||
# Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model)
|
||||
apt install libnuma-dev
|
||||
export USE_NUMA=1
|
||||
bash install.sh # or #make dev_install
|
||||
```
|
||||
- For Multi-concurrency with 500G RAM:
|
||||
|
||||
```shell
|
||||
USE_BALANCE_SERVE=1 bash ./install.sh
|
||||
```
|
||||
- For Multi-concurrency with two cpu and 1T RAM:
|
||||
|
||||
```shell
|
||||
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
|
||||
```
|
||||
- For Windows (Windows native temprarily deprecated, please try WSL)
|
||||
|
||||
```shell
|
||||
install.bat
|
||||
```
|
||||
|
||||
* If you are developer, you can make use of the makefile to compile and format the code. <br> the detailed usage of makefile is [here](./makefile_usage.md)
|
||||
|
||||
<h3>Local Chat</h3>
|
||||
|
@ -157,6 +157,7 @@ python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-V2-Lite-Cha
|
|||
# GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
|
||||
# python ktransformers.local_chat --model_path ./DeepSeek-V2-Lite --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF
|
||||
```
|
||||
|
||||
It features the following arguments:
|
||||
|
||||
- `--model_path` (required): Name of the model (such as "deepseek-ai/DeepSeek-V2-Lite-Chat" which will automatically download configs from [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite)). Or if you already got local files you may directly use that path to initialize the model.
|
||||
|
@ -174,6 +175,7 @@ We provide a server script, which supports multi-concurrency functionality in ve
|
|||
```
|
||||
python ktransformers/server/main.py --model_path /mnt/data/models/DeepSeek-V3 --gguf_path /mnt/data/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q4_K_M/ --cpu_infer 62 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml --port 10002 --chunk_size 256 --max_new_tokens 1024 --max_batch_size 4 --port 10002 --cache_lens 32768 --backend_type balance_serve
|
||||
```
|
||||
|
||||
It features the following arguments:
|
||||
|
||||
- `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
|
||||
|
@ -301,16 +303,19 @@ Start without website:
|
|||
```sh
|
||||
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002
|
||||
```
|
||||
|
||||
Start with website:
|
||||
|
||||
```sh
|
||||
ktransformers --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path /path/to/DeepSeek-V2-Lite-Chat-GGUF --port 10002 --web True
|
||||
```
|
||||
|
||||
Or you want to start server with transformers, the model_path should include safetensors
|
||||
|
||||
```bash
|
||||
ktransformers --type transformers --model_path /mnt/data/model/Qwen2-0.5B-Instruct --port 10002 --web True
|
||||
```
|
||||
|
||||
Access website with url [http://localhost:10002/web/index.html#/chat](http://localhost:10002/web/index.html#/chat) :
|
||||
|
||||
<p align="center">
|
||||
|
|
|
@ -126,9 +126,9 @@ git clone https://github.com/kvcache-ai/ktransformers.git
|
|||
cd ktransformers
|
||||
git submodule update --init --recursive
|
||||
# 如果使用双 numa 版本
|
||||
sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONzPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
|
||||
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
|
||||
# 如果使用单 numa 版本
|
||||
sudo env USE_BALANCE_SERVE=1 PYTHONzPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
|
||||
USE_BALANCE_SERVE=1 bash ./install.sh
|
||||
# 启动命令
|
||||
python ktransformers/server/main.py --model_path <your model path> --gguf_path <your gguf path> --cpu_infer 62 --optimize_config_path <inject rule path> --port 10002 --chunk_size 256 --max_new_tokens 1024 --max_batch_size 4 --port 10002 --cache_lens 32768 --backend_type balance_serve
|
||||
```
|
||||
|
|
Loading…
Add table
Reference in a new issue