diff --git a/doc/en/balance-serve.md b/doc/en/balance-serve.md index ade35dd..a50e0b6 100644 --- a/doc/en/balance-serve.md +++ b/doc/en/balance-serve.md @@ -1,50 +1,54 @@ # Balance Serve backend (multi-concurrency) for ktransformers ## KTransformers v0.2.4 Release Notes + We are excited to announce the official release of the long-awaited **KTransformers v0.2.4**! In this version, we’ve added highly desired **multi-concurrency** support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code. By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration: - - https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
### 🚀 Key Updates + 1. Multi-Concurrency Support - Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously. - We implemented [custom_flashinfer](https://github.com/kvcache-ai/custom_flashinfer/tree/fix-precision-mla-merge-main) based on the high-performance and highly flexible operator library [flashinfer](https://github.com/flashinfer-ai/flashinfer/), and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead. - In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency. - With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance. 2. Engine Architecture Optimization -  - Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency: +  + Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency: - Server:Handles user requests and serves the OpenAI-compatible API. - Inference Engine:Executes model inference and supports chunked prefill. - Scheduler:Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in a FCFS manner and sending them to the inference engine. 3. Project Structure Reorganization -All C/C++ code is now centralized under the /csrc directory. + All C/C++ code is now centralized under the /csrc directory. 4. Parameter Adjustments -Removed some legacy and deprecated launch parameters for a cleaner configuration experience. -We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging. + Removed some legacy and deprecated launch parameters for a cleaner configuration experience. + We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging. + ### 📚 Upgrade Notes + - Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize. - To enable multi-concurrency, please refer to the latest documentation for configuration examples. + ### What's Changed + Implemented **custom_flashinfer** @Atream @ovowei @qiyuxinlin -Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei -Implemented a **continuous batching** scheduler in C++ @ErvinXie -release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao - - +Implemented **balance_serve** engine based on **FlashInfer** @qiyuxinlin @ovowei +Implemented a **continuous batching** scheduler in C++ @ErvinXie +release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao ## Installation Guide -⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!! ⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!! ⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!! + +⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!! + ### 1. Set Up Conda Environment We recommend using Miniconda3/Anaconda3 for environment management: @@ -82,9 +86,9 @@ git submodule update --init --recursive # Install single NUMA dependencies -sudo env USE_BALANCE_SERVE=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh +USE_BALANCE_SERVE=1 bash ./install.sh # Or Install Dual NUMA dependencies -sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh +USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh ``` ## Running DeepSeek-R1-Q4KM Models @@ -116,6 +120,7 @@ It features the following arguments: - `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`. ### 2. access server + ``` curl -X POST http://localhost:10002/v1/chat/completions \ -H "accept: application/json" \ diff --git a/doc/en/install.md b/doc/en/install.md index b4918e7..b4a3879 100644 --- a/doc/en/install.md +++ b/doc/en/install.md @@ -87,47 +87,47 @@ sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev l for windows we prepare a pre compiled whl package on [ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl](https://github.com/kvcache-ai/ktransformers/releases/download/v0.2.0/ktransformers-0.2.0+cu125torch24avx2-cp312-cp312-win_amd64.whl), which require cuda-12.5, torch-2.4, python-3.11, more pre compiled package are being produced. --> - Download source code and compile: - - init source code +- init source code - ```sh - git clone https://github.com/kvcache-ai/ktransformers.git - cd ktransformers - git submodule update --init --recursive - ``` - - [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ``bash install.sh`` - - For Linux + ```sh + git clone https://github.com/kvcache-ai/ktransformers.git + cd ktransformers + git submodule update --init --recursive + ``` +- [Optional] If you want to run with website, please [compile the website](./api/server/website.md) before execute ``bash install.sh`` +- For Linux - - For simple install: - - ```shell - bash install.sh - ``` - - For those who have two cpu and 1T RAM: - - ```shell - # Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model) - apt install libnuma-dev - export USE_NUMA=1 - bash install.sh # or #make dev_install - ``` - - For Multi-concurrency with 500G RAM: - - ```shell - sudo env USE_BALANCE_SERVE=1 PYTHONPATH="\$(which python)" PATH="\$(dirname \$(which python)):\$PATH" bash ./install.sh - ``` - - For Multi-concurrency with two cpu and 1T RAM: - - ```shell - sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONPATH="\$(which python)" PATH="\$(dirname \$(which python)):\$PATH" bash ./install.sh - ``` - - For Windows (Windows native temprarily deprecated, please try WSL) + - For simple install: ```shell - install.bat + bash install.sh ``` + - For those who have two cpu and 1T RAM: + + ```shell + # Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model) + apt install libnuma-dev + export USE_NUMA=1 + bash install.sh # or #make dev_install + ``` + - For Multi-concurrency with 500G RAM: + + ```shell + USE_BALANCE_SERVE=1 bash ./install.sh + ``` + - For Multi-concurrency with two cpu and 1T RAM: + + ```shell + USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh + ``` +- For Windows (Windows native temprarily deprecated, please try WSL) + + ```shell + install.bat + ``` + * If you are developer, you can make use of the makefile to compile and format the code.
diff --git a/doc/zh/DeepseekR1_V3_tutorial_zh.md b/doc/zh/DeepseekR1_V3_tutorial_zh.md
index 17b51cd..5645f4f 100644
--- a/doc/zh/DeepseekR1_V3_tutorial_zh.md
+++ b/doc/zh/DeepseekR1_V3_tutorial_zh.md
@@ -126,9 +126,9 @@ git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive
# 如果使用双 numa 版本
-sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONzPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
+USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
# 如果使用单 numa 版本
-sudo env USE_BALANCE_SERVE=1 PYTHONzPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
+USE_BALANCE_SERVE=1 bash ./install.sh
# 启动命令
python ktransformers/server/main.py --model_path