diff --git a/README.md b/README.md index c8edb36..4f0e83f 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

🔥 Updates

-* **Mar 27, 2025**: Support Multi-concurrency. +* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)). * **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)). * **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM. * **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context). diff --git a/doc/en/balance-serve.md b/doc/en/balance-serve.md new file mode 100644 index 0000000..a6b5786 --- /dev/null +++ b/doc/en/balance-serve.md @@ -0,0 +1,73 @@ +# balance_serve backend (multi-concurrency) for ktransformers + +## Installation Guide + +### 1. Set Up Conda Environment +We recommend using Miniconda3/Anaconda3 for environment management: + +```bash +# Download Miniconda +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh + +# Create environment +conda create --name ktransformers python=3.11 +conda activate ktransformers + +# Install required libraries +conda install -c conda-forge libstdcxx-ng + +# Verify GLIBCXX version (should include 3.4.32) +strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX +``` + +> **Note:** Adjust the Anaconda path if your installation directory differs from `~/anaconda3` + +### 2. Install dependencies + + +```bash +sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libfmt-dev libgflags-dev zlib1g-dev patchelf +``` + +### 3. Build ktransformers + +```bash +# Clone repository +git clone https://github.com/kvcache-ai/ktransformers.git +cd ktransformers +git submodule update --init --recursive + +# Optional: Compile web interface +# See: api/server/website.md + +# Install single NUMA dependencies +sudo env USE_BALANCE_SERVE=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh +# Install Dual NUMA dependencies +sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh +``` + +## Running DeepSeek-R1-Q4KM Models + +### Configuration for 24GB VRAM GPUs +Use our optimized configuration for constrained VRAM: + +```bash +python ktransformers/server/main.py \ + --model_path \ + --gguf_path \ + --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \ + --max_new_tokens 1024 \ + --cache_lens 32768 \ + --chunk_size 256 \ + --max_batch_size 4 \ + --backend_type balance_serve +``` + +It features the following arguments: + +- `--max_new_tokens`: Maximum number of tokens generated per request. +- `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space. +- `--chunk_size`: Maximum number of tokens processed in a single run by the engine. +corresponding to 32768 tokens, and the space occupied will be released after the requests are completed. +- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`) +- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.