mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-11 15:54:37 +00:00
update readme
This commit is contained in:
parent
d41dd23b14
commit
b62cefaec9
2 changed files with 74 additions and 1 deletions
|
@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
|
||||||
|
|
||||||
<h2 id="Updates">🔥 Updates</h2>
|
<h2 id="Updates">🔥 Updates</h2>
|
||||||
|
|
||||||
* **Mar 27, 2025**: Support Multi-concurrency.
|
* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).
|
||||||
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
|
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
|
||||||
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
|
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
|
||||||
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
||||||
|
|
73
doc/en/balance-serve.md
Normal file
73
doc/en/balance-serve.md
Normal file
|
@ -0,0 +1,73 @@
|
||||||
|
# balance_serve backend (multi-concurrency) for ktransformers
|
||||||
|
|
||||||
|
## Installation Guide
|
||||||
|
|
||||||
|
### 1. Set Up Conda Environment
|
||||||
|
We recommend using Miniconda3/Anaconda3 for environment management:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download Miniconda
|
||||||
|
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
||||||
|
|
||||||
|
# Create environment
|
||||||
|
conda create --name ktransformers python=3.11
|
||||||
|
conda activate ktransformers
|
||||||
|
|
||||||
|
# Install required libraries
|
||||||
|
conda install -c conda-forge libstdcxx-ng
|
||||||
|
|
||||||
|
# Verify GLIBCXX version (should include 3.4.32)
|
||||||
|
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note:** Adjust the Anaconda path if your installation directory differs from `~/anaconda3`
|
||||||
|
|
||||||
|
### 2. Install dependencies
|
||||||
|
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libfmt-dev libgflags-dev zlib1g-dev patchelf
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Build ktransformers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone repository
|
||||||
|
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||||
|
cd ktransformers
|
||||||
|
git submodule update --init --recursive
|
||||||
|
|
||||||
|
# Optional: Compile web interface
|
||||||
|
# See: api/server/website.md
|
||||||
|
|
||||||
|
# Install single NUMA dependencies
|
||||||
|
sudo env USE_BALANCE_SERVE=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
|
||||||
|
# Install Dual NUMA dependencies
|
||||||
|
sudo env USE_BALANCE_SERVE=1 USE_NUMA=1 PYTHONPATH="$(which python)" PATH="$(dirname $(which python)):$PATH" bash ./install.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Running DeepSeek-R1-Q4KM Models
|
||||||
|
|
||||||
|
### Configuration for 24GB VRAM GPUs
|
||||||
|
Use our optimized configuration for constrained VRAM:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python ktransformers/server/main.py \
|
||||||
|
--model_path <path_to_safetensor_config> \
|
||||||
|
--gguf_path <path_to_gguf_files> \
|
||||||
|
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
|
||||||
|
--max_new_tokens 1024 \
|
||||||
|
--cache_lens 32768 \
|
||||||
|
--chunk_size 256 \
|
||||||
|
--max_batch_size 4 \
|
||||||
|
--backend_type balance_serve
|
||||||
|
```
|
||||||
|
|
||||||
|
It features the following arguments:
|
||||||
|
|
||||||
|
- `--max_new_tokens`: Maximum number of tokens generated per request.
|
||||||
|
- `--cache_lens`: Total length of kvcache allocated by the scheduler. All requests share a kvcache space.
|
||||||
|
- `--chunk_size`: Maximum number of tokens processed in a single run by the engine.
|
||||||
|
corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.
|
||||||
|
- `--max_batch_size`: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by `balance_serve`)
|
||||||
|
- `--backend_type`: `balance_serve` is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is `ktransformers`.
|
Loading…
Add table
Add a link
Reference in a new issue