mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-07 04:59:55 +00:00
1.4 KiB
1.4 KiB
Enabling Prefix Cache Mode in KTransformers
Balance serve now supports prefix cache reuse! To enable Prefix Cache Mode in KTransformers, you need to modify the configuration file and recompile the project.
Step 1: Modify the Configuration File
Edit the ./ktransformers/configs/config.yaml
file with the following content (you can adjust the values according to your needs):
attn:
page_size: 16 # Size of a page in KV Cache.
chunk_size: 256
kvc2:
gpu_only: false # Set to false to enable prefix cache mode (Disk + CPU + GPU KV storage)
utilization_percentage: 1.0
cpu_memory_size_GB: 500 # Amount of CPU memory allocated for KV Cache
disk_path: /mnt/data/kvc # Path to store KV Cache on disk
Step 2: Update Submodules and Recompile
If this is your first time using prefix cache mode, please update the submodules first:
git submodule update --init --recursive # Update PhotonLibOS submodule
Then recompile the project:
# Install single NUMA dependencies
USE_BALANCE_SERVE=1 bash ./install.sh
# For those who have two cpu and 1T RAM(Dual NUMA):
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
Note
Balance serve utilizes a 3-layer (GPU-CPU-Disk) scheme to store and reuse KVCache. Deleting KVCache is not supported now. If you have too much KVCache, you can simply delete them by remove kvcache files.