kvcache-ai-ktransformers/doc/en/prefix_cache.md
2025-06-30 15:09:35 +00:00

1.4 KiB
Raw Permalink Blame History

Enabling Prefix Cache Mode in KTransformers

Balance serve now supports prefix cache reuse! To enable Prefix Cache Mode in KTransformers, you need to modify the configuration file and recompile the project.

Step 1: Modify the Configuration File

Edit the ./ktransformers/configs/config.yaml file with the following content (you can adjust the values according to your needs):

attn:
  page_size: 16 # Size of a page in KV Cache.
  chunk_size: 256
kvc2:
  gpu_only: false # Set to false to enable prefix cache mode (Disk + CPU + GPU KV storage)
  utilization_percentage: 1.0
  cpu_memory_size_GB: 500 # Amount of CPU memory allocated for KV Cache
  disk_path: /mnt/data/kvc # Path to store KV Cache on disk

Step 2: Update Submodules and Recompile

If this is your first time using prefix cache mode, please update the submodules first:

git submodule update --init --recursive # Update PhotonLibOS submodule

Then recompile the project:

# Install single NUMA dependencies
USE_BALANCE_SERVE=1  bash ./install.sh
# For those who have two cpu and 1T RAMDual NUMA:
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh

Note

Balance serve utilizes a 3-layer (GPU-CPU-Disk) scheme to store and reuse KVCache. Deleting KVCache is not supported now. If you have too much KVCache, you can simply delete them by remove kvcache files.