# Kimi-K2 Support for KTransformers ## Introduction ### Overview We are very pleased to announce that Ktransformers now supports Kimi-K2 and Kimi-K2-0905. On a single-socket CPU with one consumer-grade GPU, running the Q4_K_M model yields roughly 10 TPS and requires about 600 GB of DRAM. With a dual-socket CPU and sufficient system memory, enabling NUMA optimizations increases performance to about 14 TPS. ### Model & Resource Links - Official Kimi-K2 Release: - https://huggingface.co/collections/moonshotai/kimi-k2-6871243b990f2af5ba60617d - GGUF Format(quantized models): - https://huggingface.co/KVCache-ai/Kimi-K2-Instruct-GGUF - Official Kimi-K2-0905 Release: - https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905 - GGUF Format(quantized models): - https://huggingface.co/KVCache-ai/Kimi-K2-Instruct-0905-GGUF ## Installation Guide ### 1. Resource Requirements The model running with 384 Experts requires approximately 600 GB of memory and 14 GB of GPU memory. ### 2. Prepare Models ```bash # download gguf huggingface-cli download --resume-download KVCache-ai/Kimi-K2-Instruct-GGUF ``` ### 3. Install ktransformers To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html). ### 4. Run Kimi-K2 Inference Server ```bash python ktransformers/server/main.py \ --port 10002 \ --model_path \ --gguf_path \ --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \ --max_new_tokens 1024 \ --cache_lens 32768 \ --chunk_size 256 \ --max_batch_size 4 \ --backend_type balance_serve \ ``` ### 5. Access server ``` curl -X POST http://localhost:10002/v1/chat/completions \ -H "accept: application/json" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "hello"} ], "model": "Kimi-K2", "temperature": 0.3, "top_p": 1.0, "stream": true }' ```