vrr/kvcache-ai-ktransformers

Fork 0

mirror of https://github.com/kvcache-ai/ktransformers.git synced 2025-09-05 20:19:51 +00:00

liam 0f73f40da0 ⚡ add Summary part

2025-02-10 11:32:44 +08:00

5.6 KiB

Raw Blame History

GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM

SUMMARY

https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285

[NEW!!!] Local 671B DeepSeek-Coder-V3/R1: Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
- Prefill Speed:
  - KTransfermor: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, v3 only) → 286.55 (selectively using 6 experts, v3 only)
  - Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to 63.53× speedup.
- Decode Speed(tokens/s):
  - KTransfermor: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, v3 only)
  - Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to 3.03× speedup.
- Upcoming Open Source Release:
  - AMX optimizations and selective expert activation will be open-sourced in v0.3.
  - Currently available only in preview binary distribution, which can be found here.

Prerequisites

We run our best performance tests (V0.2) on
CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes)
GPU: 4090D 24G VRAM

Bench Result

V0.2

Settings

Model: DeepseekV3-q4km (int4)
CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
GPU: 4090D 24G VRAM
We test after enough warm up

Memory consumption:

Single socket: 382G DRAM, at least 14GB VRAM
Dual socket: 1T DRAM, at least 14GB VRAM

Benchmark Results

"6 experts" case is part of V0.3's preview

Prompt (500 tokens)	Dual socket Ktrans (6 experts)	Dual socket Ktrans (8 experts)	Single socket Ktrans (6 experts)	Single socket Ktrans (8 experts)	llama.cpp (8 experts)
Prefill token/s	97.32	82.94	65.14	54.21	10.31
Decode token/s	13.69	12.208	10.303	8.73	4.51

The highest speedup reaches up to 3.03x in decoding and 9.44x in prefill.

V0.3-Preview

Settings

Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)

Memory consumptions:

644GB DRAM, at least 14GB VRAM

Benchmark results

Prompt length	1K	2K	4K	8K
KTrans (8 experts) Prefill token/s	185.96	255.26	252.58	195.62
KTrans (6 experts) Prefill token/s	203.70	286.55	271.08	207.20

The prefill of KTrans V0.3 is up to 3.45x times faster than KTrans V0.2, and is up to 63.53x times faster than llama.cpp. The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted

The main acceleration comes from

Intel AMX instruction set and our specially designed cache friendly memory layout
Expert selection strategy that selects fewer experts based on offline profile results of out of domain data

From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1, when we slightly decrease the activation experts num in inference, the output quality doesn't change. But the speed of decoding and prefill is speed up which is inspiring. So our showcase makes use of this finding

How to Run

V0.2 Showcase

Single socket version (32 cores)

Our local_chat test command is:

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 33  --cache_lens 1536 
<when you see chat, then press enter to load the text prompt_file>

<your model path> can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com)
<your gguf path> can also be online, but as its large we recommend you download it and quantize the model to what you want
The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes

Dual socket version (64 cores)

Make suer before you install (use install.sh or make dev_install), setting the env var USE_NUMA=1 by export USE_NUMA=1 (if already installed, reinstall it with this env var set)
Our local_chat test command is:

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
export USE_NUMA=1
make dev_install # or sh ./install.sh
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536 
<when you see chat, then press enter to load the text prompt_file>

The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65

Some Explanations

Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. To avoid the cost of data transfer between nodes, we "copy" the critical matrix on both nodes which takes more memory consumption but accelerates the prefill and decoding process. But this method takes huge memory and slow when loading weights, So be patient when loading and monitor the memory usage. (we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~
The command args --cpu_infer 65 specifies how many cores to use (it's ok that it exceeds the physical number, but it's not the more the better. Adjust it slightly lower to your actual number of cores)

5.6 KiB Raw Blame History Unescape Escape

GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM

SUMMARY

Prerequisites

Bench Result

V0.2

Settings

Memory consumption:

Benchmark Results

V0.3-Preview

Settings

Memory consumptions:

Benchmark results

How to Run

V0.2 Showcase

Single socket version (32 cores)

Dual socket version (64 cores)

Some Explanations

5.6 KiB

Raw Blame History