fix typo

This commit is contained in:
liam 2025-02-10 10:50:40 +08:00
parent 910d8c842a
commit 107e4be417

View file

@ -1,18 +1,18 @@
# Report # Report
## Prerequisites ## Prerequisites
We run our best performance tests(V0.2) on <br> We run our best performance tests (V0.2) on <br>
cpu: Intel(R) Xeon(R) Gold 6454S 1T DRAM(2 NUMA nodes)<br> CPU: Intel(R) Xeon(R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
gpu: 4090D 24G VRAM <br> GPU: 4090D 24G VRAM <br>
## Bench result ## Bench Result
### V0.2 ### V0.2
#### settings #### Settings
- model: DeepseekV3-q4kmint4<br> - Model: DeepseekV3-q4kmint4<br>
- CPU: cpu_model_nameIntel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2numa nodes - CPU: cpu_model_nameIntel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
- GPU: 4090D 24GVRAM - GPU: 4090D 24GVRAM
- we test after enough warm up! - We test after enough warm up
#### memory consumption: #### Memory consumption:
- single socket: 382G DRAM, 12G VRAM - Single socket: 382G DRAM, at least 12G VRAM
- dual socket: 1T DRAM, 12G VRAM - Dual socket: 1T DRAM, at least 12G VRAM
#### Benchmark Results #### Benchmark Results
@ -26,22 +26,22 @@ gpu: 4090D 24G VRAM <br>
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.** **The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
### V0.3-Preview ### V0.3-Preview
#### settings #### Settings
- model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU) - Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
- CPU: cpu_model_nameIntel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes - CPU: cpu_model_nameIntel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt) - GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)
#### memory consumptions: #### Memory consumptions:
- 644GB DRAM, at least 12GB VRAM - 644GB DRAM, at least 12GB VRAM
#### Benchmark Results #### Benchmark results
| Prompt length | 1K | 2K | 4K | 8K | | Prompt length | 1K | 2K | 4K | 8K |
|---------------|-----|-----|-----|-----| |---------------|-----|-----|-----|-----|
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 | | KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 | | KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.** **The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted.** **The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
The main acceleration comes from The main acceleration comes from
- Intel AMX instruction set and our specially designed cache friendly memory layout - Intel AMX instruction set and our specially designed cache friendly memory layout
@ -53,9 +53,9 @@ when we slightly decrease the activation experts num in inference,
the output quality doesn't change. But the speed of decoding and prefill the output quality doesn't change. But the speed of decoding and prefill
is speed up which is inspiring. So our showcase makes use of this finding* is speed up which is inspiring. So our showcase makes use of this finding*
## how to run ## How to Run
### v0.2 showcase ### V0.2 Showcase
#### single socket version(32 cores) #### Single socket version(32 cores)
our local_chat test command is: our local_chat test command is:
``` shell ``` shell
git clone https://github.com/kvcache-ai/ktransformers.git git clone https://github.com/kvcache-ai/ktransformers.git
@ -64,10 +64,10 @@ numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model
<when you see chat, then press enter to load the text prompt_file> <when you see chat, then press enter to load the text prompt_file>
``` ```
\<your model path\> can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com) <br> \<your model path\> can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com) <br>
\<your gguf path\> can also be onlie, but as its large we recommend you download it and quantize the model to what you want.<br> \<your gguf path\> can also be onlie, but as its large we recommend you download it and quantize the model to what you want <br>
the command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes. The command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes
### dual socket version(64 cores) #### Dual socket version(64 cores)
make suer before you install(use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1`(if already installed, reinstall it with this env var set) <br> Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
our local_chat test command is: our local_chat test command is:
``` shell ``` shell
git clone https://github.com/kvcache-ai/ktransformers.git git clone https://github.com/kvcache-ai/ktransformers.git
@ -77,12 +77,12 @@ make dev_install # or sh ./install.sh
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your promt txt file> --cpu_infer 65 --cache_lens 1536 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your promt txt file> --cpu_infer 65 --cache_lens 1536
<when you see chat, then press enter to load the text prompt_file> <when you see chat, then press enter to load the text prompt_file>
``` ```
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65. The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65
## some explanations ## Some Explanations
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
both nodes which takes more memory consumption but accelerates the prefill and decoding process. both nodes which takes more memory consumption but accelerates the prefill and decoding process.
But this method takes huge memory and slow when loading weights, So be patient when loading But this method takes huge memory and slow when loading weights, So be patient when loading
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br> and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number, 2. The command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br> but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>