mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-06 12:40:02 +00:00
⚡ fix typo
This commit is contained in:
parent
910d8c842a
commit
107e4be417
1 changed files with 26 additions and 26 deletions
|
@ -1,18 +1,18 @@
|
||||||
# Report
|
# Report
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
We run our best performance tests (V0.2) on <br>
|
We run our best performance tests (V0.2) on <br>
|
||||||
cpu: Intel(R) Xeon(R) Gold 6454S 1T DRAM(2 NUMA nodes)<br>
|
CPU: Intel(R) Xeon(R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
|
||||||
gpu: 4090D 24G VRAM <br>
|
GPU: 4090D 24G VRAM <br>
|
||||||
## Bench result
|
## Bench Result
|
||||||
### V0.2
|
### V0.2
|
||||||
#### settings
|
#### Settings
|
||||||
- model: DeepseekV3-q4km(int4)<br>
|
- Model: DeepseekV3-q4km(int4)<br>
|
||||||
- CPU: cpu_model_name:Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
- CPU: cpu_model_name:Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
||||||
- GPU: 4090D 24GVRAM
|
- GPU: 4090D 24GVRAM
|
||||||
- we test after enough warm up!
|
- We test after enough warm up
|
||||||
#### memory consumption:
|
#### Memory consumption:
|
||||||
- single socket: 382G DRAM, 12G VRAM
|
- Single socket: 382G DRAM, at least 12G VRAM
|
||||||
- dual socket: 1T DRAM, 12G VRAM
|
- Dual socket: 1T DRAM, at least 12G VRAM
|
||||||
|
|
||||||
#### Benchmark Results
|
#### Benchmark Results
|
||||||
|
|
||||||
|
@ -26,22 +26,22 @@ gpu: 4090D 24G VRAM <br>
|
||||||
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
|
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
|
||||||
|
|
||||||
### V0.3-Preview
|
### V0.3-Preview
|
||||||
#### settings
|
#### Settings
|
||||||
- model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
|
- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
|
||||||
- CPU: cpu_model_name:Intel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
- CPU: cpu_model_name:Intel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
|
||||||
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)
|
- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)
|
||||||
|
|
||||||
#### memory consumptions:
|
#### Memory consumptions:
|
||||||
- 644GB DRAM, at least 12GB VRAM
|
- 644GB DRAM, at least 12GB VRAM
|
||||||
|
|
||||||
#### Benchmark Results
|
#### Benchmark results
|
||||||
| Prompt length | 1K | 2K | 4K | 8K |
|
| Prompt length | 1K | 2K | 4K | 8K |
|
||||||
|---------------|-----|-----|-----|-----|
|
|---------------|-----|-----|-----|-----|
|
||||||
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
|
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
|
||||||
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
|
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
|
||||||
|
|
||||||
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
|
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
|
||||||
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted.**
|
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
|
||||||
|
|
||||||
The main acceleration comes from
|
The main acceleration comes from
|
||||||
- Intel AMX instruction set and our specially designed cache friendly memory layout
|
- Intel AMX instruction set and our specially designed cache friendly memory layout
|
||||||
|
@ -53,9 +53,9 @@ when we slightly decrease the activation experts num in inference,
|
||||||
the output quality doesn't change. But the speed of decoding and prefill
|
the output quality doesn't change. But the speed of decoding and prefill
|
||||||
is speed up which is inspiring. So our showcase makes use of this finding*
|
is speed up which is inspiring. So our showcase makes use of this finding*
|
||||||
|
|
||||||
## how to run
|
## How to Run
|
||||||
### v0.2 showcase
|
### V0.2 Showcase
|
||||||
#### single socket version(32 cores)
|
#### Single socket version(32 cores)
|
||||||
our local_chat test command is:
|
our local_chat test command is:
|
||||||
``` shell
|
``` shell
|
||||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||||
|
@ -64,10 +64,10 @@ numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model
|
||||||
<when you see chat, then press enter to load the text prompt_file>
|
<when you see chat, then press enter to load the text prompt_file>
|
||||||
```
|
```
|
||||||
\<your model path\> can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com) <br>
|
\<your model path\> can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com) <br>
|
||||||
\<your gguf path\> can also be onlie, but as its large we recommend you download it and quantize the model to what you want.<br>
|
\<your gguf path\> can also be onlie, but as its large we recommend you download it and quantize the model to what you want <br>
|
||||||
the command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes.
|
The command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes
|
||||||
### dual socket version(64 cores)
|
#### Dual socket version(64 cores)
|
||||||
make suer before you install(use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1`(if already installed, reinstall it with this env var set) <br>
|
Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
|
||||||
our local_chat test command is:
|
our local_chat test command is:
|
||||||
``` shell
|
``` shell
|
||||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||||
|
@ -77,12 +77,12 @@ make dev_install # or sh ./install.sh
|
||||||
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your promt txt file> --cpu_infer 65 --cache_lens 1536
|
python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path> --prompt_file <your promt txt file> --cpu_infer 65 --cache_lens 1536
|
||||||
<when you see chat, then press enter to load the text prompt_file>
|
<when you see chat, then press enter to load the text prompt_file>
|
||||||
```
|
```
|
||||||
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65.
|
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65
|
||||||
## some explanations
|
## Some Explanations
|
||||||
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
||||||
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
|
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
|
||||||
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
|
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
|
||||||
But this method takes huge memory and slow when loading weights, So be patient when loading
|
But this method takes huge memory and slow when loading weights, So be patient when loading
|
||||||
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
|
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
|
||||||
2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
|
2. The command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
|
||||||
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>
|
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>
|
||||||
|
|
Loading…
Add table
Reference in a new issue