mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 11:49:51 +00:00
⚡ improve readme
This commit is contained in:
parent
fd8037cda1
commit
6dd4fa0e87
2 changed files with 12 additions and 16 deletions
|
|
@ -1,8 +1,9 @@
|
|||
## prerequisites
|
||||
# Report
|
||||
## Prerequisites
|
||||
We run our best performance tests on <br>
|
||||
cpu: Intel(R) Xeon(R) Gold 6454S 1T DRAM(2 NUMA nodes)<br>
|
||||
gpu: 4090D 24G VRAM <br>
|
||||
## bench result
|
||||
## Bench result
|
||||
### V0.2
|
||||
#### settings
|
||||
- model: DeepseekV3-q4km(int4)<br>
|
||||
|
|
@ -17,12 +18,12 @@ gpu: 4090D 24G VRAM <br>
|
|||
|
||||
"6 experts" case is part of v0.3's preview
|
||||
|
||||
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| Llama (8 experts) |
|
||||
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
|
||||
| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |
|
||||
|
||||
**The highest speedup reaches up to <u>x3.03</u> in decoding and <u>x9.44</u> in prefill.**
|
||||
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
|
||||
|
||||
### V0.3-Preview
|
||||
#### settings
|
||||
|
|
@ -39,7 +40,7 @@ gpu: 4090D 24G VRAM <br>
|
|||
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
|
||||
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
|
||||
|
||||
**The prefill of KTrans V0.3 is up to <u>x3.45</u> times faster than KTrans V0.2, and is up to <u>x63.53</u> times faster than Llama.**
|
||||
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
|
||||
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted.**
|
||||
|
||||
The main acceleration comes from
|
||||
|
|
@ -72,15 +73,10 @@ python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path
|
|||
```
|
||||
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65.
|
||||
## some explanations
|
||||
1. From our perspective on DeepSeekV2, DeepSeekV3 and DeepSeekR1,
|
||||
when we slightly decrease the activation experts num in inference,
|
||||
the output quality doesn't change(within 1% accuracy drop),But the speed of decoding and prefill
|
||||
is speed up about 30% which is inspiring. So our showcase makes use of this finding,
|
||||
changing the activation experts of DeepSeekV3/R1 from 8 to 6. <br>
|
||||
2. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
||||
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
||||
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
|
||||
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
|
||||
But this method takes huge memory and slow when loading weights, So be patient when loading
|
||||
and monitor the memory usage.(we are considering to make this method as an option)<br>
|
||||
3. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
|
||||
but it's not the more the better. Adjust it slight lower to your actual number of cores)<br>
|
||||
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
|
||||
2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
|
||||
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue