⚡ improve readme

2026-04-28 11:49:51 +00:00 · 2025-02-10 09:38:26 +08:00 · 2025-02-10 09:38:26 +08:00 · 6dd4fa0e87
commit 6dd4fa0e87
parent fd8037cda1
2 changed files with 12 additions and 16 deletions
--- a/doc/en/DeepseekR1_V3_tutorial.md
+++ b/doc/en/DeepseekR1_V3_tutorial.md
@ -1,8 +1,9 @@
-## prerequisites
+# Report
+## Prerequisites
 We run our best performance tests on <br>
 cpu: Intel(R) Xeon(R) Gold 6454S 1T DRAM(2 NUMA nodes)<br>
 gpu: 4090D 24G VRAM <br>
-## bench result
+## Bench result
 ### V0.2
 #### settings
 - model: DeepseekV3-q4km（int4）<br>
@ -17,12 +18,12 @@ gpu: 4090D 24G VRAM <br>

 "6 experts" case is part of v0.3's preview

-| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| Llama (8 experts) | 
+| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) | 
 | --- | --- | --- | --- | --- | --- | 
 | Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
 | Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |

-**The highest speedup reaches up to <u>x3.03</u> in decoding and <u>x9.44</u> in prefill.**
+**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**

 ### V0.3-Preview
 #### settings
@ -39,7 +40,7 @@ gpu: 4090D 24G VRAM <br>
 | KTrans (8 experts) Prefill token/s |   185.96  |  255.26   |  252.58   |  195.62   |
 | KTrans (6 experts) Prefill token/s |   203.70  |  286.55   |  271.08   |  207.20   |

-**The prefill of KTrans V0.3 is up to <u>x3.45</u> times faster than KTrans V0.2, and is up to <u>x63.53</u> times faster than Llama.**
+**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
 **The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted.**

 The main acceleration comes from 
@ -72,15 +73,10 @@ python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path
 ```
 The parameters meaning is the same. But As we  use dual socket, so we set cpu_infer to 65.
 ## some explanations
-1. From our perspective on DeepSeekV2, DeepSeekV3 and DeepSeekR1, 
-when we slightly decrease the activation experts num in inference, 
-the output quality doesn't change(within 1% accuracy drop),But the speed of decoding and prefill 
-is speed up about 30% which is inspiring. So our showcase makes use of this finding, 
-changing the activation experts of DeepSeekV3/R1 from 8 to 6. <br>
-2. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 
+1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 
 To avoid the cost of data transfer between nodes, we "copy" the critical matrix on 
 both nodes which takes more memory consumption but accelerates the prefill and decoding process.
 But this method takes huge memory and slow when loading weights, So be patient when loading
-and monitor the memory usage.(we are considering to make this method as an option)<br>
-3. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number, 
-but it's not the more the better. Adjust it slight lower to your actual number of cores)<br>
+and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
+2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number, 
+but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>