improve readme

This commit is contained in:
liam 2025-02-10 09:38:26 +08:00
parent fd8037cda1
commit 6dd4fa0e87
2 changed files with 12 additions and 16 deletions

View file

@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
<h2 id="Updates">🔥 Updates</h2>
* **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to XXX speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
* **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64X speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
* **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
@ -50,7 +50,7 @@ https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
- Upcoming Open Source Release:
- AMX optimizations and selective expert activation will be open-sourced in v0.3.
- Currently available only in preview binary distribution, which can be found here.
- Currently available only in preview binary distribution, which can be found [here](xxx).
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).

View file

@ -1,8 +1,9 @@
## prerequisites
# Report
## Prerequisites
We run our best performance tests on <br>
cpu: Intel(R) Xeon(R) Gold 6454S 1T DRAM(2 NUMA nodes)<br>
gpu: 4090D 24G VRAM <br>
## bench result
## Bench result
### V0.2
#### settings
- model: DeepseekV3-q4kmint4<br>
@ -17,12 +18,12 @@ gpu: 4090D 24G VRAM <br>
"6 experts" case is part of v0.3's preview
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| Llama (8 experts) |
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) |
| --- | --- | --- | --- | --- | --- |
| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |
**The highest speedup reaches up to <u>x3.03</u> in decoding and <u>x9.44</u> in prefill.**
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
### V0.3-Preview
#### settings
@ -39,7 +40,7 @@ gpu: 4090D 24G VRAM <br>
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
**The prefill of KTrans V0.3 is up to <u>x3.45</u> times faster than KTrans V0.2, and is up to <u>x63.53</u> times faster than Llama.**
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted.**
The main acceleration comes from
@ -72,15 +73,10 @@ python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path
```
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65.
## some explanations
1. From our perspective on DeepSeekV2, DeepSeekV3 and DeepSeekR1,
when we slightly decrease the activation experts num in inference,
the output quality doesn't change(within 1% accuracy drop),But the speed of decoding and prefill
is speed up about 30% which is inspiring. So our showcase makes use of this finding,
changing the activation experts of DeepSeekV3/R1 from 8 to 6. <br>
2. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
But this method takes huge memory and slow when loading weights, So be patient when loading
and monitor the memory usage.(we are considering to make this method as an option)<br>
3. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
but it's not the more the better. Adjust it slight lower to your actual number of cores)<br>
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>