mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-17 02:29:41 +00:00
⚡ improve readme
This commit is contained in:
parent
fd8037cda1
commit
6dd4fa0e87
2 changed files with 12 additions and 16 deletions
|
@ -23,7 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
|
|||
|
||||
<h2 id="Updates">🔥 Updates</h2>
|
||||
|
||||
* **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to XXX speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
|
||||
* **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64X speedup. The Detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
|
||||
* **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
|
||||
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
|
||||
* **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU.
|
||||
|
@ -50,7 +50,7 @@ https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
|
|||
- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.
|
||||
- Upcoming Open Source Release:
|
||||
- AMX optimizations and selective expert activation will be open-sourced in v0.3.
|
||||
- Currently available only in preview binary distribution, which can be found here.
|
||||
- Currently available only in preview binary distribution, which can be found [here](xxx).
|
||||
|
||||
- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
|
||||
|
||||
|
|
|
@ -1,8 +1,9 @@
|
|||
## prerequisites
|
||||
# Report
|
||||
## Prerequisites
|
||||
We run our best performance tests on <br>
|
||||
cpu: Intel(R) Xeon(R) Gold 6454S 1T DRAM(2 NUMA nodes)<br>
|
||||
gpu: 4090D 24G VRAM <br>
|
||||
## bench result
|
||||
## Bench result
|
||||
### V0.2
|
||||
#### settings
|
||||
- model: DeepseekV3-q4km(int4)<br>
|
||||
|
@ -17,12 +18,12 @@ gpu: 4090D 24G VRAM <br>
|
|||
|
||||
"6 experts" case is part of v0.3's preview
|
||||
|
||||
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| Llama (8 experts) |
|
||||
| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
|
||||
| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |
|
||||
|
||||
**The highest speedup reaches up to <u>x3.03</u> in decoding and <u>x9.44</u> in prefill.**
|
||||
**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
|
||||
|
||||
### V0.3-Preview
|
||||
#### settings
|
||||
|
@ -39,7 +40,7 @@ gpu: 4090D 24G VRAM <br>
|
|||
| KTrans (8 experts) Prefill token/s | 185.96 | 255.26 | 252.58 | 195.62 |
|
||||
| KTrans (6 experts) Prefill token/s | 203.70 | 286.55 | 271.08 | 207.20 |
|
||||
|
||||
**The prefill of KTrans V0.3 is up to <u>x3.45</u> times faster than KTrans V0.2, and is up to <u>x63.53</u> times faster than Llama.**
|
||||
**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
|
||||
**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted.**
|
||||
|
||||
The main acceleration comes from
|
||||
|
@ -72,15 +73,10 @@ python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path
|
|||
```
|
||||
The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65.
|
||||
## some explanations
|
||||
1. From our perspective on DeepSeekV2, DeepSeekV3 and DeepSeekR1,
|
||||
when we slightly decrease the activation experts num in inference,
|
||||
the output quality doesn't change(within 1% accuracy drop),But the speed of decoding and prefill
|
||||
is speed up about 30% which is inspiring. So our showcase makes use of this finding,
|
||||
changing the activation experts of DeepSeekV3/R1 from 8 to 6. <br>
|
||||
2. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
||||
1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu.
|
||||
To avoid the cost of data transfer between nodes, we "copy" the critical matrix on
|
||||
both nodes which takes more memory consumption but accelerates the prefill and decoding process.
|
||||
But this method takes huge memory and slow when loading weights, So be patient when loading
|
||||
and monitor the memory usage.(we are considering to make this method as an option)<br>
|
||||
3. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
|
||||
but it's not the more the better. Adjust it slight lower to your actual number of cores)<br>
|
||||
and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~ <br>
|
||||
2. the command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number,
|
||||
but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue