From 3d7dfd61510db0cd9a8fb078cda84f2691f474ee Mon Sep 17 00:00:00 2001 From: liam Date: Mon, 10 Feb 2025 11:12:52 +0800 Subject: [PATCH] :zap: fix typo --- doc/en/DeepseekR1_V3_tutorial.md | 38 ++++++++++++++++---------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/doc/en/DeepseekR1_V3_tutorial.md b/doc/en/DeepseekR1_V3_tutorial.md index c837a6a..376ffa1 100644 --- a/doc/en/DeepseekR1_V3_tutorial.md +++ b/doc/en/DeepseekR1_V3_tutorial.md @@ -1,14 +1,14 @@ # Report ## Prerequisites We run our best performance tests (V0.2) on
-CPU: Intel(R) Xeon(R) Gold 6454S 1T DRAM (2 NUMA nodes)
+CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes)
GPU: 4090D 24G VRAM
## Bench Result ### V0.2 #### Settings -- Model: DeepseekV3-q4km(int4)
-- CPU: cpu_model_name:Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes -- GPU: 4090D 24GVRAM +- Model: DeepseekV3-q4km (int4)
+- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes +- GPU: 4090D 24G VRAM - We test after enough warm up #### Memory consumption: - Single socket: 382G DRAM, at least 12G VRAM @@ -16,7 +16,7 @@ GPU: 4090D 24G VRAM
#### Benchmark Results -"6 experts" case is part of v0.3's preview +"6 experts" case is part of V0.3's preview | Prompt
(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) | | --- | --- | --- | --- | --- | --- | @@ -28,7 +28,7 @@ GPU: 4090D 24G VRAM
### V0.3-Preview #### Settings - Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU) -- CPU: cpu_model_name:Intel(R) Xeon(R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes +- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes - GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt) #### Memory consumptions: @@ -55,34 +55,34 @@ is speed up which is inspiring. So our showcase makes use of this finding* ## How to Run ### V0.2 Showcase -#### Single socket version(32 cores) -our local_chat test command is: +#### Single socket version (32 cores) +Our local_chat test command is: ``` shell git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers -numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path --gguf_path --prompt_file --cpu_infer 33 --cache_lens 1536 +numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path --gguf_path --prompt_file --cpu_infer 33 --cache_lens 1536 ``` -\ can be local or set from onlie hugging face like deepseek-ai/DeepSeek-V3. If onlie encounters connection problem, try use mirror(hf-mirror.com)
-\ can also be onlie, but as its large we recommend you download it and quantize the model to what you want
-The command numactl -N 1 -m 1 aims to adoid data transfer between numa nodes -#### Dual socket version(64 cores) +\ can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com)
+\ can also be online, but as its large we recommend you download it and quantize the model to what you want
+The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes +#### Dual socket version (64 cores) Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set)
-our local_chat test command is: +Our local_chat test command is: ``` shell git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers export USE_NUMA=1 make dev_install # or sh ./install.sh -python ./ktransformers/local_chat.py --model_path --gguf_path --prompt_file --cpu_infer 65 --cache_lens 1536 +python ./ktransformers/local_chat.py --model_path --gguf_path --prompt_file --cpu_infer 65 --cache_lens 1536 ``` -The parameters meaning is the same. But As we use dual socket, so we set cpu_infer to 65 +The parameters' meaning is the same. But As we use dual socket, we set cpu_infer to 65 ## Some Explanations 1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. To avoid the cost of data transfer between nodes, we "copy" the critical matrix on both nodes which takes more memory consumption but accelerates the prefill and decoding process. But this method takes huge memory and slow when loading weights, So be patient when loading -and monitor the memory usage.(we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~
-2. The command args `--cpu_infer 65` specifies how many cores to use(it's ok that it exceeds the physical number, -but it's not the more the better. Adjust it slightly lower to your actual number of cores)
+and monitor the memory usage. (we are considering to make this method as an option). We are going to optimize this huge memory overhead. Stay tuned~
+2. The command args `--cpu_infer 65` specifies how many cores to use (it's ok that it exceeds the physical number, +but it's not the more the better. Adjust it slightly lower to your actual number of cores)
\ No newline at end of file