Merge pull request #667 from Azure-Tang/update-readme

[update] Update doc.
2025-09-09 05:54:06 +00:00 · 2025-02-26 00:01:46 +08:00 · 2025-02-26 00:01:46 +08:00 · 31bc990677
commit 31bc990677
parent 9c71bcb0bb bb6920ed72
2 changed files with 24 additions and 6 deletions
--- a/doc/en/DeepseekR1_V3_tutorial.md
+++ b/doc/en/DeepseekR1_V3_tutorial.md
@ -16,7 +16,9 @@
 			- [Memory consumptions:](#memory-consumptions)
 			- [Benchmark results](#benchmark-results-2)
 	- [How to Run](#how-to-run)
- 		- [V0.2.2 longer context](#v022-longer-context)
+		- [V0.2.2 longer context \& FP8 kernel](#v022-longer-context--fp8-kernel)
+			- [longer context](#longer-context)
+			- [FP8 kernel](#fp8-kernel)
 		- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
 			- [Single socket version (32 cores)](#single-socket-version-32-cores)
 			- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
@ -155,7 +157,11 @@ the output quality doesn't change. But the speed of decoding and prefill
 is speed up which is inspiring. So our showcase makes use of this finding*

 ## How to Run
-### V0.2.2 longer context
+### V0.2.2 longer context & FP8 kernel
+#### longer context
+To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
+
+
 If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
 ```
 - match:
@ -167,6 +173,18 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
      prefill_device: "cuda"
      absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
 ```
+#### FP8 kernel
+
+The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
+- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
+- **Hybrid Quantization Architecture**:
+  - Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
+  - Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
+
+So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
+
+The detailed guide is [here](./fp8_kernel.md).
+
 ### V0.2 & V0.2.1 Showcase
 #### Single socket version (32 cores)
 Our local_chat test command is:
--- a/doc/en/fp8_kernel.md
+++ b/doc/en/fp8_kernel.md
@ -1,4 +1,4 @@
-# FP8 Linear Kernel for DeepSeek-V3
+# FP8 Linear Kernel for DeepSeek-V3/R1

 ## Overview
 The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
@ -17,8 +17,8 @@ So those who are persuing the best performance can use the FP8 linear kernel for
 ### Using Pre-Merged Weights

 Pre-merged weights are available on Hugging Face:
-[KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3)
-[KVCache-ai/DeepSeek-R1](https://huggingface.co/KVCache-ai/DeepSeek-R1)
+[KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-V3)
+[KVCache-ai/DeepSeek-R1-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-R1)
 > Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time.


@ -29,7 +29,7 @@ pip install -U huggingface_hub
 # Optional: Use HF Mirror for faster downloads in special area.
 # export HF_ENDPOINT=https://hf-mirror.com 

-huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3 --local-dir <local_dir>
+huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid --local-dir <local_dir>
 ```
 ### Using merge scripts
 If you got local DeepSeek-R1/V3 fp8 safetensors and q4km gguf weights, you can merge them using the following scripts.