mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-10 15:29:39 +00:00
update fp8 kernel tutorial
This commit is contained in:
parent
ca7366d2db
commit
4dc5518e4d
7 changed files with 46 additions and 5 deletions
|
@ -5,10 +5,11 @@
|
||||||
- [Installation Guide](en/install.md)
|
- [Installation Guide](en/install.md)
|
||||||
|
|
||||||
# Tutorial
|
# Tutorial
|
||||||
- [Deepseek-R1/V3 Show Case](en/DeepseekR1_V3_tutorial.md)
|
- [Deepseek-R1/V3 Show Case/Tutorial](en/DeepseekR1_V3_tutorial.md)
|
||||||
- [Why KTransformers So Fast](en/deepseek-v2-injection.md)
|
- [Why KTransformers So Fast](en/deepseek-v2-injection.md)
|
||||||
- [Injection Tutorial](en/injection_tutorial.md)
|
- [Injection Tutorial](en/injection_tutorial.md)
|
||||||
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
|
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
|
||||||
|
- [Using FP8 GPU Kernel](../merge_tensors/README.md)
|
||||||
# Server
|
# Server
|
||||||
- [Server](en/api/server/server.md)
|
- [Server](en/api/server/server.md)
|
||||||
- [Website](en/api/server/website.md)
|
- [Website](en/api/server/website.md)
|
||||||
|
|
|
@ -55,7 +55,7 @@ You have to set `--cpu_infer` to the number of cores you want to use. The more c
|
||||||
|
|
||||||
### Q: My DeepSeek-R1 model is not thinking.
|
### Q: My DeepSeek-R1 model is not thinking.
|
||||||
|
|
||||||
According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think true `.
|
According to DeepSeek, you need to enforce the model to initiate its response with "\<think>\n" at the beginning of every output by passing the arg `--force_think True `.
|
||||||
|
|
||||||
### Q: Loading gguf error
|
### Q: Loading gguf error
|
||||||
|
|
||||||
|
@ -63,9 +63,12 @@ Make sure you:
|
||||||
1. Have the `gguf` file in the `--gguf_path` directory.
|
1. Have the `gguf` file in the `--gguf_path` directory.
|
||||||
2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
|
2. The directory only contains gguf files from one model. If you have multiple models, you need to separate them into different directories.
|
||||||
3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
|
3. The folder name it self should not end with `.gguf`, eg. `Deep-gguf` is correct, `Deep.gguf` is wrong.
|
||||||
|
4. The file itself is not corrupted; you can verify this by checking that the sha256sum matches the one from huggingface, modelscope, or hf-mirror.
|
||||||
|
|
||||||
### Q: Version `GLIBCXX_3.4.30' not found
|
### Q: Version `GLIBCXX_3.4.30' not found
|
||||||
The detailed error:
|
The detailed error:
|
||||||
>ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)
|
>ImportError: /mnt/data/miniconda3/envs/xxx/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/xxx/xxx/ktransformers/./cpuinfer_ext.cpython-312-x86_64-linux-gnu.so)
|
||||||
|
|
||||||
Running `conda install -c conda-forge libstdcxx-ng` can solve the problem.
|
Running `conda install -c conda-forge libstdcxx-ng` can solve the problem.
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -59,6 +59,7 @@ Supported operators and their corresponding classes are as follows:
|
||||||
| Linear | KTransformersLinear | KLinearMarlin | Marlin as backend |
|
| Linear | KTransformersLinear | KLinearMarlin | Marlin as backend |
|
||||||
| | | KLinearTorch | pytorch as backend |
|
| | | KLinearTorch | pytorch as backend |
|
||||||
| | | KLinearCPUInfer | llamafile as backend |
|
| | | KLinearCPUInfer | llamafile as backend |
|
||||||
|
| | | KLinearFP8 | Triton fp8_gemm kernel. Requires GPU be able to caluculate fp8 data |
|
||||||
| experts | KTransformersExperts | KExpertsTorch | pytorch as backend |
|
| experts | KTransformersExperts | KExpertsTorch | pytorch as backend |
|
||||||
| | | KExpertsMarlin | Marlin as backend |
|
| | | KExpertsMarlin | Marlin as backend |
|
||||||
| | | KExpertsCPU | llamafile as backend |
|
| | | KExpertsCPU | llamafile as backend |
|
||||||
|
|
|
@ -340,7 +340,7 @@ class TransformersInterface(BackendInterfaceBase):
|
||||||
sm_scale=(self.model.config.qk_rope_head_dim + self.model.config.qk_nope_head_dim) ** (-0.5), q_data_type=torch.bfloat16, kv_data_type=torch.bfloat16)
|
sm_scale=(self.model.config.qk_rope_head_dim + self.model.config.qk_nope_head_dim) ** (-0.5), q_data_type=torch.bfloat16, kv_data_type=torch.bfloat16)
|
||||||
next_token = self.decode_one_tokens()
|
next_token = self.decode_one_tokens()
|
||||||
self.profiler.inc("decode")
|
self.profiler.inc("decode")
|
||||||
if next_token == self.tokenizer.eos_token_id:
|
if next_token == self.tokenizer.eos_token_id or "<|im_end|>" == self.tokenizer.decode(next_token):
|
||||||
assert self.args.batch_size == 1
|
assert self.args.batch_size == 1
|
||||||
break
|
break
|
||||||
yield self.append_new_tokens(next_token)
|
yield self.append_new_tokens(next_token)
|
||||||
|
|
36
merge_tensors/README.md
Normal file
36
merge_tensors/README.md
Normal file
|
@ -0,0 +1,36 @@
|
||||||
|
# FP8 Linear Kernel.
|
||||||
|
For DeepSeek-R1/V3, the DeepSeek-AI team provides fp8 safetensors. We have integrated the FP8 GPU kernel into the KTransformers. But to keep the experts still in CPU to save GPU memory, we still use ggml(GGUF tensors) quantization for experts. In this way, we can increase the precision in calculating attention, which may improve the model's performance.
|
||||||
|
|
||||||
|
Therefore, to use fp8 linear kernel, we need to merge fp8 weights and gguf files. We have provides prepared weights in huggingface so that you can use them directly.
|
||||||
|
|
||||||
|
[KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3/upload/main)
|
||||||
|
|
||||||
|
|
||||||
|
If you want to use other formats of ggml quantization, you can use the following script to merge them.
|
||||||
|
|
||||||
|
## Example
|
||||||
|
To use fp8 linear kernal and q4km experts.
|
||||||
|
```shell
|
||||||
|
bash
|
||||||
|
python convert_model.py \
|
||||||
|
--safetensor_path <fp8 safetensor path> \
|
||||||
|
--gguf_path <q4km gguf folder path> \
|
||||||
|
--output_path <output path>
|
||||||
|
```
|
||||||
|
* `--safetensor_path`: input path of safetensor file
|
||||||
|
* `--gguf_path`: input path of gguf folder
|
||||||
|
* `--output_path`: output path of merged file
|
||||||
|
|
||||||
|
|
||||||
|
## To Run DeepSeek-V3 with fp8 linear kernel and q4km experts
|
||||||
|
|
||||||
|
|
||||||
|
```shell
|
||||||
|
python ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-V3 --gguf_path <new weights folder> --optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml --cpu_infer <cpu cores + 1>
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
> NOTES:
|
||||||
|
> 1. Using fp8 linear kernel and q4km experts will consume approximatly 19GB GPU memory.
|
||||||
|
> 2. I know the the new way to load module is ugly, we are working on it.
|
||||||
|
> 3. Though the model is a mixture of fp8 and ggml, they are stored in .safetensor files. Please pass the folder path of the new weights to `--gguf_path`.
|
|
@ -3,7 +3,6 @@
|
||||||
import os
|
import os
|
||||||
# insert the path of the project
|
# insert the path of the project
|
||||||
import sys
|
import sys
|
||||||
sys.path.insert(0, "/home/azure/ktransformers")
|
|
||||||
import argparse
|
import argparse
|
||||||
import torch
|
import torch
|
||||||
from ktransformers.util.custom_gguf import GGUFLoader, translate_name_to_gguf
|
from ktransformers.util.custom_gguf import GGUFLoader, translate_name_to_gguf
|
||||||
|
|
|
@ -4,4 +4,5 @@ numpy
|
||||||
torch>=2.3.0
|
torch>=2.3.0
|
||||||
packaging
|
packaging
|
||||||
cpufeature
|
cpufeature
|
||||||
protobuf
|
protobuf
|
||||||
|
tiktoken
|
Loading…
Add table
Add a link
Reference in a new issue