Update doc

This commit is contained in:
Azure 2025-02-25 08:21:18 +00:00
parent 4dc5518e4d
commit 36fbeee341
11 changed files with 101 additions and 59 deletions

View file

@ -23,7 +23,8 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
<h2 id="Updates">🔥 Updates</h2> <h2 id="Updates">🔥 Updates</h2>
* **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed +15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/). * **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; Longer Context (from 8K to 128K for 24GB VRAM).
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed +15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md). * **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
* **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G. * **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
* **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU. * **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU.
@ -125,7 +126,7 @@ To utilize the provided kernels, users only need to create a YAML-based injectio
```python ```python
with torch.device("meta"): with torch.device("meta"):
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True) model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config) optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
... ...
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000) generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
``` ```

View file

@ -21,7 +21,8 @@ KTransformers 是一个以 Python 为中心的灵活框架,其核心是可扩
<h2 id="Updates">🔥 更新</h2> <h2 id="Updates">🔥 更新</h2>
* **2025 年 2 月 15 日**KTransformers V0.2.1: 长上下文(从4K到8K24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s),文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。 * **2025 年 2 月 15 日**为DeepSeek-V3/R1支持[FP8 GPU内核](./doc/en/fp8_kernel.md); 支持更长的上下文 (从8K到128K仅用24GB VRAM).
* **2025 年 2 月 15 日**:长上下文(从4K到8K24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s),文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。
* **2025 年 2 月 10 日**:支持 Deepseek-R1 和 V3 在单个24GB VRAM/多 GPU 和 382G DRAM 上运行,速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。 * **2025 年 2 月 10 日**:支持 Deepseek-R1 和 V3 在单个24GB VRAM/多 GPU 和 382G DRAM 上运行,速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。
* **2024 年 8 月 28 日**:支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文,使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。 * **2024 年 8 月 28 日**:支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文,使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。
* **2024 年 8 月 28 日**:将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。 * **2024 年 8 月 28 日**:将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。
@ -68,11 +69,11 @@ https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
</p> </p>
<h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3> <!-- <h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
<p align="center"> <p align="center"> -->
https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
<!-- https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12 -->
<!--
* **1M 上下文 InternLM 2.5 7B**:以全 bf16 精度运行,使用 24GB VRAM 和 150GB DRAM可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率,在 128K NIAH 测试中达到 100%。 * **1M 上下文 InternLM 2.5 7B**:以全 bf16 精度运行,使用 24GB VRAM 和 150GB DRAM可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率,在 128K NIAH 测试中达到 100%。
<p align="center"> <p align="center">
@ -89,7 +90,7 @@ https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
* **增强的速度**:使用稀疏注意力,通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。 * **增强的速度**:使用稀疏注意力,通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。
* **灵活的稀疏注意力框架**:提供了一个灵活的块稀疏注意力框架,用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。 * **灵活的稀疏注意力框架**:提供了一个灵活的块稀疏注意力框架,用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。 -->
<strong>更多高级功能即将推出,敬请期待!</strong> <strong>更多高级功能即将推出,敬请期待!</strong>
@ -116,7 +117,7 @@ KTransformers 的核心是一个用户友好的、基于模板的注入框架。
```python ```python
with torch.device("meta"): with torch.device("meta"):
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True) model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config) optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
... ...
generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000) generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
``` ```
@ -151,7 +152,7 @@ YAML 文件中的每个规则都有两部分:`match` 和 `replace`。`match`
<h2 id="ack">致谢和贡献者</h2> <h2 id="ack">致谢和贡献者</h2>
KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 和 Marlin 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。 KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformer 更快、更易于使用。 KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformer 更快、更易于使用。

View file

@ -9,7 +9,7 @@
- [Why KTransformers So Fast](en/deepseek-v2-injection.md) - [Why KTransformers So Fast](en/deepseek-v2-injection.md)
- [Injection Tutorial](en/injection_tutorial.md) - [Injection Tutorial](en/injection_tutorial.md)
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md) - [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
- [Using FP8 GPU Kernel](../merge_tensors/README.md) - [Use FP8 GPU Kernel](en/fp8_kernel.md)
# Server # Server
- [Server](en/api/server/server.md) - [Server](en/api/server/server.md)
- [Website](en/api/server/website.md) - [Website](en/api/server/website.md)

View file

@ -45,7 +45,7 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them? ### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
Use the `--optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file. Use the `--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
> Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution. > Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.

74
doc/en/fp8_kernel.md Normal file
View file

@ -0,0 +1,74 @@
# FP8 Linear Kernel for DeepSeek-V3
## Overview
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
- **Hybrid Quantization Architecture**:
- Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
- Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
## Key Features
✅ Hybrid Precision Architecture (FP8 + GGML)
✅ Memory Optimization (~19GB VRAM usage)
## Quick Start
### Using Pre-Merged Weights
Pre-merged weights are available on Hugging Face:
[KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3)
[KVCache-ai/DeepSeek-R1](https://huggingface.co/KVCache-ai/DeepSeek-R1)
> Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time.
Download Pre-Merged Weights
```shell
pip install -U huggingface_hub
# Optional: Use HF Mirror for faster downloads in special area.
# export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3 --local-dir <local_dir>
```
### Using merge scripts
If you got local DeepSeek-R1/V3 fp8 safetensors and q4km gguf weights, you can merge them using the following scripts.
```shell
python convert_model.py \
--safetensor_path <fp8_safetensor_path> \
--gguf_path <q4km_gguf_folder_path> \
--output_path <merged_output_path>
```
* `--safetensor_path`: input path of safetensor file([Download](https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main)).
* `--gguf_path`: input path of gguf folder ([Download](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)).
* `--output_path`: output path of merged file.
### Execution Notes
Launch local_chat.py with custom quantized experts
```shell
python ktransformers/local_chat.py \
--model_path deepseek-ai/DeepSeek-V3 \
--gguf_path <merged_weights_folder> \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml \
--cpu_infer <cpu_cores + 1>
```
## Notes
⚠️ Hardware Requirements
* Recommended minimum 19GB available VRAM for FP8 kernel.
* Requires GPU with FP8 support (e.g., 4090)
⏳ First-Run Optimization
JIT compilation causes longer initial execution (subsequent runs retain optimized speed).
🔄 Temporary Interface
Current weight loading implementation is provisional - will be refined in future versions
📁 Path Specification
Despite hybrid quantization, merged weights are stored as .safetensors - pass the containing folder path to `--gguf_path`

View file

@ -141,7 +141,7 @@ It features the following arguments:
- `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model. - `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models. - `--optimize_config_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
- `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate. - `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.

View file

@ -54,7 +54,7 @@ default_optimize_rules = {
def local_chat( def local_chat(
model_path: str | None = None, model_path: str | None = None,
optimize_rule_path: str = None, optimize_config_path: str = None,
gguf_path: str | None = None, gguf_path: str | None = None,
max_new_tokens: int = 300, max_new_tokens: int = 300,
cpu_infer: int = Config().cpu_infer, cpu_infer: int = Config().cpu_infer,
@ -95,12 +95,12 @@ def local_chat(
config, trust_remote_code=True, attn_implementation="flash_attention_2" config, trust_remote_code=True, attn_implementation="flash_attention_2"
) )
if optimize_rule_path is None: if optimize_config_path is None:
if config.architectures[0] in default_optimize_rules: if config.architectures[0] in default_optimize_rules:
print("using default_optimize_rule for", config.architectures[0]) print("using default_optimize_rule for", config.architectures[0])
optimize_rule_path = default_optimize_rules[config.architectures[0]] optimize_config_path = default_optimize_rules[config.architectures[0]]
else: else:
optimize_rule_path = input( optimize_config_path = input(
"please input the path of your rule file(yaml file containing optimize rules):" "please input the path of your rule file(yaml file containing optimize rules):"
) )
@ -108,7 +108,7 @@ def local_chat(
gguf_path = input( gguf_path = input(
"please input the path of your gguf file(gguf file in the dir containing input gguf file must all belong to current model):" "please input the path of your gguf file(gguf file in the dir containing input gguf file must all belong to current model):"
) )
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config) optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
try: try:
model.generation_config = GenerationConfig.from_pretrained(model_path) model.generation_config = GenerationConfig.from_pretrained(model_path)

View file

@ -35,9 +35,9 @@ class KTransformersInterface(TransformersInterface):
with torch.device("meta"): with torch.device("meta"):
self.model = custom_models[config.architectures[0]](config) self.model = custom_models[config.architectures[0]](config)
if default_args.optimize_config_path is None: if default_args.optimize_config_path is None:
optimize_rule_path = default_optimize_rules[config.architectures[0]] optimize_config_path = default_optimize_rules[config.architectures[0]]
else: else:
optimize_rule_path = args.optimize_config_path optimize_config_path = args.optimize_config_path
# print(optimize_config) # print(optimize_config)
@ -47,7 +47,7 @@ class KTransformersInterface(TransformersInterface):
"please input the path of your gguf file(gguf file in the dir containing input gguf file must all" "please input the path of your gguf file(gguf file in the dir containing input gguf file must all"
" belong to current model):" " belong to current model):"
) )
optimize_and_load_gguf(self.model, optimize_rule_path, gguf_path, config) optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
self.device_map = self.model.gguf_loader.tensor_device_map self.device_map = self.model.gguf_loader.tensor_device_map
# logger.info(f"{args.model_name} loaded from {args.model_dir} to {self.device_map}") # logger.info(f"{args.model_name} loaded from {args.model_dir} to {self.device_map}")

View file

@ -1,36 +0,0 @@
# FP8 Linear Kernel.
For DeepSeek-R1/V3, the DeepSeek-AI team provides fp8 safetensors. We have integrated the FP8 GPU kernel into the KTransformers. But to keep the experts still in CPU to save GPU memory, we still use ggml(GGUF tensors) quantization for experts. In this way, we can increase the precision in calculating attention, which may improve the model's performance.
Therefore, to use fp8 linear kernel, we need to merge fp8 weights and gguf files. We have provides prepared weights in huggingface so that you can use them directly.
[KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3/upload/main)
If you want to use other formats of ggml quantization, you can use the following script to merge them.
## Example
To use fp8 linear kernal and q4km experts.
```shell
bash
python convert_model.py \
--safetensor_path <fp8 safetensor path> \
--gguf_path <q4km gguf folder path> \
--output_path <output path>
```
* `--safetensor_path`: input path of safetensor file
* `--gguf_path`: input path of gguf folder
* `--output_path`: output path of merged file
## To Run DeepSeek-V3 with fp8 linear kernel and q4km experts
```shell
python ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-V3 --gguf_path <new weights folder> --optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml --cpu_infer <cpu cores + 1>
```
> NOTES:
> 1. Using fp8 linear kernel and q4km experts will consume approximatly 19GB GPU memory.
> 2. I know the the new way to load module is ugly, we are working on it.
> 3. Though the model is a mixture of fp8 and ggml, they are stored in .safetensor files. Please pass the folder path of the new weights to `--gguf_path`.

View file

@ -3,6 +3,7 @@
import os import os
# insert the path of the project # insert the path of the project
import sys import sys
sys.path.insert(0, "/home/azure/ktransformers")
import argparse import argparse
import torch import torch
from ktransformers.util.custom_gguf import GGUFLoader, translate_name_to_gguf from ktransformers.util.custom_gguf import GGUFLoader, translate_name_to_gguf
@ -180,7 +181,7 @@ def write_combined_tensor(target_tensor_map: dict, output_path: str, gguf_loader
output_file = os.path.join(output_path, f"model-{shard_idx:05}-of-{total_shards:05}.safetensors") output_file = os.path.join(output_path, f"model-{shard_idx:05}-of-{total_shards:05}.safetensors")
print(f"Saving layer {layer_num} to {output_file}") print(f"Saving layer {layer_num} to {output_file}")
print(tensors.keys()) # print(tensors.keys())
save_file(tensors, output_file) save_file(tensors, output_file)
shard_idx += 1 shard_idx += 1

View file

@ -5,4 +5,5 @@ torch>=2.3.0
packaging packaging
cpufeature cpufeature
protobuf protobuf
tiktoken tiktoken
blobfile