Update doc

2025-09-10 15:29:39 +00:00 · 2025-02-25 08:21:18 +00:00 · 2025-02-25 08:21:18 +00:00 · 36fbeee341
commit 36fbeee341
parent 4dc5518e4d
11 changed files with 101 additions and 59 deletions
--- a/README.md
+++ b/README.md
@ -23,7 +23,8 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
 <h2 id="Updates">🔥 Updates</h2>
-* **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
+* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; Longer Context (from 8K to 128K for 24GB VRAM).
 * **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
 * **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
 * **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
 * **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU. 
@ -125,7 +126,7 @@ To utilize the provided kernels, users only need to create a YAML-based injectio
 ```python
 with torch.device("meta"):
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
-optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
+optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
 ...
 generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
 ```
--- a/README_ZH.md
+++ b/README_ZH.md
@ -21,7 +21,8 @@ KTransformers 是一个以 Python 为中心的灵活框架，其核心是可扩
 <h2 id="Updates">🔥 更新</h2>
-* **2025 年 2 月 15 日**：KTransformers V0.2.1: 长上下文(从4K到8K，24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s)，文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。
+* **2025 年 2 月 15 日**：为DeepSeek-V3/R1支持[FP8 GPU内核](./doc/en/fp8_kernel.md); 支持更长的上下文 (从8K到128K仅用24GB VRAM).
 * **2025 年 2 月 15 日**：长上下文(从4K到8K，24GB VRAM) & 稍快的速度(+15%)(最快 16 Tokens/s)，文档请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md) 和 [在线指南](https://kvcache-ai.github.io/ktransformers/) 。
 * **2025 年 2 月 10 日**：支持 Deepseek-R1 和 V3 在单个（24GB VRAM）/多 GPU 和 382G DRAM 上运行，速度提升高达 3~28 倍。详细教程请参见 [这里](./doc/en/DeepseekR1_V3_tutorial.md)。
 * **2024 年 8 月 28 日**：支持 InternLM2.5-7B-Chat-1M 模型下的 1M 上下文，使用 24GB 的 VRAM 和 150GB 的 DRAM。详细教程请参见 [这里](./doc/en/long_context_tutorial.md)。
 * **2024 年 8 月 28 日**：将 DeepseekV2 所需的 VRAM 从 21G 降低到 11G。
@ -68,11 +69,11 @@ https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
 </p>
-<h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
+<!-- <h3>在仅 24GB VRAM 的桌面上进行 1M 上下文本地推理</h3>
-<p align="center">
+<p align="center"> -->
 https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
 <!-- https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12 -->
 <!-- 
 * **1M 上下文 InternLM 2.5 7B**：以全 bf16 精度运行，使用 24GB VRAM 和 150GB DRAM，可在本地桌面设置中实现。在 1M "针在干草堆中" 测试中达到 92.88% 的成功率，在 128K NIAH 测试中达到 100%。
 <p align="center">
@ -89,7 +90,7 @@ https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
 * **增强的速度**：使用稀疏注意力，通过 llamafile 内核实现 1M 上下文生成 16.91 tokens/s 的速度。这种方法比 llama.cpp 的全注意力方法快 10 倍以上。
-* **灵活的稀疏注意力框架**：提供了一个灵活的块稀疏注意力框架，用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。
+* **灵活的稀疏注意力框架**：提供了一个灵活的块稀疏注意力框架，用于 CPU 卸载解码。与 SnapKV、Quest 和 InfLLm 兼容。更多信息请参见 [这里](./doc/en/long_context_introduction.md)。 -->
 <strong>更多高级功能即将推出，敬请期待！</strong>
@ -116,7 +117,7 @@ KTransformers 的核心是一个用户友好的、基于模板的注入框架。
 ```python
 with torch.device("meta"):
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
-optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
+optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
 ...
 generated = prefill_and_generate(model, tokenizer, input_tensor.cuda(), max_new_tokens=1000)
 ```
@ -151,7 +152,7 @@ YAML 文件中的每个规则都有两部分：`match` 和 `replace`。`match`
 <h2 id="ack">致谢和贡献者</h2>
-KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 和 Marlin 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
+KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
 KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们，使 KTransformer 更快、更易于使用。
--- a/doc/SUMMARY.md
+++ b/doc/SUMMARY.md
@ -9,7 +9,7 @@
 - [Why KTransformers So Fast](en/deepseek-v2-injection.md)
 - [Injection Tutorial](en/injection_tutorial.md)
 - [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
- [Using FP8 GPU Kernel](../merge_tensors/README.md)
+- [Use FP8 GPU Kernel](en/fp8_kernel.md)
 # Server
  - [Server](en/api/server/server.md)
  - [Website](en/api/server/website.md)
--- a/doc/en/FAQ.md
+++ b/doc/en/FAQ.md
@ -45,7 +45,7 @@ from-https://github.com/kvcache-ai/ktransformers/issues/129#issue-2842799552
 ### Q: If I don't have enough VRAM, but I have multiple GPUs, how can I utilize them?
-Use the `--optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
+Use the `--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` to load the two optimized rule yaml file. You may also use it as an example to write your own 4/8 gpu optimized rule yaml file.
 > Note: The ktransformers' multi-gpu stratigy is pipline, which is not able to speed up the model's inference. It's only for the model's weight distribution.
--- a/doc/en/fp8_kernel.md
+++ b/doc/en/fp8_kernel.md
@ -0,0 +1,74 @@
 # FP8 Linear Kernel for DeepSeek-V3
 ## Overview
 The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
 - **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
 - **Hybrid Quantization Architecture**:
  - Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
  - Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
 So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
 ## Key Features
 ✅ Hybrid Precision Architecture (FP8 + GGML)
 ✅ Memory Optimization (~19GB VRAM usage)
 ## Quick Start
 ### Using Pre-Merged Weights
 Pre-merged weights are available on Hugging Face:
 [KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3)
 [KVCache-ai/DeepSeek-R1](https://huggingface.co/KVCache-ai/DeepSeek-R1)
 > Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time.
 Download Pre-Merged Weights
 ```shell
 pip install -U huggingface_hub
 # Optional: Use HF Mirror for faster downloads in special area.
 # export HF_ENDPOINT=https://hf-mirror.com 
 huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3 --local-dir <local_dir>
 ```
 ### Using merge scripts
 If you got local DeepSeek-R1/V3 fp8 safetensors and q4km gguf weights, you can merge them using the following scripts.
 ```shell
 python convert_model.py \
  --safetensor_path <fp8_safetensor_path> \
  --gguf_path <q4km_gguf_folder_path> \
  --output_path <merged_output_path>
 ```
 * `--safetensor_path`:	input path of safetensor file([Download](https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main)).
 * `--gguf_path`: input path of gguf folder ([Download](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q4_K_M)).
 * `--output_path`: output path of merged file.
 ### Execution Notes
 Launch local_chat.py with custom quantized experts
 ```shell
 python ktransformers/local_chat.py \
  --model_path deepseek-ai/DeepSeek-V3 \
  --gguf_path <merged_weights_folder> \
  --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml \
  --cpu_infer <cpu_cores + 1>
 ```
 ## Notes
 ⚠️ Hardware Requirements
 * Recommended minimum 19GB available VRAM for FP8 kernel.
 * Requires GPU with FP8 support (e.g., 4090)
 ⏳ First-Run Optimization
 JIT compilation causes longer initial execution (subsequent runs retain optimized speed).
 🔄 Temporary Interface
 Current weight loading implementation is provisional - will be refined in future versions
 📁 Path Specification
 Despite hybrid quantization, merged weights are stored as .safetensors - pass the containing folder path to `--gguf_path`
--- a/doc/en/install.md
+++ b/doc/en/install.md
@ -141,7 +141,7 @@ It features the following arguments:
 - `--gguf_path` (required): Path of a directory containing GGUF files which could that can be downloaded from [Hugging Face](https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/tree/main). Note that the directory should only contains GGUF of current model, which means you need one separate directory for each model.
- `--optimize_rule_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
+- `--optimize_config_path` (required except for Qwen2Moe and DeepSeek-V2): Path of YAML file containing optimize rules. There are two rule files pre-written in the [ktransformers/optimize/optimize_rules](ktransformers/optimize/optimize_rules) directory for optimizing DeepSeek-V2 and Qwen2-57B-A14, two SOTA MoE models.
 - `--max_new_tokens`: Int (default=1000). Maximum number of new tokens to generate.
--- a/ktransformers/local_chat.py
+++ b/ktransformers/local_chat.py
@ -54,7 +54,7 @@ default_optimize_rules = {
 def local_chat(
    model_path: str | None = None,
-    optimize_rule_path: str = None,
+    optimize_config_path: str = None,
    gguf_path: str | None = None,
    max_new_tokens: int = 300,
    cpu_infer: int = Config().cpu_infer,
@ -95,12 +95,12 @@ def local_chat(
                config, trust_remote_code=True, attn_implementation="flash_attention_2"
            )
-    if optimize_rule_path is None:
+    if optimize_config_path is None:
        if config.architectures[0] in default_optimize_rules:
            print("using default_optimize_rule for", config.architectures[0])
-            optimize_rule_path = default_optimize_rules[config.architectures[0]]
+            optimize_config_path = default_optimize_rules[config.architectures[0]]
        else:
-            optimize_rule_path = input(
+            optimize_config_path = input(
                "please input the path of your rule file(yaml file containing optimize rules):"
            )
@ -108,7 +108,7 @@ def local_chat(
        gguf_path = input(
            "please input the path of your gguf file(gguf file in the dir containing input gguf file must all belong to current model):"
        )
-    optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
+    optimize_and_load_gguf(model, optimize_config_path, gguf_path, config)
    try:
            model.generation_config = GenerationConfig.from_pretrained(model_path)
--- a/ktransformers/server/backend/interfaces/ktransformers.py
+++ b/ktransformers/server/backend/interfaces/ktransformers.py
@ -35,9 +35,9 @@ class KTransformersInterface(TransformersInterface):
        with torch.device("meta"):
            self.model = custom_models[config.architectures[0]](config)
        if default_args.optimize_config_path is None:
-            optimize_rule_path = default_optimize_rules[config.architectures[0]]
+            optimize_config_path = default_optimize_rules[config.architectures[0]]
        else:
-            optimize_rule_path = args.optimize_config_path
+            optimize_config_path = args.optimize_config_path
        # print(optimize_config)
@ -47,7 +47,7 @@ class KTransformersInterface(TransformersInterface):
                "please input the path of your gguf file(gguf file in the dir containing input gguf file must all"
                " belong to current model):"
            )
-        optimize_and_load_gguf(self.model, optimize_rule_path, gguf_path, config)
+        optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
        self.device_map = self.model.gguf_loader.tensor_device_map
        # logger.info(f"{args.model_name} loaded from {args.model_dir} to {self.device_map}")
--- a/merge_tensors/README.md
+++ b/merge_tensors/README.md
@ -1,36 +0,0 @@
 # FP8 Linear Kernel.
 For DeepSeek-R1/V3, the DeepSeek-AI team provides fp8 safetensors. We have integrated the FP8 GPU kernel into the KTransformers. But to keep the experts still in CPU to save GPU memory, we still use ggml(GGUF tensors) quantization for experts. In this way, we can increase the precision in calculating attention, which may improve the model's performance.
 Therefore, to use fp8 linear kernel, we need to merge fp8 weights and gguf files. We have provides prepared weights in huggingface so that you can use them directly. 
 [KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3/upload/main)
 If you want to use other formats of ggml quantization, you can use the following script to merge them.
 ## Example
 To use fp8 linear kernal and q4km experts.
 ```shell
 bash
 python convert_model.py \
  --safetensor_path <fp8 safetensor path> \
  --gguf_path <q4km gguf folder path> \
  --output_path <output path>
 ```
 * `--safetensor_path`:	input path of safetensor file
 * `--gguf_path`: input path of gguf folder
 * `--output_path`: output path of merged file
 ## To Run DeepSeek-V3 with fp8 linear kernel and q4km experts
 ```shell
 python ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-V3 --gguf_path <new weights folder> --optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml --cpu_infer <cpu cores + 1>
 ```
 > NOTES: 
 > 1. Using fp8 linear kernel and q4km experts will consume approximatly 19GB GPU memory. 
 > 2. I know the the new way to load module is ugly, we are working on it.
 > 3. Though the model is a mixture of fp8 and ggml, they are stored in .safetensor files. Please pass the folder path of the new weights to `--gguf_path`.
--- a/merge_tensors/merge_safetensor_gguf.py
+++ b/merge_tensors/merge_safetensor_gguf.py
@ -3,6 +3,7 @@
 import os
 # insert the path of the project
 import sys
 sys.path.insert(0, "/home/azure/ktransformers")
 import argparse
 import torch
 from ktransformers.util.custom_gguf import GGUFLoader, translate_name_to_gguf
@ -180,7 +181,7 @@ def write_combined_tensor(target_tensor_map: dict, output_path: str, gguf_loader
        output_file = os.path.join(output_path, f"model-{shard_idx:05}-of-{total_shards:05}.safetensors")
        print(f"Saving layer {layer_num} to {output_file}")
-        print(tensors.keys())
+        # print(tensors.keys())
        save_file(tensors, output_file)
        shard_idx += 1
--- a/requirements-local_chat.txt
+++ b/requirements-local_chat.txt
@ -5,4 +5,5 @@ torch>=2.3.0
 packaging
 cpufeature
 protobuf
-tiktoken
+tiktoken
 blobfile