Merge branch 'kvcache-ai:main' into main

2025-09-11 07:44:35 +00:00 · 2025-02-15 21:57:08 +08:00 · 2025-02-15 21:57:08 +08:00 · d3b45d5704
commit d3b45d5704
parent f74c2d1d17 3c6035aa8a
8 changed files with 86 additions and 18 deletions
--- a/2
+++ b/2
@ -32,4 +32,4 @@ CPU_INSTRUCT=NATIVE  KTRANSFORMERS_FORCE_BUILD=TRUE TORCH_CUDA_ARCH_LIST="8.0;8.
 pip cache purge
 EOF

-ENTRYPOINT [ "/opt/conda/bin/ktransformers" ]
+ENTRYPOINT ["tail", "-f", "/dev/null"]
--- a/README.md
+++ b/README.md
@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

 <h2 id="Updates">🔥 Updates</h2>

+* **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
 * **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
 * **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
 * **Aug 15, 2024**: Update detailed [tutorial](doc/en/injection_tutorial.md) for injection and multi-GPU. 
@ -159,7 +160,7 @@ If you are interested in our design principles and the implementation of the inj

 <h2 id="ack">Acknowledgment and Contributors</h2>

-The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, and Marlin. We are planning to contribute back to the community by upstreaming our modifications.
+The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.

 KTransformer is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformer faster and easier to use.

--- a/doc/SUMMARY.md
+++ b/doc/SUMMARY.md
@ -9,7 +9,7 @@
 - [Why KTransformers So Fast](en/deepseek-v2-injection.md)
 - [Injection Tutorial](en/injection_tutorial.md)
 - [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
-# Server(Temperary Deprected)
+# Server (Temporary Deprecated)
  - [Server](en/api/server/server.md)
  - [Website](en/api/server/website.md)
  - [Tabby](en/api/server/tabby.md)
--- a/doc/en/DeepseekR1_V3_tutorial.md
+++ b/doc/en/DeepseekR1_V3_tutorial.md
@ -3,21 +3,28 @@
 - [SUMMARY](#summary)
 	- [Show Case Environment](#show-case-environment)
 	- [Bench Result](#bench-result)
+		- [V0.2.1](#v021)
+			- [Memory consumption:](#memory-consumption)
+			- [Change Log](#change-log)
+			- [Benchmark Results](#benchmark-results)
 		- [V0.2](#v02)
 			- [Settings](#settings)
-			- [Memory consumption:](#memory-consumption)
-			- [Benchmark Results](#benchmark-results)
+			- [Memory consumption:](#memory-consumption-1)
+			- [Benchmark Results](#benchmark-results-1)
 		- [V0.3-Preview](#v03-preview)
 			- [Settings](#settings-1)
 			- [Memory consumptions:](#memory-consumptions)
-			- [Benchmark results](#benchmark-results-1)
+			- [Benchmark results](#benchmark-results-2)
 	- [How to Run](#how-to-run)
-		- [V0.2 Showcase](#v02-showcase)
+		- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
 			- [Single socket version (32 cores)](#single-socket-version-32-cores)
 			- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
 		- [V0.3 Showcase](#v03-showcase)
 			- [Dual socket version (64 cores)](#dual-socket-version-64-cores-1)
 	- [Some Explanations](#some-explanations)
+	- [Next](#next)
+		- [Faster](#faster)
+		- [Easier](#easier)
 	- [FAQ](#faq)
 		- [R1 No Thinking](#r1-no-thinking)
 		- [More FAQ](#more-faq)
@ -49,13 +56,54 @@ https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
 We also give our upcoming optimizations previews, including an Intel AMX-accelerated kernel and a selective expert activation method, which will significantly enhance performance. With V0.3-preview, we achieve up to 286 tokens/s for prefill, making it up to **28× faster than llama.cpp** for local inference.
 The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)  

+> **Feb 15, 2025**: KTransformers V0.2.1: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%) (Up to 16 Tokens/s), update docs [here](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
+
+We speed up the decode and prefill speed a littlt bit. The reason for the limited performance improvement mainly lies in the fact that the inference process is still constrained by the CPU's computational speed and memory bandwidth. The MLA part handled by the GPU accounts for a relatively small proportion.
+
+Besides the improvements in speed, we've also significantly updated the documentation to enhance usability, including:<br>
+- Added Multi-GPU configuration tutorial.
+- Consolidated installation guide.
+- Add a detailed tutorial on registering extra GPU memory with ExpertMarlin;
+

 ## Show Case Environment
 We run our best performance tests (V0.2) on <br>
 CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
 GPU: 4090D 24G VRAM <br>
-Memory: standard DDR5-4800 server DRAM (1 TB)
+Memory: standard DDR5-4800 server DRAM (1 TB), each socket with 8×DDR5-4800
 ## Bench Result
+### V0.2.1
+- Model: DeepseekV3-q4km (int4)<br>
+- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
+- GPU: 4090 24G VRAM
+- We test after enough warm up
+#### Memory consumption:
+  - Single socket: 382G DRAM, at least 14GB VRAM
+  - Dual socket: 1T DRAM, at least 14GB VRAM
+#### Change Log
+- Longer Context (from 4K to 8K for 24GB VRAM) and Slightly Faster Speed （+15%):<br>
+Integrated the highly efficient Triton MLA Kernel from the fantastic sglang project, enable much longer context length and slightly faster prefill/decode speed
+- We suspect that some of the improvements come from the change of hardwre platform (4090D->4090)
+#### Benchmark Results
+
+
+"6 experts" case is part of V0.3's preview
+
+
+| Prompt | hi (2) | 1K (969) | 2K (1930) | 4K (3846) | llama.cpp (8 experts) | 
+| --- | --- | --- | --- | --- | --- | 
+| Output length | 10tokens | 300tokens | 300tokens | 300tokens | 300tokens | 
+| **6 experts V0.2.0** |  |  |  |  |  |
+| Prefill token/s | 13 | 105 | 102 | 88 | CUDA OOM |
+| decode token/s | 16.8 | 15.4 | 14.2 | 13.0 | CUDA OOM |
+| **6 experts V0.2.1** |   |   |   |   |   |
+| Prefill token/s | 13 | 111 | 112.5 | 102 **(1.16x speedup)** | 101 |
+| decode token/s | 16.8 | 15.9 | 15.4 | 14.9 **(1.15x speedup)** | 13.9 |
+| **8 experts V0.2.1** |   |   |   |   |   |
+| Prefill token/s | 12.2 | 88.2 | 88.5 | 81.9 | 80 |
+| Decode token/s | 13.4 | 13.5 | 13.4 | 13.2 | 12.4 |
+
+
 ### V0.2
 #### Settings
 - Model: DeepseekV3-q4km (int4)<br>
@ -106,7 +154,7 @@ the output quality doesn't change. But the speed of decoding and prefill
 is speed up which is inspiring. So our showcase makes use of this finding*

 ## How to Run
-### V0.2 Showcase
+### V0.2 & V0.2.1 Showcase
 #### Single socket version (32 cores)
 Our local_chat test command is:
 ``` shell
@ -170,6 +218,17 @@ DeepSeek's MLA operators are highly computationally intensive. While running eve

 5. Why Intel CPUs?
 Intel is currently the only CPU vendor that supports AMX-like instructions, which delivers significantly better performance compared to AVX-only alternatives.
+## Next
+### Faster
+* The FlashInfer (https://github.com/flashinfer-ai/flashinfer) project is releasing an even more efficient fused MLA operator, promising further speedups
+* vLLM has explored multi-token prediction in DeepSeek-V3, and support is on our roadmap for even better performance
+* We are collaborating with Intel to enhance the AMX kernel (v0.3) and optimize for Xeon6/MRDIMM
+### Easier
+* Official Docker images to simplify installation
+* Fix the server integration for web API access
+* Support for more quantization types, including the highly requested dynamic quantization from unsloth
+
+Stay tuned for more updates! 
 ## FAQ
 ### R1 No Thinking
 Attention! If you are testing R1 and it may skip thinking. So you can add arg: `--force_think true`. The detail is in [FAQ](./FAQ.md) part <br>
--- a/doc/en/Docker.md
+++ b/doc/en/Docker.md
@ -7,7 +7,7 @@
 ## Images
 There is a Docker image available for our project, you can pull the docker image by：
 ```
-docker pull approachingai/ktransformers:0.1.1
+docker pull approachingai/ktransformers:0.2.1
 ```
 **Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformer in the /workspace/ktransformers directory within the container.

@ -16,14 +16,16 @@ docker pull approachingai/ktransformers:0.1.1

 - finish, execute
   ```bash
-   docker build  -t approachingai/ktransformers:v0.1.1 .
+   docker build  -t approachingai/ktransformers:0.2.1 .
   ```

 ## Usage

 Assuming you have the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) that you can use the GPU in a Docker container.
 ```
-docker run --gpus all -v /path/to/models:/models -p 10002:10002 approachingai/ktransformers:v0.1.1 --port 10002 --gguf_path /models/path/to/gguf_path --model_path /models/path/to/model_path --web True
+docker run --gpus all -v /path/to/models:/models --name ktransformers -itd approachingai/ktransformers:0.2.1
+docker exec -it ktransformers /bin/bash
+python -m ktransformers.local_chat  --gguf_path /models/path/to/gguf_path --model_path /models/path/to/model_path --cpu_infer 33
 ```

 More operators you can see in the [readme](../../README.md)
--- a/ktransformers/init.py
+++ b/ktransformers/init.py
@ -5,7 +5,7 @@ Description  :
 Author       : kkk1nak0
 Date         : 2024-08-15 07:34:46
 Version      : 1.0.0
-LastEditors  : unicornchan 
-LastEditTime : 2025-02-10 00:59:53
+LastEditors  : chenxl 
+LastEditTime : 2025-02-15 03:53:02
 '''
-__version__ = "0.2.0"
+__version__ = "0.2.1"
--- a/ktransformers/operators/attention.py
+++ b/ktransformers/operators/attention.py
@ -262,7 +262,7 @@ class KDeepseekV2Attention(BaseInjectedModule, DeepseekV2Attention):
            """

            # flash attn doesn't support head_dim bigger than 256
-            # use vLLM triton attention kernel for MQA
+            # use triton attention kernel adapted from vLLM and SGLang for MQA
            decode_attention_fwd_grouped(query_states, compressed_kv_with_k_pe, compressed_kv, attn_output,
                             page_table,
                             position_ids.squeeze(0).to(torch.int32), attn_logits,
@ -551,4 +551,4 @@ class KLlamaAttention(BaseInjectedModule):
        if not output_attentions:
            attn_weights = None

-        return attn_output, attn_weights, past_key_value
+        return attn_output, attn_weights, past_key_value
--- a/ktransformers/operators/triton_attention.py
+++ b/ktransformers/operators/triton_attention.py
@ -1,3 +1,9 @@
+# Adapted from
+# https://github.com/sgl-project/sglang/blob/9f635ea50de920aa507f486daafba26a5b837574/python/sglang/srt/layers/attention/triton_ops/decode_attention.py
+# which was originally adapted from
+# https://github.com/ModelTC/lightllm/blob/96353e868a840db4d103138caf15ed9dbea8c186/lightllm/models/deepseek2/triton_kernel/gqa_flash_decoding_stage1.py
+# https://github.com/ModelTC/lightllm/blob/96353e868a840db4d103138caf15ed9dbea8c186/lightllm/models/deepseek2/triton_kernel/gqa_flash_decoding_stage2.py
+
 import triton
 import triton.language as tl

@ -376,4 +382,4 @@ def decode_attention_fwd_grouped(
    )

    _decode_softmax_reducev_fwd(attn_logits, q, o, v_buffer, b_seq_len,
-                                num_kv_splits)
+                                num_kv_splits)