From cb5617b4790aad65ecc0d8243aaf5e2a3cc09f5a Mon Sep 17 00:00:00 2001 From: qiyuxinlin <1668068727@qq.com> Date: Mon, 28 Apr 2025 22:14:23 +0000 Subject: [PATCH] update readme --- README.md | 3 ++- doc/en/AMX.md | 28 +++++++++++++++++++++++++++- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index db9029a..a0b8a5f 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,8 @@ interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.

🔥 Updates

- +* **Apr 29, 2025**: Support AMX-Int8 and AMX-BF16([Tutorial](./doc/en/AMX.md)). Support Qwen3MoE +https://github.com/user-attachments/assets/14992126-5203-4855-acf3-d250acead6b2 * **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)). * **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)). diff --git a/doc/en/AMX.md b/doc/en/AMX.md index b729907..048b1d7 100644 --- a/doc/en/AMX.md +++ b/doc/en/AMX.md @@ -1,3 +1,27 @@ +# Qwen 3 + KTransformers 0.3 (+AMX) = AI 工作站/PC +Following DeepSeek-V3/R1, LLaMa-4, and Kimi-VL, Qwen has also released an impressive MoE model—undoubtedly, this year belongs to MoE. As a low-barrier inference system for running MoE models in local heterogeneous environments, KTransformers naturally joins the party. Thanks to the support of the Qwen team, we completed Day 0 support for the entire Qwen 3 series of MoE models. At the same time, we took this opportunity to open-source the long-awaited preliminary version of our AMX high-performance operators (BF16, Int8; an int4 variant is coming soon), officially advancing to version 0.3. + +What excites me most about Qwen3MoE is that, unlike the 671 B “giant” model, its two configurations—235 B-A22 and 30 B-A3B—hit the performance sweet spots for both local workstations and consumer-grade PCs. Accordingly, we ran benchmarks in two typical setups: + +Server CPU (Xeon 4) + RTX 4090 + +Consumer-grade CPU (Core i9-14900KF + dual-channel DDR4-4000 MT/s) + RTX 4090 + +The results are as follows: + +https://github.com/user-attachments/assets/14992126-5203-4855-acf3-d250acead6b2 + +Machine | Model | GPU Memory | RAM Usage | Prefill (tokens/s) | Decode (tokens/s) +Workstation (Xeon 4 + RTX 4090) | Qwen3-30B-A3B (8-bit) | 8.6 GB | 44 GB | 313 | 33 (single) → 50 (4-way) +Workstation (Xeon 4 + RTX 4090) | Qwen3-30B-A3B (4-bit) | 8.6 GB | 20 GB | 347.7 | 49.8 (single) → 98.8 (4-way) +Workstation (Xeon 4 + RTX 4090) | Qwen3-235B-A22B (4-bit) | 13 GB | 160 GB | 114.9 | 13.8 (single) → 24.4 (4-way) +Personal PC (Core i9-14900KF + RTX 4090) | Qwen3-30B-A3B (4-bit) | 8.6 GB | 20 GB | 240.0 | 12.0 (single) → 26.4 (4-way) +Personal PC (Core i9-14900KF + RTX 4090) | Qwen3-235B-A22B (4-bit) | 13 GB | 160 GB | 45 | 2.5 (single) → 6.0 (4-way) + +You can see that, thanks to the AMX instruction optimizations, we achieve up to 347 tokens/s prefill performance in the workstation scenario. On consumer-grade CPUs, we’re able to run the large model (235B-A22) and deliver smooth performance on the smaller 30B-A3B. Even in terms of resource overhead, it appears that a high-end gaming laptop can handle 30B-A3B smoothly. After talking about the concept of AIPC for so long, we can finally see its feasibility. + +To make it easier for everyone to understand the AMX optimizations we’ve open-sourced, we’ve prepared a brief document. We also extend our gratitude to Intel for their assistance. + # Introduction to AMX Instruction Set Intel Advanced Matrix Extensions (AMX) are a set of specialized instruction extensions introduced for the x86 architecture starting with Sapphire Rapids (4th generation Xeon Scalable processors) and onward. AMX accelerates large-scale matrix computations at the hardware level, particularly for the compute-intensive parts of deep learning inference and machine learning workloads. By introducing the concept of Tile registers, it loads 2D sub-matrices into dedicated Tile registers and performs matrix multiply-accumulate operations at the register level, significantly improving throughput and energy efficiency. @@ -137,4 +161,6 @@ KTransformers allows users to easily switch between different backends through s backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default) ``` -**Note:** Currently, using AMXInt8 requires reading weights from a BF16 GGUF file and performing online quantization during model loading. This may cause slightly slower load times. Future versions will provide pre-quantized weights to eliminate this overhead. \ No newline at end of file +**Note:** Currently, using AMXInt8 requires reading weights from a BF16 GGUF file and performing online quantization during model loading. This may cause slightly slower load times. Future versions will provide pre-quantized weights to eliminate this overhead. + +![Image](https://github.com/user-attachments/assets/7c33c410-3af9-456f-aa67-5b24e19ba680) \ No newline at end of file