From cb5617b4790aad65ecc0d8243aaf5e2a3cc09f5a Mon Sep 17 00:00:00 2001
From: qiyuxinlin <1668068727@qq.com>
Date: Mon, 28 Apr 2025 22:14:23 +0000
Subject: [PATCH] update readme
---
README.md | 3 ++-
doc/en/AMX.md | 28 +++++++++++++++++++++++++++-
2 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index db9029a..a0b8a5f 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,8 @@ interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified
Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.
🔥 Updates
-
+* **Apr 29, 2025**: Support AMX-Int8 and AMX-BF16([Tutorial](./doc/en/AMX.md)). Support Qwen3MoE
+https://github.com/user-attachments/assets/14992126-5203-4855-acf3-d250acead6b2
* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)).
* **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).
diff --git a/doc/en/AMX.md b/doc/en/AMX.md
index b729907..048b1d7 100644
--- a/doc/en/AMX.md
+++ b/doc/en/AMX.md
@@ -1,3 +1,27 @@
+# Qwen 3 + KTransformers 0.3 (+AMX) = AI 工作站/PC
+Following DeepSeek-V3/R1, LLaMa-4, and Kimi-VL, Qwen has also released an impressive MoE model—undoubtedly, this year belongs to MoE. As a low-barrier inference system for running MoE models in local heterogeneous environments, KTransformers naturally joins the party. Thanks to the support of the Qwen team, we completed Day 0 support for the entire Qwen 3 series of MoE models. At the same time, we took this opportunity to open-source the long-awaited preliminary version of our AMX high-performance operators (BF16, Int8; an int4 variant is coming soon), officially advancing to version 0.3.
+
+What excites me most about Qwen3MoE is that, unlike the 671 B “giant” model, its two configurations—235 B-A22 and 30 B-A3B—hit the performance sweet spots for both local workstations and consumer-grade PCs. Accordingly, we ran benchmarks in two typical setups:
+
+Server CPU (Xeon 4) + RTX 4090
+
+Consumer-grade CPU (Core i9-14900KF + dual-channel DDR4-4000 MT/s) + RTX 4090
+
+The results are as follows:
+
+https://github.com/user-attachments/assets/14992126-5203-4855-acf3-d250acead6b2
+
+Machine | Model | GPU Memory | RAM Usage | Prefill (tokens/s) | Decode (tokens/s)
+Workstation (Xeon 4 + RTX 4090) | Qwen3-30B-A3B (8-bit) | 8.6 GB | 44 GB | 313 | 33 (single) → 50 (4-way)
+Workstation (Xeon 4 + RTX 4090) | Qwen3-30B-A3B (4-bit) | 8.6 GB | 20 GB | 347.7 | 49.8 (single) → 98.8 (4-way)
+Workstation (Xeon 4 + RTX 4090) | Qwen3-235B-A22B (4-bit) | 13 GB | 160 GB | 114.9 | 13.8 (single) → 24.4 (4-way)
+Personal PC (Core i9-14900KF + RTX 4090) | Qwen3-30B-A3B (4-bit) | 8.6 GB | 20 GB | 240.0 | 12.0 (single) → 26.4 (4-way)
+Personal PC (Core i9-14900KF + RTX 4090) | Qwen3-235B-A22B (4-bit) | 13 GB | 160 GB | 45 | 2.5 (single) → 6.0 (4-way)
+
+You can see that, thanks to the AMX instruction optimizations, we achieve up to 347 tokens/s prefill performance in the workstation scenario. On consumer-grade CPUs, we’re able to run the large model (235B-A22) and deliver smooth performance on the smaller 30B-A3B. Even in terms of resource overhead, it appears that a high-end gaming laptop can handle 30B-A3B smoothly. After talking about the concept of AIPC for so long, we can finally see its feasibility.
+
+To make it easier for everyone to understand the AMX optimizations we’ve open-sourced, we’ve prepared a brief document. We also extend our gratitude to Intel for their assistance.
+
# Introduction to AMX Instruction Set
Intel Advanced Matrix Extensions (AMX) are a set of specialized instruction extensions introduced for the x86 architecture starting with Sapphire Rapids (4th generation Xeon Scalable processors) and onward. AMX accelerates large-scale matrix computations at the hardware level, particularly for the compute-intensive parts of deep learning inference and machine learning workloads. By introducing the concept of Tile registers, it loads 2D sub-matrices into dedicated Tile registers and performs matrix multiply-accumulate operations at the register level, significantly improving throughput and energy efficiency.
@@ -137,4 +161,6 @@ KTransformers allows users to easily switch between different backends through s
backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)
```
-**Note:** Currently, using AMXInt8 requires reading weights from a BF16 GGUF file and performing online quantization during model loading. This may cause slightly slower load times. Future versions will provide pre-quantized weights to eliminate this overhead.
\ No newline at end of file
+**Note:** Currently, using AMXInt8 requires reading weights from a BF16 GGUF file and performing online quantization during model loading. This may cause slightly slower load times. Future versions will provide pre-quantized weights to eliminate this overhead.
+
+
\ No newline at end of file