update readme

2025-09-09 22:05:30 +00:00 · 2025-04-28 22:14:23 +00:00 · 2025-04-28 22:14:23 +00:00 · cb5617b479
commit cb5617b479
parent 063c5489b3
2 changed files with 29 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -22,7 +22,8 @@ interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified
 Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.

 <h2 id="Updates">🔥 Updates</h2>
-
+* **Apr 29, 2025**: Support AMX-Int8 and AMX-BF16([Tutorial](./doc/en/AMX.md)). Support Qwen3MoE
+https://github.com/user-attachments/assets/14992126-5203-4855-acf3-d250acead6b2
 * **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./doc/en/llama4.md)).
 * **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./doc/en/balance-serve.md)).

--- a/doc/en/AMX.md
+++ b/doc/en/AMX.md
@ -1,3 +1,27 @@
+# Qwen 3 + KTransformers 0.3 (+AMX) = AI 工作站/PC
+Following DeepSeek-V3/R1, LLaMa-4, and Kimi-VL, Qwen has also released an impressive MoE model—undoubtedly, this year belongs to MoE. As a low-barrier inference system for running MoE models in local heterogeneous environments, KTransformers naturally joins the party. Thanks to the support of the Qwen team, we completed Day 0 support for the entire Qwen 3 series of MoE models. At the same time, we took this opportunity to open-source the long-awaited preliminary version of our AMX high-performance operators (BF16, Int8; an int4 variant is coming soon), officially advancing to version 0.3.
+
+What excites me most about Qwen3MoE is that, unlike the 671 B “giant” model, its two configurations—235 B-A22 and 30 B-A3B—hit the performance sweet spots for both local workstations and consumer-grade PCs. Accordingly, we ran benchmarks in two typical setups:
+
+Server CPU (Xeon 4) + RTX 4090
+
+Consumer-grade CPU (Core i9-14900KF + dual-channel DDR4-4000 MT/s) + RTX 4090
+
+The results are as follows:
+
+https://github.com/user-attachments/assets/14992126-5203-4855-acf3-d250acead6b2
+
+Machine | Model | GPU Memory | RAM Usage | Prefill (tokens/s) | Decode (tokens/s)
+Workstation (Xeon 4 + RTX 4090) | Qwen3-30B-A3B (8-bit) | 8.6 GB | 44 GB | 313 | 33 (single) → 50 (4-way)
+Workstation (Xeon 4 + RTX 4090) | Qwen3-30B-A3B (4-bit) | 8.6 GB | 20 GB | 347.7 | 49.8 (single) → 98.8 (4-way)
+Workstation (Xeon 4 + RTX 4090) | Qwen3-235B-A22B (4-bit) | 13 GB | 160 GB | 114.9 | 13.8 (single) → 24.4 (4-way)
+Personal PC (Core i9-14900KF + RTX 4090) | Qwen3-30B-A3B (4-bit) | 8.6 GB | 20 GB | 240.0 | 12.0 (single) → 26.4 (4-way)
+Personal PC (Core i9-14900KF + RTX 4090) | Qwen3-235B-A22B (4-bit) | 13 GB | 160 GB | 45 | 2.5 (single) → 6.0 (4-way)
+
+You can see that, thanks to the AMX instruction optimizations, we achieve up to 347 tokens/s prefill performance in the workstation scenario. On consumer-grade CPUs, we’re able to run the large model (235B-A22) and deliver smooth performance on the smaller 30B-A3B. Even in terms of resource overhead, it appears that a high-end gaming laptop can handle 30B-A3B smoothly. After talking about the concept of AIPC for so long, we can finally see its feasibility.
+
+To make it easier for everyone to understand the AMX optimizations we’ve open-sourced, we’ve prepared a brief document. We also extend our gratitude to Intel for their assistance.
+
 # Introduction to AMX Instruction Set

 Intel Advanced Matrix Extensions (AMX) are a set of specialized instruction extensions introduced for the x86 architecture starting with Sapphire Rapids (4th generation Xeon Scalable processors) and onward. AMX accelerates large-scale matrix computations at the hardware level, particularly for the compute-intensive parts of deep learning inference and machine learning workloads. By introducing the concept of Tile registers, it loads 2D sub-matrices into dedicated Tile registers and performs matrix multiply-accumulate operations at the register level, significantly improving throughput and energy efficiency.
@ -137,4 +161,6 @@ KTransformers allows users to easily switch between different backends through s
      backend: "AMXInt8"  # or "AMXBF16" or "llamafile" (default)
 ```

-**Note:** Currently, using AMXInt8 requires reading weights from a BF16 GGUF file and performing online quantization during model loading. This may cause slightly slower load times. Future versions will provide pre-quantized weights to eliminate this overhead.
+**Note:** Currently, using AMXInt8 requires reading weights from a BF16 GGUF file and performing online quantization during model loading. This may cause slightly slower load times. Future versions will provide pre-quantized weights to eliminate this overhead.
+
+![Image](https://github.com/user-attachments/assets/7c33c410-3af9-456f-aa67-5b24e19ba680)