From e70db18b63237975f989f71abf12b0615ec43e4c Mon Sep 17 00:00:00 2001 From: qiyuxinlin <1668068727@qq.com> Date: Mon, 28 Apr 2025 23:08:38 +0000 Subject: [PATCH] update AMX readme --- doc/en/AMX.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/doc/en/AMX.md b/doc/en/AMX.md index 9588fba..760b4c0 100644 --- a/doc/en/AMX.md +++ b/doc/en/AMX.md @@ -19,8 +19,10 @@ You can see that, thanks to the AMX instruction optimizations, we achieve up to Here is the Qwen3MoE startup command: ``` python -python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path --gguf_path --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml # llamafile backend -python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path --gguf_path --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml # AMX backend +# llamafile backend +python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path --gguf_path --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml +# AMX backend +python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path --gguf_path --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml ``` **Note: At present, Qwen3MoE running with AMX can only read BF16 GGUF; support for loading from safetensor will be added later.** @@ -62,7 +64,7 @@ Taking INT8 as an example, AMX can perform the multiplication of two 16×64 sub-

- amx_intro + amx_intro

@@ -87,7 +89,7 @@ During inference, we designed around the CPU’s multi-level cache hierarchy to

- amx + amx

@@ -104,7 +106,7 @@ Although AMX is highly efficient for large-scale matrix multiplication, it perfo

- amx_avx + amx_avx

@@ -124,7 +126,7 @@ Thanks to these optimizations, our kernel can achieve 21 TFLOPS of BF16 throughp

- onednn_1 + onednn_1