update AMX readme

2025-09-12 16:19:42 +00:00 · 2025-04-28 23:08:38 +00:00 · 2025-04-28 23:08:38 +00:00 · e70db18b63
commit e70db18b63
parent 2e905c8bd4
1 changed files with 8 additions and 6 deletions
--- a/doc/en/AMX.md
+++ b/doc/en/AMX.md
@ -19,8 +19,10 @@ You can see that, thanks to the AMX instruction optimizations, we achieve up to
 Here is the Qwen3MoE startup command:

 ``` python
-python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path <model_dir> --gguf_path <gguf_dir> --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml # llamafile backend
-python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path <model_dir> --gguf_path <gguf_dir> --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml # AMX backend
+# llamafile backend
+python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path <model_dir> --gguf_path <gguf_dir> --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml 
+# AMX backend
+python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path <model_dir> --gguf_path <gguf_dir> --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml 
 ```

 **Note: At present, Qwen3MoE running with AMX can only read BF16 GGUF; support for loading from safetensor will be added later.**
@ -62,7 +64,7 @@ Taking INT8 as an example, AMX can perform the multiplication of two 16×64 sub-

 <p align="center">
  <picture>
-    <img alt="amx_intro" src="https://github.com/kvcache-ai/ktransformers/tree/main/doc/assets/amx_intro.png" width=60%>
+    <img alt="amx_intro" src="../assets/amx_intro.png" width=60%>
  </picture>
 </p>

@ -87,7 +89,7 @@ During inference, we designed around the CPU’s multi-level cache hierarchy to

 <p align="center">
  <picture>
-    <img alt="amx" src="https://github.com/kvcache-ai/ktransformers/tree/main/doc/assets/amx.png" width=60%>
+    <img alt="amx" src="../assets/amx.png" width=60%>
  </picture>
 </p>

@ -104,7 +106,7 @@ Although AMX is highly efficient for large-scale matrix multiplication, it perfo

 <p align="center">
  <picture>
-    <img alt="amx_avx" src="https://github.com/kvcache-ai/ktransformers/tree/main/doc/assets/amx_avx.png" width=60%>
+    <img alt="amx_avx" src="../assets/amx_avx.png" width=60%>
  </picture>
 </p>

@ -124,7 +126,7 @@ Thanks to these optimizations, our kernel can achieve 21 TFLOPS of BF16 throughp

 <p align="center">
  <picture>
-    <img alt="onednn_1" src="https://github.com/kvcache-ai/ktransformers/tree/main/doc/assets/onednn_1.png" width=60%>
+    <img alt="onednn_1" src="../assets/onednn_1.png" width=60%>
  </picture>
 </p>