From ecc3028c13d3f7ad1e8f09206f9bca425e99da8e Mon Sep 17 00:00:00 2001
From: djw <1913953267@qq.com>
Date: Wed, 9 Apr 2025 09:34:04 +0000
Subject: [PATCH] update llama4 tutorial

---
 doc/README.md    |   1 +
 doc/en/llama4.md | 112 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)
 create mode 100644 doc/en/llama4.md
diff --git a/doc/README.md b/doc/README.md
index d3acca7..05df2d3 100644
--- a/doc/README.md
+++ b/doc/README.md
@@ -22,6 +22,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
 
 <h2 id="Updates">🔥 Updates</h2>
 
+* **Apr 9, 2025**: Experimental support for LLaMA 4 models ([Tutorial](./en/llama4.md)).
 * **Apr 2, 2025**: Support Multi-concurrency. ([Tutorial](./en/balance-serve.md)).
 * **Mar 27, 2025**: Support Multi-concurrency.
 * **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./en/ROCm.md)).
diff --git a/doc/en/llama4.md b/doc/en/llama4.md
new file mode 100644
index 0000000..56d365f
--- /dev/null
+++ b/doc/en/llama4.md
@@ -0,0 +1,112 @@
+# 🦙 Tutorial: LLaMA 4 Multi-Concurrency Support with KTransformers (Balance Serve Backend)
+
+## 📌 Overview
+
+We are pleased to announce that **KTransformers** now provides **experimental support for LLaMA 4 models** through the powerful `balance_serve` backend introduced in **v0.2.4**. This update is available under the dedicated development branch: [`support-llama4`](https://github.com/kvcache-ai/ktransformers/tree/support-llama4), specifically targeting the newly released **Meta LLaMA 4** model architecture.
+
+⚠️ This support is currently **not available on the main branch** due to dependencies on newer versions of `transformers`, and **compatibility limitations with inference of currently supported models**. Work is underway to integrate this into the mainline once broader stability and compatibility are validated.
+
+💡 **If you already have an environment based on the main branch**, it is **strongly recommended to create a new environment** to avoid potential dependency conflicts.
+
+------
+
+## 🔗 Model & Resource Links
+
+- 🔥 Official LLaMA 4 Release: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
+   (Note: LLaMA 4 models are served through the Meta repository. Make sure to **agree to terms** before downloading.)
+- 🧠 GGUF Format (quantized models):
+  - https://huggingface.co/mradermacher/Llama-4-Scout-17B-16E-Instruct-GGUF
+
+------
+
+## 🧪 Demo
+
+https://github.com/user-attachments/assets/449706f1-784b-4931-b2ba-07687c1aca54
+
+------
+
+## ⚙️ Usage Instructions
+
+### 1. Clone `support-llama4` Branch
+
+```bash
+git clone https://github.com/kvcache-ai/ktransformers.git
+cd ktransformers
+git checkout support-llama4
+git submodule update --init --recursive
+```
+
+### 2. Set Up Environment
+
+```bash
+# Download Miniconda
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+
+# Create environment
+conda create --name ktransformers python=3.11
+conda activate ktransformers
+
+# Install required libraries
+conda install -c conda-forge libstdcxx-ng
+
+# Verify GLIBCXX version (should include 3.4.32)
+strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
+
+sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libfmt-dev libgflags-dev zlib1g-dev patchelf
+pip3 install packaging ninja cpufeature numpy openai
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
+```
+
+### 3. Build with Balance Serve Support
+
+```bash
+# Install single NUMA dependencies
+USE_BALANCE_SERVE=1  bash ./install.sh
+# For those who have two cpu and 1T RAM（Dual NUMA）:
+USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
+```
+
+### 4. Run LLaMA 4 Inference Server
+
+Make sure you have:
+
+- `--model_path` pointing to a local config directory (not a Hugging Face name).
+- `--gguf_path` pointing to quantized `.gguf` weights.
+
+```bash
+python ktransformers/server/main.py \
+  --port 10002 \
+  --model_path <path_to_safetensor_config> \
+  --gguf_path <path_to_gguf_files> \
+  --optimize_config_path ktransformers/optimize/optimize_rules/Llama4-serve.yaml \
+  --max_new_tokens 1024 \
+  --cache_lens 32768 \
+  --chunk_size 256 \
+  --max_batch_size 4 \
+  --backend_type balance_serve \
+```
+
+### 5. Access server
+
+```
+curl -X POST http://localhost:10002/v1/chat/completions \
+  -H "accept: application/json" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "hello"}
+    ],
+    "model": "Llama4",
+    "temperature": 0.3,
+    "top_p": 1.0,
+    "stream": true
+  }'
+```
+
+------
+
+## 📌 Limitations
+
+- ✅ **Only `balance_serve` backend is supported** for LLaMA 4 models in this version.
+- ⚠️ Requires **`transformers==4.51.0`** or newer. Due to potential compatibility issues with older toolchains, we have **not merged this branch to main yet**.
+- ❌ Multimodal models are not supported yet in this version. Support will be added in future releases.