kvcache-ai-ktransformers/doc/en/Qwen3-Next.md

# Qwen3-Next Support for KTransformers

## Introduction

### Overview
We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.

### Model & Resource Links

- Official Qwen3-Next-80B-A3B-Thinking Release:
  - https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking

- Official Qwen3-Next-80B-A3B-Instruct Release
  - https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct


## Installation Guide

### 1. Resource Requirements

The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.

### 2. Prepare Models

```bash
# download gguf
huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct

```

### 3. Install ktransformers

To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).

### 4. Run Qwen3-Next Inference Server

```bash
python ktransformers/server/main.py \
  --port 10021 \
  --model_path path-to-Qwen3-Next-80B-A3B-Thinking \
  --gguf_path path-to-Qwen3-Next-80B-A3B-Thinking \
  --model_name Qwen3NextForCausalLM \
  --optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
  --max_new_tokens 1024 \
  --cache_lens 32768 \
  --chunk_size 256 \
  --max_batch_size 4 \
  --no-use_cuda_graph \
  --backend_type balance_serve
```

### 5. Access server

```
curl -X POST http://localhost:10021/v1/chat/completions \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "hello"}
    ],
    "model": "Qwen3-Next-80B-A3B-Instruct",
    "temperature": 0.3,
    "top_p": 1.0,
    "stream": true
  }'
```

### 6. Notes

Due to Qwen3-Next’s use of linear attention, CUDA Graph optimization is not yet support — but it’s coming soon! 🚀