kvcache-ai-ktransformers/doc/en/Qwen3-Next.md
2025-09-16 13:21:58 +00:00

71 lines
No EOL
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Qwen3-Next Support for KTransformers
## Introduction
### Overview
We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.
### Model & Resource Links
- Official Qwen3-Next-80B-A3B-Thinking Release:
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
- Official Qwen3-Next-80B-A3B-Instruct Release
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
## Installation Guide
### 1. Resource Requirements
The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.
### 2. Prepare Models
```bash
# download gguf
huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct
```
### 3. Install ktransformers
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
### 4. Run Qwen3-Next Inference Server
```bash
python ktransformers/server/main.py \
--port 10021 \
--model_path path-to-Qwen3-Next-80B-A3B-Thinking \
--gguf_path path-to-Qwen3-Next-80B-A3B-Thinking \
--model_name Qwen3NextForCausalLM \
--optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--no-use_cuda_graph \
--backend_type balance_serve
```
### 5. Access server
```
curl -X POST http://localhost:10021/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "Qwen3-Next-80B-A3B-Instruct",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'
```
### 6. Notes
Due to Qwen3-Nexts use of linear attention, CUDA Graph optimization is not yet support — but its coming soon! 🚀