mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 03:39:48 +00:00
71 lines
No EOL
1.8 KiB
Markdown
71 lines
No EOL
1.8 KiB
Markdown
# Qwen3-Next Support for KTransformers
|
||
|
||
## Introduction
|
||
|
||
### Overview
|
||
We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.
|
||
|
||
### Model & Resource Links
|
||
|
||
- Official Qwen3-Next-80B-A3B-Thinking Release:
|
||
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
|
||
|
||
- Official Qwen3-Next-80B-A3B-Instruct Release
|
||
- https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
|
||
|
||
|
||
## Installation Guide
|
||
|
||
### 1. Resource Requirements
|
||
|
||
The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.
|
||
|
||
### 2. Prepare Models
|
||
|
||
```bash
|
||
# download gguf
|
||
huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct
|
||
|
||
```
|
||
|
||
### 3. Install ktransformers
|
||
|
||
To install KTransformers, follow the official [Installation Guide](https://kvcache-ai.github.io/ktransformers/en/install.html).
|
||
|
||
### 4. Run Qwen3-Next Inference Server
|
||
|
||
```bash
|
||
python ktransformers/server/main.py \
|
||
--port 10021 \
|
||
--model_path path-to-Qwen3-Next-80B-A3B-Thinking \
|
||
--gguf_path path-to-Qwen3-Next-80B-A3B-Thinking \
|
||
--model_name Qwen3NextForCausalLM \
|
||
--optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
|
||
--max_new_tokens 1024 \
|
||
--cache_lens 32768 \
|
||
--chunk_size 256 \
|
||
--max_batch_size 4 \
|
||
--no-use_cuda_graph \
|
||
--backend_type balance_serve
|
||
```
|
||
|
||
### 5. Access server
|
||
|
||
```
|
||
curl -X POST http://localhost:10021/v1/chat/completions \
|
||
-H "accept: application/json" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"messages": [
|
||
{"role": "user", "content": "hello"}
|
||
],
|
||
"model": "Qwen3-Next-80B-A3B-Instruct",
|
||
"temperature": 0.3,
|
||
"top_p": 1.0,
|
||
"stream": true
|
||
}'
|
||
```
|
||
|
||
### 6. Notes
|
||
|
||
Due to Qwen3-Next’s use of linear attention, CUDA Graph optimization is not yet support — but it’s coming soon! 🚀 |