mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-28 11:49:51 +00:00
add ci (#1642)
Some checks failed
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
Some checks failed
Book-CI / test (push) Has been cancelled
Book-CI / test-1 (push) Has been cancelled
Book-CI / test-2 (push) Has been cancelled
Deploy / deploy (macos-latest) (push) Has been cancelled
Deploy / deploy (ubuntu-latest) (push) Has been cancelled
Deploy / deploy (windows-latest) (push) Has been cancelled
This commit is contained in:
parent
2cffdf7033
commit
51745a9ea1
14 changed files with 845 additions and 48 deletions
|
|
@ -22,6 +22,8 @@ Convert weights to INT4/INT8 format optimized for AMX inference on CPU. These qu
|
|||
- **FP16**: 16-bit floating point
|
||||
- **BF16**: BFloat16 format
|
||||
|
||||
> **⚠️ Precision Warning:** Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original **BF16** model as the source for INT4/INT8 quantization.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Quantize BF16 model to INT4
|
||||
|
|
@ -213,6 +215,37 @@ python scripts/convert_gpu_weights.py \
|
|||
- `--dataset`: HuggingFace dataset for calibration
|
||||
- `--dataset_split`: Dataset split to use
|
||||
|
||||
#### Memory Management (Avoiding OOM)
|
||||
|
||||
GPTQ quantization requires additional GPU memory for Hessian matrix computation beyond model weights. Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU:
|
||||
|
||||
```bash
|
||||
python scripts/convert_gpu_weights.py \
|
||||
--model_id /path/to/model \
|
||||
--output_dir /path/to/output \
|
||||
--quant_type W4A16 \
|
||||
--max_gpu_memory "40GiB"
|
||||
```
|
||||
|
||||
**Recommended settings:**
|
||||
|
||||
| GPU VRAM | Suggested `--max_gpu_memory` |
|
||||
|----------|------------------------------|
|
||||
| 24 GiB | 14-16 GiB |
|
||||
| 48 GiB | 30-35 GiB |
|
||||
| 80 GiB | 50-60 GiB |
|
||||
|
||||
Reserve 40-50% of GPU memory for GPTQ's Hessian matrix computation.
|
||||
|
||||
**Options:**
|
||||
- `--max_gpu_memory`: Maximum GPU memory for model weights per device (e.g., '40GiB')
|
||||
- `--max_cpu_memory`: Maximum CPU memory (default: 1000GiB when `--max_gpu_memory` is set)
|
||||
|
||||
**Important:** llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
|
||||
1. Reduce `--num_calibration_samples` (e.g., 256)
|
||||
2. Reduce `--max_sequence_length` (e.g., 1024)
|
||||
3. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM)
|
||||
|
||||
### Examples
|
||||
|
||||
#### Example 1: Quantize Qwen3-Next-80B for Hybrid Inference (W4A16)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue