add ci (#1642)

2026-04-28 11:49:51 +00:00 · 2025-11-25 20:52:08 +08:00 · 2025-11-25 20:52:08 +08:00 · 51745a9ea1
commit 51745a9ea1
parent 2cffdf7033
14 changed files with 845 additions and 48 deletions
--- a/kt-kernel/scripts/README.md
+++ b/kt-kernel/scripts/README.md
@ -22,6 +22,8 @@ Convert weights to INT4/INT8 format optimized for AMX inference on CPU. These qu
 - **FP16**: 16-bit floating point
 - **BF16**: BFloat16 format

+> **⚠️ Precision Warning:** Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original **BF16** model as the source for INT4/INT8 quantization.
+
 ## Basic Usage

 ### Quantize BF16 model to INT4
@ -213,6 +215,37 @@ python scripts/convert_gpu_weights.py \
 - `--dataset`: HuggingFace dataset for calibration
 - `--dataset_split`: Dataset split to use

+#### Memory Management (Avoiding OOM)
+
+GPTQ quantization requires additional GPU memory for Hessian matrix computation beyond model weights. Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU:
+
+```bash
+python scripts/convert_gpu_weights.py \
+  --model_id /path/to/model \
+  --output_dir /path/to/output \
+  --quant_type W4A16 \
+  --max_gpu_memory "40GiB"
+```
+
+**Recommended settings:**
+
+| GPU VRAM | Suggested `--max_gpu_memory` |
+|----------|------------------------------|
+| 24 GiB   | 14-16 GiB                    |
+| 48 GiB   | 30-35 GiB                    |
+| 80 GiB   | 50-60 GiB                    |
+
+Reserve 40-50% of GPU memory for GPTQ's Hessian matrix computation.
+
+**Options:**
+- `--max_gpu_memory`: Maximum GPU memory for model weights per device (e.g., '40GiB')
+- `--max_cpu_memory`: Maximum CPU memory (default: 1000GiB when `--max_gpu_memory` is set)
+
+**Important:** llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
+1. Reduce `--num_calibration_samples` (e.g., 256)
+2. Reduce `--max_sequence_length` (e.g., 1024)
+3. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM)
+
 ### Examples

 #### Example 1: Quantize Qwen3-Next-80B for Hybrid Inference (W4A16)