mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-28 03:39:48 +00:00

History

mrhaoxx 9544a8960d Some checks failed Book-CI / test (push) Has been cancelled Details Book-CI / test-1 (push) Has been cancelled Details Book-CI / test-2 (push) Has been cancelled Details Deploy / deploy (macos-latest) (push) Has been cancelled Details Deploy / deploy (ubuntu-latest) (push) Has been cancelled Details Deploy / deploy (windows-latest) (push) Has been cancelled Details feat(sft): AMX MoE SFT backend with LoRA support (#1936 ) * feat(sft): AMX MoE SFT backend with LoRA support Complete SFT (Supervised Fine-Tuning) backend for MoE models using AMX SIMD: Core C++ implementation: - sft_moe.hpp: Forward/backward with LoRA fused operations (~5500 lines) - moe-sft-tp.hpp: Tensor-parallel wrapper for multi-NUMA - amx/moe-sft-tp.hpp: AMX-specific TP implementation - avx_kernels.hpp: AVX512 SIMD kernels for LoRA GEMM - amx_kernels.hpp: AMX tile kernels for Panel5 rank-outer optimization - worker_pool: RDTSC profiling, Chrome trace output, SFT timer infrastructure - ext_bindings.cpp: SFT MOE pybind bindings (BF16/INT8/INT4 + SkipLoRA variants) Python sft/ submodule (kt_kernel.sft): - base.py: BaseSFTMoEWrapper with buffer management (template method pattern) - amx.py: AMXSFTMoEWrapper (weight loading, C++ task construction) - autograd.py: KTMoEFunction (torch.autograd.Function for distributed training) - layer.py: KTMoELayerWrapper (nn.Module replacing HF MoE layers) - arch.py: MOEArchConfig (Qwen3/DeepSeek/Mixtral architecture detection) - weights.py: Expert weight extraction and checkpoint loading - lora.py: PEFT LoRA adaptation (view buffers, grad buffers, save/load adapter) - wrapper.py: wrap_moe_layers_with_kt_wrapper, load_kt_model, build_kt_device_map - config.py: KTConfig dataclass (DeepSpeed-style opaque config passthrough) - dist_utils.py: Distributed gather/scatter, checkpoint-phase detection Design decisions: - Rank-0-only expert pattern: only rank 0 holds C++ wrapper and expert weights - DeepSpeed-style integration: accelerate keeps only KTransformersPlugin (framework interaction fields), all logic in kt_kernel.sft - Inference isolation: importing kt_kernel does not load sft/ submodule - Old field name compatibility: _get_kt_config() converts kt_xxx→xxx automatically Verified: Qwen3-235B-A22B 4GPU AMXBF16 training, loss converges normally. * refactor(sft): unify KTConfig field names with kt_ prefix, add share_cache_pool, remove dead code - KTConfig fields all use kt_ prefix matching dict keys — eliminates _OLD_TO_NEW mapping and prefix-stripping in wrapper.py - Add kt_share_cache_pool field, auto-enabled when gradient_checkpointing is on (via training_args.py), flows through to C++ cache allocation - Remove dead checkpoint detection code: in_ckpt_recompute, in_ckpt_first_forward vars (assigned but never read), fallback _is_in_checkpoint_first_forward() function, unused inspect import - Remove redundant env var fallbacks in wrapper.py for share_backward_bb and share_cache_pool (KTConfig.__post_init__ already handles env vars) - Simplify layer.py checkpoint logic to single _checkpoint_hook_mode() check Verified: Qwen3-235B 3-step training on sap4, loss matches baseline (1.2886 / 1.9824 / 1.377 vs 1.2886 / 1.9766 / 1.3809) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(sft): share_backward_bb default True, share_cache_pool auto-derived - kt_share_backward_bb defaults to True (always saves memory) - kt_share_cache_pool no longer reads from env var; defaults False, auto-set to True by trainer_config_process when gradient checkpointing is enabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add missing gpu_experts_mask=None to KTMoEWrapper call in SFT wrapper KTMoEWrapper.__new__() requires gpu_experts_mask as a positional argument, but the SFT wrapper omitted it, causing MoE layer wrapping to fail silently and FSDP2 to attempt broadcasting all expert weights (OOM/NCCL crash). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(sft): support transformers v5 fused expert format Fused experts (e.g. Qwen3MoeExperts) store weights as 3D Parameters (gate_up_proj [E,2I,H], down_proj [E,H,I]) instead of per-expert nn.Linear modules. PEFT cannot attach LoRA to these, so we create KT-managed LoRA buffers with kaiming init, nn.Parameter wrappers for the optimizer, and pre-assigned .grad for C++ backward. - arch.py: detect_fused_experts() detection - weights.py: fused format extraction and weight clearing - wrapper.py: detect fused at wrap time, store _fused_experts/_lora_rank - lora.py: _create_fused_expert_lora_buffers, save/load fused LoRA, get_kt_lora_params collects fused params, deduplicate wrapper finding - layer.py: handle v5 TopKRouter tuple output, remove dead code - autograd.py: sync_forward_sft/submit_forward_sft API rename Verified: v5 loss/expert-LoRA values match v4 baseline, v4 backward compat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(sft): add Qwen3.5 MoE support + fused checkpoint loading - arch.py: add Qwen3_5Moe arch match, read config from text_config, _get_layers_prefix returns model.language_model.layers for Qwen3.5, _get_model_container_and_layers searches language_model attr - weights.py: load_experts_from_checkpoint_files detects fused format (gate_up_proj in weight_map) and splits into gate/up/down - wrapper.py: hidden_size fallback to text_config Verified: Qwen3.5-35B-A3B (256 experts, fused format) E2E pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [fix](sft): align Python API with C++ backend after v5 refactor - wrapper.py: pass gpu_experts_mask=None to KTMoEWrapper (required by C++ signature) - layer.py: rename submit_forward_sft/sync_forward_sft to submit_forward/sync_forward - autograd.py: rename sync_forward_sft to sync_forward The sft-v5 refactor (commits `58d7eab`, `dd1da65`) renamed Python-side method calls but the C++ backend (AMXSFTMoEWrapper) still exposes the original method names. This caused AttributeError on Qwen3.5-35B and other models. * align sft branch with main: revert worker_pool, strip sft_timer, fix inference defaults - Revert worker_pool.cpp/.h to main (remove RDTSC timer, Chrome Trace, sft_timer namespace, ITT API, extended do_work_stealing_job API) - Strip all sft_timer instrumentation from sft-only files (sft_moe.hpp, moe-sft-tp.hpp, avx_kernels.hpp) - Restore pin_memory=True in KExpertsCPUBuffer (inference path) - Restore fused tensor transpose logic in convert_cpu_weights.py (main layout) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * revert CMakeLists.txt to main: remove debug flags and cpptrace dep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * clean up dev artifacts: remove SFT design docs, debug examples, bench scripts Remove files not needed in the merge: - docs/SFT+KTWrapper/ (6 Chinese design docs) - docs/sft_moe_amx/ (21 dev/debug docs) - 12 debug/test example scripts - 6 SFT-specific bench scripts and report Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove dev version stamps from ext_bindings, sft_moe, moe-sft-tp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: JimmyPeilinLi <lipeilin@mail.nwpu.edu.cn>		2026-04-22 11:27:01 +08:00
..
check.py	add kt-kernel	2025-10-12 05:13:00 +00:00
check_cpu_features.py	[feat](kt-kernel): Fix CPU instruction set variants for build & install (#1746 )	2025-12-24 18:57:45 +08:00
compare_weights.py	update kt-kernel	2025-11-03 15:19:52 +08:00
convert_cpu_weights.py	feat(sft): AMX MoE SFT backend with LoRA support (#1936 )	2026-04-22 11:27:01 +08:00
convert_gpu_weights.py	fix(scripts): resolve OOM when converting gpu weights and update README (#1640 )	2025-12-01 14:15:14 +08:00
convert_kimi_k2_fp8_to_bf16_cpu.py	[docs]: add contribuing guide and add hooks install (#1613 )	2025-11-15 18:26:49 +08:00
convert_moe_to_bf16.py	[docs]: add contribuing guide and add hooks install (#1613 )	2025-11-15 18:26:49 +08:00
install-git-hooks.sh	[docs]: add contribuing guide and add hooks install (#1613 )	2025-11-15 18:26:49 +08:00
merge_cpu_weights.py	feat(kt-kernel): Add utility script to merge loose layer weights to safetensors (#1886 )	2026-03-31 10:41:07 +08:00
README.md	fix(scripts): resolve OOM when converting gpu weights and update README (#1640 )	2025-12-01 14:15:14 +08:00

README.md

Weight Quantization Tools

KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement:

CPU Weights (convert_cpu_weights.py): Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" experts
GPU Weights (convert_gpu_weights.py): Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" experts

CPU Weight Quantization

Convert weights to INT4/INT8 format optimized for AMX inference on CPU. These quantized weights are used for "cold" experts (less frequently accessed) that run on CPU in hybrid inference scenarios.

Quantization Methods

INT4: 4-bit quantization for maximum memory efficiency
INT8: 8-bit quantization for better accuracy

Supported Input Formats

FP8: 8-bit floating point with automatic dequantization
FP16: 16-bit floating point
BF16: BFloat16 format

⚠️ Precision Warning: Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original BF16 model as the source for INT4/INT8 quantization.

Basic Usage

Quantize BF16 model to INT4

python scripts/convert_cpu_weights.py \
  --input-path /path/to/bf16/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4

Quantize FP16 model to INT8

python scripts/convert_cpu_weights.py \
  --input-path /path/to/fp16/model \
  --input-type fp16 \
  --output /path/to/output \
  --quant-method int8

Quantize FP8 model to INT4

python scripts/convert_cpu_weights.py \
  --input-path /path/to/fp8/model \
  --input-type fp8 \
  --output /path/to/output \
  --quant-method int4

Output Format

By default, the converted weights are saved in SafeTensors format with NUMA-aware layout:

output_dir/
├── model-00001-of-00050.safetensors
├── model-00002-of-00050.safetensors
├── ...
├── config.json
└── tokenizer files...

Each expert's weights are split across NUMA nodes for optimal memory access:

blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.weight: Quantized weights
blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.scale: Quantization scales

Advanced Options

Low Memory Mode

For systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor

This will save quantized weights in the following folder structure:

output_dir/
├── _layer_0/
│   ├── _numa_0/
│   │   ├── INT4_down_0_*.kt
│   │   ├── INT4_gate_0_*.kt
│   │   └── INT4_up_0_*.kt
│   └── _numa_1/
│       └── ...
├── _layer_1/
│   └── ...
└── ...

When to use --no-merge-safetensor:

Machine runs out of memory during the merge step
Need to process very large models on memory-constrained systems
Want to preserve intermediate layer-wise quantized weights

Resume Layer

For memory-constrained systems that are unable to complete quantization despite enabling low memory mode with --no-merge-safetensor, restart the script with the --resume-layer arg to specify the layer from which to continue the conversion process. In the example below, we skip layers 0-11 and resume conversion starting with layer 12.

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor
  --resume-layer 12

Examples

Example 1: Quantize DeepSeek-V3.1 (FP8 → INT4)

python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/DeepSeek-V3.1 \
  --input-type fp8 \
  --output /mnt/data/models/DeepSeek-V3.1-INT4 \
  --quant-method int4 \
  --cpuinfer-threads 60 \
  --threadpool-count 2

Example 2: Quantize Qwen3-Next-80B (BF16 → INT4, Low Memory)

python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
  --input-type bf16 \
  --output /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-INT4 \
  --quant-method int4 \
  --cpuinfer-threads 60 \
  --threadpool-count 2 \
  --no-merge-safetensor

GPU Weight Quantization

Prerequisites

GPU weight quantization requires additional dependencies. Install them before proceeding:

pip install accelerate transformers llmcompressor datasets

Required packages:

accelerate: For distributed model loading and device mapping
transformers: For model and tokenizer loading
llmcompressor: For quantization (supports GPTQ and RTN methods)
datasets: For calibration data loading (GPTQ only)

Documentation: This tool is based on llmcompressor. For more details, see llmcompressor quantization guide.

Overview

Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with convert_cpu_weights.py to enable heterogeneous expert placement:

GPU-resident experts ("hot" experts) use GPTQ/RTN quantization (this tool) for efficient GPU memory usage
CPU-resident experts ("cold" experts) use AMX-optimized INT4/INT8 quantization (convert_cpu_weights.py)
Attention layers, gates, and shared experts remain in higher precision

This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs.

Quantization Methods

1. GPTQ (Calibration-based, Default)

Pros:

Higher accuracy through calibration-based quantization
Recommended for production deployments

Cons:

Requires calibration dataset
Slower quantization process
Higher memory requirements (needs Hessian matrix)

2. RTN (Round-To-Nearest)

Pros:

Fast quantization (no calibration needed)
Lower memory requirements
Good for quick testing and prototyping

Cons:

Slightly lower accuracy compared to GPTQ
No calibration optimization

Quantization Types

W4A16: 4-bit weights, 16-bit activations (INT4)
W8A16: 8-bit weights, 16-bit activations (INT8)

Basic Usage

GPTQ Quantization (Recommended for Production)

python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method GPTQ \
  --quant_type W4A16

RTN Quantization (Fast, for Testing)

python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method RTN \
  --quant_type W4A16

Memory Requirements

Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods.

RTN Memory Requirements

RTN only requires memory for quantization parameters (scales/zero-points):

Component	Requirement
DRAM (CPU Memory)	≥ Total model parameters
VRAM (GPU Memory)	≥ Single layer parameters

Example: DeepSeek-R1-0528-BF16 (684B parameters)

DRAM: ~1368 GB (684B params × 2 bytes)
VRAM: ~22.4 GB (1 layer)

GPTQ Memory Requirements

GPTQ requires additional memory for Hessian matrices during calibration:

Component	Requirement
DRAM (CPU Memory)	≥ Total model parameters
VRAM (GPU Memory)	≥ Single layer parameters × 2

The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery.

Example: DeepSeek-R1-0528-BF16 (684B parameters)

DRAM: ~1368 GB (684B params × 2 bytes)
VRAM: ~44.8 GB (1 layer × 2 for Hessian matrix)

Method Comparison

Method	Speed	VRAM	Accuracy	Use Case
RTN	Fast	Low (~22GB)	Good	Testing, prototyping
GPTQ	Slow	High (~45GB)	Better	Production deployment

Advanced Options

Calibration Configuration (GPTQ Only)

For GPTQ quantization, control the calibration process for better quantization quality:

python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method GPTQ \
  --quant_type W4A16 \
  --num_calibration_samples 512 \
  --max_sequence_length 2048 \
  --dataset HuggingFaceH4/ultrachat_200k \
  --dataset_split train_sft

Options (GPTQ only):

--num_calibration_samples: Number of samples for calibration (default: 512)
--max_sequence_length: Maximum sequence length (default: 2048)
--dataset: HuggingFace dataset for calibration
--dataset_split: Dataset split to use
--dampening_frac: Dampening fraction to reduce quantization noise (default: 0.1)

Memory Management

Use --max_gpu_memory to limit GPU memory usage and offload remaining layers to CPU:

python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method GPTQ \
  --quant_type W4A16 \
  --max_gpu_memory "40GiB"

Recommended settings for GPTQ:

GPU VRAM	Suggested `--max_gpu_memory`	Notes
24 GiB	10-12 GiB	Reserve ~50% for Hessian
48 GiB	24-30 GiB	Reserve ~40% for Hessian
80 GiB	40-50 GiB	Reserve ~40% for Hessian

Recommended settings for RTN:

GPU VRAM	Suggested `--max_gpu_memory`	Notes
24 GiB	18-20 GiB	No Hessian needed
48 GiB	40-45 GiB	No Hessian needed
80 GiB	70-75 GiB	No Hessian needed

Options:

--max_gpu_memory: Maximum GPU memory for model weights per device (e.g., '40GiB')
--max_cpu_memory: Maximum CPU memory (default: 1000GiB when --max_gpu_memory is set)

Important: llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:

Use RTN instead of GPTQ (requires less memory)
Reduce --num_calibration_samples (GPTQ only, e.g., 256)
Reduce --max_sequence_length (GPTQ only, e.g., 1024)
Use --force_cpu to run entirely on CPU (slower but avoids GPU OOM)

Examples

Example 1: GPTQ Quantization for Production (Qwen3-Next-80B, W4A16)

python scripts/convert_gpu_weights.py \
  --model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
  --output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \
  --quant_method GPTQ \
  --quant_type W4A16 \
  --num_calibration_samples 512 \
  --max_sequence_length 2048 \
  --max_gpu_memory "40GiB" \
  --trust_remote_code

Example 2: RTN Quantization for Fast Testing (DeepSeek-R1, W4A16)

python scripts/convert_gpu_weights.py \
  --model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \
  --output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \
  --quant_method RTN \
  --quant_type W4A16 \
  --max_gpu_memory "70GiB" \
  --trust_remote_code

Example 3: GPTQ with Custom Calibration Dataset (GLM-4.5-Air, W8A16)

python scripts/convert_gpu_weights.py \
  --model_id /mnt/data/models/GLM-4.5-Air \
  --output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \
  --quant_method GPTQ \
  --quant_type W8A16 \
  --dataset "tatsu-lab/alpaca" \
  --dataset_split "train" \
  --num_calibration_samples 256 \
  --max_gpu_memory "40GiB" \
  --trust_remote_code

README.md Unescape Escape

Weight Quantization Tools

CPU Weight Quantization

Quantization Methods

Supported Input Formats

Basic Usage

Quantize BF16 model to INT4

Quantize FP16 model to INT8

Quantize FP8 model to INT4

Output Format

Advanced Options

Low Memory Mode

Resume Layer

Examples

Example 1: Quantize DeepSeek-V3.1 (FP8 → INT4)

Example 2: Quantize Qwen3-Next-80B (BF16 → INT4, Low Memory)

GPU Weight Quantization

Prerequisites

Overview

Quantization Methods

1. GPTQ (Calibration-based, Default)

2. RTN (Round-To-Nearest)

Quantization Types

Basic Usage

GPTQ Quantization (Recommended for Production)

RTN Quantization (Fast, for Testing)

Memory Requirements

RTN Memory Requirements

GPTQ Memory Requirements

Method Comparison

Advanced Options

Calibration Configuration (GPTQ Only)

Memory Management

Examples

Example 1: GPTQ Quantization for Production (Qwen3-Next-80B, W4A16)

Example 2: RTN Quantization for Fast Testing (DeepSeek-R1, W4A16)

Example 3: GPTQ with Custom Calibration Dataset (GLM-4.5-Air, W8A16)

README.md