mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-27 08:45:07 +00:00

Reuven 9f80f7298f feat(ruvector-cnn): implement ADR-091 INT8 CNN quantization

Complete implementation of INT8 quantization for ruvector-cnn:

Phase 1 - Core Infrastructure:
- QuantizationParams, QuantizationScheme, QuantizationMode
- QuantizedTensor<i8> with quantize/dequantize methods
- CalibrationMethod (MinMax, Percentile, MSE, Entropy)
- 34 unit tests passing

Phase 2 - INT8 Kernels:
- Scalar reference: conv2d, depthwise_conv2d, matmul, requantize
- AVX2 SIMD: _mm256_maddubs_epi16 for 2-4x speedup
- ARM NEON: vmull_s8, vpadalq_s16 for 2-3x speedup
- WASM SIMD128: i8x16 operations for 1.5-2x speedup

Phase 3 - Graph Rewrite Passes:
- GR-1: BatchNorm fusion into Conv weights
- GR-2: Zero-point correction pre-computation
- GR-3: Q/DQ node insertion at FP32/INT8 boundaries
- GR-4: ReLU/HardSwish fusion with LUT

Phase 4 - Quantized Layers:
- QuantizedConv2d with per-channel quantization
- QuantizedDepthwiseConv2d for MobileNet
- QuantizedLinear for FC layers
- QuantizedMaxPool2d/AvgPool2d
- QuantizedResidualAdd with scale alignment

Phase 6 - Tests & Benchmarks:
- quality_validation.rs: cosine similarity ≥0.995
- acceptance_gates.rs: 7 ADR-091 gates
- kernel_equivalence.rs: SIMD vs scalar validation
- int8_bench.rs: Criterion benchmarks

Performance targets:
- 2.5x latency improvement (MobileNetV3)
- 4x memory reduction
- <1% accuracy degradation

Co-Authored-By: claude-flow <ruv@ruv.net>

2026-03-12 14:45:52 -04:00

7.5 KiB

Raw Permalink Blame History

INT8 SIMD Kernels Implementation - Phase 2.2-2.3

Overview

This document describes the implementation of ADR-091 Phase 2.2-2.3: SIMD INT8 kernels for ruvector-cnn. These kernels provide 2-4x speedup over FP32 inference with minimal accuracy loss.

Implementation Summary

Files Created

src/kernels/int8_avx2.rs - x86_64 AVX2 kernels
src/kernels/int8_neon.rs - ARM NEON kernels
src/kernels/int8_wasm.rs - WebAssembly SIMD128 kernels
src/kernels/mod.rs - Module exports and dispatch logic

Key Features

Multi-architecture support: AVX2, NEON, WASM SIMD128
Automatic dispatch: Runtime feature detection selects optimal implementation
Kernel equivalence: All SIMD kernels match scalar reference within 1 ULP (INV-6)
Edge case handling: Supports non-aligned sizes, small inputs, remainder processing

Architecture-Specific Implementations

1. x86_64 AVX2 (`int8_avx2.rs`)

Key Instructions:

_mm256_maddubs_epi16: Multiply u8×i8 → i16, pairwise add
_mm256_madd_epi16: Multiply i16×i16 → i32, pairwise add
_mm256_add_epi32: Accumulate i32 results

Performance:

Processes 32 elements per iteration (dot product)
Processes 8 output channels per iteration (convolution)
Expected speedup: 2-4x over FP32

Functions:

dot_product_int8_avx2() - INT8 dot product
conv2d_int8_avx2() - 2D convolution with per-channel quantization
depthwise_conv2d_int8_avx2() - Depthwise separable convolution
matmul_int8_avx2() - Matrix multiplication (GEMM)

2. ARM NEON (`int8_neon.rs`)

Key Instructions:

vmull_s8: Multiply i8×i8 → i16 (widening)
vpadalq_s16: Pairwise add i16 → i32 (accumulate)
vpadd_s32: Horizontal sum for final accumulation

Performance:

Processes 16 elements per iteration (dot product)
Processes 4 output channels per iteration (convolution)
Expected speedup: 2-3x over FP32

Functions:

dot_product_int8_neon() - INT8 dot product
conv2d_int8_neon() - 2D convolution
depthwise_conv2d_int8_neon() - Depthwise convolution
matmul_int8_neon() - Matrix multiplication

3. WebAssembly SIMD128 (`int8_wasm.rs`)

Key Instructions:

i16x8_extend_low_i8x16: Widen i8 → i16
i16x8_mul: Multiply i16×i16
i32x4_extend_low_i16x8: Widen i16 → i32
i32x4_add: Accumulate i32

Performance:

Processes 16 elements per iteration (dot product)
Processes 4 output channels per iteration (convolution)
Expected speedup: 1.5-2.5x over scalar

Functions:

dot_product_int8_wasm() - INT8 dot product
conv2d_int8_wasm() - 2D convolution
depthwise_conv2d_int8_wasm() - Depthwise convolution
matmul_int8_wasm() - Matrix multiplication

Automatic Dispatch System

The kernels/mod.rs module provides automatic dispatch functions that select the best implementation at runtime:

// Automatic dispatch based on target architecture
pub fn conv2d_int8_dispatch(
    input: &[u8],
    input_zero_point: i32,
    kernel: &[i8],
    bias_i32: &[i32],
    output: &mut [i32],
    in_h: usize, in_w: usize, in_c: usize, out_c: usize,
    stride: usize, padding: usize,
) {
    #[cfg(target_arch = "x86_64")]
    {
        if is_x86_feature_detected!("avx2") {
            // Use AVX2 kernel
        }
    }
    // Fallback to scalar or other architectures
}

Quantization Scheme

Asymmetric Quantization (Activations)

Used for activations after ReLU (non-negative):

quantized = round(value / scale) + zero_point

Range: [0, 255] for u8
Zero point: Computed to map min value to 0
Scale: (max - min) / 255

Symmetric Quantization (Weights)

Used for weights (centered around 0):

quantized = round(value / scale)

Range: [-127, 127] for i8
Zero point: Always 0
Scale: max(abs(values)) / 127

Per-Channel Quantization

Different scale per output channel for higher accuracy:

Weights: Per-channel symmetric quantization
Activations: Per-tensor asymmetric quantization

Zero-Point Correction

For asymmetric quantization, a correction term is added to account for the zero point:

// Pre-compute correction: input_zero_point × sum(weights)
let correction = input_zero_point * weight_sum;
output = bias - correction + dot_product(input, weights);

This ensures the quantized computation matches the FP32 result when dequantized.

Kernel Equivalence Testing (INV-6)

All SIMD kernels are tested against scalar reference implementations:

#[test]
fn test_kernel_equivalence_conv2d() {
    let mut scalar_output = vec![0i32; output_size];
    let mut simd_output = vec![0i32; output_size];

    scalar_conv2d_int8(..., &mut scalar_output);
    conv2d_int8_dispatch(..., &mut simd_output);

    // Must match exactly (INV-6)
    assert_eq!(scalar_output, simd_output);
}

Tests verify:

Exact equivalence: INT32 outputs match exactly
Edge cases: Non-aligned sizes, small inputs
Remainder handling: Correct processing of tail elements

Performance Expectations

Operation	AVX2 Speedup	NEON Speedup	WASM Speedup
Conv2d 3×3	2-3x	2-2.5x	1.5-2x
Depthwise	2-2.5x	2x	1.5-2x
MatMul	3-4x	2.5-3x	2-2.5x
Dot Product	4x	3x	2x

Memory Bandwidth

INT8 provides 4x memory bandwidth reduction:

FP32: 4 bytes per value
INT8: 1 byte per value
Cache efficiency: 4x better cache utilization

Usage Example

use ruvector_cnn::kernels::{conv2d_int8_dispatch, QuantParams};

// Quantize inputs
let input_q: Vec<u8> = quantize_activations(&input_fp32);
let kernel_q: Vec<i8> = quantize_weights(&kernel_fp32);
let bias_q: Vec<i32> = quantize_bias(&bias_fp32);

// Run INT8 convolution (automatic SIMD dispatch)
let mut output_i32 = vec![0i32; output_size];
conv2d_int8_dispatch(
    &input_q, input_zero_point, &kernel_q, &bias_q,
    &mut output_i32, in_h, in_w, in_c, out_c, stride, padding
);

// Dequantize output
let output_fp32 = dequantize(&output_i32, output_scale);

Testing Strategy

Unit Tests

Each kernel includes unit tests:

Basic functionality tests
Small input tests (< SIMD width)
Non-aligned size tests
Equivalence tests vs scalar

Integration Tests

Tests verify:

End-to-end quantized inference
Accuracy vs FP32 baseline (<1% degradation)
Performance benchmarks

Running Tests

# Run all kernel tests
cargo test -p ruvector-cnn --lib kernels

# Run specific kernel tests
cargo test -p ruvector-cnn test_kernel_equivalence

# Run with optimizations
cargo test -p ruvector-cnn --release kernels

Future Optimizations

AVX-512 VNNI

Intel Cascade Lake+ supports VNNI instructions:

_mm512_dpbusd_epi32: 4-way dot product in single instruction
Expected speedup: 5-6x over FP32

ARM Dot Product (ARMv8.2+)

ARM Cortex-A55+ supports dot product instructions:

vdotq_s32: 4-way dot product
Expected speedup: 3-4x over FP32

References

ADR-091: INT8 Quantization Design for ruvector-cnn
Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
ARM NEON Intrinsics: https://developer.arm.com/architectures/instruction-sets/intrinsics/
WebAssembly SIMD: https://github.com/WebAssembly/simd

Compliance

INV-6: All kernels match scalar reference within 1 ULP ✓
Edge cases: Non-aligned inputs handled correctly ✓
Multi-architecture: AVX2, NEON, WASM SIMD128 supported ✓
Automatic dispatch: Runtime feature detection implemented ✓

7.5 KiB Raw Permalink Blame History Unescape Escape

INT8 SIMD Kernels Implementation - Phase 2.2-2.3

Overview

Implementation Summary

Files Created

Key Features

Architecture-Specific Implementations

1. x86_64 AVX2 (int8_avx2.rs)

2. ARM NEON (int8_neon.rs)

3. WebAssembly SIMD128 (int8_wasm.rs)

Automatic Dispatch System

Quantization Scheme

Asymmetric Quantization (Activations)

Symmetric Quantization (Weights)

Per-Channel Quantization

Zero-Point Correction

Kernel Equivalence Testing (INV-6)

Performance Expectations

Memory Bandwidth

Usage Example

Testing Strategy

Unit Tests

Integration Tests

Running Tests

Future Optimizations

AVX-512 VNNI

ARM Dot Product (ARMv8.2+)

References

Compliance

7.5 KiB

Raw Permalink Blame History

1. x86_64 AVX2 (`int8_avx2.rs`)

2. ARM NEON (`int8_neon.rs`)

3. WebAssembly SIMD128 (`int8_wasm.rs`)