mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-26 07:44:05 +00:00

Claude 51bb16ca09 docs(research): add TurboQuant KV cache compression research document

Comprehensive research document covering TurboQuant (ICLR 2026) and its
mapping to ruvLLM. Covers algorithm details, performance results,
integration architecture, PiQ3 comparison, risks/mitigations, and
implementation summary.

https://claude.ai/code/session_011ogX2uc7Zf8d8aQ3UAbNcd

2026-03-25 12:14:17 +00:00

7.8 KiB

Raw Permalink Blame History

TurboQuant: Data-Oblivious KV Cache & Vector Compression for ruvLLM

Abstract

TurboQuant (ICLR 2026) is a data-oblivious quantization algorithm that compresses high-dimensional vectors to ~3.5 bits per value with provably near-optimal geometry preservation. Unlike traditional quantization methods requiring codebooks or training, TurboQuant operates without dataset-specific tuning while achieving distortion within ~2.7× of information-theoretic lower bounds.

This document maps TurboQuant to ruvLLM's edge inference stack, where it addresses the KV cache memory bottleneck and enables compressed embedding stores compatible with RuVector's geometric control plane.

1. Core Algorithm

1.1 Two-Stage Pipeline

TurboQuant is a two-stage compression pipeline:

Stage 1: PolarQuant (MSE-optimal scalar quantization)

Apply randomized Hadamard rotation to input vector
After rotation, coordinates become approximately independent (Beta-distributed)
Apply optimal scalar quantizer per coordinate (no codebooks needed)

Stage 2: QJL Residual Correction (1-bit inner product correction)

Compute residual between original and Stage 1 reconstruction
Apply Quantized Johnson-Lindenstrauss (QJL) transform: store sign bits only
QJL signs produce an unbiased inner product estimator with minimal overhead

Combined: MSE quantizer + 1-bit QJL = unbiased inner product quantizer.

1.2 Mathematical Foundations

Random Rotation (Hadamard):

Orthogonal transform: H × H^T = n × I
Makes vector dimensions approximately independent
After rotation, angles follow a concentrated Beta distribution
Eliminates need for explicit normalization (saves memory)

Scalar Quantization:

Per-coordinate uniform quantizer with block-local scale/offset
Levels determined by target bit-width (e.g., 8 levels for 3 bits)
No codebook storage overhead

QJL Residual:

Sign-bit quantization: each residual component → +1 or -1
Zero memory overhead for quantization constants
Asymmetric estimator: exact query × quantized key → unbiased inner product
Total: ~0.5-1.0 extra bits per dimension

1.3 Error Bounds

Distortion within ~2.7× of information-theoretic lower bounds
Quality-neutral at 3.5 bits per channel (tested on Gemma, Mistral, Llama-3.1-8B)
Marginal quality degradation at 2.5 bits per channel

2. Performance Results

Metric	Value	Configuration
KV cache memory reduction	≥6×	3.5-bit vs FP16
Attention speedup	up to 8×	4-bit keys on H100
Recall vs PQ/RabbiQ	Superior	Zero indexing time
Training required	None	Data-oblivious
Runtime overhead	Negligible	Rotation + scalar quant

Benchmarks: LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, L-Eval.

3. Mapping to ruvLLM Architecture

3.1 KV Cache Integration (Highest ROI)

Problem: KV cache explodes with context length. ruvLLM pushes long context + continuous agents on edge devices (Pi 5, Seed appliance, Cognitum tiles).

Current architecture (kv_cache.rs):

TwoTierKvCache:
  Hot tier (FP16): Recent tokens (tail_length=256)
  Cold tier (Q4):  Older tokens (4.5 bits)

New architecture (TurboQuantKvCache):

Three-tier cache:
  Hot tier (FP16):        Recent tokens (tail_length=256)
  Cold tier (TurboQuant): Older tokens (~3.5 bits, geometry-preserving)

Impact:

5-8× more effective context window on edge devices
Preserves attention quality (unbiased inner product estimator)
No training or calibration data required
Drop-in replacement for cold tier quantization

3.2 RuVector Embedding Compression

TurboQuant preserves Euclidean distance geometry, which aligns with RuVector's use of geometry as a control layer:

HNSW search: Inner product preservation means nearest-neighbor results are stable under compression
Mincut coherence: Structural coherence signals remain valid on compressed embeddings
Hyperbolic embeddings: Require pre-transform to Euclidean space before compression (limitation)

Implementation: TurboQuantEmbeddingStore provides batch build, single retrieval, and nearest-neighbor search on compressed embeddings.

3.3 Comparison with PiQ3

Feature	PiQ3	TurboQuant	Recommended Use
Data aware	Yes	No	PiQ3 for archival, TurboQuant for live
Online	Partial	Yes	TurboQuant for streaming KV cache
Geometry preservation	Good	Provably near-optimal	TurboQuant for attention
KV cache ready	Not native	Yes	TurboQuant
Training required	Sometimes	None	TurboQuant for zero-config
Compression ratio	8-12×	6-9×	PiQ3 for cold storage

Best strategy: TurboQuant for live KV cache and real-time embeddings; PiQ3 for archival tiers and temporal compression pipelines.

4. Integration Architecture

ruvLLM Inference Pipeline
  ├── KV Cache
  │   ├── Hot Tier (FP16, recent tokens)
  │   └── Cold Tier (TurboQuant 3.5-bit) ← NEW
  ├── Embedding Store
  │   ├── Live (TurboQuant) ← NEW
  │   └── Archive (PiQ3 temporal compression)
  ├── RuVector Store
  │   ├── HNSW index (compressed embeddings)
  │   └── Mincut coherence (validation layer)
  └── Attention Computation
      └── Asymmetric inner product (exact query × compressed key)

5. Risks & Mitigations

5.1 Inner Product vs Mincut Tension

TurboQuant optimizes MSE + inner product distortion. RuVector optimizes structural coherence (mincut).

Mitigation: Run mincut as a validation layer. Reject high-distortion regions where TurboQuant error exceeds coherence threshold.

5.2 Hyperbolic Embeddings

TurboQuant assumes Euclidean space. ruvLLM uses hyperbolic + mixed curvature.

Mitigation: Pre-transform to Euclidean (logarithmic map) → quantize → inverse map (exponential map). Adds latency but preserves hyperbolic geometry.

5.3 Ultra-Low-Bit Instability (<3 bits)

Below ~3 bits, error spikes in rare vectors.

Mitigation: Existing ruvLLM infrastructure handles this:

Delta checks (detect excessive error)
Witness gating (audit trail)
Sparsifier (flag problematic vectors)

6. Implementation Summary

Phase 1: Core Compression (DONE)

turbo_quant.rs: TurboQuantCompressor with Hadamard rotation + scalar quantization + QJL residual correction
Bit configurations: 2.5, 3.0, 3.5, 4.0 bits per value
Bitstream packing for non-byte-aligned bit widths
13 passing tests

Phase 2: KV Cache Integration (DONE)

TurboQuantCacheTier: Compressed KV pair storage with push/get/evict
TurboQuantKvCache: Three-tier cache (FP16 hot + TurboQuant cold) with auto-migration from tail to cold tier
Integrated into kv_cache.rs with CacheTier::TurboQuant variant

Phase 3: Embedding Store (DONE)

TurboQuantEmbeddingStore: Batch build, single retrieval, nearest-neighbor search using asymmetric inner product
Compatible with RuVector HNSW index

Phase 4: Future Work

Mincut-based distortion gating ("coherence-aware quantization")
SIMD-optimized bit packing (NEON/AVX2)
Hyperbolic pre-transform adapter
Streaming compression for continuous agent contexts

7. References

TurboQuant (ICLR 2026): arxiv.org/abs/2504.19874
PolarQuant (AISTATS 2026): arxiv.org/abs/2502.02617
QJL: arxiv.org/abs/2406.03482
ADR-090: Ultra-Low-Bit Quantization Design
Google Research Blog: research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

8. File Inventory

File	Description
`crates/ruvllm/src/quantize/turbo_quant.rs`	Core TurboQuant implementation
`crates/ruvllm/src/quantize/mod.rs`	Module exports (updated)
`crates/ruvllm/src/kv_cache.rs`	TurboQuantKvCache integration
`crates/ruvllm/src/quantize/hadamard.rs`	Hadamard transform (dependency)

7.8 KiB Raw Permalink Blame History Unescape Escape