mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-25 23:24:03 +00:00

Michael O'Boyle a1a04a3570 bench: comprehensive quantization method comparison (8 methods, 3 datasets)

First benchmark comparing all ruvector-core quantization methods against
TurboQuant on standard vector search datasets. 8 configurations, 3 datasets
(GloVe d=200, SIFT d=128, PKM d=384), 3 trials per config with variance.

Key findings:
- Int4 beats TurboQuant MSE on recall at 8x compression (91.2% vs 89.6% R@1)
- QJL correction hurts recall for vector search (9-41% loss)
- PQ with 8 subspaces fails at d=200 (18.2% R@1)
- TurboQuant MSE 3-bit fills unserved 10.7x compression tier (82.0% R@1)
- QuantizedVector::distance() never called during HNSW search

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-19 21:37:14 -04:00

11 KiB

Raw Permalink Blame History

RuVector Quantization Benchmark: All Methods Compared

Summary

First comprehensive benchmark of all quantization methods in RuVector, plus TurboQuant, on standard vector search datasets. Tests 8 configurations across 3 datasets, measuring recall, compression, and latency. Each configuration is run 3 times with independent seeds to report variance.

The results are surprising. Several ruvector-core quantization tiers underperform expectations, and no quantization method is actually used during HNSW search.

Architecture Finding

RuVector has two disconnected quantization subsystems that don't communicate with each other or with HNSW search:

crates/ruvllm/src/quantize/turbo_quant.rs     TurboQuant (1,483 lines, SIMD)
                                               └── TurboQuantEmbeddingStore: linear scan only

crates/ruvector-core/src/quantization.rs       ScalarQuantized, Int4, Product, Binary
                                               └── Never called during HNSW search

crates/ruvector-core/src/index/hnsw.rs         HNSW index
                                               └── f32 vectors only, no quantized distance

The QuantizedVector trait has 4 implementations with distance functions that are never called during graph traversal. All quantization in ruvector-core is storage-only. This affects every quantization method, not just TurboQuant.

Results

All methods pre-reconstruct to f32 before search for fair latency comparison. Values with ± show standard deviation across 3 trials with independent random seeds.

GloVe d=200 (100,000 vectors, 1,000 queries)

Method	Origin	R@1	R@10	R@100	Compress	p50 ms	Mem MB
f32 baseline	--	1.000	1.000	1.000	1.0x	2.77±0.08	80.0
scalar int8	ruvector-core	0.997	0.993	0.994	4.0x	2.85±0.21	20.8
int4	ruvector-core	0.912	0.904	0.917	8.0x	2.88±0.09	20.8
TQ MSE 4-bit	ruvllm	0.896	0.903±0.003	0.917	8.0x	3.09±0.24	10.0
TQ MSE 3-bit	ruvllm	0.820±0.007	0.826±0.003	0.845	10.7x	2.87±0.14	7.5
TQ full (QJL)	ruvllm	0.661±0.005	0.680	0.685	10.7x	27.26±0.95	7.7
binary	ruvector-core	0.514	0.503	0.498	32.0x	2.82±0.10	2.5
product_quant 8sub	ruvector-core	0.182	0.205	0.269	100.0x	2.88±0.14	1.0

SIFT d=128 (100,000 vectors, 1,000 queries)

Method	Origin	R@1	R@10	R@100	Compress	p50 ms	Mem MB
f32 baseline	--	1.000	1.000	1.000	1.0x	1.95±0.31	51.2
scalar int8	ruvector-core	0.986	0.989	0.993	4.0x	1.47	13.6
int4	ruvector-core	0.750	0.850	0.902	8.0x	1.49	13.6
TQ MSE 4-bit	ruvllm	0.448±0.032	0.577±0.036	0.691±0.035	8.0x	1.50±0.04	6.4
TQ MSE 3-bit	ruvllm	0.283±0.018	0.411±0.015	0.548±0.019	10.7x	1.49±0.02	4.8
TQ full (QJL)	ruvllm	0.168±0.012	0.249±0.007	0.377±0.003	10.7x	13.97±0.12	5.0
product_quant 8sub	ruvector-core	0.081	0.189	0.338	64.0x	1.49±0.02	0.9
binary	ruvector-core	0.000	0.000	0.003	32.0x	1.41	1.6

PKM d=384 (117 vectors, 20 queries)

Method	Origin	R@1	R@10	R@100	Compress	p50 ms	Mem MB
f32 baseline	--	1.000	1.000	1.000	1.0x	0.01	0.2
product_quant 8sub	ruvector-core	1.000	1.000	1.000	192.0x	0.01	0.2
scalar int8	ruvector-core	0.950	0.990	1.000	4.0x	0.01	0.0
int4	ruvector-core	0.900	0.960	0.991	8.0x	0.01	0.0
TQ MSE 4-bit	ruvllm	0.900	0.955±0.008	0.994±0.001	8.0x	0.01	0.0
TQ MSE 3-bit	ruvllm	0.900	0.932±0.006	0.989	10.7x	0.01	0.0
binary	ruvector-core	0.800	0.805	0.963	32.0x	0.01	0.0
TQ full (QJL)	ruvllm	0.817±0.085	0.880±0.004	0.979±0.003	10.7x	0.40	0.0

Key Findings

1. Int4 Beats TurboQuant MSE on Recall at Same Compression

At 8x compression on GloVe, naive min-max Int4 achieves 91.2% R@1 vs TurboQuant MSE 4-bit at 89.6%. The Hadamard rotation + Lloyd-Max codebook does not outperform simple linear scaling for nearest-neighbor recall.

With fair pre-reconstruction, latency is equivalent (~2.88ms vs ~3.09ms). The previously reported 6x speed advantage was an artifact of reconstructing Int4 per-query inside the timing loop while TurboQuant pre-reconstructed once. Correcting this eliminates the speed difference for brute-force search.

At d=128 (SIFT), Int4 dominates more strongly: 75.0% R@1 vs 44.8% for TQ MSE 4-bit.

2. QJL Hurts Recall Across All Datasets

Dataset	MSE-only R@1	Full (QJL) R@1	Loss
GloVe d=200	0.820±0.007	0.661±0.005	-19.4%
SIFT d=128	0.283±0.018	0.168±0.012	-40.6%
PKM d=384	0.900	0.817±0.085	-9.2%

QJL provides unbiased inner product estimation, but its variance shuffles near-neighbor rankings. For top-k retrieval, lower variance beats unbiasedness. This finding, previously observed for KV cache (softmax amplifies variance), extends to vector search. The high variance on PKM (±0.085 R@1) confirms QJL's instability at small dataset sizes.

3. Product Quantization Fails at d=200 with 8 Subspaces

ProductQuantized with 8 subspaces and 256 centroids achieves only 18.2% R@1 on GloVe (d=200). Each subspace is 25-dimensional with 256 centroids, which is insufficient to capture the distribution. The ruvector-core documentation lists PQ as "8-16x compression" for "cold data," but at d=200 it performs worse than random at practical compression ratios.

PQ works well on PKM (d=384, 117 vectors) because the small dataset allows the codebooks to memorize the data. This is not generalizable.

Note: other PQ configurations (more subspaces, OPQ rotation) could perform better. This benchmark tests the ruvector-core default configuration.

4. Binary Quantization: Near-Random on GloVe, Zero on SIFT

Binary quantization (sign-bit) achieves 51.4% R@1 on GloVe (barely above random for cosine on normalized vectors) and 0.0% R@1 on SIFT (L2-metric data where sign bits lose all magnitude information). The ruvector-core documentation suggests binary for "archive (<1% access)" data, which is appropriate, but the extreme quality loss should be documented.

5. TurboQuant's Real Niche: The 3-bit Tier

No ruvector-core method exists between int4 (4 bits, 8x) and binary (1 bit, 32x). TurboQuant MSE 3-bit fills this gap at 10.7x compression with 82.0±0.7% R@1 on GloVe. This is the strongest argument for integrating TurboQuant into ruvector-core: it occupies an unserved compression tier.

6. Quality Scales with Dimension

TurboQuant performs better at higher dimensions, consistent with theory (Beta distribution concentration):

Dimension	TQ MSE 4-bit R@1	Int4 R@1	TQ vs Int4
d=128 (SIFT)	0.448±0.032	0.750	-40.3%
d=200 (GloVe)	0.896	0.912	-1.8%
d=384 (PKM)	0.900	0.900	Tied

At d=384+, TurboQuant matches naive Int4. Modern embedding models (384-1536 dim) are in TurboQuant's effective range. Below d=200, TurboQuant is not competitive.

Recommendations

Tier 1: Quantization Comparison Table (Corrected)

The current ruvector-core documentation states:

| Quantization | Compression | Use Case              |
|--------------|-------------|-----------------------|
| Scalar (u8)  | 4x          | Warm data (40-80%)    |
| Int4         | 8x          | Cool data (10-40%)    |
| Product      | 8-16x       | Cold data (1-10%)     |
| Binary       | 32x         | Archive (<1% access)  |

Based on benchmarks, a corrected table for d >= 200:

| Quantization     | Compression | R@1 (GloVe) | Recommendation               |
|------------------|-------------|-------------|------------------------------|
| Scalar int8      | 4x          | 99.7%       | Default for quality          |
| Int4             | 8x          | 91.2%       | Best recall at 8x            |
| TQ MSE 4-bit     | 8x          | 89.6%       | Alternative at 8x (d >= 200) |
| TQ MSE 3-bit     | 10.7x       | 82.0%       | NEW: fills compression gap   |
| Binary           | 32x         | 51.4%       | Coarse filter only           |
| Product (8 sub)  | 100x        | 18.2%       | Not recommended at d=200     |

Tier 2: HNSW Integration

All quantization methods are storage-only. Bridging QuantizedVector::distance() into the HNSW search path would make every method usable for search, not just TurboQuant. This is a ~600-950 line change that benefits the entire quantization stack.

Tier 3: TurboQuant MSE-Only Integration

Add TurboQuant MSE-only (skip QJL) as a fifth quantization option in ruvector-core. The data-oblivious property (no training, no codebooks) makes it ideal for online vector databases where data arrives incrementally. Its value is the 3-bit compression tier, not speed or recall superiority over Int4.

Methodology

Protocol

All searches are brute-force (no HNSW) to isolate quantization effects
Vectors L2-normalized, inner product = cosine similarity
Ground truth: exact f32 inner product per dataset (not pre-computed files)
All methods pre-reconstruct to f32 before the timing loop (fair latency comparison)
TQ full: paper's unbiased inner product estimator (not pre-reconstructed)
Each configuration run 3 times with independent random seeds
Stochastic methods (TurboQuant, PQ) report mean±std across trials
Deterministic methods (Int4, Int8, Binary) report latency variance only

Reproducibility

pip install torch numpy scipy
python benchmark_quantized_search.py           # All datasets
python benchmark_quantized_search.py glove200  # Single dataset

Datasets: GloVe 6B (Stanford NLP), SIFT1M (INRIA Texmex), PKM (anonymized, included as .npy). Reference: tonbistudio/turboquant-pytorch

Limitations

PKM dataset (117 vectors) is too small for meaningful generalization; included for completeness at d=384
PQ tested with default 8 subspaces / 256 centroids only; optimized PQ variants may perform better
Binary and Int4 stored at full width during search (theoretical compression ratios reported)
CPU-only benchmarks; SIMD-optimized implementations would change latency characteristics

References

TurboQuant (ICLR 2026): arxiv.org/abs/2504.19874
PolarQuant (AISTATS 2026): arxiv.org/abs/2502.02617
QJL: arxiv.org/abs/2406.03482

11 KiB Raw Permalink Blame History