ruvector/crates/ruvector-rabitq
ruvnet 2c4b7dd76b perf(rabitq): AVX-512 VPOPCNTDQ scan variant — +10.5% single-thread at n=100k
Extends the scan dispatch ladder to scalar → AVX2 → AVX-512 VPOPCNTDQ.
The new kernel runs under #[target_feature(enable = "avx2,avx512f,
avx512bw,avx512vpopcntdq")] and processes 8 u64s per zmm load via
_mm512_popcnt_epi64.

select_impl() now prefers avx512f+avx512vpopcntdq, falls back to
avx2+popcnt, then scalar. All paths cached in the existing OnceLock.

Measured on host with all three levels available (n=100k, D=128,
rerank×20, single-thread, ruLake Fresh path):

  before (AVX2 path): ~3,681 QPS
  after  (AVX-512):   ~4,067 QPS  (+10.5%)

Below the 2× target because at D=128 only 2 u64s per candidate feed
VPOPCNTDQ — the kernel is memory-bandwidth-bound on the sequential
packed stream, and the _mm512_storeu_si512 → scalar fold for
per-candidate pair reduction eats part of the win. A vpsadbw-based
in-register reduction would recover more but would balloon the
intrinsics surface beyond what fits cleanly in scan.rs.

Determinism preserved: scan_avx512 is byte-identical to scan_scalar
at D=64, D=100, D=128, D=192, D=200, plus tail sizes n=7 and 1023.
New test scan_avx512_matches_scalar exercises a 1000-vector D=128
run; the existing run_both harness adds AVX-512 parity to every
shape it tests.

Clippy clean (one allow(incompatible_msrv) scoped to scan_avx512
only — AVX-512 intrinsics stabilized in Rust 1.89, runtime detection
guarantees safe dispatch).

38 → 39 rabitq lib tests. Rulake unchanged (42).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-04-24 09:55:12 -04:00
..
benches style(rabitq): cargo fmt pass to satisfy Rustfmt CI 2026-04-23 14:48:08 -04:00
src perf(rabitq): AVX-512 VPOPCNTDQ scan variant — +10.5% single-thread at n=100k 2026-04-24 09:55:12 -04:00
BENCHMARK.md perf(rabitq): SoA storage + cos-LUT — 2.5–3.1× symmetric scan at n=100k 2026-04-23 13:28:26 -04:00
Cargo.toml perf(rabitq,rulake): parallel prime via rayon — 11× faster at n=100k 2026-04-23 21:48:41 -04:00