ruvector

mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-25 15:03:46 +00:00

History

ruvnet 2c4b7dd76b perf(rabitq): AVX-512 VPOPCNTDQ scan variant — +10.5% single-thread at n=100k Extends the scan dispatch ladder to scalar → AVX2 → AVX-512 VPOPCNTDQ. The new kernel runs under #[target_feature(enable = "avx2,avx512f, avx512bw,avx512vpopcntdq")] and processes 8 u64s per zmm load via _mm512_popcnt_epi64. select_impl() now prefers avx512f+avx512vpopcntdq, falls back to avx2+popcnt, then scalar. All paths cached in the existing OnceLock. Measured on host with all three levels available (n=100k, D=128, rerank×20, single-thread, ruLake Fresh path): before (AVX2 path): ~3,681 QPS after (AVX-512): ~4,067 QPS (+10.5%) Below the 2× target because at D=128 only 2 u64s per candidate feed VPOPCNTDQ — the kernel is memory-bandwidth-bound on the sequential packed stream, and the _mm512_storeu_si512 → scalar fold for per-candidate pair reduction eats part of the win. A vpsadbw-based in-register reduction would recover more but would balloon the intrinsics surface beyond what fits cleanly in scan.rs. Determinism preserved: scan_avx512 is byte-identical to scan_scalar at D=64, D=100, D=128, D=192, D=200, plus tail sizes n=7 and 1023. New test scan_avx512_matches_scalar exercises a 1000-vector D=128 run; the existing run_both harness adds AVX-512 parity to every shape it tests. Clippy clean (one allow(incompatible_msrv) scoped to scan_avx512 only — AVX-512 intrinsics stabilized in Rust 1.89, runtime detection guarantees safe dispatch). 38 → 39 rabitq lib tests. Rulake unchanged (42). Co-Authored-By: claude-flow <ruv@ruv.net>		2026-04-24 09:55:12 -04:00
..
benches	style(rabitq): cargo fmt pass to satisfy Rustfmt CI	2026-04-23 14:48:08 -04:00
src	perf(rabitq): AVX-512 VPOPCNTDQ scan variant — +10.5% single-thread at n=100k	2026-04-24 09:55:12 -04:00
BENCHMARK.md	perf(rabitq): SoA storage + cos-LUT — 2.5–3.1× symmetric scan at n=100k	2026-04-23 13:28:26 -04:00
Cargo.toml	perf(rabitq,rulake): parallel prime via rayon — 11× faster at n=100k	2026-04-23 21:48:41 -04:00