mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-25 15:03:46 +00:00
Extends the scan dispatch ladder to scalar → AVX2 → AVX-512 VPOPCNTDQ. The new kernel runs under #[target_feature(enable = "avx2,avx512f, avx512bw,avx512vpopcntdq")] and processes 8 u64s per zmm load via _mm512_popcnt_epi64. select_impl() now prefers avx512f+avx512vpopcntdq, falls back to avx2+popcnt, then scalar. All paths cached in the existing OnceLock. Measured on host with all three levels available (n=100k, D=128, rerank×20, single-thread, ruLake Fresh path): before (AVX2 path): ~3,681 QPS after (AVX-512): ~4,067 QPS (+10.5%) Below the 2× target because at D=128 only 2 u64s per candidate feed VPOPCNTDQ — the kernel is memory-bandwidth-bound on the sequential packed stream, and the _mm512_storeu_si512 → scalar fold for per-candidate pair reduction eats part of the win. A vpsadbw-based in-register reduction would recover more but would balloon the intrinsics surface beyond what fits cleanly in scan.rs. Determinism preserved: scan_avx512 is byte-identical to scan_scalar at D=64, D=100, D=128, D=192, D=200, plus tail sizes n=7 and 1023. New test scan_avx512_matches_scalar exercises a 1000-vector D=128 run; the existing run_both harness adds AVX-512 parity to every shape it tests. Clippy clean (one allow(incompatible_msrv) scoped to scan_avx512 only — AVX-512 intrinsics stabilized in Rust 1.89, runtime detection guarantees safe dispatch). 38 → 39 rabitq lib tests. Rulake unchanged (42). Co-Authored-By: claude-flow <ruv@ruv.net> |
||
|---|---|---|
| .. | ||
| benches | ||
| src | ||
| BENCHMARK.md | ||
| Cargo.toml | ||