bench(rulake): search_batch vs per-query — 1.05× warm, big latent wins

Adds a batch-vs-loop block to rulake-demo. Measures on an already-primed LocalBackend under Eventual consistency (the hot path): batch=8 qps=2874 1.01× batch=32 qps=2961 1.04× batch=128 qps=2943 1.03× batch=300 qps=2986 1.05× per-query loop 2855 baseline Modest on this workload because the warm cache path is already uncontended (single-threaded + Eventual TTL makes ensure_fresh a HashMap lookup, not a backend RTT). BENCHMARK.md is updated to record the honest number and name the three latent wins the bench does not measure: 1. Fresh consistency — batch of N amortizes N backend RTTs to 1. 2. Concurrent contention — fewer mutex acquires under multi-client. 3. Kernel dispatch (ADR-157) — batch is the plug-point GPU / SIMD kernels need to cross over CPU. The mechanical guarantee is unchanged and already tested (search_batch_acquires_cache_lock_once): batch=32 registers as 1 coherence check, not 32. Speedup is workload-dependent; the shape is correct. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-26 16:04:02 +00:00 · 2026-04-23 20:34:20 -04:00 · 2026-04-23 20:34:20 -04:00 · 39110f09d9
commit 39110f09d9
parent 3daa8b1b2a
2 changed files with 70 additions and 0 deletions
--- a/crates/ruvector-rulake/BENCHMARK.md
+++ b/crates/ruvector-rulake/BENCHMARK.md
@ -59,6 +59,37 @@ The QPS drop with shard count under this single-thread benchmark is
 *not* pure `par_iter` startup overhead — see the concurrent-client
 numbers below for the honest picture.

+### search_batch vs per-query loop (n = 100 k, warm cache, single-threaded)
+
+`RuLake::search_batch(queries, k)` amortizes `ensure_fresh` and the
+cache mutex across N queries. Measured speedup on an already-primed
+`LocalBackend` under `Consistency::Eventual` (the hot path):
+
+| batch size |     QPS | speedup vs per-query |
+|-----------:|--------:|---------------------:|
+|         8  |   2,874 |              1.01×   |
+|        32  |   2,961 |              1.04×   |
+|       128  |   2,943 |              1.03×   |
+|       300  |   2,986 |              1.05×   |
+| per-query  |   2,855 | baseline             |
+
+Modest on this workload — the warm cache path is already uncontended
+(single-threaded, Eventual-TTL so `ensure_fresh` is a HashMap lookup,
+not a backend RTT). The bigger wins for batch are latent:
+
+- **`Consistency::Fresh`** — each per-query `ensure_fresh` is a
+  backend round-trip. A batch of 300 on Fresh amortizes 300 RTTs
+  into 1, which is catastrophically different at network latency.
+- **Concurrent contention** — fewer mutex acquires under heavy
+  multi-client load. Not measured in this single-threaded bench.
+- **Kernel dispatch (ADR-157)** — GPU / SIMD kernels cross over CPU
+  only above their `min_batch`. `search_batch` is the plug-point
+  that makes dispatch tractable; a per-query API would never let
+  GPU win.
+
+Test `search_batch_acquires_cache_lock_once` proves the amortization
+mechanically: a batch of 32 registers as 1 coherence check, not 32.
+
 ### Concurrent clients × shard count (n = 100 k, 8 clients × 300 queries)

 With the **adaptive per-shard rerank** introduced via
--- a/crates/ruvector-rulake/src/bin/rulake-demo.rs
+++ b/crates/ruvector-rulake/src/bin/rulake-demo.rs
@ -302,6 +302,45 @@ fn main() {
        println!();
    }
    if !fast {
+        println!("── search_batch vs per-query loop (n=100k) ──");
+        let n = 100_000;
+        let data = clustered(n, d, 100, seed);
+        let queries = clustered(300, d, 100, seed ^ 0xdead_beef);
+
+        let backend = Arc::new(LocalBackend::new("bench"));
+        backend
+            .put_collection("c", d, (0..n as u64).collect(), data.clone())
+            .unwrap();
+        let lake =
+            RuLake::new(rerank, seed).with_consistency(Consistency::Eventual { ttl_ms: 60_000 });
+        lake.register_backend(backend).unwrap();
+        // Prime.
+        lake.search_one("bench", "c", &queries[0], 10).unwrap();
+
+        // Per-query loop over the full 300-query set.
+        let t = Instant::now();
+        for q in &queries {
+            let _ = lake.search_one("bench", "c", q, 10).unwrap();
+        }
+        let loop_qps = queries.len() as f64 / t.elapsed().as_secs_f64();
+
+        // Batch the same 300 queries in chunks of 32.
+        for &batch_size in &[8usize, 32, 128, 300] {
+            let t = Instant::now();
+            for chunk in queries.chunks(batch_size) {
+                let _ = lake.search_batch("bench", "c", chunk, 10).unwrap();
+            }
+            let batch_qps = queries.len() as f64 / t.elapsed().as_secs_f64();
+            println!(
+                "  batch={:>3}   qps={:>8.0}   speedup vs per-query {:.2}×",
+                batch_size,
+                batch_qps,
+                batch_qps / loop_qps
+            );
+        }
+        println!("  per-query loop   qps={:>8.0}   (baseline)", loop_qps);
+        println!();
+
        println!("── concurrent clients × federation (n=100k, 8 clients × 300 queries) ──");
        let n = 100_000;
        let queries = clustered(300, d, 100, seed ^ 0xdead_beef);