bench(rulake): search_batch vs per-query — 1.05× warm, big latent wins

Adds a batch-vs-loop block to rulake-demo. Measures on an already-primed
LocalBackend under Eventual consistency (the hot path):

  batch=8     qps=2874   1.01×
  batch=32    qps=2961   1.04×
  batch=128   qps=2943   1.03×
  batch=300   qps=2986   1.05×
  per-query loop  2855   baseline

Modest on this workload because the warm cache path is already
uncontended (single-threaded + Eventual TTL makes ensure_fresh a
HashMap lookup, not a backend RTT). BENCHMARK.md is updated to
record the honest number and name the three latent wins the bench
does not measure:

  1. Fresh consistency — batch of N amortizes N backend RTTs to 1.
  2. Concurrent contention — fewer mutex acquires under multi-client.
  3. Kernel dispatch (ADR-157) — batch is the plug-point GPU / SIMD
     kernels need to cross over CPU.

The mechanical guarantee is unchanged and already tested
(search_batch_acquires_cache_lock_once): batch=32 registers as 1
coherence check, not 32. Speedup is workload-dependent; the shape
is correct.

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
ruvnet 2026-04-23 20:34:20 -04:00
parent 3daa8b1b2a
commit 39110f09d9
2 changed files with 70 additions and 0 deletions

View file

@ -59,6 +59,37 @@ The QPS drop with shard count under this single-thread benchmark is
*not* pure `par_iter` startup overhead — see the concurrent-client
numbers below for the honest picture.
### search_batch vs per-query loop (n = 100 k, warm cache, single-threaded)
`RuLake::search_batch(queries, k)` amortizes `ensure_fresh` and the
cache mutex across N queries. Measured speedup on an already-primed
`LocalBackend` under `Consistency::Eventual` (the hot path):
| batch size | QPS | speedup vs per-query |
|-----------:|--------:|---------------------:|
| 8 | 2,874 | 1.01× |
| 32 | 2,961 | 1.04× |
| 128 | 2,943 | 1.03× |
| 300 | 2,986 | 1.05× |
| per-query | 2,855 | baseline |
Modest on this workload — the warm cache path is already uncontended
(single-threaded, Eventual-TTL so `ensure_fresh` is a HashMap lookup,
not a backend RTT). The bigger wins for batch are latent:
- **`Consistency::Fresh`** — each per-query `ensure_fresh` is a
backend round-trip. A batch of 300 on Fresh amortizes 300 RTTs
into 1, which is catastrophically different at network latency.
- **Concurrent contention** — fewer mutex acquires under heavy
multi-client load. Not measured in this single-threaded bench.
- **Kernel dispatch (ADR-157)** — GPU / SIMD kernels cross over CPU
only above their `min_batch`. `search_batch` is the plug-point
that makes dispatch tractable; a per-query API would never let
GPU win.
Test `search_batch_acquires_cache_lock_once` proves the amortization
mechanically: a batch of 32 registers as 1 coherence check, not 32.
### Concurrent clients × shard count (n = 100 k, 8 clients × 300 queries)
With the **adaptive per-shard rerank** introduced via

View file

@ -302,6 +302,45 @@ fn main() {
println!();
}
if !fast {
println!("── search_batch vs per-query loop (n=100k) ──");
let n = 100_000;
let data = clustered(n, d, 100, seed);
let queries = clustered(300, d, 100, seed ^ 0xdead_beef);
let backend = Arc::new(LocalBackend::new("bench"));
backend
.put_collection("c", d, (0..n as u64).collect(), data.clone())
.unwrap();
let lake =
RuLake::new(rerank, seed).with_consistency(Consistency::Eventual { ttl_ms: 60_000 });
lake.register_backend(backend).unwrap();
// Prime.
lake.search_one("bench", "c", &queries[0], 10).unwrap();
// Per-query loop over the full 300-query set.
let t = Instant::now();
for q in &queries {
let _ = lake.search_one("bench", "c", q, 10).unwrap();
}
let loop_qps = queries.len() as f64 / t.elapsed().as_secs_f64();
// Batch the same 300 queries in chunks of 32.
for &batch_size in &[8usize, 32, 128, 300] {
let t = Instant::now();
for chunk in queries.chunks(batch_size) {
let _ = lake.search_batch("bench", "c", chunk, 10).unwrap();
}
let batch_qps = queries.len() as f64 / t.elapsed().as_secs_f64();
println!(
" batch={:>3} qps={:>8.0} speedup vs per-query {:.2}×",
batch_size,
batch_qps,
batch_qps / loop_qps
);
}
println!(" per-query loop qps={:>8.0} (baseline)", loop_qps);
println!();
println!("── concurrent clients × federation (n=100k, 8 clients × 300 queries) ──");
let n = 100_000;
let queries = clustered(300, d, 100, seed ^ 0xdead_beef);