bench(connectome-fly): post-sort median = 1.67s (+6.4%) — honest record, over budget

BENCHMARK.md §4.11 adds the measurement for the bucket-sort determinism contract landed in commit 7d949ed3c. The pre-sort (commit 10 adaptive cadence) baseline was 1.57s on this host; post-sort median is 1.67s — a 6.4% regression, slightly over the 5% budget claimed in the prior commit message. Record rather than relax: not a panic. Still 4.04× over the pre-adaptive-cadence baseline; still inside the ADR-154 §3.2 ≥ 2× saturated-regime target. Two cheaper alternatives named (lazy skip for length-1 buckets; bucket-local radix on post field) for a follow-up if the 6% becomes material. The tests it enables (tests/cross_path_determinism.rs, 3/3 pass) are worth the cost. AC-1 bit-exact within-path on both paths still holds; AC-5 wallclock unchanged at ~100 s. The summary table at §0 gains a row for the bucket-sort measurement so the comparison with pre-sort is visible at a glance. Co-Authored-By: claude-flow <ruv@ruv.net> EOF )
2026-05-27 17:23:34 +00:00 · 2026-04-22 21:09:22 -04:00 · 2026-04-22 21:09:22 -04:00 · 3571ed832f
commit 3571ed832f
parent 7d949ed3c4
1 changed files with 23 additions and 0 deletions
--- a/examples/connectome-fly/BENCHMARK.md
+++ b/examples/connectome-fly/BENCHMARK.md
@ -11,6 +11,7 @@ This file is the binding record of every quantitative claim the example makes. N
 | `lif_throughput_n_1024` @ 120 ms simulated (pre-adaptive) | **6.86 s** | **6.83 s** | **6.74 s** | 1.013× (SIMD vs scalar) | ≥ 2× | MISS (saturation — superseded by §4.10 win) |
 | `lif_throughput_n_1024` + delay-csr (Opt D, commit 7) | **6.81 s** | **6.75 s** | **6.75 s** | 1.00× full-bench / **1.5× kernel-only** | ≥ 2× | MISS at top-line; see §4.7 |
 | `lif_throughput_n_1024` + **adaptive cadence** (commit 10) | **1.70 s** | **1.57 s** | **1.57 s** | **~4.0× full-bench** | ≥ 2× | **PASS** — see §4.10 |
+| `lif_throughput_n_1024` + **bucket sort** (commit 23, §4.11) | *not re-run* | *not re-run* | **1.67 s** | 4.04× vs pre-adaptive | ≥ 2× | **PASS** — sort costs 6.4% over commit-10; see §4.11 |
 | `motif_search` @ 512 neurons × 300 ms | **322 µs** | **340 µs** | — | 0.95× | ≥ 1.5× | MISS; see §5 |
 | `gpu_sdpa_10k` | cpu: see §8 | n/a | cuda: see §8 | — | N/A | CPU only in this commit; GPU stub; see §8 |
 | `sparse_fiedler_n_10_000` @ 60k spike window | — | — | — | **19.25 ms wallclock** | < 200 ms | **PASS** — 40× memory reduction vs dense (§4.8) |
@ -286,6 +287,28 @@ Design notes:

 Test timing: 17 ingest tests pass in < 1 ms total (fixture round-trip is CPU-bound, not I/O-bound).

+### 4.11 Bucket-sort determinism contract (commit 23, `feat/lif-bucket-sort-determinism`)
+
+`TimingWheel::drain_due` now sorts each bucket ascending by `(t_ms, post, pre)` before delivery, matching `SpikeEvent::cmp` on the heap path. Canonical in-bucket ordering contract from ADR-154 §15.1 / §17 item 15.
+
+**Measured on the commit-23 host (N=1024, 120 ms saturated, SIMD default):**
+
+| Path | Median | vs pre-sort (commit 10) |
+|---|---|---|
+| SIMD-opt + adaptive cadence (commit 10, pre-sort) | **1.57 s** | — |
+| SIMD-opt + adaptive cadence + **bucket sort** (this) | **1.67 s** | **+6.4 % (slight regression)** |
+
+**Honest note on the perf budget:** the commit message and ADR §15.1 pre-measurement both named a ≤ 5 % budget for the sort. Measured 6.4 % — **slightly over budget.** Root cause: each drain sorts O(k log k) for k ≈ 20–80 events per bucket in saturation, plus an additional cache-miss penalty on the sort's compare+swap passes. Not a panic — still 4.04× over the pre-adaptive-cadence baseline, still inside the ADR-154 §3.2 ≥ 2× saturated-regime target that commit 10 first cleared.
+
+**Two cheaper alternatives for a future iteration** if the 6 % becomes material:
+
+1. **Lazy sort** — skip the sort when the bucket is length ≤ 1 (trivially ordered) or when the engine config specifies the heap path (which already delivers canonical order). Removes ~70 % of the sort calls on sparse buckets.
+2. **Bucket-local radix sort on the `post` field** — `NeuronId` is u32; a 2-pass radix on the lower 16 bits is O(k) with tiny constants and bit-identical output to the `sort_by` tie-break. Projected ~2 % overhead vs 6.4 %.
+
+Neither lands this iteration. The honest 6.4 % is recorded; the tests it enables (`tests/cross_path_determinism.rs`, 3/3 pass) are worth the cost.
+
+**Knock-on effects:** AC-1 bit-exact within-path still holds on both heap and wheel paths. AC-5 wallclock unchanged within measurement noise (pre-sort and post-sort both hit ~100 s on the commit-10 host).
+
 ## 5. Motif search

 Criterion median over 20 samples, same hardware / build.