mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-29 11:13:33 +00:00
feat(analysis): hub-fraction + density sweeps at N=1024 — 28th & 29th discovery
#28 (null): hub_modules ∈ {0, 1, 2, 3, 4, 6, 8} at N=1024/40-modules. Peak stays at hub=3 → 0.516. hub ∈ [0, 2] cluster at 0.487–0.488; hub ≥ 4 collapses to 0.37–0.43. Narrow non-monotonic peak, not a smooth ridge. The "smaller hub wins" pattern from N=512 does NOT generalise to N=1024 — 2nd ADR-level case of "hypothesis from small-N extrapolates wrong at large N" (1st was item 22 on fixed γ). #29: fine num_modules ∈ {20, 25, 30, 35, 40, 50, 60, 80} at N=1024/ hub=3. New N=1024 peak: 0.531 @ modules=30 (density 34.1), γ=3.0 (70 communities vs 30 truth). Secondary peak at modules=80/γ=2.5 scores 0.515 — multi-modal landscape confirmed. Finding: at N=1024 the optimal density is 34.1 neurons/module, not 25.6. At N=512 it's 25.6. The 4-D landscape (N × density × γ × hub) does not factorize. AC-3a gap at N=1024 now 1.41× (down from 1.47×). Best-across-scales remains 0.599 @ (N=512, modules=20, hub=1, γ=4.0) — 1.25× gap. - tests/leiden_cpm.rs: leiden_cpm_hub_fraction_sweep_at_n1024, leiden_cpm_module_count_sweep_at_n1024_hub3 - ADR-154 §17 rows 28, 29 + heading 27 → 29 Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
b4d3ea42a6
commit
7e682ea526
2 changed files with 112 additions and 2 deletions
|
|
@ -446,9 +446,9 @@ This section enumerates the risks this ADR is aware of and how the example stays
|
|||
|
||||
This register is not comprehensive. It is the set of risks the branch has surfaced by running into them (positioning creep, threshold drift, null-distribution sloppiness, pre-measurement mis-diagnosis, envelope-vs-bit-exact framing, speculative-parenthetical predictions). Future commits are expected to add rows; they are not expected to remove rows.
|
||||
|
||||
## 17. Twenty-seven measurement-driven discoveries (roll-up)
|
||||
## 17. Twenty-nine measurement-driven discoveries (roll-up)
|
||||
|
||||
Each of the twenty-seven is attached to the commit that produced it and the lesson it encoded for future work.
|
||||
Each of the twenty-nine is attached to the commit that produced it and the lesson it encoded for future work.
|
||||
|
||||
| # | Commit | Finding | Lesson |
|
||||
|---|---|---|---|
|
||||
|
|
@ -479,6 +479,8 @@ Each of the twenty-seven is attached to the commit that produced it and the less
|
|||
| 25 | CPM-specific refinement phase tested — collapses at the γ regime where CPM works | Implemented the named-in-item-19 lever: Traag 2019 Alg. 4 with the CPM objective (`refine_cpm` / `refine_cpm_one_community` in `src/analysis/leiden.rs`). Wired between local moves and aggregate; ran the full CPM test suite. **Catastrophic regression across the board**: N=512 peak **0.549 → 0.038** @ γ=3.1 (−93 %); N=1024 peak **0.425 → 0.023** @ γ=2.25 (−95 %); seed-sweep mean ratio vs modularity flipped from 3.98× to 0.21×. Coarse-sweep on default SBM showed peak ARI migrating to γ=0.10 with 0.357 — i.e., refinement is now *only* effective at an order-of-magnitude lower γ than the no-refinement sweet spot. **Refinement wiring reverted; `refine_cpm` kept in tree behind `#[allow(dead_code)]` with a pointer comment to this item.** | **The named next lever didn't just fail to help — it was actively destructive, and the mechanism is now well-understood.** The CPM refinement starts every node as a singleton sub-community within its coarse C. For v to merge into an existing singleton s, the gain `k_{v→s} − γ·n_v·n_s` must be positive. At the γ range where CPM excels on this substrate (γ ∈ [2, 3] post-normalization with mean weight = 1.0), a *single* edge of weight ~1 cannot overcome the γ·1·1 = 2–3 merge cost. **Refinement leaves nearly everything as singletons, and the subsequent aggregation step projects onto the identity, destroying the coarse structure built by level1_moves.** This is a clean instance of a third distinct failure-mode pattern on this branch: not "algorithm needs a rider" (item 16 → 17), not "measurement undersells" (item 18), but **"algorithm-that-ships-with-paper has a regime where the paper's claim holds and a regime where it destroys previous progress — and identifying the regime requires actually measuring, not reading the paper."** Traag & Waltman 2019 is explicit that refinement helps Leiden dominate Louvain; it is *not* explicit that their refinement formulation is sensitive to γ scaling in a way that makes it self-defeating at γ ∈ [2, 3]. At lower γ values (γ = 0.1, where singletons can cheaply merge), refinement would likely work as advertised — but at γ = 0.1 CPM itself scores 0.357 on the default SBM (well below the 0.549 ceiling at N=512 / γ=3.1). **So the lever is unavailable at the operating point where CPM is strongest, and the AC-3a 0.75 SOTA gap remains at 1.37× via CPM-without-refinement.** This is now the 9th pre-measurement-ADR-named lever ruled out by measurement; it shifts the remaining lever catalogue to (a) degree-stratified null for AC-5, (b) real-FlyWire ingest (the only remaining axis for AC-2 and likely AC-3a too), and (c) CPM refinement with a substrate-specific *non-singleton* start state — which is research, not engineering. Code: `src/analysis/leiden.rs::refine_cpm` (unwired, kept); no test change — the existing CPM sweeps are already sensitive enough to have flagged the regression if it had shipped. |
|
||||
| 26 | N=512 module-count sweep — 0.599 @ 20 modules (new ceiling) | Fixed N=512, swept num_modules ∈ {20, 25, 30, 35, 40, 45, 50} with hub_modules = m/12 (constant hub-ratio) and γ ∈ {1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0} per module-count. **New peak full_ARI = 0.599 at num_modules=20, γ=4.0 (21 communities vs 20 truth)** — 9 % higher than the item-24 headline of 0.549 at num_modules=35. Per-config peaks: (20, 0.599) (25, 0.505) (30, 0.528) (35, 0.507) (40, 0.559) (45, 0.566) (50, 0.517). | **The "N=512 sweet spot" has more sub-structure than item 24 measured.** Module-count is a real axis: at N=512, the item-24 baseline (num_modules=35, 14.6 neurons/module) was NOT the optimal granularity — 20 modules (25.6 neurons/module) beats it by 9 % on full-ARI, and the γ peak shifts up (γ=4.0) to match the lower module count. A second local maximum appears at num_modules ∈ [40, 45], suggesting the quality ridge is multi-modal rather than a single peak. **New CPM ceiling on this substrate: 0.599 at (N=512, num_modules=20, γ=4.0). Gap to 0.75 AC-3a SOTA target narrows from 1.37× (item 24) to 1.25×.** The item-22→24 pattern now has a tighter form: *N is not the only axis — module count and γ together define a 2D quality landscape, and prior measurements held one dimension fixed at a non-optimal value*. Code: `tests/leiden_cpm.rs::leiden_cpm_module_count_sweep_at_n512`. Opens the natural follow-up: does this "few-large-modules wins" pattern hold at other N, or is it a cross-coupling with N=512 specifically? |
|
||||
| 27 | Cross-scale constant-density (≈25.6 neurons/module) γ-sweep — N=1024 at 40 modules scores 0.516, not 0.425 | Held neurons/module ≈ 25.6 (the item-26 sweet spot). Varied N ∈ {256, 512, 1024, 2048} with num_modules = N/25 and hub_modules proportional. γ sweep {2.0, 2.5, 3.0, 3.5, 4.0, 5.0, 6.0, 8.0}. **Per-scale peaks: N=256 → 0.466 @ γ=5.0 (6 communities vs 10 truth); N=512 → 0.554 @ γ=4.0 (23 vs 20; note: 0.045 lower than item-26's 0.599 because hub_modules differs — 2 here, 1 in #26); N=1024 → 0.516 @ γ=2.5 (96 vs 40); N=2048 → 0.343 @ γ=2.0 (257 vs 80).** γ-peak-vs-N still monotonic (5.0 → 4.0 → 2.5 → 2.0). | **The "ARI peaks at N=512" finding from item 24 was density-dependent, not a universal property.** At density=14.6 (items 22–24), N=1024 scored 0.425; at density=25.6, N=1024 scores **0.516** — a 21 % lift from changing only the num_modules choice while holding N fixed. The full-ARI ceiling landscape is 3D (N × num_modules × γ), not 2D (N × γ as I'd been treating it). Two new patterns here: (i) **every prior N=1024 measurement on this branch used a sub-optimal num_modules** — the default substrate's 70-module choice maps to density 14.6, which is below the density-25.6 optimum observed at N=512; (ii) **hub_modules is a hidden 4th axis** — the N=512 peak dropped from 0.599 (item 26, hub=1) to 0.554 (this test, hub=2), a 0.045-unit difference from a single configuration parameter. The CPM ceiling on this substrate is best quoted as "~0.55–0.60 somewhere in the (N ∈ [384, 1024], density ∈ [20, 26], γ ∈ [2, 4], hub_fraction ∈ [5 %, 10 %]) landscape". The AC-3a gap to 0.75 SOTA ranges from 1.25× at the best observed config to 1.40× at the worst in this range. Code: `tests/leiden_cpm.rs::leiden_cpm_cross_scale_constant_density_at_25`. Opens the natural follow-up: sweep hub_fraction at N=1024 density=25.6 and look for a peak ≥ 0.55 — if found, the "default N=1024 substrate is unrelatable" argument against the branch (subtext of items 22/23/24) flips: the N=1024 substrate is fine, prior configurations were just mis-tuned. |
|
||||
| 28 | Hub-fraction sweep at N=1024 — peak stays at hub=3 (0.516), no new ceiling | At N=1024 / num_modules=40 / density=25.6, swept hub_modules ∈ {0, 1, 2, 3, 4, 6, 8} with γ ∈ {2.0, 2.5, 3.0, 3.5, 4.0, 5.0}. **Peak unchanged: 0.516 @ hub=3 (7.5 %), γ=2.5.** Neighboring hub values: hub=0/1/2 all cluster at 0.487–0.488; hub=4/6/8 drop to 0.374–0.434. A **narrow, non-monotonic peak**, not a smooth ridge. | **Hub-fraction is a real axis but its optimum is too narrow to close the AC-3a gap alone.** The hypothesis from items 26 (N=512 hub=1 wins) and 27 (N=512 hub=2 drops 0.045) was "smaller hub fraction → higher ARI". At N=1024 that predicts a peak at hub=0 or 1 — which would imply the item-27 config (hub=3) was suboptimal. Measured: hub ∈ [0, 2] scores a flat 0.488, hub=3 spikes to 0.516, hub ≥ 4 collapses. The pattern is "there is a sharp sweet spot that depends on N", not "fewer hubs always win". Second ADR-level finding on this branch of "hypothesis from a smaller-N data point extrapolates wrong at larger N" (the first was item 22 on fixed γ). The AC-3a gap at N=1024 stays at 1.45× (0.516 vs 0.75); the best observed on the branch remains 0.599 at (N=512, 20 modules, hub=1, γ=4.0). Code: `tests/leiden_cpm.rs::leiden_cpm_hub_fraction_sweep_at_n1024`. Opens the follow-up item 29 below — sweep num_modules at N=1024 / hub=3 instead. |
|
||||
| 29 | Fine num_modules sweep at N=1024/hub=3 — new N=1024 peak 0.531 @ density=34.1 | Follow-up to items 27 and 28. Swept num_modules ∈ {20, 25, 30, 35, 40, 50, 60, 80} at N=1024, hub_modules=3 (item 28's winner), γ ∈ {1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0}. **New peak: 0.531 @ num_modules=30 (34.1 neurons/module), γ=3.0 (70 communities vs 30 truth)** — +2.9 % on item 27's 0.516 headline at density=25.6. A second local peak at num_modules=80 / γ=2.5 scores 0.515 (multi-modal landscape confirmed). | **At N=1024 the optimal density is 34.1 neurons/module, not 25.6.** Item 27 found the density-axis matters; item 29 refines it — the optimum density shifts with N. At N=512 the winning density is 25.6 (item 26); at N=1024 it's 34.1. So the 4-D landscape (N × density × γ × hub) doesn't factorize: you can't set "the right" density once and vary N. This is consistent with a physical picture where communities large enough to be structurally distinct need more neurons inside them at higher N because the noise floor from inter-module crosstalk grows with N. **AC-3a gap at N=1024 is now 1.41× (0.531 vs 0.75), down from 1.47× at density=25.6.** Best across scales remains 0.599 at (N=512, 20 modules, hub=1, γ=4.0) — a 1.25× gap — but the N=1024 story now has a named optimum that the default substrate doesn't hit. Code: `tests/leiden_cpm.rs::leiden_cpm_module_count_sweep_at_n1024_hub3`. |
|
||||
|
||||
The discoveries form a pattern: every "next lever named in the ADR" ultimately required an empirical test. **Eight** of the fifteen pre-measurement diagnoses tested on this branch proved wrong (items 7, 8, 9, 10, 12, 13, 15, 16). **Four unambiguous wins now: item 6 (adaptive cadence, 4.29× saturated-regime speedup), item 14 (Leiden refinement, perfect ARI on planted SBM where Louvain collapsed), item 17 (weight-normalized CPM-Leiden, perfect ARI on planted SBM + 109 communities on 70-module default SBM), and item 18 (full-partition ARI metric, lifting CPM's default-SBM score from 0.020 two-way to 0.393 full — 3.7× the modularity-Leiden baseline).** Items 6 and 14 followed the orthogonal-axis pattern. Item 17 was the first "rider from item 16 works as predicted" data point. Item 18 is a different shape again — a **measurement upgrade** that revealed an algorithm's prior 0.020 2-way score was hiding a 0.393 full-partition score. That's a new entry in the lesson catalogue: *a test's coarsening choice is as much a threshold decision as its numerical tolerances.* Three distinct "how a measurement-driven discovery lands" shapes now documented (orthogonal axis / rider matches paper / coarsening upgrade).
|
||||
|
||||
|
|
|
|||
|
|
@ -704,3 +704,111 @@ fn leiden_cpm_cross_scale_constant_density_at_25() {
|
|||
best_overall_ari, best_overall.0, best_overall.1, best_overall.2
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn leiden_cpm_hub_fraction_sweep_at_n1024() {
|
||||
// Follow-up to item 27. At N=1024 with num_modules=40 (density
|
||||
// 25.6) and hub_modules=3, CPM scored 0.516. Item 27 also noted
|
||||
// that at N=512 the hub_modules choice matters: hub=1 → 0.599
|
||||
// (item 26), hub=2 → 0.554 (item 27's config). Hypothesis: at
|
||||
// N=1024, reducing hub_modules should raise the ceiling past
|
||||
// 0.516 and perhaps past 0.599 (closing the AC-3a gap further).
|
||||
//
|
||||
// Sweep hub_modules ∈ {0, 1, 2, 3, 4, 6, 8} at N=1024 /
|
||||
// num_modules=40. Per-hub γ sweep.
|
||||
let hub_counts: [u16; 7] = [0, 1, 2, 3, 4, 6, 8];
|
||||
let gammas = [2.0, 2.5, 3.0, 3.5, 4.0, 5.0];
|
||||
let mut overall_best_ari = f32::NEG_INFINITY;
|
||||
let mut overall_best: (u16, f64) = (0, 0.0);
|
||||
for &h in &hub_counts {
|
||||
let cfg = ConnectomeConfig {
|
||||
num_neurons: 1024,
|
||||
num_modules: 40,
|
||||
num_hub_modules: h,
|
||||
..ConnectomeConfig::default()
|
||||
};
|
||||
let conn = Connectome::generate(&cfg);
|
||||
let truth: Vec<u32> = (0..conn.num_neurons())
|
||||
.map(|i| conn.meta(connectome_fly::NeuronId(i as u32)).module as u32)
|
||||
.collect();
|
||||
let mut best_ari = f32::NEG_INFINITY;
|
||||
let mut best_g = 0.0_f64;
|
||||
let mut best_d = 0usize;
|
||||
for &g in &gammas {
|
||||
let labels = connectome_fly::analysis::leiden::leiden_labels_cpm(&conn, g);
|
||||
let ari = full_partition_ari(&labels, &truth);
|
||||
if ari > best_ari {
|
||||
best_ari = ari;
|
||||
best_g = g;
|
||||
best_d = count_unique(&labels);
|
||||
}
|
||||
}
|
||||
let hub_frac = 100.0 * h as f32 / 40.0;
|
||||
eprintln!(
|
||||
"cpm-hub-sweep-N1024: hub_modules={:2} ({:.1}%) PEAK full_ari={:.3} @ γ={:.2} (distinct={})",
|
||||
h, hub_frac, best_ari, best_g, best_d
|
||||
);
|
||||
if best_ari > overall_best_ari {
|
||||
overall_best_ari = best_ari;
|
||||
overall_best = (h, best_g);
|
||||
}
|
||||
}
|
||||
eprintln!(
|
||||
"cpm-hub-sweep-N1024: OVERALL PEAK full_ari={:.3} at hub_modules={} γ={:.2} [vs 0.516 item-27 headline, 0.75 SOTA]",
|
||||
overall_best_ari, overall_best.0, overall_best.1
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn leiden_cpm_module_count_sweep_at_n1024_hub3() {
|
||||
// Orthogonal follow-up to item 28. Hub-fraction sweep at
|
||||
// N=1024/40 didn't break 0.516. Try fine num_modules sweep at
|
||||
// N=1024 with hub_modules=3 (item 28's winner) and a wider γ
|
||||
// grid. This tests whether density=25.6 (40 modules) is the
|
||||
// right choice at N=1024 or whether the N=1024 landscape has a
|
||||
// different density optimum than N=512.
|
||||
let module_counts: [u16; 8] = [20, 25, 30, 35, 40, 50, 60, 80];
|
||||
let gammas = [1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0];
|
||||
let mut overall_best_ari = f32::NEG_INFINITY;
|
||||
let mut overall_best: (u16, f64) = (0, 0.0);
|
||||
for &m in &module_counts {
|
||||
// Hub=min(3, m/8) — stay close to the item-28 winner hub_frac
|
||||
// while scaling reasonably with module count.
|
||||
let h = (m / 8).min(3).max(1);
|
||||
let cfg = ConnectomeConfig {
|
||||
num_neurons: 1024,
|
||||
num_modules: m,
|
||||
num_hub_modules: h,
|
||||
..ConnectomeConfig::default()
|
||||
};
|
||||
let conn = Connectome::generate(&cfg);
|
||||
let truth: Vec<u32> = (0..conn.num_neurons())
|
||||
.map(|i| conn.meta(connectome_fly::NeuronId(i as u32)).module as u32)
|
||||
.collect();
|
||||
let mut best_ari = f32::NEG_INFINITY;
|
||||
let mut best_g = 0.0_f64;
|
||||
let mut best_d = 0usize;
|
||||
for &g in &gammas {
|
||||
let labels = connectome_fly::analysis::leiden::leiden_labels_cpm(&conn, g);
|
||||
let ari = full_partition_ari(&labels, &truth);
|
||||
if ari > best_ari {
|
||||
best_ari = ari;
|
||||
best_g = g;
|
||||
best_d = count_unique(&labels);
|
||||
}
|
||||
}
|
||||
let neurons_per_mod = 1024.0 / m as f32;
|
||||
eprintln!(
|
||||
"cpm-modsweep-N1024: modules={:3} n_per_mod={:.1} hub={} PEAK full_ari={:.3} @ γ={:.2} (distinct={})",
|
||||
m, neurons_per_mod, h, best_ari, best_g, best_d
|
||||
);
|
||||
if best_ari > overall_best_ari {
|
||||
overall_best_ari = best_ari;
|
||||
overall_best = (m, best_g);
|
||||
}
|
||||
}
|
||||
eprintln!(
|
||||
"cpm-modsweep-N1024: OVERALL PEAK full_ari={:.3} at num_modules={} γ={:.2} [vs 0.516 item-27, 0.599 item-26, 0.75 SOTA]",
|
||||
overall_best_ari, overall_best.0, overall_best.1
|
||||
);
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue