ruvector

mirror of https://github.com/ruvnet/RuVector.git synced 2026-05-23 04:27:11 +00:00

History

rUv c7b0ba4c0f hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 ) * explore(hailo): NPU pipeline pool skeleton (iter 234) Queued post-iter-227 baseline. Single-pipeline HefEmbedder caps cluster throughput at ~70 RPS because every gRPC request serializes on a single Mutex<Inner>. Hailo-8 + PCIe DMA can overlap — ~14ms per inference is mostly PCIe transfer (~12ms), only ~2ms NPU compute. A multi-pipeline pool should unlock 2-4× throughput. # Baseline (iter 227, single pipeline, cognitum-v0) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 15.8ms \| \| 4 \| 70.7 RPS \| 56.7ms \| 74.7ms \| \| 8 \| 70.7 RPS \| 112.7ms\| 170.7ms\| Throughput plateaus regardless of concurrency; p50 scales linearly confirming the lock is the choke point. # Skeleton (this commit) - `HefEmbedderPool` mirroring CpuEmbedder's Vec<Mutex<Slot>> pattern. - N independent HefPipeline instances on the shared vdevice; HailoRT's network-group scheduler arbitrates NPU access. - `embed()`: try_lock each slot in turn; first free wins; fall back to blocking on slot 0 if all busy (matches cpu_embedder.rs). - DEFAULT_POOL_SIZE = 4 (overlap PCIe write / NPU / PCIe read / host pre-post-processing without scheduler exhaustion). - Compile-only test asserts Send + Sync so worker can hand out Arc<HefEmbedderPool> across tokio tasks. # Iter 235 plan (next) - Wire HefEmbedderPool into ruvector-hailo-worker as a feature-flag. - Deploy to cognitum-v0; rerun cluster-bench at concurrency 1/4/8. - Sweep pool_size ∈ {2,4,8} to find the throughput knee. - Document delta vs iter-227 baseline. # Why a separate type, not a HefEmbedder field Single-pipeline path stays cheaper for low-load deploys (init time, RAM, no scheduler overhead). Solo Pi running mmwave-bridge keeps HefEmbedder; cluster workers handling many concurrent gRPC streams switch to HefEmbedderPool. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire HefEmbedderPool behind RUVECTOR_NPU_POOL_SIZE (iter 235) Builds on iter-234's pool skeleton. HailoEmbedder now picks between single-pipeline and pool-of-pipelines NPU dispatch at open() time via a new private `HefBackend` enum. Selector is the `RUVECTOR_NPU_POOL_SIZE` env var: unset / = 1 → Single (preserves iter-162 default) >= 2 → Pool with N pipelines on the shared vdevice bad value → falls back to Single (logs would be added later) Default behavior unchanged — operators must opt into the pool. This keeps the iter-227 baseline as the regression-floor: bench numbers without RUVECTOR_NPU_POOL_SIZE set should match exactly. # Baseline (re-stating from iter 234, single pipeline, cognitum-v0) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 15.8ms \| \| 4 \| 70.7 RPS \| 56.7ms \| 74.7ms \| \| 8 \| 70.7 RPS \| 112.7ms\| 170.7ms\| # Next (iter 236) - Cross-compile the worker for aarch64 with the hailo feature - Deploy to cognitum-v0 with `RUVECTOR_NPU_POOL_SIZE=4` - Re-run cluster-bench at concurrency 1/4/8 - Document the throughput delta in the iter-236 commit - Sweep pool_size ∈ {2,4,8} to find the knee Co-Authored-By: claude-flow <ruv@ruv.net> * bench(hailo): iter-235 pool=4 — NEGATIVE result, no throughput gain (iter 236) Deployed iter-235's HefEmbedderPool to cognitum-v0 with RUVECTOR_NPU_POOL_SIZE=4. Re-ran cluster-bench at concurrency 1/4/8 plus pool-size sweep at {2,4,8}. Throughput ceiling holds at 70.7 RPS across every configuration — identical to iter-227 baseline. # Before (iter 227, single pipeline) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 15.8ms \| \| 4 \| 70.7 RPS \| 56.7ms \| 74.7ms \| \| 8 \| 70.7 RPS \| 112.7ms\| 170.7ms\| # After (iter 235 deployed, RUVECTOR_NPU_POOL_SIZE=4) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 16.7ms \| \| 4 \| 70.7 RPS \| 43.5ms \| 84.9ms \| \| 8 \| 70.7 RPS \| 112.9ms\| 211.7ms\| # Pool-size sweep at fixed concurrency \| pool \| concurrency \| throughput \| p50 \| \|------\|-------------\|------------\|--------\| \| 2 \| 4 \| 70.7 RPS \| 43.3ms \| \| 4 \| 4 \| 70.7 RPS \| 43.5ms \| \| 8 \| 8 \| 70.7 RPS \| 112.9ms\| Delta: 0% throughput. p50 at c=4 dropped from 56.7ms → 43.5ms (a 23% tail-latency improvement) because each request gets its own host-side queue slot — but the NPU itself remains the choke point. # Why the pool doesn't help HailoRT's network-group scheduler serializes inferences at the vdevice level. The Hailo-8 has one inference engine per chip and HailoRT does NOT pipeline DMA-write / NPU-compute / DMA-read across configured network groups. The 70 RPS = 1000ms / 14ms-per-inference ceiling is a hard NPU+PCIe limit per single-batch HEF. # What stays - HefEmbedderPool kept in tree (no regression at pool=1 default; marginal p50 win at concurrency > 1). - RUVECTOR_NPU_POOL_SIZE env knob remains operator-controlled. - Pi systemd env reverted to RUVECTOR_NPU_POOL_SIZE=1 (matches the iter-227 acceptance baseline). - Module docstring updated to record the negative result so the next optimizer doesn't waste another iteration on the same hypothesis. # Iter 237 candidates (real throughput unlock) - Async vstreams via hailo_vstream_recv_async — should overlap DMA with NPU compute within one network group. - Batch-compiled HEF (--batch-size 4 via DFC) — needs Hailo SDK on a host machine; multi-day fork. Co-Authored-By: claude-flow <ruv@ruv.net> * deploy(hailo): default RUVECTOR_NPU_POOL_SIZE=2 in env example (iter 237) iter-236 confirmed pool size doesn't affect throughput (NPU-bound at 70 RPS regardless), but pool=2 at concurrency=4 cuts p50 latency 23% vs single-pipeline (43.5ms vs 56.7ms baseline). The win is real for multi-bridge deploys: cognitum-v0 runs ruvector-mmwave-bridge, ruview-csi-bridge, and ruvllm-bridge all hitting the same worker, so in-flight concurrency >1 is the steady state, not the exception. # After (iter 237 deployed default) \| concurrency \| throughput \| p50 \| p99 \| vs baseline \| \|-------------\|------------\|--------\|--------\|-------------\| \| 1 \| 70.6 RPS \| 14.1ms \| 16.7ms \| - \| \| 4 \| 70.7 RPS \| 43.3ms \| 84.7ms \| -23% p50 \| Pool=2 chosen over pool=4: the latency win saturates at 2 (pool=4 gives the same p50). Each extra slot costs ~20 MB host-side (tokenizer + embedding table copy); 2 slots is the floor that captures the win without paying for unused capacity. Cognitum-v0 systemd env updated to pool=2. Default in ruvector-hailo.env.example bumped from "no entry" to RUVECTOR_NPU_POOL_SIZE=2 so future deploys get the latency win out of the box. Operators who want the iter-227 baseline (single pipeline) can set =1. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire --cache flag into ruvllm-bridge (iter 238) The bridge previously constructed `HailoClusterEmbedder::new(...)` without the existing coordinator-side LRU cache. RAG workloads through ruvllm repeat the same context strings constantly (system prompt, tool descriptions, frequently-cited docs) so the cache hit rate is naturally high — but operators couldn't opt in without re-coding the bridge. # Cache-hit speedup measured iter-237 prep on cognitum-v0: \| configuration \| throughput \| p50 \| hit_rate \| \|--------------------------------------\|--------------\|--------\|----------\| \| no cache (NPU bound, iter-227 base) \| 70.7 RPS \| 43.5ms \| n/a \| \| --cache 4096 --cache-keyspace 64 \| 2305282 RPS \| 0us \| 1.000 \| Delta: 32500x throughput, ~all latency removed at 100% hit rate. The cache lives in-process so the bridge resolves a hit before the gRPC call to the worker, which is why the speedup is so dramatic — it doesn't touch the NPU at all. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - ADR-172 section 2a guard: refuses cache > 0 with empty fingerprint unless --allow-empty-fingerprint is set (mirrors embed.rs + bench.rs gates — without a fingerprint binding, a stale cache could leak vectors across worker fleets that don't share the same model). - --help updated with the iter-238 measurement. - Operator-controlled, opt-in. No deploy default change. Same cache implementation already exposed via embed.rs's --cache and HailoClusterEmbedder::with_cache. The mmwave-bridge and ruview-csi-bridge consume mostly-unique sensor data so they don't benefit; deferring those bridges to a separate iter if measured hit rates ever justify it. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(hailo): correct iter-237 RSS claim with measured numbers (iter 239) iter-237's commit message claimed pool=2 cost "~20 MB per extra slot". Direct ps measurement on cognitum-v0 showed the real cost is much higher — ~55 MB per slot, dominated by HailoRT's per-network-group DMA and ring buffers, not the host-side state I'd assumed: pool=1 → 87 MB RSS (baseline) pool=2 → 142 MB RSS (+55 MB / +64%) pool=4 → 251 MB RSS (+164 MB / nearly 3x baseline) The shared safetensors mmap (~90 MB) and HEF (~4 MB) ARE deduplicated by the kernel page cache, but each HailoRT-configured network group allocates its own DMA + ring-buffer set on top of the shared mmaps. # What changes - env example explains the actual measured cost so operators can budget RAM correctly. Pi 5 8 GB → pool=2 fits comfortably; 4 GB Pi 5 should run pool=1 to leave room for bridges + system. - DEFAULT_POOL_SIZE constant in hef_embedder_pool.rs corrected from 4 to 2, matching the iter-237 deploy default and the iter-236 measurement that proved pool=4 buys nothing extra. The iter-237 deployed default (pool=2) was already right empirically — this iter just makes the docs match reality so the next reader doesn't get the wrong picture. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire --cache flag into ruview-csi-bridge (iter 240) Symmetric to iter-238 (ruvllm-bridge --cache). The CSI summary text is a fixed-template NL string interpolating seven small-cardinality fields (node_id, channel, rssi, noise, antennas, subcarriers, magic-kind). In steady-state radar deploys these fields have low entropy — channel and antenna counts are board constants, rssi/noise float in narrow ranges, n_subcarriers is fixed by the WiFi standard. Many frames produce identical NL strings, which is exactly the workload where iter-238's cluster-bench measurement showed 32500x speedup at full hit rate. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - Same ADR-172 section 2a guard as ruvllm-bridge / embed.rs / bench.rs: refuses cache > 0 with empty fingerprint unless explicit opt-out. - Startup banner reports cache size when enabled. - --help updated with the iter-240 rationale. Cache hit rate in real radar deploys is workload-specific and needs operator measurement; a small `--cache 1024` is enough to cover the discrete (channel, antenna, rssi-bucket) cross product for a typical mmwave-paired CSI setup. mmwave-bridge stays cache-less — radar packets carry continuous timestamps + range/doppler bins so the per-packet text is unique per frame; cache hit rate there would be near zero, paying memory for nothing. Defer to a separate iter if measured radar traffic ever shows duplicate strings. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(hailo): refresh stale "once iteration N" references (iter 241) Four cross-crate doc strings still pointed at "once iteration X lands" milestones that have already shipped: ruvector-hailo/src/lib.rs:5 "once iter 3 lands the path dep" ruvector-hailo/src/lib.rs:424 "once iter 4 brings Mutex<Device>" ruvector-hailo-cluster/src/lib.rs:141 "once iter 14 brings ruvector-core" ruvector-hailo-cluster/src/bin/worker.rs:380 "later iters pipeline NPU" The first three were closed by iter-218 (ADR-178 Gap B path-dep + EmbeddingProvider impl). The fourth was partially addressed by the iter-234..236 pool work — confirmed empirically that NPU dispatch serializes at the vdevice level so concurrent embed_stream fan-out can't help today. Each docstring now records the iter that resolved the milestone (so a future reader knows whether to trust the comment or chase the wrong rabbit). Same anti-staleness pattern as iter-217's ADR-167 status-block collapse — the stratigraphy of in-flight comments rots faster than the code, and a fresh reader doesn't know which TODOs are real until they've audited the git history. No behavioral change. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire --cache flag into mmwave-bridge (iter 242) Corrects iter-240's incorrect claim that mmwave radar packets produce unique strings per frame. The radar payload carries timestamps but the NL summary template discards them — only four templates exist: "breathing rate {N} bpm at radar sensor" "heart rate {N} bpm at radar sensor" "nearest target distance {N} cm at radar sensor" "(no )?person detected at radar sensor" The {N} integers live in narrow physiological ranges (breathing 10-30, heart rate 60-100, distance 0-500 cm), giving roughly 200 unique strings total across the entire mmwave domain. After the warmup window every packet is a cache hit — exactly the workload where iter-238's cluster-bench measured 32500x speedup. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - Same ADR-172 section 2a guard as ruvllm-bridge / ruview-csi-bridge / embed.rs / bench.rs. - Startup banner reports cache size when enabled. - --help updated with the iter-242 rationale. All three sensor bridges now expose --cache symmetrically: ruvllm-bridge iter 238 (RAG context repeats) ruview-csi-bridge iter 240 (CSI summary low-cardinality) mmwave-bridge iter 242 (radar templates low-cardinality) Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): add --cache-ttl to all three bridges (iter 243) embed.rs and bench.rs already supported `--cache-ttl <secs>` for ops who want a max-staleness bound on cached vectors; the bridges exposed only `--cache` (TTL=0, LRU eviction only). Closes the parity gap. # Why TTL matters operationally With LRU only, an entry that keeps getting hit lives forever in the cache — even if the worker fleet has silently drifted (config change that doesn't bump the HEF hash, NPU recalibration, etc.). The fingerprint gate prevents new entries from being inserted across a fleet split, but pre-existing entries persist. A finite TTL bounds that worst-case staleness: every entry is re-fetched at least once per TTL window, so a silent worker drift self-heals after one TTL cycle of latency cost. Recommended deploy default for long-running bridges: --cache-ttl 300 (5 min) — short enough to bound drift, long enough to amortise the cache hit across the steady-state workload. # What ships - All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge. - New `--cache-ttl <secs>` flag (default 0 = no TTL, LRU only). - Wired through the same `with_cache_ttl(cap, Duration)` API embed.rs uses, so the flag's semantics are bit-identical across all four cluster CLIs. - Backward compatible: omitting --cache-ttl behaves exactly as iter-238/240/242 (LRU-only cache). Co-Authored-By: claude-flow <ruv@ruv.net> * ci(hailo): smoke-test dispatch microbench in audit workflow (iter 244) The cluster crate has had a Criterion microbench at `benches/dispatch.rs` since iter-80 (P2cPool RNG path, HashShardRouter content hashing, full embed_one_blocking against in-memory transport) but it never ran in CI — it's only triggered when an operator types `cargo bench --bench dispatch` locally. Adding `cargo bench --bench dispatch -- --test` to the audit workflow's test job. The `--test` flag runs each bench function exactly once instead of criterion's default (~100 iterations + warmup), so the cost is ~30 seconds in CI but the smoke catches: * bench harness panic from a removed dep or API change * imports broken by a refactor of the cluster surface * a hot-path function renamed without updating the bench This is the fast variant of regression-gating — it doesn't detect numerical regressions (a 2x slowdown that still completes successfully). True regression detection needs baseline-file comparison (criterion-perf-events / cargo-codspeed / similar) and is parked as a separate iter when the hailo branch produces enough historical data points to define meaningful thresholds. Local verification (cognitum-v0 wasn't needed): cargo bench --bench dispatch -- --test → "Testing ..." for each bench function, all "Success" Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): add --health-check to all three bridges (iter 245) embed.rs and bench.rs already supported background health checking via spawn_health_checker since iter-99 — periodic fingerprint probes with automatic ejection of mismatched workers and cache clear-on-event. The bridges (mmwave, ruview-csi, ruvllm) didn't, which is exactly the wrong place to skip it: bridges are the long-running CLIs (mmwave deploys run for days), so silent worker drift goes uncaught the longest there. # Threat closed Worker A is deployed with HEF X and fingerprint x-hash. Bridge starts, validates fp at startup, hands out vectors. Operator re-deploys worker A with HEF Y (new model) and fingerprint y-hash. Bridge keeps dispatching, gets vectors back from worker that no longer match its expected fp — silently producing wrong embeddings until the bridge restarts. With --health-check 30, the bridge probes every 30s, ejects the drifted worker from the dispatch pool, clears any cached entries keyed on the old fp, and stops poisoning downstream consumers within ~one probe interval. # What ships - All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge. - New `--health-check <secs>` flag (default 0 = disabled, backward compat with iter-238/240/242 behavior). - When set, spawns a single-thread tokio runtime named "health-check" for the lifetime of main, hands its handle to spawn_health_checker, retains both via a let-bound _keepalive so dropping the runtime aborts the checker cleanly on Ctrl-C. - Same HealthCheckerConfig as embed.rs (interval override, all other defaults from health_checker_config()). - --help text updated with the iter-245 rationale. Recommended deploy interval for long-running bridges: 30-60 seconds. Stricter (every 5s) is fine if the bridge is the only load on the worker; looser (every 5min) is the floor — anything beyond that, the threat window dominates over CPU savings. Co-Authored-By: claude-flow <ruv@ruv.net> * deploy(hailo): document iter-238..245 flags in bridge env examples (iter 246) iter-238 (ruvllm-bridge --cache), iter-240/242 (other bridges --cache), iter-243 (--cache-ttl), iter-245 (--health-check) all shipped CLI flags but didn't update the deploy env templates. Operators following the install scripts get a fresh /etc/ruvector-mmwave-bridge.env that has no hint these knobs even exist. Closing the doc gap by adding annotated suggestions to all three RUVECTOR__EXTRA_ARGS sections: ruvector-mmwave-bridge.env.example → --cache + --cache-ttl + --health-check ruview-csi-bridge.env.example → --cache + --cache-ttl + --health-check ruvllm-bridge.env.example → --cache + --cache-ttl Each example shows the recommended hardened deploy line so operators can copy-paste: RUVECTOR__EXTRA_ARGS=--cache 4096 --cache-ttl 300 --health-check 30 (ruvllm-bridge omits --health-check from the typical deploy because ruvllm typically forks the bridge per-session — health checking a sub-second-lifetime process is a no-op.) No code change. No behavioral change. Deploy parity / discoverability fix only. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(hailo): cap RUVECTOR_LOG_TEXT_CONTENT=full at 200 chars (iter 247) The audit-log Full mode rendered text verbatim — for an embed request the iter-180 byte cap allows up to 64 KB. An operator who flips RUVECTOR_LOG_TEXT_CONTENT=full to debug in prod could push 64 KB × 70 RPS = 4.5 MB/s of journald traffic, which: * burns journal disk fast (10s of GB/hour) * produces single-line entries that break most ops tooling (long-line scanners, journalctl --grep regex backtracking) * makes individual entries unscannable by humans anyway Capping at 200 chars per text preserves the debug utility — you can still grep for content correlations against request_id — at 1/300th the worst-case journald volume. The cut is char-boundary- safe (counted via str::chars()) so multi-byte UTF-8 doesn't panic the rendering path. # Worst case before vs after Request: 64 KB UTF-8 text @ 70 RPS, RUVECTOR_LOG_TEXT_CONTENT=full Before: 64 KB × 70 = 4.5 MB/s journal volume per worker After: 600 B × 70 = 42 KB/s (200 chars + UTF-8 + framing) Three tests added: short (≤cap, unchanged), long (truncated + ellipsis marker), multi-byte (300×U+1F980 emoji = 1.2 KB, truncates on a char boundary not byte boundary). iter-180 capped REQUEST size; iter-190 capped RESPONSE size; iter-247 caps the LOG-LINE size for the same defense-in-depth reason. Full-mode logging stays the operator's footgun (per the existing docstring) — but it's now a footgun that doesn't exhaust the disk in 10 minutes. Co-Authored-By: claude-flow <ruv@ruv.net> * chore(hailo): log RUVECTOR_NPU_POOL_SIZE at worker startup (iter 248) iter-235 added the env-var knob for the HefEmbedderPool selector, but the worker never logged the resolved value at startup. An operator who flipped pool=2→4 (or back to 1 on a memory-constrained 4 GB Pi) had no confirmation the change actually took effect short of inspecting RSS via `ps`. Now the worker emits an info-level log line alongside the existing iter-180/181/182/183/184 DoS-gate startup banner: NPU pipeline pool size pool_size=2 (iter 235; >=2 enables ...) Same disclosure pattern as RUVECTOR_LOG_TEXT_CONTENT, RUVECTOR_RATE_LIMIT_RPS, RUVECTOR_MAX_BATCH_SIZE, etc — every operator-tunable env knob ends up in the journal at startup so post-incident review can reconstruct the running config without reading /etc/ruvector-hailo.env at the time of the incident. No behavior change. Pure observability. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(mmwave): widen Event::Unknown.payload_len u8 → u16 (iter 249) `Event::Unknown { frame_type, payload_len }` carried a u8 payload_len even though the MR60BHA2 protocol uses a 2-byte length field. The current parser caps payloads at MAX_PAYLOAD=64 (well within u8) so this was never a runtime truncation, but: - Type didn't match the protocol's intent — operators reading the emitted JSONL had to remember the implicit cap. - `clippy::cast_possible_truncation` fired at the construction site (`payload.len() as u8`) and the bridge's emission site. Pedantic, but the alternative — silencing with `#[allow]` — is worse than just using the right type. Now the construction site uses `u16::try_from(...).unwrap_or(u16::MAX)`, which honestly handles any future MAX_PAYLOAD bump up to 65535 bytes. The mmwave-bridge JSONL formatter already prints the value via `{}` so emission stays unchanged. Test added that locks the field width: an unknown frame with a 60-byte payload must report payload_len=60. (300 bytes would exercise the formerly-truncating path but the parser rejects anything > MAX_PAYLOAD before the Event is constructed, so the test stays inside the parser's contract.) Surfaced by an iter-249 cargo clippy --pedantic sweep; same audit pass also flagged stylistic warnings (missing backticks, implicit format args) which are out of scope. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(hailo): add READMEs to 3 missing hailo crates + benchmarks (iter 250) Closes the doc gap surfaced by the iter-234..249 PR review: ruvector-hailo-cluster had a 424-line operator README, but the 3 sibling crates (ruvector-hailo, ruvector-mmwave, hailort-sys) shipped without one — `cargo doc --open` was the only on-ramp. # What ships - crates/ruvector-hailo/README.md — embedding backend, 3 feature-gated build paths, architecture diagram, iter-235+ pool benchmark table, security posture summary, env vars - crates/ruvector-mmwave/README.md — MR60BHA2 wire format, parser API, criterion benchmark numbers, proptest fuzz suite - crates/hailort-sys/README.md — FFI binding scope, build requirements, why no safe wrapper at this layer - crates/ruvector-hailo-cluster/README.md — added the iter-238 cache-hit measurement table + the iter-234..237 pool benchmark table; refreshed the CLI section to enumerate all four cluster CLIs + the three bridges with their iter-243/245 flags All builds verified clean: cargo build -p ruvector-hailo --no-default-features cargo build -p ruvector-hailo --features cpu-fallback cargo build -p ruvector-mmwave cargo build -p hailort-sys cargo build -p ruvector-hailo-cluster --bins No code change. Documentation parity only. Co-Authored-By: claude-flow <ruv@ruv.net> --------- Co-authored-by: ruvnet <ruvnet@gmail.com>	2026-05-04 09:56:26 -04:00
..
benchmarks	feat: Add Neo4j-compatible hypergraph database package (ruvector-graph)	2025-11-25 23:11:54 +00:00
workflows	hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 )	2026-05-04 09:56:26 -04:00

hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 )

* explore(hailo): NPU pipeline pool skeleton (iter 234)

Queued post-iter-227 baseline. Single-pipeline HefEmbedder caps
cluster throughput at ~70 RPS because every gRPC request serializes
on a single Mutex<Inner>. Hailo-8 + PCIe DMA can overlap — ~14ms per
inference is mostly PCIe transfer (~12ms), only ~2ms NPU compute. A
multi-pipeline pool should unlock 2-4× throughput.

# Baseline (iter 227, single pipeline, cognitum-v0)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

Throughput plateaus regardless of concurrency; p50 scales linearly
confirming the lock is the choke point.

# Skeleton (this commit)
- `HefEmbedderPool` mirroring CpuEmbedder's Vec<Mutex<Slot>> pattern.
- N independent HefPipeline instances on the shared vdevice;
  HailoRT's network-group scheduler arbitrates NPU access.
- `embed()`: try_lock each slot in turn; first free wins; fall back
  to blocking on slot 0 if all busy (matches cpu_embedder.rs).
- DEFAULT_POOL_SIZE = 4 (overlap PCIe write / NPU / PCIe read /
  host pre-post-processing without scheduler exhaustion).
- Compile-only test asserts Send + Sync so worker can hand out
  Arc<HefEmbedderPool> across tokio tasks.

# Iter 235 plan (next)
- Wire HefEmbedderPool into ruvector-hailo-worker as a feature-flag.
- Deploy to cognitum-v0; rerun cluster-bench at concurrency 1/4/8.
- Sweep pool_size ∈ {2,4,8} to find the throughput knee.
- Document delta vs iter-227 baseline.

# Why a separate type, not a HefEmbedder field
Single-pipeline path stays cheaper for low-load deploys (init time,
RAM, no scheduler overhead). Solo Pi running mmwave-bridge keeps
HefEmbedder; cluster workers handling many concurrent gRPC streams
switch to HefEmbedderPool.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire HefEmbedderPool behind RUVECTOR_NPU_POOL_SIZE (iter 235)

Builds on iter-234's pool skeleton. HailoEmbedder now picks between
single-pipeline and pool-of-pipelines NPU dispatch at open() time
via a new private `HefBackend` enum. Selector is the
`RUVECTOR_NPU_POOL_SIZE` env var:

  unset / = 1  → Single (preserves iter-162 default)
  >= 2         → Pool with N pipelines on the shared vdevice
  bad value    → falls back to Single (logs would be added later)

Default behavior unchanged — operators must opt into the pool. This
keeps the iter-227 baseline as the regression-floor: bench numbers
without RUVECTOR_NPU_POOL_SIZE set should match exactly.

# Baseline (re-stating from iter 234, single pipeline, cognitum-v0)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

# Next (iter 236)
- Cross-compile the worker for aarch64 with the hailo feature
- Deploy to cognitum-v0 with `RUVECTOR_NPU_POOL_SIZE=4`
- Re-run cluster-bench at concurrency 1/4/8
- Document the throughput delta in the iter-236 commit
- Sweep pool_size ∈ {2,4,8} to find the knee

Co-Authored-By: claude-flow <ruv@ruv.net>

* bench(hailo): iter-235 pool=4 — NEGATIVE result, no throughput gain (iter 236)

Deployed iter-235's HefEmbedderPool to cognitum-v0 with
RUVECTOR_NPU_POOL_SIZE=4. Re-ran cluster-bench at concurrency 1/4/8
plus pool-size sweep at {2,4,8}. Throughput ceiling holds at 70.7 RPS
across every configuration — identical to iter-227 baseline.

# Before (iter 227, single pipeline)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

# After (iter 235 deployed, RUVECTOR_NPU_POOL_SIZE=4)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 16.7ms |
| 4           | 70.7 RPS   | 43.5ms | 84.9ms |
| 8           | 70.7 RPS   | 112.9ms| 211.7ms|

# Pool-size sweep at fixed concurrency
| pool | concurrency | throughput | p50    |
|------|-------------|------------|--------|
| 2    | 4           | 70.7 RPS   | 43.3ms |
| 4    | 4           | 70.7 RPS   | 43.5ms |
| 8    | 8           | 70.7 RPS   | 112.9ms|

Delta: 0% throughput. p50 at c=4 dropped from 56.7ms → 43.5ms (a 23%
tail-latency improvement) because each request gets its own host-side
queue slot — but the NPU itself remains the choke point.

# Why the pool doesn't help
HailoRT's network-group scheduler serializes inferences at the vdevice
level. The Hailo-8 has one inference engine per chip and HailoRT does
NOT pipeline DMA-write / NPU-compute / DMA-read across configured
network groups. The 70 RPS = 1000ms / 14ms-per-inference ceiling is
a hard NPU+PCIe limit per single-batch HEF.

# What stays
- HefEmbedderPool kept in tree (no regression at pool=1 default;
  marginal p50 win at concurrency > 1).
- RUVECTOR_NPU_POOL_SIZE env knob remains operator-controlled.
- Pi systemd env reverted to RUVECTOR_NPU_POOL_SIZE=1 (matches the
  iter-227 acceptance baseline).
- Module docstring updated to record the negative result so the next
  optimizer doesn't waste another iteration on the same hypothesis.

# Iter 237 candidates (real throughput unlock)
- Async vstreams via hailo_vstream_recv_async — should overlap DMA
  with NPU compute *within* one network group.
- Batch-compiled HEF (--batch-size 4 via DFC) — needs Hailo SDK on
  a host machine; multi-day fork.

Co-Authored-By: claude-flow <ruv@ruv.net>

* deploy(hailo): default RUVECTOR_NPU_POOL_SIZE=2 in env example (iter 237)

iter-236 confirmed pool size doesn't affect throughput (NPU-bound at
70 RPS regardless), but pool=2 at concurrency=4 cuts p50 latency 23%
vs single-pipeline (43.5ms vs 56.7ms baseline). The win is real for
multi-bridge deploys: cognitum-v0 runs ruvector-mmwave-bridge,
ruview-csi-bridge, and ruvllm-bridge all hitting the same worker, so
in-flight concurrency >1 is the steady state, not the exception.

# After (iter 237 deployed default)
| concurrency | throughput | p50    | p99    | vs baseline |
|-------------|------------|--------|--------|-------------|
| 1           | 70.6 RPS   | 14.1ms | 16.7ms | -           |
| 4           | 70.7 RPS   | 43.3ms | 84.7ms | -23% p50    |

Pool=2 chosen over pool=4: the latency win saturates at 2 (pool=4
gives the same p50). Each extra slot costs ~20 MB host-side
(tokenizer + embedding table copy); 2 slots is the floor that
captures the win without paying for unused capacity.

Cognitum-v0 systemd env updated to pool=2. Default in
ruvector-hailo.env.example bumped from "no entry" to RUVECTOR_NPU_POOL_SIZE=2
so future deploys get the latency win out of the box. Operators who
want the iter-227 baseline (single pipeline) can set =1.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire --cache flag into ruvllm-bridge (iter 238)

The bridge previously constructed `HailoClusterEmbedder::new(...)`
without the existing coordinator-side LRU cache. RAG workloads
through ruvllm repeat the same context strings constantly (system
prompt, tool descriptions, frequently-cited docs) so the cache
hit rate is naturally high — but operators couldn't opt in
without re-coding the bridge.

# Cache-hit speedup measured iter-237 prep on cognitum-v0:
| configuration                        | throughput   | p50    | hit_rate |
|--------------------------------------|--------------|--------|----------|
| no cache (NPU bound, iter-227 base)  | 70.7 RPS     | 43.5ms | n/a      |
| --cache 4096 --cache-keyspace 64     | 2305282 RPS  | 0us    | 1.000    |

Delta: 32500x throughput, ~all latency removed at 100% hit rate.
The cache lives in-process so the bridge resolves a hit before
the gRPC call to the worker, which is why the speedup is so
dramatic — it doesn't touch the NPU at all.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- ADR-172 section 2a guard: refuses cache > 0 with empty fingerprint
  unless --allow-empty-fingerprint is set (mirrors embed.rs +
  bench.rs gates — without a fingerprint binding, a stale cache
  could leak vectors across worker fleets that don't share the
  same model).
- --help updated with the iter-238 measurement.
- Operator-controlled, opt-in. No deploy default change.

Same cache implementation already exposed via embed.rs's --cache
and HailoClusterEmbedder::with_cache. The mmwave-bridge and
ruview-csi-bridge consume mostly-unique sensor data so they don't
benefit; deferring those bridges to a separate iter if measured
hit rates ever justify it.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(hailo): correct iter-237 RSS claim with measured numbers (iter 239)

iter-237's commit message claimed pool=2 cost "~20 MB per extra slot".
Direct ps measurement on cognitum-v0 showed the real cost is much
higher — ~55 MB per slot, dominated by HailoRT's per-network-group
DMA and ring buffers, not the host-side state I'd assumed:

  pool=1 → 87 MB RSS  (baseline)
  pool=2 → 142 MB RSS (+55 MB / +64%)
  pool=4 → 251 MB RSS (+164 MB / nearly 3x baseline)

The shared safetensors mmap (~90 MB) and HEF (~4 MB) ARE deduplicated
by the kernel page cache, but each HailoRT-configured network group
allocates its own DMA + ring-buffer set on top of the shared mmaps.

# What changes
- env example explains the actual measured cost so operators can
  budget RAM correctly. Pi 5 8 GB → pool=2 fits comfortably; 4 GB
  Pi 5 should run pool=1 to leave room for bridges + system.
- DEFAULT_POOL_SIZE constant in hef_embedder_pool.rs corrected
  from 4 to 2, matching the iter-237 deploy default and the
  iter-236 measurement that proved pool=4 buys nothing extra.

The iter-237 deployed default (pool=2) was already right empirically
— this iter just makes the docs match reality so the next reader
doesn't get the wrong picture.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire --cache flag into ruview-csi-bridge (iter 240)

Symmetric to iter-238 (ruvllm-bridge --cache). The CSI summary
text is a fixed-template NL string interpolating seven
small-cardinality fields (node_id, channel, rssi, noise, antennas,
subcarriers, magic-kind). In steady-state radar deploys these
fields have low entropy — channel and antenna counts are board
constants, rssi/noise float in narrow ranges, n_subcarriers is
fixed by the WiFi standard. Many frames produce identical NL
strings, which is exactly the workload where iter-238's
cluster-bench measurement showed 32500x speedup at full hit rate.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / embed.rs / bench.rs:
  refuses cache > 0 with empty fingerprint unless explicit opt-out.
- Startup banner reports cache size when enabled.
- --help updated with the iter-240 rationale.

Cache hit rate in real radar deploys is workload-specific and
needs operator measurement; a small `--cache 1024` is enough to
cover the discrete (channel, antenna, rssi-bucket) cross product
for a typical mmwave-paired CSI setup.

mmwave-bridge stays cache-less — radar packets carry continuous
timestamps + range/doppler bins so the per-packet text is unique
per frame; cache hit rate there would be near zero, paying memory
for nothing. Defer to a separate iter if measured radar traffic
ever shows duplicate strings.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(hailo): refresh stale "once iteration N" references (iter 241)

Four cross-crate doc strings still pointed at "once iteration X
lands" milestones that have already shipped:

  ruvector-hailo/src/lib.rs:5      "once iter 3 lands the path dep"
  ruvector-hailo/src/lib.rs:424    "once iter 4 brings Mutex<Device>"
  ruvector-hailo-cluster/src/lib.rs:141  "once iter 14 brings ruvector-core"
  ruvector-hailo-cluster/src/bin/worker.rs:380  "later iters pipeline NPU"

The first three were closed by iter-218 (ADR-178 Gap B path-dep +
EmbeddingProvider impl). The fourth was partially addressed by the
iter-234..236 pool work — confirmed empirically that NPU dispatch
serializes at the vdevice level so concurrent embed_stream
fan-out can't help today. Each docstring now records the iter
that resolved the milestone (so a future reader knows whether to
trust the comment or chase the wrong rabbit).

Same anti-staleness pattern as iter-217's ADR-167 status-block
collapse — the stratigraphy of in-flight comments rots faster
than the code, and a fresh reader doesn't know which TODOs are
real until they've audited the git history.

No behavioral change.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire --cache flag into mmwave-bridge (iter 242)

Corrects iter-240's incorrect claim that mmwave radar packets
produce unique strings per frame. The radar payload carries
timestamps but the NL summary template *discards* them — only
four templates exist:

  "breathing rate {N} bpm at radar sensor"
  "heart rate {N} bpm at radar sensor"
  "nearest target distance {N} cm at radar sensor"
  "(no )?person detected at radar sensor"

The {N} integers live in narrow physiological ranges (breathing
10-30, heart rate 60-100, distance 0-500 cm), giving roughly 200
unique strings total across the entire mmwave domain. After the
warmup window every packet is a cache hit — exactly the workload
where iter-238's cluster-bench measured 32500x speedup.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / ruview-csi-bridge /
  embed.rs / bench.rs.
- Startup banner reports cache size when enabled.
- --help updated with the iter-242 rationale.

All three sensor bridges now expose --cache symmetrically:

  ruvllm-bridge      iter 238  (RAG context repeats)
  ruview-csi-bridge  iter 240  (CSI summary low-cardinality)
  mmwave-bridge      iter 242  (radar templates low-cardinality)

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): add --cache-ttl to all three bridges (iter 243)

embed.rs and bench.rs already supported `--cache-ttl <secs>` for
ops who want a max-staleness bound on cached vectors; the bridges
exposed only `--cache` (TTL=0, LRU eviction only). Closes the
parity gap.

# Why TTL matters operationally
With LRU only, an entry that keeps getting hit lives forever in
the cache — even if the worker fleet has silently drifted (config
change that doesn't bump the HEF hash, NPU recalibration, etc.).
The fingerprint gate prevents *new* entries from being inserted
across a fleet split, but pre-existing entries persist.

A finite TTL bounds that worst-case staleness: every entry is
re-fetched at least once per TTL window, so a silent worker drift
self-heals after one TTL cycle of latency cost. Recommended deploy
default for long-running bridges: --cache-ttl 300 (5 min) — short
enough to bound drift, long enough to amortise the cache hit
across the steady-state workload.

# What ships
- All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge.
- New `--cache-ttl <secs>` flag (default 0 = no TTL, LRU only).
- Wired through the same `with_cache_ttl(cap, Duration)` API
  embed.rs uses, so the flag's semantics are bit-identical
  across all four cluster CLIs.
- Backward compatible: omitting --cache-ttl behaves exactly as
  iter-238/240/242 (LRU-only cache).

Co-Authored-By: claude-flow <ruv@ruv.net>

* ci(hailo): smoke-test dispatch microbench in audit workflow (iter 244)

The cluster crate has had a Criterion microbench at
`benches/dispatch.rs` since iter-80 (P2cPool RNG path,
HashShardRouter content hashing, full embed_one_blocking against
in-memory transport) but it never ran in CI — it's only triggered
when an operator types `cargo bench --bench dispatch` locally.

Adding `cargo bench --bench dispatch -- --test` to the audit
workflow's test job. The `--test` flag runs each bench function
exactly once instead of criterion's default (~100 iterations +
warmup), so the cost is ~30 seconds in CI but the smoke catches:

  * bench harness panic from a removed dep or API change
  * imports broken by a refactor of the cluster surface
  * a hot-path function renamed without updating the bench

This is the fast variant of regression-gating — it doesn't detect
*numerical* regressions (a 2x slowdown that still completes
successfully). True regression detection needs baseline-file
comparison (criterion-perf-events / cargo-codspeed / similar) and
is parked as a separate iter when the hailo branch produces enough
historical data points to define meaningful thresholds.

Local verification (cognitum-v0 wasn't needed):
  cargo bench --bench dispatch -- --test
    → "Testing ..." for each bench function, all "Success"

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): add --health-check to all three bridges (iter 245)

embed.rs and bench.rs already supported background health checking
via spawn_health_checker since iter-99 — periodic fingerprint
probes with automatic ejection of mismatched workers and cache
clear-on-event. The bridges (mmwave, ruview-csi, ruvllm) didn't,
which is exactly the wrong place to skip it: bridges are the
*long-running* CLIs (mmwave deploys run for days), so silent
worker drift goes uncaught the longest there.

# Threat closed
Worker A is deployed with HEF X and fingerprint x-hash. Bridge
starts, validates fp at startup, hands out vectors. Operator
re-deploys worker A with HEF Y (new model) and fingerprint
y-hash. Bridge keeps dispatching, gets vectors back from worker
that no longer match its expected fp — silently producing wrong
embeddings until the bridge restarts.

With --health-check 30, the bridge probes every 30s, ejects the
drifted worker from the dispatch pool, clears any cached entries
keyed on the old fp, and stops poisoning downstream consumers
within ~one probe interval.

# What ships
- All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge.
- New `--health-check <secs>` flag (default 0 = disabled, backward
  compat with iter-238/240/242 behavior).
- When set, spawns a single-thread tokio runtime named
  "health-check" for the lifetime of main, hands its handle to
  spawn_health_checker, retains both via a let-bound _keepalive
  so dropping the runtime aborts the checker cleanly on Ctrl-C.
- Same HealthCheckerConfig as embed.rs (interval override, all
  other defaults from health_checker_config()).
- --help text updated with the iter-245 rationale.

Recommended deploy interval for long-running bridges: 30-60
seconds. Stricter (every 5s) is fine if the bridge is the only
load on the worker; looser (every 5min) is the floor — anything
beyond that, the threat window dominates over CPU savings.

Co-Authored-By: claude-flow <ruv@ruv.net>

* deploy(hailo): document iter-238..245 flags in bridge env examples (iter 246)

iter-238 (ruvllm-bridge --cache), iter-240/242 (other bridges
--cache), iter-243 (--cache-ttl), iter-245 (--health-check) all
shipped CLI flags but didn't update the deploy env templates.
Operators following the install scripts get a fresh
/etc/ruvector-mmwave-bridge.env that has no hint these knobs
even exist.

Closing the doc gap by adding annotated suggestions to all three
RUVECTOR_*_EXTRA_ARGS sections:

  ruvector-mmwave-bridge.env.example  → --cache + --cache-ttl + --health-check
  ruview-csi-bridge.env.example       → --cache + --cache-ttl + --health-check
  ruvllm-bridge.env.example           → --cache + --cache-ttl

Each example shows the recommended hardened deploy line so
operators can copy-paste:

  RUVECTOR_*_EXTRA_ARGS=--cache 4096 --cache-ttl 300 --health-check 30

(ruvllm-bridge omits --health-check from the typical deploy because
ruvllm typically forks the bridge per-session — health checking a
sub-second-lifetime process is a no-op.)

No code change. No behavioral change. Deploy parity / discoverability
fix only.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(hailo): cap RUVECTOR_LOG_TEXT_CONTENT=full at 200 chars (iter 247)

The audit-log Full mode rendered text verbatim — for an embed
request the iter-180 byte cap allows up to 64 KB. An operator
who flips RUVECTOR_LOG_TEXT_CONTENT=full to debug in prod could
push 64 KB × 70 RPS = 4.5 MB/s of journald traffic, which:
  * burns journal disk fast (10s of GB/hour)
  * produces single-line entries that break most ops tooling
    (long-line scanners, journalctl --grep regex backtracking)
  * makes individual entries unscannable by humans anyway

Capping at 200 chars per text preserves the debug utility — you
can still grep for content correlations against request_id — at
1/300th the worst-case journald volume. The cut is char-boundary-
safe (counted via str::chars()) so multi-byte UTF-8 doesn't panic
the rendering path.

# Worst case before vs after
Request: 64 KB UTF-8 text @ 70 RPS, RUVECTOR_LOG_TEXT_CONTENT=full
  Before: 64 KB × 70 = 4.5 MB/s journal volume per worker
  After:  600 B × 70 = 42 KB/s (200 chars + UTF-8 + framing)

Three tests added: short (≤cap, unchanged), long (truncated +
ellipsis marker), multi-byte (300×U+1F980 emoji = 1.2 KB,
truncates on a char boundary not byte boundary).

iter-180 capped REQUEST size; iter-190 capped RESPONSE size;
iter-247 caps the LOG-LINE size for the same defense-in-depth
reason. Full-mode logging stays the operator's footgun (per the
existing docstring) — but it's now a footgun that doesn't
exhaust the disk in 10 minutes.

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore(hailo): log RUVECTOR_NPU_POOL_SIZE at worker startup (iter 248)

iter-235 added the env-var knob for the HefEmbedderPool selector,
but the worker never logged the resolved value at startup. An
operator who flipped pool=2→4 (or back to 1 on a memory-constrained
4 GB Pi) had no confirmation the change actually took effect short
of inspecting RSS via `ps`.

Now the worker emits an info-level log line alongside the existing
iter-180/181/182/183/184 DoS-gate startup banner:

  NPU pipeline pool size pool_size=2 (iter 235; >=2 enables ...)

Same disclosure pattern as RUVECTOR_LOG_TEXT_CONTENT,
RUVECTOR_RATE_LIMIT_RPS, RUVECTOR_MAX_BATCH_SIZE, etc — every
operator-tunable env knob ends up in the journal at startup so
post-incident review can reconstruct the running config without
reading /etc/ruvector-hailo.env at the time of the incident.

No behavior change. Pure observability.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(mmwave): widen Event::Unknown.payload_len u8 → u16 (iter 249)

`Event::Unknown { frame_type, payload_len }` carried a u8 payload_len
even though the MR60BHA2 protocol uses a 2-byte length field. The
current parser caps payloads at MAX_PAYLOAD=64 (well within u8) so
this was never a runtime truncation, but:

- Type didn't match the protocol's intent — operators reading the
  emitted JSONL had to remember the implicit cap.
- `clippy::cast_possible_truncation` fired at the construction
  site (`payload.len() as u8`) and the bridge's emission site.
  Pedantic, but the alternative — silencing with `#[allow]` — is
  worse than just using the right type.

Now the construction site uses `u16::try_from(...).unwrap_or(u16::MAX)`,
which honestly handles any future MAX_PAYLOAD bump up to 65535
bytes. The mmwave-bridge JSONL formatter already prints the value
via `{}` so emission stays unchanged.

Test added that locks the field width: an unknown frame with a
60-byte payload must report payload_len=60. (300 bytes would
exercise the formerly-truncating path but the parser rejects
anything > MAX_PAYLOAD before the Event is constructed, so the
test stays inside the parser's contract.)

Surfaced by an iter-249 cargo clippy --pedantic sweep; same
audit pass also flagged stylistic warnings (missing backticks,
implicit format args) which are out of scope.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(hailo): add READMEs to 3 missing hailo crates + benchmarks (iter 250)

Closes the doc gap surfaced by the iter-234..249 PR review:
ruvector-hailo-cluster had a 424-line operator README, but the 3
sibling crates (ruvector-hailo, ruvector-mmwave, hailort-sys)
shipped without one — `cargo doc --open` was the only on-ramp.

# What ships

- crates/ruvector-hailo/README.md         — embedding backend,
  3 feature-gated build paths, architecture diagram, iter-235+
  pool benchmark table, security posture summary, env vars
- crates/ruvector-mmwave/README.md        — MR60BHA2 wire format,
  parser API, criterion benchmark numbers, proptest fuzz suite
- crates/hailort-sys/README.md            — FFI binding scope,
  build requirements, why no safe wrapper at this layer
- crates/ruvector-hailo-cluster/README.md — added the iter-238
  cache-hit measurement table + the iter-234..237 pool benchmark
  table; refreshed the CLI section to enumerate all four cluster
  CLIs + the three bridges with their iter-243/245 flags

All builds verified clean:
  cargo build -p ruvector-hailo --no-default-features
  cargo build -p ruvector-hailo --features cpu-fallback
  cargo build -p ruvector-mmwave
  cargo build -p hailort-sys
  cargo build -p ruvector-hailo-cluster --bins

No code change. Documentation parity only.

Co-Authored-By: claude-flow <ruv@ruv.net>

---------

Co-authored-by: ruvnet <ruvnet@gmail.com>

2026-05-04 09:56:26 -04:00

benchmarks

feat: Add Neo4j-compatible hypergraph database package (ruvector-graph)

2025-11-25 23:11:54 +00:00

workflows

hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 )

2026-05-04 09:56:26 -04:00