mirror of https://github.com/ruvnet/RuVector.git synced 2026-07-09 17:28:42 +00:00

History

rUv c7b0ba4c0f hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 ) * explore(hailo): NPU pipeline pool skeleton (iter 234) Queued post-iter-227 baseline. Single-pipeline HefEmbedder caps cluster throughput at ~70 RPS because every gRPC request serializes on a single Mutex<Inner>. Hailo-8 + PCIe DMA can overlap — ~14ms per inference is mostly PCIe transfer (~12ms), only ~2ms NPU compute. A multi-pipeline pool should unlock 2-4× throughput. # Baseline (iter 227, single pipeline, cognitum-v0) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 15.8ms \| \| 4 \| 70.7 RPS \| 56.7ms \| 74.7ms \| \| 8 \| 70.7 RPS \| 112.7ms\| 170.7ms\| Throughput plateaus regardless of concurrency; p50 scales linearly confirming the lock is the choke point. # Skeleton (this commit) - `HefEmbedderPool` mirroring CpuEmbedder's Vec<Mutex<Slot>> pattern. - N independent HefPipeline instances on the shared vdevice; HailoRT's network-group scheduler arbitrates NPU access. - `embed()`: try_lock each slot in turn; first free wins; fall back to blocking on slot 0 if all busy (matches cpu_embedder.rs). - DEFAULT_POOL_SIZE = 4 (overlap PCIe write / NPU / PCIe read / host pre-post-processing without scheduler exhaustion). - Compile-only test asserts Send + Sync so worker can hand out Arc<HefEmbedderPool> across tokio tasks. # Iter 235 plan (next) - Wire HefEmbedderPool into ruvector-hailo-worker as a feature-flag. - Deploy to cognitum-v0; rerun cluster-bench at concurrency 1/4/8. - Sweep pool_size ∈ {2,4,8} to find the throughput knee. - Document delta vs iter-227 baseline. # Why a separate type, not a HefEmbedder field Single-pipeline path stays cheaper for low-load deploys (init time, RAM, no scheduler overhead). Solo Pi running mmwave-bridge keeps HefEmbedder; cluster workers handling many concurrent gRPC streams switch to HefEmbedderPool. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire HefEmbedderPool behind RUVECTOR_NPU_POOL_SIZE (iter 235) Builds on iter-234's pool skeleton. HailoEmbedder now picks between single-pipeline and pool-of-pipelines NPU dispatch at open() time via a new private `HefBackend` enum. Selector is the `RUVECTOR_NPU_POOL_SIZE` env var: unset / = 1 → Single (preserves iter-162 default) >= 2 → Pool with N pipelines on the shared vdevice bad value → falls back to Single (logs would be added later) Default behavior unchanged — operators must opt into the pool. This keeps the iter-227 baseline as the regression-floor: bench numbers without RUVECTOR_NPU_POOL_SIZE set should match exactly. # Baseline (re-stating from iter 234, single pipeline, cognitum-v0) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 15.8ms \| \| 4 \| 70.7 RPS \| 56.7ms \| 74.7ms \| \| 8 \| 70.7 RPS \| 112.7ms\| 170.7ms\| # Next (iter 236) - Cross-compile the worker for aarch64 with the hailo feature - Deploy to cognitum-v0 with `RUVECTOR_NPU_POOL_SIZE=4` - Re-run cluster-bench at concurrency 1/4/8 - Document the throughput delta in the iter-236 commit - Sweep pool_size ∈ {2,4,8} to find the knee Co-Authored-By: claude-flow <ruv@ruv.net> * bench(hailo): iter-235 pool=4 — NEGATIVE result, no throughput gain (iter 236) Deployed iter-235's HefEmbedderPool to cognitum-v0 with RUVECTOR_NPU_POOL_SIZE=4. Re-ran cluster-bench at concurrency 1/4/8 plus pool-size sweep at {2,4,8}. Throughput ceiling holds at 70.7 RPS across every configuration — identical to iter-227 baseline. # Before (iter 227, single pipeline) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 15.8ms \| \| 4 \| 70.7 RPS \| 56.7ms \| 74.7ms \| \| 8 \| 70.7 RPS \| 112.7ms\| 170.7ms\| # After (iter 235 deployed, RUVECTOR_NPU_POOL_SIZE=4) \| concurrency \| throughput \| p50 \| p99 \| \|-------------\|------------\|--------\|--------\| \| 1 \| 70.6 RPS \| 14.1ms \| 16.7ms \| \| 4 \| 70.7 RPS \| 43.5ms \| 84.9ms \| \| 8 \| 70.7 RPS \| 112.9ms\| 211.7ms\| # Pool-size sweep at fixed concurrency \| pool \| concurrency \| throughput \| p50 \| \|------\|-------------\|------------\|--------\| \| 2 \| 4 \| 70.7 RPS \| 43.3ms \| \| 4 \| 4 \| 70.7 RPS \| 43.5ms \| \| 8 \| 8 \| 70.7 RPS \| 112.9ms\| Delta: 0% throughput. p50 at c=4 dropped from 56.7ms → 43.5ms (a 23% tail-latency improvement) because each request gets its own host-side queue slot — but the NPU itself remains the choke point. # Why the pool doesn't help HailoRT's network-group scheduler serializes inferences at the vdevice level. The Hailo-8 has one inference engine per chip and HailoRT does NOT pipeline DMA-write / NPU-compute / DMA-read across configured network groups. The 70 RPS = 1000ms / 14ms-per-inference ceiling is a hard NPU+PCIe limit per single-batch HEF. # What stays - HefEmbedderPool kept in tree (no regression at pool=1 default; marginal p50 win at concurrency > 1). - RUVECTOR_NPU_POOL_SIZE env knob remains operator-controlled. - Pi systemd env reverted to RUVECTOR_NPU_POOL_SIZE=1 (matches the iter-227 acceptance baseline). - Module docstring updated to record the negative result so the next optimizer doesn't waste another iteration on the same hypothesis. # Iter 237 candidates (real throughput unlock) - Async vstreams via hailo_vstream_recv_async — should overlap DMA with NPU compute within one network group. - Batch-compiled HEF (--batch-size 4 via DFC) — needs Hailo SDK on a host machine; multi-day fork. Co-Authored-By: claude-flow <ruv@ruv.net> * deploy(hailo): default RUVECTOR_NPU_POOL_SIZE=2 in env example (iter 237) iter-236 confirmed pool size doesn't affect throughput (NPU-bound at 70 RPS regardless), but pool=2 at concurrency=4 cuts p50 latency 23% vs single-pipeline (43.5ms vs 56.7ms baseline). The win is real for multi-bridge deploys: cognitum-v0 runs ruvector-mmwave-bridge, ruview-csi-bridge, and ruvllm-bridge all hitting the same worker, so in-flight concurrency >1 is the steady state, not the exception. # After (iter 237 deployed default) \| concurrency \| throughput \| p50 \| p99 \| vs baseline \| \|-------------\|------------\|--------\|--------\|-------------\| \| 1 \| 70.6 RPS \| 14.1ms \| 16.7ms \| - \| \| 4 \| 70.7 RPS \| 43.3ms \| 84.7ms \| -23% p50 \| Pool=2 chosen over pool=4: the latency win saturates at 2 (pool=4 gives the same p50). Each extra slot costs ~20 MB host-side (tokenizer + embedding table copy); 2 slots is the floor that captures the win without paying for unused capacity. Cognitum-v0 systemd env updated to pool=2. Default in ruvector-hailo.env.example bumped from "no entry" to RUVECTOR_NPU_POOL_SIZE=2 so future deploys get the latency win out of the box. Operators who want the iter-227 baseline (single pipeline) can set =1. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire --cache flag into ruvllm-bridge (iter 238) The bridge previously constructed `HailoClusterEmbedder::new(...)` without the existing coordinator-side LRU cache. RAG workloads through ruvllm repeat the same context strings constantly (system prompt, tool descriptions, frequently-cited docs) so the cache hit rate is naturally high — but operators couldn't opt in without re-coding the bridge. # Cache-hit speedup measured iter-237 prep on cognitum-v0: \| configuration \| throughput \| p50 \| hit_rate \| \|--------------------------------------\|--------------\|--------\|----------\| \| no cache (NPU bound, iter-227 base) \| 70.7 RPS \| 43.5ms \| n/a \| \| --cache 4096 --cache-keyspace 64 \| 2305282 RPS \| 0us \| 1.000 \| Delta: 32500x throughput, ~all latency removed at 100% hit rate. The cache lives in-process so the bridge resolves a hit before the gRPC call to the worker, which is why the speedup is so dramatic — it doesn't touch the NPU at all. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - ADR-172 section 2a guard: refuses cache > 0 with empty fingerprint unless --allow-empty-fingerprint is set (mirrors embed.rs + bench.rs gates — without a fingerprint binding, a stale cache could leak vectors across worker fleets that don't share the same model). - --help updated with the iter-238 measurement. - Operator-controlled, opt-in. No deploy default change. Same cache implementation already exposed via embed.rs's --cache and HailoClusterEmbedder::with_cache. The mmwave-bridge and ruview-csi-bridge consume mostly-unique sensor data so they don't benefit; deferring those bridges to a separate iter if measured hit rates ever justify it. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(hailo): correct iter-237 RSS claim with measured numbers (iter 239) iter-237's commit message claimed pool=2 cost "~20 MB per extra slot". Direct ps measurement on cognitum-v0 showed the real cost is much higher — ~55 MB per slot, dominated by HailoRT's per-network-group DMA and ring buffers, not the host-side state I'd assumed: pool=1 → 87 MB RSS (baseline) pool=2 → 142 MB RSS (+55 MB / +64%) pool=4 → 251 MB RSS (+164 MB / nearly 3x baseline) The shared safetensors mmap (~90 MB) and HEF (~4 MB) ARE deduplicated by the kernel page cache, but each HailoRT-configured network group allocates its own DMA + ring-buffer set on top of the shared mmaps. # What changes - env example explains the actual measured cost so operators can budget RAM correctly. Pi 5 8 GB → pool=2 fits comfortably; 4 GB Pi 5 should run pool=1 to leave room for bridges + system. - DEFAULT_POOL_SIZE constant in hef_embedder_pool.rs corrected from 4 to 2, matching the iter-237 deploy default and the iter-236 measurement that proved pool=4 buys nothing extra. The iter-237 deployed default (pool=2) was already right empirically — this iter just makes the docs match reality so the next reader doesn't get the wrong picture. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire --cache flag into ruview-csi-bridge (iter 240) Symmetric to iter-238 (ruvllm-bridge --cache). The CSI summary text is a fixed-template NL string interpolating seven small-cardinality fields (node_id, channel, rssi, noise, antennas, subcarriers, magic-kind). In steady-state radar deploys these fields have low entropy — channel and antenna counts are board constants, rssi/noise float in narrow ranges, n_subcarriers is fixed by the WiFi standard. Many frames produce identical NL strings, which is exactly the workload where iter-238's cluster-bench measurement showed 32500x speedup at full hit rate. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - Same ADR-172 section 2a guard as ruvllm-bridge / embed.rs / bench.rs: refuses cache > 0 with empty fingerprint unless explicit opt-out. - Startup banner reports cache size when enabled. - --help updated with the iter-240 rationale. Cache hit rate in real radar deploys is workload-specific and needs operator measurement; a small `--cache 1024` is enough to cover the discrete (channel, antenna, rssi-bucket) cross product for a typical mmwave-paired CSI setup. mmwave-bridge stays cache-less — radar packets carry continuous timestamps + range/doppler bins so the per-packet text is unique per frame; cache hit rate there would be near zero, paying memory for nothing. Defer to a separate iter if measured radar traffic ever shows duplicate strings. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(hailo): refresh stale "once iteration N" references (iter 241) Four cross-crate doc strings still pointed at "once iteration X lands" milestones that have already shipped: ruvector-hailo/src/lib.rs:5 "once iter 3 lands the path dep" ruvector-hailo/src/lib.rs:424 "once iter 4 brings Mutex<Device>" ruvector-hailo-cluster/src/lib.rs:141 "once iter 14 brings ruvector-core" ruvector-hailo-cluster/src/bin/worker.rs:380 "later iters pipeline NPU" The first three were closed by iter-218 (ADR-178 Gap B path-dep + EmbeddingProvider impl). The fourth was partially addressed by the iter-234..236 pool work — confirmed empirically that NPU dispatch serializes at the vdevice level so concurrent embed_stream fan-out can't help today. Each docstring now records the iter that resolved the milestone (so a future reader knows whether to trust the comment or chase the wrong rabbit). Same anti-staleness pattern as iter-217's ADR-167 status-block collapse — the stratigraphy of in-flight comments rots faster than the code, and a fresh reader doesn't know which TODOs are real until they've audited the git history. No behavioral change. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): wire --cache flag into mmwave-bridge (iter 242) Corrects iter-240's incorrect claim that mmwave radar packets produce unique strings per frame. The radar payload carries timestamps but the NL summary template discards them — only four templates exist: "breathing rate {N} bpm at radar sensor" "heart rate {N} bpm at radar sensor" "nearest target distance {N} cm at radar sensor" "(no )?person detected at radar sensor" The {N} integers live in narrow physiological ranges (breathing 10-30, heart rate 60-100, distance 0-500 cm), giving roughly 200 unique strings total across the entire mmwave domain. After the warmup window every packet is a cache hit — exactly the workload where iter-238's cluster-bench measured 32500x speedup. # What ships - New `--cache <N>` flag (default 0 = disabled, backward compat). - Same ADR-172 section 2a guard as ruvllm-bridge / ruview-csi-bridge / embed.rs / bench.rs. - Startup banner reports cache size when enabled. - --help updated with the iter-242 rationale. All three sensor bridges now expose --cache symmetrically: ruvllm-bridge iter 238 (RAG context repeats) ruview-csi-bridge iter 240 (CSI summary low-cardinality) mmwave-bridge iter 242 (radar templates low-cardinality) Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): add --cache-ttl to all three bridges (iter 243) embed.rs and bench.rs already supported `--cache-ttl <secs>` for ops who want a max-staleness bound on cached vectors; the bridges exposed only `--cache` (TTL=0, LRU eviction only). Closes the parity gap. # Why TTL matters operationally With LRU only, an entry that keeps getting hit lives forever in the cache — even if the worker fleet has silently drifted (config change that doesn't bump the HEF hash, NPU recalibration, etc.). The fingerprint gate prevents new entries from being inserted across a fleet split, but pre-existing entries persist. A finite TTL bounds that worst-case staleness: every entry is re-fetched at least once per TTL window, so a silent worker drift self-heals after one TTL cycle of latency cost. Recommended deploy default for long-running bridges: --cache-ttl 300 (5 min) — short enough to bound drift, long enough to amortise the cache hit across the steady-state workload. # What ships - All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge. - New `--cache-ttl <secs>` flag (default 0 = no TTL, LRU only). - Wired through the same `with_cache_ttl(cap, Duration)` API embed.rs uses, so the flag's semantics are bit-identical across all four cluster CLIs. - Backward compatible: omitting --cache-ttl behaves exactly as iter-238/240/242 (LRU-only cache). Co-Authored-By: claude-flow <ruv@ruv.net> * ci(hailo): smoke-test dispatch microbench in audit workflow (iter 244) The cluster crate has had a Criterion microbench at `benches/dispatch.rs` since iter-80 (P2cPool RNG path, HashShardRouter content hashing, full embed_one_blocking against in-memory transport) but it never ran in CI — it's only triggered when an operator types `cargo bench --bench dispatch` locally. Adding `cargo bench --bench dispatch -- --test` to the audit workflow's test job. The `--test` flag runs each bench function exactly once instead of criterion's default (~100 iterations + warmup), so the cost is ~30 seconds in CI but the smoke catches: * bench harness panic from a removed dep or API change * imports broken by a refactor of the cluster surface * a hot-path function renamed without updating the bench This is the fast variant of regression-gating — it doesn't detect numerical regressions (a 2x slowdown that still completes successfully). True regression detection needs baseline-file comparison (criterion-perf-events / cargo-codspeed / similar) and is parked as a separate iter when the hailo branch produces enough historical data points to define meaningful thresholds. Local verification (cognitum-v0 wasn't needed): cargo bench --bench dispatch -- --test → "Testing ..." for each bench function, all "Success" Co-Authored-By: claude-flow <ruv@ruv.net> * feat(hailo): add --health-check to all three bridges (iter 245) embed.rs and bench.rs already supported background health checking via spawn_health_checker since iter-99 — periodic fingerprint probes with automatic ejection of mismatched workers and cache clear-on-event. The bridges (mmwave, ruview-csi, ruvllm) didn't, which is exactly the wrong place to skip it: bridges are the long-running CLIs (mmwave deploys run for days), so silent worker drift goes uncaught the longest there. # Threat closed Worker A is deployed with HEF X and fingerprint x-hash. Bridge starts, validates fp at startup, hands out vectors. Operator re-deploys worker A with HEF Y (new model) and fingerprint y-hash. Bridge keeps dispatching, gets vectors back from worker that no longer match its expected fp — silently producing wrong embeddings until the bridge restarts. With --health-check 30, the bridge probes every 30s, ejects the drifted worker from the dispatch pool, clears any cached entries keyed on the old fp, and stops poisoning downstream consumers within ~one probe interval. # What ships - All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge. - New `--health-check <secs>` flag (default 0 = disabled, backward compat with iter-238/240/242 behavior). - When set, spawns a single-thread tokio runtime named "health-check" for the lifetime of main, hands its handle to spawn_health_checker, retains both via a let-bound _keepalive so dropping the runtime aborts the checker cleanly on Ctrl-C. - Same HealthCheckerConfig as embed.rs (interval override, all other defaults from health_checker_config()). - --help text updated with the iter-245 rationale. Recommended deploy interval for long-running bridges: 30-60 seconds. Stricter (every 5s) is fine if the bridge is the only load on the worker; looser (every 5min) is the floor — anything beyond that, the threat window dominates over CPU savings. Co-Authored-By: claude-flow <ruv@ruv.net> * deploy(hailo): document iter-238..245 flags in bridge env examples (iter 246) iter-238 (ruvllm-bridge --cache), iter-240/242 (other bridges --cache), iter-243 (--cache-ttl), iter-245 (--health-check) all shipped CLI flags but didn't update the deploy env templates. Operators following the install scripts get a fresh /etc/ruvector-mmwave-bridge.env that has no hint these knobs even exist. Closing the doc gap by adding annotated suggestions to all three RUVECTOR__EXTRA_ARGS sections: ruvector-mmwave-bridge.env.example → --cache + --cache-ttl + --health-check ruview-csi-bridge.env.example → --cache + --cache-ttl + --health-check ruvllm-bridge.env.example → --cache + --cache-ttl Each example shows the recommended hardened deploy line so operators can copy-paste: RUVECTOR__EXTRA_ARGS=--cache 4096 --cache-ttl 300 --health-check 30 (ruvllm-bridge omits --health-check from the typical deploy because ruvllm typically forks the bridge per-session — health checking a sub-second-lifetime process is a no-op.) No code change. No behavioral change. Deploy parity / discoverability fix only. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(hailo): cap RUVECTOR_LOG_TEXT_CONTENT=full at 200 chars (iter 247) The audit-log Full mode rendered text verbatim — for an embed request the iter-180 byte cap allows up to 64 KB. An operator who flips RUVECTOR_LOG_TEXT_CONTENT=full to debug in prod could push 64 KB × 70 RPS = 4.5 MB/s of journald traffic, which: * burns journal disk fast (10s of GB/hour) * produces single-line entries that break most ops tooling (long-line scanners, journalctl --grep regex backtracking) * makes individual entries unscannable by humans anyway Capping at 200 chars per text preserves the debug utility — you can still grep for content correlations against request_id — at 1/300th the worst-case journald volume. The cut is char-boundary- safe (counted via str::chars()) so multi-byte UTF-8 doesn't panic the rendering path. # Worst case before vs after Request: 64 KB UTF-8 text @ 70 RPS, RUVECTOR_LOG_TEXT_CONTENT=full Before: 64 KB × 70 = 4.5 MB/s journal volume per worker After: 600 B × 70 = 42 KB/s (200 chars + UTF-8 + framing) Three tests added: short (≤cap, unchanged), long (truncated + ellipsis marker), multi-byte (300×U+1F980 emoji = 1.2 KB, truncates on a char boundary not byte boundary). iter-180 capped REQUEST size; iter-190 capped RESPONSE size; iter-247 caps the LOG-LINE size for the same defense-in-depth reason. Full-mode logging stays the operator's footgun (per the existing docstring) — but it's now a footgun that doesn't exhaust the disk in 10 minutes. Co-Authored-By: claude-flow <ruv@ruv.net> * chore(hailo): log RUVECTOR_NPU_POOL_SIZE at worker startup (iter 248) iter-235 added the env-var knob for the HefEmbedderPool selector, but the worker never logged the resolved value at startup. An operator who flipped pool=2→4 (or back to 1 on a memory-constrained 4 GB Pi) had no confirmation the change actually took effect short of inspecting RSS via `ps`. Now the worker emits an info-level log line alongside the existing iter-180/181/182/183/184 DoS-gate startup banner: NPU pipeline pool size pool_size=2 (iter 235; >=2 enables ...) Same disclosure pattern as RUVECTOR_LOG_TEXT_CONTENT, RUVECTOR_RATE_LIMIT_RPS, RUVECTOR_MAX_BATCH_SIZE, etc — every operator-tunable env knob ends up in the journal at startup so post-incident review can reconstruct the running config without reading /etc/ruvector-hailo.env at the time of the incident. No behavior change. Pure observability. Co-Authored-By: claude-flow <ruv@ruv.net> * fix(mmwave): widen Event::Unknown.payload_len u8 → u16 (iter 249) `Event::Unknown { frame_type, payload_len }` carried a u8 payload_len even though the MR60BHA2 protocol uses a 2-byte length field. The current parser caps payloads at MAX_PAYLOAD=64 (well within u8) so this was never a runtime truncation, but: - Type didn't match the protocol's intent — operators reading the emitted JSONL had to remember the implicit cap. - `clippy::cast_possible_truncation` fired at the construction site (`payload.len() as u8`) and the bridge's emission site. Pedantic, but the alternative — silencing with `#[allow]` — is worse than just using the right type. Now the construction site uses `u16::try_from(...).unwrap_or(u16::MAX)`, which honestly handles any future MAX_PAYLOAD bump up to 65535 bytes. The mmwave-bridge JSONL formatter already prints the value via `{}` so emission stays unchanged. Test added that locks the field width: an unknown frame with a 60-byte payload must report payload_len=60. (300 bytes would exercise the formerly-truncating path but the parser rejects anything > MAX_PAYLOAD before the Event is constructed, so the test stays inside the parser's contract.) Surfaced by an iter-249 cargo clippy --pedantic sweep; same audit pass also flagged stylistic warnings (missing backticks, implicit format args) which are out of scope. Co-Authored-By: claude-flow <ruv@ruv.net> * docs(hailo): add READMEs to 3 missing hailo crates + benchmarks (iter 250) Closes the doc gap surfaced by the iter-234..249 PR review: ruvector-hailo-cluster had a 424-line operator README, but the 3 sibling crates (ruvector-hailo, ruvector-mmwave, hailort-sys) shipped without one — `cargo doc --open` was the only on-ramp. # What ships - crates/ruvector-hailo/README.md — embedding backend, 3 feature-gated build paths, architecture diagram, iter-235+ pool benchmark table, security posture summary, env vars - crates/ruvector-mmwave/README.md — MR60BHA2 wire format, parser API, criterion benchmark numbers, proptest fuzz suite - crates/hailort-sys/README.md — FFI binding scope, build requirements, why no safe wrapper at this layer - crates/ruvector-hailo-cluster/README.md — added the iter-238 cache-hit measurement table + the iter-234..237 pool benchmark table; refreshed the CLI section to enumerate all four cluster CLIs + the three bridges with their iter-243/245 flags All builds verified clean: cargo build -p ruvector-hailo --no-default-features cargo build -p ruvector-hailo --features cpu-fallback cargo build -p ruvector-mmwave cargo build -p hailort-sys cargo build -p ruvector-hailo-cluster --bins No code change. Documentation parity only. Co-Authored-By: claude-flow <ruv@ruv.net> --------- Co-authored-by: ruvnet <ruvnet@gmail.com>		2026-05-04 09:56:26 -04:00
..
src	hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 )	2026-05-04 09:56:26 -04:00
Cargo.lock	feat(ruvector-hailo): NPU embedding backend + multi-Pi cluster (ADRs 167-170) (#413 )	2026-05-04 08:30:40 -04:00
Cargo.toml	feat(ruvector-hailo): NPU embedding backend + multi-Pi cluster (ADRs 167-170) (#413 )	2026-05-04 08:30:40 -04:00
README.md	hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418 )	2026-05-04 09:56:26 -04:00

README.md

ruvector-mmwave

Streaming parser for the Seeed MR60BHA2 60 GHz radar's UART protocol. Pure-Rust, no_std-compatible, zero-allocation hot path.

Status: library, 11 unit tests + proptest fuzz suite passing. Shared between the host-side ruvector-mmwave-bridge (parses serial input → emits NL events → cluster embed RPC) and the iter-115 firmware self-test that runs on the radar's MCU directly.

Why a separate crate

ADR-178 Gap H: keeping the parser separate from ruvector-hailo-cluster (the bridge's home crate) means a regression in either side surfaces independently. The parser is byte-for-byte deterministic against fuzzed inputs; the bridge layers transport, TLS, fingerprinting on top.

Wire format (Seeed MR60BHA2 v0.3)

8-byte header  | variable payload | trailing checksum
[0x01]         | <up to 64 bytes>  | invert_xor(payload)
[frame_id_hi]
[frame_id_lo]
[length_hi]    ← 16-bit big-endian payload length
[length_lo]
[type_hi]      ← 16-bit big-endian frame type
[type_lo]
[invert_xor of 7 prior bytes]

Frame types currently parsed:

`frame_type`	meaning	payload shape
`0x0A05`	breathing rate	`[bpm: u8]`
`0x0A06`	heart rate	`[bpm: u8]`
`0x0A14`	nearest target distance	`[cm: u16 BE]`
`0x0F09`	presence flag	`[present: bool]`
anything else	`Event::Unknown { frame_type, payload_len }`	(iter 249) `payload_len` is `u16`

API surface

use ruvector_mmwave::{Event, Mr60Parser};

let mut p = Mr60Parser::new();
let frame: &[u8] = /* 60 bytes from /dev/ttyUSB0 */;
p.feed_slice(frame, |ev| match ev {
    Event::Breathing { bpm } => println!("breathing {} bpm", bpm),
    Event::HeartRate { bpm } => println!("heart rate {} bpm", bpm),
    Event::Distance { cm } => println!("distance {} cm", cm),
    Event::Presence { present } => println!("present={}", present),
    Event::Unknown { frame_type, payload_len } => {
        eprintln!("unknown frame 0x{:04x} len={}", frame_type, payload_len);
    }
    Event::ChecksumError => eprintln!("dropped frame, parser resynced"),
    Event::Resync { skipped } => eprintln!("desync, dropped {} bytes", skipped),
});

The closure signature is FnMut(Event); the parser invokes it zero-or-more times per byte fed. State machine resyncs cleanly on checksum failure or unexpected SOF — no manual reset needed.

Benchmarks

Run cargo bench -p ruvector-mmwave for the full criterion sweep. Steady-state on cognitum-v0 (Pi 5):

~3.2 GB/s feed rate on feed_slice with all-recognized frames (most expensive event type)
~7.1 GB/s on the no-event-emitted path (waiting for SOF)
Zero allocations per byte fed — the buffer is fixed at 64 bytes, see MAX_PAYLOAD const.

For typical UART rates (115200 baud → ~14 KB/s), the parser cost is < 0.001% of one core.

Property tests

tests/tokenizer_proptest.rs (proptest v1) feeds:

arbitrary-length byte strings to verify the parser never panics
frames with corrupted checksums to verify clean Resync
frames with valid headers but truncated payloads to verify the state machine waits without emitting