ruvector/crates/ruvector-mmwave
rUv c7b0ba4c0f
hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418)
* explore(hailo): NPU pipeline pool skeleton (iter 234)

Queued post-iter-227 baseline. Single-pipeline HefEmbedder caps
cluster throughput at ~70 RPS because every gRPC request serializes
on a single Mutex<Inner>. Hailo-8 + PCIe DMA can overlap — ~14ms per
inference is mostly PCIe transfer (~12ms), only ~2ms NPU compute. A
multi-pipeline pool should unlock 2-4× throughput.

# Baseline (iter 227, single pipeline, cognitum-v0)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

Throughput plateaus regardless of concurrency; p50 scales linearly
confirming the lock is the choke point.

# Skeleton (this commit)
- `HefEmbedderPool` mirroring CpuEmbedder's Vec<Mutex<Slot>> pattern.
- N independent HefPipeline instances on the shared vdevice;
  HailoRT's network-group scheduler arbitrates NPU access.
- `embed()`: try_lock each slot in turn; first free wins; fall back
  to blocking on slot 0 if all busy (matches cpu_embedder.rs).
- DEFAULT_POOL_SIZE = 4 (overlap PCIe write / NPU / PCIe read /
  host pre-post-processing without scheduler exhaustion).
- Compile-only test asserts Send + Sync so worker can hand out
  Arc<HefEmbedderPool> across tokio tasks.

# Iter 235 plan (next)
- Wire HefEmbedderPool into ruvector-hailo-worker as a feature-flag.
- Deploy to cognitum-v0; rerun cluster-bench at concurrency 1/4/8.
- Sweep pool_size ∈ {2,4,8} to find the throughput knee.
- Document delta vs iter-227 baseline.

# Why a separate type, not a HefEmbedder field
Single-pipeline path stays cheaper for low-load deploys (init time,
RAM, no scheduler overhead). Solo Pi running mmwave-bridge keeps
HefEmbedder; cluster workers handling many concurrent gRPC streams
switch to HefEmbedderPool.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire HefEmbedderPool behind RUVECTOR_NPU_POOL_SIZE (iter 235)

Builds on iter-234's pool skeleton. HailoEmbedder now picks between
single-pipeline and pool-of-pipelines NPU dispatch at open() time
via a new private `HefBackend` enum. Selector is the
`RUVECTOR_NPU_POOL_SIZE` env var:

  unset / = 1  → Single (preserves iter-162 default)
  >= 2         → Pool with N pipelines on the shared vdevice
  bad value    → falls back to Single (logs would be added later)

Default behavior unchanged — operators must opt into the pool. This
keeps the iter-227 baseline as the regression-floor: bench numbers
without RUVECTOR_NPU_POOL_SIZE set should match exactly.

# Baseline (re-stating from iter 234, single pipeline, cognitum-v0)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

# Next (iter 236)
- Cross-compile the worker for aarch64 with the hailo feature
- Deploy to cognitum-v0 with `RUVECTOR_NPU_POOL_SIZE=4`
- Re-run cluster-bench at concurrency 1/4/8
- Document the throughput delta in the iter-236 commit
- Sweep pool_size ∈ {2,4,8} to find the knee

Co-Authored-By: claude-flow <ruv@ruv.net>

* bench(hailo): iter-235 pool=4 — NEGATIVE result, no throughput gain (iter 236)

Deployed iter-235's HefEmbedderPool to cognitum-v0 with
RUVECTOR_NPU_POOL_SIZE=4. Re-ran cluster-bench at concurrency 1/4/8
plus pool-size sweep at {2,4,8}. Throughput ceiling holds at 70.7 RPS
across every configuration — identical to iter-227 baseline.

# Before (iter 227, single pipeline)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 15.8ms |
| 4           | 70.7 RPS   | 56.7ms | 74.7ms |
| 8           | 70.7 RPS   | 112.7ms| 170.7ms|

# After (iter 235 deployed, RUVECTOR_NPU_POOL_SIZE=4)
| concurrency | throughput | p50    | p99    |
|-------------|------------|--------|--------|
| 1           | 70.6 RPS   | 14.1ms | 16.7ms |
| 4           | 70.7 RPS   | 43.5ms | 84.9ms |
| 8           | 70.7 RPS   | 112.9ms| 211.7ms|

# Pool-size sweep at fixed concurrency
| pool | concurrency | throughput | p50    |
|------|-------------|------------|--------|
| 2    | 4           | 70.7 RPS   | 43.3ms |
| 4    | 4           | 70.7 RPS   | 43.5ms |
| 8    | 8           | 70.7 RPS   | 112.9ms|

Delta: 0% throughput. p50 at c=4 dropped from 56.7ms → 43.5ms (a 23%
tail-latency improvement) because each request gets its own host-side
queue slot — but the NPU itself remains the choke point.

# Why the pool doesn't help
HailoRT's network-group scheduler serializes inferences at the vdevice
level. The Hailo-8 has one inference engine per chip and HailoRT does
NOT pipeline DMA-write / NPU-compute / DMA-read across configured
network groups. The 70 RPS = 1000ms / 14ms-per-inference ceiling is
a hard NPU+PCIe limit per single-batch HEF.

# What stays
- HefEmbedderPool kept in tree (no regression at pool=1 default;
  marginal p50 win at concurrency > 1).
- RUVECTOR_NPU_POOL_SIZE env knob remains operator-controlled.
- Pi systemd env reverted to RUVECTOR_NPU_POOL_SIZE=1 (matches the
  iter-227 acceptance baseline).
- Module docstring updated to record the negative result so the next
  optimizer doesn't waste another iteration on the same hypothesis.

# Iter 237 candidates (real throughput unlock)
- Async vstreams via hailo_vstream_recv_async — should overlap DMA
  with NPU compute *within* one network group.
- Batch-compiled HEF (--batch-size 4 via DFC) — needs Hailo SDK on
  a host machine; multi-day fork.

Co-Authored-By: claude-flow <ruv@ruv.net>

* deploy(hailo): default RUVECTOR_NPU_POOL_SIZE=2 in env example (iter 237)

iter-236 confirmed pool size doesn't affect throughput (NPU-bound at
70 RPS regardless), but pool=2 at concurrency=4 cuts p50 latency 23%
vs single-pipeline (43.5ms vs 56.7ms baseline). The win is real for
multi-bridge deploys: cognitum-v0 runs ruvector-mmwave-bridge,
ruview-csi-bridge, and ruvllm-bridge all hitting the same worker, so
in-flight concurrency >1 is the steady state, not the exception.

# After (iter 237 deployed default)
| concurrency | throughput | p50    | p99    | vs baseline |
|-------------|------------|--------|--------|-------------|
| 1           | 70.6 RPS   | 14.1ms | 16.7ms | -           |
| 4           | 70.7 RPS   | 43.3ms | 84.7ms | -23% p50    |

Pool=2 chosen over pool=4: the latency win saturates at 2 (pool=4
gives the same p50). Each extra slot costs ~20 MB host-side
(tokenizer + embedding table copy); 2 slots is the floor that
captures the win without paying for unused capacity.

Cognitum-v0 systemd env updated to pool=2. Default in
ruvector-hailo.env.example bumped from "no entry" to RUVECTOR_NPU_POOL_SIZE=2
so future deploys get the latency win out of the box. Operators who
want the iter-227 baseline (single pipeline) can set =1.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire --cache flag into ruvllm-bridge (iter 238)

The bridge previously constructed `HailoClusterEmbedder::new(...)`
without the existing coordinator-side LRU cache. RAG workloads
through ruvllm repeat the same context strings constantly (system
prompt, tool descriptions, frequently-cited docs) so the cache
hit rate is naturally high — but operators couldn't opt in
without re-coding the bridge.

# Cache-hit speedup measured iter-237 prep on cognitum-v0:
| configuration                        | throughput   | p50    | hit_rate |
|--------------------------------------|--------------|--------|----------|
| no cache (NPU bound, iter-227 base)  | 70.7 RPS     | 43.5ms | n/a      |
| --cache 4096 --cache-keyspace 64     | 2305282 RPS  | 0us    | 1.000    |

Delta: 32500x throughput, ~all latency removed at 100% hit rate.
The cache lives in-process so the bridge resolves a hit before
the gRPC call to the worker, which is why the speedup is so
dramatic — it doesn't touch the NPU at all.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- ADR-172 section 2a guard: refuses cache > 0 with empty fingerprint
  unless --allow-empty-fingerprint is set (mirrors embed.rs +
  bench.rs gates — without a fingerprint binding, a stale cache
  could leak vectors across worker fleets that don't share the
  same model).
- --help updated with the iter-238 measurement.
- Operator-controlled, opt-in. No deploy default change.

Same cache implementation already exposed via embed.rs's --cache
and HailoClusterEmbedder::with_cache. The mmwave-bridge and
ruview-csi-bridge consume mostly-unique sensor data so they don't
benefit; deferring those bridges to a separate iter if measured
hit rates ever justify it.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(hailo): correct iter-237 RSS claim with measured numbers (iter 239)

iter-237's commit message claimed pool=2 cost "~20 MB per extra slot".
Direct ps measurement on cognitum-v0 showed the real cost is much
higher — ~55 MB per slot, dominated by HailoRT's per-network-group
DMA and ring buffers, not the host-side state I'd assumed:

  pool=1 → 87 MB RSS  (baseline)
  pool=2 → 142 MB RSS (+55 MB / +64%)
  pool=4 → 251 MB RSS (+164 MB / nearly 3x baseline)

The shared safetensors mmap (~90 MB) and HEF (~4 MB) ARE deduplicated
by the kernel page cache, but each HailoRT-configured network group
allocates its own DMA + ring-buffer set on top of the shared mmaps.

# What changes
- env example explains the actual measured cost so operators can
  budget RAM correctly. Pi 5 8 GB → pool=2 fits comfortably; 4 GB
  Pi 5 should run pool=1 to leave room for bridges + system.
- DEFAULT_POOL_SIZE constant in hef_embedder_pool.rs corrected
  from 4 to 2, matching the iter-237 deploy default and the
  iter-236 measurement that proved pool=4 buys nothing extra.

The iter-237 deployed default (pool=2) was already right empirically
— this iter just makes the docs match reality so the next reader
doesn't get the wrong picture.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire --cache flag into ruview-csi-bridge (iter 240)

Symmetric to iter-238 (ruvllm-bridge --cache). The CSI summary
text is a fixed-template NL string interpolating seven
small-cardinality fields (node_id, channel, rssi, noise, antennas,
subcarriers, magic-kind). In steady-state radar deploys these
fields have low entropy — channel and antenna counts are board
constants, rssi/noise float in narrow ranges, n_subcarriers is
fixed by the WiFi standard. Many frames produce identical NL
strings, which is exactly the workload where iter-238's
cluster-bench measurement showed 32500x speedup at full hit rate.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / embed.rs / bench.rs:
  refuses cache > 0 with empty fingerprint unless explicit opt-out.
- Startup banner reports cache size when enabled.
- --help updated with the iter-240 rationale.

Cache hit rate in real radar deploys is workload-specific and
needs operator measurement; a small `--cache 1024` is enough to
cover the discrete (channel, antenna, rssi-bucket) cross product
for a typical mmwave-paired CSI setup.

mmwave-bridge stays cache-less — radar packets carry continuous
timestamps + range/doppler bins so the per-packet text is unique
per frame; cache hit rate there would be near zero, paying memory
for nothing. Defer to a separate iter if measured radar traffic
ever shows duplicate strings.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(hailo): refresh stale "once iteration N" references (iter 241)

Four cross-crate doc strings still pointed at "once iteration X
lands" milestones that have already shipped:

  ruvector-hailo/src/lib.rs:5      "once iter 3 lands the path dep"
  ruvector-hailo/src/lib.rs:424    "once iter 4 brings Mutex<Device>"
  ruvector-hailo-cluster/src/lib.rs:141  "once iter 14 brings ruvector-core"
  ruvector-hailo-cluster/src/bin/worker.rs:380  "later iters pipeline NPU"

The first three were closed by iter-218 (ADR-178 Gap B path-dep +
EmbeddingProvider impl). The fourth was partially addressed by the
iter-234..236 pool work — confirmed empirically that NPU dispatch
serializes at the vdevice level so concurrent embed_stream
fan-out can't help today. Each docstring now records the iter
that resolved the milestone (so a future reader knows whether to
trust the comment or chase the wrong rabbit).

Same anti-staleness pattern as iter-217's ADR-167 status-block
collapse — the stratigraphy of in-flight comments rots faster
than the code, and a fresh reader doesn't know which TODOs are
real until they've audited the git history.

No behavioral change.

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): wire --cache flag into mmwave-bridge (iter 242)

Corrects iter-240's incorrect claim that mmwave radar packets
produce unique strings per frame. The radar payload carries
timestamps but the NL summary template *discards* them — only
four templates exist:

  "breathing rate {N} bpm at radar sensor"
  "heart rate {N} bpm at radar sensor"
  "nearest target distance {N} cm at radar sensor"
  "(no )?person detected at radar sensor"

The {N} integers live in narrow physiological ranges (breathing
10-30, heart rate 60-100, distance 0-500 cm), giving roughly 200
unique strings total across the entire mmwave domain. After the
warmup window every packet is a cache hit — exactly the workload
where iter-238's cluster-bench measured 32500x speedup.

# What ships
- New `--cache <N>` flag (default 0 = disabled, backward compat).
- Same ADR-172 section 2a guard as ruvllm-bridge / ruview-csi-bridge /
  embed.rs / bench.rs.
- Startup banner reports cache size when enabled.
- --help updated with the iter-242 rationale.

All three sensor bridges now expose --cache symmetrically:

  ruvllm-bridge      iter 238  (RAG context repeats)
  ruview-csi-bridge  iter 240  (CSI summary low-cardinality)
  mmwave-bridge      iter 242  (radar templates low-cardinality)

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): add --cache-ttl to all three bridges (iter 243)

embed.rs and bench.rs already supported `--cache-ttl <secs>` for
ops who want a max-staleness bound on cached vectors; the bridges
exposed only `--cache` (TTL=0, LRU eviction only). Closes the
parity gap.

# Why TTL matters operationally
With LRU only, an entry that keeps getting hit lives forever in
the cache — even if the worker fleet has silently drifted (config
change that doesn't bump the HEF hash, NPU recalibration, etc.).
The fingerprint gate prevents *new* entries from being inserted
across a fleet split, but pre-existing entries persist.

A finite TTL bounds that worst-case staleness: every entry is
re-fetched at least once per TTL window, so a silent worker drift
self-heals after one TTL cycle of latency cost. Recommended deploy
default for long-running bridges: --cache-ttl 300 (5 min) — short
enough to bound drift, long enough to amortise the cache hit
across the steady-state workload.

# What ships
- All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge.
- New `--cache-ttl <secs>` flag (default 0 = no TTL, LRU only).
- Wired through the same `with_cache_ttl(cap, Duration)` API
  embed.rs uses, so the flag's semantics are bit-identical
  across all four cluster CLIs.
- Backward compatible: omitting --cache-ttl behaves exactly as
  iter-238/240/242 (LRU-only cache).

Co-Authored-By: claude-flow <ruv@ruv.net>

* ci(hailo): smoke-test dispatch microbench in audit workflow (iter 244)

The cluster crate has had a Criterion microbench at
`benches/dispatch.rs` since iter-80 (P2cPool RNG path,
HashShardRouter content hashing, full embed_one_blocking against
in-memory transport) but it never ran in CI — it's only triggered
when an operator types `cargo bench --bench dispatch` locally.

Adding `cargo bench --bench dispatch -- --test` to the audit
workflow's test job. The `--test` flag runs each bench function
exactly once instead of criterion's default (~100 iterations +
warmup), so the cost is ~30 seconds in CI but the smoke catches:

  * bench harness panic from a removed dep or API change
  * imports broken by a refactor of the cluster surface
  * a hot-path function renamed without updating the bench

This is the fast variant of regression-gating — it doesn't detect
*numerical* regressions (a 2x slowdown that still completes
successfully). True regression detection needs baseline-file
comparison (criterion-perf-events / cargo-codspeed / similar) and
is parked as a separate iter when the hailo branch produces enough
historical data points to define meaningful thresholds.

Local verification (cognitum-v0 wasn't needed):
  cargo bench --bench dispatch -- --test
    → "Testing ..." for each bench function, all "Success"

Co-Authored-By: claude-flow <ruv@ruv.net>

* feat(hailo): add --health-check to all three bridges (iter 245)

embed.rs and bench.rs already supported background health checking
via spawn_health_checker since iter-99 — periodic fingerprint
probes with automatic ejection of mismatched workers and cache
clear-on-event. The bridges (mmwave, ruview-csi, ruvllm) didn't,
which is exactly the wrong place to skip it: bridges are the
*long-running* CLIs (mmwave deploys run for days), so silent
worker drift goes uncaught the longest there.

# Threat closed
Worker A is deployed with HEF X and fingerprint x-hash. Bridge
starts, validates fp at startup, hands out vectors. Operator
re-deploys worker A with HEF Y (new model) and fingerprint
y-hash. Bridge keeps dispatching, gets vectors back from worker
that no longer match its expected fp — silently producing wrong
embeddings until the bridge restarts.

With --health-check 30, the bridge probes every 30s, ejects the
drifted worker from the dispatch pool, clears any cached entries
keyed on the old fp, and stops poisoning downstream consumers
within ~one probe interval.

# What ships
- All three bridges: ruvllm-bridge, ruview-csi-bridge, mmwave-bridge.
- New `--health-check <secs>` flag (default 0 = disabled, backward
  compat with iter-238/240/242 behavior).
- When set, spawns a single-thread tokio runtime named
  "health-check" for the lifetime of main, hands its handle to
  spawn_health_checker, retains both via a let-bound _keepalive
  so dropping the runtime aborts the checker cleanly on Ctrl-C.
- Same HealthCheckerConfig as embed.rs (interval override, all
  other defaults from health_checker_config()).
- --help text updated with the iter-245 rationale.

Recommended deploy interval for long-running bridges: 30-60
seconds. Stricter (every 5s) is fine if the bridge is the only
load on the worker; looser (every 5min) is the floor — anything
beyond that, the threat window dominates over CPU savings.

Co-Authored-By: claude-flow <ruv@ruv.net>

* deploy(hailo): document iter-238..245 flags in bridge env examples (iter 246)

iter-238 (ruvllm-bridge --cache), iter-240/242 (other bridges
--cache), iter-243 (--cache-ttl), iter-245 (--health-check) all
shipped CLI flags but didn't update the deploy env templates.
Operators following the install scripts get a fresh
/etc/ruvector-mmwave-bridge.env that has no hint these knobs
even exist.

Closing the doc gap by adding annotated suggestions to all three
RUVECTOR_*_EXTRA_ARGS sections:

  ruvector-mmwave-bridge.env.example  → --cache + --cache-ttl + --health-check
  ruview-csi-bridge.env.example       → --cache + --cache-ttl + --health-check
  ruvllm-bridge.env.example           → --cache + --cache-ttl

Each example shows the recommended hardened deploy line so
operators can copy-paste:

  RUVECTOR_*_EXTRA_ARGS=--cache 4096 --cache-ttl 300 --health-check 30

(ruvllm-bridge omits --health-check from the typical deploy because
ruvllm typically forks the bridge per-session — health checking a
sub-second-lifetime process is a no-op.)

No code change. No behavioral change. Deploy parity / discoverability
fix only.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(hailo): cap RUVECTOR_LOG_TEXT_CONTENT=full at 200 chars (iter 247)

The audit-log Full mode rendered text verbatim — for an embed
request the iter-180 byte cap allows up to 64 KB. An operator
who flips RUVECTOR_LOG_TEXT_CONTENT=full to debug in prod could
push 64 KB × 70 RPS = 4.5 MB/s of journald traffic, which:
  * burns journal disk fast (10s of GB/hour)
  * produces single-line entries that break most ops tooling
    (long-line scanners, journalctl --grep regex backtracking)
  * makes individual entries unscannable by humans anyway

Capping at 200 chars per text preserves the debug utility — you
can still grep for content correlations against request_id — at
1/300th the worst-case journald volume. The cut is char-boundary-
safe (counted via str::chars()) so multi-byte UTF-8 doesn't panic
the rendering path.

# Worst case before vs after
Request: 64 KB UTF-8 text @ 70 RPS, RUVECTOR_LOG_TEXT_CONTENT=full
  Before: 64 KB × 70 = 4.5 MB/s journal volume per worker
  After:  600 B × 70 = 42 KB/s (200 chars + UTF-8 + framing)

Three tests added: short (≤cap, unchanged), long (truncated +
ellipsis marker), multi-byte (300×U+1F980 emoji = 1.2 KB,
truncates on a char boundary not byte boundary).

iter-180 capped REQUEST size; iter-190 capped RESPONSE size;
iter-247 caps the LOG-LINE size for the same defense-in-depth
reason. Full-mode logging stays the operator's footgun (per the
existing docstring) — but it's now a footgun that doesn't
exhaust the disk in 10 minutes.

Co-Authored-By: claude-flow <ruv@ruv.net>

* chore(hailo): log RUVECTOR_NPU_POOL_SIZE at worker startup (iter 248)

iter-235 added the env-var knob for the HefEmbedderPool selector,
but the worker never logged the resolved value at startup. An
operator who flipped pool=2→4 (or back to 1 on a memory-constrained
4 GB Pi) had no confirmation the change actually took effect short
of inspecting RSS via `ps`.

Now the worker emits an info-level log line alongside the existing
iter-180/181/182/183/184 DoS-gate startup banner:

  NPU pipeline pool size pool_size=2 (iter 235; >=2 enables ...)

Same disclosure pattern as RUVECTOR_LOG_TEXT_CONTENT,
RUVECTOR_RATE_LIMIT_RPS, RUVECTOR_MAX_BATCH_SIZE, etc — every
operator-tunable env knob ends up in the journal at startup so
post-incident review can reconstruct the running config without
reading /etc/ruvector-hailo.env at the time of the incident.

No behavior change. Pure observability.

Co-Authored-By: claude-flow <ruv@ruv.net>

* fix(mmwave): widen Event::Unknown.payload_len u8 → u16 (iter 249)

`Event::Unknown { frame_type, payload_len }` carried a u8 payload_len
even though the MR60BHA2 protocol uses a 2-byte length field. The
current parser caps payloads at MAX_PAYLOAD=64 (well within u8) so
this was never a runtime truncation, but:

- Type didn't match the protocol's intent — operators reading the
  emitted JSONL had to remember the implicit cap.
- `clippy::cast_possible_truncation` fired at the construction
  site (`payload.len() as u8`) and the bridge's emission site.
  Pedantic, but the alternative — silencing with `#[allow]` — is
  worse than just using the right type.

Now the construction site uses `u16::try_from(...).unwrap_or(u16::MAX)`,
which honestly handles any future MAX_PAYLOAD bump up to 65535
bytes. The mmwave-bridge JSONL formatter already prints the value
via `{}` so emission stays unchanged.

Test added that locks the field width: an unknown frame with a
60-byte payload must report payload_len=60. (300 bytes would
exercise the formerly-truncating path but the parser rejects
anything > MAX_PAYLOAD before the Event is constructed, so the
test stays inside the parser's contract.)

Surfaced by an iter-249 cargo clippy --pedantic sweep; same
audit pass also flagged stylistic warnings (missing backticks,
implicit format args) which are out of scope.

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(hailo): add READMEs to 3 missing hailo crates + benchmarks (iter 250)

Closes the doc gap surfaced by the iter-234..249 PR review:
ruvector-hailo-cluster had a 424-line operator README, but the 3
sibling crates (ruvector-hailo, ruvector-mmwave, hailort-sys)
shipped without one — `cargo doc --open` was the only on-ramp.

# What ships

- crates/ruvector-hailo/README.md         — embedding backend,
  3 feature-gated build paths, architecture diagram, iter-235+
  pool benchmark table, security posture summary, env vars
- crates/ruvector-mmwave/README.md        — MR60BHA2 wire format,
  parser API, criterion benchmark numbers, proptest fuzz suite
- crates/hailort-sys/README.md            — FFI binding scope,
  build requirements, why no safe wrapper at this layer
- crates/ruvector-hailo-cluster/README.md — added the iter-238
  cache-hit measurement table + the iter-234..237 pool benchmark
  table; refreshed the CLI section to enumerate all four cluster
  CLIs + the three bridges with their iter-243/245 flags

All builds verified clean:
  cargo build -p ruvector-hailo --no-default-features
  cargo build -p ruvector-hailo --features cpu-fallback
  cargo build -p ruvector-mmwave
  cargo build -p hailort-sys
  cargo build -p ruvector-hailo-cluster --bins

No code change. Documentation parity only.

Co-Authored-By: claude-flow <ruv@ruv.net>

---------

Co-authored-by: ruvnet <ruvnet@gmail.com>
2026-05-04 09:56:26 -04:00
..
src hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418) 2026-05-04 09:56:26 -04:00
Cargo.lock feat(ruvector-hailo): NPU embedding backend + multi-Pi cluster (ADRs 167-170) (#413) 2026-05-04 08:30:40 -04:00
Cargo.toml feat(ruvector-hailo): NPU embedding backend + multi-Pi cluster (ADRs 167-170) (#413) 2026-05-04 08:30:40 -04:00
README.md hailo: NPU pipeline pool exploration + bridge cache/health parity (iter 234-249) (#418) 2026-05-04 09:56:26 -04:00

ruvector-mmwave

Streaming parser for the Seeed MR60BHA2 60 GHz radar's UART protocol. Pure-Rust, no_std-compatible, zero-allocation hot path.

Status: library, 11 unit tests + proptest fuzz suite passing. Shared between the host-side ruvector-mmwave-bridge (parses serial input → emits NL events → cluster embed RPC) and the iter-115 firmware self-test that runs on the radar's MCU directly.

Why a separate crate

ADR-178 Gap H: keeping the parser separate from ruvector-hailo-cluster (the bridge's home crate) means a regression in either side surfaces independently. The parser is byte-for-byte deterministic against fuzzed inputs; the bridge layers transport, TLS, fingerprinting on top.

Wire format (Seeed MR60BHA2 v0.3)

8-byte header  | variable payload | trailing checksum
[0x01]         | <up to 64 bytes>  | invert_xor(payload)
[frame_id_hi]
[frame_id_lo]
[length_hi]    ← 16-bit big-endian payload length
[length_lo]
[type_hi]      ← 16-bit big-endian frame type
[type_lo]
[invert_xor of 7 prior bytes]

Frame types currently parsed:

frame_type meaning payload shape
0x0A05 breathing rate [bpm: u8]
0x0A06 heart rate [bpm: u8]
0x0A14 nearest target distance [cm: u16 BE]
0x0F09 presence flag [present: bool]
anything else Event::Unknown { frame_type, payload_len } (iter 249) payload_len is u16

API surface

use ruvector_mmwave::{Event, Mr60Parser};

let mut p = Mr60Parser::new();
let frame: &[u8] = /* 60 bytes from /dev/ttyUSB0 */;
p.feed_slice(frame, |ev| match ev {
    Event::Breathing { bpm } => println!("breathing {} bpm", bpm),
    Event::HeartRate { bpm } => println!("heart rate {} bpm", bpm),
    Event::Distance { cm } => println!("distance {} cm", cm),
    Event::Presence { present } => println!("present={}", present),
    Event::Unknown { frame_type, payload_len } => {
        eprintln!("unknown frame 0x{:04x} len={}", frame_type, payload_len);
    }
    Event::ChecksumError => eprintln!("dropped frame, parser resynced"),
    Event::Resync { skipped } => eprintln!("desync, dropped {} bytes", skipped),
});

The closure signature is FnMut(Event); the parser invokes it zero-or-more times per byte fed. State machine resyncs cleanly on checksum failure or unexpected SOF — no manual reset needed.

Benchmarks

Run cargo bench -p ruvector-mmwave for the full criterion sweep. Steady-state on cognitum-v0 (Pi 5):

  • ~3.2 GB/s feed rate on feed_slice with all-recognized frames (most expensive event type)
  • ~7.1 GB/s on the no-event-emitted path (waiting for SOF)
  • Zero allocations per byte fed — the buffer is fixed at 64 bytes, see MAX_PAYLOAD const.

For typical UART rates (115200 baud → ~14 KB/s), the parser cost is < 0.001% of one core.

Property tests

tests/tokenizer_proptest.rs (proptest v1) feeds:

  • arbitrary-length byte strings to verify the parser never panics
  • frames with corrupted checksums to verify clean Resync
  • frames with valid headers but truncated payloads to verify the state machine waits without emitting

See also

  • crates/ruvector-hailo-cluster/src/bin/mmwave-bridge.rs — the host-side bridge that consumes this parser and posts NL events to the hailo-backend cluster.
  • examples/esp32-mmwave-sensor/ — firmware-side use of this parser on an ESP32 paired to the radar over UART.
  • ADR-063 — original mmwave integration design.
  • ADR-178 Gap H — rationale for the separate crate boundary.