docs(adr): ADR-176 EPIC — wire HEF into HailoEmbedder for NPU acceleration (iter 158)

Six-phase EPIC covering the remaining Rust integration to make NPU acceleration the production-default after the iter 156b/157 breakthrough (HEF compiled + validated at 73.4 FPS on real hardware): P0 — Pi dev environment [done — iter 152] P1 — HEF loading + vstreams [iter 158-159] P2 — Host-side embedding lookup [iter 160] P3 — End-to-end pipeline compose [iter 161] P4 — HailoEmbedder dispatch [iter 162] P5 — Pi hardware validation [iter 163-164] P6 — ADR finalization [iter 165] Scoped as an EPIC because the runtime path is six distinct concerns that can't fit in a single commit without going past 500 LOC; each iter-step is small but they nest. Tracking as one EPIC prevents "looks done but actually broken" partial wire-ups. Acceptance criteria: ≥5× throughput vs cpu-fallback (iter-149 baseline of 7/sec → ≥35/sec single-worker on Pi 5), cosine >0.95 between HEF and cpu-fallback outputs, clippy clean both feature combos. Loop-worker plan: self-paced iterations, one phase deliverable each; snags loop before advancing. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-29 19:33:34 +00:00 · 2026-05-03 15:03:06 -04:00 · 2026-05-03 15:03:06 -04:00 · 98ab2ae7e7
commit 98ab2ae7e7
parent 2ba399fbed
1 changed files with 243 additions and 0 deletions
--- a/docs/adr/ADR-176-hef-integration-epic.md
+++ b/docs/adr/ADR-176-hef-integration-epic.md
@ -0,0 +1,243 @@
+---
+adr: 176
+title: "EPIC — Wire HEF into HailoEmbedder for NPU-accelerated production embeddings"
+status: in-progress
+date: 2026-05-03
+authors: [ruvnet, claude-flow]
+related: [ADR-167, ADR-172, ADR-173, ADR-175]
+---
+
+# ADR-176 — EPIC: Wire HEF into HailoEmbedder for NPU-accelerated embeddings
+
+## Status
+
+**In progress** as of iter 158 (2026-05-03). Iter 156b/157 produced
+and validated the encoder HEF on real hardware (73.4 FPS on
+cognitum-v0). This EPIC tracks the remaining Rust integration work
+to make NPU acceleration the production-default embedding path.
+
+## Why this is an EPIC, not a single iteration
+
+Through iter 133-157 we compressed the HEF compile blocker — but the
+runtime path is six distinct concerns that can't be done in one
+commit without going past 500 LOC:
+
+1. HEF artifact provenance + deploy plumbing
+2. HailoRT FFI surface (HEF loading, vstreams, dequantize)
+3. Host-side embedding lookup (candle `BertEmbeddings` or hand-rolled)
+4. End-to-end pipeline composition (tokenize → embed → NPU → pool → L2)
+5. `HailoEmbedder` integration (HEF takes precedence over cpu-fallback)
+6. Hardware validation + benchmark vs cpu-fallback baseline
+
+Each is small individually but they nest — phase 5 needs phases 1-4
+to land first; phase 6 needs phase 5; etc. Tracking them as one EPIC
+prevents the "looks done but actually broken" failure mode that would
+follow from merging a partial wire-up.
+
+## Phases
+
+### Phase P0 — Pi development environment
+
+**Done.** Iter 152 ran `install.sh` on cognitum-v0; HEF runs at
+73.4 FPS via `hailortcli run`. `/dev/hailo0` accessible to the
+`ruvector-worker` group via the udev rule.
+
+### Phase P1 — HEF loading + vstream creation (Rust)
+
+**New module**: `crates/ruvector-hailo/src/hef_pipeline.rs` (or
+extend `inference.rs`). Surfaces:
+
+```rust
+#[cfg(feature = "hailo")]
+pub struct HefPipeline {
+    device: Arc<HailoDevice>,        // shared with HailoEmbedder
+    network_group: hailort_sys::hailo_configured_network_group,
+    input_vstream: hailort_sys::hailo_input_vstream,
+    output_vstream: hailort_sys::hailo_output_vstream,
+    input_quant: QuantInfo,           // scale + zero-point for input
+    output_quant: QuantInfo,          // scale + zero-point for output
+    input_shape: [usize; 3],          // (1, seq=128, hidden=384)
+    output_shape: [usize; 3],         // (1, seq=128, hidden=384)
+}
+
+impl HefPipeline {
+    pub fn open(device: &HailoDevice, hef_path: &Path) -> Result<Self>;
+    pub fn forward(&mut self, input: &[f32]) -> Result<Vec<f32>>;
+    pub fn input_dim(&self) -> usize;
+    pub fn output_dim(&self) -> usize;
+}
+```
+
+**FFI surface needed** (already in `hailort-sys` via bindgen):
+- `hailo_create_hef` — load `.hef` from disk
+- `hailo_configure_vdevice` — bind HEF to vdevice → network groups
+- `hailo_get_network_groups` — pick `minilm_encoder`
+- `hailo_create_input_vstreams` / `hailo_create_output_vstreams`
+- `hailo_get_input_vstream_info` / `hailo_get_output_vstream_info`
+  — quantization scale + zero-point per stream
+- `hailo_vstream_write_raw_buffer` — push input
+- `hailo_vstream_read_raw_buffer` — read output
+- `hailo_release_*` — drop helpers
+
+**Quantization handling**:
+- Input is FP32 in our Rust API but UINT8 to the NPU. Quantize
+  `out_u8 = clip(round(in_f32 / scale + zero_point), 0, 255)`.
+- Output is UINT8 from the NPU but FP32 in our Rust API.
+  Dequantize `out_f32 = scale * (in_u8 - zero_point)`.
+- Scale/zero-point come from vstream info at HEF-load time.
+
+**Tests**: smoke test that uses a fixed-bytes input and checks the
+output shape + dim. Skipped on `cargo test --no-default-features`.
+
+### Phase P2 — Host-side embedding lookup
+
+**Why**: the iter-156 ONNX export removed the `Gather` embedding
+lookup so the NPU graph is just the encoder block. The host has to
+do `input_ids → embeddings` before pushing to NPU.
+
+**Two possible implementations**:
+
+A. **Reuse candle's `BertEmbeddings`**: factor out the embedding
+   layer from `cpu_embedder.rs`. Candle handles the position +
+   token-type embedding sums and LayerNorm. ~60 LOC of refactor.
+
+B. **Hand-rolled embedding lookup**: read the embedding tables
+   directly from `model.safetensors` (word_embeddings,
+   position_embeddings, token_type_embeddings, LayerNorm gamma/beta)
+   and do the math without candle. ~150 LOC; avoids the candle
+   runtime overhead per call.
+
+**Recommendation**: Start with (A) for speed of implementation. If
+profiling shows the lookup is >20% of end-to-end latency, swap to
+(B). The lookup is mostly memory bandwidth (table-fetch + add) so
+SIMD doesn't matter much.
+
+### Phase P3 — End-to-end pipeline composition
+
+**New struct in `cpu_embedder.rs` or sibling**:
+
+```rust
+pub struct HefEmbedder {
+    embeddings: BertEmbeddings,   // host-side (from model.safetensors)
+    pipeline: HefPipeline,         // NPU forward pass
+    tokenizer: Tokenizer,
+    output_dim: usize,
+    max_seq: usize,
+}
+
+impl HefEmbedder {
+    pub fn open(hef_path: &Path, model_dir: &Path) -> Result<Self>;
+    pub fn embed(&mut self, text: &str) -> Result<Vec<f32>>;
+}
+```
+
+`embed()` flow:
+1. Tokenize `text` → `(input_ids, attention_mask)` (HF tokenizer)
+2. Pad to seq=128
+3. Compute embeddings host-side: `embed_table[input_ids] + position_embed + type_embed`, then LayerNorm. Output shape `[1, 128, 384]` FP32.
+4. Push embeddings to `pipeline.forward()` → output `[1, 128, 384]` FP32 (post-dequant).
+5. Mean-pool over seq dim weighted by `attention_mask` (existing
+   `inference::mean_pool` — already there).
+6. L2-normalize (existing `inference::l2_normalize`).
+7. Return `Vec<f32>` of length 384.
+
+### Phase P4 — `HailoEmbedder` integration
+
+Modify `HailoEmbedder::open` (`crates/ruvector-hailo/src/lib.rs`):
+
+```text
+priority order at open():
+  1. If --features hailo AND model_dir contains model.hef:
+       use HefEmbedder (NPU acceleration)
+  2. Else if --features cpu-fallback AND model_dir contains
+       model.safetensors:
+       use CpuEmbedder (host CPU)
+  3. Else:
+       open(NoModelLoaded) — health probe still serves
+```
+
+`embed()` dispatch:
+```text
+1. self.hef_embedder.as_ref()?.embed(text)
+2. self.cpu_fallback.as_ref()?.embed(text)
+3. Err(NoModelLoaded)
+```
+
+`has_model()` returns `true` if either is loaded.
+
+`compute_fingerprint` (cluster) already handles both layouts (iter
+143). Need to extend to `model.hef` ⊕ `model.safetensors` (worker
+running with both gets a fingerprint distinct from worker running
+only safetensors — different code paths means different vectors,
+cluster should refuse to mix).
+
+### Phase P5 — Hardware validation + benchmark
+
+On cognitum-v0:
+1. Stop the systemd `ruvector-hailo-worker`
+2. Cross-build worker with `--features hailo,cpu-fallback`
+3. Drop `model.hef` into `/var/lib/ruvector-hailo/models/all-minilm-l6-v2/`
+   alongside the existing safetensors trio
+4. Restart the systemd unit
+5. Verify the iter-145 startup self-test embed completes
+   (proves the HEF path runs end-to-end on hardware)
+6. Run `cluster-bench --workers cognitum-v0:50051 --concurrency 4
+    --duration-secs 30` and capture:
+   - throughput vs cpu-fallback (expect 5-10× improvement)
+   - p50 / p99 latency vs cpu-fallback
+7. Verify output vectors are semantically similar to cpu-fallback
+   (cosine similarity >0.95 on a fixed sentence corpus — small
+   accuracy loss is expected from INT8 quantization but the
+   ordering must hold)
+
+### Phase P6 — ADR-176 finalization
+
+Update this ADR with measured numbers, mark status `accepted`. Update
+ADR-167 status table. Update ADR-175 to mark Option A as the
+production path. Update worker README and env.example.
+
+## Acceptance criteria
+
+This EPIC is "complete and validated" when:
+
+1. `cargo build --features hailo,cpu-fallback --bin ruvector-hailo-worker`
+   succeeds on Pi 5
+2. `systemctl start ruvector-hailo-worker` boots cleanly with HEF
+3. Iter-145 self-test embed prints success in journald
+4. ruvllm-bridge → cluster → Pi worker returns a real semantic
+   vector (validated as in iter 149)
+5. `cluster-bench` measures ≥5× throughput improvement vs iter-149
+   cpu-fallback baseline (7.0 / sec → ≥35 / sec single-worker)
+6. Cosine similarity between HEF-produced and cpu-fallback-produced
+   vectors on a 5-sentence test corpus stays >0.95 average
+7. `cargo clippy --all-targets -- -D warnings` clean both feature
+   combos
+
+## Iteration plan (loop-worker driven)
+
+Each loop-worker iteration tackles one tightly-scoped chunk:
+
+| Iter | Phase | Concrete deliverable |
+|---|---|---|
+| 158 | P1 | hef_pipeline.rs scaffold + HEF load + vstream open |
+| 159 | P1 | vstream read/write + quantize/dequantize |
+| 160 | P2 | BertEmbeddings extracted from cpu_embedder |
+| 161 | P3 | HefEmbedder struct, end-to-end embed() |
+| 162 | P4 | HailoEmbedder dispatch + has_model + tests |
+| 163 | P5 | Pi 5 deploy + cluster-bench measurement |
+| 164 | P5 | Cosine similarity verification vs cpu-fallback |
+| 165 | P6 | Finalize ADR-176 + update related ADRs |
+
+Loop will self-pace; if one iteration's deliverable hits a snag
+(e.g., a HailoRT API turns out different than docs suggest), the
+loop iterates on it before moving on.
+
+## References
+
+- ADR-167 — original ruvector-hailo embedding backend design
+- ADR-175 — Rust-side workarounds for HEF SDK bugs
+- ADR-176 — this EPIC (in-progress)
+- iter 156b commit — `ffa3e90a6` HEF compiled
+- iter 157 commit — `2ba399fbe` NPU forward pass validated at 73.4 FPS
+- HEF artifact: 15.7 MB,
+  sha256 `cdbc892765d3099f74723ee6c28ab3f0daade2358827823ba08d2969b07ebd40`