docs(adr): ADR-176 EPIC — wire HEF into HailoEmbedder for NPU acceleration (iter 158)

Six-phase EPIC covering the remaining Rust integration to make NPU
acceleration the production-default after the iter 156b/157
breakthrough (HEF compiled + validated at 73.4 FPS on real hardware):

  P0 — Pi dev environment           [done — iter 152]
  P1 — HEF loading + vstreams       [iter 158-159]
  P2 — Host-side embedding lookup   [iter 160]
  P3 — End-to-end pipeline compose  [iter 161]
  P4 — HailoEmbedder dispatch       [iter 162]
  P5 — Pi hardware validation       [iter 163-164]
  P6 — ADR finalization             [iter 165]

Scoped as an EPIC because the runtime path is six distinct concerns
that can't fit in a single commit without going past 500 LOC; each
iter-step is small but they nest. Tracking as one EPIC prevents
"looks done but actually broken" partial wire-ups.

Acceptance criteria: ≥5× throughput vs cpu-fallback (iter-149
baseline of 7/sec → ≥35/sec single-worker on Pi 5), cosine >0.95
between HEF and cpu-fallback outputs, clippy clean both feature
combos.

Loop-worker plan: self-paced iterations, one phase deliverable each;
snags loop before advancing.

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
ruvnet 2026-05-03 15:03:06 -04:00
parent 2ba399fbed
commit 98ab2ae7e7

View file

@ -0,0 +1,243 @@
---
adr: 176
title: "EPIC — Wire HEF into HailoEmbedder for NPU-accelerated production embeddings"
status: in-progress
date: 2026-05-03
authors: [ruvnet, claude-flow]
related: [ADR-167, ADR-172, ADR-173, ADR-175]
---
# ADR-176 — EPIC: Wire HEF into HailoEmbedder for NPU-accelerated embeddings
## Status
**In progress** as of iter 158 (2026-05-03). Iter 156b/157 produced
and validated the encoder HEF on real hardware (73.4 FPS on
cognitum-v0). This EPIC tracks the remaining Rust integration work
to make NPU acceleration the production-default embedding path.
## Why this is an EPIC, not a single iteration
Through iter 133-157 we compressed the HEF compile blocker — but the
runtime path is six distinct concerns that can't be done in one
commit without going past 500 LOC:
1. HEF artifact provenance + deploy plumbing
2. HailoRT FFI surface (HEF loading, vstreams, dequantize)
3. Host-side embedding lookup (candle `BertEmbeddings` or hand-rolled)
4. End-to-end pipeline composition (tokenize → embed → NPU → pool → L2)
5. `HailoEmbedder` integration (HEF takes precedence over cpu-fallback)
6. Hardware validation + benchmark vs cpu-fallback baseline
Each is small individually but they nest — phase 5 needs phases 1-4
to land first; phase 6 needs phase 5; etc. Tracking them as one EPIC
prevents the "looks done but actually broken" failure mode that would
follow from merging a partial wire-up.
## Phases
### Phase P0 — Pi development environment
**Done.** Iter 152 ran `install.sh` on cognitum-v0; HEF runs at
73.4 FPS via `hailortcli run`. `/dev/hailo0` accessible to the
`ruvector-worker` group via the udev rule.
### Phase P1 — HEF loading + vstream creation (Rust)
**New module**: `crates/ruvector-hailo/src/hef_pipeline.rs` (or
extend `inference.rs`). Surfaces:
```rust
#[cfg(feature = "hailo")]
pub struct HefPipeline {
device: Arc<HailoDevice>, // shared with HailoEmbedder
network_group: hailort_sys::hailo_configured_network_group,
input_vstream: hailort_sys::hailo_input_vstream,
output_vstream: hailort_sys::hailo_output_vstream,
input_quant: QuantInfo, // scale + zero-point for input
output_quant: QuantInfo, // scale + zero-point for output
input_shape: [usize; 3], // (1, seq=128, hidden=384)
output_shape: [usize; 3], // (1, seq=128, hidden=384)
}
impl HefPipeline {
pub fn open(device: &HailoDevice, hef_path: &Path) -> Result<Self>;
pub fn forward(&mut self, input: &[f32]) -> Result<Vec<f32>>;
pub fn input_dim(&self) -> usize;
pub fn output_dim(&self) -> usize;
}
```
**FFI surface needed** (already in `hailort-sys` via bindgen):
- `hailo_create_hef` — load `.hef` from disk
- `hailo_configure_vdevice` — bind HEF to vdevice → network groups
- `hailo_get_network_groups` — pick `minilm_encoder`
- `hailo_create_input_vstreams` / `hailo_create_output_vstreams`
- `hailo_get_input_vstream_info` / `hailo_get_output_vstream_info`
— quantization scale + zero-point per stream
- `hailo_vstream_write_raw_buffer` — push input
- `hailo_vstream_read_raw_buffer` — read output
- `hailo_release_*` — drop helpers
**Quantization handling**:
- Input is FP32 in our Rust API but UINT8 to the NPU. Quantize
`out_u8 = clip(round(in_f32 / scale + zero_point), 0, 255)`.
- Output is UINT8 from the NPU but FP32 in our Rust API.
Dequantize `out_f32 = scale * (in_u8 - zero_point)`.
- Scale/zero-point come from vstream info at HEF-load time.
**Tests**: smoke test that uses a fixed-bytes input and checks the
output shape + dim. Skipped on `cargo test --no-default-features`.
### Phase P2 — Host-side embedding lookup
**Why**: the iter-156 ONNX export removed the `Gather` embedding
lookup so the NPU graph is just the encoder block. The host has to
do `input_ids → embeddings` before pushing to NPU.
**Two possible implementations**:
A. **Reuse candle's `BertEmbeddings`**: factor out the embedding
layer from `cpu_embedder.rs`. Candle handles the position +
token-type embedding sums and LayerNorm. ~60 LOC of refactor.
B. **Hand-rolled embedding lookup**: read the embedding tables
directly from `model.safetensors` (word_embeddings,
position_embeddings, token_type_embeddings, LayerNorm gamma/beta)
and do the math without candle. ~150 LOC; avoids the candle
runtime overhead per call.
**Recommendation**: Start with (A) for speed of implementation. If
profiling shows the lookup is >20% of end-to-end latency, swap to
(B). The lookup is mostly memory bandwidth (table-fetch + add) so
SIMD doesn't matter much.
### Phase P3 — End-to-end pipeline composition
**New struct in `cpu_embedder.rs` or sibling**:
```rust
pub struct HefEmbedder {
embeddings: BertEmbeddings, // host-side (from model.safetensors)
pipeline: HefPipeline, // NPU forward pass
tokenizer: Tokenizer,
output_dim: usize,
max_seq: usize,
}
impl HefEmbedder {
pub fn open(hef_path: &Path, model_dir: &Path) -> Result<Self>;
pub fn embed(&mut self, text: &str) -> Result<Vec<f32>>;
}
```
`embed()` flow:
1. Tokenize `text``(input_ids, attention_mask)` (HF tokenizer)
2. Pad to seq=128
3. Compute embeddings host-side: `embed_table[input_ids] + position_embed + type_embed`, then LayerNorm. Output shape `[1, 128, 384]` FP32.
4. Push embeddings to `pipeline.forward()` → output `[1, 128, 384]` FP32 (post-dequant).
5. Mean-pool over seq dim weighted by `attention_mask` (existing
`inference::mean_pool` — already there).
6. L2-normalize (existing `inference::l2_normalize`).
7. Return `Vec<f32>` of length 384.
### Phase P4 — `HailoEmbedder` integration
Modify `HailoEmbedder::open` (`crates/ruvector-hailo/src/lib.rs`):
```text
priority order at open():
1. If --features hailo AND model_dir contains model.hef:
use HefEmbedder (NPU acceleration)
2. Else if --features cpu-fallback AND model_dir contains
model.safetensors:
use CpuEmbedder (host CPU)
3. Else:
open(NoModelLoaded) — health probe still serves
```
`embed()` dispatch:
```text
1. self.hef_embedder.as_ref()?.embed(text)
2. self.cpu_fallback.as_ref()?.embed(text)
3. Err(NoModelLoaded)
```
`has_model()` returns `true` if either is loaded.
`compute_fingerprint` (cluster) already handles both layouts (iter
143). Need to extend to `model.hef``model.safetensors` (worker
running with both gets a fingerprint distinct from worker running
only safetensors — different code paths means different vectors,
cluster should refuse to mix).
### Phase P5 — Hardware validation + benchmark
On cognitum-v0:
1. Stop the systemd `ruvector-hailo-worker`
2. Cross-build worker with `--features hailo,cpu-fallback`
3. Drop `model.hef` into `/var/lib/ruvector-hailo/models/all-minilm-l6-v2/`
alongside the existing safetensors trio
4. Restart the systemd unit
5. Verify the iter-145 startup self-test embed completes
(proves the HEF path runs end-to-end on hardware)
6. Run `cluster-bench --workers cognitum-v0:50051 --concurrency 4
--duration-secs 30` and capture:
- throughput vs cpu-fallback (expect 5-10× improvement)
- p50 / p99 latency vs cpu-fallback
7. Verify output vectors are semantically similar to cpu-fallback
(cosine similarity >0.95 on a fixed sentence corpus — small
accuracy loss is expected from INT8 quantization but the
ordering must hold)
### Phase P6 — ADR-176 finalization
Update this ADR with measured numbers, mark status `accepted`. Update
ADR-167 status table. Update ADR-175 to mark Option A as the
production path. Update worker README and env.example.
## Acceptance criteria
This EPIC is "complete and validated" when:
1. `cargo build --features hailo,cpu-fallback --bin ruvector-hailo-worker`
succeeds on Pi 5
2. `systemctl start ruvector-hailo-worker` boots cleanly with HEF
3. Iter-145 self-test embed prints success in journald
4. ruvllm-bridge → cluster → Pi worker returns a real semantic
vector (validated as in iter 149)
5. `cluster-bench` measures ≥5× throughput improvement vs iter-149
cpu-fallback baseline (7.0 / sec → ≥35 / sec single-worker)
6. Cosine similarity between HEF-produced and cpu-fallback-produced
vectors on a 5-sentence test corpus stays >0.95 average
7. `cargo clippy --all-targets -- -D warnings` clean both feature
combos
## Iteration plan (loop-worker driven)
Each loop-worker iteration tackles one tightly-scoped chunk:
| Iter | Phase | Concrete deliverable |
|---|---|---|
| 158 | P1 | hef_pipeline.rs scaffold + HEF load + vstream open |
| 159 | P1 | vstream read/write + quantize/dequantize |
| 160 | P2 | BertEmbeddings extracted from cpu_embedder |
| 161 | P3 | HefEmbedder struct, end-to-end embed() |
| 162 | P4 | HailoEmbedder dispatch + has_model + tests |
| 163 | P5 | Pi 5 deploy + cluster-bench measurement |
| 164 | P5 | Cosine similarity verification vs cpu-fallback |
| 165 | P6 | Finalize ADR-176 + update related ADRs |
Loop will self-pace; if one iteration's deliverable hits a snag
(e.g., a HailoRT API turns out different than docs suggest), the
loop iterates on it before moving on.
## References
- ADR-167 — original ruvector-hailo embedding backend design
- ADR-175 — Rust-side workarounds for HEF SDK bugs
- ADR-176 — this EPIC (in-progress)
- iter 156b commit — `ffa3e90a6` HEF compiled
- iter 157 commit — `2ba399fbe` NPU forward pass validated at 73.4 FPS
- HEF artifact: 15.7 MB,
sha256 `cdbc892765d3099f74723ee6c28ab3f0daade2358827823ba08d2969b07ebd40`