From 98ab2ae7e76ddd21d4f8da7a0787201270625e2e Mon Sep 17 00:00:00 2001 From: ruvnet Date: Sun, 3 May 2026 15:03:06 -0400 Subject: [PATCH] =?UTF-8?q?docs(adr):=20ADR-176=20EPIC=20=E2=80=94=20wire?= =?UTF-8?q?=20HEF=20into=20HailoEmbedder=20for=20NPU=20acceleration=20(ite?= =?UTF-8?q?r=20158)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Six-phase EPIC covering the remaining Rust integration to make NPU acceleration the production-default after the iter 156b/157 breakthrough (HEF compiled + validated at 73.4 FPS on real hardware): P0 — Pi dev environment [done — iter 152] P1 — HEF loading + vstreams [iter 158-159] P2 — Host-side embedding lookup [iter 160] P3 — End-to-end pipeline compose [iter 161] P4 — HailoEmbedder dispatch [iter 162] P5 — Pi hardware validation [iter 163-164] P6 — ADR finalization [iter 165] Scoped as an EPIC because the runtime path is six distinct concerns that can't fit in a single commit without going past 500 LOC; each iter-step is small but they nest. Tracking as one EPIC prevents "looks done but actually broken" partial wire-ups. Acceptance criteria: ≥5× throughput vs cpu-fallback (iter-149 baseline of 7/sec → ≥35/sec single-worker on Pi 5), cosine >0.95 between HEF and cpu-fallback outputs, clippy clean both feature combos. Loop-worker plan: self-paced iterations, one phase deliverable each; snags loop before advancing. Co-Authored-By: claude-flow --- docs/adr/ADR-176-hef-integration-epic.md | 243 +++++++++++++++++++++++ 1 file changed, 243 insertions(+) create mode 100644 docs/adr/ADR-176-hef-integration-epic.md diff --git a/docs/adr/ADR-176-hef-integration-epic.md b/docs/adr/ADR-176-hef-integration-epic.md new file mode 100644 index 00000000..50934de0 --- /dev/null +++ b/docs/adr/ADR-176-hef-integration-epic.md @@ -0,0 +1,243 @@ +--- +adr: 176 +title: "EPIC — Wire HEF into HailoEmbedder for NPU-accelerated production embeddings" +status: in-progress +date: 2026-05-03 +authors: [ruvnet, claude-flow] +related: [ADR-167, ADR-172, ADR-173, ADR-175] +--- + +# ADR-176 — EPIC: Wire HEF into HailoEmbedder for NPU-accelerated embeddings + +## Status + +**In progress** as of iter 158 (2026-05-03). Iter 156b/157 produced +and validated the encoder HEF on real hardware (73.4 FPS on +cognitum-v0). This EPIC tracks the remaining Rust integration work +to make NPU acceleration the production-default embedding path. + +## Why this is an EPIC, not a single iteration + +Through iter 133-157 we compressed the HEF compile blocker — but the +runtime path is six distinct concerns that can't be done in one +commit without going past 500 LOC: + +1. HEF artifact provenance + deploy plumbing +2. HailoRT FFI surface (HEF loading, vstreams, dequantize) +3. Host-side embedding lookup (candle `BertEmbeddings` or hand-rolled) +4. End-to-end pipeline composition (tokenize → embed → NPU → pool → L2) +5. `HailoEmbedder` integration (HEF takes precedence over cpu-fallback) +6. Hardware validation + benchmark vs cpu-fallback baseline + +Each is small individually but they nest — phase 5 needs phases 1-4 +to land first; phase 6 needs phase 5; etc. Tracking them as one EPIC +prevents the "looks done but actually broken" failure mode that would +follow from merging a partial wire-up. + +## Phases + +### Phase P0 — Pi development environment + +**Done.** Iter 152 ran `install.sh` on cognitum-v0; HEF runs at +73.4 FPS via `hailortcli run`. `/dev/hailo0` accessible to the +`ruvector-worker` group via the udev rule. + +### Phase P1 — HEF loading + vstream creation (Rust) + +**New module**: `crates/ruvector-hailo/src/hef_pipeline.rs` (or +extend `inference.rs`). Surfaces: + +```rust +#[cfg(feature = "hailo")] +pub struct HefPipeline { + device: Arc, // shared with HailoEmbedder + network_group: hailort_sys::hailo_configured_network_group, + input_vstream: hailort_sys::hailo_input_vstream, + output_vstream: hailort_sys::hailo_output_vstream, + input_quant: QuantInfo, // scale + zero-point for input + output_quant: QuantInfo, // scale + zero-point for output + input_shape: [usize; 3], // (1, seq=128, hidden=384) + output_shape: [usize; 3], // (1, seq=128, hidden=384) +} + +impl HefPipeline { + pub fn open(device: &HailoDevice, hef_path: &Path) -> Result; + pub fn forward(&mut self, input: &[f32]) -> Result>; + pub fn input_dim(&self) -> usize; + pub fn output_dim(&self) -> usize; +} +``` + +**FFI surface needed** (already in `hailort-sys` via bindgen): +- `hailo_create_hef` — load `.hef` from disk +- `hailo_configure_vdevice` — bind HEF to vdevice → network groups +- `hailo_get_network_groups` — pick `minilm_encoder` +- `hailo_create_input_vstreams` / `hailo_create_output_vstreams` +- `hailo_get_input_vstream_info` / `hailo_get_output_vstream_info` + — quantization scale + zero-point per stream +- `hailo_vstream_write_raw_buffer` — push input +- `hailo_vstream_read_raw_buffer` — read output +- `hailo_release_*` — drop helpers + +**Quantization handling**: +- Input is FP32 in our Rust API but UINT8 to the NPU. Quantize + `out_u8 = clip(round(in_f32 / scale + zero_point), 0, 255)`. +- Output is UINT8 from the NPU but FP32 in our Rust API. + Dequantize `out_f32 = scale * (in_u8 - zero_point)`. +- Scale/zero-point come from vstream info at HEF-load time. + +**Tests**: smoke test that uses a fixed-bytes input and checks the +output shape + dim. Skipped on `cargo test --no-default-features`. + +### Phase P2 — Host-side embedding lookup + +**Why**: the iter-156 ONNX export removed the `Gather` embedding +lookup so the NPU graph is just the encoder block. The host has to +do `input_ids → embeddings` before pushing to NPU. + +**Two possible implementations**: + +A. **Reuse candle's `BertEmbeddings`**: factor out the embedding + layer from `cpu_embedder.rs`. Candle handles the position + + token-type embedding sums and LayerNorm. ~60 LOC of refactor. + +B. **Hand-rolled embedding lookup**: read the embedding tables + directly from `model.safetensors` (word_embeddings, + position_embeddings, token_type_embeddings, LayerNorm gamma/beta) + and do the math without candle. ~150 LOC; avoids the candle + runtime overhead per call. + +**Recommendation**: Start with (A) for speed of implementation. If +profiling shows the lookup is >20% of end-to-end latency, swap to +(B). The lookup is mostly memory bandwidth (table-fetch + add) so +SIMD doesn't matter much. + +### Phase P3 — End-to-end pipeline composition + +**New struct in `cpu_embedder.rs` or sibling**: + +```rust +pub struct HefEmbedder { + embeddings: BertEmbeddings, // host-side (from model.safetensors) + pipeline: HefPipeline, // NPU forward pass + tokenizer: Tokenizer, + output_dim: usize, + max_seq: usize, +} + +impl HefEmbedder { + pub fn open(hef_path: &Path, model_dir: &Path) -> Result; + pub fn embed(&mut self, text: &str) -> Result>; +} +``` + +`embed()` flow: +1. Tokenize `text` → `(input_ids, attention_mask)` (HF tokenizer) +2. Pad to seq=128 +3. Compute embeddings host-side: `embed_table[input_ids] + position_embed + type_embed`, then LayerNorm. Output shape `[1, 128, 384]` FP32. +4. Push embeddings to `pipeline.forward()` → output `[1, 128, 384]` FP32 (post-dequant). +5. Mean-pool over seq dim weighted by `attention_mask` (existing + `inference::mean_pool` — already there). +6. L2-normalize (existing `inference::l2_normalize`). +7. Return `Vec` of length 384. + +### Phase P4 — `HailoEmbedder` integration + +Modify `HailoEmbedder::open` (`crates/ruvector-hailo/src/lib.rs`): + +```text +priority order at open(): + 1. If --features hailo AND model_dir contains model.hef: + use HefEmbedder (NPU acceleration) + 2. Else if --features cpu-fallback AND model_dir contains + model.safetensors: + use CpuEmbedder (host CPU) + 3. Else: + open(NoModelLoaded) — health probe still serves +``` + +`embed()` dispatch: +```text +1. self.hef_embedder.as_ref()?.embed(text) +2. self.cpu_fallback.as_ref()?.embed(text) +3. Err(NoModelLoaded) +``` + +`has_model()` returns `true` if either is loaded. + +`compute_fingerprint` (cluster) already handles both layouts (iter +143). Need to extend to `model.hef` ⊕ `model.safetensors` (worker +running with both gets a fingerprint distinct from worker running +only safetensors — different code paths means different vectors, +cluster should refuse to mix). + +### Phase P5 — Hardware validation + benchmark + +On cognitum-v0: +1. Stop the systemd `ruvector-hailo-worker` +2. Cross-build worker with `--features hailo,cpu-fallback` +3. Drop `model.hef` into `/var/lib/ruvector-hailo/models/all-minilm-l6-v2/` + alongside the existing safetensors trio +4. Restart the systemd unit +5. Verify the iter-145 startup self-test embed completes + (proves the HEF path runs end-to-end on hardware) +6. Run `cluster-bench --workers cognitum-v0:50051 --concurrency 4 + --duration-secs 30` and capture: + - throughput vs cpu-fallback (expect 5-10× improvement) + - p50 / p99 latency vs cpu-fallback +7. Verify output vectors are semantically similar to cpu-fallback + (cosine similarity >0.95 on a fixed sentence corpus — small + accuracy loss is expected from INT8 quantization but the + ordering must hold) + +### Phase P6 — ADR-176 finalization + +Update this ADR with measured numbers, mark status `accepted`. Update +ADR-167 status table. Update ADR-175 to mark Option A as the +production path. Update worker README and env.example. + +## Acceptance criteria + +This EPIC is "complete and validated" when: + +1. `cargo build --features hailo,cpu-fallback --bin ruvector-hailo-worker` + succeeds on Pi 5 +2. `systemctl start ruvector-hailo-worker` boots cleanly with HEF +3. Iter-145 self-test embed prints success in journald +4. ruvllm-bridge → cluster → Pi worker returns a real semantic + vector (validated as in iter 149) +5. `cluster-bench` measures ≥5× throughput improvement vs iter-149 + cpu-fallback baseline (7.0 / sec → ≥35 / sec single-worker) +6. Cosine similarity between HEF-produced and cpu-fallback-produced + vectors on a 5-sentence test corpus stays >0.95 average +7. `cargo clippy --all-targets -- -D warnings` clean both feature + combos + +## Iteration plan (loop-worker driven) + +Each loop-worker iteration tackles one tightly-scoped chunk: + +| Iter | Phase | Concrete deliverable | +|---|---|---| +| 158 | P1 | hef_pipeline.rs scaffold + HEF load + vstream open | +| 159 | P1 | vstream read/write + quantize/dequantize | +| 160 | P2 | BertEmbeddings extracted from cpu_embedder | +| 161 | P3 | HefEmbedder struct, end-to-end embed() | +| 162 | P4 | HailoEmbedder dispatch + has_model + tests | +| 163 | P5 | Pi 5 deploy + cluster-bench measurement | +| 164 | P5 | Cosine similarity verification vs cpu-fallback | +| 165 | P6 | Finalize ADR-176 + update related ADRs | + +Loop will self-pace; if one iteration's deliverable hits a snag +(e.g., a HailoRT API turns out different than docs suggest), the +loop iterates on it before moving on. + +## References + +- ADR-167 — original ruvector-hailo embedding backend design +- ADR-175 — Rust-side workarounds for HEF SDK bugs +- ADR-176 — this EPIC (in-progress) +- iter 156b commit — `ffa3e90a6` HEF compiled +- iter 157 commit — `2ba399fbe` NPU forward pass validated at 73.4 FPS +- HEF artifact: 15.7 MB, + sha256 `cdbc892765d3099f74723ee6c28ab3f0daade2358827823ba08d2969b07ebd40`