mirror of
https://github.com/ruvnet/RuVector.git
synced 2026-05-29 19:33:34 +00:00
docs(adr): ADR-176 EPIC — wire HEF into HailoEmbedder for NPU acceleration (iter 158)
Six-phase EPIC covering the remaining Rust integration to make NPU acceleration the production-default after the iter 156b/157 breakthrough (HEF compiled + validated at 73.4 FPS on real hardware): P0 — Pi dev environment [done — iter 152] P1 — HEF loading + vstreams [iter 158-159] P2 — Host-side embedding lookup [iter 160] P3 — End-to-end pipeline compose [iter 161] P4 — HailoEmbedder dispatch [iter 162] P5 — Pi hardware validation [iter 163-164] P6 — ADR finalization [iter 165] Scoped as an EPIC because the runtime path is six distinct concerns that can't fit in a single commit without going past 500 LOC; each iter-step is small but they nest. Tracking as one EPIC prevents "looks done but actually broken" partial wire-ups. Acceptance criteria: ≥5× throughput vs cpu-fallback (iter-149 baseline of 7/sec → ≥35/sec single-worker on Pi 5), cosine >0.95 between HEF and cpu-fallback outputs, clippy clean both feature combos. Loop-worker plan: self-paced iterations, one phase deliverable each; snags loop before advancing. Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
2ba399fbed
commit
98ab2ae7e7
1 changed files with 243 additions and 0 deletions
243
docs/adr/ADR-176-hef-integration-epic.md
Normal file
243
docs/adr/ADR-176-hef-integration-epic.md
Normal file
|
|
@ -0,0 +1,243 @@
|
|||
---
|
||||
adr: 176
|
||||
title: "EPIC — Wire HEF into HailoEmbedder for NPU-accelerated production embeddings"
|
||||
status: in-progress
|
||||
date: 2026-05-03
|
||||
authors: [ruvnet, claude-flow]
|
||||
related: [ADR-167, ADR-172, ADR-173, ADR-175]
|
||||
---
|
||||
|
||||
# ADR-176 — EPIC: Wire HEF into HailoEmbedder for NPU-accelerated embeddings
|
||||
|
||||
## Status
|
||||
|
||||
**In progress** as of iter 158 (2026-05-03). Iter 156b/157 produced
|
||||
and validated the encoder HEF on real hardware (73.4 FPS on
|
||||
cognitum-v0). This EPIC tracks the remaining Rust integration work
|
||||
to make NPU acceleration the production-default embedding path.
|
||||
|
||||
## Why this is an EPIC, not a single iteration
|
||||
|
||||
Through iter 133-157 we compressed the HEF compile blocker — but the
|
||||
runtime path is six distinct concerns that can't be done in one
|
||||
commit without going past 500 LOC:
|
||||
|
||||
1. HEF artifact provenance + deploy plumbing
|
||||
2. HailoRT FFI surface (HEF loading, vstreams, dequantize)
|
||||
3. Host-side embedding lookup (candle `BertEmbeddings` or hand-rolled)
|
||||
4. End-to-end pipeline composition (tokenize → embed → NPU → pool → L2)
|
||||
5. `HailoEmbedder` integration (HEF takes precedence over cpu-fallback)
|
||||
6. Hardware validation + benchmark vs cpu-fallback baseline
|
||||
|
||||
Each is small individually but they nest — phase 5 needs phases 1-4
|
||||
to land first; phase 6 needs phase 5; etc. Tracking them as one EPIC
|
||||
prevents the "looks done but actually broken" failure mode that would
|
||||
follow from merging a partial wire-up.
|
||||
|
||||
## Phases
|
||||
|
||||
### Phase P0 — Pi development environment
|
||||
|
||||
**Done.** Iter 152 ran `install.sh` on cognitum-v0; HEF runs at
|
||||
73.4 FPS via `hailortcli run`. `/dev/hailo0` accessible to the
|
||||
`ruvector-worker` group via the udev rule.
|
||||
|
||||
### Phase P1 — HEF loading + vstream creation (Rust)
|
||||
|
||||
**New module**: `crates/ruvector-hailo/src/hef_pipeline.rs` (or
|
||||
extend `inference.rs`). Surfaces:
|
||||
|
||||
```rust
|
||||
#[cfg(feature = "hailo")]
|
||||
pub struct HefPipeline {
|
||||
device: Arc<HailoDevice>, // shared with HailoEmbedder
|
||||
network_group: hailort_sys::hailo_configured_network_group,
|
||||
input_vstream: hailort_sys::hailo_input_vstream,
|
||||
output_vstream: hailort_sys::hailo_output_vstream,
|
||||
input_quant: QuantInfo, // scale + zero-point for input
|
||||
output_quant: QuantInfo, // scale + zero-point for output
|
||||
input_shape: [usize; 3], // (1, seq=128, hidden=384)
|
||||
output_shape: [usize; 3], // (1, seq=128, hidden=384)
|
||||
}
|
||||
|
||||
impl HefPipeline {
|
||||
pub fn open(device: &HailoDevice, hef_path: &Path) -> Result<Self>;
|
||||
pub fn forward(&mut self, input: &[f32]) -> Result<Vec<f32>>;
|
||||
pub fn input_dim(&self) -> usize;
|
||||
pub fn output_dim(&self) -> usize;
|
||||
}
|
||||
```
|
||||
|
||||
**FFI surface needed** (already in `hailort-sys` via bindgen):
|
||||
- `hailo_create_hef` — load `.hef` from disk
|
||||
- `hailo_configure_vdevice` — bind HEF to vdevice → network groups
|
||||
- `hailo_get_network_groups` — pick `minilm_encoder`
|
||||
- `hailo_create_input_vstreams` / `hailo_create_output_vstreams`
|
||||
- `hailo_get_input_vstream_info` / `hailo_get_output_vstream_info`
|
||||
— quantization scale + zero-point per stream
|
||||
- `hailo_vstream_write_raw_buffer` — push input
|
||||
- `hailo_vstream_read_raw_buffer` — read output
|
||||
- `hailo_release_*` — drop helpers
|
||||
|
||||
**Quantization handling**:
|
||||
- Input is FP32 in our Rust API but UINT8 to the NPU. Quantize
|
||||
`out_u8 = clip(round(in_f32 / scale + zero_point), 0, 255)`.
|
||||
- Output is UINT8 from the NPU but FP32 in our Rust API.
|
||||
Dequantize `out_f32 = scale * (in_u8 - zero_point)`.
|
||||
- Scale/zero-point come from vstream info at HEF-load time.
|
||||
|
||||
**Tests**: smoke test that uses a fixed-bytes input and checks the
|
||||
output shape + dim. Skipped on `cargo test --no-default-features`.
|
||||
|
||||
### Phase P2 — Host-side embedding lookup
|
||||
|
||||
**Why**: the iter-156 ONNX export removed the `Gather` embedding
|
||||
lookup so the NPU graph is just the encoder block. The host has to
|
||||
do `input_ids → embeddings` before pushing to NPU.
|
||||
|
||||
**Two possible implementations**:
|
||||
|
||||
A. **Reuse candle's `BertEmbeddings`**: factor out the embedding
|
||||
layer from `cpu_embedder.rs`. Candle handles the position +
|
||||
token-type embedding sums and LayerNorm. ~60 LOC of refactor.
|
||||
|
||||
B. **Hand-rolled embedding lookup**: read the embedding tables
|
||||
directly from `model.safetensors` (word_embeddings,
|
||||
position_embeddings, token_type_embeddings, LayerNorm gamma/beta)
|
||||
and do the math without candle. ~150 LOC; avoids the candle
|
||||
runtime overhead per call.
|
||||
|
||||
**Recommendation**: Start with (A) for speed of implementation. If
|
||||
profiling shows the lookup is >20% of end-to-end latency, swap to
|
||||
(B). The lookup is mostly memory bandwidth (table-fetch + add) so
|
||||
SIMD doesn't matter much.
|
||||
|
||||
### Phase P3 — End-to-end pipeline composition
|
||||
|
||||
**New struct in `cpu_embedder.rs` or sibling**:
|
||||
|
||||
```rust
|
||||
pub struct HefEmbedder {
|
||||
embeddings: BertEmbeddings, // host-side (from model.safetensors)
|
||||
pipeline: HefPipeline, // NPU forward pass
|
||||
tokenizer: Tokenizer,
|
||||
output_dim: usize,
|
||||
max_seq: usize,
|
||||
}
|
||||
|
||||
impl HefEmbedder {
|
||||
pub fn open(hef_path: &Path, model_dir: &Path) -> Result<Self>;
|
||||
pub fn embed(&mut self, text: &str) -> Result<Vec<f32>>;
|
||||
}
|
||||
```
|
||||
|
||||
`embed()` flow:
|
||||
1. Tokenize `text` → `(input_ids, attention_mask)` (HF tokenizer)
|
||||
2. Pad to seq=128
|
||||
3. Compute embeddings host-side: `embed_table[input_ids] + position_embed + type_embed`, then LayerNorm. Output shape `[1, 128, 384]` FP32.
|
||||
4. Push embeddings to `pipeline.forward()` → output `[1, 128, 384]` FP32 (post-dequant).
|
||||
5. Mean-pool over seq dim weighted by `attention_mask` (existing
|
||||
`inference::mean_pool` — already there).
|
||||
6. L2-normalize (existing `inference::l2_normalize`).
|
||||
7. Return `Vec<f32>` of length 384.
|
||||
|
||||
### Phase P4 — `HailoEmbedder` integration
|
||||
|
||||
Modify `HailoEmbedder::open` (`crates/ruvector-hailo/src/lib.rs`):
|
||||
|
||||
```text
|
||||
priority order at open():
|
||||
1. If --features hailo AND model_dir contains model.hef:
|
||||
use HefEmbedder (NPU acceleration)
|
||||
2. Else if --features cpu-fallback AND model_dir contains
|
||||
model.safetensors:
|
||||
use CpuEmbedder (host CPU)
|
||||
3. Else:
|
||||
open(NoModelLoaded) — health probe still serves
|
||||
```
|
||||
|
||||
`embed()` dispatch:
|
||||
```text
|
||||
1. self.hef_embedder.as_ref()?.embed(text)
|
||||
2. self.cpu_fallback.as_ref()?.embed(text)
|
||||
3. Err(NoModelLoaded)
|
||||
```
|
||||
|
||||
`has_model()` returns `true` if either is loaded.
|
||||
|
||||
`compute_fingerprint` (cluster) already handles both layouts (iter
|
||||
143). Need to extend to `model.hef` ⊕ `model.safetensors` (worker
|
||||
running with both gets a fingerprint distinct from worker running
|
||||
only safetensors — different code paths means different vectors,
|
||||
cluster should refuse to mix).
|
||||
|
||||
### Phase P5 — Hardware validation + benchmark
|
||||
|
||||
On cognitum-v0:
|
||||
1. Stop the systemd `ruvector-hailo-worker`
|
||||
2. Cross-build worker with `--features hailo,cpu-fallback`
|
||||
3. Drop `model.hef` into `/var/lib/ruvector-hailo/models/all-minilm-l6-v2/`
|
||||
alongside the existing safetensors trio
|
||||
4. Restart the systemd unit
|
||||
5. Verify the iter-145 startup self-test embed completes
|
||||
(proves the HEF path runs end-to-end on hardware)
|
||||
6. Run `cluster-bench --workers cognitum-v0:50051 --concurrency 4
|
||||
--duration-secs 30` and capture:
|
||||
- throughput vs cpu-fallback (expect 5-10× improvement)
|
||||
- p50 / p99 latency vs cpu-fallback
|
||||
7. Verify output vectors are semantically similar to cpu-fallback
|
||||
(cosine similarity >0.95 on a fixed sentence corpus — small
|
||||
accuracy loss is expected from INT8 quantization but the
|
||||
ordering must hold)
|
||||
|
||||
### Phase P6 — ADR-176 finalization
|
||||
|
||||
Update this ADR with measured numbers, mark status `accepted`. Update
|
||||
ADR-167 status table. Update ADR-175 to mark Option A as the
|
||||
production path. Update worker README and env.example.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
This EPIC is "complete and validated" when:
|
||||
|
||||
1. `cargo build --features hailo,cpu-fallback --bin ruvector-hailo-worker`
|
||||
succeeds on Pi 5
|
||||
2. `systemctl start ruvector-hailo-worker` boots cleanly with HEF
|
||||
3. Iter-145 self-test embed prints success in journald
|
||||
4. ruvllm-bridge → cluster → Pi worker returns a real semantic
|
||||
vector (validated as in iter 149)
|
||||
5. `cluster-bench` measures ≥5× throughput improvement vs iter-149
|
||||
cpu-fallback baseline (7.0 / sec → ≥35 / sec single-worker)
|
||||
6. Cosine similarity between HEF-produced and cpu-fallback-produced
|
||||
vectors on a 5-sentence test corpus stays >0.95 average
|
||||
7. `cargo clippy --all-targets -- -D warnings` clean both feature
|
||||
combos
|
||||
|
||||
## Iteration plan (loop-worker driven)
|
||||
|
||||
Each loop-worker iteration tackles one tightly-scoped chunk:
|
||||
|
||||
| Iter | Phase | Concrete deliverable |
|
||||
|---|---|---|
|
||||
| 158 | P1 | hef_pipeline.rs scaffold + HEF load + vstream open |
|
||||
| 159 | P1 | vstream read/write + quantize/dequantize |
|
||||
| 160 | P2 | BertEmbeddings extracted from cpu_embedder |
|
||||
| 161 | P3 | HefEmbedder struct, end-to-end embed() |
|
||||
| 162 | P4 | HailoEmbedder dispatch + has_model + tests |
|
||||
| 163 | P5 | Pi 5 deploy + cluster-bench measurement |
|
||||
| 164 | P5 | Cosine similarity verification vs cpu-fallback |
|
||||
| 165 | P6 | Finalize ADR-176 + update related ADRs |
|
||||
|
||||
Loop will self-pace; if one iteration's deliverable hits a snag
|
||||
(e.g., a HailoRT API turns out different than docs suggest), the
|
||||
loop iterates on it before moving on.
|
||||
|
||||
## References
|
||||
|
||||
- ADR-167 — original ruvector-hailo embedding backend design
|
||||
- ADR-175 — Rust-side workarounds for HEF SDK bugs
|
||||
- ADR-176 — this EPIC (in-progress)
|
||||
- iter 156b commit — `ffa3e90a6` HEF compiled
|
||||
- iter 157 commit — `2ba399fbe` NPU forward pass validated at 73.4 FPS
|
||||
- HEF artifact: 15.7 MB,
|
||||
sha256 `cdbc892765d3099f74723ee6c28ab3f0daade2358827823ba08d2969b07ebd40`
|
||||
Loading…
Add table
Add a link
Reference in a new issue