feat(hailo): cpu-fallback fingerprint integrity + ADR-167 SDK bug chain (iter 143)

Production fix: cpu-fallback workers now produce a real model fingerprint instead of empty-string. Previously, compute_fingerprint only hashed model.hef + vocab.txt so cpu-fallback workers always reported empty, which caused the cluster's ADR-167 §8.3 fleet integrity check to silently skip them. compute_fingerprint now also hashes model.safetensors + tokenizer.json + config.json (streaming the safetensors so we don't hold 90 MB in RAM). NPU-layout vs cpu-fallback workers produce different fingerprints by design — they run different code paths so the cluster will refuse to mix them. Verified end-to-end: booted cpu-fallback worker against /tmp/cpu-fallback-test, got real fingerprint 2517aa00... (was empty before). One new lib test, total 16 fingerprint tests green. Worker startup warning updated to mention both layouts. ADR-167 documents the iter-142/142b/143 SDK bug chain found by reading hailo_sdk source: KeyError fixed by internal-layer-name keying; AccelerasValueError fixed by 4D NCHW calib; then TypeError on ElementwiseAddDirectOp deserialization in spawned subprocess — that last one is beyond user-space patching. NPU acceleration remains blocked; cpu-fallback remains the production path. Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-30 20:43:38 +00:00 · 2026-05-02 17:16:31 -04:00 · 2026-05-02 17:16:31 -04:00 · 4ca64ed7d3
commit 4ca64ed7d3
parent fb091c7567
3 changed files with 112 additions and 17 deletions
--- a/crates/ruvector-hailo-cluster/src/bin/worker.rs
+++ b/crates/ruvector-hailo-cluster/src/bin/worker.rs
@ -337,8 +337,10 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fingerprint = compute_fingerprint(&model_dir);
    if fingerprint.is_empty() {
        warn!(
-            "model_dir {} has no model.hef / vocab.txt — fingerprint empty; \
-             coordinators will skip the integrity check",
+            "model_dir {} has no recognizable model artifacts \
+             (NPU: model.hef + vocab.txt; cpu-fallback: model.safetensors + \
+             tokenizer.json + config.json) — fingerprint empty; coordinators \
+             will skip the integrity check",
            model_dir.display()
        );
    } else {
--- a/crates/ruvector-hailo-cluster/src/fingerprint.rs
+++ b/crates/ruvector-hailo-cluster/src/fingerprint.rs
@ -1,29 +1,50 @@
-//! Model fingerprint — sha256 over HEF + tokenizer artifacts.
+//! Model fingerprint — sha256 over the model artifacts on disk.
 //!
 //! ADR-167 §8.3 fleet integrity guard: coordinators refuse to mix
 //! workers reporting different model fingerprints. Computed at worker
-//! startup over the files actually loaded, so any swap of HEF or
-//! vocab.txt produces a different fingerprint and the coordinator can
-//! eject the drift.
+//! startup over the files actually loaded, so any swap of model
+//! weights or tokenizer produces a different fingerprint and the
+//! coordinator can eject the drift.
 //!
-//! Format: hex-lowercase, 64 chars.
+//! Iter 143 — covers both deployment paths:
+//!   * NPU path:   sha256(model.hef || vocab.txt)
+//!   * cpu-fallback: sha256(model.safetensors || tokenizer.json || config.json)
+//!
+//! Mixed clusters (some workers on NPU, some on CPU) intentionally
+//! produce different fingerprints — they're running different code
+//! paths so the cluster should reject the mix.
+//!
+//! Format: hex-lowercase, 64 chars. Empty when no recognizable model
+//! artifacts are present in `model_dir`.

 use sha2::{Digest, Sha256};
 use std::path::Path;

-/// Compute sha256 over (hef_bytes || vocab_bytes). Missing files are
-/// treated as empty so the fingerprint is *also* a witness of which
-/// files exist — a worker that loads only the HEF (no tokenizer)
-/// produces a different fingerprint than a worker with both.
+/// Compute sha256 over the model artifacts. Missing files are treated
+/// as empty within their layout so the fingerprint is *also* a witness
+/// of which files exist — a worker that loads only the HEF (no
+/// tokenizer) produces a different fingerprint than a worker with both.
 ///
-/// Returns "" when both files are missing — caller (worker startup)
-/// uses empty-string as "skip the check" sentinel until step 6 lands.
+/// Returns "" when neither layout has any recognizable file — caller
+/// (worker startup) uses empty-string as "skip the check" sentinel.
 pub fn compute_fingerprint(model_dir: &Path) -> String {
    let hef = std::fs::read(model_dir.join("model.hef")).unwrap_or_default();
    let vocab = std::fs::read(model_dir.join("vocab.txt")).unwrap_or_default();
-    if hef.is_empty() && vocab.is_empty() {
+
+    // Iter 143: cpu-fallback artifacts. We don't read the full
+    // safetensors (90 MB) into memory — sha256 it in streaming chunks.
+    let safetensors_present = model_dir.join("model.safetensors").exists();
+    let tokenizer_json = std::fs::read(model_dir.join("tokenizer.json")).unwrap_or_default();
+    let config_json = std::fs::read(model_dir.join("config.json")).unwrap_or_default();
+
+    let npu_layout_present = !hef.is_empty() || !vocab.is_empty();
+    let cpu_layout_present =
+        safetensors_present || !tokenizer_json.is_empty() || !config_json.is_empty();
+
+    if !npu_layout_present && !cpu_layout_present {
        return String::new();
    }
+
    let mut h = Sha256::new();
    // Length-prefix each input so a hef of N bytes + vocab of M bytes
    // never collides with a hef of N+M bytes + empty vocab.
@ -31,6 +52,30 @@ pub fn compute_fingerprint(model_dir: &Path) -> String {
    h.update(&hef);
    h.update((vocab.len() as u64).to_le_bytes());
    h.update(&vocab);
+
+    // Stream-hash the safetensors so we don't read 90 MB into memory.
+    if safetensors_present {
+        // Tag with file marker so an empty hef doesn't blend with safetensors.
+        h.update(b"safetensors:");
+        if let Ok(mut f) = std::fs::File::open(model_dir.join("model.safetensors")) {
+            let mut buf = [0u8; 64 * 1024];
+            let mut total: u64 = 0;
+            use std::io::Read;
+            while let Ok(n) = f.read(&mut buf) {
+                if n == 0 {
+                    break;
+                }
+                h.update(&buf[..n]);
+                total += n as u64;
+            }
+            h.update(total.to_le_bytes());
+        }
+    }
+    h.update((tokenizer_json.len() as u64).to_le_bytes());
+    h.update(&tokenizer_json);
+    h.update((config_json.len() as u64).to_le_bytes());
+    h.update(&config_json);
+
    let digest = h.finalize();
    hex_lower(&digest)
 }
@ -95,6 +140,37 @@ mod tests {
        assert_ne!(fp2, fp3);
    }

+    #[test]
+    fn cpu_fallback_safetensors_layout_yields_distinct_fingerprint() {
+        // Iter 143: a worker with safetensors+tokenizer+config but no
+        // hef must produce a non-empty fingerprint, distinct from a
+        // worker with the same files but different content.
+        let d1 = tmpdir();
+        std::fs::write(d1.path().join("model.safetensors"), b"weights-A").unwrap();
+        std::fs::write(d1.path().join("tokenizer.json"), b"tok-A").unwrap();
+        std::fs::write(d1.path().join("config.json"), b"cfg-A").unwrap();
+        let fp1 = compute_fingerprint(d1.path());
+        assert_eq!(fp1.len(), 64);
+        assert!(!fp1.is_empty());
+
+        let d2 = tmpdir();
+        std::fs::write(d2.path().join("model.safetensors"), b"weights-B").unwrap();
+        std::fs::write(d2.path().join("tokenizer.json"), b"tok-A").unwrap();
+        std::fs::write(d2.path().join("config.json"), b"cfg-A").unwrap();
+        let fp2 = compute_fingerprint(d2.path());
+        assert_ne!(fp1, fp2, "different safetensors must yield different fp");
+
+        // Per ADR-167 §8.3, an NPU-layout worker and a cpu-fallback
+        // worker run different code paths so their fingerprints SHOULD
+        // differ even with the same logical model — the cluster will
+        // refuse to mix them.
+        let d3 = tmpdir();
+        std::fs::write(d3.path().join("model.hef"), b"weights-A").unwrap();
+        std::fs::write(d3.path().join("vocab.txt"), b"tok-A").unwrap();
+        let fp3 = compute_fingerprint(d3.path());
+        assert_ne!(fp1, fp3, "NPU layout vs cpu-fallback must differ");
+    }
+
    #[test]
    fn length_prefix_prevents_split_collision() {
        // Without length-prefixing, sha256(b"abc" || b"de") == sha256(b"ab" || b"cde").
--- a/docs/adr/ADR-167-ruvector-hailo-npu-embedding-backend.md
+++ b/docs/adr/ADR-167-ruvector-hailo-npu-embedding-backend.md
@ -66,10 +66,27 @@ re-exported the BERT encoder block in isolation:
  quantized weights (full-precision HEF would be possible on
  hailo15h but not hailo8).

+**Iter 142/142b/143 follow-up debugging** (after reading the SDK source):
+- Root-caused the iter-139 `KeyError` to a mismatch between
+  `_get_build_inputs` (returns dict keyed by user dataset keys) and
+  `hailo_model.build` (looks up by internal `flow.input_nodes` names).
+  Workaround: introspect the parsed HN, key the calibration dict by
+  the actual layer name (`minilm_encoder/input_layer1`).
+- Past that: `AccelerasValueError` shape mismatch — Hailo's HN treats
+  inputs as 4D NCHW with implicit channels=1. Workaround: reshape
+  calibration from `[batch, seq, hidden]` to `[batch, 1, seq, hidden]`.
+- Past **that**: a Keras serialization bug —
+  `TypeError: Could not locate class 'ElementwiseAddDirectOp'` —
+  during the SDK's deepcopy of its own internal layer types. This is
+  hailo_model_optimization deepcopy-ing a custom Keras layer it
+  registered itself, then failing to deserialize it because the
+  `@register_keras_serializable` decorator isn't running in the
+  spawned subprocess. Cannot be fixed from user-space.
+
 **Status:** the encoder ONNX is fundamentally Hailo-compatible (it
-parses + full-precision-optimizes cleanly). The remaining gap is an
-SDK-internal bug in INT8 quantization of transformer encoders that
-can't be worked around from user-space. The cleanest unblock paths:
+parses + full-precision-optimizes cleanly). The remaining gap is a
+chain of SDK-internal bugs in INT8 quantization of transformer encoders
+that can't be worked around from user-space. The cleanest unblock paths:
 1. Hailo support ticket (the SDK should not KeyError on a layer it
   knows about — this is a quantization-flow bug, not a
   user-input bug)