feat(demo): wire all 6 RuVector WASM attention mechanisms into pose fusion

* feat: dual-modal WASM browser pose estimation demo (ADR-058) Live webcam video + WiFi CSI fusion for real-time pose estimation. Two parallel CNN pipelines (ruvector-cnn-wasm) with attention-weighted fusion and dynamic confidence gating. Three modes: Dual, Video-only, CSI-only. Includes pre-built WASM package (~52KB) for browser deployment. - ADR-058: Dual-modal architecture design - ui/pose-fusion.html: Main demo page with dark theme UI - 7 JS modules: video-capture, csi-simulator, cnn-embedder, fusion-engine, pose-decoder, canvas-renderer, main orchestrator - Pre-built ruvector-cnn-wasm WASM package for browser - CSI heatmap, embedding space visualization, latency metrics - WebSocket support for live ESP32 CSI data - Navigation link added to main dashboard Co-Authored-By: claude-flow <ruv@ruv.net> * fix: motion-responsive skeleton + through-wall CSI tracking - Pose decoder now uses per-cell motion grid to track actual arm/head positions — raising arms moves the skeleton's arms, head follows lateral movement - Motion grid (10x8 cells) tracks intensity per body zone: head, left/right arm upper/mid, legs - Through-wall mode: when person exits frame, CSI maintains presence with slow decay (~10s) and skeleton drifts in exit direction - CSI simulator persists sensing after video loss, ghost pose renders with decreasing confidence - Reduced temporal smoothing (0.45) for faster response to movement Co-Authored-By: claude-flow <ruv@ruv.net> * fix: video fills available space + correct WASM path resolution - Remove fixed aspect-ratio and max-height from video panel so it fills the available viewport space without scrolling - Grid uses 1fr row for content area, overflow:hidden on main grid - Fix WASM path: resolve relative to JS module file using import.meta.url instead of hardcoded ./pkg/ which resolved incorrectly on gh-pages - Responsive: mobile still gets aspect-ratio constraint Co-Authored-By: claude-flow <ruv@ruv.net> * feat: live ESP32 CSI pipeline + auto-connect WebSocket - Add auto-connect to local sensing server WebSocket (ws://localhost:8765) - Demo shows "Live ESP32" when connected to real CSI data - Add build_firmware.ps1 for native Windows ESP-IDF builds (no Docker) - Add read_serial.ps1 for ESP32 serial monitor Pipeline: ESP32 → UDP:5005 → sensing-server → WS:8765 → browser demo Co-Authored-By: claude-flow <ruv@ruv.net> * docs: add ADR-059 live ESP32 CSI pipeline + update README with demo links - ADR-059: Documents end-to-end ESP32 → sensing server → browser pipeline - README: Add dual-modal pose fusion demo link, update ADR count to 49 - References issue #245 Co-Authored-By: claude-flow <ruv@ruv.net> * feat: RSSI visualization, RuVector attention WASM, cache-bust fixes - Add animated RSSI Signal Strength panel with sparkline history - Fix RuVector WasmMultiHeadAttention retptr calling convention - Wire up RuVector Multi-Head + Flash Attention in CNN embedder - Add ambient temporal drift to CSI simulator for visible heatmap animation - Fix embedding space projection (sparse projection replaces cancelling sum) - Add auto-scaling to embedding space renderer - Add cache busters (?v=4) to all ES module imports to prevent stale caches - Add diagnostic logging for module version verification - Add RSSI tracking with quality labels and color-coded dBm display - Includes ruvector-attention-wasm v2.0.5 browser ESM wrapper Co-Authored-By: claude-flow <ruv@ruv.net> * feat: 26-keypoint dexterous pose + full RuVector attention pipeline Pose Decoder (17 → 26 keypoints): - Add finger approximations: thumb, index, pinky per hand (6 new) - Add toe tips: left/right foot index (2 new) - Add neck keypoint (1 new) - Hand openness driven by arm motion intensity - Finger positions computed from wrist-elbow axis angles CNN Embedder (full RuVector WASM pipeline): - Stage 1: Multi-Head Attention (global spatial reasoning) - Stage 2: Hyperbolic Attention (hierarchical body-part tree) - Stage 3: MoE Attention (3 experts: upper/lower/extremities, top-2) - Blended 40/30/30 weighting → final embedding projection Canvas Renderer: - Magenta finger joints with distinct glow - Cyan toe tips - White neck keypoint - Thinner limb lines for hand/foot connections - Joint count shown in overlay label CSI Simulator: - Skip synthetic person state when live ESP32 connected - Only simulate CSI data in demo mode (was already correct) Embedding Space: - Fixed projection: sparse 8-dim projection replaces cancelling sum - Auto-scaling normalizes point spread to fill canvas Cache busters bumped to v=5 on all imports. Co-Authored-By: claude-flow <ruv@ruv.net> * fix: centroid-based pose tracking for responsive limb movement Rewrites pose decoder from intensity-based to position-based tracking: - Arms now track toward motion centroid in each body zone - Elbow/wrist positions computed along shoulder→centroid vector - Legs track toward lower-body zone centroids - Smoothing reduced from 0.45 to 0.25 for responsiveness - Zone centroids blend 30% old / 70% new each frame 6 body zones with overlapping coverage: - Head (top 20%, center cols) - Left/Right Arm (rows 10-60%, outer cols) - Torso (rows 15-55%, center cols) - Left/Right Leg (rows 50-100%, half cols each) Hand openness now driven by arm spread distance + raise amount. Cache busters v=6. Co-Authored-By: claude-flow <ruv@ruv.net> * fix: remove duplicate lAnkleX/rAnkleX declarations in pose-decoder Stale code block from old intensity-based tracking was left behind, re-declaring variables already defined by centroid-based tracking. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(demo): wire all 6 RuVector WASM attention mechanisms into pose fusion - Add WasmLinearAttention and WasmLocalGlobalAttention to browser ESM wrapper - Add 6 WASM utility functions (batch_normalize, pairwise_distances, etc.) - Extend CnnEmbedder to 6-stage pipeline: Flash → MHA → Hyperbolic → Linear → MoE → L+G - Use log-energy softmax blending across all 6 stages - Wire WASM cosine_similarity and normalize into FusionEngine - Add RuVector pipeline stats panel to UI (energy, refinement, pose impact) - Compute embedding-to-joint mapping stats without modifying joint positions - Center camera prompt with flexbox layout - Add cache busters v=12 Co-Authored-By: claude-flow <ruv@ruv.net>
2026-04-28 05:59:32 +00:00 · 2026-03-12 20:59:57 -04:00 · 2026-03-12 20:59:57 -04:00 · 7c1351fd5d
commit 7c1351fd5d
parent 6e03a47867
27 changed files with 7428 additions and 1 deletions
--- a/docs/adr/ADR-058-ruvector-wasm-browser-pose-example.md
+++ b/docs/adr/ADR-058-ruvector-wasm-browser-pose-example.md
@ -0,0 +1,392 @@
+# ADR-058: Dual-Modal WASM Browser Pose Estimation — Live Video + WiFi CSI Fusion
+
+- **Status**: Proposed
+- **Date**: 2026-03-12
+- **Deciders**: ruv
+- **Tags**: wasm, browser, cnn, pose-estimation, ruvector, video, multimodal, fusion
+
+## Context
+
+WiFi-DensePose estimates human poses from WiFi CSI (Channel State Information).
+The `ruvector-cnn` crate provides a pure Rust CNN (MobileNet-V3) with WASM bindings.
+Both modalities exist independently — what's missing is **fusing live webcam video
+with WiFi CSI** in a single browser demo to achieve robust pose estimation that
+works even when one modality degrades (occlusion, signal noise, poor lighting).
+
+Existing assets:
+
+1. **`wifi-densepose-wasm`** — CSI signal processing compiled to WASM
+2. **`wifi-densepose-sensing-server`** — Axum server streaming live CSI via WebSocket
+3. **`ruvector-cnn`** — Pure Rust CNN with MobileNet-V3 backbones, SIMD, contrastive learning
+4. **`ruvector-cnn-wasm`** — wasm-bindgen bindings: `WasmCnnEmbedder`, `SimdOps`, `LayerOps`, contrastive losses
+5. **`vendor/ruvector/examples/wasm-vanilla/`** — Reference vanilla JS WASM example
+
+Research shows multi-modal fusion (camera + WiFi) significantly outperforms either alone:
+- Camera fails under occlusion, poor lighting, privacy constraints
+- WiFi CSI fails with signal noise, multipath, low spatial resolution
+- Fusion compensates: WiFi provides through-wall coverage, camera provides fine-grained detail
+
+## Decision
+
+Build a **dual-modal browser demo** at `examples/wasm-browser-pose/` that:
+
+1. Captures **live webcam video** via `getUserMedia` API
+2. Receives **live WiFi CSI** via WebSocket from the sensing server
+3. Processes **both streams** through separate CNN pipelines in `ruvector-cnn-wasm`
+4. **Fuses embeddings** with learned attention weights for combined pose estimation
+5. Renders **video overlay** with skeleton + WiFi confidence heatmap on Canvas
+6. Runs entirely in the browser — all inference client-side via WASM
+
+### Architecture
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  Browser                                                         │
+│                                                                  │
+│  ┌────────────┐    ┌────────────────┐    ┌───────────────────┐   │
+│  │ getUserMedia│───▶│ Video Frame    │───▶│ CNN WASM          │   │
+│  │ (Webcam)   │    │ Capture        │    │ (Visual Embedder) │   │
+│  └────────────┘    │ 224×224 RGB    │    │ → 512-dim         │   │
+│                    └────────────────┘    └────────┬──────────┘   │
+│                                                   │              │
+│                                          visual_embedding        │
+│                                                   │              │
+│                                            ┌──────▼──────┐       │
+│  ┌────────────┐    ┌────────────────┐      │             │       │
+│  │ WebSocket  │───▶│ CSI WASM       │      │  Attention  │       │
+│  │ Client     │    │ (densepose-    │      │  Fusion     │       │
+│  │            │    │  wasm)         │      │  Module     │       │
+│  └────────────┘    └───────┬────────┘      │             │       │
+│                            │               └──────┬──────┘       │
+│                    ┌───────▼────────┐             │              │
+│                    │ CNN WASM       │      fused_embedding       │
+│                    │ (CSI Embedder) │             │              │
+│                    │ → 512-dim      │      ┌──────▼──────┐       │
+│                    └───────┬────────┘      │ Pose        │       │
+│                            │               │ Decoder     │       │
+│                     csi_embedding           │ → 17 kpts   │       │
+│                            │               └──────┬──────┘       │
+│                            └──────────────────────┘              │
+│                                                   │              │
+│                    ┌──────────────┐         ┌─────▼──────┐       │
+│                    │ Video Canvas │◀────────│ Overlay    │       │
+│                    │ + Skeleton   │         │ Renderer   │       │
+│                    │ + Heatmap    │         └────────────┘       │
+│                    └──────────────┘                               │
+│                                                                  │
+└──────────────────────────────────────────────────────────────────┘
+         ▲                                     ▲
+         │ getUserMedia                        │ WebSocket
+         │ (camera)                            │ (ws://host:3030/ws/csi)
+         │                                     │
+    ┌────┴────┐                        ┌───────┴─────────┐
+    │ Webcam  │                        │ Sensing Server   │
+    └─────────┘                        └─────────────────┘
+```
+
+### Dual Pipeline Design
+
+Two parallel CNN pipelines run on each frame tick (~30 FPS):
+
+| Pipeline | Input | Preprocessing | CNN Config | Output |
+|----------|-------|---------------|------------|--------|
+| **Visual** | Webcam frame (640×480) | Resize to 224×224 RGB, ImageNet normalize | MobileNet-V3 Small, 512-dim | Visual embedding |
+| **CSI** | CSI frame (ADR-018 binary) | Amplitude/phase/delta → 224×224 pseudo-RGB | MobileNet-V3 Small, 512-dim | CSI embedding |
+
+Both use the same `WasmCnnEmbedder` but with separate instances and weight sets.
+
+### Fusion Strategy
+
+**Learned attention-weighted fusion** combines the two 512-dim embeddings:
+
+```javascript
+// Attention fusion: learn which modality to trust per-dimension
+// α ∈ [0,1]^512 — attention weights (shipped as JSON, trained offline)
+// visual_emb, csi_emb ∈ R^512
+
+function fuseEmbeddings(visual_emb, csi_emb, attention_weights) {
+    const fused = new Float32Array(512);
+    for (let i = 0; i < 512; i++) {
+        const α = attention_weights[i];
+        fused[i] = α * visual_emb[i] + (1 - α) * csi_emb[i];
+    }
+    return fused;
+}
+```
+
+**Dynamic confidence gating** adjusts fusion based on signal quality:
+
+| Condition | Behavior |
+|-----------|----------|
+| Good video + good CSI | Balanced fusion (α ≈ 0.5) |
+| Poor lighting / occlusion | CSI-dominant (α → 0, WiFi takes over) |
+| CSI noise / no ESP32 | Video-dominant (α → 1, camera only) |
+| Video-only mode (no WiFi) | α = 1.0, pure visual CNN pose estimation |
+| CSI-only mode (no camera) | α = 0.0, pure WiFi pose estimation |
+
+Quality detection:
+- **Video quality**: Frame brightness variance (dark = low quality), motion blur score
+- **CSI quality**: Signal-to-noise ratio from `wifi-densepose-wasm`, coherence gate output
+
+### CSI-to-Image Encoding
+
+CSI data encoded as 3-channel pseudo-image for the CSI CNN pipeline:
+
+| Channel | Data | Normalization |
+|---------|------|---------------|
+| R | CSI amplitude (subcarrier × time window) | Min-max to [0, 255] |
+| G | CSI phase (unwrapped, subcarrier × time window) | Min-max to [0, 255] |
+| B | Temporal difference (frame-to-frame Δ amplitude) | Abs, min-max to [0, 255] |
+
+### Video Processing
+
+Webcam frames processed through standard ImageNet pipeline:
+
+```javascript
+// Capture frame from video element
+const frame = captureVideoFrame(videoElement, 224, 224); // Returns Uint8Array RGB
+
+// ImageNet normalization happens inside WasmCnnEmbedder.extract()
+const visual_embedding = visual_embedder.extract(frame, 224, 224);
+```
+
+### Pose Keypoint Mapping
+
+17 COCO-format keypoints decoded from the fused 512-dim embedding:
+
+```
+ 0: nose          1: left_eye       2: right_eye
+ 3: left_ear      4: right_ear      5: left_shoulder
+ 6: right_shoulder 7: left_elbow    8: right_elbow
+ 9: left_wrist   10: right_wrist   11: left_hip
+12: right_hip    13: left_knee     14: right_knee
+15: left_ankle   16: right_ankle
+```
+
+Each keypoint decoded as (x, y, confidence) = 51 values from the 512-dim embedding
+via a learned linear projection.
+
+### Operating Modes
+
+The demo supports three modes, selectable in the UI:
+
+| Mode | Video | CSI | Fusion | Use Case |
+|------|-------|-----|--------|----------|
+| **Dual (default)** | ✅ | ✅ | Attention-weighted | Best accuracy, full demo |
+| **Video Only** | ✅ | ❌ | α = 1.0 | No ESP32 available, quick demo |
+| **CSI Only** | ❌ | ✅ | α = 0.0 | Privacy mode, through-wall sensing |
+
+**Video Only mode works without any hardware** — just a webcam — making the demo
+instantly accessible for anyone wanting to try it.
+
+### File Layout
+
+```
+examples/wasm-browser-pose/
+├── index.html                  # Single-page app (vanilla JS, no bundler)
+├── js/
+│   ├── app.js                  # Main entry, mode selection, orchestration
+│   ├── video-capture.js        # getUserMedia, frame extraction, quality detection
+│   ├── csi-processor.js        # WebSocket CSI client, frame parsing, pseudo-image encoding
+│   ├── fusion.js               # Attention-weighted embedding fusion, confidence gating
+│   ├── pose-decoder.js         # Fused embedding → 17 keypoints
+│   └── canvas-renderer.js      # Video overlay, skeleton, CSI heatmap, confidence bars
+├── data/
+│   ├── visual-weights.json     # Visual CNN → embedding projection (placeholder until trained)
+│   ├── csi-weights.json        # CSI CNN → embedding projection (placeholder until trained)
+│   ├── fusion-weights.json     # Attention fusion α weights (512 values)
+│   └── pose-weights.json       # Fused embedding → keypoint projection
+├── css/
+│   └── style.css               # Dark theme UI styling
+├── pkg/                        # Built WASM packages (gitignored, built by script)
+│   ├── wifi_densepose_wasm/
+│   └── ruvector_cnn_wasm/
+├── build.sh                    # wasm-pack build script for both packages
+└── README.md                   # Setup and usage instructions
+```
+
+### Build Pipeline
+
+```bash
+#!/bin/bash
+# build.sh — builds both WASM packages into pkg/
+
+set -e
+
+# Build wifi-densepose-wasm (CSI processing)
+wasm-pack build ../../rust-port/wifi-densepose-rs/crates/wifi-densepose-wasm \
+  --target web --out-dir "$(pwd)/pkg/wifi_densepose_wasm" --no-typescript
+
+# Build ruvector-cnn-wasm (CNN inference for both video and CSI)
+wasm-pack build ../../vendor/ruvector/crates/ruvector-cnn-wasm \
+  --target web --out-dir "$(pwd)/pkg/ruvector_cnn_wasm" --no-typescript
+
+echo "Build complete. Serve with: python3 -m http.server 8080"
+```
+
+### UI Layout
+
+```
+┌─────────────────────────────────────────────────────────┐
+│  WiFi-DensePose — Live Dual-Modal Pose Estimation       │
+│  [Dual Mode ▼]  [⚙ Settings]          FPS: 28  ◉ Live  │
+├───────────────────────────┬─────────────────────────────┤
+│                           │                             │
+│   ┌───────────────────┐   │   ┌───────────────────┐     │
+│   │                   │   │   │                   │     │
+│   │  Video + Skeleton │   │   │  CSI Heatmap      │     │
+│   │  Overlay          │   │   │  (amplitude ×     │     │
+│   │  (main canvas)    │   │   │   subcarrier)     │     │
+│   │                   │   │   │                   │     │
+│   └───────────────────┘   │   └───────────────────┘     │
+│                           │                             │
+├───────────────────────────┴─────────────────────────────┤
+│  Fusion Confidence: ████████░░ 78%                      │
+│  Video: ██████████ 95%  │  CSI: ██████░░░░ 61%          │
+├─────────────────────────────────────────────────────────┤
+│  ┌─────────────────────────────────────────────────┐    │
+│  │  Embedding Space (2D projection)                 │    │
+│  │     ·  ·    ·                                    │    │
+│  │   · · ·  ·    · ·    (color = pose cluster)     │    │
+│  │      ·  · · ·                                    │    │
+│  └─────────────────────────────────────────────────┘    │
+├─────────────────────────────────────────────────────────┤
+│  Latency: Video 12ms │ CSI 8ms │ Fusion 1ms │ Total 21ms│
+│  [▶ Record]  [📷 Snapshot]  [Confidence: ████ 0.6]      │
+└─────────────────────────────────────────────────────────┘
+```
+
+### WASM Module Structure
+
+| Package | Source Crate | Provides | Size (est.) |
+|---------|-------------|----------|-------------|
+| `wifi_densepose_wasm` | `wifi-densepose-wasm` | CSI frame parsing, signal processing, feature extraction | ~200KB |
+| `ruvector_cnn_wasm` | `ruvector-cnn-wasm` | `WasmCnnEmbedder` (×2 instances), `SimdOps`, `LayerOps`, contrastive losses | ~150KB |
+
+Two `WasmCnnEmbedder` instances are created — one for video frames, one for CSI pseudo-images.
+They share the same WASM module but have independent state.
+
+### Browser API Requirements
+
+| API | Purpose | Required | Fallback |
+|-----|---------|----------|----------|
+| `getUserMedia` | Webcam capture | For video mode | CSI-only mode |
+| WebAssembly | CNN inference | Yes | None (hard requirement) |
+| WASM SIMD128 | Accelerated inference | No | Scalar fallback (~2× slower) |
+| WebSocket | CSI data stream | For CSI mode | Video-only mode |
+| Canvas 2D | Rendering | Yes | None |
+| `requestAnimationFrame` | Render loop | Yes | `setTimeout` fallback |
+| ES Modules | Code organization | Yes | None |
+
+Target: Chrome 89+, Firefox 89+, Safari 15+, Edge 89+
+
+### Performance Budget
+
+| Stage | Target Latency | Notes |
+|-------|---------------|-------|
+| Video frame capture + resize | <3ms | `drawImage` to offscreen canvas |
+| Video CNN embedding | <15ms | 224×224 RGB → 512-dim |
+| CSI receive + parse | <2ms | Binary WebSocket message |
+| CSI pseudo-image encoding | <3ms | Amplitude/phase/delta channels |
+| CSI CNN embedding | <15ms | 224×224 pseudo-RGB → 512-dim |
+| Attention fusion | <1ms | Element-wise weighted sum |
+| Pose decoding | <1ms | Linear projection |
+| Canvas overlay render | <3ms | Video + skeleton + heatmap |
+| **Total (dual mode)** | **<33ms** | **30 FPS capable** |
+| **Total (video only)** | **<22ms** | **45 FPS capable** |
+
+Note: Video and CSI CNN pipelines can run in parallel using Web Workers,
+reducing dual-mode latency to ~max(15, 15) + 5 = ~20ms (50 FPS).
+
+### Contrastive Learning Integration
+
+The demo optionally shows real-time contrastive learning in the browser:
+
+- **InfoNCE loss** (`WasmInfoNCELoss`): Compare video vs CSI embeddings for the same pose — trains cross-modal alignment
+- **Triplet loss** (`WasmTripletLoss`): Push apart different poses, pull together same pose across modalities
+- **SimdOps**: Accelerated dot products for real-time similarity computation
+- **Embedding space panel**: Live 2D projection shows video and CSI embeddings converging when viewing the same person
+
+### Relationship to Existing Crates
+
+| Existing Crate | Role in This Demo |
+|---------------|-------------------|
+| `ruvector-cnn-wasm` | CNN inference for **both** video frames and CSI pseudo-images |
+| `wifi-densepose-wasm` | CSI frame parsing and signal processing |
+| `wifi-densepose-sensing-server` | WebSocket CSI data source |
+| `wifi-densepose-core` | ADR-018 frame format definitions |
+| `ruvector-cnn` | Underlying MobileNet-V3, layers, contrastive learning |
+
+No new Rust crates are needed. The example is pure HTML/JS consuming existing WASM packages.
+
+## Consequences
+
+### Positive
+
+- **Instant demo**: Video-only mode works with just a webcam — no ESP32 needed
+- **Multi-modal showcase**: Demonstrates camera + WiFi fusion, the core innovation of the project
+- **Graceful degradation**: Works with video-only, CSI-only, or both
+- **Through-wall capability**: CSI mode shows pose estimation where cameras cannot reach
+- **Zero-install**: Anyone with a browser can try it
+- **Training data collection**: Can record paired (video, CSI) data for offline model training
+- **Reusable**: JS modules embed directly in the Tauri desktop app's webview
+
+### Negative
+
+- **Model weights**: Requires offline-trained weights for visual CNN, CSI CNN, fusion, and pose decoder (~200KB total JSON)
+- **WASM size**: Two WASM modules total ~350KB (acceptable)
+- **No GPU**: CPU-only WASM inference; adequate at 224×224 but limits resolution scaling
+- **Camera privacy**: Video mode requires camera permission (mitigated: CSI-only mode available)
+- **Two CNN instances**: Memory footprint doubles vs single-modal (~10MB total, acceptable for desktop browsers)
+
+### Risks
+
+- **Cross-modal alignment**: Video and CSI embeddings must be trained jointly for fusion to work;
+  without proper training, fusion may be worse than either modality alone
+- **Latency on mobile**: Dual CNN on mobile browsers may exceed 33ms; implement automatic quality reduction
+- **WebSocket drops**: Network jitter → CSI frame gaps; buffer last 3 frames, interpolate missing data
+
+## Implementation Plan
+
+1. **Phase 1 — Scaffold**: File layout, build.sh, index.html shell, mode selector UI
+2. **Phase 2 — Video pipeline**: getUserMedia → frame capture → CNN embedding → basic pose display
+3. **Phase 3 — CSI pipeline**: WebSocket client → CSI parsing → pseudo-image → CNN embedding
+4. **Phase 4 — Fusion**: Attention-weighted combination, confidence gating, mode switching
+5. **Phase 5 — Pose decoder**: Linear projection with placeholder weights → 17 keypoints
+6. **Phase 6 — Overlay renderer**: Video canvas with skeleton overlay, CSI heatmap panel
+7. **Phase 7 — Training**: Use `wifi-densepose-train` to generate real weights for both CNNs + fusion + decoder
+8. **Phase 8 — Contrastive demo**: Embedding space visualization, cross-modal similarity display
+9. **Phase 9 — Web Workers**: Move CNN inference to workers for parallel video + CSI processing
+10. **Phase 10 — Polish**: Recording, snapshots, adaptive quality, mobile optimization
+
+## Alternatives Considered
+
+### 1. CSI-Only (No Video)
+Rejected: Misses the opportunity to show multi-modal fusion and makes the demo less
+accessible (requires ESP32 hardware). Video-only mode as a fallback is strictly better.
+
+### 2. Server-Side Video Inference
+Rejected: Adds latency, requires webcam stream upload (privacy concern), and defeats
+the WASM-first architecture. All inference must be client-side.
+
+### 3. TensorFlow.js for Video, ruvector-cnn-wasm for CSI
+Rejected: Would require two different ML frameworks. Using `ruvector-cnn-wasm` for both
+keeps a single WASM module, unified embedding space, and simpler fusion.
+
+### 4. Pre-recorded Video Demo
+Rejected: Live webcam input is far more compelling for demonstrations.
+Pre-recorded mode can be added as a secondary option.
+
+### 5. React/Vue Framework
+Rejected: Adds build tooling. Vanilla JS + ES modules keeps the demo self-contained.
+
+## References
+
+- [ADR-018: Binary CSI Frame Format](ADR-018-binary-csi-frame-format.md)
+- [ADR-024: Contrastive CSI Embedding / AETHER](ADR-024-contrastive-csi-embedding.md)
+- [ADR-055: Integrated Sensing Server](ADR-055-integrated-sensing-server.md)
+- `vendor/ruvector/crates/ruvector-cnn/src/lib.rs` — CNN embedder implementation
+- `vendor/ruvector/crates/ruvector-cnn-wasm/src/lib.rs` — WASM bindings
+- `vendor/ruvector/examples/wasm-vanilla/index.html` — Reference vanilla JS WASM pattern
+- Person-in-WiFi: Fine-grained Person Perception using WiFi (ICCV 2019) — camera+WiFi fusion precedent
+- WiPose: Multi-Person WiFi Pose Estimation (TMC 2022) — cross-modal embedding approach