mirror of
https://github.com/ruvnet/RuView.git
synced 2026-04-28 05:59:32 +00:00
feat(demo): wire all 6 RuVector WASM attention mechanisms into pose fusion
* feat: dual-modal WASM browser pose estimation demo (ADR-058) Live webcam video + WiFi CSI fusion for real-time pose estimation. Two parallel CNN pipelines (ruvector-cnn-wasm) with attention-weighted fusion and dynamic confidence gating. Three modes: Dual, Video-only, CSI-only. Includes pre-built WASM package (~52KB) for browser deployment. - ADR-058: Dual-modal architecture design - ui/pose-fusion.html: Main demo page with dark theme UI - 7 JS modules: video-capture, csi-simulator, cnn-embedder, fusion-engine, pose-decoder, canvas-renderer, main orchestrator - Pre-built ruvector-cnn-wasm WASM package for browser - CSI heatmap, embedding space visualization, latency metrics - WebSocket support for live ESP32 CSI data - Navigation link added to main dashboard Co-Authored-By: claude-flow <ruv@ruv.net> * fix: motion-responsive skeleton + through-wall CSI tracking - Pose decoder now uses per-cell motion grid to track actual arm/head positions — raising arms moves the skeleton's arms, head follows lateral movement - Motion grid (10x8 cells) tracks intensity per body zone: head, left/right arm upper/mid, legs - Through-wall mode: when person exits frame, CSI maintains presence with slow decay (~10s) and skeleton drifts in exit direction - CSI simulator persists sensing after video loss, ghost pose renders with decreasing confidence - Reduced temporal smoothing (0.45) for faster response to movement Co-Authored-By: claude-flow <ruv@ruv.net> * fix: video fills available space + correct WASM path resolution - Remove fixed aspect-ratio and max-height from video panel so it fills the available viewport space without scrolling - Grid uses 1fr row for content area, overflow:hidden on main grid - Fix WASM path: resolve relative to JS module file using import.meta.url instead of hardcoded ./pkg/ which resolved incorrectly on gh-pages - Responsive: mobile still gets aspect-ratio constraint Co-Authored-By: claude-flow <ruv@ruv.net> * feat: live ESP32 CSI pipeline + auto-connect WebSocket - Add auto-connect to local sensing server WebSocket (ws://localhost:8765) - Demo shows "Live ESP32" when connected to real CSI data - Add build_firmware.ps1 for native Windows ESP-IDF builds (no Docker) - Add read_serial.ps1 for ESP32 serial monitor Pipeline: ESP32 → UDP:5005 → sensing-server → WS:8765 → browser demo Co-Authored-By: claude-flow <ruv@ruv.net> * docs: add ADR-059 live ESP32 CSI pipeline + update README with demo links - ADR-059: Documents end-to-end ESP32 → sensing server → browser pipeline - README: Add dual-modal pose fusion demo link, update ADR count to 49 - References issue #245 Co-Authored-By: claude-flow <ruv@ruv.net> * feat: RSSI visualization, RuVector attention WASM, cache-bust fixes - Add animated RSSI Signal Strength panel with sparkline history - Fix RuVector WasmMultiHeadAttention retptr calling convention - Wire up RuVector Multi-Head + Flash Attention in CNN embedder - Add ambient temporal drift to CSI simulator for visible heatmap animation - Fix embedding space projection (sparse projection replaces cancelling sum) - Add auto-scaling to embedding space renderer - Add cache busters (?v=4) to all ES module imports to prevent stale caches - Add diagnostic logging for module version verification - Add RSSI tracking with quality labels and color-coded dBm display - Includes ruvector-attention-wasm v2.0.5 browser ESM wrapper Co-Authored-By: claude-flow <ruv@ruv.net> * feat: 26-keypoint dexterous pose + full RuVector attention pipeline Pose Decoder (17 → 26 keypoints): - Add finger approximations: thumb, index, pinky per hand (6 new) - Add toe tips: left/right foot index (2 new) - Add neck keypoint (1 new) - Hand openness driven by arm motion intensity - Finger positions computed from wrist-elbow axis angles CNN Embedder (full RuVector WASM pipeline): - Stage 1: Multi-Head Attention (global spatial reasoning) - Stage 2: Hyperbolic Attention (hierarchical body-part tree) - Stage 3: MoE Attention (3 experts: upper/lower/extremities, top-2) - Blended 40/30/30 weighting → final embedding projection Canvas Renderer: - Magenta finger joints with distinct glow - Cyan toe tips - White neck keypoint - Thinner limb lines for hand/foot connections - Joint count shown in overlay label CSI Simulator: - Skip synthetic person state when live ESP32 connected - Only simulate CSI data in demo mode (was already correct) Embedding Space: - Fixed projection: sparse 8-dim projection replaces cancelling sum - Auto-scaling normalizes point spread to fill canvas Cache busters bumped to v=5 on all imports. Co-Authored-By: claude-flow <ruv@ruv.net> * fix: centroid-based pose tracking for responsive limb movement Rewrites pose decoder from intensity-based to position-based tracking: - Arms now track toward motion centroid in each body zone - Elbow/wrist positions computed along shoulder→centroid vector - Legs track toward lower-body zone centroids - Smoothing reduced from 0.45 to 0.25 for responsiveness - Zone centroids blend 30% old / 70% new each frame 6 body zones with overlapping coverage: - Head (top 20%, center cols) - Left/Right Arm (rows 10-60%, outer cols) - Torso (rows 15-55%, center cols) - Left/Right Leg (rows 50-100%, half cols each) Hand openness now driven by arm spread distance + raise amount. Cache busters v=6. Co-Authored-By: claude-flow <ruv@ruv.net> * fix: remove duplicate lAnkleX/rAnkleX declarations in pose-decoder Stale code block from old intensity-based tracking was left behind, re-declaring variables already defined by centroid-based tracking. Co-Authored-By: claude-flow <ruv@ruv.net> * feat(demo): wire all 6 RuVector WASM attention mechanisms into pose fusion - Add WasmLinearAttention and WasmLocalGlobalAttention to browser ESM wrapper - Add 6 WASM utility functions (batch_normalize, pairwise_distances, etc.) - Extend CnnEmbedder to 6-stage pipeline: Flash → MHA → Hyperbolic → Linear → MoE → L+G - Use log-energy softmax blending across all 6 stages - Wire WASM cosine_similarity and normalize into FusionEngine - Add RuVector pipeline stats panel to UI (energy, refinement, pose impact) - Compute embedding-to-joint mapping stats without modifying joint positions - Center camera prompt with flexbox layout - Add cache busters v=12 Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
parent
6e03a47867
commit
7c1351fd5d
27 changed files with 7428 additions and 1 deletions
392
docs/adr/ADR-058-ruvector-wasm-browser-pose-example.md
Normal file
392
docs/adr/ADR-058-ruvector-wasm-browser-pose-example.md
Normal file
|
|
@ -0,0 +1,392 @@
|
|||
# ADR-058: Dual-Modal WASM Browser Pose Estimation — Live Video + WiFi CSI Fusion
|
||||
|
||||
- **Status**: Proposed
|
||||
- **Date**: 2026-03-12
|
||||
- **Deciders**: ruv
|
||||
- **Tags**: wasm, browser, cnn, pose-estimation, ruvector, video, multimodal, fusion
|
||||
|
||||
## Context
|
||||
|
||||
WiFi-DensePose estimates human poses from WiFi CSI (Channel State Information).
|
||||
The `ruvector-cnn` crate provides a pure Rust CNN (MobileNet-V3) with WASM bindings.
|
||||
Both modalities exist independently — what's missing is **fusing live webcam video
|
||||
with WiFi CSI** in a single browser demo to achieve robust pose estimation that
|
||||
works even when one modality degrades (occlusion, signal noise, poor lighting).
|
||||
|
||||
Existing assets:
|
||||
|
||||
1. **`wifi-densepose-wasm`** — CSI signal processing compiled to WASM
|
||||
2. **`wifi-densepose-sensing-server`** — Axum server streaming live CSI via WebSocket
|
||||
3. **`ruvector-cnn`** — Pure Rust CNN with MobileNet-V3 backbones, SIMD, contrastive learning
|
||||
4. **`ruvector-cnn-wasm`** — wasm-bindgen bindings: `WasmCnnEmbedder`, `SimdOps`, `LayerOps`, contrastive losses
|
||||
5. **`vendor/ruvector/examples/wasm-vanilla/`** — Reference vanilla JS WASM example
|
||||
|
||||
Research shows multi-modal fusion (camera + WiFi) significantly outperforms either alone:
|
||||
- Camera fails under occlusion, poor lighting, privacy constraints
|
||||
- WiFi CSI fails with signal noise, multipath, low spatial resolution
|
||||
- Fusion compensates: WiFi provides through-wall coverage, camera provides fine-grained detail
|
||||
|
||||
## Decision
|
||||
|
||||
Build a **dual-modal browser demo** at `examples/wasm-browser-pose/` that:
|
||||
|
||||
1. Captures **live webcam video** via `getUserMedia` API
|
||||
2. Receives **live WiFi CSI** via WebSocket from the sensing server
|
||||
3. Processes **both streams** through separate CNN pipelines in `ruvector-cnn-wasm`
|
||||
4. **Fuses embeddings** with learned attention weights for combined pose estimation
|
||||
5. Renders **video overlay** with skeleton + WiFi confidence heatmap on Canvas
|
||||
6. Runs entirely in the browser — all inference client-side via WASM
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Browser │
|
||||
│ │
|
||||
│ ┌────────────┐ ┌────────────────┐ ┌───────────────────┐ │
|
||||
│ │ getUserMedia│───▶│ Video Frame │───▶│ CNN WASM │ │
|
||||
│ │ (Webcam) │ │ Capture │ │ (Visual Embedder) │ │
|
||||
│ └────────────┘ │ 224×224 RGB │ │ → 512-dim │ │
|
||||
│ └────────────────┘ └────────┬──────────┘ │
|
||||
│ │ │
|
||||
│ visual_embedding │
|
||||
│ │ │
|
||||
│ ┌──────▼──────┐ │
|
||||
│ ┌────────────┐ ┌────────────────┐ │ │ │
|
||||
│ │ WebSocket │───▶│ CSI WASM │ │ Attention │ │
|
||||
│ │ Client │ │ (densepose- │ │ Fusion │ │
|
||||
│ │ │ │ wasm) │ │ Module │ │
|
||||
│ └────────────┘ └───────┬────────┘ │ │ │
|
||||
│ │ └──────┬──────┘ │
|
||||
│ ┌───────▼────────┐ │ │
|
||||
│ │ CNN WASM │ fused_embedding │
|
||||
│ │ (CSI Embedder) │ │ │
|
||||
│ │ → 512-dim │ ┌──────▼──────┐ │
|
||||
│ └───────┬────────┘ │ Pose │ │
|
||||
│ │ │ Decoder │ │
|
||||
│ csi_embedding │ → 17 kpts │ │
|
||||
│ │ └──────┬──────┘ │
|
||||
│ └──────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ ┌─────▼──────┐ │
|
||||
│ │ Video Canvas │◀────────│ Overlay │ │
|
||||
│ │ + Skeleton │ │ Renderer │ │
|
||||
│ │ + Heatmap │ └────────────┘ │
|
||||
│ └──────────────┘ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
▲ ▲
|
||||
│ getUserMedia │ WebSocket
|
||||
│ (camera) │ (ws://host:3030/ws/csi)
|
||||
│ │
|
||||
┌────┴────┐ ┌───────┴─────────┐
|
||||
│ Webcam │ │ Sensing Server │
|
||||
└─────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### Dual Pipeline Design
|
||||
|
||||
Two parallel CNN pipelines run on each frame tick (~30 FPS):
|
||||
|
||||
| Pipeline | Input | Preprocessing | CNN Config | Output |
|
||||
|----------|-------|---------------|------------|--------|
|
||||
| **Visual** | Webcam frame (640×480) | Resize to 224×224 RGB, ImageNet normalize | MobileNet-V3 Small, 512-dim | Visual embedding |
|
||||
| **CSI** | CSI frame (ADR-018 binary) | Amplitude/phase/delta → 224×224 pseudo-RGB | MobileNet-V3 Small, 512-dim | CSI embedding |
|
||||
|
||||
Both use the same `WasmCnnEmbedder` but with separate instances and weight sets.
|
||||
|
||||
### Fusion Strategy
|
||||
|
||||
**Learned attention-weighted fusion** combines the two 512-dim embeddings:
|
||||
|
||||
```javascript
|
||||
// Attention fusion: learn which modality to trust per-dimension
|
||||
// α ∈ [0,1]^512 — attention weights (shipped as JSON, trained offline)
|
||||
// visual_emb, csi_emb ∈ R^512
|
||||
|
||||
function fuseEmbeddings(visual_emb, csi_emb, attention_weights) {
|
||||
const fused = new Float32Array(512);
|
||||
for (let i = 0; i < 512; i++) {
|
||||
const α = attention_weights[i];
|
||||
fused[i] = α * visual_emb[i] + (1 - α) * csi_emb[i];
|
||||
}
|
||||
return fused;
|
||||
}
|
||||
```
|
||||
|
||||
**Dynamic confidence gating** adjusts fusion based on signal quality:
|
||||
|
||||
| Condition | Behavior |
|
||||
|-----------|----------|
|
||||
| Good video + good CSI | Balanced fusion (α ≈ 0.5) |
|
||||
| Poor lighting / occlusion | CSI-dominant (α → 0, WiFi takes over) |
|
||||
| CSI noise / no ESP32 | Video-dominant (α → 1, camera only) |
|
||||
| Video-only mode (no WiFi) | α = 1.0, pure visual CNN pose estimation |
|
||||
| CSI-only mode (no camera) | α = 0.0, pure WiFi pose estimation |
|
||||
|
||||
Quality detection:
|
||||
- **Video quality**: Frame brightness variance (dark = low quality), motion blur score
|
||||
- **CSI quality**: Signal-to-noise ratio from `wifi-densepose-wasm`, coherence gate output
|
||||
|
||||
### CSI-to-Image Encoding
|
||||
|
||||
CSI data encoded as 3-channel pseudo-image for the CSI CNN pipeline:
|
||||
|
||||
| Channel | Data | Normalization |
|
||||
|---------|------|---------------|
|
||||
| R | CSI amplitude (subcarrier × time window) | Min-max to [0, 255] |
|
||||
| G | CSI phase (unwrapped, subcarrier × time window) | Min-max to [0, 255] |
|
||||
| B | Temporal difference (frame-to-frame Δ amplitude) | Abs, min-max to [0, 255] |
|
||||
|
||||
### Video Processing
|
||||
|
||||
Webcam frames processed through standard ImageNet pipeline:
|
||||
|
||||
```javascript
|
||||
// Capture frame from video element
|
||||
const frame = captureVideoFrame(videoElement, 224, 224); // Returns Uint8Array RGB
|
||||
|
||||
// ImageNet normalization happens inside WasmCnnEmbedder.extract()
|
||||
const visual_embedding = visual_embedder.extract(frame, 224, 224);
|
||||
```
|
||||
|
||||
### Pose Keypoint Mapping
|
||||
|
||||
17 COCO-format keypoints decoded from the fused 512-dim embedding:
|
||||
|
||||
```
|
||||
0: nose 1: left_eye 2: right_eye
|
||||
3: left_ear 4: right_ear 5: left_shoulder
|
||||
6: right_shoulder 7: left_elbow 8: right_elbow
|
||||
9: left_wrist 10: right_wrist 11: left_hip
|
||||
12: right_hip 13: left_knee 14: right_knee
|
||||
15: left_ankle 16: right_ankle
|
||||
```
|
||||
|
||||
Each keypoint decoded as (x, y, confidence) = 51 values from the 512-dim embedding
|
||||
via a learned linear projection.
|
||||
|
||||
### Operating Modes
|
||||
|
||||
The demo supports three modes, selectable in the UI:
|
||||
|
||||
| Mode | Video | CSI | Fusion | Use Case |
|
||||
|------|-------|-----|--------|----------|
|
||||
| **Dual (default)** | ✅ | ✅ | Attention-weighted | Best accuracy, full demo |
|
||||
| **Video Only** | ✅ | ❌ | α = 1.0 | No ESP32 available, quick demo |
|
||||
| **CSI Only** | ❌ | ✅ | α = 0.0 | Privacy mode, through-wall sensing |
|
||||
|
||||
**Video Only mode works without any hardware** — just a webcam — making the demo
|
||||
instantly accessible for anyone wanting to try it.
|
||||
|
||||
### File Layout
|
||||
|
||||
```
|
||||
examples/wasm-browser-pose/
|
||||
├── index.html # Single-page app (vanilla JS, no bundler)
|
||||
├── js/
|
||||
│ ├── app.js # Main entry, mode selection, orchestration
|
||||
│ ├── video-capture.js # getUserMedia, frame extraction, quality detection
|
||||
│ ├── csi-processor.js # WebSocket CSI client, frame parsing, pseudo-image encoding
|
||||
│ ├── fusion.js # Attention-weighted embedding fusion, confidence gating
|
||||
│ ├── pose-decoder.js # Fused embedding → 17 keypoints
|
||||
│ └── canvas-renderer.js # Video overlay, skeleton, CSI heatmap, confidence bars
|
||||
├── data/
|
||||
│ ├── visual-weights.json # Visual CNN → embedding projection (placeholder until trained)
|
||||
│ ├── csi-weights.json # CSI CNN → embedding projection (placeholder until trained)
|
||||
│ ├── fusion-weights.json # Attention fusion α weights (512 values)
|
||||
│ └── pose-weights.json # Fused embedding → keypoint projection
|
||||
├── css/
|
||||
│ └── style.css # Dark theme UI styling
|
||||
├── pkg/ # Built WASM packages (gitignored, built by script)
|
||||
│ ├── wifi_densepose_wasm/
|
||||
│ └── ruvector_cnn_wasm/
|
||||
├── build.sh # wasm-pack build script for both packages
|
||||
└── README.md # Setup and usage instructions
|
||||
```
|
||||
|
||||
### Build Pipeline
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# build.sh — builds both WASM packages into pkg/
|
||||
|
||||
set -e
|
||||
|
||||
# Build wifi-densepose-wasm (CSI processing)
|
||||
wasm-pack build ../../rust-port/wifi-densepose-rs/crates/wifi-densepose-wasm \
|
||||
--target web --out-dir "$(pwd)/pkg/wifi_densepose_wasm" --no-typescript
|
||||
|
||||
# Build ruvector-cnn-wasm (CNN inference for both video and CSI)
|
||||
wasm-pack build ../../vendor/ruvector/crates/ruvector-cnn-wasm \
|
||||
--target web --out-dir "$(pwd)/pkg/ruvector_cnn_wasm" --no-typescript
|
||||
|
||||
echo "Build complete. Serve with: python3 -m http.server 8080"
|
||||
```
|
||||
|
||||
### UI Layout
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ WiFi-DensePose — Live Dual-Modal Pose Estimation │
|
||||
│ [Dual Mode ▼] [⚙ Settings] FPS: 28 ◉ Live │
|
||||
├───────────────────────────┬─────────────────────────────┤
|
||||
│ │ │
|
||||
│ ┌───────────────────┐ │ ┌───────────────────┐ │
|
||||
│ │ │ │ │ │ │
|
||||
│ │ Video + Skeleton │ │ │ CSI Heatmap │ │
|
||||
│ │ Overlay │ │ │ (amplitude × │ │
|
||||
│ │ (main canvas) │ │ │ subcarrier) │ │
|
||||
│ │ │ │ │ │ │
|
||||
│ └───────────────────┘ │ └───────────────────┘ │
|
||||
│ │ │
|
||||
├───────────────────────────┴─────────────────────────────┤
|
||||
│ Fusion Confidence: ████████░░ 78% │
|
||||
│ Video: ██████████ 95% │ CSI: ██████░░░░ 61% │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ Embedding Space (2D projection) │ │
|
||||
│ │ · · · │ │
|
||||
│ │ · · · · · · (color = pose cluster) │ │
|
||||
│ │ · · · · │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
├─────────────────────────────────────────────────────────┤
|
||||
│ Latency: Video 12ms │ CSI 8ms │ Fusion 1ms │ Total 21ms│
|
||||
│ [▶ Record] [📷 Snapshot] [Confidence: ████ 0.6] │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### WASM Module Structure
|
||||
|
||||
| Package | Source Crate | Provides | Size (est.) |
|
||||
|---------|-------------|----------|-------------|
|
||||
| `wifi_densepose_wasm` | `wifi-densepose-wasm` | CSI frame parsing, signal processing, feature extraction | ~200KB |
|
||||
| `ruvector_cnn_wasm` | `ruvector-cnn-wasm` | `WasmCnnEmbedder` (×2 instances), `SimdOps`, `LayerOps`, contrastive losses | ~150KB |
|
||||
|
||||
Two `WasmCnnEmbedder` instances are created — one for video frames, one for CSI pseudo-images.
|
||||
They share the same WASM module but have independent state.
|
||||
|
||||
### Browser API Requirements
|
||||
|
||||
| API | Purpose | Required | Fallback |
|
||||
|-----|---------|----------|----------|
|
||||
| `getUserMedia` | Webcam capture | For video mode | CSI-only mode |
|
||||
| WebAssembly | CNN inference | Yes | None (hard requirement) |
|
||||
| WASM SIMD128 | Accelerated inference | No | Scalar fallback (~2× slower) |
|
||||
| WebSocket | CSI data stream | For CSI mode | Video-only mode |
|
||||
| Canvas 2D | Rendering | Yes | None |
|
||||
| `requestAnimationFrame` | Render loop | Yes | `setTimeout` fallback |
|
||||
| ES Modules | Code organization | Yes | None |
|
||||
|
||||
Target: Chrome 89+, Firefox 89+, Safari 15+, Edge 89+
|
||||
|
||||
### Performance Budget
|
||||
|
||||
| Stage | Target Latency | Notes |
|
||||
|-------|---------------|-------|
|
||||
| Video frame capture + resize | <3ms | `drawImage` to offscreen canvas |
|
||||
| Video CNN embedding | <15ms | 224×224 RGB → 512-dim |
|
||||
| CSI receive + parse | <2ms | Binary WebSocket message |
|
||||
| CSI pseudo-image encoding | <3ms | Amplitude/phase/delta channels |
|
||||
| CSI CNN embedding | <15ms | 224×224 pseudo-RGB → 512-dim |
|
||||
| Attention fusion | <1ms | Element-wise weighted sum |
|
||||
| Pose decoding | <1ms | Linear projection |
|
||||
| Canvas overlay render | <3ms | Video + skeleton + heatmap |
|
||||
| **Total (dual mode)** | **<33ms** | **30 FPS capable** |
|
||||
| **Total (video only)** | **<22ms** | **45 FPS capable** |
|
||||
|
||||
Note: Video and CSI CNN pipelines can run in parallel using Web Workers,
|
||||
reducing dual-mode latency to ~max(15, 15) + 5 = ~20ms (50 FPS).
|
||||
|
||||
### Contrastive Learning Integration
|
||||
|
||||
The demo optionally shows real-time contrastive learning in the browser:
|
||||
|
||||
- **InfoNCE loss** (`WasmInfoNCELoss`): Compare video vs CSI embeddings for the same pose — trains cross-modal alignment
|
||||
- **Triplet loss** (`WasmTripletLoss`): Push apart different poses, pull together same pose across modalities
|
||||
- **SimdOps**: Accelerated dot products for real-time similarity computation
|
||||
- **Embedding space panel**: Live 2D projection shows video and CSI embeddings converging when viewing the same person
|
||||
|
||||
### Relationship to Existing Crates
|
||||
|
||||
| Existing Crate | Role in This Demo |
|
||||
|---------------|-------------------|
|
||||
| `ruvector-cnn-wasm` | CNN inference for **both** video frames and CSI pseudo-images |
|
||||
| `wifi-densepose-wasm` | CSI frame parsing and signal processing |
|
||||
| `wifi-densepose-sensing-server` | WebSocket CSI data source |
|
||||
| `wifi-densepose-core` | ADR-018 frame format definitions |
|
||||
| `ruvector-cnn` | Underlying MobileNet-V3, layers, contrastive learning |
|
||||
|
||||
No new Rust crates are needed. The example is pure HTML/JS consuming existing WASM packages.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Instant demo**: Video-only mode works with just a webcam — no ESP32 needed
|
||||
- **Multi-modal showcase**: Demonstrates camera + WiFi fusion, the core innovation of the project
|
||||
- **Graceful degradation**: Works with video-only, CSI-only, or both
|
||||
- **Through-wall capability**: CSI mode shows pose estimation where cameras cannot reach
|
||||
- **Zero-install**: Anyone with a browser can try it
|
||||
- **Training data collection**: Can record paired (video, CSI) data for offline model training
|
||||
- **Reusable**: JS modules embed directly in the Tauri desktop app's webview
|
||||
|
||||
### Negative
|
||||
|
||||
- **Model weights**: Requires offline-trained weights for visual CNN, CSI CNN, fusion, and pose decoder (~200KB total JSON)
|
||||
- **WASM size**: Two WASM modules total ~350KB (acceptable)
|
||||
- **No GPU**: CPU-only WASM inference; adequate at 224×224 but limits resolution scaling
|
||||
- **Camera privacy**: Video mode requires camera permission (mitigated: CSI-only mode available)
|
||||
- **Two CNN instances**: Memory footprint doubles vs single-modal (~10MB total, acceptable for desktop browsers)
|
||||
|
||||
### Risks
|
||||
|
||||
- **Cross-modal alignment**: Video and CSI embeddings must be trained jointly for fusion to work;
|
||||
without proper training, fusion may be worse than either modality alone
|
||||
- **Latency on mobile**: Dual CNN on mobile browsers may exceed 33ms; implement automatic quality reduction
|
||||
- **WebSocket drops**: Network jitter → CSI frame gaps; buffer last 3 frames, interpolate missing data
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
1. **Phase 1 — Scaffold**: File layout, build.sh, index.html shell, mode selector UI
|
||||
2. **Phase 2 — Video pipeline**: getUserMedia → frame capture → CNN embedding → basic pose display
|
||||
3. **Phase 3 — CSI pipeline**: WebSocket client → CSI parsing → pseudo-image → CNN embedding
|
||||
4. **Phase 4 — Fusion**: Attention-weighted combination, confidence gating, mode switching
|
||||
5. **Phase 5 — Pose decoder**: Linear projection with placeholder weights → 17 keypoints
|
||||
6. **Phase 6 — Overlay renderer**: Video canvas with skeleton overlay, CSI heatmap panel
|
||||
7. **Phase 7 — Training**: Use `wifi-densepose-train` to generate real weights for both CNNs + fusion + decoder
|
||||
8. **Phase 8 — Contrastive demo**: Embedding space visualization, cross-modal similarity display
|
||||
9. **Phase 9 — Web Workers**: Move CNN inference to workers for parallel video + CSI processing
|
||||
10. **Phase 10 — Polish**: Recording, snapshots, adaptive quality, mobile optimization
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### 1. CSI-Only (No Video)
|
||||
Rejected: Misses the opportunity to show multi-modal fusion and makes the demo less
|
||||
accessible (requires ESP32 hardware). Video-only mode as a fallback is strictly better.
|
||||
|
||||
### 2. Server-Side Video Inference
|
||||
Rejected: Adds latency, requires webcam stream upload (privacy concern), and defeats
|
||||
the WASM-first architecture. All inference must be client-side.
|
||||
|
||||
### 3. TensorFlow.js for Video, ruvector-cnn-wasm for CSI
|
||||
Rejected: Would require two different ML frameworks. Using `ruvector-cnn-wasm` for both
|
||||
keeps a single WASM module, unified embedding space, and simpler fusion.
|
||||
|
||||
### 4. Pre-recorded Video Demo
|
||||
Rejected: Live webcam input is far more compelling for demonstrations.
|
||||
Pre-recorded mode can be added as a secondary option.
|
||||
|
||||
### 5. React/Vue Framework
|
||||
Rejected: Adds build tooling. Vanilla JS + ES modules keeps the demo self-contained.
|
||||
|
||||
## References
|
||||
|
||||
- [ADR-018: Binary CSI Frame Format](ADR-018-binary-csi-frame-format.md)
|
||||
- [ADR-024: Contrastive CSI Embedding / AETHER](ADR-024-contrastive-csi-embedding.md)
|
||||
- [ADR-055: Integrated Sensing Server](ADR-055-integrated-sensing-server.md)
|
||||
- `vendor/ruvector/crates/ruvector-cnn/src/lib.rs` — CNN embedder implementation
|
||||
- `vendor/ruvector/crates/ruvector-cnn-wasm/src/lib.rs` — WASM bindings
|
||||
- `vendor/ruvector/examples/wasm-vanilla/index.html` — Reference vanilla JS WASM pattern
|
||||
- Person-in-WiFi: Fine-grained Person Perception using WiFi (ICCV 2019) — camera+WiFi fusion precedent
|
||||
- WiPose: Multi-Person WiFi Pose Estimation (TMC 2022) — cross-modal embedding approach
|
||||
Loading…
Add table
Add a link
Reference in a new issue