Ruview/docs/adr/ADR-079-camera-ground-truth-training.md
ruv d09baa6a09 fix: remove hardcoded Tailscale IPs and usernames from public files
- ADR-079: strip SSH user/IP from optimization description
- mac-mini-train.sh: replace hardcoded IP with env var WINDOWS_HOST

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-04-06 14:39:21 -04:00

21 KiB
Raw Permalink Blame History

ADR-079: Camera Ground-Truth Training Pipeline

  • Status: Accepted
  • Date: 2026-04-06
  • Deciders: ruv
  • Relates to: ADR-072 (WiFlow Architecture), ADR-070 (Self-Supervised Pretraining), ADR-071 (ruvllm Training Pipeline), ADR-024 (AETHER Contrastive), ADR-064 (Multimodal Ambient Intelligence), ADR-075 (MinCut Person Separation)

Context

WiFlow (ADR-072) currently trains without ground-truth pose labels, using proxy poses generated from presence/motion heuristics. This produces a PCK@20 of only 2.5% — far below the 30-50% achievable with supervised training. The fundamental bottleneck is the absence of spatial keypoint labels.

Academic WiFi pose estimation systems (Wi-Pose, Person-in-WiFi 3D, MetaFi++) all train with synchronized camera ground truth and achieve PCK@20 of 40-85%. They discard the camera at deployment — the camera is a training-time teacher, not a runtime dependency.

ADR-064 already identified this: "Record CSI + mmWave while performing signs with a camera as ground truth, then deploy camera-free." This ADR specifies the implementation.

Current Training Pipeline Gap

Current:  CSI amplitude → WiFlow → 17 keypoints (proxy-supervised, PCK@20 = 2.5%)
                                    ↑
                            Heuristic proxies:
                            - Standing skeleton when presence > 0.3
                            - Limb perturbation from motion energy
                            - No spatial accuracy

Target Pipeline

Training: CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-supervised, PCK@20 target: 35%+)
                                        ↑
          Laptop camera ──→ MediaPipe ──→ 17 COCO keypoints (ground truth)
                                        (time-synchronized, 30 fps)

Deploy:   CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-free, trained model only)

Decision

Build a camera ground-truth collection and training pipeline using the laptop webcam as a teacher signal. The camera is used only during training data collection and is not required at deployment.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Data Collection Phase                         │
│                                                                 │
│  ESP32-S3 nodes ──UDP──→ Sensing Server ──→ CSI frames (.jsonl) │
│                              ↑ time sync                        │
│  Laptop Camera ──→ MediaPipe Pose ──→ Keypoints (.jsonl)        │
│                              ↑                                  │
│                     collect-ground-truth.py                      │
│                     (single orchestrator)                        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Training Phase                                │
│                                                                 │
│  Paired dataset: { csi_window[128,20], keypoints[17,2], conf }  │
│         ↓                                                       │
│  train-wiflow-supervised.js                                     │
│    Phase 1: Contrastive pretrain (ADR-072, reuse)               │
│    Phase 2: Supervised keypoint regression (NEW)                │
│    Phase 3: Fine-tune with bone constraints + confidence        │
│         ↓                                                       │
│  WiFlow model (1.8M params) → SafeTensors export                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Deployment (camera-free)                      │
│                                                                 │
│  ESP32-S3 CSI → Sensing Server → WiFlow inference → 17 keypoints│
│  (No camera. Trained model runs on CSI input only.)             │
└─────────────────────────────────────────────────────────────────┘

Component 1: scripts/collect-ground-truth.py

Single Python script that orchestrates synchronized capture from the laptop camera and the ESP32 CSI stream.

Dependencies: mediapipe, opencv-python, requests (all pip-installable, no GPU)

Capture flow:

# Pseudocode
camera = cv2.VideoCapture(0)           # Laptop webcam
sensing_api = "http://localhost:3000"   # Sensing server

# Start CSI recording via existing API
requests.post(f"{sensing_api}/api/v1/recording/start")

while recording:
    frame = camera.read()
    t = time.time_ns()                  # Nanosecond timestamp

    # MediaPipe Pose: 33 landmarks → map to 17 COCO keypoints
    result = mp_pose.process(frame)
    keypoints_17 = map_mediapipe_to_coco(result.pose_landmarks)
    confidence = mean(landmark.visibility for relevant landmarks)

    # Write to ground-truth JSONL (one line per frame)
    write_jsonl({
        "ts_ns": t,
        "keypoints": keypoints_17,      # [[x,y], ...] normalized [0,1]
        "confidence": confidence,        # 0-1, used for loss weighting
        "n_visible": count(visibility > 0.5),
    })

    # Optional: show live preview with skeleton overlay
    if preview:
        draw_skeleton(frame, keypoints_17)
        cv2.imshow("Ground Truth", frame)

# Stop CSI recording
requests.post(f"{sensing_api}/api/v1/recording/stop")

MediaPipe → COCO keypoint mapping:

COCO Index Joint MediaPipe Index
0 Nose 0
1 Left Eye 2
2 Right Eye 5
3 Left Ear 7
4 Right Ear 8
5 Left Shoulder 11
6 Right Shoulder 12
7 Left Elbow 13
8 Right Elbow 14
9 Left Wrist 15
10 Right Wrist 16
11 Left Hip 23
12 Right Hip 24
13 Left Knee 25
14 Right Knee 26
15 Left Ankle 27
16 Right Ankle 28

Component 2: Time Alignment (scripts/align-ground-truth.js)

CSI frames arrive at ~100 Hz with server-side timestamps. Camera keypoints arrive at ~30 fps with client-side timestamps. Alignment is needed because:

  1. Camera and sensing server clocks differ (typically < 50ms on LAN)
  2. CSI is aggregated into 20-frame windows for WiFlow input
  3. Ground-truth keypoints must be averaged over the same window

Alignment algorithm:

For each CSI window W_i (20 frames, ~200ms at 100Hz):
  t_start = W_i.first_frame.timestamp
  t_end   = W_i.last_frame.timestamp

  # Find all camera keypoints within this time window
  matching_keypoints = [k for k in camera_data if t_start <= k.ts <= t_end]

  if len(matching_keypoints) >= 3:   # At least 3 camera frames per window
    # Average keypoints, weighted by confidence
    avg_keypoints = weighted_mean(matching_keypoints, weights=confidences)
    avg_confidence = mean(confidences)

    paired_dataset.append({
      csi_window: W_i.amplitudes,    # [128, 20] float32
      keypoints: avg_keypoints,       # [17, 2] float32
      confidence: avg_confidence,     # scalar
      n_camera_frames: len(matching_keypoints),
    })

Clock sync strategy:

  • NTP is sufficient (< 20ms error on LAN)
  • The 200ms CSI window is 10x larger than typical clock drift
  • For tighter sync: use a handclap/jump as a sync marker — visible spike in both CSI motion energy and camera skeleton velocity. Auto-detect and align.

Output: data/recordings/paired-{timestamp}.jsonl — one line per paired sample:

{"csi": [128x20 flat], "kp": [[0.45,0.12], ...], "conf": 0.92, "ts": 1775300000000}

Component 3: Supervised Training (scripts/train-wiflow-supervised.js)

Extends the existing train-ruvllm.js pipeline with a supervised phase.

Phase 1: Contrastive Pretrain (reuse ADR-072)

  • Same as existing: temporal + cross-node triplets
  • Learns CSI representation without labels
  • 50 epochs, ~5 min on laptop

Phase 2: Supervised Keypoint Regression (NEW)

  • Load paired dataset from Component 2
  • Loss: confidence-weighted SmoothL1 on keypoints
L_supervised = (1/N) * sum_i [ conf_i * SmoothL1(pred_i, gt_i, beta=0.05) ]
  • Only train on samples where conf > 0.5 (discard frames where MediaPipe lost tracking)
  • Learning rate: 1e-4 with cosine decay
  • 200 epochs, ~15 min on laptop CPU (1.8M params, no GPU needed)

Phase 3: Refinement with Bone Constraints

  • Fine-tune with combined loss:
L = L_supervised + 0.3 * L_bone + 0.1 * L_temporal

L_bone     = (1/14) * sum_b (bone_len_b - prior_b)^2   # ADR-072 bone priors
L_temporal = SmoothL1(kp_t, kp_{t-1})                   # Temporal smoothness
  • 50 epochs at lower LR (1e-5)
  • Tighten bone constraint weight from 0.3 → 0.5 over epochs

Phase 4: Quantization + Export

  • Reuse ruvllm TurboQuant: float32 → int8 (4x smaller, ~881 KB)
  • Export via SafeTensors for cross-platform deployment
  • Validate quantized model PCK@20 within 2% of full-precision

Component 4: Evaluation Script (scripts/eval-wiflow.js)

Measure actual PCK@20 using held-out paired data (20% split).

PCK@k = (1/N) * sum_i [ (||pred_i - gt_i|| < k * torso_length) ? 1 : 0 ]

Metrics reported:

Metric Description Target
PCK@20 % of keypoints within 20% torso length > 35%
PCK@50 % within 50% torso length > 60%
MPJPE Mean per-joint position error (pixels) < 40px
Per-joint PCK Breakdown by joint (wrists are hardest) Report all 17
Inference latency Single window prediction time < 50ms

Optimization Strategy

O1: Curriculum Learning

Train easy poses first, hard poses later:

Stage Epochs Data Filter Rationale
1 50 conf > 0.9, standing only Establish stable skeleton baseline
2 50 conf > 0.7, low motion Add sitting, subtle movements
3 50 conf > 0.5, all poses Full dataset including occlusions
4 50 All data, with augmentation Robustness via noise injection

O2: Data Augmentation (CSI domain)

Augment CSI windows to increase effective dataset size without collecting more data:

Augmentation Implementation Expected Gain
Time shift Roll CSI window by ±2 frames +30% data
Amplitude noise Gaussian noise, sigma=0.02 Robustness
Subcarrier dropout Zero 10% of subcarriers randomly Robustness
Temporal flip Reverse window + reverse keypoint velocity +100% data
Multi-node mix Swap node CSI, keep same-time keypoints Cross-node generalization

O3: Knowledge Distillation from MediaPipe

Instead of raw keypoint regression, distill MediaPipe's confidence and heatmap information:

L_distill = KL_div(softmax(wifi_heatmap / T), softmax(camera_heatmap / T))
  • Temperature T=4 for soft targets (transfers inter-joint relationships)
  • WiFlow predicts a 17-channel heatmap [17, H, W] instead of direct [17, 2]
  • Argmax for final keypoint extraction
  • Trade-off: Adds ~200K params for heatmap decoder, but improves spatial precision

O4: Active Learning Loop

Identify which poses the model is worst at and collect more data for those:

1. Train initial model on first collection session
2. Run inference on new CSI data, compute prediction entropy
3. Flag high-entropy windows (model is uncertain)
4. During next collection, the preview overlay highlights these moments:
   "Hold this pose — model needs more examples"
5. Re-train with augmented dataset

Expected: 2-3 active learning iterations reach saturation.

O6: Subcarrier Selection (ruvector-solver)

Variance-based top-K subcarrier selection, equivalent to ruvector-solver's sparse interpolation (114→56). Removes noise/static subcarriers before training:

For each subcarrier d in [0, dim):
  variance[d] = mean over samples of temporal_variance(csi[d, :])
Select top-K by variance (K = dim * 0.5)

Validated: 128 → 56 subcarriers (56% input reduction), proportional model size reduction.

O7: Attention-Weighted Subcarriers (ruvector-attention)

Compute per-subcarrier attention weights based on temporal energy correlation with ground-truth keypoint motion. High-energy subcarriers that covary with skeleton movement get amplified:

For each subcarrier d:
  energy[d] = sum of squared first-differences over time
  weight[d] = softmax(energy, temperature=0.1)
Apply: csi[d, :] *= weight[d] * dim  (mean weight = 1)

Validated: Top-5 attention subcarriers identified automatically per dataset.

O8: Stoer-Wagner MinCut Person Separation (ruvector-mincut / ADR-075)

JS implementation of the Stoer-Wagner algorithm for person separation in CSI, equivalent to DynamicPersonMatcher in wifi-densepose-train/src/metrics.rs. Builds a subcarrier correlation graph and finds the minimum cut to identify person-specific subcarrier clusters:

1. Build dim×dim Pearson correlation matrix across subcarriers
2. Run Stoer-Wagner min-cut on correlation graph
3. Partition subcarriers into person-specific groups
4. Train per-partition models for multi-person scenarios

Validated: Stoer-Wagner executes on 56-dim graph, identifies partition boundaries.

O9: Multi-SPSA Gradient Estimation

Average over K=3 random perturbation directions per gradient step. Reduces variance by sqrt(K) = 1.73x compared to single SPSA, at 3x forward pass cost (net win for convergence quality):

For k in 1..K:
  delta_k = random ±1 per parameter
  grad_k = (loss(w + eps*delta_k) - loss(w - eps*delta_k)) / (2*eps*delta_k)
grad = mean(grad_1, ..., grad_K)

O10: Mac M4 Pro Training via Tailscale

Training runs on Mac Mini M4 Pro (16-core GPU, ARM NEON SIMD) via Tailscale SSH, using ruvllm's native Node.js SIMD ops:

Windows (CPU) Mac M4 Pro
Node.js v24.12.0 (x86) v25.9.0 (ARM)
SIMD SSE4/AVX2 NEON
Cores Consumer laptop 12P + 4E cores
Training Slow (minutes/epoch) Fast (seconds/epoch)

O5: Cross-Environment Transfer

Train on one room, deploy in another:

Strategy Implementation
Room-invariant features Normalize CSI by running mean/variance
LoRA adapters Train a 4-rank LoRA per room (ADR-071) — 7.3 KB each
Few-shot calibration 2 min of camera data in new room → fine-tune LoRA only
AETHER embeddings Use contrastive room-independent features (ADR-024) as input

The LoRA approach is most practical: ship a base model + collect 2 min of calibration data per new room using the laptop camera.

Data Collection Protocol

Recommended collection sessions per room:

Session Duration Activity People Total CSI Frames
1. Baseline 5 min Empty + 1 person entry/exit 0-1 30,000
2. Standing poses 5 min Stand, arms up/down/sides, turn 1 30,000
3. Sitting 5 min Sit, type, lean, stand up/sit down 1 30,000
4. Walking 5 min Walk paths across room 1 30,000
5. Mixed 5 min Varied activities, transitions 1 30,000
6. Multi-person 5 min 2 people, varied activities 2 30,000
Total 30 min 180,000

At 20-frame windows: 9,000 paired training samples per 30-min session. With augmentation (O2): ~27,000 effective samples.

Camera placement: position laptop so the camera has a clear view of the sensing area. The camera FOV should cover the same space the ESP32 nodes cover.

File Structure

scripts/
  collect-ground-truth.py     # Camera capture + MediaPipe + CSI sync
  align-ground-truth.js       # Time-align CSI windows with camera keypoints
  train-wiflow-supervised.js  # Supervised training pipeline
  eval-wiflow.js              # PCK evaluation on held-out data

data/
  ground-truth/               # Raw camera keypoint captures
    gt-{timestamp}.jsonl
  paired/                     # Aligned CSI + keypoint pairs
    paired-{timestamp}.jsonl

models/
  wiflow-supervised/          # Trained model outputs
    wiflow-v1.safetensors
    wiflow-v1-int8.safetensors
    training-log.json
    eval-report.json

Privacy Considerations

  • Camera frames are processed locally by MediaPipe — no cloud upload
  • Raw video is never saved — only extracted keypoint coordinates are stored
  • The .jsonl ground-truth files contain only [x,y] joint coordinates, not images
  • The trained model runs on CSI only — no camera data leaves the laptop
  • Users can delete data/ground-truth/ after training; the model is self-contained

Consequences

Positive

  • 10-20x accuracy improvement: PCK@20 from 2.5% → 35%+ with real supervision
  • Reuses existing infrastructure: sensing server recording API, ruvllm training, SafeTensors
  • No new hardware: laptop webcam + existing ESP32 nodes
  • Privacy preserved at deployment: camera only needed during 30-min training session
  • Incremental: can improve with more collection sessions + active learning
  • Distributable: trained model weights can be shared on HuggingFace (ADR-070)

Negative

  • Camera placement matters: must see the same area ESP32 nodes sense
  • Single-room models: need LoRA calibration per room (2 min + camera)
  • MediaPipe limitations: occlusion, side views, multiple people reduce keypoint quality
  • Time sync: NTP drift can misalign frames (mitigated by 200ms windows)

Risks

Risk Probability Impact Mitigation
MediaPipe keypoints too noisy Low Medium Filter by confidence; MediaPipe is robust indoors
Clock drift > 100ms Low High Add handclap sync marker detection
Single camera can't see all poses Medium Medium Position camera centrally; collect from 2 angles
Model overfits to one room High Medium LoRA adapters + AETHER normalization (O5)
Insufficient data (< 5K pairs) Low High Augmentation (O2) + active learning (O4)

Implementation Plan

Phase Task Effort Status
P1 collect-ground-truth.py — camera + MediaPipe capture 2 hrs Done
P2 align-ground-truth.js — time alignment + pairing 1 hr Done
P3 train-wiflow-supervised.js — supervised training 3 hrs Done
P4 eval-wiflow.js — PCK evaluation 1 hr Done
P5 ruvector optimizations (O6-O9) 2 hrs Done
P6 Mac M4 Pro training via Tailscale (O10) 1 hr Done
P7 Data collection session (30 min recording) 1 hr Pending
P8 Training + evaluation on real paired data 30 min Pending
P9 LoRA cross-room calibration (O5) 2 hrs Pending

Validated Hardware

Component Spec Validated
Mac Mini camera 1920x1080, 30fps Yes — 14/17 keypoints, conf 0.94-1.0
MediaPipe PoseLandmarker v0.10.33 Tasks API, lite model Yes — via Tailscale SSH
Mac M4 Pro GPU 16-core, Metal 4, NEON SIMD Yes — Node.js v25.9.0
Tailscale SSH LAN-accessible Mac, passwordless Yes
ESP32-S3 CSI 128 subcarriers, 100Hz Yes — existing recordings
Sensing server recording API /api/v1/recording/start|stop Yes — existing

Baseline Benchmark

Proxy-pose baseline (no camera supervision, standing skeleton heuristic):

PCK@10:  11.8%
PCK@20:  35.3%
PCK@50:  94.1%
MPJPE:   0.067
Latency: 0.03ms/sample

Per-joint PCK@20: upper body (nose, shoulders, wrists) at 0% — proxy has no spatial accuracy for these. Camera supervision targets these joints specifically.

References

  • WiFlow: arXiv:2602.08661 — WiFi-based pose estimation with TCN + axial attention
  • Wi-Pose (CVPR 2021) — 3D CNN WiFi pose with camera supervision
  • Person-in-WiFi 3D (CVPR 2024) — Deformable attention with camera labels
  • MediaPipe Pose — Google's real-time 33-landmark body pose estimator
  • MetaFi++ (NeurIPS 2023) — Meta-learning cross-modal WiFi sensing