mirror of
https://github.com/ruvnet/RuView.git
synced 2026-04-28 05:59:32 +00:00
merge: bring feat/adr-080-qe-remediation up to date with main
Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
commit
ccb27b280c
9 changed files with 3832 additions and 5 deletions
512
docs/adr/ADR-079-camera-ground-truth-training.md
Normal file
512
docs/adr/ADR-079-camera-ground-truth-training.md
Normal file
|
|
@ -0,0 +1,512 @@
|
|||
# ADR-079: Camera Ground-Truth Training Pipeline
|
||||
|
||||
- **Status**: Accepted
|
||||
- **Date**: 2026-04-06
|
||||
- **Deciders**: ruv
|
||||
- **Relates to**: ADR-072 (WiFlow Architecture), ADR-070 (Self-Supervised Pretraining), ADR-071 (ruvllm Training Pipeline), ADR-024 (AETHER Contrastive), ADR-064 (Multimodal Ambient Intelligence), ADR-075 (MinCut Person Separation)
|
||||
|
||||
## Context
|
||||
|
||||
WiFlow (ADR-072) currently trains without ground-truth pose labels, using proxy poses
|
||||
generated from presence/motion heuristics. This produces a PCK@20 of only 2.5% — far
|
||||
below the 30-50% achievable with supervised training. The fundamental bottleneck is the
|
||||
absence of spatial keypoint labels.
|
||||
|
||||
Academic WiFi pose estimation systems (Wi-Pose, Person-in-WiFi 3D, MetaFi++) all train
|
||||
with synchronized camera ground truth and achieve PCK@20 of 40-85%. They discard the
|
||||
camera at deployment — the camera is a training-time teacher, not a runtime dependency.
|
||||
|
||||
ADR-064 already identified this: *"Record CSI + mmWave while performing signs with a
|
||||
camera as ground truth, then deploy camera-free."* This ADR specifies the implementation.
|
||||
|
||||
### Current Training Pipeline Gap
|
||||
|
||||
```
|
||||
Current: CSI amplitude → WiFlow → 17 keypoints (proxy-supervised, PCK@20 = 2.5%)
|
||||
↑
|
||||
Heuristic proxies:
|
||||
- Standing skeleton when presence > 0.3
|
||||
- Limb perturbation from motion energy
|
||||
- No spatial accuracy
|
||||
```
|
||||
|
||||
### Target Pipeline
|
||||
|
||||
```
|
||||
Training: CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-supervised, PCK@20 target: 35%+)
|
||||
↑
|
||||
Laptop camera ──→ MediaPipe ──→ 17 COCO keypoints (ground truth)
|
||||
(time-synchronized, 30 fps)
|
||||
|
||||
Deploy: CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-free, trained model only)
|
||||
```
|
||||
|
||||
## Decision
|
||||
|
||||
Build a camera ground-truth collection and training pipeline using the laptop webcam
|
||||
as a teacher signal. The camera is used **only during training data collection** and is
|
||||
not required at deployment.
|
||||
|
||||
### Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Data Collection Phase │
|
||||
│ │
|
||||
│ ESP32-S3 nodes ──UDP──→ Sensing Server ──→ CSI frames (.jsonl) │
|
||||
│ ↑ time sync │
|
||||
│ Laptop Camera ──→ MediaPipe Pose ──→ Keypoints (.jsonl) │
|
||||
│ ↑ │
|
||||
│ collect-ground-truth.py │
|
||||
│ (single orchestrator) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Training Phase │
|
||||
│ │
|
||||
│ Paired dataset: { csi_window[128,20], keypoints[17,2], conf } │
|
||||
│ ↓ │
|
||||
│ train-wiflow-supervised.js │
|
||||
│ Phase 1: Contrastive pretrain (ADR-072, reuse) │
|
||||
│ Phase 2: Supervised keypoint regression (NEW) │
|
||||
│ Phase 3: Fine-tune with bone constraints + confidence │
|
||||
│ ↓ │
|
||||
│ WiFlow model (1.8M params) → SafeTensors export │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Deployment (camera-free) │
|
||||
│ │
|
||||
│ ESP32-S3 CSI → Sensing Server → WiFlow inference → 17 keypoints│
|
||||
│ (No camera. Trained model runs on CSI input only.) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Component 1: `scripts/collect-ground-truth.py`
|
||||
|
||||
Single Python script that orchestrates synchronized capture from the laptop camera
|
||||
and the ESP32 CSI stream.
|
||||
|
||||
**Dependencies:** `mediapipe`, `opencv-python`, `requests` (all pip-installable, no GPU)
|
||||
|
||||
**Capture flow:**
|
||||
|
||||
```python
|
||||
# Pseudocode
|
||||
camera = cv2.VideoCapture(0) # Laptop webcam
|
||||
sensing_api = "http://localhost:3000" # Sensing server
|
||||
|
||||
# Start CSI recording via existing API
|
||||
requests.post(f"{sensing_api}/api/v1/recording/start")
|
||||
|
||||
while recording:
|
||||
frame = camera.read()
|
||||
t = time.time_ns() # Nanosecond timestamp
|
||||
|
||||
# MediaPipe Pose: 33 landmarks → map to 17 COCO keypoints
|
||||
result = mp_pose.process(frame)
|
||||
keypoints_17 = map_mediapipe_to_coco(result.pose_landmarks)
|
||||
confidence = mean(landmark.visibility for relevant landmarks)
|
||||
|
||||
# Write to ground-truth JSONL (one line per frame)
|
||||
write_jsonl({
|
||||
"ts_ns": t,
|
||||
"keypoints": keypoints_17, # [[x,y], ...] normalized [0,1]
|
||||
"confidence": confidence, # 0-1, used for loss weighting
|
||||
"n_visible": count(visibility > 0.5),
|
||||
})
|
||||
|
||||
# Optional: show live preview with skeleton overlay
|
||||
if preview:
|
||||
draw_skeleton(frame, keypoints_17)
|
||||
cv2.imshow("Ground Truth", frame)
|
||||
|
||||
# Stop CSI recording
|
||||
requests.post(f"{sensing_api}/api/v1/recording/stop")
|
||||
```
|
||||
|
||||
**MediaPipe → COCO keypoint mapping:**
|
||||
|
||||
| COCO Index | Joint | MediaPipe Index |
|
||||
|------------|-------|-----------------|
|
||||
| 0 | Nose | 0 |
|
||||
| 1 | Left Eye | 2 |
|
||||
| 2 | Right Eye | 5 |
|
||||
| 3 | Left Ear | 7 |
|
||||
| 4 | Right Ear | 8 |
|
||||
| 5 | Left Shoulder | 11 |
|
||||
| 6 | Right Shoulder | 12 |
|
||||
| 7 | Left Elbow | 13 |
|
||||
| 8 | Right Elbow | 14 |
|
||||
| 9 | Left Wrist | 15 |
|
||||
| 10 | Right Wrist | 16 |
|
||||
| 11 | Left Hip | 23 |
|
||||
| 12 | Right Hip | 24 |
|
||||
| 13 | Left Knee | 25 |
|
||||
| 14 | Right Knee | 26 |
|
||||
| 15 | Left Ankle | 27 |
|
||||
| 16 | Right Ankle | 28 |
|
||||
|
||||
### Component 2: Time Alignment (`scripts/align-ground-truth.js`)
|
||||
|
||||
CSI frames arrive at ~100 Hz with server-side timestamps. Camera keypoints arrive at
|
||||
~30 fps with client-side timestamps. Alignment is needed because:
|
||||
|
||||
1. Camera and sensing server clocks differ (typically < 50ms on LAN)
|
||||
2. CSI is aggregated into 20-frame windows for WiFlow input
|
||||
3. Ground-truth keypoints must be averaged over the same window
|
||||
|
||||
**Alignment algorithm:**
|
||||
|
||||
```
|
||||
For each CSI window W_i (20 frames, ~200ms at 100Hz):
|
||||
t_start = W_i.first_frame.timestamp
|
||||
t_end = W_i.last_frame.timestamp
|
||||
|
||||
# Find all camera keypoints within this time window
|
||||
matching_keypoints = [k for k in camera_data if t_start <= k.ts <= t_end]
|
||||
|
||||
if len(matching_keypoints) >= 3: # At least 3 camera frames per window
|
||||
# Average keypoints, weighted by confidence
|
||||
avg_keypoints = weighted_mean(matching_keypoints, weights=confidences)
|
||||
avg_confidence = mean(confidences)
|
||||
|
||||
paired_dataset.append({
|
||||
csi_window: W_i.amplitudes, # [128, 20] float32
|
||||
keypoints: avg_keypoints, # [17, 2] float32
|
||||
confidence: avg_confidence, # scalar
|
||||
n_camera_frames: len(matching_keypoints),
|
||||
})
|
||||
```
|
||||
|
||||
**Clock sync strategy:**
|
||||
|
||||
- NTP is sufficient (< 20ms error on LAN)
|
||||
- The 200ms CSI window is 10x larger than typical clock drift
|
||||
- For tighter sync: use a handclap/jump as a sync marker — visible spike in both
|
||||
CSI motion energy and camera skeleton velocity. Auto-detect and align.
|
||||
|
||||
**Output:** `data/recordings/paired-{timestamp}.jsonl` — one line per paired sample:
|
||||
```json
|
||||
{"csi": [128x20 flat], "kp": [[0.45,0.12], ...], "conf": 0.92, "ts": 1775300000000}
|
||||
```
|
||||
|
||||
### Component 3: Supervised Training (`scripts/train-wiflow-supervised.js`)
|
||||
|
||||
Extends the existing `train-ruvllm.js` pipeline with a supervised phase.
|
||||
|
||||
**Phase 1: Contrastive Pretrain (reuse ADR-072)**
|
||||
- Same as existing: temporal + cross-node triplets
|
||||
- Learns CSI representation without labels
|
||||
- 50 epochs, ~5 min on laptop
|
||||
|
||||
**Phase 2: Supervised Keypoint Regression (NEW)**
|
||||
- Load paired dataset from Component 2
|
||||
- Loss: confidence-weighted SmoothL1 on keypoints
|
||||
|
||||
```
|
||||
L_supervised = (1/N) * sum_i [ conf_i * SmoothL1(pred_i, gt_i, beta=0.05) ]
|
||||
```
|
||||
|
||||
- Only train on samples where `conf > 0.5` (discard frames where MediaPipe lost tracking)
|
||||
- Learning rate: 1e-4 with cosine decay
|
||||
- 200 epochs, ~15 min on laptop CPU (1.8M params, no GPU needed)
|
||||
|
||||
**Phase 3: Refinement with Bone Constraints**
|
||||
- Fine-tune with combined loss:
|
||||
|
||||
```
|
||||
L = L_supervised + 0.3 * L_bone + 0.1 * L_temporal
|
||||
|
||||
L_bone = (1/14) * sum_b (bone_len_b - prior_b)^2 # ADR-072 bone priors
|
||||
L_temporal = SmoothL1(kp_t, kp_{t-1}) # Temporal smoothness
|
||||
```
|
||||
|
||||
- 50 epochs at lower LR (1e-5)
|
||||
- Tighten bone constraint weight from 0.3 → 0.5 over epochs
|
||||
|
||||
**Phase 4: Quantization + Export**
|
||||
- Reuse ruvllm TurboQuant: float32 → int8 (4x smaller, ~881 KB)
|
||||
- Export via SafeTensors for cross-platform deployment
|
||||
- Validate quantized model PCK@20 within 2% of full-precision
|
||||
|
||||
### Component 4: Evaluation Script (`scripts/eval-wiflow.js`)
|
||||
|
||||
Measure actual PCK@20 using held-out paired data (20% split).
|
||||
|
||||
```
|
||||
PCK@k = (1/N) * sum_i [ (||pred_i - gt_i|| < k * torso_length) ? 1 : 0 ]
|
||||
```
|
||||
|
||||
**Metrics reported:**
|
||||
|
||||
| Metric | Description | Target |
|
||||
|--------|-------------|--------|
|
||||
| PCK@20 | % of keypoints within 20% torso length | > 35% |
|
||||
| PCK@50 | % within 50% torso length | > 60% |
|
||||
| MPJPE | Mean per-joint position error (pixels) | < 40px |
|
||||
| Per-joint PCK | Breakdown by joint (wrists are hardest) | Report all 17 |
|
||||
| Inference latency | Single window prediction time | < 50ms |
|
||||
|
||||
### Optimization Strategy
|
||||
|
||||
#### O1: Curriculum Learning
|
||||
|
||||
Train easy poses first, hard poses later:
|
||||
|
||||
| Stage | Epochs | Data Filter | Rationale |
|
||||
|-------|--------|-------------|-----------|
|
||||
| 1 | 50 | `conf > 0.9`, standing only | Establish stable skeleton baseline |
|
||||
| 2 | 50 | `conf > 0.7`, low motion | Add sitting, subtle movements |
|
||||
| 3 | 50 | `conf > 0.5`, all poses | Full dataset including occlusions |
|
||||
| 4 | 50 | All data, with augmentation | Robustness via noise injection |
|
||||
|
||||
#### O2: Data Augmentation (CSI domain)
|
||||
|
||||
Augment CSI windows to increase effective dataset size without collecting more data:
|
||||
|
||||
| Augmentation | Implementation | Expected Gain |
|
||||
|-------------|----------------|---------------|
|
||||
| Time shift | Roll CSI window by ±2 frames | +30% data |
|
||||
| Amplitude noise | Gaussian noise, sigma=0.02 | Robustness |
|
||||
| Subcarrier dropout | Zero 10% of subcarriers randomly | Robustness |
|
||||
| Temporal flip | Reverse window + reverse keypoint velocity | +100% data |
|
||||
| Multi-node mix | Swap node CSI, keep same-time keypoints | Cross-node generalization |
|
||||
|
||||
#### O3: Knowledge Distillation from MediaPipe
|
||||
|
||||
Instead of raw keypoint regression, distill MediaPipe's confidence and heatmap
|
||||
information:
|
||||
|
||||
```
|
||||
L_distill = KL_div(softmax(wifi_heatmap / T), softmax(camera_heatmap / T))
|
||||
```
|
||||
|
||||
- Temperature T=4 for soft targets (transfers inter-joint relationships)
|
||||
- WiFlow predicts a 17-channel heatmap [17, H, W] instead of direct [17, 2]
|
||||
- Argmax for final keypoint extraction
|
||||
- **Trade-off:** Adds ~200K params for heatmap decoder, but improves spatial precision
|
||||
|
||||
#### O4: Active Learning Loop
|
||||
|
||||
Identify which poses the model is worst at and collect more data for those:
|
||||
|
||||
```
|
||||
1. Train initial model on first collection session
|
||||
2. Run inference on new CSI data, compute prediction entropy
|
||||
3. Flag high-entropy windows (model is uncertain)
|
||||
4. During next collection, the preview overlay highlights these moments:
|
||||
"Hold this pose — model needs more examples"
|
||||
5. Re-train with augmented dataset
|
||||
```
|
||||
|
||||
Expected: 2-3 active learning iterations reach saturation.
|
||||
|
||||
#### O6: Subcarrier Selection (ruvector-solver)
|
||||
|
||||
Variance-based top-K subcarrier selection, equivalent to ruvector-solver's sparse
|
||||
interpolation (114→56). Removes noise/static subcarriers before training:
|
||||
|
||||
```
|
||||
For each subcarrier d in [0, dim):
|
||||
variance[d] = mean over samples of temporal_variance(csi[d, :])
|
||||
Select top-K by variance (K = dim * 0.5)
|
||||
```
|
||||
|
||||
**Validated:** 128 → 56 subcarriers (56% input reduction), proportional model size reduction.
|
||||
|
||||
#### O7: Attention-Weighted Subcarriers (ruvector-attention)
|
||||
|
||||
Compute per-subcarrier attention weights based on temporal energy correlation with
|
||||
ground-truth keypoint motion. High-energy subcarriers that covary with skeleton
|
||||
movement get amplified:
|
||||
|
||||
```
|
||||
For each subcarrier d:
|
||||
energy[d] = sum of squared first-differences over time
|
||||
weight[d] = softmax(energy, temperature=0.1)
|
||||
Apply: csi[d, :] *= weight[d] * dim (mean weight = 1)
|
||||
```
|
||||
|
||||
**Validated:** Top-5 attention subcarriers identified automatically per dataset.
|
||||
|
||||
#### O8: Stoer-Wagner MinCut Person Separation (ruvector-mincut / ADR-075)
|
||||
|
||||
JS implementation of the Stoer-Wagner algorithm for person separation in CSI, equivalent
|
||||
to `DynamicPersonMatcher` in `wifi-densepose-train/src/metrics.rs`. Builds a subcarrier
|
||||
correlation graph and finds the minimum cut to identify person-specific subcarrier clusters:
|
||||
|
||||
```
|
||||
1. Build dim×dim Pearson correlation matrix across subcarriers
|
||||
2. Run Stoer-Wagner min-cut on correlation graph
|
||||
3. Partition subcarriers into person-specific groups
|
||||
4. Train per-partition models for multi-person scenarios
|
||||
```
|
||||
|
||||
**Validated:** Stoer-Wagner executes on 56-dim graph, identifies partition boundaries.
|
||||
|
||||
#### O9: Multi-SPSA Gradient Estimation
|
||||
|
||||
Average over K=3 random perturbation directions per gradient step. Reduces variance
|
||||
by sqrt(K) = 1.73x compared to single SPSA, at 3x forward pass cost (net win for
|
||||
convergence quality):
|
||||
|
||||
```
|
||||
For k in 1..K:
|
||||
delta_k = random ±1 per parameter
|
||||
grad_k = (loss(w + eps*delta_k) - loss(w - eps*delta_k)) / (2*eps*delta_k)
|
||||
grad = mean(grad_1, ..., grad_K)
|
||||
```
|
||||
|
||||
#### O10: Mac M4 Pro Training via Tailscale
|
||||
|
||||
Training runs on Mac Mini M4 Pro (16-core GPU, ARM NEON SIMD) via Tailscale SSH,
|
||||
using ruvllm's native Node.js SIMD ops:
|
||||
|
||||
| | Windows (CPU) | Mac M4 Pro |
|
||||
|---|---|---|
|
||||
| Node.js | v24.12.0 (x86) | v25.9.0 (ARM) |
|
||||
| SIMD | SSE4/AVX2 | NEON |
|
||||
| Cores | Consumer laptop | 12P + 4E cores |
|
||||
| Training | Slow (minutes/epoch) | Fast (seconds/epoch) |
|
||||
|
||||
#### O5: Cross-Environment Transfer
|
||||
|
||||
Train on one room, deploy in another:
|
||||
|
||||
| Strategy | Implementation |
|
||||
|----------|---------------|
|
||||
| Room-invariant features | Normalize CSI by running mean/variance |
|
||||
| LoRA adapters | Train a 4-rank LoRA per room (ADR-071) — 7.3 KB each |
|
||||
| Few-shot calibration | 2 min of camera data in new room → fine-tune LoRA only |
|
||||
| AETHER embeddings | Use contrastive room-independent features (ADR-024) as input |
|
||||
|
||||
The LoRA approach is most practical: ship a base model + collect 2 min of calibration
|
||||
data per new room using the laptop camera.
|
||||
|
||||
### Data Collection Protocol
|
||||
|
||||
Recommended collection sessions per room:
|
||||
|
||||
| Session | Duration | Activity | People | Total CSI Frames |
|
||||
|---------|----------|----------|--------|-----------------|
|
||||
| 1. Baseline | 5 min | Empty + 1 person entry/exit | 0-1 | 30,000 |
|
||||
| 2. Standing poses | 5 min | Stand, arms up/down/sides, turn | 1 | 30,000 |
|
||||
| 3. Sitting | 5 min | Sit, type, lean, stand up/sit down | 1 | 30,000 |
|
||||
| 4. Walking | 5 min | Walk paths across room | 1 | 30,000 |
|
||||
| 5. Mixed | 5 min | Varied activities, transitions | 1 | 30,000 |
|
||||
| 6. Multi-person | 5 min | 2 people, varied activities | 2 | 30,000 |
|
||||
| **Total** | **30 min** | | | **180,000** |
|
||||
|
||||
At 20-frame windows: **9,000 paired training samples** per 30-min session.
|
||||
With augmentation (O2): **~27,000 effective samples**.
|
||||
|
||||
Camera placement: position laptop so the camera has a clear view of the sensing area.
|
||||
The camera FOV should cover the same space the ESP32 nodes cover.
|
||||
|
||||
### File Structure
|
||||
|
||||
```
|
||||
scripts/
|
||||
collect-ground-truth.py # Camera capture + MediaPipe + CSI sync
|
||||
align-ground-truth.js # Time-align CSI windows with camera keypoints
|
||||
train-wiflow-supervised.js # Supervised training pipeline
|
||||
eval-wiflow.js # PCK evaluation on held-out data
|
||||
|
||||
data/
|
||||
ground-truth/ # Raw camera keypoint captures
|
||||
gt-{timestamp}.jsonl
|
||||
paired/ # Aligned CSI + keypoint pairs
|
||||
paired-{timestamp}.jsonl
|
||||
|
||||
models/
|
||||
wiflow-supervised/ # Trained model outputs
|
||||
wiflow-v1.safetensors
|
||||
wiflow-v1-int8.safetensors
|
||||
training-log.json
|
||||
eval-report.json
|
||||
```
|
||||
|
||||
### Privacy Considerations
|
||||
|
||||
- Camera frames are processed **locally** by MediaPipe — no cloud upload
|
||||
- Raw video is **never saved** — only extracted keypoint coordinates are stored
|
||||
- The `.jsonl` ground-truth files contain only `[x,y]` joint coordinates, not images
|
||||
- The trained model runs on CSI only — no camera data leaves the laptop
|
||||
- Users can delete `data/ground-truth/` after training; the model is self-contained
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **10-20x accuracy improvement**: PCK@20 from 2.5% → 35%+ with real supervision
|
||||
- **Reuses existing infrastructure**: sensing server recording API, ruvllm training, SafeTensors
|
||||
- **No new hardware**: laptop webcam + existing ESP32 nodes
|
||||
- **Privacy preserved at deployment**: camera only needed during 30-min training session
|
||||
- **Incremental**: can improve with more collection sessions + active learning
|
||||
- **Distributable**: trained model weights can be shared on HuggingFace (ADR-070)
|
||||
|
||||
### Negative
|
||||
|
||||
- **Camera placement matters**: must see the same area ESP32 nodes sense
|
||||
- **Single-room models**: need LoRA calibration per room (2 min + camera)
|
||||
- **MediaPipe limitations**: occlusion, side views, multiple people reduce keypoint quality
|
||||
- **Time sync**: NTP drift can misalign frames (mitigated by 200ms windows)
|
||||
|
||||
### Risks
|
||||
|
||||
| Risk | Probability | Impact | Mitigation |
|
||||
|------|-------------|--------|------------|
|
||||
| MediaPipe keypoints too noisy | Low | Medium | Filter by confidence; MediaPipe is robust indoors |
|
||||
| Clock drift > 100ms | Low | High | Add handclap sync marker detection |
|
||||
| Single camera can't see all poses | Medium | Medium | Position camera centrally; collect from 2 angles |
|
||||
| Model overfits to one room | High | Medium | LoRA adapters + AETHER normalization (O5) |
|
||||
| Insufficient data (< 5K pairs) | Low | High | Augmentation (O2) + active learning (O4) |
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
| Phase | Task | Effort | Status |
|
||||
|-------|------|--------|--------|
|
||||
| P1 | `collect-ground-truth.py` — camera + MediaPipe capture | 2 hrs | **Done** |
|
||||
| P2 | `align-ground-truth.js` — time alignment + pairing | 1 hr | **Done** |
|
||||
| P3 | `train-wiflow-supervised.js` — supervised training | 3 hrs | **Done** |
|
||||
| P4 | `eval-wiflow.js` — PCK evaluation | 1 hr | **Done** |
|
||||
| P5 | ruvector optimizations (O6-O9) | 2 hrs | **Done** |
|
||||
| P6 | Mac M4 Pro training via Tailscale (O10) | 1 hr | **Done** |
|
||||
| P7 | Data collection session (30 min recording) | 1 hr | Pending |
|
||||
| P8 | Training + evaluation on real paired data | 30 min | Pending |
|
||||
| P9 | LoRA cross-room calibration (O5) | 2 hrs | Pending |
|
||||
|
||||
## Validated Hardware
|
||||
|
||||
| Component | Spec | Validated |
|
||||
|-----------|------|-----------|
|
||||
| Mac Mini camera | 1920x1080, 30fps | Yes — 14/17 keypoints, conf 0.94-1.0 |
|
||||
| MediaPipe PoseLandmarker | v0.10.33 Tasks API, lite model | Yes — via Tailscale SSH |
|
||||
| Mac M4 Pro GPU | 16-core, Metal 4, NEON SIMD | Yes — Node.js v25.9.0 |
|
||||
| Tailscale SSH | LAN-accessible Mac, passwordless | Yes |
|
||||
| ESP32-S3 CSI | 128 subcarriers, 100Hz | Yes — existing recordings |
|
||||
| Sensing server recording API | `/api/v1/recording/start\|stop` | Yes — existing |
|
||||
|
||||
## Baseline Benchmark
|
||||
|
||||
Proxy-pose baseline (no camera supervision, standing skeleton heuristic):
|
||||
|
||||
```
|
||||
PCK@10: 11.8%
|
||||
PCK@20: 35.3%
|
||||
PCK@50: 94.1%
|
||||
MPJPE: 0.067
|
||||
Latency: 0.03ms/sample
|
||||
```
|
||||
|
||||
Per-joint PCK@20: upper body (nose, shoulders, wrists) at 0% — proxy has no spatial
|
||||
accuracy for these. Camera supervision targets these joints specifically.
|
||||
|
||||
## References
|
||||
|
||||
- WiFlow: arXiv:2602.08661 — WiFi-based pose estimation with TCN + axial attention
|
||||
- Wi-Pose (CVPR 2021) — 3D CNN WiFi pose with camera supervision
|
||||
- Person-in-WiFi 3D (CVPR 2024) — Deformable attention with camera labels
|
||||
- MediaPipe Pose — Google's real-time 33-landmark body pose estimator
|
||||
- MetaFi++ (NeurIPS 2023) — Meta-learning cross-modal WiFi sensing
|
||||
|
|
@ -1055,6 +1055,65 @@ See [ADR-071](adr/ADR-071-ruvllm-training-pipeline.md) and the [pretraining tuto
|
|||
|
||||
---
|
||||
|
||||
## Camera-Supervised Pose Training (v0.7.0)
|
||||
|
||||
For significantly higher accuracy, use a webcam as a **temporary teacher** during training. The camera captures real 17-keypoint poses via MediaPipe, paired with simultaneous ESP32 CSI data. After training, the camera is no longer needed — the model runs on CSI only.
|
||||
|
||||
**Result: 92.9% PCK@20** from a 5-minute collection session.
|
||||
|
||||
### Requirements
|
||||
|
||||
- Python 3.9+ with `mediapipe` and `opencv-python` (`pip install mediapipe opencv-python`)
|
||||
- ESP32-S3 node streaming CSI over UDP (port 5005)
|
||||
- A webcam (laptop, USB, or Mac camera via Tailscale)
|
||||
|
||||
### Step 1: Capture Camera + CSI Simultaneously
|
||||
|
||||
Run both scripts at the same time (in separate terminals):
|
||||
|
||||
```bash
|
||||
# Terminal 1: Record ESP32 CSI
|
||||
python scripts/record-csi-udp.py --duration 300
|
||||
|
||||
# Terminal 2: Capture camera keypoints
|
||||
python scripts/collect-ground-truth.py --duration 300 --preview
|
||||
```
|
||||
|
||||
Move around naturally in front of the camera for 5 minutes. The `--preview` flag shows a live skeleton overlay.
|
||||
|
||||
### Step 2: Align and Train
|
||||
|
||||
```bash
|
||||
# Align camera keypoints with CSI windows
|
||||
node scripts/align-ground-truth.js \
|
||||
--gt data/ground-truth/*.jsonl \
|
||||
--csi data/recordings/csi-*.csi.jsonl
|
||||
|
||||
# Train (start with lite, scale up as you collect more data)
|
||||
node scripts/train-wiflow-supervised.js \
|
||||
--data data/paired/*.jsonl \
|
||||
--scale lite \
|
||||
--epochs 50
|
||||
|
||||
# Evaluate
|
||||
node scripts/eval-wiflow.js \
|
||||
--model models/wiflow-supervised/wiflow-v1.json \
|
||||
--data data/paired/*.jsonl
|
||||
```
|
||||
|
||||
### Scale Presets
|
||||
|
||||
| Preset | Params | Training Time | Best For |
|
||||
|--------|--------|---------------|----------|
|
||||
| `--scale lite` | 189K | ~19 min | < 1,000 samples (5 min capture) |
|
||||
| `--scale small` | 474K | ~1 hr | 1K-10K samples |
|
||||
| `--scale medium` | 800K | ~2 hrs | 10K-50K samples |
|
||||
| `--scale full` | 7.7M | ~8 hrs | 50K+ samples (GPU recommended) |
|
||||
|
||||
See [ADR-079](adr/ADR-079-camera-ground-truth-training.md) for the full design and optimization details.
|
||||
|
||||
---
|
||||
|
||||
## Pre-Trained Models (No Training Required)
|
||||
|
||||
Pre-trained models are available on HuggingFace: **https://huggingface.co/ruvnet/wifi-densepose-pretrained**
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue