Covers 8 known issues encountered during multi-node ESP32-S3 deployments: 1. Node not appearing (limping state after USB flash) 2. Person count stuck at 1 (ADR-044) 3. Heart rate/breathing rate jitter (last-write-wins from multiple nodes) 4. Signal quality placeholder 5. Dashboard freezing (WS disconnect loop) 6. OTA crash at 59% (BLE vs OTA conflict) 7. SSH LAN hang (Tailscale workaround) 8. USB-C port selection Helps with #268 (no nodes found), #375 (node_id), #366 (build errors).
6.4 KiB
RuView Troubleshooting Guide
Known issues and fixes from the rebase-to-upstream branch (upstream #301).
1. Node not appearing in /api/v1/nodes
Symptom: ESP32-S3 node associates with WiFi, LED blinks, but no CSI frames arrive at the server. Node missing from /api/v1/spatial/nodes.
Root cause: After USB flash, the node enters a limping state where WiFi associates but the UDP CSI sender silently fails. The SoftAP + mDNS stack initializes but the CSI callback never fires.
Fix: Power cycle the node (unplug USB, wait 2s, replug). If that doesn't work, send DTR reset via serial: python -m serial.tools.miniterm --dtr 0 COMx 115200 then Ctrl+C.
Prevention: Firmware 0.8.0+ includes a watchdog that detects zero CSI frames for 30s and triggers a software reset automatically. Nodes 1-10 are still on old firmware and lack this recovery (OTA-vs-BLE chicken-and-egg; see issue #6).
2. Person count stuck at 1
Symptom: estimated_persons always returns 1 regardless of how many people are in the room.
Root cause (ADR-044): Eight converging bugs:
score_to_person_counthad a ceiling of 3fuse_multi_node_featuresused.max()instead of sum — N identical readings collapsed to 1- Four
.max(1)clamps forced minimum count to 1 even when absent field_model.estimate_occupancycapped at.min(3)- Normalization saturated (dividing by hardcoded thresholds instead of adaptive p95)
- No field model auto-calibration — eigenvalue path never activated
- Vitals-path clamps were asymmetric
- Tomography produced one blob (CC=1) so dedup gave wrong count
Fix applied (Waves 1-3):
- Wave 1 (
9cc5f604): ceiling 3→10,.max()→ sum/3 aggregation, softened.max(1)clamps - Wave 2 (
306f1262): RollingP95 adaptive normalization, field_model 30s auto-calibration, vitals clamp symmetry - Wave 3 (
c3df375a+0d4bfb09+6ac70ddf): CC flood-fill infrastructure, lambda 0.1→5.0, threshold 0.01→0.15, CC>1 gate
Current state: estimated_persons = 6-8 for 5 bodies (3 humans + 2 dogs). Overcounts because the sum/3 dedup factor is a guess. Tomography still produces one blob (CC=1), so the CC path doesn't activate. Runtime-configurable lambda would help tune without redeployment.
3. Heart rate / breathing rate jitter
Symptom: HR and BR readings jump wildly between frames. BR CV was 23.3%, HR CV was 12.9%.
Root cause (ADR-045): 11 ESP32 nodes each compute independent vitals. The server used last-write-wins — whichever node's UDP packet arrived last overwrote the global vitals. At ~20 fps per node, this meant vitals randomly interleaved from different vantage points every 50ms.
Fix applied (46fbc061): Best-node selection. Each node's vitals are smoothed independently via median filter + EMA. The node with the highest combined breathing_confidence + heartbeat_confidence is selected as authoritative. Result: BR CV 23.3% → 12.6%, HR CV 12.9% → 11.6%.
Known limitation: The wifi-densepose-vitals crate has a superior 4-stage pipeline (bandpass → Hilbert envelope → autocorrelation → peak detection) but is not yet wired into the sensing server. The current VitalSignDetector uses a simpler FFT approach with 4 BPM frequency resolution.
4. Signal quality shows 50% always
Symptom: The dashboard signal quality gauge was always stuck at ~50%.
Root cause: Signal quality was a hardcoded placeholder value, not derived from actual CSI data.
Fix applied: ADR-044 Wave 2 replaced the fake gauge with RollingP95 adaptive normalization. The UI honesty pass (b2070ab4) added beta tags to unvalidated metrics, replaced the fake gauge with per-node pill indicators, and surfaced the actual per-node signal data.
5. Dashboard freezes every 2-4 seconds
Symptom: The spatial view and dashboard would freeze, then reconnect, creating a visible stutter every 2-4 seconds.
Root cause: The WebSocket broadcast channel's recv() returned Err(Lagged) when a client fell behind. The server treated this as a fatal error and dropped the connection. The client immediately reconnected, creating a connect/disconnect cycle.
Fix applied (581daf4f):
- Server:
Laggederror →continue(skip missed frames instead of disconnecting) - Server: 30s ping/pong keepalive to prevent Caddy proxy idle timeouts
- Result: 154 frames over 8 seconds sustained, zero disconnects
6. OTA update crashes at 59%
Symptom: OTA firmware update via /api/v1/firmware/download progresses to ~59% then the node crashes with StoreProhibited on Core 1.
Root cause: NimBLE BLE advertising/scanning runs on Core 1. During OTA, the HTTP client also runs on Core 1. BLE and OTA compete for stack space, and the BLE scan callback triggers a memory access violation during the OTA write.
Fix:
- Stop NimBLE advertising and scanning before calling
esp_https_ota_begin() - Increase httpd stack from 4KB to 8KB (
CONFIG_HTTPD_MAX_REQ_HDR_LENand task stack) - Resume BLE after OTA completes or fails
Caveat: Nodes running old firmware (1-10) can't receive this fix via OTA because the crash happens during the OTA itself. These nodes must be USB-flashed with firmware 0.8.0+ first, then future OTA updates will work. Node 11 was USB-flashed with the watchdog firmware and can receive OTA updates.
7. Can't SSH to babycube via LAN
Symptom: ssh thyhack@10.0.10.10 hangs at banner exchange. Ping works, TCP port 22 is open, but SSH never completes the handshake.
Workaround: Use the Tailscale IP instead:
ssh thyhack@100.90.238.87
Not the cause: CrowdSec. The 10.0.0.0/8 range is whitelisted in CrowdSec (cscli decisions list shows no active decisions for LAN IPs). The banner hang occurs before any authentication attempt, so it's not a firewall block.
Suspected cause: Unknown. Possibly MTU/fragmentation issue on the LAN segment, or a network stack bug in the babycube's NIC driver. The Tailscale overlay network (WireGuard UDP) bypasses whatever is causing the LAN TCP issue.
8. Right USB-C port doesn't work on some ESP32-S3 boards
Symptom: Plugging into the right USB-C port (when facing the board with USB-C toward you) shows no serial device on the host.
Fix: Use the left USB-C port. On most ESP32-S3-DevKitC boards, the left port is the USB-to-UART bridge (CP2102/CH340) used for flashing and serial monitor. The right port is the native USB (USB-JTAG) which requires different drivers and isn't used by the RuView firmware.