mirror of
https://github.com/ruvnet/RuView.git
synced 2026-04-28 05:59:32 +00:00
9-layer QEMU testing platform (ADR-061) and YAML-driven swarm configurator (ADR-062) for ESP32-S3 firmware testing without hardware. 12 commits, 56 files, +9,500 lines. Tested on Windows with Espressif QEMU 9.0.0 — firmware boots, mock CSI generates frames, 14/16 validation checks pass. 39 bugs found and fixed across 2 deep code reviews. Closes #259 Co-Authored-By: claude-flow <ruv@ruv.net>
199 lines
8.8 KiB
Markdown
199 lines
8.8 KiB
Markdown
# ADR-062: QEMU ESP32-S3 Swarm Configurator
|
||
|
||
| Field | Value |
|
||
|-------------|------------------------------------------------|
|
||
| **Status** | Accepted |
|
||
| **Date** | 2026-03-14 |
|
||
| **Authors** | RuView Team |
|
||
| **Relates** | ADR-061 (QEMU testing platform), ADR-060 (channel/MAC filter), ADR-018 (binary frame), ADR-039 (edge intel) |
|
||
|
||
## Glossary
|
||
|
||
| Term | Definition |
|
||
|------|-----------|
|
||
| Swarm | A group of N QEMU ESP32-S3 instances running simultaneously |
|
||
| Topology | How nodes are connected: star, mesh, line, ring |
|
||
| Role | Node function: `sensor` (collects CSI), `coordinator` (aggregates + forwards), `gateway` (bridges to host) |
|
||
| Scenario matrix | Cross-product of topology × node count × NVS config × mock scenario |
|
||
| Health oracle | Python process that monitors all node UART logs and declares swarm health |
|
||
|
||
## Context
|
||
|
||
ADR-061 Layer 3 provides a basic multi-node mesh test: N identical nodes with sequential TDM slots connected via a Linux bridge. This is useful but limited:
|
||
|
||
1. **All nodes are identical** — real deployments have heterogeneous roles (sensor, coordinator, gateway)
|
||
2. **Single topology** — only fully-connected bridge; no star, line, or ring topologies
|
||
3. **No scenario variation per node** — all nodes run the same mock CSI scenario
|
||
4. **Manual configuration** — each test requires hand-editing env vars and arguments
|
||
5. **No swarm-level health monitoring** — validation checks individual nodes, not collective behavior
|
||
6. **No cross-node timing validation** — TDM slot ordering and inter-frame gaps aren't verified
|
||
|
||
Real WiFi-DensePose deployments use 3-8 ESP32-S3 nodes in various topologies. A single coordinator aggregates CSI from multiple sensors. The firmware must handle TDM conflicts, missing nodes, role-based behavior differences, and network partitions — none of which ADR-061 Layer 3 tests.
|
||
|
||
## Decision
|
||
|
||
Build a **QEMU Swarm Configurator** — a YAML-driven tool that defines multi-node test scenarios declaratively and orchestrates them under QEMU with swarm-level validation.
|
||
|
||
### Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ swarm_config.yaml │
|
||
│ nodes: [{role: sensor, scenario: 2, channel: 6}] │
|
||
│ topology: star │
|
||
│ duration: 60s │
|
||
│ assertions: [all_nodes_boot, tdm_no_collision, ...] │
|
||
└──────────────────────┬──────────────────────────────┘
|
||
│
|
||
┌────────────▼────────────┐
|
||
│ qemu_swarm.py │
|
||
│ (orchestrator) │
|
||
└───┬────┬────┬───┬──────┘
|
||
│ │ │ │
|
||
┌────▼┐ ┌▼──┐ ▼ ┌▼────┐
|
||
│Node0│ │N1 │... │N(n-1)│ QEMU instances
|
||
│sens │ │sen│ │coord │
|
||
└──┬──┘ └─┬─┘ └──┬───┘
|
||
│ │ │
|
||
┌──▼──────▼─────────▼──┐
|
||
│ Virtual Network │ TAP bridge / SLIRP
|
||
│ (topology-shaped) │
|
||
└──────────┬───────────┘
|
||
│
|
||
┌──────────▼───────────┐
|
||
│ Aggregator (Rust) │ Collects frames
|
||
└──────────┬───────────┘
|
||
│
|
||
┌──────────▼───────────┐
|
||
│ Health Oracle │ Swarm-level assertions
|
||
│ (swarm_health.py) │
|
||
└──────────────────────┘
|
||
```
|
||
|
||
### YAML Configuration Schema
|
||
|
||
```yaml
|
||
# swarm_config.yaml
|
||
swarm:
|
||
name: "3-sensor-star"
|
||
duration_s: 60
|
||
topology: star # star | mesh | line | ring
|
||
aggregator_port: 5005
|
||
|
||
nodes:
|
||
- role: coordinator
|
||
node_id: 0
|
||
scenario: 0 # empty room (baseline)
|
||
channel: 6
|
||
edge_tier: 2
|
||
is_gateway: true # receives aggregated frames
|
||
|
||
- role: sensor
|
||
node_id: 1
|
||
scenario: 2 # walking person
|
||
channel: 6
|
||
tdm_slot: 1 # TDM slot index (auto-assigned from node position if omitted)
|
||
|
||
- role: sensor
|
||
node_id: 2
|
||
scenario: 3 # fall event
|
||
channel: 6
|
||
tdm_slot: 2
|
||
|
||
assertions:
|
||
- all_nodes_boot
|
||
- no_crashes
|
||
- tdm_no_collision
|
||
- all_nodes_produce_frames
|
||
- coordinator_receives_from_all
|
||
- fall_detected_by_node_2
|
||
- frame_rate_above: 15 # Hz minimum per node
|
||
- max_boot_time_s: 10
|
||
```
|
||
|
||
### Topologies
|
||
|
||
| Topology | Network | Description |
|
||
|----------|---------|-------------|
|
||
| `star` | All sensors connect to coordinator; coordinator has TAP to each sensor | Hub-and-spoke, most common |
|
||
| `mesh` | All nodes on same bridge (existing Layer 3 behavior) | Every node sees every other |
|
||
| `line` | Node 0 ↔ Node 1 ↔ Node 2 ↔ ... | Linear chain, tests multi-hop |
|
||
| `ring` | Like line but last connects to first | Circular, tests routing |
|
||
|
||
### Node Roles
|
||
|
||
| Role | Behavior | NVS Keys |
|
||
|------|----------|----------|
|
||
| `sensor` | Runs mock CSI, sends frames to coordinator | `node_id`, `tdm_slot`, `target_ip` |
|
||
| `coordinator` | Receives frames from sensors, runs edge aggregation | `node_id`, `tdm_slot=0`, `edge_tier=2` |
|
||
| `gateway` | Like coordinator but also bridges to host UDP | `node_id`, `target_ip=host`, `is_gateway=1` |
|
||
|
||
### Assertions (Swarm-Level)
|
||
|
||
| Assertion | What It Checks |
|
||
|-----------|---------------|
|
||
| `all_nodes_boot` | Every node's UART log shows boot indicators within timeout |
|
||
| `no_crashes` | No Guru Meditation, assert, panic in any log |
|
||
| `tdm_no_collision` | No two nodes transmit in the same TDM slot |
|
||
| `all_nodes_produce_frames` | Every sensor node's log contains CSI frame output |
|
||
| `coordinator_receives_from_all` | Coordinator log shows frames from each sensor's node_id |
|
||
| `fall_detected_by_node_N` | Node N's log reports a fall detection event |
|
||
| `frame_rate_above` | Each node produces at least N frames/second |
|
||
| `max_boot_time_s` | All nodes boot within N seconds |
|
||
| `no_heap_errors` | No OOM or heap corruption in any log |
|
||
| `network_partitioned_recovery` | After deliberate partition, nodes resume communication (future) |
|
||
|
||
### Preset Configurations
|
||
|
||
| Preset | Nodes | Topology | Purpose |
|
||
|--------|-------|----------|---------|
|
||
| `smoke` | 2 | star | Quick CI smoke test (15s) |
|
||
| `standard` | 3 | star | Default 3-node (sensor + sensor + coordinator) |
|
||
| `large-mesh` | 6 | mesh | Scale test with 6 fully-connected nodes |
|
||
| `line-relay` | 4 | line | Multi-hop relay chain |
|
||
| `ring-fault` | 4 | ring | Ring with fault injection mid-test |
|
||
| `heterogeneous` | 5 | star | Mixed scenarios: walk, fall, static, channel-sweep, empty |
|
||
| `ci-matrix` | 3 | star | CI-optimized preset (30s, minimal assertions) |
|
||
|
||
## File Layout
|
||
|
||
```
|
||
scripts/
|
||
├── qemu_swarm.py # Main orchestrator (CLI entry point)
|
||
├── swarm_health.py # Swarm-level health oracle
|
||
└── swarm_presets/
|
||
├── smoke.yaml
|
||
├── standard.yaml
|
||
├── large_mesh.yaml
|
||
├── line_relay.yaml
|
||
├── ring_fault.yaml
|
||
├── heterogeneous.yaml
|
||
└── ci_matrix.yaml
|
||
|
||
.github/workflows/
|
||
└── firmware-qemu.yml # MODIFIED: add swarm test job
|
||
```
|
||
|
||
## Consequences
|
||
|
||
### Benefits
|
||
|
||
1. **Declarative testing** — define swarm topology in YAML, not shell scripts
|
||
2. **Role-based nodes** — test coordinator/sensor/gateway interactions
|
||
3. **Topology variety** — star/mesh/line/ring match real deployment patterns
|
||
4. **Swarm-level assertions** — validate collective behavior, not just individual nodes
|
||
5. **Preset library** — quick CI smoke tests and thorough manual validation
|
||
6. **Reproducible** — YAML configs are version-controlled and shareable
|
||
|
||
### Limitations
|
||
|
||
1. **Still requires root** for TAP bridge topologies (star, line, ring); mesh can use SLIRP
|
||
2. **QEMU resource usage** — 6+ QEMU instances use ~2GB RAM, may slow CI runners
|
||
3. **No real RF** — inter-node communication is IP-based, not WiFi CSI multipath
|
||
|
||
## References
|
||
|
||
- ADR-061: QEMU ESP32-S3 firmware testing platform (Layers 1-9)
|
||
- ADR-060: Channel override and MAC address filter provisioning
|
||
- ADR-018: Binary CSI frame format (magic `0xC5110001`)
|
||
- ADR-039: Edge intelligence pipeline (biquad, vitals, fall detection)
|