docs: implement Codex recommendations for temperature monitoring

Add comprehensive documentation improvements based on architectural review: 1. Enhanced Known Limitations section: - Document single proxy failure mode - Explain sensors output parsing brittleness with mitigation steps - Clarify cluster discovery dependencies and fallback options - Describe SSH fan-out scaling considerations for large clusters 2. Documented SSH key rotation workflow: - Promote automated rotation script as recommended approach - Include dry-run, execution, and rollback examples - Provide manual fallback process - Reference existing pulse-proxy-rotate-keys.sh script 3. Added Future Improvements roadmap: - Proxmox API integration (when available) - Agent-based architecture option - SNMP/IPMI support - Schema validation - Caching and throttling - Automated rotation timer - Health check endpoint Instrumentation verified: proxy already has comprehensive Prometheus metrics (RPC/SSH requests, latency, queue depth, rate limiting) and structured logging.
2026-04-28 03:20:11 +00:00 · 2025-10-17 12:03:31 +00:00 · 2025-10-17 12:03:31 +00:00 · a5d4d57097
commit a5d4d57097
parent 07fe382553
1 changed files with 97 additions and 24 deletions
--- a/docs/TEMPERATURE_MONITORING.md
+++ b/docs/TEMPERATURE_MONITORING.md
@ -471,7 +471,35 @@ journalctl -u pulse | grep -i "proxy\|temperature"

 ### SSH Key Rotation

-Rotate SSH keys periodically for security (recommended every 90 days):
+Rotate SSH keys periodically for security (recommended every 90 days).
+
+**Automated Rotation (Recommended):**
+
+The `/opt/pulse/scripts/pulse-proxy-rotate-keys.sh` script handles rotation safely with staging, verification, and rollback support:
+
+```bash
+# 1. Dry-run first (recommended)
+sudo /opt/pulse/scripts/pulse-proxy-rotate-keys.sh --dry-run
+
+# 2. Perform rotation
+sudo /opt/pulse/scripts/pulse-proxy-rotate-keys.sh
+```
+
+**What the script does:**
+- Generates new Ed25519 keypair in staging directory
+- Pushes new key to all cluster nodes via proxy RPC
+- Verifies SSH connectivity with new key on each node
+- Atomically swaps keys (current → backup, staging → active)
+- Preserves old keys for rollback
+
+**If rotation fails, rollback:**
+```bash
+sudo /opt/pulse/scripts/pulse-proxy-rotate-keys.sh --rollback
+```
+
+**Manual Rotation (Fallback):**
+
+If the automated script fails or is unavailable:

 ```bash
 # 1. On Proxmox host, backup old keys
@ -482,22 +510,10 @@ cp id_ed25519.pub id_ed25519.pub.backup
 # 2. Generate new keypair
 ssh-keygen -t ed25519 -f id_ed25519 -N "" -C "pulse-sensor-proxy-rotated"

-# 3. Get the new public key
-cat id_ed25519.pub
+# 3. Re-run setup to push keys to cluster
+curl -fsSL https://get.pulsenode.com/install-proxy.sh | bash -s -- --ctid <your-container-id>

-# 4. Add new key to all cluster nodes
-# For each node in your cluster:
-ssh root@node1 "echo 'NEW_PUBLIC_KEY_HERE' >> /root/.ssh/authorized_keys"
-ssh root@node2 "echo 'NEW_PUBLIC_KEY_HERE' >> /root/.ssh/authorized_keys"
-# ... repeat for all nodes
-
-# 5. Restart proxy to use new keys
-systemctl restart pulse-sensor-proxy
-
-# 6. Verify temperature data still works in Pulse UI
-
-# 7. Remove old keys from nodes (after confirming new keys work)
-ssh root@node1 "sed -i '/pulse-sensor-proxy-old/d' /root/.ssh/authorized_keys"
+# 4. Verify temperature data still works in Pulse UI
 ```

 ### Revoking Access When Nodes Leave
@ -567,18 +583,33 @@ test -S /run/pulse-sensor-proxy/pulse-sensor-proxy.sock && echo "Socket OK" || e

 ### Known Limitations

-**One Proxy Per Host:**
+**Single Proxy = Single Point of Failure:**
 - Each Proxmox host runs one pulse-sensor-proxy instance
+- If the proxy service dies, temperature monitoring stops for all containers on that host
+- This is acceptable for read-only telemetry, but be aware of the failure mode
+- Systemd auto-restart (`Restart=on-failure`) mitigates most outages
 - If multiple Pulse containers run on same host, they share the same proxy
- All containers see the same temperature data from the same cluster
+
+**Sensors Output Parsing Brittleness:**
+- Pulse depends on `sensors -j` JSON output format from lm-sensors
+- Changes to sensor names, structure, or output format could break parsing
+- Consider adding schema validation and instrumentation to detect issues early
+- Monitor proxy logs for parsing errors: `journalctl -u pulse-sensor-proxy | grep -i "parse\|error"`
+
+**Cluster Discovery Limitations:**
+- Proxy uses `pvecm status` to discover cluster nodes (requires Proxmox IPC access)
+- If Proxmox hardens IPC access or cluster topology changes unexpectedly, discovery may fail
+- Standalone Proxmox nodes work but only monitor that single node
+- Fallback: Re-run setup script manually to reconfigure cluster access
+
+**SSH Fan-Out Scaling:**
+- Proxy SSHs to each node sequentially during each polling cycle
+- Large clusters (10+ nodes) at short intervals may trigger rate limiting or increase load
+- Consider implementing caching or throttling if you experience SSH connection issues
+- Monitor SSH latency metrics: `curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_ssh_latency`

 **Requires Proxmox Cluster Membership:**
- Proxy uses `pvecm nodes` to discover cluster members
- Standalone Proxmox nodes work but only monitor that single node
- For standalone nodes, proxy is less useful (direct SSH works fine)
-
-**Passwordless Root SSH Required:**
- Proxy assumes passwordless root SSH between cluster nodes
+- Proxy requires passwordless root SSH between cluster nodes
 - Standard for Proxmox clusters, but hardened environments may differ
 - Alternative: Create dedicated service account with sudo access to `sensors`

@ -611,6 +642,48 @@ test -S /run/pulse-sensor-proxy/pulse-sensor-proxy.sock && echo "Socket OK" || e
 3. Test manually: `/usr/local/bin/pulse-sensor-proxy --version`
 4. Check socket directory: `ls -ld /var/run`

+### Future Improvements
+
+**Potential Enhancements (Roadmap):**
+
+1. **Proxmox API Integration**
+   - If future Proxmox versions expose temperature telemetry via API, retire SSH approach
+   - Would eliminate SSH key management and improve security posture
+   - Monitor Proxmox development for metrics/RRD temperature endpoints
+
+2. **Agent-Based Architecture**
+   - Deploy lightweight agents on each node for richer telemetry
+   - Reduces SSH fan-out overhead for large clusters
+   - Trade-off: Adds deployment/update complexity
+   - Consider only if demand for additional metrics grows
+
+3. **SNMP/IPMI Support**
+   - Optional integration for baseboard management controllers
+   - Better for hardware-level sensors (baseboard temps, fan speeds)
+   - Requires hardware/firmware support, so keep as optional add-on
+
+4. **Schema Validation**
+   - Add JSON schema validation for `sensors -j` output
+   - Detect format changes early with instrumentation
+   - Log warnings when unexpected sensor formats appear
+
+5. **Caching & Throttling**
+   - Implement result caching for large clusters (10+ nodes)
+   - Reduce SSH overhead with configurable TTL
+   - Add request throttling to prevent SSH rate limiting
+
+6. **Automated Key Rotation**
+   - Systemd timer for automatic 90-day rotation
+   - Already supported via `/opt/pulse/scripts/pulse-proxy-rotate-keys.sh`
+   - Just needs timer unit configuration (documented in hardening guide)
+
+7. **Health Check Endpoint**
+   - Add `/health` endpoint separate from Prometheus metrics
+   - Enable external monitoring systems (Nagios, Zabbix, etc.)
+   - Return proxy status, socket accessibility, and last successful poll
+
+**Contributions Welcome:** If any of these improvements interest you, open a GitHub issue to discuss implementation!
+
 ### Getting Help

 If temperature monitoring isn't working: