# Pulse Sensor Proxy Runbook ## Quick Reference - Binary: `/opt/pulse/sensor-proxy/bin/pulse-sensor-proxy` - Unit: `pulse-sensor-proxy.service` - Logs: `/var/log/pulse/sensor-proxy/proxy.log` - Audit trail: `/var/log/pulse/sensor-proxy/audit.log` (hash chained, forwarded via rsyslog) - Metrics: `http://127.0.0.1:9127/metrics` (set `PULSE_SENSOR_PROXY_METRICS_ADDR` to change/disable) - Limiters: 1 request/sec per UID (burst 5), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures ## Monitoring Alerts & Response ```mermaid sequenceDiagram participant Backend participant Proxy participant Node Backend->>Proxy: get_temperature Proxy->>Proxy: Check rate limit Proxy->>Node: SSH sensors -j Node->>Proxy: JSON response Proxy->>Backend: Temperature data ``` ### Rate Limit Hits (`pulse_proxy_limiter_rejections_total`) 1. Check audit log entries tagged `limiter.rejection` for offending UID. 2. Confirm workload legitimacy; if expected, consider increasing limits via config override. 3. If malicious, block source process/user and inspect Pulse audit logs. ### Penalty Events (`pulse_proxy_limiter_penalties_total`) 1. Review corresponding validation failures in audit log (`command.validation_failed`). 2. If repeated invalid JSON/unknown methods, inspect caller code for regressions or intrusion attempts. ### Audit Log Forwarder Down 1. `journalctl -u rsyslog` to confirm transmission errors. 2. Ensure `/etc/pulse/log-forwarding` certs valid & remote host reachable. 3. Forwarding queue stored locally in `/var/log/pulse/sensor-proxy/forwarding.log`; ship manually if outage exceeds 1 hour. ### Proxy Health Endpoint Fails 1. `systemctl status pulse-sensor-proxy` 2. Check `/var/log/pulse/sensor-proxy/proxy.log` for panic or limiter exhaustion. 3. Inspect `/var/log/pulse/sensor-proxy/audit.log` for recent privileged method denials. ## Standard Procedures ### Restart Proxy Safely ```bash sudo systemctl stop pulse-sensor-proxy sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy # if updating policy sudo systemctl start pulse-sensor-proxy ``` Verify: ```bash # Metrics endpoint exposes proxy build/health curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_build_info # Ensure adaptive polling sees the proxy again curl -s http://localhost:7655/api/monitoring/scheduler/health \ | jq '.instances[] | select(.key | contains("temperature")) | {key, pollStatus}' ``` Temperature instances should show recent `lastSuccess` timestamps with no DLQ entries. ### Rotate SSH Keys 1. Run `scripts/secure-sensor-files.sh` to regenerate keys (ensure environment locked down). 2. Use RPC `ensure_cluster_keys` to distribute new public key. 3. Confirm nodes accept `ssh` from proxy host. 4. Confirm the scheduler clears any temporary breakers/dlq entries: ```bash curl -s http://localhost:7655/api/monitoring/scheduler/health \ | jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}' ``` Expect `breaker.state=="closed"` and `deadLetter.present==false` for all proxy-driven pollers. ### Rate Limit Tuning | Profile | Nodes | `per_peer_interval_ms` | `per_peer_burst` | Notes | | --- | --- | --- | --- | --- | | Default | ≤5 | 1000 | 5 | Shipped with commit 46b8b8d; no action needed for single host clusters. | | Medium | 6–10 | 500 | 10 | Doubles throughput; monitor `pulse_proxy_limiter_rejects_total`. | | Large | 11–20 | 250 | 20 | Confirm proxy CPU stays below 70 % and audit logs remain clean. | | XL | 21–40 | 150 | 30 | Requires high-trust environment; ensure UID filters are locked down. | **Procedure:** 1. Edit `/etc/pulse-sensor-proxy/config.yaml` and set the desired profile values under `rate_limit`. 2. Restart the service: ```bash sudo systemctl restart pulse-sensor-proxy ``` 3. Validate: ```bash curl -s http://127.0.0.1:9127/metrics \ | grep pulse_proxy_limiter_rejects_total ``` The counter should stop incrementing during steady-state polling. 4. Record the change in the operations log and review audit entries for unexpected callers. ## Incident Handling - **Unauthorized Command Attempt:** audit log shows `command.validation_failed` and limiter penalties; capture correlation ID, check Pulse side for compromised container. - **Excessive Temperature Failures:** refer to `pulse_proxy_ssh_requests_total{result="error"}`; validate network ACLs and node health; escalate to Proxmox team if nodes unreachable. - **Log Tampering Suspected:** verify audit hash chain by replaying `eventHash` values; compare with remote log store (immutable). Trigger security response if mismatch. ## Postmortem Checklist - Timeline: command audit entries, limiter stats, rsyslog queue depth. - Verify AppArmor/seccomp status (`aa-status`, `systemctl show pulse-sensor-proxy -p AppArmorProfile`). - Ensure firewall ACLs match `docs/security/pulse-sensor-proxy-network.md`.