mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-04-29 20:10:21 +00:00
Document the pulse-sensor-proxy rate limiting bug fix and new configurability across all relevant documentation: TEMPERATURE_MONITORING.md: - Added 'Rate Limiting & Scaling' section with symptom diagnosis - Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments - Provided tuning formula: interval_ms = polling_interval / node_count TROUBLESHOOTING.md: - Added 'Temperature data flickers after adding nodes' section - Step-by-step diagnosis using limiter metrics and scheduler health - Quick fix with config example CONFIGURATION.md: - Added pulse-sensor-proxy/config.yaml reference section - Documented rate_limit.per_peer_interval_ms and per_peer_burst fields - Included defaults and example override pulse-sensor-proxy-runbook.md: - Updated quick reference with new defaults (1 req/sec, burst 5) - Added 'Rate Limit Tuning' procedure with 4 deployment profiles - Included validation steps and monitoring commands TEMPERATURE_MONITORING_SECURITY.md: - Updated rate limiting section with new defaults - Added configurable overrides guidance - Documented security considerations for production deployments Related commits: -46b8b8d08: Initial rate limit fix (hardcoded defaults) -ca534e2b6: Made rate limits configurable via YAML -e244da837: Added guidance for large deployments (30+ nodes)
105 lines
4.9 KiB
Markdown
105 lines
4.9 KiB
Markdown
# Pulse Sensor Proxy Runbook
|
||
|
||
## Quick Reference
|
||
- Binary: `/opt/pulse/sensor-proxy/bin/pulse-sensor-proxy`
|
||
- Unit: `pulse-sensor-proxy.service`
|
||
- Logs: `/var/log/pulse/sensor-proxy/proxy.log`
|
||
- Audit trail: `/var/log/pulse/sensor-proxy/audit.log` (hash chained, forwarded via rsyslog)
|
||
- Metrics: `http://127.0.0.1:9127/metrics` (set `PULSE_SENSOR_PROXY_METRICS_ADDR` to change/disable)
|
||
- Limiters: 1 request/sec per UID (burst 5), per-UID concurrency 2, global concurrency 8, 2 s penalty on validation failures
|
||
|
||
## Monitoring Alerts & Response
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant Backend
|
||
participant Proxy
|
||
participant Node
|
||
|
||
Backend->>Proxy: get_temperature
|
||
Proxy->>Proxy: Check rate limit
|
||
Proxy->>Node: SSH sensors -j
|
||
Node->>Proxy: JSON response
|
||
Proxy->>Backend: Temperature data
|
||
```
|
||
|
||
### Rate Limit Hits (`pulse_proxy_limiter_rejections_total`)
|
||
1. Check audit log entries tagged `limiter.rejection` for offending UID.
|
||
2. Confirm workload legitimacy; if expected, consider increasing limits via config override.
|
||
3. If malicious, block source process/user and inspect Pulse audit logs.
|
||
|
||
### Penalty Events (`pulse_proxy_limiter_penalties_total`)
|
||
1. Review corresponding validation failures in audit log (`command.validation_failed`).
|
||
2. If repeated invalid JSON/unknown methods, inspect caller code for regressions or intrusion attempts.
|
||
|
||
### Audit Log Forwarder Down
|
||
1. `journalctl -u rsyslog` to confirm transmission errors.
|
||
2. Ensure `/etc/pulse/log-forwarding` certs valid & remote host reachable.
|
||
3. Forwarding queue stored locally in `/var/log/pulse/sensor-proxy/forwarding.log`; ship manually if outage exceeds 1 hour.
|
||
|
||
### Proxy Health Endpoint Fails
|
||
1. `systemctl status pulse-sensor-proxy`
|
||
2. Check `/var/log/pulse/sensor-proxy/proxy.log` for panic or limiter exhaustion.
|
||
3. Inspect `/var/log/pulse/sensor-proxy/audit.log` for recent privileged method denials.
|
||
|
||
## Standard Procedures
|
||
### Restart Proxy Safely
|
||
```bash
|
||
sudo systemctl stop pulse-sensor-proxy
|
||
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy # if updating policy
|
||
sudo systemctl start pulse-sensor-proxy
|
||
```
|
||
Verify:
|
||
```bash
|
||
# Metrics endpoint exposes proxy build/health
|
||
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_build_info
|
||
|
||
# Ensure adaptive polling sees the proxy again
|
||
curl -s http://localhost:7655/api/monitoring/scheduler/health \
|
||
| jq '.instances[] | select(.key | contains("temperature")) | {key, pollStatus}'
|
||
```
|
||
Temperature instances should show recent `lastSuccess` timestamps with no DLQ entries.
|
||
|
||
### Rotate SSH Keys
|
||
1. Run `scripts/secure-sensor-files.sh` to regenerate keys (ensure environment locked down).
|
||
2. Use RPC `ensure_cluster_keys` to distribute new public key.
|
||
3. Confirm nodes accept `ssh` from proxy host.
|
||
4. Confirm the scheduler clears any temporary breakers/dlq entries:
|
||
```bash
|
||
curl -s http://localhost:7655/api/monitoring/scheduler/health \
|
||
| jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}'
|
||
```
|
||
Expect `breaker.state=="closed"` and `deadLetter.present==false` for all proxy-driven pollers.
|
||
|
||
### Rate Limit Tuning
|
||
|
||
| Profile | Nodes | `per_peer_interval_ms` | `per_peer_burst` | Notes |
|
||
| --- | --- | --- | --- | --- |
|
||
| Default | ≤5 | 1000 | 5 | Shipped with commit 46b8b8d; no action needed for single host clusters. |
|
||
| Medium | 6–10 | 500 | 10 | Doubles throughput; monitor `pulse_proxy_limiter_rejects_total`. |
|
||
| Large | 11–20 | 250 | 20 | Confirm proxy CPU stays below 70 % and audit logs remain clean. |
|
||
| XL | 21–40 | 150 | 30 | Requires high-trust environment; ensure UID filters are locked down. |
|
||
|
||
**Procedure:**
|
||
1. Edit `/etc/pulse-sensor-proxy/config.yaml` and set the desired profile values under `rate_limit`.
|
||
2. Restart the service:
|
||
```bash
|
||
sudo systemctl restart pulse-sensor-proxy
|
||
```
|
||
3. Validate:
|
||
```bash
|
||
curl -s http://127.0.0.1:9127/metrics \
|
||
| grep pulse_proxy_limiter_rejects_total
|
||
```
|
||
The counter should stop incrementing during steady-state polling.
|
||
4. Record the change in the operations log and review audit entries for unexpected callers.
|
||
|
||
## Incident Handling
|
||
- **Unauthorized Command Attempt:** audit log shows `command.validation_failed` and limiter penalties; capture correlation ID, check Pulse side for compromised container.
|
||
- **Excessive Temperature Failures:** refer to `pulse_proxy_ssh_requests_total{result="error"}`; validate network ACLs and node health; escalate to Proxmox team if nodes unreachable.
|
||
- **Log Tampering Suspected:** verify audit hash chain by replaying `eventHash` values; compare with remote log store (immutable). Trigger security response if mismatch.
|
||
|
||
## Postmortem Checklist
|
||
- Timeline: command audit entries, limiter stats, rsyslog queue depth.
|
||
- Verify AppArmor/seccomp status (`aa-status`, `systemctl show pulse-sensor-proxy -p AppArmorProfile`).
|
||
- Ensure firewall ACLs match `docs/security/pulse-sensor-proxy-network.md`.
|