Pulse/docs/operations/pulse-sensor-proxy-runbook.md
rcourtman ddc9a7a068 docs: comprehensive documentation for rate limit fix and configurability
Document the pulse-sensor-proxy rate limiting bug fix and new
configurability across all relevant documentation:

TEMPERATURE_MONITORING.md:
- Added 'Rate Limiting & Scaling' section with symptom diagnosis
- Included sizing table for 1-3, 4-10, 10-20, and 30+ node deployments
- Provided tuning formula: interval_ms = polling_interval / node_count

TROUBLESHOOTING.md:
- Added 'Temperature data flickers after adding nodes' section
- Step-by-step diagnosis using limiter metrics and scheduler health
- Quick fix with config example

CONFIGURATION.md:
- Added pulse-sensor-proxy/config.yaml reference section
- Documented rate_limit.per_peer_interval_ms and per_peer_burst fields
- Included defaults and example override

pulse-sensor-proxy-runbook.md:
- Updated quick reference with new defaults (1 req/sec, burst 5)
- Added 'Rate Limit Tuning' procedure with 4 deployment profiles
- Included validation steps and monitoring commands

TEMPERATURE_MONITORING_SECURITY.md:
- Updated rate limiting section with new defaults
- Added configurable overrides guidance
- Documented security considerations for production deployments

Related commits:
- 46b8b8d08: Initial rate limit fix (hardcoded defaults)
- ca534e2b6: Made rate limits configurable via YAML
- e244da837: Added guidance for large deployments (30+ nodes)
2025-10-21 11:36:07 +00:00

105 lines
4.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pulse Sensor Proxy Runbook
## Quick Reference
- Binary: `/opt/pulse/sensor-proxy/bin/pulse-sensor-proxy`
- Unit: `pulse-sensor-proxy.service`
- Logs: `/var/log/pulse/sensor-proxy/proxy.log`
- Audit trail: `/var/log/pulse/sensor-proxy/audit.log` (hash chained, forwarded via rsyslog)
- Metrics: `http://127.0.0.1:9127/metrics` (set `PULSE_SENSOR_PROXY_METRICS_ADDR` to change/disable)
- Limiters: 1 request/sec per UID (burst 5), per-UID concurrency 2, global concurrency 8, 2s penalty on validation failures
## Monitoring Alerts & Response
```mermaid
sequenceDiagram
participant Backend
participant Proxy
participant Node
Backend->>Proxy: get_temperature
Proxy->>Proxy: Check rate limit
Proxy->>Node: SSH sensors -j
Node->>Proxy: JSON response
Proxy->>Backend: Temperature data
```
### Rate Limit Hits (`pulse_proxy_limiter_rejections_total`)
1. Check audit log entries tagged `limiter.rejection` for offending UID.
2. Confirm workload legitimacy; if expected, consider increasing limits via config override.
3. If malicious, block source process/user and inspect Pulse audit logs.
### Penalty Events (`pulse_proxy_limiter_penalties_total`)
1. Review corresponding validation failures in audit log (`command.validation_failed`).
2. If repeated invalid JSON/unknown methods, inspect caller code for regressions or intrusion attempts.
### Audit Log Forwarder Down
1. `journalctl -u rsyslog` to confirm transmission errors.
2. Ensure `/etc/pulse/log-forwarding` certs valid & remote host reachable.
3. Forwarding queue stored locally in `/var/log/pulse/sensor-proxy/forwarding.log`; ship manually if outage exceeds 1 hour.
### Proxy Health Endpoint Fails
1. `systemctl status pulse-sensor-proxy`
2. Check `/var/log/pulse/sensor-proxy/proxy.log` for panic or limiter exhaustion.
3. Inspect `/var/log/pulse/sensor-proxy/audit.log` for recent privileged method denials.
## Standard Procedures
### Restart Proxy Safely
```bash
sudo systemctl stop pulse-sensor-proxy
sudo apparmor_parser -r /etc/apparmor.d/pulse-sensor-proxy # if updating policy
sudo systemctl start pulse-sensor-proxy
```
Verify:
```bash
# Metrics endpoint exposes proxy build/health
curl -s http://127.0.0.1:9127/metrics | grep pulse_proxy_build_info
# Ensure adaptive polling sees the proxy again
curl -s http://localhost:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.key | contains("temperature")) | {key, pollStatus}'
```
Temperature instances should show recent `lastSuccess` timestamps with no DLQ entries.
### Rotate SSH Keys
1. Run `scripts/secure-sensor-files.sh` to regenerate keys (ensure environment locked down).
2. Use RPC `ensure_cluster_keys` to distribute new public key.
3. Confirm nodes accept `ssh` from proxy host.
4. Confirm the scheduler clears any temporary breakers/dlq entries:
```bash
curl -s http://localhost:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.key | contains("temperature")) | {key, breaker: .breaker.state, deadLetter: .deadLetter.present}'
```
Expect `breaker.state=="closed"` and `deadLetter.present==false` for all proxy-driven pollers.
### Rate Limit Tuning
| Profile | Nodes | `per_peer_interval_ms` | `per_peer_burst` | Notes |
| --- | --- | --- | --- | --- |
| Default | ≤5 | 1000 | 5 | Shipped with commit 46b8b8d; no action needed for single host clusters. |
| Medium | 610 | 500 | 10 | Doubles throughput; monitor `pulse_proxy_limiter_rejects_total`. |
| Large | 1120 | 250 | 20 | Confirm proxy CPU stays below 70% and audit logs remain clean. |
| XL | 2140 | 150 | 30 | Requires high-trust environment; ensure UID filters are locked down. |
**Procedure:**
1. Edit `/etc/pulse-sensor-proxy/config.yaml` and set the desired profile values under `rate_limit`.
2. Restart the service:
```bash
sudo systemctl restart pulse-sensor-proxy
```
3. Validate:
```bash
curl -s http://127.0.0.1:9127/metrics \
| grep pulse_proxy_limiter_rejects_total
```
The counter should stop incrementing during steady-state polling.
4. Record the change in the operations log and review audit entries for unexpected callers.
## Incident Handling
- **Unauthorized Command Attempt:** audit log shows `command.validation_failed` and limiter penalties; capture correlation ID, check Pulse side for compromised container.
- **Excessive Temperature Failures:** refer to `pulse_proxy_ssh_requests_total{result="error"}`; validate network ACLs and node health; escalate to Proxmox team if nodes unreachable.
- **Log Tampering Suspected:** verify audit hash chain by replaying `eventHash` values; compare with remote log store (immutable). Trigger security response if mismatch.
## Postmortem Checklist
- Timeline: command audit entries, limiter stats, rsyslog queue depth.
- Verify AppArmor/seccomp status (`aa-status`, `systemctl show pulse-sensor-proxy -p AppArmorProfile`).
- Ensure firewall ACLs match `docs/security/pulse-sensor-proxy-network.md`.