mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-05-19 07:54:10 +00:00
3.3 KiB
3.3 KiB
Adaptive Polling Rollout Runbook
Adaptive polling (v4.24.0+) lets the scheduler dynamically adjust poll intervals per resource. This runbook documents the safe way to enable, monitor, and, if needed, disable the feature across environments.
Scope & Prerequisites
- Pulse v4.24.0 or newer
- Admin access to Settings → System → Monitoring
- Prometheus access to
pulse_monitor_*metrics - Ability to run authenticated
curlcommands against the Pulse API
Change Windows
Run rollouts during a maintenance window where transient alert jitter is acceptable. Adaptive polling touches every monitor queue; give yourself at least 15 minutes to observe steady state metrics.
Rollout Steps
-
Snapshot current health
curl -s http://localhost:7655/api/monitoring/scheduler/health | jq '.enabled, .queue.depth'Record queue depth, breaker count, and dead-letter entries.
-
Enable adaptive polling
- UI: toggle Settings → System → Monitoring → Adaptive Polling → Enable
- CLI:
jq '.AdaptivePollingEnabled=true' /var/lib/pulse/system.json > tmp && mv tmp system.json - Env override:
ADAPTIVE_POLLING_ENABLED=truebefore starting Pulse (for containers/k8s)
-
Watch metrics (first 5 minutes)
watch -n 5 'curl -s http://localhost:9091/metrics | grep -E "pulse_monitor_(poll_queue_depth|poll_staleness_seconds)" | head'Targets:
pulse_monitor_poll_queue_depth < 50pulse_monitor_poll_staleness_secondsunder your SLA (typically < 60 s)- No spikes in
pulse_monitor_poll_errors_total{category="permanent"}
-
Validate scheduler state
curl -s http://localhost:7655/api/monitoring/scheduler/health \ | jq '{enabled, queue: .queue.depth, breakers: [.breakers[]?.instance], deadLetter: .deadLetter.count}'Expect
enabled: true, empty breaker list, anddeadLetter.count == 0. -
Document overrides
- Note any instances moved to manual polling (Settings → Nodes → Polling)
- Capture Grafana screenshots for queue depth/staleness widgets
Rollback
If queue depth climbs uncontrollably or breakers remain open for >10 minutes:
- Disable the feature the same way you enabled it (UI/environment).
- Restart Pulse if environment overrides were used, otherwise hot toggle is immediate.
- Continue monitoring until queue depth and staleness return to baseline.
Canary Strategy Suggestions
| Stage | Action | Acceptance Criteria |
|---|---|---|
| Dev | Enable flag in hot-dev (scripts/hot-dev.sh) | No scheduler panics, UI reflects flag instantly |
| Staging | Enable on one Pulse instance per region | queue.depth within ±20 % of baseline after 15 min |
| Production | Enable per cluster with 30 min soak | No more than 5 breaker openings per hour |
Instrumentation Checklist
- Grafana dashboard with
queue.depth,poll_staleness_seconds,poll_errors_totalby type - Alert rule:
rate(pulse_monitor_poll_errors_total{category="permanent"}[5m]) > 0 - Alert rule:
max_over_time(pulse_monitor_poll_queue_depth[5m]) > 75 - JSON log search for
"scheduler":warnings immediately after enablement