# Adaptive Polling Operations Runbook **GA in v4.24.0 - Enabled by Default** This runbook guides operators managing the adaptive polling scheduler in production. Adaptive polling is enabled by default in v4.24.0 and can be toggled via **Settings → System → Monitoring** (no restart) or environment variables (restart required). Follow these operational procedures for steady-state monitoring, troubleshooting, and rollback scenarios. --- ## 1. Preparation (Before v4.24.0 Deployment) **For new deployments or upgrades to v4.24.0:** 1. **Monitoring readiness** - Review available scrape series in [Prometheus Metrics](../monitoring/PROMETHEUS_METRICS.md). - Set up Grafana dashboard with: - Instance gauges: `pulse_monitor_poll_queue_depth`, `pulse_monitor_poll_staleness_seconds`, `pulse_monitor_poll_last_success_timestamp` - **Per-node coverage:** `pulse_monitor_node_poll_staleness_seconds`, `pulse_monitor_node_poll_errors_total`, `pulse_monitor_node_poll_duration_seconds` - Scheduler health: `pulse_scheduler_queue_depth`, `pulse_scheduler_queue_due_soon`, `pulse_scheduler_dead_letter_depth`, `pulse_scheduler_breaker_state`, and `pulse_scheduler_breaker_failure_count` - Diagnostics cache sanity: `increase(pulse_diagnostics_cache_misses_total[5m])` vs `pulse_diagnostics_cache_hits_total` - HTTP SLA: `rate(pulse_http_request_errors_total{status_class="server_error"}[5m])` - Continue trending `pulse_monitor_poll_total` / `pulse_monitor_poll_errors_total` for throughput - Configure alerts (see §4) 2. **Baseline metrics** - Record pre-upgrade metrics if upgrading from < v4.24.0: - Typical polling frequency - Average response times - Current alert volumes - These help assess adaptive polling impact 3. **Rollback readiness** - Verify update rollback workflow: **Settings → System → Updates → Restore previous version** - Test rollback in staging environment - Confirm `/api/monitoring/scheduler/health` accessible - Document emergency disable procedure (see §5) 4. **Configuration review** - Review `system.json` or environment variables for adaptive polling tunables: - `ADAPTIVE_POLLING_BASE_INTERVAL` (default: 10s) - `ADAPTIVE_POLLING_MIN_INTERVAL` (default: 5s) - `ADAPTIVE_POLLING_MAX_INTERVAL` (default: 5m) - Adjust if needed for your environment (e.g., high-frequency monitoring) --- ## 2. Post-Deployment Verification (v4.24.0+) **Adaptive polling is enabled by default. Verify it's working correctly:** 1. **Check scheduler health** ```bash curl -s http://:7655/api/monitoring/scheduler/health | jq ``` **Expected response:** - `"enabled": true` - `queue.depth` reasonable (< instances × 1.5) - `deadLetter.count` = 0 (or only known failing instances) - `instances[]` array populated with your nodes - No `breaker.state` stuck in `open` (except known issues) 2. **Verify UI access** - Navigate to **Settings → System → Monitoring** - Confirm "Adaptive Polling" toggle is ON - Review queue depth and recent poll status 3. **Check Grafana metrics** (if configured) - `pulse_monitor_poll_queue_depth` shows reasonable values - `pulse_monitor_poll_staleness_seconds` < 60s for healthy instances - `pulse_monitor_poll_errors_total` not rapidly increasing - `pulse_monitor_poll_last_success_timestamp` updating regularly 4. **Monitor update history** - Check **Settings → System → Updates** for update entry - Verify rollback button is available - Confirm update status shows "completed" **Via API:** ```bash curl -s http://:7655/api/updates/history | jq '.entries[0]' ``` --- ## 3. Steady-State Operations **Ongoing monitoring and SLO checks:** 1. **Daily health checks** - Review scheduler health dashboard or API endpoint - Check for: - Queue depth < 50 (alert if > 50 for 10+ minutes) - Instance staleness < 60s for healthy instances, < 120s for critical instances - **Per-node staleness** `pulse_monitor_node_poll_staleness_seconds` < 120s - DLQ count stable (not growing) - Circuit breakers mostly `closed` - Diagnostics cache misses roughly follow hits (`increase(pulse_diagnostics_cache_misses_total[10m])` in line with hits) - HTTP error rate (`rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`) near zero 2. **Weekly reviews** - Analyze trends in Grafana: - Poll success rates - Average queue depth over time - Circuit breaker trip frequency - Dead-letter queue patterns - Document any recurring issues 3. **SLO targets** - **Queue depth**: < 1.5× instance count (< 50 typical) - **Staleness**: < 60s for healthy instances, < 120s for critical instances - **Poll success rate**: > 99% for healthy infrastructure - **DLQ growth**: < 5% per week (excluding known failures) - **Circuit breaker recovery**: < 5 minutes for transient failures 4. **Log correlation** - Cross-reference scheduler health with update history - Check `/api/updates/history` for rollback events correlated with scheduler issues - Review audit logs for adaptive polling configuration changes --- ## 4. Grafana & Alert Configuration 1. **Dashboard panels** - **Queue Depth**: `pulse_monitor_poll_queue_depth` - Use single-stat with alert if > 1.5× active instances for > 10 min - **Instance & Node Staleness**: combine `pulse_monitor_poll_staleness_seconds` and `pulse_monitor_node_poll_staleness_seconds` - Alert threshold: > 60 s for > 5 min (excluding known failing instances) - **Polling Throughput**: rate of `pulse_monitor_poll_total{result="success"}` vs `result="error"` - **Per-node errors**: table or graph of `pulse_monitor_node_poll_errors_total` to spot noisy nodes - **Scheduler Health**: panels for `pulse_scheduler_queue_depth`, `pulse_scheduler_queue_due_soon`, `pulse_scheduler_dead_letter_depth`, `pulse_scheduler_breaker_state`, `pulse_scheduler_breaker_failure_count` - **Diagnostics Cache**: compare `increase(pulse_diagnostics_cache_hits_total[5m])` vs misses so spikes stand out - **HTTP SLA**: `rate(pulse_http_request_errors_total{status_class="server_error"}[5m])` - **Last Success Timestamp** (v4.24.0+): `pulse_monitor_poll_last_success_timestamp` to detect polling gaps 2. **Alerts** - Queue depth > threshold for >10 min (Warning), >20 min (Critical) - Instance or node staleness > 60 s for >5 min (Critical) - Dead-letter count increase > N (based on baseline) triggers Warning - Any breaker stuck in `open` for >10 min (from `pulse_scheduler_breaker_state`) triggers Critical - Queue wait > 5 s (95th percentile on `pulse_scheduler_queue_wait_seconds`) triggers Warning - Permanent failures (`pulse_monitor_poll_errors_total{category="permanent"}`) trigger immediate Critical - Diagnostics refresh duration > 20 s alongside miss spikes should page engineering (`pulse_diagnostics_refresh_duration_seconds`) 3. **Notification routing** - Ensure alerts route to on-call + feature owner --- ## 5. Rollback Procedures ### Option A: Disable Adaptive Polling (Keep v4.24.0) **If adaptive polling causes issues but you want to keep v4.24.0:** 1. **Via UI (No restart required)** - Navigate to **Settings → System → Monitoring** - Toggle "Adaptive Polling" OFF - Changes apply immediately 2. **Via environment variables (Restart required)** ```bash # Systemd sudo systemctl edit pulse # Add: [Service] Environment="ADAPTIVE_POLLING_ENABLED=false" # Then restart sudo systemctl restart pulse ``` 3. **Verification** ```bash curl -s http://:7655/api/monitoring/scheduler/health | jq '.enabled' # Should return: false ``` ### Option B: Full Version Rollback **If v4.24.0 causes broader issues:** 1. **Via UI** - Navigate to **Settings → System → Updates** - Click **"Restore previous version"** - Confirm rollback - Pulse restarts automatically with previous version 2. **Via CLI** ```bash # Systemd installations sudo /opt/pulse/pulse config rollback # LXC containers pct exec -- bash -c "cd /opt/pulse && ./pulse config rollback" ``` 3. **Verification** ```bash # Check version curl -s http://:7655/api/version | jq '.version' # Check update history curl -s http://:7655/api/updates/history | jq '.entries[0]' # Should show action="rollback", status="completed" ``` 4. **Post-rollback** - Verify rollback logged in update history - Check journal: `journalctl -u pulse | grep rollback` - Monitor for 15-30 minutes to ensure stability - Document rollback reason and notify stakeholders --- ## 6. Troubleshooting | Symptom | Possible Cause | Action | |---------|----------------|--------| | Queue depth remains high (> 2× usual) | Insufficient workers, hidden breaker, misconfigured flag | Check scheduler health API for breaker states; consider increasing workers or reverting flag | | Staleness spikes across many instances | Backend API slowdown or connectivity issues | Inspect backend logs, network health; revert flag if duration > 15 min | | Dead-letter count climbs rapidly | Downstream API failures | Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback | | Circuit breakers stuck half-open/open | Persistent transient failures | Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly | | Grafana panels flatline | Metrics exporter or job issue | Ensure Prometheus scraping working; verify service restarted with flag | ### Accessing Scheduler Health API ```bash curl -s http://:7655/api/monitoring/scheduler/health | jq ``` Key sections to inspect: - `queue.depth`, `queue.perType` - `instances[].pollStatus` (success/failure streaks and last error) - `instances[].breaker` (current breaker state, retry windows) - `instances[].deadLetter` (reason, retry counts, schedules) - `staleness` (normalized freshness score) Common queries: **Instances with errors:** ```bash curl -s http://:7655/api/monitoring/scheduler/health \ | jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}' ``` **Current dead-letter entries:** ```bash curl -s http://:7655/api/monitoring/scheduler/health \ | jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}' ``` **Breakers not closed:** ```bash curl -s http://:7655/api/monitoring/scheduler/health \ | jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}' ``` ### When to Roll Back Rollback immediately if any of the following occurs: - Queue depth > 3× baseline for > 15 min - Staleness > 120 s on majority of instances - Dead-letter count doubles without clear cause - Customer-facing alerts or latency regressions attributed to adaptive polling Document the incident and notify stakeholders after rollback. --- ## 7. Related Documentation - [Scheduler Health API](../api/SCHEDULER_HEALTH.md) - Complete API reference - [Adaptive Polling Architecture](../monitoring/ADAPTIVE_POLLING.md) - Technical details - [Management Endpoints](ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md) - Circuit breaker/DLQ controls - [Configuration Guide](../CONFIGURATION.md) - Adaptive polling settings