Pulse/docs/operations/ADAPTIVE_POLLING_ROLLOUT.md
rcourtman e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00

11 KiB
Raw Blame History

Adaptive Polling Operations Runbook

GA in v4.24.0 - Enabled by Default

This runbook guides operators managing the adaptive polling scheduler in production. Adaptive polling is enabled by default in v4.24.0 and can be toggled via Settings → System → Monitoring (no restart) or environment variables (restart required).

Follow these operational procedures for steady-state monitoring, troubleshooting, and rollback scenarios.


1. Preparation (Before v4.24.0 Deployment)

For new deployments or upgrades to v4.24.0:

  1. Monitoring readiness

    • Review available scrape series in Prometheus Metrics.
    • Set up Grafana dashboard with:
      • Instance gauges: pulse_monitor_poll_queue_depth, pulse_monitor_poll_staleness_seconds, pulse_monitor_poll_last_success_timestamp
      • Per-node coverage: pulse_monitor_node_poll_staleness_seconds, pulse_monitor_node_poll_errors_total, pulse_monitor_node_poll_duration_seconds
      • Scheduler health: pulse_scheduler_queue_depth, pulse_scheduler_queue_due_soon, pulse_scheduler_dead_letter_depth, pulse_scheduler_breaker_state, and pulse_scheduler_breaker_failure_count
      • Diagnostics cache sanity: increase(pulse_diagnostics_cache_misses_total[5m]) vs pulse_diagnostics_cache_hits_total
      • HTTP SLA: rate(pulse_http_request_errors_total{status_class="server_error"}[5m])
      • Continue trending pulse_monitor_poll_total / pulse_monitor_poll_errors_total for throughput
      • Configure alerts (see §4)
  2. Baseline metrics

    • Record pre-upgrade metrics if upgrading from < v4.24.0:
      • Typical polling frequency
      • Average response times
      • Current alert volumes
    • These help assess adaptive polling impact
  3. Rollback readiness

    • Verify update rollback workflow: Settings → System → Updates → Restore previous version
    • Test rollback in staging environment
    • Confirm /api/monitoring/scheduler/health accessible
    • Document emergency disable procedure (see §5)
  4. Configuration review

    • Review system.json or environment variables for adaptive polling tunables:
      • ADAPTIVE_POLLING_BASE_INTERVAL (default: 10s)
      • ADAPTIVE_POLLING_MIN_INTERVAL (default: 5s)
      • ADAPTIVE_POLLING_MAX_INTERVAL (default: 5m)
    • Adjust if needed for your environment (e.g., high-frequency monitoring)

2. Post-Deployment Verification (v4.24.0+)

Adaptive polling is enabled by default. Verify it's working correctly:

  1. Check scheduler health

    curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
    

    Expected response:

    • "enabled": true
    • queue.depth reasonable (< instances × 1.5)
    • deadLetter.count = 0 (or only known failing instances)
    • instances[] array populated with your nodes
    • No breaker.state stuck in open (except known issues)
  2. Verify UI access

    • Navigate to Settings → System → Monitoring
    • Confirm "Adaptive Polling" toggle is ON
    • Review queue depth and recent poll status
  3. Check Grafana metrics (if configured)

    • pulse_monitor_poll_queue_depth shows reasonable values
    • pulse_monitor_poll_staleness_seconds < 60s for healthy instances
    • pulse_monitor_poll_errors_total not rapidly increasing
    • pulse_monitor_poll_last_success_timestamp updating regularly
  4. Monitor update history

    • Check Settings → System → Updates for update entry
    • Verify rollback button is available
    • Confirm update status shows "completed"

    Via API:

    curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
    

3. Steady-State Operations

Ongoing monitoring and SLO checks:

  1. Daily health checks

    • Review scheduler health dashboard or API endpoint
    • Check for:
      • Queue depth < 50 (alert if > 50 for 10+ minutes)
      • Instance staleness < 60s for healthy instances, < 120s for critical instances
      • Per-node staleness pulse_monitor_node_poll_staleness_seconds < 120s
      • DLQ count stable (not growing)
      • Circuit breakers mostly closed
      • Diagnostics cache misses roughly follow hits (increase(pulse_diagnostics_cache_misses_total[10m]) in line with hits)
      • HTTP error rate (rate(pulse_http_request_errors_total{status_class="server_error"}[5m])) near zero
  2. Weekly reviews

    • Analyze trends in Grafana:
      • Poll success rates
      • Average queue depth over time
      • Circuit breaker trip frequency
      • Dead-letter queue patterns
    • Document any recurring issues
  3. SLO targets

    • Queue depth: < 1.5× instance count (< 50 typical)
    • Staleness: < 60s for healthy instances, < 120s for critical instances
    • Poll success rate: > 99% for healthy infrastructure
    • DLQ growth: < 5% per week (excluding known failures)
    • Circuit breaker recovery: < 5 minutes for transient failures
  4. Log correlation

    • Cross-reference scheduler health with update history
    • Check /api/updates/history for rollback events correlated with scheduler issues
    • Review audit logs for adaptive polling configuration changes

4. Grafana & Alert Configuration

  1. Dashboard panels

    • Queue Depth: pulse_monitor_poll_queue_depth
      • Use single-stat with alert if > 1.5× active instances for > 10 min
    • Instance & Node Staleness: combine pulse_monitor_poll_staleness_seconds and pulse_monitor_node_poll_staleness_seconds
      • Alert threshold: > 60 s for > 5 min (excluding known failing instances)
    • Polling Throughput: rate of pulse_monitor_poll_total{result="success"} vs result="error"
    • Per-node errors: table or graph of pulse_monitor_node_poll_errors_total to spot noisy nodes
    • Scheduler Health: panels for pulse_scheduler_queue_depth, pulse_scheduler_queue_due_soon, pulse_scheduler_dead_letter_depth, pulse_scheduler_breaker_state, pulse_scheduler_breaker_failure_count
    • Diagnostics Cache: compare increase(pulse_diagnostics_cache_hits_total[5m]) vs misses so spikes stand out
    • HTTP SLA: rate(pulse_http_request_errors_total{status_class="server_error"}[5m])
    • Last Success Timestamp (v4.24.0+): pulse_monitor_poll_last_success_timestamp to detect polling gaps
  2. Alerts

    • Queue depth > threshold for >10 min (Warning), >20 min (Critical)
    • Instance or node staleness > 60 s for >5 min (Critical)
    • Dead-letter count increase > N (based on baseline) triggers Warning
    • Any breaker stuck in open for >10 min (from pulse_scheduler_breaker_state) triggers Critical
    • Queue wait > 5 s (95th percentile on pulse_scheduler_queue_wait_seconds) triggers Warning
    • Permanent failures (pulse_monitor_poll_errors_total{category="permanent"}) trigger immediate Critical
    • Diagnostics refresh duration > 20 s alongside miss spikes should page engineering (pulse_diagnostics_refresh_duration_seconds)
  3. Notification routing

    • Ensure alerts route to on-call + feature owner

5. Rollback Procedures

Option A: Disable Adaptive Polling (Keep v4.24.0)

If adaptive polling causes issues but you want to keep v4.24.0:

  1. Via UI (No restart required)

    • Navigate to Settings → System → Monitoring
    • Toggle "Adaptive Polling" OFF
    • Changes apply immediately
  2. Via environment variables (Restart required)

    # Systemd
    sudo systemctl edit pulse
    # Add:
    [Service]
    Environment="ADAPTIVE_POLLING_ENABLED=false"
    
    # Then restart
    sudo systemctl restart pulse
    
  3. Verification

    curl -s http://<host>:7655/api/monitoring/scheduler/health | jq '.enabled'
    # Should return: false
    

Option B: Full Version Rollback

If v4.24.0 causes broader issues:

  1. Via UI

    • Navigate to Settings → System → Updates
    • Click "Restore previous version"
    • Confirm rollback
    • Pulse restarts automatically with previous version
  2. Via CLI

    # Systemd installations
    sudo /opt/pulse/pulse config rollback
    
    # LXC containers
    pct exec <ctid> -- bash -c "cd /opt/pulse && ./pulse config rollback"
    
  3. Verification

    # Check version
    curl -s http://<host>:7655/api/version | jq '.version'
    
    # Check update history
    curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
    # Should show action="rollback", status="completed"
    
  4. Post-rollback

    • Verify rollback logged in update history
    • Check journal: journalctl -u pulse | grep rollback
    • Monitor for 15-30 minutes to ensure stability
    • Document rollback reason and notify stakeholders

6. Troubleshooting

Symptom Possible Cause Action
Queue depth remains high (> 2× usual) Insufficient workers, hidden breaker, misconfigured flag Check scheduler health API for breaker states; consider increasing workers or reverting flag
Staleness spikes across many instances Backend API slowdown or connectivity issues Inspect backend logs, network health; revert flag if duration > 15 min
Dead-letter count climbs rapidly Downstream API failures Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback
Circuit breakers stuck half-open/open Persistent transient failures Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly
Grafana panels flatline Metrics exporter or job issue Ensure Prometheus scraping working; verify service restarted with flag

Accessing Scheduler Health API

curl -s http://<host>:7655/api/monitoring/scheduler/health | jq

Key sections to inspect:

  • queue.depth, queue.perType
  • instances[].pollStatus (success/failure streaks and last error)
  • instances[].breaker (current breaker state, retry windows)
  • instances[].deadLetter (reason, retry counts, schedules)
  • staleness (normalized freshness score)

Common queries:

Instances with errors:

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'

Current dead-letter entries:

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'

Breakers not closed:

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'

When to Roll Back

Rollback immediately if any of the following occurs:

  • Queue depth > 3× baseline for > 15 min
  • Staleness > 120 s on majority of instances
  • Dead-letter count doubles without clear cause
  • Customer-facing alerts or latency regressions attributed to adaptive polling

Document the incident and notify stakeholders after rollback.