Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
11 KiB
Adaptive Polling Operations Runbook
GA in v4.24.0 - Enabled by Default
This runbook guides operators managing the adaptive polling scheduler in production. Adaptive polling is enabled by default in v4.24.0 and can be toggled via Settings → System → Monitoring (no restart) or environment variables (restart required).
Follow these operational procedures for steady-state monitoring, troubleshooting, and rollback scenarios.
1. Preparation (Before v4.24.0 Deployment)
For new deployments or upgrades to v4.24.0:
-
Monitoring readiness
- Review available scrape series in Prometheus Metrics.
- Set up Grafana dashboard with:
- Instance gauges:
pulse_monitor_poll_queue_depth,pulse_monitor_poll_staleness_seconds,pulse_monitor_poll_last_success_timestamp - Per-node coverage:
pulse_monitor_node_poll_staleness_seconds,pulse_monitor_node_poll_errors_total,pulse_monitor_node_poll_duration_seconds - Scheduler health:
pulse_scheduler_queue_depth,pulse_scheduler_queue_due_soon,pulse_scheduler_dead_letter_depth,pulse_scheduler_breaker_state, andpulse_scheduler_breaker_failure_count - Diagnostics cache sanity:
increase(pulse_diagnostics_cache_misses_total[5m])vspulse_diagnostics_cache_hits_total - HTTP SLA:
rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) - Continue trending
pulse_monitor_poll_total/pulse_monitor_poll_errors_totalfor throughput - Configure alerts (see §4)
- Instance gauges:
-
Baseline metrics
- Record pre-upgrade metrics if upgrading from < v4.24.0:
- Typical polling frequency
- Average response times
- Current alert volumes
- These help assess adaptive polling impact
- Record pre-upgrade metrics if upgrading from < v4.24.0:
-
Rollback readiness
- Verify update rollback workflow: Settings → System → Updates → Restore previous version
- Test rollback in staging environment
- Confirm
/api/monitoring/scheduler/healthaccessible - Document emergency disable procedure (see §5)
-
Configuration review
- Review
system.jsonor environment variables for adaptive polling tunables:ADAPTIVE_POLLING_BASE_INTERVAL(default: 10s)ADAPTIVE_POLLING_MIN_INTERVAL(default: 5s)ADAPTIVE_POLLING_MAX_INTERVAL(default: 5m)
- Adjust if needed for your environment (e.g., high-frequency monitoring)
- Review
2. Post-Deployment Verification (v4.24.0+)
Adaptive polling is enabled by default. Verify it's working correctly:
-
Check scheduler health
curl -s http://<host>:7655/api/monitoring/scheduler/health | jqExpected response:
"enabled": truequeue.depthreasonable (< instances × 1.5)deadLetter.count= 0 (or only known failing instances)instances[]array populated with your nodes- No
breaker.statestuck inopen(except known issues)
-
Verify UI access
- Navigate to Settings → System → Monitoring
- Confirm "Adaptive Polling" toggle is ON
- Review queue depth and recent poll status
-
Check Grafana metrics (if configured)
pulse_monitor_poll_queue_depthshows reasonable valuespulse_monitor_poll_staleness_seconds< 60s for healthy instancespulse_monitor_poll_errors_totalnot rapidly increasingpulse_monitor_poll_last_success_timestampupdating regularly
-
Monitor update history
- Check Settings → System → Updates for update entry
- Verify rollback button is available
- Confirm update status shows "completed"
Via API:
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
3. Steady-State Operations
Ongoing monitoring and SLO checks:
-
Daily health checks
- Review scheduler health dashboard or API endpoint
- Check for:
- Queue depth < 50 (alert if > 50 for 10+ minutes)
- Instance staleness < 60s for healthy instances, < 120s for critical instances
- Per-node staleness
pulse_monitor_node_poll_staleness_seconds< 120s - DLQ count stable (not growing)
- Circuit breakers mostly
closed - Diagnostics cache misses roughly follow hits (
increase(pulse_diagnostics_cache_misses_total[10m])in line with hits) - HTTP error rate (
rate(pulse_http_request_errors_total{status_class="server_error"}[5m])) near zero
-
Weekly reviews
- Analyze trends in Grafana:
- Poll success rates
- Average queue depth over time
- Circuit breaker trip frequency
- Dead-letter queue patterns
- Document any recurring issues
- Analyze trends in Grafana:
-
SLO targets
- Queue depth: < 1.5× instance count (< 50 typical)
- Staleness: < 60s for healthy instances, < 120s for critical instances
- Poll success rate: > 99% for healthy infrastructure
- DLQ growth: < 5% per week (excluding known failures)
- Circuit breaker recovery: < 5 minutes for transient failures
-
Log correlation
- Cross-reference scheduler health with update history
- Check
/api/updates/historyfor rollback events correlated with scheduler issues - Review audit logs for adaptive polling configuration changes
4. Grafana & Alert Configuration
-
Dashboard panels
- Queue Depth:
pulse_monitor_poll_queue_depth- Use single-stat with alert if > 1.5× active instances for > 10 min
- Instance & Node Staleness: combine
pulse_monitor_poll_staleness_secondsandpulse_monitor_node_poll_staleness_seconds- Alert threshold: > 60 s for > 5 min (excluding known failing instances)
- Polling Throughput: rate of
pulse_monitor_poll_total{result="success"}vsresult="error" - Per-node errors: table or graph of
pulse_monitor_node_poll_errors_totalto spot noisy nodes - Scheduler Health: panels for
pulse_scheduler_queue_depth,pulse_scheduler_queue_due_soon,pulse_scheduler_dead_letter_depth,pulse_scheduler_breaker_state,pulse_scheduler_breaker_failure_count - Diagnostics Cache: compare
increase(pulse_diagnostics_cache_hits_total[5m])vs misses so spikes stand out - HTTP SLA:
rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) - Last Success Timestamp (v4.24.0+):
pulse_monitor_poll_last_success_timestampto detect polling gaps
- Queue Depth:
-
Alerts
- Queue depth > threshold for >10 min (Warning), >20 min (Critical)
- Instance or node staleness > 60 s for >5 min (Critical)
- Dead-letter count increase > N (based on baseline) triggers Warning
- Any breaker stuck in
openfor >10 min (frompulse_scheduler_breaker_state) triggers Critical - Queue wait > 5 s (95th percentile on
pulse_scheduler_queue_wait_seconds) triggers Warning - Permanent failures (
pulse_monitor_poll_errors_total{category="permanent"}) trigger immediate Critical - Diagnostics refresh duration > 20 s alongside miss spikes should page engineering (
pulse_diagnostics_refresh_duration_seconds)
-
Notification routing
- Ensure alerts route to on-call + feature owner
5. Rollback Procedures
Option A: Disable Adaptive Polling (Keep v4.24.0)
If adaptive polling causes issues but you want to keep v4.24.0:
-
Via UI (No restart required)
- Navigate to Settings → System → Monitoring
- Toggle "Adaptive Polling" OFF
- Changes apply immediately
-
Via environment variables (Restart required)
# Systemd sudo systemctl edit pulse # Add: [Service] Environment="ADAPTIVE_POLLING_ENABLED=false" # Then restart sudo systemctl restart pulse -
Verification
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq '.enabled' # Should return: false
Option B: Full Version Rollback
If v4.24.0 causes broader issues:
-
Via UI
- Navigate to Settings → System → Updates
- Click "Restore previous version"
- Confirm rollback
- Pulse restarts automatically with previous version
-
Via CLI
# Systemd installations sudo /opt/pulse/pulse config rollback # LXC containers pct exec <ctid> -- bash -c "cd /opt/pulse && ./pulse config rollback" -
Verification
# Check version curl -s http://<host>:7655/api/version | jq '.version' # Check update history curl -s http://<host>:7655/api/updates/history | jq '.entries[0]' # Should show action="rollback", status="completed" -
Post-rollback
- Verify rollback logged in update history
- Check journal:
journalctl -u pulse | grep rollback - Monitor for 15-30 minutes to ensure stability
- Document rollback reason and notify stakeholders
6. Troubleshooting
| Symptom | Possible Cause | Action |
|---|---|---|
| Queue depth remains high (> 2× usual) | Insufficient workers, hidden breaker, misconfigured flag | Check scheduler health API for breaker states; consider increasing workers or reverting flag |
| Staleness spikes across many instances | Backend API slowdown or connectivity issues | Inspect backend logs, network health; revert flag if duration > 15 min |
| Dead-letter count climbs rapidly | Downstream API failures | Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback |
| Circuit breakers stuck half-open/open | Persistent transient failures | Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly |
| Grafana panels flatline | Metrics exporter or job issue | Ensure Prometheus scraping working; verify service restarted with flag |
Accessing Scheduler Health API
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
Key sections to inspect:
queue.depth,queue.perTypeinstances[].pollStatus(success/failure streaks and last error)instances[].breaker(current breaker state, retry windows)instances[].deadLetter(reason, retry counts, schedules)staleness(normalized freshness score)
Common queries:
Instances with errors:
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'
Current dead-letter entries:
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'
Breakers not closed:
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'
When to Roll Back
Rollback immediately if any of the following occurs:
- Queue depth > 3× baseline for > 15 min
- Staleness > 120 s on majority of instances
- Dead-letter count doubles without clear cause
- Customer-facing alerts or latency regressions attributed to adaptive polling
Document the incident and notify stakeholders after rollback.
7. Related Documentation
- Scheduler Health API - Complete API reference
- Adaptive Polling Architecture - Technical details
- Management Endpoints - Circuit breaker/DLQ controls
- Configuration Guide - Adaptive polling settings