vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 12:00:13 +00:00

rcourtman e0396c1362 docs: update documentation for diagnostics improvements

Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)

2025-10-21 12:45:19 +00:00

11 KiB

Raw Blame History

Adaptive Polling Operations Runbook

GA in v4.24.0 - Enabled by Default

This runbook guides operators managing the adaptive polling scheduler in production. Adaptive polling is enabled by default in v4.24.0 and can be toggled via Settings → System → Monitoring (no restart) or environment variables (restart required).

Follow these operational procedures for steady-state monitoring, troubleshooting, and rollback scenarios.

1. Preparation (Before v4.24.0 Deployment)

For new deployments or upgrades to v4.24.0:

Monitoring readiness
- Review available scrape series in Prometheus Metrics.
- Set up Grafana dashboard with:
  - Instance gauges: pulse_monitor_poll_queue_depth, pulse_monitor_poll_staleness_seconds, pulse_monitor_poll_last_success_timestamp
  - Per-node coverage: pulse_monitor_node_poll_staleness_seconds, pulse_monitor_node_poll_errors_total, pulse_monitor_node_poll_duration_seconds
  - Scheduler health: pulse_scheduler_queue_depth, pulse_scheduler_queue_due_soon, pulse_scheduler_dead_letter_depth, pulse_scheduler_breaker_state, and pulse_scheduler_breaker_failure_count
  - Diagnostics cache sanity: increase(pulse_diagnostics_cache_misses_total[5m]) vs pulse_diagnostics_cache_hits_total
  - HTTP SLA: rate(pulse_http_request_errors_total{status_class="server_error"}[5m])
  - Continue trending pulse_monitor_poll_total / pulse_monitor_poll_errors_total for throughput
  - Configure alerts (see §4)
Baseline metrics
- Record pre-upgrade metrics if upgrading from < v4.24.0:
  - Typical polling frequency
  - Average response times
  - Current alert volumes
- These help assess adaptive polling impact
Rollback readiness
- Verify update rollback workflow: Settings → System → Updates → Restore previous version
- Test rollback in staging environment
- Confirm /api/monitoring/scheduler/health accessible
- Document emergency disable procedure (see §5)
Configuration review
- Review system.json or environment variables for adaptive polling tunables:
  - ADAPTIVE_POLLING_BASE_INTERVAL (default: 10s)
  - ADAPTIVE_POLLING_MIN_INTERVAL (default: 5s)
  - ADAPTIVE_POLLING_MAX_INTERVAL (default: 5m)
- Adjust if needed for your environment (e.g., high-frequency monitoring)

2. Post-Deployment Verification (v4.24.0+)

Adaptive polling is enabled by default. Verify it's working correctly:

Check scheduler health
```
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
```
Expected response:
- "enabled": true
- queue.depth reasonable (< instances × 1.5)
- deadLetter.count = 0 (or only known failing instances)
- instances[] array populated with your nodes
- No breaker.state stuck in open (except known issues)
Verify UI access
- Navigate to Settings → System → Monitoring
- Confirm "Adaptive Polling" toggle is ON
- Review queue depth and recent poll status
Check Grafana metrics (if configured)
- pulse_monitor_poll_queue_depth shows reasonable values
- pulse_monitor_poll_staleness_seconds < 60s for healthy instances
- pulse_monitor_poll_errors_total not rapidly increasing
- pulse_monitor_poll_last_success_timestamp updating regularly
Monitor update history
- Check Settings → System → Updates for update entry
- Verify rollback button is available
- Confirm update status shows "completed"
Via API:
```
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
```

3. Steady-State Operations

Ongoing monitoring and SLO checks:

Daily health checks
- Review scheduler health dashboard or API endpoint
- Check for:
  - Queue depth < 50 (alert if > 50 for 10+ minutes)
  - Instance staleness < 60s for healthy instances, < 120s for critical instances
  - Per-node staleness pulse_monitor_node_poll_staleness_seconds < 120s
  - DLQ count stable (not growing)
  - Circuit breakers mostly closed
  - Diagnostics cache misses roughly follow hits (increase(pulse_diagnostics_cache_misses_total[10m]) in line with hits)
  - HTTP error rate (rate(pulse_http_request_errors_total{status_class="server_error"}[5m])) near zero
Weekly reviews
- Analyze trends in Grafana:
  - Poll success rates
  - Average queue depth over time
  - Circuit breaker trip frequency
  - Dead-letter queue patterns
- Document any recurring issues
SLO targets
- Queue depth: < 1.5× instance count (< 50 typical)
- Staleness: < 60s for healthy instances, < 120s for critical instances
- Poll success rate: > 99% for healthy infrastructure
- DLQ growth: < 5% per week (excluding known failures)
- Circuit breaker recovery: < 5 minutes for transient failures
Log correlation
- Cross-reference scheduler health with update history
- Check /api/updates/history for rollback events correlated with scheduler issues
- Review audit logs for adaptive polling configuration changes

4. Grafana & Alert Configuration

Dashboard panels
- Queue Depth: pulse_monitor_poll_queue_depth
  - Use single-stat with alert if > 1.5× active instances for > 10 min
- Instance & Node Staleness: combine pulse_monitor_poll_staleness_seconds and pulse_monitor_node_poll_staleness_seconds
  - Alert threshold: > 60 s for > 5 min (excluding known failing instances)
- Polling Throughput: rate of pulse_monitor_poll_total{result="success"} vs result="error"
- Per-node errors: table or graph of pulse_monitor_node_poll_errors_total to spot noisy nodes
- Scheduler Health: panels for pulse_scheduler_queue_depth, pulse_scheduler_queue_due_soon, pulse_scheduler_dead_letter_depth, pulse_scheduler_breaker_state, pulse_scheduler_breaker_failure_count
- Diagnostics Cache: compare increase(pulse_diagnostics_cache_hits_total[5m]) vs misses so spikes stand out
- HTTP SLA: rate(pulse_http_request_errors_total{status_class="server_error"}[5m])
- Last Success Timestamp (v4.24.0+): pulse_monitor_poll_last_success_timestamp to detect polling gaps
Alerts
- Queue depth > threshold for >10 min (Warning), >20 min (Critical)
- Instance or node staleness > 60 s for >5 min (Critical)
- Dead-letter count increase > N (based on baseline) triggers Warning
- Any breaker stuck in open for >10 min (from pulse_scheduler_breaker_state) triggers Critical
- Queue wait > 5 s (95th percentile on pulse_scheduler_queue_wait_seconds) triggers Warning
- Permanent failures (pulse_monitor_poll_errors_total{category="permanent"}) trigger immediate Critical
- Diagnostics refresh duration > 20 s alongside miss spikes should page engineering (pulse_diagnostics_refresh_duration_seconds)
Notification routing
- Ensure alerts route to on-call + feature owner

5. Rollback Procedures

Option A: Disable Adaptive Polling (Keep v4.24.0)

If adaptive polling causes issues but you want to keep v4.24.0:

Via UI (No restart required)
- Navigate to Settings → System → Monitoring
- Toggle "Adaptive Polling" OFF
- Changes apply immediately

Via environment variables (Restart required)

# Systemd
sudo systemctl edit pulse
# Add:
[Service]
Environment="ADAPTIVE_POLLING_ENABLED=false"

# Then restart
sudo systemctl restart pulse

Verification

curl -s http://<host>:7655/api/monitoring/scheduler/health | jq '.enabled'
# Should return: false

Option B: Full Version Rollback

If v4.24.0 causes broader issues:

Via UI
- Navigate to Settings → System → Updates
- Click "Restore previous version"
- Confirm rollback
- Pulse restarts automatically with previous version

Via CLI

# Systemd installations
sudo /opt/pulse/pulse config rollback

# LXC containers
pct exec <ctid> -- bash -c "cd /opt/pulse && ./pulse config rollback"

Verification

# Check version
curl -s http://<host>:7655/api/version | jq '.version'

# Check update history
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
# Should show action="rollback", status="completed"

Post-rollback
- Verify rollback logged in update history
- Check journal: journalctl -u pulse | grep rollback
- Monitor for 15-30 minutes to ensure stability
- Document rollback reason and notify stakeholders

6. Troubleshooting

Symptom	Possible Cause	Action
Queue depth remains high (> 2× usual)	Insufficient workers, hidden breaker, misconfigured flag	Check scheduler health API for breaker states; consider increasing workers or reverting flag
Staleness spikes across many instances	Backend API slowdown or connectivity issues	Inspect backend logs, network health; revert flag if duration > 15 min
Dead-letter count climbs rapidly	Downstream API failures	Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback
Circuit breakers stuck half-open/open	Persistent transient failures	Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly
Grafana panels flatline	Metrics exporter or job issue	Ensure Prometheus scraping working; verify service restarted with flag

Accessing Scheduler Health API

curl -s http://<host>:7655/api/monitoring/scheduler/health | jq

Key sections to inspect:

queue.depth, queue.perType
instances[].pollStatus (success/failure streaks and last error)
instances[].breaker (current breaker state, retry windows)
instances[].deadLetter (reason, retry counts, schedules)
staleness (normalized freshness score)

Common queries:

Instances with errors:

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'

Current dead-letter entries:

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'

Breakers not closed:

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'

When to Roll Back

Rollback immediately if any of the following occurs:

Queue depth > 3× baseline for > 15 min
Staleness > 120 s on majority of instances
Dead-letter count doubles without clear cause
Customer-facing alerts or latency regressions attributed to adaptive polling

Document the incident and notify stakeholders after rollback.

Scheduler Health API - Complete API reference
Adaptive Polling Architecture - Technical details
Management Endpoints - Circuit breaker/DLQ controls
Configuration Guide - Adaptive polling settings

11 KiB Raw Blame History Unescape Escape