Pulse/docs/operations/ADAPTIVE_POLLING_ROLLOUT.md
rcourtman e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00

279 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Adaptive Polling Operations Runbook
**GA in v4.24.0 - Enabled by Default**
This runbook guides operators managing the adaptive polling scheduler in production. Adaptive polling is enabled by default in v4.24.0 and can be toggled via **Settings → System → Monitoring** (no restart) or environment variables (restart required).
Follow these operational procedures for steady-state monitoring, troubleshooting, and rollback scenarios.
---
## 1. Preparation (Before v4.24.0 Deployment)
**For new deployments or upgrades to v4.24.0:**
1. **Monitoring readiness**
- Review available scrape series in [Prometheus Metrics](../monitoring/PROMETHEUS_METRICS.md).
- Set up Grafana dashboard with:
- Instance gauges: `pulse_monitor_poll_queue_depth`, `pulse_monitor_poll_staleness_seconds`, `pulse_monitor_poll_last_success_timestamp`
- **Per-node coverage:** `pulse_monitor_node_poll_staleness_seconds`, `pulse_monitor_node_poll_errors_total`, `pulse_monitor_node_poll_duration_seconds`
- Scheduler health: `pulse_scheduler_queue_depth`, `pulse_scheduler_queue_due_soon`, `pulse_scheduler_dead_letter_depth`, `pulse_scheduler_breaker_state`, and `pulse_scheduler_breaker_failure_count`
- Diagnostics cache sanity: `increase(pulse_diagnostics_cache_misses_total[5m])` vs `pulse_diagnostics_cache_hits_total`
- HTTP SLA: `rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`
- Continue trending `pulse_monitor_poll_total` / `pulse_monitor_poll_errors_total` for throughput
- Configure alerts (see §4)
2. **Baseline metrics**
- Record pre-upgrade metrics if upgrading from < v4.24.0:
- Typical polling frequency
- Average response times
- Current alert volumes
- These help assess adaptive polling impact
3. **Rollback readiness**
- Verify update rollback workflow: **Settings → System → Updates → Restore previous version**
- Test rollback in staging environment
- Confirm `/api/monitoring/scheduler/health` accessible
- Document emergency disable procedure (see §5)
4. **Configuration review**
- Review `system.json` or environment variables for adaptive polling tunables:
- `ADAPTIVE_POLLING_BASE_INTERVAL` (default: 10s)
- `ADAPTIVE_POLLING_MIN_INTERVAL` (default: 5s)
- `ADAPTIVE_POLLING_MAX_INTERVAL` (default: 5m)
- Adjust if needed for your environment (e.g., high-frequency monitoring)
---
## 2. Post-Deployment Verification (v4.24.0+)
**Adaptive polling is enabled by default. Verify it's working correctly:**
1. **Check scheduler health**
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
```
**Expected response:**
- `"enabled": true`
- `queue.depth` reasonable (< instances × 1.5)
- `deadLetter.count` = 0 (or only known failing instances)
- `instances[]` array populated with your nodes
- No `breaker.state` stuck in `open` (except known issues)
2. **Verify UI access**
- Navigate to **Settings → System → Monitoring**
- Confirm "Adaptive Polling" toggle is ON
- Review queue depth and recent poll status
3. **Check Grafana metrics** (if configured)
- `pulse_monitor_poll_queue_depth` shows reasonable values
- `pulse_monitor_poll_staleness_seconds` < 60s for healthy instances
- `pulse_monitor_poll_errors_total` not rapidly increasing
- `pulse_monitor_poll_last_success_timestamp` updating regularly
4. **Monitor update history**
- Check **Settings → System → Updates** for update entry
- Verify rollback button is available
- Confirm update status shows "completed"
**Via API:**
```bash
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
```
---
## 3. Steady-State Operations
**Ongoing monitoring and SLO checks:**
1. **Daily health checks**
- Review scheduler health dashboard or API endpoint
- Check for:
- Queue depth < 50 (alert if > 50 for 10+ minutes)
- Instance staleness < 60s for healthy instances, < 120s for critical instances
- **Per-node staleness** `pulse_monitor_node_poll_staleness_seconds` < 120s
- DLQ count stable (not growing)
- Circuit breakers mostly `closed`
- Diagnostics cache misses roughly follow hits (`increase(pulse_diagnostics_cache_misses_total[10m])` in line with hits)
- HTTP error rate (`rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`) near zero
2. **Weekly reviews**
- Analyze trends in Grafana:
- Poll success rates
- Average queue depth over time
- Circuit breaker trip frequency
- Dead-letter queue patterns
- Document any recurring issues
3. **SLO targets**
- **Queue depth**: < 1.5× instance count (< 50 typical)
- **Staleness**: < 60s for healthy instances, < 120s for critical instances
- **Poll success rate**: > 99% for healthy infrastructure
- **DLQ growth**: < 5% per week (excluding known failures)
- **Circuit breaker recovery**: < 5 minutes for transient failures
4. **Log correlation**
- Cross-reference scheduler health with update history
- Check `/api/updates/history` for rollback events correlated with scheduler issues
- Review audit logs for adaptive polling configuration changes
---
## 4. Grafana & Alert Configuration
1. **Dashboard panels**
- **Queue Depth**: `pulse_monitor_poll_queue_depth`
- Use single-stat with alert if > 1.5× active instances for > 10 min
- **Instance & Node Staleness**: combine `pulse_monitor_poll_staleness_seconds` and `pulse_monitor_node_poll_staleness_seconds`
- Alert threshold: > 60 s for > 5 min (excluding known failing instances)
- **Polling Throughput**: rate of `pulse_monitor_poll_total{result="success"}` vs `result="error"`
- **Per-node errors**: table or graph of `pulse_monitor_node_poll_errors_total` to spot noisy nodes
- **Scheduler Health**: panels for `pulse_scheduler_queue_depth`, `pulse_scheduler_queue_due_soon`, `pulse_scheduler_dead_letter_depth`, `pulse_scheduler_breaker_state`, `pulse_scheduler_breaker_failure_count`
- **Diagnostics Cache**: compare `increase(pulse_diagnostics_cache_hits_total[5m])` vs misses so spikes stand out
- **HTTP SLA**: `rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`
- **Last Success Timestamp** (v4.24.0+): `pulse_monitor_poll_last_success_timestamp` to detect polling gaps
2. **Alerts**
- Queue depth > threshold for >10 min (Warning), >20 min (Critical)
- Instance or node staleness > 60 s for >5 min (Critical)
- Dead-letter count increase > N (based on baseline) triggers Warning
- Any breaker stuck in `open` for >10 min (from `pulse_scheduler_breaker_state`) triggers Critical
- Queue wait > 5 s (95th percentile on `pulse_scheduler_queue_wait_seconds`) triggers Warning
- Permanent failures (`pulse_monitor_poll_errors_total{category="permanent"}`) trigger immediate Critical
- Diagnostics refresh duration > 20 s alongside miss spikes should page engineering (`pulse_diagnostics_refresh_duration_seconds`)
3. **Notification routing**
- Ensure alerts route to on-call + feature owner
---
## 5. Rollback Procedures
### Option A: Disable Adaptive Polling (Keep v4.24.0)
**If adaptive polling causes issues but you want to keep v4.24.0:**
1. **Via UI (No restart required)**
- Navigate to **Settings → System → Monitoring**
- Toggle "Adaptive Polling" OFF
- Changes apply immediately
2. **Via environment variables (Restart required)**
```bash
# Systemd
sudo systemctl edit pulse
# Add:
[Service]
Environment="ADAPTIVE_POLLING_ENABLED=false"
# Then restart
sudo systemctl restart pulse
```
3. **Verification**
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq '.enabled'
# Should return: false
```
### Option B: Full Version Rollback
**If v4.24.0 causes broader issues:**
1. **Via UI**
- Navigate to **Settings → System → Updates**
- Click **"Restore previous version"**
- Confirm rollback
- Pulse restarts automatically with previous version
2. **Via CLI**
```bash
# Systemd installations
sudo /opt/pulse/pulse config rollback
# LXC containers
pct exec <ctid> -- bash -c "cd /opt/pulse && ./pulse config rollback"
```
3. **Verification**
```bash
# Check version
curl -s http://<host>:7655/api/version | jq '.version'
# Check update history
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
# Should show action="rollback", status="completed"
```
4. **Post-rollback**
- Verify rollback logged in update history
- Check journal: `journalctl -u pulse | grep rollback`
- Monitor for 15-30 minutes to ensure stability
- Document rollback reason and notify stakeholders
---
## 6. Troubleshooting
| Symptom | Possible Cause | Action |
|---------|----------------|--------|
| Queue depth remains high (> 2× usual) | Insufficient workers, hidden breaker, misconfigured flag | Check scheduler health API for breaker states; consider increasing workers or reverting flag |
| Staleness spikes across many instances | Backend API slowdown or connectivity issues | Inspect backend logs, network health; revert flag if duration > 15 min |
| Dead-letter count climbs rapidly | Downstream API failures | Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback |
| Circuit breakers stuck half-open/open | Persistent transient failures | Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly |
| Grafana panels flatline | Metrics exporter or job issue | Ensure Prometheus scraping working; verify service restarted with flag |
### Accessing Scheduler Health API
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
```
Key sections to inspect:
- `queue.depth`, `queue.perType`
- `instances[].pollStatus` (success/failure streaks and last error)
- `instances[].breaker` (current breaker state, retry windows)
- `instances[].deadLetter` (reason, retry counts, schedules)
- `staleness` (normalized freshness score)
Common queries:
**Instances with errors:**
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'
```
**Current dead-letter entries:**
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'
```
**Breakers not closed:**
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'
```
### When to Roll Back
Rollback immediately if any of the following occurs:
- Queue depth > 3× baseline for > 15 min
- Staleness > 120 s on majority of instances
- Dead-letter count doubles without clear cause
- Customer-facing alerts or latency regressions attributed to adaptive polling
Document the incident and notify stakeholders after rollback.
---
## 7. Related Documentation
- [Scheduler Health API](../api/SCHEDULER_HEALTH.md) - Complete API reference
- [Adaptive Polling Architecture](../monitoring/ADAPTIVE_POLLING.md) - Technical details
- [Management Endpoints](ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md) - Circuit breaker/DLQ controls
- [Configuration Guide](../CONFIGURATION.md) - Adaptive polling settings