mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-04-29 20:10:21 +00:00
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.
**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
Prometheus metrics with alert suggestions
**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
defaults
These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards
Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
279 lines
11 KiB
Markdown
279 lines
11 KiB
Markdown
# Adaptive Polling Operations Runbook
|
||
|
||
**GA in v4.24.0 - Enabled by Default**
|
||
|
||
This runbook guides operators managing the adaptive polling scheduler in production. Adaptive polling is enabled by default in v4.24.0 and can be toggled via **Settings → System → Monitoring** (no restart) or environment variables (restart required).
|
||
|
||
Follow these operational procedures for steady-state monitoring, troubleshooting, and rollback scenarios.
|
||
|
||
---
|
||
|
||
## 1. Preparation (Before v4.24.0 Deployment)
|
||
|
||
**For new deployments or upgrades to v4.24.0:**
|
||
|
||
1. **Monitoring readiness**
|
||
- Review available scrape series in [Prometheus Metrics](../monitoring/PROMETHEUS_METRICS.md).
|
||
- Set up Grafana dashboard with:
|
||
- Instance gauges: `pulse_monitor_poll_queue_depth`, `pulse_monitor_poll_staleness_seconds`, `pulse_monitor_poll_last_success_timestamp`
|
||
- **Per-node coverage:** `pulse_monitor_node_poll_staleness_seconds`, `pulse_monitor_node_poll_errors_total`, `pulse_monitor_node_poll_duration_seconds`
|
||
- Scheduler health: `pulse_scheduler_queue_depth`, `pulse_scheduler_queue_due_soon`, `pulse_scheduler_dead_letter_depth`, `pulse_scheduler_breaker_state`, and `pulse_scheduler_breaker_failure_count`
|
||
- Diagnostics cache sanity: `increase(pulse_diagnostics_cache_misses_total[5m])` vs `pulse_diagnostics_cache_hits_total`
|
||
- HTTP SLA: `rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`
|
||
- Continue trending `pulse_monitor_poll_total` / `pulse_monitor_poll_errors_total` for throughput
|
||
- Configure alerts (see §4)
|
||
|
||
2. **Baseline metrics**
|
||
- Record pre-upgrade metrics if upgrading from < v4.24.0:
|
||
- Typical polling frequency
|
||
- Average response times
|
||
- Current alert volumes
|
||
- These help assess adaptive polling impact
|
||
|
||
3. **Rollback readiness**
|
||
- Verify update rollback workflow: **Settings → System → Updates → Restore previous version**
|
||
- Test rollback in staging environment
|
||
- Confirm `/api/monitoring/scheduler/health` accessible
|
||
- Document emergency disable procedure (see §5)
|
||
|
||
4. **Configuration review**
|
||
- Review `system.json` or environment variables for adaptive polling tunables:
|
||
- `ADAPTIVE_POLLING_BASE_INTERVAL` (default: 10s)
|
||
- `ADAPTIVE_POLLING_MIN_INTERVAL` (default: 5s)
|
||
- `ADAPTIVE_POLLING_MAX_INTERVAL` (default: 5m)
|
||
- Adjust if needed for your environment (e.g., high-frequency monitoring)
|
||
|
||
---
|
||
|
||
## 2. Post-Deployment Verification (v4.24.0+)
|
||
|
||
**Adaptive polling is enabled by default. Verify it's working correctly:**
|
||
|
||
1. **Check scheduler health**
|
||
```bash
|
||
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
|
||
```
|
||
|
||
**Expected response:**
|
||
- `"enabled": true`
|
||
- `queue.depth` reasonable (< instances × 1.5)
|
||
- `deadLetter.count` = 0 (or only known failing instances)
|
||
- `instances[]` array populated with your nodes
|
||
- No `breaker.state` stuck in `open` (except known issues)
|
||
|
||
2. **Verify UI access**
|
||
- Navigate to **Settings → System → Monitoring**
|
||
- Confirm "Adaptive Polling" toggle is ON
|
||
- Review queue depth and recent poll status
|
||
|
||
3. **Check Grafana metrics** (if configured)
|
||
- `pulse_monitor_poll_queue_depth` shows reasonable values
|
||
- `pulse_monitor_poll_staleness_seconds` < 60s for healthy instances
|
||
- `pulse_monitor_poll_errors_total` not rapidly increasing
|
||
- `pulse_monitor_poll_last_success_timestamp` updating regularly
|
||
|
||
4. **Monitor update history**
|
||
- Check **Settings → System → Updates** for update entry
|
||
- Verify rollback button is available
|
||
- Confirm update status shows "completed"
|
||
|
||
**Via API:**
|
||
```bash
|
||
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Steady-State Operations
|
||
|
||
**Ongoing monitoring and SLO checks:**
|
||
|
||
1. **Daily health checks**
|
||
- Review scheduler health dashboard or API endpoint
|
||
- Check for:
|
||
- Queue depth < 50 (alert if > 50 for 10+ minutes)
|
||
- Instance staleness < 60s for healthy instances, < 120s for critical instances
|
||
- **Per-node staleness** `pulse_monitor_node_poll_staleness_seconds` < 120s
|
||
- DLQ count stable (not growing)
|
||
- Circuit breakers mostly `closed`
|
||
- Diagnostics cache misses roughly follow hits (`increase(pulse_diagnostics_cache_misses_total[10m])` in line with hits)
|
||
- HTTP error rate (`rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`) near zero
|
||
|
||
2. **Weekly reviews**
|
||
- Analyze trends in Grafana:
|
||
- Poll success rates
|
||
- Average queue depth over time
|
||
- Circuit breaker trip frequency
|
||
- Dead-letter queue patterns
|
||
- Document any recurring issues
|
||
|
||
3. **SLO targets**
|
||
- **Queue depth**: < 1.5× instance count (< 50 typical)
|
||
- **Staleness**: < 60s for healthy instances, < 120s for critical instances
|
||
- **Poll success rate**: > 99% for healthy infrastructure
|
||
- **DLQ growth**: < 5% per week (excluding known failures)
|
||
- **Circuit breaker recovery**: < 5 minutes for transient failures
|
||
|
||
4. **Log correlation**
|
||
- Cross-reference scheduler health with update history
|
||
- Check `/api/updates/history` for rollback events correlated with scheduler issues
|
||
- Review audit logs for adaptive polling configuration changes
|
||
|
||
---
|
||
|
||
## 4. Grafana & Alert Configuration
|
||
|
||
1. **Dashboard panels**
|
||
- **Queue Depth**: `pulse_monitor_poll_queue_depth`
|
||
- Use single-stat with alert if > 1.5× active instances for > 10 min
|
||
- **Instance & Node Staleness**: combine `pulse_monitor_poll_staleness_seconds` and `pulse_monitor_node_poll_staleness_seconds`
|
||
- Alert threshold: > 60 s for > 5 min (excluding known failing instances)
|
||
- **Polling Throughput**: rate of `pulse_monitor_poll_total{result="success"}` vs `result="error"`
|
||
- **Per-node errors**: table or graph of `pulse_monitor_node_poll_errors_total` to spot noisy nodes
|
||
- **Scheduler Health**: panels for `pulse_scheduler_queue_depth`, `pulse_scheduler_queue_due_soon`, `pulse_scheduler_dead_letter_depth`, `pulse_scheduler_breaker_state`, `pulse_scheduler_breaker_failure_count`
|
||
- **Diagnostics Cache**: compare `increase(pulse_diagnostics_cache_hits_total[5m])` vs misses so spikes stand out
|
||
- **HTTP SLA**: `rate(pulse_http_request_errors_total{status_class="server_error"}[5m])`
|
||
- **Last Success Timestamp** (v4.24.0+): `pulse_monitor_poll_last_success_timestamp` to detect polling gaps
|
||
|
||
2. **Alerts**
|
||
- Queue depth > threshold for >10 min (Warning), >20 min (Critical)
|
||
- Instance or node staleness > 60 s for >5 min (Critical)
|
||
- Dead-letter count increase > N (based on baseline) triggers Warning
|
||
- Any breaker stuck in `open` for >10 min (from `pulse_scheduler_breaker_state`) triggers Critical
|
||
- Queue wait > 5 s (95th percentile on `pulse_scheduler_queue_wait_seconds`) triggers Warning
|
||
- Permanent failures (`pulse_monitor_poll_errors_total{category="permanent"}`) trigger immediate Critical
|
||
- Diagnostics refresh duration > 20 s alongside miss spikes should page engineering (`pulse_diagnostics_refresh_duration_seconds`)
|
||
|
||
3. **Notification routing**
|
||
- Ensure alerts route to on-call + feature owner
|
||
|
||
---
|
||
|
||
## 5. Rollback Procedures
|
||
|
||
### Option A: Disable Adaptive Polling (Keep v4.24.0)
|
||
|
||
**If adaptive polling causes issues but you want to keep v4.24.0:**
|
||
|
||
1. **Via UI (No restart required)**
|
||
- Navigate to **Settings → System → Monitoring**
|
||
- Toggle "Adaptive Polling" OFF
|
||
- Changes apply immediately
|
||
|
||
2. **Via environment variables (Restart required)**
|
||
```bash
|
||
# Systemd
|
||
sudo systemctl edit pulse
|
||
# Add:
|
||
[Service]
|
||
Environment="ADAPTIVE_POLLING_ENABLED=false"
|
||
|
||
# Then restart
|
||
sudo systemctl restart pulse
|
||
```
|
||
|
||
3. **Verification**
|
||
```bash
|
||
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq '.enabled'
|
||
# Should return: false
|
||
```
|
||
|
||
### Option B: Full Version Rollback
|
||
|
||
**If v4.24.0 causes broader issues:**
|
||
|
||
1. **Via UI**
|
||
- Navigate to **Settings → System → Updates**
|
||
- Click **"Restore previous version"**
|
||
- Confirm rollback
|
||
- Pulse restarts automatically with previous version
|
||
|
||
2. **Via CLI**
|
||
```bash
|
||
# Systemd installations
|
||
sudo /opt/pulse/pulse config rollback
|
||
|
||
# LXC containers
|
||
pct exec <ctid> -- bash -c "cd /opt/pulse && ./pulse config rollback"
|
||
```
|
||
|
||
3. **Verification**
|
||
```bash
|
||
# Check version
|
||
curl -s http://<host>:7655/api/version | jq '.version'
|
||
|
||
# Check update history
|
||
curl -s http://<host>:7655/api/updates/history | jq '.entries[0]'
|
||
# Should show action="rollback", status="completed"
|
||
```
|
||
|
||
4. **Post-rollback**
|
||
- Verify rollback logged in update history
|
||
- Check journal: `journalctl -u pulse | grep rollback`
|
||
- Monitor for 15-30 minutes to ensure stability
|
||
- Document rollback reason and notify stakeholders
|
||
|
||
---
|
||
|
||
## 6. Troubleshooting
|
||
|
||
| Symptom | Possible Cause | Action |
|
||
|---------|----------------|--------|
|
||
| Queue depth remains high (> 2× usual) | Insufficient workers, hidden breaker, misconfigured flag | Check scheduler health API for breaker states; consider increasing workers or reverting flag |
|
||
| Staleness spikes across many instances | Backend API slowdown or connectivity issues | Inspect backend logs, network health; revert flag if duration > 15 min |
|
||
| Dead-letter count climbs rapidly | Downstream API failures | Investigate specific instances via scheduler health API; fix credential/connectivity issues or rollback |
|
||
| Circuit breakers stuck half-open/open | Persistent transient failures | Review error logs, ensure backoff/rate limits not starving retries; rollback if unresolved quickly |
|
||
| Grafana panels flatline | Metrics exporter or job issue | Ensure Prometheus scraping working; verify service restarted with flag |
|
||
|
||
### Accessing Scheduler Health API
|
||
|
||
```bash
|
||
curl -s http://<host>:7655/api/monitoring/scheduler/health | jq
|
||
```
|
||
|
||
Key sections to inspect:
|
||
|
||
- `queue.depth`, `queue.perType`
|
||
- `instances[].pollStatus` (success/failure streaks and last error)
|
||
- `instances[].breaker` (current breaker state, retry windows)
|
||
- `instances[].deadLetter` (reason, retry counts, schedules)
|
||
- `staleness` (normalized freshness score)
|
||
|
||
Common queries:
|
||
|
||
**Instances with errors:**
|
||
```bash
|
||
curl -s http://<host>:7655/api/monitoring/scheduler/health \
|
||
| jq '.instances[] | select(.pollStatus.lastError != null) | {key, lastError: .pollStatus.lastError}'
|
||
```
|
||
|
||
**Current dead-letter entries:**
|
||
```bash
|
||
curl -s http://<host>:7655/api/monitoring/scheduler/health \
|
||
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount}'
|
||
```
|
||
|
||
**Breakers not closed:**
|
||
```bash
|
||
curl -s http://<host>:7655/api/monitoring/scheduler/health \
|
||
| jq '.instances[] | select(.breaker.state != "closed") | {key, breaker: .breaker}'
|
||
```
|
||
|
||
### When to Roll Back
|
||
|
||
Rollback immediately if any of the following occurs:
|
||
- Queue depth > 3× baseline for > 15 min
|
||
- Staleness > 120 s on majority of instances
|
||
- Dead-letter count doubles without clear cause
|
||
- Customer-facing alerts or latency regressions attributed to adaptive polling
|
||
|
||
Document the incident and notify stakeholders after rollback.
|
||
|
||
---
|
||
|
||
## 7. Related Documentation
|
||
|
||
- [Scheduler Health API](../api/SCHEDULER_HEALTH.md) - Complete API reference
|
||
- [Adaptive Polling Architecture](../monitoring/ADAPTIVE_POLLING.md) - Technical details
|
||
- [Management Endpoints](ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md) - Circuit breaker/DLQ controls
|
||
- [Configuration Guide](../CONFIGURATION.md) - Adaptive polling settings
|