Pulse/docs/monitoring/PROMETHEUS_METRICS.md
rcourtman e0396c1362 docs: update documentation for diagnostics improvements
Add comprehensive operator documentation for the new observability features
introduced in the previous commit.

**New Documentation:**
- docs/monitoring/PROMETHEUS_METRICS.md - Complete reference for all 18 new
  Prometheus metrics with alert suggestions

**Updated Documentation:**
- docs/API.md - Document X-Request-ID and X-Diagnostics-Cached-At headers,
  explain diagnostics endpoint caching behavior
- docs/TROUBLESHOOTING.md - Add section on correlating API calls with logs
  using request IDs
- docs/operations/ADAPTIVE_POLLING_ROLLOUT.md - Update monitoring checklists
  with new per-node and scheduler metrics
- docs/CONFIGURATION.md - Clarify LOG_FILE dual-output behavior and rotation
  defaults

These updates ensure operators understand:
- How to set up monitoring/alerting for new metrics
- How to configure file logging with rotation
- How to troubleshoot using request correlation
- What metrics are available for dashboards

Related to: 495e6c794 (feat: comprehensive diagnostics improvements)
2025-10-21 12:45:19 +00:00

81 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pulse Prometheus Metrics (v4.24.0+)
Pulse exposes multiple metric families that cover HTTP ingress, per-node poll execution, scheduler health, and diagnostics caching. Use the following reference when wiring dashboards or alert rules.
---
## HTTP Request Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_http_request_duration_seconds` | Histogram | `method`, `route`, `status` | Request latency buckets. `route` is a normalised path (dynamic segments collapsed to `:id`, `:uuid`, etc.). |
| `pulse_http_requests_total` | Counter | `method`, `route`, `status` | Total requests handled. |
| `pulse_http_request_errors_total` | Counter | `method`, `route`, `status_class` | Counts 4xx/5xx responses. |
**Alert suggestion:**
`rate(pulse_http_request_errors_total{status_class="server_error"}[5m]) > 0.05` (more than ~3 server errors/min) should page ops.
---
## Per-Node Poll Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_monitor_node_poll_duration_seconds` | Histogram | `instance_type`, `instance`, `node` | Wall-clock duration for each node poll. |
| `pulse_monitor_node_poll_total` | Counter | `instance_type`, `instance`, `node`, `result` | Success/error counts per node. |
| `pulse_monitor_node_poll_errors_total` | Counter | `instance_type`, `instance`, `node`, `error_type` | Error type breakdown (connection, auth, internal, etc.). |
| `pulse_monitor_node_poll_last_success_timestamp` | Gauge | `instance_type`, `instance`, `node` | Unix timestamp of last successful poll. |
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | `instance_type`, `instance`, `node` | Seconds since last success (1 means no success yet). |
**Alert suggestion:**
`max_over_time(pulse_monitor_node_poll_staleness_seconds{node!=""}[10m]) > 300` indicates a node has been stale for 5+ minutes.
---
## Scheduler Health Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_scheduler_queue_due_soon` | Gauge | — | Number of tasks due within 12 seconds. |
| `pulse_scheduler_queue_depth` | Gauge | `instance_type` | Queue depth per instance type (PVE, PBS, PMG). |
| `pulse_scheduler_queue_wait_seconds` | Histogram | `instance_type` | Wait time between when a task should run and when it actually executes. |
| `pulse_scheduler_dead_letter_depth` | Gauge | `instance_type`, `instance` | Dead-letter queue depth per monitored instance. |
| `pulse_scheduler_breaker_state` | Gauge | `instance_type`, `instance` | Circuit breaker state: `0`=closed, `1`=half-open, `2`=open, `-1`=unknown. |
| `pulse_scheduler_breaker_failure_count` | Gauge | `instance_type`, `instance` | Consecutive failures tracked by the breaker. |
| `pulse_scheduler_breaker_retry_seconds` | Gauge | `instance_type`, `instance` | Seconds until the breaker will allow the next attempt. |
**Alert suggestions:**
- Queue saturation: `max_over_time(pulse_scheduler_queue_depth[10m]) > <instance count * 1.5>`
- DLQ growth: `increase(pulse_scheduler_dead_letter_depth[10m]) > 0`
- Breaker stuck open: `pulse_scheduler_breaker_state == 2` for > 10 minutes.
---
## Diagnostics Cache Metrics
| Metric | Type | Labels | Description |
| --- | --- | --- | --- |
| `pulse_diagnostics_cache_hits_total` | Counter | — | Diagnostics requests served from cache. |
| `pulse_diagnostics_cache_misses_total` | Counter | — | Requests that triggered a fresh probe. |
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | — | Time taken to refresh diagnostics payload. |
**Alert suggestion:**
`rate(pulse_diagnostics_cache_misses_total[5m])` spiking alongside `pulse_diagnostics_refresh_duration_seconds` > 20s can signal upstream slowness.
---
## Existing Instance-Level Poll Metrics (for completeness)
The following metrics pre-date v4.24.0 but remain essential:
| Metric | Type | Description |
| --- | --- | --- |
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll duration per instance. |
| `pulse_monitor_poll_total` | Counter | Success/error counts per instance. |
| `pulse_monitor_poll_errors_total` | Counter | Error counts per instance. |
| `pulse_monitor_poll_last_success_timestamp` | Gauge | Last successful poll timestamp. |
| `pulse_monitor_poll_staleness_seconds` | Gauge | Seconds since last successful poll (instance-level). |
| `pulse_monitor_poll_queue_depth` | Gauge | Current queue depth. |
| `pulse_monitor_poll_inflight` | Gauge | Polls currently running. |
Refer to this document whenever you build dashboards or craft alert policies. Scrape all metrics from the Pulse backend `/metrics` endpoint (9091 by default for systemd installs).