Pulse/docs/monitoring/PROMETHEUS_METRICS.md

73 lines
3.9 KiB
Markdown

# 📊 Prometheus Metrics
Pulse exposes metrics at `/metrics` (default port `9091`).
Example scrape target:
- `http://<pulse-host>:9091/metrics`
This listener is separate from the main UI/API port (`7655`). In Docker and Kubernetes you must expose `9091` explicitly if you want to scrape it from outside the container/pod.
**Helm note:** the current chart exposes only port `7655`, so Prometheus scraping requires an additional Service that targets `9091` (and a matching ServiceMonitor).
## 🌐 HTTP Ingress
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_http_request_duration_seconds` | Histogram | Latency buckets by `method`, `route`, `status`. |
| `pulse_http_requests_total` | Counter | Total requests by `method`, `route`, `status`. |
| `pulse_http_request_errors_total` | Counter | Error totals by `method`, `route`, `status_class` (`client_error`, `server_error`, `none`). |
## 🔄 Polling & Nodes
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_monitor_poll_duration_seconds` | Histogram | Per-instance poll latency. |
| `pulse_monitor_poll_total` | Counter | Success/error counts per instance (`result` label). |
| `pulse_monitor_poll_errors_total` | Counter | Poll failures by `error_type`. |
| `pulse_monitor_poll_last_success_timestamp` | Gauge | Unix timestamp of last success. |
| `pulse_monitor_poll_staleness_seconds` | Gauge | Seconds since last success (`-1` if never succeeded). |
| `pulse_monitor_poll_queue_depth` | Gauge | Global queue depth. |
| `pulse_monitor_poll_inflight` | Gauge | In-flight polls by `instance_type`. |
| `pulse_monitor_node_poll_duration_seconds` | Histogram | Per-node poll latency. |
| `pulse_monitor_node_poll_total` | Counter | Success/error counts per node (`result` label). |
| `pulse_monitor_node_poll_errors_total` | Counter | Node poll failures by `error_type`. |
| `pulse_monitor_node_poll_last_success_timestamp` | Gauge | Unix timestamp of last node success. |
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | Seconds since last node success (`-1` if never succeeded). |
## 🧠 Scheduler Health
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_scheduler_queue_due_soon` | Gauge | Tasks due within the next 12 seconds. |
| `pulse_scheduler_queue_depth` | Gauge | Queue depth per `instance_type`. |
| `pulse_scheduler_queue_wait_seconds` | Histogram | Wait time between task readiness and execution. |
| `pulse_scheduler_dead_letter_depth` | Gauge | DLQ depth by `instance_type` and `instance`. |
| `pulse_scheduler_breaker_state` | Gauge | `0`=Closed, `1`=Half-Open, `2`=Open, `-1`=Unknown. |
| `pulse_scheduler_breaker_failure_count` | Gauge | Consecutive failure count. |
| `pulse_scheduler_breaker_retry_seconds` | Gauge | Seconds until next retry allowed. |
## ⚡ Diagnostics Cache
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_diagnostics_cache_hits_total` | Counter | Cache hits. |
| `pulse_diagnostics_cache_misses_total` | Counter | Cache misses. |
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | Refresh latency. |
## 🚨 Alert Lifecycle
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_alerts_active` | Gauge | Active alerts by `level` and `type`. |
| `pulse_alerts_fired_total` | Counter | Total alerts fired by `level` and `type`. |
| `pulse_alerts_resolved_total` | Counter | Total alerts resolved by `type`. |
| `pulse_alerts_acknowledged_total` | Counter | Total alerts acknowledged. |
| `pulse_alerts_suppressed_total` | Counter | Alerts suppressed by `reason` (quiet_hours, rate_limit, duplicate, etc.). |
| `pulse_alerts_rate_limited_total` | Counter | Alerts suppressed due to rate limiting. |
| `pulse_alert_duration_seconds` | Histogram | Time from alert fire to resolve (by `type`). |
## 🚨 Alerting Examples
- **High Error Rate**: `rate(pulse_http_request_errors_total[5m]) > 0.05`
- **Stale Node**: `pulse_monitor_node_poll_staleness_seconds > 300`
- **Breaker Open**: `pulse_scheduler_breaker_state == 2`