# 📊 Prometheus Metrics Pulse exposes metrics at `/metrics` (default port `9091`). Example scrape target: - `http://:9091/metrics` This listener is separate from the main UI/API port (`7655`). In Docker and Kubernetes you must expose `9091` explicitly if you want to scrape it from outside the container/pod. **Helm note:** the current chart exposes only port `7655`, so Prometheus scraping requires an additional Service that targets `9091` (and a matching ServiceMonitor). ## 🌐 HTTP Ingress | Metric | Type | Description | | :--- | :--- | :--- | | `pulse_http_request_duration_seconds` | Histogram | Latency buckets by `method`, `route`, `status`. | | `pulse_http_requests_total` | Counter | Total requests by `method`, `route`, `status`. | | `pulse_http_request_errors_total` | Counter | Error totals by `method`, `route`, `status_class` (`client_error`, `server_error`, `none`). | ## 🔄 Polling & Nodes | Metric | Type | Description | | :--- | :--- | :--- | | `pulse_monitor_poll_duration_seconds` | Histogram | Per-instance poll latency. | | `pulse_monitor_poll_total` | Counter | Success/error counts per instance (`result` label). | | `pulse_monitor_poll_errors_total` | Counter | Poll failures by `error_type`. | | `pulse_monitor_poll_last_success_timestamp` | Gauge | Unix timestamp of last success. | | `pulse_monitor_poll_staleness_seconds` | Gauge | Seconds since last success (`-1` if never succeeded). | | `pulse_monitor_poll_queue_depth` | Gauge | Global queue depth. | | `pulse_monitor_poll_inflight` | Gauge | In-flight polls by `instance_type`. | | `pulse_monitor_node_poll_duration_seconds` | Histogram | Per-node poll latency. | | `pulse_monitor_node_poll_total` | Counter | Success/error counts per node (`result` label). | | `pulse_monitor_node_poll_errors_total` | Counter | Node poll failures by `error_type`. | | `pulse_monitor_node_poll_last_success_timestamp` | Gauge | Unix timestamp of last node success. | | `pulse_monitor_node_poll_staleness_seconds` | Gauge | Seconds since last node success (`-1` if never succeeded). | ## 🧠 Scheduler Health | Metric | Type | Description | | :--- | :--- | :--- | | `pulse_scheduler_queue_due_soon` | Gauge | Tasks due within the next 12 seconds. | | `pulse_scheduler_queue_depth` | Gauge | Queue depth per `instance_type`. | | `pulse_scheduler_queue_wait_seconds` | Histogram | Wait time between task readiness and execution. | | `pulse_scheduler_dead_letter_depth` | Gauge | DLQ depth by `instance_type` and `instance`. | | `pulse_scheduler_breaker_state` | Gauge | `0`=Closed, `1`=Half-Open, `2`=Open, `-1`=Unknown. | | `pulse_scheduler_breaker_failure_count` | Gauge | Consecutive failure count. | | `pulse_scheduler_breaker_retry_seconds` | Gauge | Seconds until next retry allowed. | ## ⚡ Diagnostics Cache | Metric | Type | Description | | :--- | :--- | :--- | | `pulse_diagnostics_cache_hits_total` | Counter | Cache hits. | | `pulse_diagnostics_cache_misses_total` | Counter | Cache misses. | | `pulse_diagnostics_refresh_duration_seconds` | Histogram | Refresh latency. | ## 🚨 Alert Lifecycle | Metric | Type | Description | | :--- | :--- | :--- | | `pulse_alerts_active` | Gauge | Active alerts by `level` and `type`. | | `pulse_alerts_fired_total` | Counter | Total alerts fired by `level` and `type`. | | `pulse_alerts_resolved_total` | Counter | Total alerts resolved by `type`. | | `pulse_alerts_acknowledged_total` | Counter | Total alerts acknowledged. | | `pulse_alerts_suppressed_total` | Counter | Alerts suppressed by `reason` (quiet_hours, rate_limit, duplicate, etc.). | | `pulse_alerts_rate_limited_total` | Counter | Alerts suppressed due to rate limiting. | | `pulse_alert_duration_seconds` | Histogram | Time from alert fire to resolve (by `type`). | ## 🚨 Alerting Examples - **High Error Rate**: `rate(pulse_http_request_errors_total[5m]) > 0.05` - **Stale Node**: `pulse_monitor_node_poll_staleness_seconds > 300` - **Breaker Open**: `pulse_scheduler_breaker_state == 2`