Pulse/docs/monitoring/PROMETHEUS_METRICS.md
rcourtman 2b48b0a459 feat: add --kube-include-all-deployments flag for Kubernetes agent
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.

- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation

Addresses feedback from PR #855
2025-12-18 20:58:30 +00:00

42 lines
1.8 KiB
Markdown

# 📊 Prometheus Metrics
Pulse exposes metrics at `/metrics` (default port `9091`).
Example scrape target:
- `http://<pulse-host>:9091/metrics`
This listener is separate from the main UI/API port (`7655`). In Docker and Kubernetes you must expose `9091` explicitly if you want to scrape it from outside the container/pod.
## 🌐 HTTP Ingress
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_http_request_duration_seconds` | Histogram | Latency buckets by `method`, `route`, `status`. |
| `pulse_http_requests_total` | Counter | Total requests. |
| `pulse_http_request_errors_total` | Counter | 4xx/5xx errors. |
## 🔄 Polling & Nodes
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_monitor_node_poll_duration_seconds` | Histogram | Per-node poll latency. |
| `pulse_monitor_node_poll_total` | Counter | Success/error counts per node. |
| `pulse_monitor_node_poll_staleness_seconds` | Gauge | Seconds since last success. |
| `pulse_monitor_poll_queue_depth` | Gauge | Global queue depth. |
## 🧠 Scheduler Health
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_scheduler_queue_depth` | Gauge | Queue depth per instance type. |
| `pulse_scheduler_dead_letter_depth` | Gauge | DLQ depth per instance. |
| `pulse_scheduler_breaker_state` | Gauge | `0`=Closed, `1`=Half-Open, `2`=Open. |
## ⚡ Diagnostics Cache
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_diagnostics_cache_hits_total` | Counter | Cache hits. |
| `pulse_diagnostics_refresh_duration_seconds` | Histogram | Refresh latency. |
## 🚨 Alerting Examples
* **High Error Rate**: `rate(pulse_http_request_errors_total[5m]) > 0.05`
* **Stale Node**: `pulse_monitor_node_poll_staleness_seconds > 300`
* **Breaker Open**: `pulse_scheduler_breaker_state == 2`