Pulse/docs/monitoring/ADAPTIVE_POLLING.md
rcourtman 2b48b0a459 feat: add --kube-include-all-deployments flag for Kubernetes agent
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.

- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation

Addresses feedback from PR #855
2025-12-18 20:58:30 +00:00

55 lines
1.9 KiB
Markdown

# 📉 Adaptive Polling
Pulse uses an adaptive scheduler to optimize polling based on instance health and activity.
## 🧠 Architecture
* **Scheduler**: Calculates intervals based on health/staleness.
* **Priority Queue**: Min-heap keyed by `NextRun`.
* **Circuit Breaker**: Prevents hot loops on failing instances.
* **Backoff**: Exponential retry delays (5s to 5m).
## ⚙️ Configuration
Adaptive polling is **disabled by default**.
### UI
There is currently no dedicated UI for adaptive polling in v5.
### Environment Variables
| Variable | Default | Description |
| :--- | :--- | :--- |
| `ADAPTIVE_POLLING_ENABLED` | `false` | Enable/disable. |
| `ADAPTIVE_POLLING_BASE_INTERVAL` | `10s` | Healthy poll rate. |
| `ADAPTIVE_POLLING_MIN_INTERVAL` | `5s` | Active/busy rate. |
| `ADAPTIVE_POLLING_MAX_INTERVAL` | `5m` | Idle/backoff rate. |
### system.json
You can also set `adaptivePollingEnabled` (and related interval fields) in `system.json` and restart Pulse.
## 📊 Metrics
Exposed at `:9091/metrics`.
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_monitor_poll_total` | Counter | Total poll attempts. |
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll latency. |
| `pulse_monitor_poll_staleness_seconds` | Gauge | Age since last success. |
| `pulse_monitor_poll_queue_depth` | Gauge | Queue size. |
| `pulse_monitor_poll_errors_total` | Counter | Error counts by category. |
## ⚡ Circuit Breaker
| State | Trigger | Recovery |
| :--- | :--- | :--- |
| **Closed** | Normal operation. | — |
| **Open** | ≥3 failures. | Backoff (max 5m). |
| **Half-open** | Retry window elapsed. | Success = Closed; Fail = Open. |
**Dead Letter Queue**: After 5 transient or 1 permanent failure, tasks move to DLQ (30m retry).
## 🩺 Health API
`GET /api/monitoring/scheduler/health` (Auth required)
Returns:
* Queue depth & breakdown.
* Dead-letter tasks.
* Circuit breaker states.
* Per-instance staleness.