Pulse/docs/monitoring/ADAPTIVE_POLLING.md
rcourtman 2b48b0a459 feat: add --kube-include-all-deployments flag for Kubernetes agent
Adds IncludeAllDeployments option to show all deployments, not just
problem ones (where replicas don't match desired). This provides parity
with the existing --kube-include-all-pods flag.

- Add IncludeAllDeployments to kubernetesagent.Config
- Add --kube-include-all-deployments flag and PULSE_KUBE_INCLUDE_ALL_DEPLOYMENTS env var
- Update collectDeployments to respect the new flag
- Add test for IncludeAllDeployments functionality
- Update UNIFIED_AGENT.md documentation

Addresses feedback from PR #855
2025-12-18 20:58:30 +00:00

1.9 KiB

📉 Adaptive Polling

Pulse uses an adaptive scheduler to optimize polling based on instance health and activity.

🧠 Architecture

  • Scheduler: Calculates intervals based on health/staleness.
  • Priority Queue: Min-heap keyed by NextRun.
  • Circuit Breaker: Prevents hot loops on failing instances.
  • Backoff: Exponential retry delays (5s to 5m).

⚙️ Configuration

Adaptive polling is disabled by default.

UI

There is currently no dedicated UI for adaptive polling in v5.

Environment Variables

Variable Default Description
ADAPTIVE_POLLING_ENABLED false Enable/disable.
ADAPTIVE_POLLING_BASE_INTERVAL 10s Healthy poll rate.
ADAPTIVE_POLLING_MIN_INTERVAL 5s Active/busy rate.
ADAPTIVE_POLLING_MAX_INTERVAL 5m Idle/backoff rate.

system.json

You can also set adaptivePollingEnabled (and related interval fields) in system.json and restart Pulse.

📊 Metrics

Exposed at :9091/metrics.

Metric Type Description
pulse_monitor_poll_total Counter Total poll attempts.
pulse_monitor_poll_duration_seconds Histogram Poll latency.
pulse_monitor_poll_staleness_seconds Gauge Age since last success.
pulse_monitor_poll_queue_depth Gauge Queue size.
pulse_monitor_poll_errors_total Counter Error counts by category.

Circuit Breaker

State Trigger Recovery
Closed Normal operation.
Open ≥3 failures. Backoff (max 5m).
Half-open Retry window elapsed. Success = Closed; Fail = Open.

Dead Letter Queue: After 5 transient or 1 permanent failure, tasks move to DLQ (30m retry).

🩺 Health API

GET /api/monitoring/scheduler/health (Auth required)

Returns:

  • Queue depth & breakdown.
  • Dead-letter tasks.
  • Circuit breaker states.
  • Per-instance staleness.