Pulse/docs/monitoring/ADAPTIVE_POLLING.md
rcourtman 0ca6001bad docs: update documentation after sensor proxy deprecation
Update docs to reflect the simplified temperature monitoring architecture:
- Remove references to pulse-sensor-proxy throughout
- Update TEMPERATURE_MONITORING.md to focus on unified agent approach
- Update CONFIGURATION.md, DEPLOYMENT_MODELS.md, FAQ.md
- Remove SECURITY_CHANGELOG.md (proxy-specific security notes)
- Clarify current recommended setup in various guides
2026-01-21 12:00:59 +00:00

86 lines
3.6 KiB
Markdown

# 📉 Adaptive Polling
Pulse uses an adaptive scheduler to optimize polling based on instance health and activity.
## 🧠 Architecture
- **Scheduler**: Calculates intervals based on health/staleness.
- **Priority Queue**: Min-heap keyed by `NextRun`.
- **Circuit Breaker**: Prevents hot loops on failing instances using success/failure counters.
- **Backoff**: Exponential retry delays (5s min to 5m max).
- **Worker Pool**: One worker per configured instance (PVE/PBS/PMG), capped at 10.
- **Global Concurrency Cap**: At most 2 polling cycles run at once to avoid resource spikes.
## 🔬 Implementation Details (Developer Info)
### Staleness Scoring
The `AdaptiveScheduler` (`internal/monitoring/scheduler.go`) relies on the `StalenessTracker` to compute a `StalenessScore` (0.0 to 1.0) based on **how long it has been since the last successful poll**.
- `0.0` = fresh (recent success)
- `1.0` = very stale or never succeeded
The staleness score is normalized against `AdaptivePollingMaxInterval` (default 5 minutes).
The scheduler applies **Exponential Smoothing** (alpha 0.6) and a small jitter (5%) to avoid oscillation.
Additional influences:
- **Error penalty**: retries tighten the interval based on the error count.
- **Queue stretch**: large queues gently stretch intervals to avoid overload.
### Circuit Breaker Recovery
The `circuitBreaker` (`internal/monitoring/circuit_breaker.go`) follows a standard state machine:
1. **Closing the Circuit**: One successful poll moves *Half-Open**Closed* and resets failure count.
2. **Backoff Calculation**: Retries use exponential backoff starting at 5s (multiplier 2, jitter 0.2) capped at 5m.
3. **Transient vs. Permanent**:
- **Transient** errors (retryable) are retried up to 5 times before moving to the Dead Letter Queue.
- **Permanent** errors move directly to the Dead Letter Queue.
**Note:** When `AdaptivePollingMaxInterval` is set to 15 seconds or less, the retry backoff is shortened (750ms initial, 4s max) to keep fast feedback loops during tight polling windows.
## ⚙️ Configuration
Adaptive polling is **disabled by default**.
### UI
There is currently no dedicated UI for adaptive polling in v5.
### Environment Variables
| Variable | Default | Description |
| :--- | :--- | :--- |
| `ADAPTIVE_POLLING_ENABLED` | `false` | Enable/disable. |
| `ADAPTIVE_POLLING_BASE_INTERVAL` | `10s` | Healthy poll rate. |
| `ADAPTIVE_POLLING_MIN_INTERVAL` | `5s` | Active/busy rate. |
| `ADAPTIVE_POLLING_MAX_INTERVAL` | `5m` | Idle/backoff rate. |
### system.json
You can also set `adaptivePollingEnabled` (and related interval fields) in `system.json` and restart Pulse.
## 📊 Metrics
Exposed at `:9091/metrics`.
| Metric | Type | Description |
| :--- | :--- | :--- |
| `pulse_monitor_poll_total` | Counter | Total poll attempts. |
| `pulse_monitor_poll_duration_seconds` | Histogram | Poll latency. |
| `pulse_monitor_poll_staleness_seconds` | Gauge | Age since last success. |
| `pulse_monitor_poll_queue_depth` | Gauge | Queue size. |
| `pulse_monitor_poll_errors_total` | Counter | Error counts by category. |
| `pulse_scheduler_queue_due_soon` | Gauge | Tasks due in the next 12 seconds. |
## ⚡ Circuit Breaker
| State | Trigger | Recovery |
| :--- | :--- | :--- |
| **Closed** | Normal operation. | — |
| **Open** | ≥3 failures. | Backoff (max 5m). |
| **Half-open** | Retry window elapsed. | Success = Closed; Fail = Open. |
**Dead Letter Queue**: After 5 transient or 1 permanent failure, tasks move to DLQ (30m retry).
## 🩺 Health API
`GET /api/monitoring/scheduler/health` (Auth required)
Returns:
- Queue depth & breakdown.
- Dead-letter tasks.
- Circuit breaker states.
- Per-instance staleness.