Pulse/docs/operations/ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md
rcourtman c91b7874ac docs: comprehensive v4.24.0 documentation audit and updates
Complete documentation overhaul for Pulse v4.24.0 release covering all new
features and operational procedures.

Documentation Updates (19 files):

P0 Release-Critical:
- Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook
- Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status
- Operations: Enhanced audit-log-rotation.md with scheduler health checks
- Security: Updated proxy hardening docs with rate limit defaults
- Docker: Added runtime logging and rollback procedures

P1 Deployment & Integration:
- KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification
- PORT_CONFIGURATION.md: Service naming, change tracking via update history
- REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification
- PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration
- TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows

Features Documented:
- X-RateLimit-* headers for all API responses
- Updates rollback workflow (UI & CLI)
- Scheduler health API with rich metadata
- Runtime logging configuration (no restart required)
- Adaptive polling (GA, enabled by default)
- Enhanced audit logging
- Circuit breakers and dead-letter queue

Supporting Changes:
- Discovery service enhancements
- Config handlers updates
- Sensor proxy installer improvements

Total Changes: 1,626 insertions(+), 622 deletions(-)
Files Modified: 24 (19 docs, 5 code)

All documentation is production-ready for v4.24.0 release.
2025-10-20 17:20:13 +00:00

260 lines
7 KiB
Markdown

# Adaptive Polling Management Endpoints (Future Enhancement)
## Status: DEFERRED
**Decision Date:** 2025-10-20
**Re-evaluated:** v4.24.0 GA release
**Current Status:** Not implemented in v4.24.0
## Overview
Manual circuit breaker and dead-letter queue (DLQ) management endpoints are **not included in v4.24.0**. The read-only scheduler health API (`/api/monitoring/scheduler/health`) provides full visibility, and automatic recovery mechanisms have proven sufficient during testing and early production rollouts.
---
## What's Available in v4.24.0
### Read-Only Scheduler Health API
**Endpoint:** `GET /api/monitoring/scheduler/health`
Provides complete visibility into:
- Queue depth and task distribution
- Circuit breaker states per instance
- Dead-letter queue contents and retry schedules
- Per-instance staleness tracking
- Failure streaks and error categorization
**Documentation:** See [Scheduler Health API](../api/SCHEDULER_HEALTH.md) for complete reference.
### Existing Management Options
**v4.24.0 operators can:**
1. **Toggle adaptive polling** (no restart required)
- Via UI: **Settings → System → Monitoring**
- Via API: Update `system.json` with `adaptivePollingEnabled: false`
2. **Service restart** (clears transient state)
```bash
# Systemd
sudo systemctl restart pulse
# Docker
docker restart pulse
# LXC
pct restart <ctid>
```
- Clears all circuit breakers
- Resets DLQ (tasks re-queued with fresh state)
- Useful for recovering from stuck states
3. **Version rollback** (if broader issues)
- Via UI: **Settings → System → Updates → Restore previous version**
- Via CLI: `pulse config rollback`
- Documented in [Operations Runbook](ADAPTIVE_POLLING_ROLLOUT.md)
4. **Per-instance configuration fixes**
- Update node credentials if authentication failures cause DLQ entries
- Adjust network/firewall if connectivity issues trigger breakers
- Fix underlying infrastructure problems
---
## Why Endpoints Are Deferred
### Test Results Demonstrate Sufficient Automation
**Integration testing** (55 seconds, 12 instances):
- Circuit breakers opened and closed automatically
- Transient failures recovered without intervention
- Permanent failures correctly routed to DLQ
**Soak testing** (2-240 minutes, 80 instances):
- Heap: 2.3MB → 3.1MB (healthy growth)
- Goroutines: 16 → 6 (no leak)
- No scenarios requiring manual intervention
**Production rollout** (v4.24.0):
- Automatic recovery working as designed
- Service restart sufficient for edge cases
- No operator requests for manual controls
### Implementation Cost vs. Benefit
**Would require:**
- Authentication and RBAC integration
- Comprehensive audit logging
- UI integration in Settings → System → Monitoring
- Additional testing and maintenance burden
**Current workarounds proven effective:**
- Adaptive polling toggle (immediate, no restart)
- Service restart (clears all state in < 30 seconds)
- Version rollback (if systematic issues)
---
## Future Implementation Plan
### Proposed Endpoints (When Needed)
If production usage reveals operational gaps, implement:
#### 1. Reset Circuit Breaker
```
POST /api/monitoring/breakers/{key}/reset
Authorization: Required (session or API token)
```
**Request:**
```json
{
"reason": "Manual reset after infrastructure fix"
}
```
**Response:**
```json
{
"success": true,
"key": "pve::pve-node1",
"previousState": "open",
"newState": "closed",
"resetBy": "admin",
"resetAt": "2025-10-20T15:30:00Z"
}
```
**Use case:** Immediately retry a specific instance after fixing underlying issue (e.g., restored network connectivity)
#### 2. Retry All DLQ Tasks
```
POST /api/monitoring/dlq/retry
Authorization: Required (session or API token)
```
**Response:**
```json
{
"success": true,
"tasksRetried": 5,
"keys": ["pve::pve-node1", "pbs::backup-server"]
}
```
**Use case:** Bulk retry after fixing widespread issue (e.g., certificate renewal)
#### 3. Retry Specific DLQ Task
```
POST /api/monitoring/dlq/{key}/retry
Authorization: Required (session or API token)
```
**Response:**
```json
{
"success": true,
"key": "pve::pve-node1",
"previousRetryCount": 5,
"scheduledFor": "2025-10-20T15:35:00Z"
}
```
**Use case:** Targeted retry of single instance
#### 4. Remove from DLQ
```
DELETE /api/monitoring/dlq/{key}
Authorization: Required (session or API token)
```
**Response:**
```json
{
"success": true,
"key": "pve::decomissioned-node",
"reason": "Instance permanently decommissioned"
}
```
**Use case:** Remove decommissioned instances from DLQ
### Security Requirements
All management endpoints would require:
- **Authentication:** Valid session cookie or API token
- **RBAC:** Admin-level permissions
- **Audit logging:** Every action logged with:
- Operator username/IP
- Instance key affected
- Reason provided
- Timestamp
- Previous and new states
- **Rate limiting:** Prevent abuse (e.g., 10 requests/minute)
---
## Re-evaluation Criteria
**Implement management endpoints if:**
1. **Operator demand:** >3 requests in first 60 days of v4.24.0 deployment
2. **Service restart frequency:** >5 restarts per week due to stuck breakers/DLQ
3. **Incident impact:** Manual controls would have prevented or accelerated recovery from >1 production incident
4. **Feedback from operations runbook:** [ADAPTIVE_POLLING_ROLLOUT.md](ADAPTIVE_POLLING_ROLLOUT.md) troubleshooting inadequate
**Don't implement if:**
- Current workarounds remain effective
- Automatic recovery continues to handle 99%+ of scenarios
- No clear operational pain points emerge
---
## Monitoring Current State
### Check Circuit Breakers
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state, since: .breaker.since}'
```
### Check Dead-Letter Queue
```bash
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount, nextRetry: .deadLetter.nextRetry}'
```
### Track Recovery Times
```bash
# Monitor breaker state changes
journalctl -u pulse | grep -E "circuit breaker|dead-letter"
```
---
## Feedback & Requests
If you encounter scenarios where manual management endpoints would be valuable:
1. **Document the use case**
- What problem occurred?
- Why wasn't automatic recovery sufficient?
- How would manual control have helped?
2. **File an issue**
- [GitHub Issues](https://github.com/rcourtman/Pulse/issues)
- Include: scheduler health API output, logs, timeline
3. **Track frequency**
- If the pattern recurs >3 times, escalate for implementation
---
## Related Documentation
- [Scheduler Health API](../api/SCHEDULER_HEALTH.md) - Complete API reference
- [Operations Runbook](ADAPTIVE_POLLING_ROLLOUT.md) - Steady-state operations and troubleshooting
- [Adaptive Polling Architecture](../monitoring/ADAPTIVE_POLLING.md) - Technical details
- [Configuration Guide](../CONFIGURATION.md) - Adaptive polling settings