mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-04-30 04:20:20 +00:00
Complete documentation overhaul for Pulse v4.24.0 release covering all new features and operational procedures. Documentation Updates (19 files): P0 Release-Critical: - Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook - Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status - Operations: Enhanced audit-log-rotation.md with scheduler health checks - Security: Updated proxy hardening docs with rate limit defaults - Docker: Added runtime logging and rollback procedures P1 Deployment & Integration: - KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification - PORT_CONFIGURATION.md: Service naming, change tracking via update history - REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification - PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration - TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows Features Documented: - X-RateLimit-* headers for all API responses - Updates rollback workflow (UI & CLI) - Scheduler health API with rich metadata - Runtime logging configuration (no restart required) - Adaptive polling (GA, enabled by default) - Enhanced audit logging - Circuit breakers and dead-letter queue Supporting Changes: - Discovery service enhancements - Config handlers updates - Sensor proxy installer improvements Total Changes: 1,626 insertions(+), 622 deletions(-) Files Modified: 24 (19 docs, 5 code) All documentation is production-ready for v4.24.0 release.
260 lines
7 KiB
Markdown
260 lines
7 KiB
Markdown
# Adaptive Polling Management Endpoints (Future Enhancement)
|
|
|
|
## Status: DEFERRED
|
|
|
|
**Decision Date:** 2025-10-20
|
|
**Re-evaluated:** v4.24.0 GA release
|
|
**Current Status:** Not implemented in v4.24.0
|
|
|
|
## Overview
|
|
|
|
Manual circuit breaker and dead-letter queue (DLQ) management endpoints are **not included in v4.24.0**. The read-only scheduler health API (`/api/monitoring/scheduler/health`) provides full visibility, and automatic recovery mechanisms have proven sufficient during testing and early production rollouts.
|
|
|
|
---
|
|
|
|
## What's Available in v4.24.0
|
|
|
|
### Read-Only Scheduler Health API
|
|
|
|
**Endpoint:** `GET /api/monitoring/scheduler/health`
|
|
|
|
Provides complete visibility into:
|
|
- Queue depth and task distribution
|
|
- Circuit breaker states per instance
|
|
- Dead-letter queue contents and retry schedules
|
|
- Per-instance staleness tracking
|
|
- Failure streaks and error categorization
|
|
|
|
**Documentation:** See [Scheduler Health API](../api/SCHEDULER_HEALTH.md) for complete reference.
|
|
|
|
### Existing Management Options
|
|
|
|
**v4.24.0 operators can:**
|
|
|
|
1. **Toggle adaptive polling** (no restart required)
|
|
- Via UI: **Settings → System → Monitoring**
|
|
- Via API: Update `system.json` with `adaptivePollingEnabled: false`
|
|
|
|
2. **Service restart** (clears transient state)
|
|
```bash
|
|
# Systemd
|
|
sudo systemctl restart pulse
|
|
|
|
# Docker
|
|
docker restart pulse
|
|
|
|
# LXC
|
|
pct restart <ctid>
|
|
```
|
|
- Clears all circuit breakers
|
|
- Resets DLQ (tasks re-queued with fresh state)
|
|
- Useful for recovering from stuck states
|
|
|
|
3. **Version rollback** (if broader issues)
|
|
- Via UI: **Settings → System → Updates → Restore previous version**
|
|
- Via CLI: `pulse config rollback`
|
|
- Documented in [Operations Runbook](ADAPTIVE_POLLING_ROLLOUT.md)
|
|
|
|
4. **Per-instance configuration fixes**
|
|
- Update node credentials if authentication failures cause DLQ entries
|
|
- Adjust network/firewall if connectivity issues trigger breakers
|
|
- Fix underlying infrastructure problems
|
|
|
|
---
|
|
|
|
## Why Endpoints Are Deferred
|
|
|
|
### Test Results Demonstrate Sufficient Automation
|
|
|
|
**Integration testing** (55 seconds, 12 instances):
|
|
- Circuit breakers opened and closed automatically
|
|
- Transient failures recovered without intervention
|
|
- Permanent failures correctly routed to DLQ
|
|
|
|
**Soak testing** (2-240 minutes, 80 instances):
|
|
- Heap: 2.3MB → 3.1MB (healthy growth)
|
|
- Goroutines: 16 → 6 (no leak)
|
|
- No scenarios requiring manual intervention
|
|
|
|
**Production rollout** (v4.24.0):
|
|
- Automatic recovery working as designed
|
|
- Service restart sufficient for edge cases
|
|
- No operator requests for manual controls
|
|
|
|
### Implementation Cost vs. Benefit
|
|
|
|
**Would require:**
|
|
- Authentication and RBAC integration
|
|
- Comprehensive audit logging
|
|
- UI integration in Settings → System → Monitoring
|
|
- Additional testing and maintenance burden
|
|
|
|
**Current workarounds proven effective:**
|
|
- Adaptive polling toggle (immediate, no restart)
|
|
- Service restart (clears all state in < 30 seconds)
|
|
- Version rollback (if systematic issues)
|
|
|
|
---
|
|
|
|
## Future Implementation Plan
|
|
|
|
### Proposed Endpoints (When Needed)
|
|
|
|
If production usage reveals operational gaps, implement:
|
|
|
|
#### 1. Reset Circuit Breaker
|
|
```
|
|
POST /api/monitoring/breakers/{key}/reset
|
|
Authorization: Required (session or API token)
|
|
```
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"reason": "Manual reset after infrastructure fix"
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"key": "pve::pve-node1",
|
|
"previousState": "open",
|
|
"newState": "closed",
|
|
"resetBy": "admin",
|
|
"resetAt": "2025-10-20T15:30:00Z"
|
|
}
|
|
```
|
|
|
|
**Use case:** Immediately retry a specific instance after fixing underlying issue (e.g., restored network connectivity)
|
|
|
|
#### 2. Retry All DLQ Tasks
|
|
```
|
|
POST /api/monitoring/dlq/retry
|
|
Authorization: Required (session or API token)
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"tasksRetried": 5,
|
|
"keys": ["pve::pve-node1", "pbs::backup-server"]
|
|
}
|
|
```
|
|
|
|
**Use case:** Bulk retry after fixing widespread issue (e.g., certificate renewal)
|
|
|
|
#### 3. Retry Specific DLQ Task
|
|
```
|
|
POST /api/monitoring/dlq/{key}/retry
|
|
Authorization: Required (session or API token)
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"key": "pve::pve-node1",
|
|
"previousRetryCount": 5,
|
|
"scheduledFor": "2025-10-20T15:35:00Z"
|
|
}
|
|
```
|
|
|
|
**Use case:** Targeted retry of single instance
|
|
|
|
#### 4. Remove from DLQ
|
|
```
|
|
DELETE /api/monitoring/dlq/{key}
|
|
Authorization: Required (session or API token)
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"success": true,
|
|
"key": "pve::decomissioned-node",
|
|
"reason": "Instance permanently decommissioned"
|
|
}
|
|
```
|
|
|
|
**Use case:** Remove decommissioned instances from DLQ
|
|
|
|
### Security Requirements
|
|
|
|
All management endpoints would require:
|
|
- **Authentication:** Valid session cookie or API token
|
|
- **RBAC:** Admin-level permissions
|
|
- **Audit logging:** Every action logged with:
|
|
- Operator username/IP
|
|
- Instance key affected
|
|
- Reason provided
|
|
- Timestamp
|
|
- Previous and new states
|
|
- **Rate limiting:** Prevent abuse (e.g., 10 requests/minute)
|
|
|
|
---
|
|
|
|
## Re-evaluation Criteria
|
|
|
|
**Implement management endpoints if:**
|
|
|
|
1. **Operator demand:** >3 requests in first 60 days of v4.24.0 deployment
|
|
2. **Service restart frequency:** >5 restarts per week due to stuck breakers/DLQ
|
|
3. **Incident impact:** Manual controls would have prevented or accelerated recovery from >1 production incident
|
|
4. **Feedback from operations runbook:** [ADAPTIVE_POLLING_ROLLOUT.md](ADAPTIVE_POLLING_ROLLOUT.md) troubleshooting inadequate
|
|
|
|
**Don't implement if:**
|
|
- Current workarounds remain effective
|
|
- Automatic recovery continues to handle 99%+ of scenarios
|
|
- No clear operational pain points emerge
|
|
|
|
---
|
|
|
|
## Monitoring Current State
|
|
|
|
### Check Circuit Breakers
|
|
```bash
|
|
curl -s http://<host>:7655/api/monitoring/scheduler/health \
|
|
| jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state, since: .breaker.since}'
|
|
```
|
|
|
|
### Check Dead-Letter Queue
|
|
```bash
|
|
curl -s http://<host>:7655/api/monitoring/scheduler/health \
|
|
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount, nextRetry: .deadLetter.nextRetry}'
|
|
```
|
|
|
|
### Track Recovery Times
|
|
```bash
|
|
# Monitor breaker state changes
|
|
journalctl -u pulse | grep -E "circuit breaker|dead-letter"
|
|
```
|
|
|
|
---
|
|
|
|
## Feedback & Requests
|
|
|
|
If you encounter scenarios where manual management endpoints would be valuable:
|
|
|
|
1. **Document the use case**
|
|
- What problem occurred?
|
|
- Why wasn't automatic recovery sufficient?
|
|
- How would manual control have helped?
|
|
|
|
2. **File an issue**
|
|
- [GitHub Issues](https://github.com/rcourtman/Pulse/issues)
|
|
- Include: scheduler health API output, logs, timeline
|
|
|
|
3. **Track frequency**
|
|
- If the pattern recurs >3 times, escalate for implementation
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [Scheduler Health API](../api/SCHEDULER_HEALTH.md) - Complete API reference
|
|
- [Operations Runbook](ADAPTIVE_POLLING_ROLLOUT.md) - Steady-state operations and troubleshooting
|
|
- [Adaptive Polling Architecture](../monitoring/ADAPTIVE_POLLING.md) - Technical details
|
|
- [Configuration Guide](../CONFIGURATION.md) - Adaptive polling settings
|