Complete documentation overhaul for Pulse v4.24.0 release covering all new features and operational procedures. Documentation Updates (19 files): P0 Release-Critical: - Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook - Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status - Operations: Enhanced audit-log-rotation.md with scheduler health checks - Security: Updated proxy hardening docs with rate limit defaults - Docker: Added runtime logging and rollback procedures P1 Deployment & Integration: - KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification - PORT_CONFIGURATION.md: Service naming, change tracking via update history - REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification - PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration - TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows Features Documented: - X-RateLimit-* headers for all API responses - Updates rollback workflow (UI & CLI) - Scheduler health API with rich metadata - Runtime logging configuration (no restart required) - Adaptive polling (GA, enabled by default) - Enhanced audit logging - Circuit breakers and dead-letter queue Supporting Changes: - Discovery service enhancements - Config handlers updates - Sensor proxy installer improvements Total Changes: 1,626 insertions(+), 622 deletions(-) Files Modified: 24 (19 docs, 5 code) All documentation is production-ready for v4.24.0 release.
7 KiB
Adaptive Polling Management Endpoints (Future Enhancement)
Status: DEFERRED
Decision Date: 2025-10-20 Re-evaluated: v4.24.0 GA release Current Status: Not implemented in v4.24.0
Overview
Manual circuit breaker and dead-letter queue (DLQ) management endpoints are not included in v4.24.0. The read-only scheduler health API (/api/monitoring/scheduler/health) provides full visibility, and automatic recovery mechanisms have proven sufficient during testing and early production rollouts.
What's Available in v4.24.0
Read-Only Scheduler Health API
Endpoint: GET /api/monitoring/scheduler/health
Provides complete visibility into:
- Queue depth and task distribution
- Circuit breaker states per instance
- Dead-letter queue contents and retry schedules
- Per-instance staleness tracking
- Failure streaks and error categorization
Documentation: See Scheduler Health API for complete reference.
Existing Management Options
v4.24.0 operators can:
-
Toggle adaptive polling (no restart required)
- Via UI: Settings → System → Monitoring
- Via API: Update
system.jsonwithadaptivePollingEnabled: false
-
Service restart (clears transient state)
# Systemd sudo systemctl restart pulse # Docker docker restart pulse # LXC pct restart <ctid>- Clears all circuit breakers
- Resets DLQ (tasks re-queued with fresh state)
- Useful for recovering from stuck states
-
Version rollback (if broader issues)
- Via UI: Settings → System → Updates → Restore previous version
- Via CLI:
pulse config rollback - Documented in Operations Runbook
-
Per-instance configuration fixes
- Update node credentials if authentication failures cause DLQ entries
- Adjust network/firewall if connectivity issues trigger breakers
- Fix underlying infrastructure problems
Why Endpoints Are Deferred
Test Results Demonstrate Sufficient Automation
Integration testing (55 seconds, 12 instances):
- Circuit breakers opened and closed automatically
- Transient failures recovered without intervention
- Permanent failures correctly routed to DLQ
Soak testing (2-240 minutes, 80 instances):
- Heap: 2.3MB → 3.1MB (healthy growth)
- Goroutines: 16 → 6 (no leak)
- No scenarios requiring manual intervention
Production rollout (v4.24.0):
- Automatic recovery working as designed
- Service restart sufficient for edge cases
- No operator requests for manual controls
Implementation Cost vs. Benefit
Would require:
- Authentication and RBAC integration
- Comprehensive audit logging
- UI integration in Settings → System → Monitoring
- Additional testing and maintenance burden
Current workarounds proven effective:
- Adaptive polling toggle (immediate, no restart)
- Service restart (clears all state in < 30 seconds)
- Version rollback (if systematic issues)
Future Implementation Plan
Proposed Endpoints (When Needed)
If production usage reveals operational gaps, implement:
1. Reset Circuit Breaker
POST /api/monitoring/breakers/{key}/reset
Authorization: Required (session or API token)
Request:
{
"reason": "Manual reset after infrastructure fix"
}
Response:
{
"success": true,
"key": "pve::pve-node1",
"previousState": "open",
"newState": "closed",
"resetBy": "admin",
"resetAt": "2025-10-20T15:30:00Z"
}
Use case: Immediately retry a specific instance after fixing underlying issue (e.g., restored network connectivity)
2. Retry All DLQ Tasks
POST /api/monitoring/dlq/retry
Authorization: Required (session or API token)
Response:
{
"success": true,
"tasksRetried": 5,
"keys": ["pve::pve-node1", "pbs::backup-server"]
}
Use case: Bulk retry after fixing widespread issue (e.g., certificate renewal)
3. Retry Specific DLQ Task
POST /api/monitoring/dlq/{key}/retry
Authorization: Required (session or API token)
Response:
{
"success": true,
"key": "pve::pve-node1",
"previousRetryCount": 5,
"scheduledFor": "2025-10-20T15:35:00Z"
}
Use case: Targeted retry of single instance
4. Remove from DLQ
DELETE /api/monitoring/dlq/{key}
Authorization: Required (session or API token)
Response:
{
"success": true,
"key": "pve::decomissioned-node",
"reason": "Instance permanently decommissioned"
}
Use case: Remove decommissioned instances from DLQ
Security Requirements
All management endpoints would require:
- Authentication: Valid session cookie or API token
- RBAC: Admin-level permissions
- Audit logging: Every action logged with:
- Operator username/IP
- Instance key affected
- Reason provided
- Timestamp
- Previous and new states
- Rate limiting: Prevent abuse (e.g., 10 requests/minute)
Re-evaluation Criteria
Implement management endpoints if:
- Operator demand: >3 requests in first 60 days of v4.24.0 deployment
- Service restart frequency: >5 restarts per week due to stuck breakers/DLQ
- Incident impact: Manual controls would have prevented or accelerated recovery from >1 production incident
- Feedback from operations runbook: ADAPTIVE_POLLING_ROLLOUT.md troubleshooting inadequate
Don't implement if:
- Current workarounds remain effective
- Automatic recovery continues to handle 99%+ of scenarios
- No clear operational pain points emerge
Monitoring Current State
Check Circuit Breakers
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state, since: .breaker.since}'
Check Dead-Letter Queue
curl -s http://<host>:7655/api/monitoring/scheduler/health \
| jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount, nextRetry: .deadLetter.nextRetry}'
Track Recovery Times
# Monitor breaker state changes
journalctl -u pulse | grep -E "circuit breaker|dead-letter"
Feedback & Requests
If you encounter scenarios where manual management endpoints would be valuable:
-
Document the use case
- What problem occurred?
- Why wasn't automatic recovery sufficient?
- How would manual control have helped?
-
File an issue
- GitHub Issues
- Include: scheduler health API output, logs, timeline
-
Track frequency
- If the pattern recurs >3 times, escalate for implementation
Related Documentation
- Scheduler Health API - Complete API reference
- Operations Runbook - Steady-state operations and troubleshooting
- Adaptive Polling Architecture - Technical details
- Configuration Guide - Adaptive polling settings