Pulse/docs/operations/ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md
rcourtman c91b7874ac docs: comprehensive v4.24.0 documentation audit and updates
Complete documentation overhaul for Pulse v4.24.0 release covering all new
features and operational procedures.

Documentation Updates (19 files):

P0 Release-Critical:
- Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook
- Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status
- Operations: Enhanced audit-log-rotation.md with scheduler health checks
- Security: Updated proxy hardening docs with rate limit defaults
- Docker: Added runtime logging and rollback procedures

P1 Deployment & Integration:
- KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification
- PORT_CONFIGURATION.md: Service naming, change tracking via update history
- REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification
- PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration
- TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows

Features Documented:
- X-RateLimit-* headers for all API responses
- Updates rollback workflow (UI & CLI)
- Scheduler health API with rich metadata
- Runtime logging configuration (no restart required)
- Adaptive polling (GA, enabled by default)
- Enhanced audit logging
- Circuit breakers and dead-letter queue

Supporting Changes:
- Discovery service enhancements
- Config handlers updates
- Sensor proxy installer improvements

Total Changes: 1,626 insertions(+), 622 deletions(-)
Files Modified: 24 (19 docs, 5 code)

All documentation is production-ready for v4.24.0 release.
2025-10-20 17:20:13 +00:00

7 KiB

Adaptive Polling Management Endpoints (Future Enhancement)

Status: DEFERRED

Decision Date: 2025-10-20 Re-evaluated: v4.24.0 GA release Current Status: Not implemented in v4.24.0

Overview

Manual circuit breaker and dead-letter queue (DLQ) management endpoints are not included in v4.24.0. The read-only scheduler health API (/api/monitoring/scheduler/health) provides full visibility, and automatic recovery mechanisms have proven sufficient during testing and early production rollouts.


What's Available in v4.24.0

Read-Only Scheduler Health API

Endpoint: GET /api/monitoring/scheduler/health

Provides complete visibility into:

  • Queue depth and task distribution
  • Circuit breaker states per instance
  • Dead-letter queue contents and retry schedules
  • Per-instance staleness tracking
  • Failure streaks and error categorization

Documentation: See Scheduler Health API for complete reference.

Existing Management Options

v4.24.0 operators can:

  1. Toggle adaptive polling (no restart required)

    • Via UI: Settings → System → Monitoring
    • Via API: Update system.json with adaptivePollingEnabled: false
  2. Service restart (clears transient state)

    # Systemd
    sudo systemctl restart pulse
    
    # Docker
    docker restart pulse
    
    # LXC
    pct restart <ctid>
    
    • Clears all circuit breakers
    • Resets DLQ (tasks re-queued with fresh state)
    • Useful for recovering from stuck states
  3. Version rollback (if broader issues)

    • Via UI: Settings → System → Updates → Restore previous version
    • Via CLI: pulse config rollback
    • Documented in Operations Runbook
  4. Per-instance configuration fixes

    • Update node credentials if authentication failures cause DLQ entries
    • Adjust network/firewall if connectivity issues trigger breakers
    • Fix underlying infrastructure problems

Why Endpoints Are Deferred

Test Results Demonstrate Sufficient Automation

Integration testing (55 seconds, 12 instances):

  • Circuit breakers opened and closed automatically
  • Transient failures recovered without intervention
  • Permanent failures correctly routed to DLQ

Soak testing (2-240 minutes, 80 instances):

  • Heap: 2.3MB → 3.1MB (healthy growth)
  • Goroutines: 16 → 6 (no leak)
  • No scenarios requiring manual intervention

Production rollout (v4.24.0):

  • Automatic recovery working as designed
  • Service restart sufficient for edge cases
  • No operator requests for manual controls

Implementation Cost vs. Benefit

Would require:

  • Authentication and RBAC integration
  • Comprehensive audit logging
  • UI integration in Settings → System → Monitoring
  • Additional testing and maintenance burden

Current workarounds proven effective:

  • Adaptive polling toggle (immediate, no restart)
  • Service restart (clears all state in < 30 seconds)
  • Version rollback (if systematic issues)

Future Implementation Plan

Proposed Endpoints (When Needed)

If production usage reveals operational gaps, implement:

1. Reset Circuit Breaker

POST /api/monitoring/breakers/{key}/reset
Authorization: Required (session or API token)

Request:

{
  "reason": "Manual reset after infrastructure fix"
}

Response:

{
  "success": true,
  "key": "pve::pve-node1",
  "previousState": "open",
  "newState": "closed",
  "resetBy": "admin",
  "resetAt": "2025-10-20T15:30:00Z"
}

Use case: Immediately retry a specific instance after fixing underlying issue (e.g., restored network connectivity)

2. Retry All DLQ Tasks

POST /api/monitoring/dlq/retry
Authorization: Required (session or API token)

Response:

{
  "success": true,
  "tasksRetried": 5,
  "keys": ["pve::pve-node1", "pbs::backup-server"]
}

Use case: Bulk retry after fixing widespread issue (e.g., certificate renewal)

3. Retry Specific DLQ Task

POST /api/monitoring/dlq/{key}/retry
Authorization: Required (session or API token)

Response:

{
  "success": true,
  "key": "pve::pve-node1",
  "previousRetryCount": 5,
  "scheduledFor": "2025-10-20T15:35:00Z"
}

Use case: Targeted retry of single instance

4. Remove from DLQ

DELETE /api/monitoring/dlq/{key}
Authorization: Required (session or API token)

Response:

{
  "success": true,
  "key": "pve::decomissioned-node",
  "reason": "Instance permanently decommissioned"
}

Use case: Remove decommissioned instances from DLQ

Security Requirements

All management endpoints would require:

  • Authentication: Valid session cookie or API token
  • RBAC: Admin-level permissions
  • Audit logging: Every action logged with:
    • Operator username/IP
    • Instance key affected
    • Reason provided
    • Timestamp
    • Previous and new states
  • Rate limiting: Prevent abuse (e.g., 10 requests/minute)

Re-evaluation Criteria

Implement management endpoints if:

  1. Operator demand: >3 requests in first 60 days of v4.24.0 deployment
  2. Service restart frequency: >5 restarts per week due to stuck breakers/DLQ
  3. Incident impact: Manual controls would have prevented or accelerated recovery from >1 production incident
  4. Feedback from operations runbook: ADAPTIVE_POLLING_ROLLOUT.md troubleshooting inadequate

Don't implement if:

  • Current workarounds remain effective
  • Automatic recovery continues to handle 99%+ of scenarios
  • No clear operational pain points emerge

Monitoring Current State

Check Circuit Breakers

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state, since: .breaker.since}'

Check Dead-Letter Queue

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount, nextRetry: .deadLetter.nextRetry}'

Track Recovery Times

# Monitor breaker state changes
journalctl -u pulse | grep -E "circuit breaker|dead-letter"

Feedback & Requests

If you encounter scenarios where manual management endpoints would be valuable:

  1. Document the use case

    • What problem occurred?
    • Why wasn't automatic recovery sufficient?
    • How would manual control have helped?
  2. File an issue

    • GitHub Issues
    • Include: scheduler health API output, logs, timeline
  3. Track frequency

    • If the pattern recurs >3 times, escalate for implementation