vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 12:00:13 +00:00

rcourtman c91b7874ac docs: comprehensive v4.24.0 documentation audit and updates

Complete documentation overhaul for Pulse v4.24.0 release covering all new
features and operational procedures.

Documentation Updates (19 files):

P0 Release-Critical:
- Operations: Rewrote ADAPTIVE_POLLING_ROLLOUT.md as GA operations runbook
- Operations: Updated ADAPTIVE_POLLING_MANAGEMENT_ENDPOINTS.md with DEFERRED status
- Operations: Enhanced audit-log-rotation.md with scheduler health checks
- Security: Updated proxy hardening docs with rate limit defaults
- Docker: Added runtime logging and rollback procedures

P1 Deployment & Integration:
- KUBERNETES.md: Runtime logging config, adaptive polling, post-upgrade verification
- PORT_CONFIGURATION.md: Service naming, change tracking via update history
- REVERSE_PROXY.md: Rate limit headers, error pass-through, v4.24.0 verification
- PROXY_AUTH.md, OIDC.md, WEBHOOKS.md: Runtime logging integration
- TROUBLESHOOTING.md, VM_DISK_MONITORING.md, zfs-monitoring.md: Updated workflows

Features Documented:
- X-RateLimit-* headers for all API responses
- Updates rollback workflow (UI & CLI)
- Scheduler health API with rich metadata
- Runtime logging configuration (no restart required)
- Adaptive polling (GA, enabled by default)
- Enhanced audit logging
- Circuit breakers and dead-letter queue

Supporting Changes:
- Discovery service enhancements
- Config handlers updates
- Sensor proxy installer improvements

Total Changes: 1,626 insertions(+), 622 deletions(-)
Files Modified: 24 (19 docs, 5 code)

All documentation is production-ready for v4.24.0 release.

2025-10-20 17:20:13 +00:00

7 KiB

Raw Blame History

Adaptive Polling Management Endpoints (Future Enhancement)

Status: DEFERRED

Decision Date: 2025-10-20 Re-evaluated: v4.24.0 GA release Current Status: Not implemented in v4.24.0

Overview

Manual circuit breaker and dead-letter queue (DLQ) management endpoints are not included in v4.24.0. The read-only scheduler health API (/api/monitoring/scheduler/health) provides full visibility, and automatic recovery mechanisms have proven sufficient during testing and early production rollouts.

What's Available in v4.24.0

Read-Only Scheduler Health API

Endpoint: GET /api/monitoring/scheduler/health

Provides complete visibility into:

Queue depth and task distribution
Circuit breaker states per instance
Dead-letter queue contents and retry schedules
Per-instance staleness tracking
Failure streaks and error categorization

Documentation: See Scheduler Health API for complete reference.

Existing Management Options

v4.24.0 operators can:

Toggle adaptive polling (no restart required)
- Via UI: Settings → System → Monitoring
- Via API: Update system.json with adaptivePollingEnabled: false
Service restart (clears transient state)
```
# Systemd
sudo systemctl restart pulse

# Docker
docker restart pulse

# LXC
pct restart <ctid>
```
- Clears all circuit breakers
- Resets DLQ (tasks re-queued with fresh state)
- Useful for recovering from stuck states
Version rollback (if broader issues)
- Via UI: Settings → System → Updates → Restore previous version
- Via CLI: pulse config rollback
- Documented in Operations Runbook
Per-instance configuration fixes
- Update node credentials if authentication failures cause DLQ entries
- Adjust network/firewall if connectivity issues trigger breakers
- Fix underlying infrastructure problems

Why Endpoints Are Deferred

Test Results Demonstrate Sufficient Automation

Integration testing (55 seconds, 12 instances):

Circuit breakers opened and closed automatically
Transient failures recovered without intervention
Permanent failures correctly routed to DLQ

Soak testing (2-240 minutes, 80 instances):

Heap: 2.3MB → 3.1MB (healthy growth)
Goroutines: 16 → 6 (no leak)
No scenarios requiring manual intervention

Production rollout (v4.24.0):

Automatic recovery working as designed
Service restart sufficient for edge cases
No operator requests for manual controls

Implementation Cost vs. Benefit

Would require:

Authentication and RBAC integration
Comprehensive audit logging
UI integration in Settings → System → Monitoring
Additional testing and maintenance burden

Current workarounds proven effective:

Adaptive polling toggle (immediate, no restart)
Service restart (clears all state in < 30 seconds)
Version rollback (if systematic issues)

Future Implementation Plan

Proposed Endpoints (When Needed)

If production usage reveals operational gaps, implement:

1. Reset Circuit Breaker

POST /api/monitoring/breakers/{key}/reset
Authorization: Required (session or API token)

Request:

{
  "reason": "Manual reset after infrastructure fix"
}

Response:

{
  "success": true,
  "key": "pve::pve-node1",
  "previousState": "open",
  "newState": "closed",
  "resetBy": "admin",
  "resetAt": "2025-10-20T15:30:00Z"
}

Use case: Immediately retry a specific instance after fixing underlying issue (e.g., restored network connectivity)

2. Retry All DLQ Tasks

POST /api/monitoring/dlq/retry
Authorization: Required (session or API token)

Response:

{
  "success": true,
  "tasksRetried": 5,
  "keys": ["pve::pve-node1", "pbs::backup-server"]
}

Use case: Bulk retry after fixing widespread issue (e.g., certificate renewal)

3. Retry Specific DLQ Task

POST /api/monitoring/dlq/{key}/retry
Authorization: Required (session or API token)

Response:

{
  "success": true,
  "key": "pve::pve-node1",
  "previousRetryCount": 5,
  "scheduledFor": "2025-10-20T15:35:00Z"
}

Use case: Targeted retry of single instance

4. Remove from DLQ

DELETE /api/monitoring/dlq/{key}
Authorization: Required (session or API token)

Response:

{
  "success": true,
  "key": "pve::decomissioned-node",
  "reason": "Instance permanently decommissioned"
}

Use case: Remove decommissioned instances from DLQ

Security Requirements

All management endpoints would require:

Authentication: Valid session cookie or API token
RBAC: Admin-level permissions
Audit logging: Every action logged with:
- Operator username/IP
- Instance key affected
- Reason provided
- Timestamp
- Previous and new states
Rate limiting: Prevent abuse (e.g., 10 requests/minute)

Re-evaluation Criteria

Implement management endpoints if:

Operator demand: >3 requests in first 60 days of v4.24.0 deployment
Service restart frequency: >5 restarts per week due to stuck breakers/DLQ
Incident impact: Manual controls would have prevented or accelerated recovery from >1 production incident
Feedback from operations runbook: ADAPTIVE_POLLING_ROLLOUT.md troubleshooting inadequate

Don't implement if:

Current workarounds remain effective
Automatic recovery continues to handle 99%+ of scenarios
No clear operational pain points emerge

Monitoring Current State

Check Circuit Breakers

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.breaker.state != "closed") | {key, state: .breaker.state, since: .breaker.since}'

Check Dead-Letter Queue

curl -s http://<host>:7655/api/monitoring/scheduler/health \
  | jq '.instances[] | select(.deadLetter.present) | {key, reason: .deadLetter.reason, retryCount: .deadLetter.retryCount, nextRetry: .deadLetter.nextRetry}'

Track Recovery Times

# Monitor breaker state changes
journalctl -u pulse | grep -E "circuit breaker|dead-letter"

Feedback & Requests

If you encounter scenarios where manual management endpoints would be valuable:

Document the use case
- What problem occurred?
- Why wasn't automatic recovery sufficient?
- How would manual control have helped?
File an issue
- GitHub Issues
- Include: scheduler health API output, logs, timeline
Track frequency
- If the pattern recurs >3 times, escalate for implementation

Scheduler Health API - Complete API reference
Operations Runbook - Steady-state operations and troubleshooting
Adaptive Polling Architecture - Technical details
Configuration Guide - Adaptive polling settings

7 KiB Raw Blame History

Adaptive Polling Management Endpoints (Future Enhancement)

Status: DEFERRED

Overview

What's Available in v4.24.0

Read-Only Scheduler Health API

Existing Management Options

Why Endpoints Are Deferred

Test Results Demonstrate Sufficient Automation

Implementation Cost vs. Benefit

Future Implementation Plan

Proposed Endpoints (When Needed)

1. Reset Circuit Breaker

2. Retry All DLQ Tasks

3. Retry Specific DLQ Task

4. Remove from DLQ

Security Requirements

Re-evaluation Criteria

Monitoring Current State

Check Circuit Breakers

Check Dead-Letter Queue

Track Recovery Times

Feedback & Requests

Related Documentation

7 KiB

Raw Blame History