Improves configuration handling and system settings APIs to support
v4.24.0 features including runtime logging controls, adaptive polling
configuration, and enhanced config export/persistence.
Changes:
- Add config override system for discovery service
- Enhance system settings API with runtime logging controls
- Improve config persistence and export functionality
- Update security setup handling
- Refine monitoring and discovery service integration
These changes provide the backend support for the configuration
features documented in the v4.24.0 release.
Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health
**New Response Structure:**
Enhanced "instances" array with per-instance details:
- Instance metadata: displayName, type, connection URL
- Poll status: last success/error timestamps, error messages, error category
- Circuit breaker: state, timestamps, failure counts, retry windows
- Dead letter: present flag, reason, attempt history, retry schedule
**Implementation:**
Data structures:
- instanceInfo: cache of display names, URLs, types
- pollStatus: tracks successes/errors with timestamps and categories
- dlqInsight: DLQ entry metadata (reason, attempts, schedule)
- circuitBreaker: enhanced with stateSince, lastTransition
Tracking logic:
- buildInstanceInfoCache: populate metadata from config on startup
- recordTaskResult: track poll outcomes, error details, categories
- sendToDeadLetter: capture DLQ insights (reason, timestamps)
- circuitBreaker: record state transitions with timestamps
**Backward Compatible:**
- Existing fields (deadLetter, breakers, staleness) unchanged
- New "instances" array is additive
- Old clients can ignore new fields
**Testing:**
- Unit test: TestSchedulerHealth_EnhancedResponse validates all fields
- Integration tests: still passing (55s)
- All error tracking and breaker history verified
**Operator Benefits:**
- Diagnose issues without log digging
- See error messages directly in API
- Understand breaker states and retry schedules
- Track DLQ entries with full context
- Single API call for complete instance health view
Example: Quickly identify "401 unauthorized" on specific PBS instance,
see it's in DLQ after 5 retries, and know when next retry scheduled.
Part of Phase 2 follow-up work to improve observability.
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance
New API endpoint:
GET /api/monitoring/scheduler/health (requires authentication)
New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data
Documentation updated with API spec, field descriptions, and usage examples.
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates
Task 6 of 10 complete (60%). Ready for error/backoff policies.
Adds freshness metadata tracking for all monitored instances:
- StalenessTracker with per-instance last success/error/mutation timestamps
- Change hash detection using SHA1 for detecting data mutations
- Normalized staleness scoring (0-1 scale) based on age vs maxStale
- Integration with PollMetrics for authoritative last-success data
- Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError
- Connected to scheduler as StalenessSource implementation
Task 4 of 10 complete. Ready for adaptive interval logic.
Track minimum and maximum CPU temperatures since monitoring started.
This provides better insight into temperature trends and cooling
adequacy over time.
Changes:
- Backend: Add CPUMin, CPUMaxRecord, MinRecorded, MaxRecorded fields
to Temperature model
- Backend: Implement min/max tracking logic in monitoring cycle that
preserves values across polling cycles
- Backend: Initialize min/max on first reading, update on extremes
- Frontend: Update Temperature TypeScript interface with new fields
- Frontend: Display min/max range in NodeCard tooltip (e.g., "52°C
(48-67°C since monitoring started)")
- Frontend: Rebuild dist assets
Temperature display now shows:
- Current temperature with color coding (green/yellow/red)
- Tooltip with full min-max range and context
- Min/max tracked in-memory (resets on Pulse restart)
Example tooltip: "CPU: 52°C (48-67°C since monitoring started)"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Improvements to pulse-sensor-proxy:
- Fix cluster discovery to use pvecm status for IP addresses instead of node names
- Add standalone node support for non-clustered Proxmox hosts
- Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting
- Add --pulse-server flag to installer for custom Pulse URLs
- Configure www-data group membership for Proxmox IPC access
UI and API cleanup:
- Remove unused "Ensure cluster keys" button from Settings
- Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint
- Remove EnsureClusterKeys method from tempproxy client
The setup script already handles SSH key distribution during initial configuration,
making the manual refresh button redundant.
The PUT /api/config/nodes/{id} endpoint was corrupting node configurations
when making partial updates (e.g., updating just monitorPhysicalDisks):
- Authentication fields (tokenName, tokenValue, password) were being cleared
when updating unrelated settings
- Name field was being blanked when not included in request
- Monitor* boolean fields were defaulting to false
Changes:
- Only update name field if explicitly provided in request
- Only switch authentication method when auth fields are explicitly provided
- Preserve existing auth credentials on non-auth updates
- Applied fix to all node types (PVE, PBS, PMG)
Also enables physical disk monitoring by default (opt-out instead of opt-in)
and preserves disk data between polling intervals.