Related to #630
When using the efficient polling path (cluster/resources endpoint), guest
agent calls to GetVMFSInfo were made without retry logic. This could cause
transient "Guest details unavailable" errors during initialization when the
guest agent wasn't immediately ready to respond.
The traditional polling path already used retryGuestAgentCall for filesystem
info queries, providing resilience against transient timeouts. This commit
applies the same retry logic to the efficient polling path for consistency.
Changes:
- Wrap GetVMFSInfo call in efficient polling with retryGuestAgentCall
- Use configured guestAgentFSInfoTimeout and guestAgentRetries settings
- Ensures consistent behavior between traditional and efficient polling paths
This should resolve the transient initialization issue reported in #630 where
guest details were unavailable until after a reinstall/restart.
Related to discussion #577
When backups are stored on shared storage accessible from multiple nodes,
the backup polling code was incorrectly assigning the backup to whichever
node it was discovered on during the scan, rather than the node where the
VM/container actually resides.
This fix:
- Builds a lookup map of VMID -> actual node at the start of backup polling
- Uses this map to assign the correct node for guest backups (VMID > 0)
- Preserves existing behavior for host backups (VMID == 0)
- Falls back to the queried node if the guest is not found in the map
This ensures the NODE column accurately reflects which node hosts each
guest, matching the information displayed on the main page.
Related to #614
Corrects three issues with PMG monitoring:
1. Remove unsupported timeframe parameter from GetMailStatistics
- PMG API /statistics/mail does not accept timeframe parameter
- Previously sent "timeframe=day" causing 400 error
- API returns current day statistics by default
2. Fix GetMailCount timespan parameter to use seconds
- Changed from 24 (hours) to 86400 (seconds)
- PMG API expects timespan in seconds, not hours
- Previously sent "timespan=24" causing 400 error
3. Update function signature and tests
- Renamed GetMailCount parameter from timespanHours to timespanSeconds
- Updated test expectations to match corrected API calls
- Tests verify parameters are sent correctly
These changes align the PMG client with actual PMG API requirements,
fixing the data population issues reported in v4.25.0.
Related to #613
When all PBS datastore queries fail (e.g., due to network issues or PBS
downtime), the system was clearing all backups and showing an empty list.
This adds the same preservation logic that exists for PVE storage backups.
Changes:
- Add shouldPreservePBSBackups() helper function
- Track datastore query success/failure counts in pollPBSBackups()
- Preserve existing backups when all datastore queries fail
- Add comprehensive unit tests for PBS backup preservation logic
This ensures users can still see their backup history even during
temporary connectivity issues with PBS, matching the behavior already
implemented for PVE storage backups.
This change modifies the `clusterEndpointEffectiveURL` function to prioritize
IP addresses over hostnames when building cluster endpoint URLs. This eliminates
excessive DNS lookups that can overwhelm DNS servers (e.g., pi-hole), which was
causing hundreds of thousands of unnecessary DNS queries.
When Pulse communicates with Proxmox cluster nodes, it will now:
1. First try to use the IP address from ClusterEndpoint.IP
2. Fall back to ClusterEndpoint.Host only if IP is not available
This is a minimal, backwards-compatible change that maintains existing
functionality while dramatically reducing DNS traffic for clusters where
node IPs are already known and stored.
Related to #620
Related to discussion #615
Add optional GuestURL field to PVE instances and cluster endpoints,
allowing users to specify a separate guest-accessible URL for web UI
navigation that differs from the internal management URL.
Backend changes:
- Add GuestURL field to PVEInstance and ClusterEndpoint structs
- Add GuestURL field to Node model
- Update cluster auto-discovery to preserve existing GuestURL values
- Update node creation logic to populate GuestURL from config
- Update API handlers to accept and persist GuestURL field
Frontend changes:
- Add GuestURL input field to NodeModal for configuration
- Update NodeGroupHeader and NodeSummaryTable to use GuestURL for navigation
- Add GuestURL to Node and PVENodeConfig TypeScript interfaces
When GuestURL is configured, it will be used for navigation links
instead of the Host URL, allowing users to access PVE hosts through
a reverse proxy or different domain while maintaining internal API
connections.
Implemented comprehensive state preservation to prevent temporary dropouts:
1. Node Grace Period (60s):
- Track last-online timestamp for each Proxmox node
- Preserve online status during grace period to prevent flapping
- Applied to all node status checks throughout codebase
2. Efficient Polling Preservation:
- Detect when cluster/resources returns empty arrays
- Preserve previous VMs/containers if had resources before
- Handles cluster health check failures gracefully
3. Traditional Polling Preservation:
- Updated preservation logic for per-node VM/container polling
- Triggers when zero resources returned regardless of node response
- Fixed issue where nodes responding with empty data bypassed preservation
Root cause: Intermittent Proxmox cluster health failures ("no healthy nodes
available") caused both efficient and traditional polling to return empty
arrays, immediately clearing all VMs/containers from state.
Changes:
- internal/monitoring/monitor.go: Added node grace period, efficient polling preservation
- internal/monitoring/monitor_polling.go: Fixed traditional polling preservation logic
Fixes frequent UI flickering where vmCount/containerCount would briefly drop to zero.
This commit implements per-node temperature monitoring control and fixes a critical
bug where partial node updates were destroying existing configuration.
Backend changes:
- Add TemperatureMonitoringEnabled field (*bool) to PVEInstance, PBSInstance, and PMGInstance
- Update monitor.go to check per-node temperature setting with global fallback
- Convert all NodeConfigRequest boolean fields to *bool pointers
- Add nil checks in HandleUpdateNode to prevent overwriting unmodified fields
- Fix critical bug where partial updates zeroed out MonitorVMs, MonitorContainers, etc.
- Update NodeResponse, NodeFrontend, and StateSnapshot to include temperature setting
- Fix HandleAddNode and test connection handlers to use pointer-based boolean fields
Frontend changes:
- Add temperatureMonitoringEnabled to Node interface and config types
- Create per-node temperature monitoring toggle handler with optimistic updates
- Update NodeModal to wire up per-node temperature toggle
- Add isTemperatureMonitoringEnabled helper to check effective monitoring state
- Update ConfiguredNodeTables to show/hide temperature badge based on monitoring state
- Update NodeSummaryTable to conditionally show temperature column
- Pass globalTemperatureMonitoringEnabled prop through component tree
The critical bug fix ensures that when updating a single field (like temperature
monitoring), the backend only modifies that specific field instead of zeroing out
all other boolean configuration fields.
Root Cause:
The classifyError() function in tempproxy/client.go was returning nil
when err was nil, even if respError contained "rate limit exceeded".
This caused the retry logic to treat rate limit errors as retryable,
triggering 3 retries with exponential backoff (100ms, 200ms, 400ms)
for each rate-limited request.
With multiple nodes polling simultaneously and hitting the proxy's
1 req/sec default rate limit, this created a retry storm:
- 3 nodes polling every 10 seconds
- 1-2 requests rate limited per cycle
- Each rate limit triggered 3 retries
- Result: 6+ extra requests per cycle, causing temperature data to
flicker in and out as requests were dropped
Solution:
1. Reordered classifyError() to check respError first before checking
if err is nil, ensuring rate limit errors are properly classified
2. Added explicit rate limit detection that marks these errors as
non-retryable
3. Added stub EnableTemperatureMonitoring/DisableTemperatureMonitoring
methods to Monitor for interface compatibility
Impact:
- Rate limit retry attempts reduced from 151 in 10 minutes to 0
- Temperature data now stable for all nodes
- No more flickering temperature displays in dashboard
This change addresses intermittent "Guest details unavailable" and "Disk stats
unavailable" errors affecting users with large VM deployments (50+ VMs) or
high-load Proxmox environments.
Changes:
- Increased default guest agent timeouts (3-5s → 10-15s) to better handle
environments under load
- Added automatic retry logic (1 retry by default) for transient timeout failures
- Made all timeouts and retry count configurable via environment variables:
* GUEST_AGENT_FSINFO_TIMEOUT (default: 15s)
* GUEST_AGENT_NETWORK_TIMEOUT (default: 10s)
* GUEST_AGENT_OSINFO_TIMEOUT (default: 10s)
* GUEST_AGENT_VERSION_TIMEOUT (default: 10s)
* GUEST_AGENT_RETRIES (default: 1)
- Added comprehensive documentation in VM_DISK_MONITORING.md with configuration
examples for different deployment scenarios
These improvements allow Pulse to gracefully handle intermittent API timeouts
without immediately displaying errors, while remaining configurable for
different network conditions and environment sizes.
Fixes: https://github.com/rcourtman/Pulse/discussions/592
- Add Access-Control-Expose-Headers to allow frontend to read X-CSRF-Token response header
- Implement proactive CSRF token issuance on GET requests when session exists but CSRF cookie is missing
- Ensures frontend always has valid CSRF token before making POST requests
- Fixes 403 Forbidden errors when toggling system settings
This resolves CSRF validation failures that occurred when CSRF tokens expired or were missing while valid sessions existed.
This commit addresses multiple issues in the Docker/host agent removal flow:
Agent Stop Fix:
- Add systemctl stop command after agent acknowledgement to prevent systemd restart
- Previous behavior: agent disabled but systemd immediately restarted it (Restart=always)
- New behavior: agent disables itself, sends ack, then stops systemd service completely
UX Improvements:
- Add real-time elapsed time counter during removal wait
- Show progress indicators prominently (no longer hidden in dropdown)
- Display expected time range (30-60 seconds) and last heartbeat
- Auto-show timeout warning after 2 minutes with actionable "Force remove" button
- Add contextual help explaining what's happening at each stage
Security Enhancement:
- Automatically revoke API tokens when removing Docker/host agents
- Previous behavior: tokens remained valid after agent removal
- New behavior: tokens are revoked and persisted immediately on removal
- Prevents removed agents from re-authenticating with old credentials
Extends Docker container monitoring with comprehensive disk and storage information:
- Writable layer size and root filesystem usage displayed in new Disk column
- Block I/O statistics (read/write bytes totals) shown in container drawer
- Mount metadata including type, source, destination, mode, and driver details
- Configurable via --collect-disk flag (enabled by default, can be disabled for large fleets)
Also fixes config watcher to consistently use production auth config path instead of following PULSE_DATA_DIR when in mock mode.
Windows Host Agent Enhancements:
- Implement native Windows service support using golang.org/x/sys/windows/svc
- Add Windows Event Log integration for troubleshooting
- Create professional PowerShell installation/uninstallation scripts
- Add process termination and retry logic to handle Windows file locking
- Register uninstall endpoint at /uninstall-host-agent.ps1
Host Agent UI Improvements:
- Add expandable drawer to Hosts page (click row to view details)
- Display system info, network interfaces, disks, and temperatures in cards
- Replace status badges with subtle colored indicators
- Remove redundant master-detail sidebar layout
- Add search filtering for hosts
Technical Details:
- service_windows.go: Windows service lifecycle management with graceful shutdown
- service_stub.go: Cross-platform compatibility for non-Windows builds
- install-host-agent.ps1: Full Windows installation with validation
- uninstall-host-agent.ps1: Clean removal with process termination and retries
- HostsOverview.tsx: Expandable row pattern matching Docker/Proxmox pages
Files Added:
- cmd/pulse-host-agent/service_windows.go
- cmd/pulse-host-agent/service_stub.go
- scripts/install-host-agent.ps1
- scripts/uninstall-host-agent.ps1
- frontend-modern/src/components/Hosts/HostsOverview.tsx
- frontend-modern/src/components/Hosts/HostsFilter.tsx
The Windows service now starts reliably with automatic restart on failure,
and the uninstall script handles file locking gracefully without requiring reboots.
Improves configuration handling and system settings APIs to support
v4.24.0 features including runtime logging controls, adaptive polling
configuration, and enhanced config export/persistence.
Changes:
- Add config override system for discovery service
- Enhance system settings API with runtime logging controls
- Improve config persistence and export functionality
- Update security setup handling
- Refine monitoring and discovery service integration
These changes provide the backend support for the configuration
features documented in the v4.24.0 release.
Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health
**New Response Structure:**
Enhanced "instances" array with per-instance details:
- Instance metadata: displayName, type, connection URL
- Poll status: last success/error timestamps, error messages, error category
- Circuit breaker: state, timestamps, failure counts, retry windows
- Dead letter: present flag, reason, attempt history, retry schedule
**Implementation:**
Data structures:
- instanceInfo: cache of display names, URLs, types
- pollStatus: tracks successes/errors with timestamps and categories
- dlqInsight: DLQ entry metadata (reason, attempts, schedule)
- circuitBreaker: enhanced with stateSince, lastTransition
Tracking logic:
- buildInstanceInfoCache: populate metadata from config on startup
- recordTaskResult: track poll outcomes, error details, categories
- sendToDeadLetter: capture DLQ insights (reason, timestamps)
- circuitBreaker: record state transitions with timestamps
**Backward Compatible:**
- Existing fields (deadLetter, breakers, staleness) unchanged
- New "instances" array is additive
- Old clients can ignore new fields
**Testing:**
- Unit test: TestSchedulerHealth_EnhancedResponse validates all fields
- Integration tests: still passing (55s)
- All error tracking and breaker history verified
**Operator Benefits:**
- Diagnose issues without log digging
- See error messages directly in API
- Understand breaker states and retry schedules
- Track DLQ entries with full context
- Single API call for complete instance health view
Example: Quickly identify "401 unauthorized" on specific PBS instance,
see it's in DLQ after 5 retries, and know when next retry scheduled.
Part of Phase 2 follow-up work to improve observability.
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance
New API endpoint:
GET /api/monitoring/scheduler/health (requires authentication)
New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data
Documentation updated with API spec, field descriptions, and usage examples.
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates
Task 6 of 10 complete (60%). Ready for error/backoff policies.
Adds freshness metadata tracking for all monitored instances:
- StalenessTracker with per-instance last success/error/mutation timestamps
- Change hash detection using SHA1 for detecting data mutations
- Normalized staleness scoring (0-1 scale) based on age vs maxStale
- Integration with PollMetrics for authoritative last-success data
- Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError
- Connected to scheduler as StalenessSource implementation
Task 4 of 10 complete. Ready for adaptive interval logic.
Track minimum and maximum CPU temperatures since monitoring started.
This provides better insight into temperature trends and cooling
adequacy over time.
Changes:
- Backend: Add CPUMin, CPUMaxRecord, MinRecorded, MaxRecorded fields
to Temperature model
- Backend: Implement min/max tracking logic in monitoring cycle that
preserves values across polling cycles
- Backend: Initialize min/max on first reading, update on extremes
- Frontend: Update Temperature TypeScript interface with new fields
- Frontend: Display min/max range in NodeCard tooltip (e.g., "52°C
(48-67°C since monitoring started)")
- Frontend: Rebuild dist assets
Temperature display now shows:
- Current temperature with color coding (green/yellow/red)
- Tooltip with full min-max range and context
- Min/max tracked in-memory (resets on Pulse restart)
Example tooltip: "CPU: 52°C (48-67°C since monitoring started)"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Improvements to pulse-sensor-proxy:
- Fix cluster discovery to use pvecm status for IP addresses instead of node names
- Add standalone node support for non-clustered Proxmox hosts
- Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting
- Add --pulse-server flag to installer for custom Pulse URLs
- Configure www-data group membership for Proxmox IPC access
UI and API cleanup:
- Remove unused "Ensure cluster keys" button from Settings
- Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint
- Remove EnsureClusterKeys method from tempproxy client
The setup script already handles SSH key distribution during initial configuration,
making the manual refresh button redundant.