Storage that is disabled in Proxmox (Datacenter > Storage > enabled=no)
was incorrectly showing as "available" in Pulse. The issue was that
Enabled/Active fields defaulted to true and were never set to false
from the per-node API response.
Now the model correctly initializes Enabled/Active from the Proxmox
per-node storage API response, and the status determination prioritizes
checking the disabled state first.
Related to #796
- Replace custom maxInt64 helper with Go 1.21+ builtin max()
- Mark unused cfg parameter in newAdaptiveIntervalSelector
- Remove test for deleted helper function
- Use defer for tempCancel() to ensure context is always cancelled
- Remove redundant shouldCollect variable that was always true
- Fix indentation after removing the unnecessary conditional block
Related to #630
Proxmox 8.3+ changed the VM status API to return the `agent` field as an
object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This
caused Pulse to incorrectly treat VMs as having no guest agent, resulting
in missing disk usage data (disk:-1) even when the guest agent was running
and functional.
The issue manifested as:
- VMs showing "Guest details unavailable" or missing disk data
- Pulse logs showing no "Guest agent enabled, querying filesystem info" messages
- `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly
from the command line, confirming the agent was functional
Root cause:
The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent
the new object format, JSON unmarshaling silently left the field at zero,
causing Pulse to skip all guest agent queries.
Changes:
- Created VMAgentField type with custom UnmarshalJSON to handle both formats:
* Legacy (Proxmox <8.3): integer (0 or 1)
* Modern (Proxmox 8.3+): object {"enabled":N,"available":N}
- Updated VMStatus.Agent from `int` to `VMAgentField`
- Updated all references to `detailedStatus.Agent` to use `.Agent.Value`
- The unmarshaler prioritizes the "available" field over "enabled" to ensure
we only query when the agent is actually responding
This fix maintains backward compatibility with older Proxmox versions while
supporting the new format introduced in Proxmox 8.3+.
Resolves#641
## Problem
When a VM migrates between Proxmox nodes, Pulse was treating it as a new
resource and discarding custom alert threshold overrides. This occurred
because guest IDs included the node name (e.g., `instance-node-VMID`),
causing the ID to change when the VM moved to a different node.
Users reported that after migrating a VM, previously disabled alerts
(e.g., memory threshold set to 0) would resume firing.
## Root Cause
Guest IDs were constructed as:
- Standalone: `node-VMID`
- Cluster: `instance-node-VMID`
When a VM migrated from node1 to node2, the ID changed from
`instance-node1-100` to `instance-node2-100`, causing:
- Alert threshold overrides to be orphaned (keyed by old ID)
- Guest metadata (custom URLs, descriptions) to be orphaned
- Active alerts to reference the wrong resource ID
## Solution
Changed guest ID format to be stable across node migrations:
- New format: `instance-VMID` (for both standalone and cluster)
- Retains uniqueness across instances while being node-independent
- Allows VMs to migrate freely without losing configuration
## Implementation
### Backend Changes
1. **Guest ID Construction** (`monitor_polling.go`):
- Simplified to always use `instance-VMID` format
- Removed node from the ID construction logic
2. **Alert Override Migration** (`alerts.go`):
- Added lazy migration in `getGuestThresholds()`
- Detects legacy ID formats and migrates to new format
- Preserves user configurations automatically
3. **Guest Metadata Migration** (`guest_metadata.go`):
- Added `GetWithLegacyMigration()` helper method
- Called during VM/container polling to migrate metadata
- Preserves custom URLs and descriptions
4. **Active Alerts Migration** (`alerts.go`):
- Added migration logic in `LoadActiveAlerts()`
- Translates legacy alert resource IDs to new format
- Preserves alert acknowledgments across restarts
### Frontend Changes
5. **ID Construction Updates**:
- `ThresholdsTable.tsx`: Updated fallback from `instance-node-vmid` to `instance-vmid`
- `Dashboard.tsx`: Simplified guest ID construction
- `GuestRow.tsx`: Updated `buildGuestId()` helper
## Migration Strategy
- **Lazy Migration**: Configs are migrated as guests are discovered
- **Backwards Compatible**: Old IDs are detected and automatically converted
- **Zero Downtime**: No manual intervention required
- **Persisted**: Migrated configs are saved on next config write cycle
## Testing Recommendations
After deployment:
1. Verify existing alert overrides still apply
2. Test VM migration - confirm thresholds persist
3. Check guest metadata (custom URLs) survive migration
4. Verify active alerts maintain acknowledgment state
## Related
- Addresses similar issues with guest metadata and active alert tracking
- Lays groundwork for any future guest-specific configuration features
- Aligns with project philosophy: correctness and UX over implementation complexity
Related to #405
Enhances error reporting and logging when all cluster endpoints are
unhealthy, making it easier to diagnose connectivity issues.
Changes:
1. Enhanced error messages in cluster_client.go:
- Error now includes list of unreachable endpoints
- Added detailed logging when no healthy endpoints available
- Log at WARN level (not DEBUG) when cluster health check fails
- Better context in recovery attempts with start/completion summaries
2. Improved storage polling resilience in monitor_polling.go:
- Better error context when cluster storage polling fails
- Specific guidance for "no healthy nodes available" scenario
- Storage polling continues with direct node queries even if
cluster-wide query fails (already worked, but now clearer)
3. Better recovery logging:
- Log when recovery attempts start with list of unhealthy endpoints
- Log individual recovery failures at DEBUG level
- Log recovery summary (success/failure counts)
- Track throttled endpoints separately for clearer diagnostics
These changes help users understand:
- Which specific endpoints are unreachable
- Whether it's a network/connectivity issue vs. API issue
- That Pulse will continue trying to recover endpoints automatically
- That storage monitoring continues via direct node queries
The root issue is that Pulse's internal health tracking can mark all
endpoints unhealthy when they're unreachable from the Pulse server,
even if Proxmox reports them as "online" in cluster status. Better
logging helps diagnose these network connectivity issues.
Related to #596
**Problem:**
Users were seeing persistent "permission denied" error messages for VMs
that simply didn't have qemu-guest-agent installed or running. The error
detection logic was too broad and classified Proxmox API 500 errors as
permission issues, even when they indicated guest agent unavailability.
**Root Cause:**
When qemu-guest-agent is not installed or not running, Proxmox API returns
various error responses (500, 403) that may contain permission-related text.
The previous error detection logic checked for "permission denied" strings
without considering the HTTP status code context, leading to:
- VMs with guest agent: guest details display correctly
- VMs without guest agent: false "Permission denied" error shown
**Solution:**
Enhanced error classification logic to distinguish between:
1. Actual permission issues (401/403 with permission keywords)
2. Guest agent unavailability (500 errors)
3. Agent timeout issues
4. Other agent errors
The fix ensures that only explicit authentication/authorization errors
(401 Unauthorized, 403 Forbidden with permission keywords) are classified
as permission-denied, while API 500 errors are correctly identified as
agent-not-running issues.
**Changes:**
- Reordered error detection to check most specific patterns first
- Added HTTP status code context to permission error detection
- 500 errors now correctly map to "agent-not-running" status
- Only 401/403 errors with explicit permission keywords trigger "permission-denied"
- Improved log messages to guide users toward correct resolution
- Fixed err.Error() vs errStr variable inconsistency
**Impact:**
Users will now see accurate error messages that guide them to:
- Install qemu-guest-agent when it's missing (most common case)
- Check permissions only when there's an actual auth/authz issue
- Understand the difference between agent problems and permission problems
Implemented comprehensive state preservation to prevent temporary dropouts:
1. Node Grace Period (60s):
- Track last-online timestamp for each Proxmox node
- Preserve online status during grace period to prevent flapping
- Applied to all node status checks throughout codebase
2. Efficient Polling Preservation:
- Detect when cluster/resources returns empty arrays
- Preserve previous VMs/containers if had resources before
- Handles cluster health check failures gracefully
3. Traditional Polling Preservation:
- Updated preservation logic for per-node VM/container polling
- Triggers when zero resources returned regardless of node response
- Fixed issue where nodes responding with empty data bypassed preservation
Root cause: Intermittent Proxmox cluster health failures ("no healthy nodes
available") caused both efficient and traditional polling to return empty
arrays, immediately clearing all VMs/containers from state.
Changes:
- internal/monitoring/monitor.go: Added node grace period, efficient polling preservation
- internal/monitoring/monitor_polling.go: Fixed traditional polling preservation logic
Fixes frequent UI flickering where vmCount/containerCount would briefly drop to zero.
This change addresses intermittent "Guest details unavailable" and "Disk stats
unavailable" errors affecting users with large VM deployments (50+ VMs) or
high-load Proxmox environments.
Changes:
- Increased default guest agent timeouts (3-5s → 10-15s) to better handle
environments under load
- Added automatic retry logic (1 retry by default) for transient timeout failures
- Made all timeouts and retry count configurable via environment variables:
* GUEST_AGENT_FSINFO_TIMEOUT (default: 15s)
* GUEST_AGENT_NETWORK_TIMEOUT (default: 10s)
* GUEST_AGENT_OSINFO_TIMEOUT (default: 10s)
* GUEST_AGENT_VERSION_TIMEOUT (default: 10s)
* GUEST_AGENT_RETRIES (default: 1)
- Added comprehensive documentation in VM_DISK_MONITORING.md with configuration
examples for different deployment scenarios
These improvements allow Pulse to gracefully handle intermittent API timeouts
without immediately displaying errors, while remaining configurable for
different network conditions and environment sizes.
Fixes: https://github.com/rcourtman/Pulse/discussions/592
Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates
Task 6 of 10 complete (60%). Ready for error/backoff policies.
Adds freshness metadata tracking for all monitored instances:
- StalenessTracker with per-instance last success/error/mutation timestamps
- Change hash detection using SHA1 for detecting data mutations
- Normalized staleness scoring (0-1 scale) based on age vs maxStale
- Integration with PollMetrics for authoritative last-success data
- Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError
- Connected to scheduler as StalenessSource implementation
Task 4 of 10 complete. Ready for adaptive interval logic.