The 15-second timeout introduced to handle unavailable NFS storage was too aggressive and caused legitimate storage queries to timeout on nodes with many storage backends or higher latency. This was causing storage to not be displayed for affected nodes.
Increased timeout to 30 seconds as a better balance between responsiveness and reliability.
- Add debug logging to guest agent filesystem API responses
- Better handle Windows drive mountpoints (C:\, D:\, etc.)
- Improve empty filesystem list detection and logging
- Add specific handling for Windows filesystems that may report differently
This should help diagnose why some VMs with guest agents installed still show 0% or missing disk usage, particularly on Windows systems.
- handle PBS node status endpoint permission errors gracefully (returns nil instead of error for 403s)
- add required cf and timeframe parameters to RRD endpoint calls
- properly handle nil nodeStatus returns in monitor.go
these API calls now fail silently as PBS API tokens often lack the required permissions for these endpoints, which is expected behavior
The cluster client was incorrectly marking nodes as unhealthy when encountering
VM-specific QEMU guest agent errors. This caused storage and backup operations
to fail with "no healthy nodes available" even though the nodes were actually
accessible.
Changes:
- Added broader detection for guest agent errors in executeWithFailover
- Updated recovery logic to ignore VM-specific errors when recovering nodes
- Guest agent errors no longer affect node health status
This fixes the issue where users with clusters would see storage and backup
operations fail after any VM without a guest agent was queried.
- Add Available field to MemoryStatus struct to capture memory available for allocation
- Update node memory calculation to use Available memory when present
- This excludes non-reclaimable cache/buffers from used memory calculation
- Provides more accurate memory pressure indication, avoiding false alerts
- Falls back to traditional used memory if Available field is missing (older Proxmox versions)
- Added disk polling to monitoring cycle using Proxmox API
- Created CheckDiskHealth() alert manager for failing drives and low SSD life
- Added PhysicalDisk model to state with proper serialization
- Implemented DiskList component with health indicators and SSD wearout bars
- Added Physical Disks tab to Storage page with toggle between pools and disks
- Added ZFS health badges to storage cards for degraded/failed pools
- Alerts trigger for health != PASSED and SSD wearout < 10%
- Frontend displays disk model, type, temperature, and usage information
- Always query guest agent for running VMs (cluster/resources API always returns 0)
- Show allocated disk size when guest agent unavailable (instead of misleading 0%)
- Fix duplicate mount point counting issue (#425)
- Add comprehensive logging for guest agent queries
- Include diagnostic script for troubleshooting VM disk issues
- Update both monitor.go and monitor_optimized.go for consistency
- Implement proper API integration with list and detail endpoints
- Add ZFS pool and device status conversion
- Enable by default with PULSE_DISABLE_ZFS_MONITORING opt-out
- Test with real Proxmox nodes and verify functionality
- Add comprehensive error handling and logging
- Document feature configuration and requirements
The feature now properly:
- Fetches ZFS pool status from Proxmox API
- Detects degraded/faulted pools and devices
- Tracks read/write/checksum errors
- Generates appropriate alerts
- Displays issues in the Storage tab UI
Tested and verified working with real Proxmox clusters.
- Added debug mode: localStorage.setItem('debug-pmg', 'true')
- Robust VMID=0 detection handles string and number types
- Debug logging shows exactly what's happening with PMG backups
- Created test suite that verifies all PMG backup scenarios
- All test cases pass including PBS 'ct' type with VMID='0'
Users experiencing issues can enable debug mode to help diagnose:
1. Open browser console
2. Run: localStorage.setItem('debug-pmg', 'true')
3. Reload page and check for [PMG Debug] messages
4. Share debug output if still showing as LXC
Test results:
✓ PBS PMG backup (ct type with VMID 0) → Host
✓ PBS PMG backup (ct type with numeric VMID 0) → Host
✓ Storage PMG backup (host type) → Host
✓ Storage PMG backup (lxc type with VMID 0) → Host
✓ Regular LXC backup → LXC
- Handle VMID as both string and number types consistently
- Check for both 'ct' and 'lxc' backup types (PBS uses 'ct')
- Check for both 'vm' and 'qemu' backup types for consistency
- Always check VMID=0 first before checking backup type
- PBS stores PMG backups as 'ct' type with VMID='0' (string)
This should properly identify all PMG host config backups regardless
of whether they come from PBS or regular storage, and regardless
of whether VMID is a string or number.
- Add PULSE_ENABLE_ZFS_MONITORING env var (disabled by default)
- Fix API field mapping (health vs state, cksum vs checksum)
- Add proper API endpoint structures for list and detail
- Mark feature as experimental due to API complexity
- Simplify conversion to handle basic health status only
This is a safer approach until we can fully test with real Proxmox nodes
- Add ZFS pool status data structures to models
- Implement ZFS pool data collection via Proxmox API
- Add ZFS pool health alerts for degraded/faulted states
- Add ZFS device error detection and alerting
- Display ZFS pool status in Storage tab when issues detected
- Add mock data generation for testing ZFS monitoring
- Alert on read/write/checksum errors for pools and devices
The real issue was not the overall timeout duration, but that DNS resolution and TLS handshake could hang indefinitely. Added specific timeouts for:
- DNS resolution/connection: 10 seconds
- TLS handshake: 10 seconds
- Response headers: 10 seconds
This prevents the connection from hanging on DNS lookup (like with pve-backup.lan) or during TLS negotiation, which was causing the 'context deadline exceeded' errors. (addresses #424)
- Made cluster health checks less aggressive to prevent false unhealthy states
- Fixed JSON unmarshal error when Proxmox returns object instead of array for VMFileSystem
- Increased initial health check timeouts from 2s to 5s for better reliability
- Added handling for JSON unmarshal errors as data format issues, not connectivity problems
- Improved recovery check interval from 5s to 10s to reduce excessive health checks
- Changed log levels from WARN to DEBUG for transient connectivity issues
- Reduced storage API timeout from 120s to 15s to prevent blocking when storage mounts are unavailable
- Added graceful error handling for storage timeouts - continues with partial data instead of failing
- Improved error messages to clarify when timeouts are likely due to unavailable storage (e.g., NFS mounts)
This prevents Pulse from marking nodes as unhealthy when storage endpoints timeout due to temporarily unavailable network storage.
- Use main host for cluster operations when node endpoints lack FQDNs/IPs
- Skip initial health check for single-endpoint clusters (main host routing)
- Return empty lists instead of errors when cluster nodes are unreachable
- Prevent VMs/containers from disappearing when cluster has connectivity issues
- Fix the 'Instance marked as cluster but is actually standalone' false warning
- Fixed issue where QEMU guest agent errors incorrectly marked nodes as unhealthy
- Nodes with VMs missing guest agents no longer affect cluster health status
- Reduced health check retry interval from 30s to 5s for faster recovery
- Storage and backup polling now works correctly even when some VMs lack guest agents
When a VM doesn't have QEMU guest agent configured, Proxmox returns a 500 error.
This was incorrectly marking the entire cluster node as unhealthy, preventing
all operations on that node. Now we treat these as VM-specific errors that
don't affect node health status.
- Storage queries can timeout on large clusters or slow storage backends
- Extended timeout specifically for GetStorage, GetStorageContent, and GetAllStorage
- Preserves existing context deadlines if they're shorter than 120s
- Should resolve 'context deadline exceeded' errors during storage polling
- Added storage permission errors (403) to exception list
- Permission denied errors no longer mark nodes as unhealthy
- Storage polling can now continue even with permission issues
- Prevents cascading failures when storage permissions are missing
- Nodes remain healthy for VM/container operations even if storage fails
- Changed cluster client initialization to be optimistic (assume healthy)
- Nodes now start as healthy and are marked unhealthy only on actual failures
- This prevents the issue where all nodes were marked unhealthy during init
- Storage operations can now proceed even if initial health checks fail
- Allows recovery from temporary network or auth issues during startup
- Increased default HTTP client timeout from 30s to 60s
- Added CreateHTTPClientWithTimeout function to properly set custom timeouts
- Updated Proxmox and PBS clients to use configured timeout values
- Increased default connection timeout from 45s to 60s in config
This prevents "context deadline exceeded" errors when connecting to slow or overloaded Proxmox/PBS nodes.
addresses #379 - better handling of offline nodes in clusters
- Skip polling VMs/containers from offline nodes to avoid 595 errors
- Improved error message for 595 to distinguish between auth failures and offline node access
addresses #389 - improved error messaging
- Better detection of whether 595 is an auth issue or offline node issue
- Clearer error messages to help users diagnose the actual problem
The 595 error can occur when:
1. Authentication actually fails (wrong credentials)
2. Trying to access resources on an offline node through another node in the cluster
addresses #388 - LXC containers not showing due to VMID type mismatch
- Changed Container.VMID from int to FlexInt to handle string VMIDs from older Proxmox versions
- Updated all code that references Container.VMID to cast to int where needed
addresses #389 - connection timeout errors with Proxmox nodes
- Increased default CONNECTION_TIMEOUT from 10s to 30s to handle slower networks
- This should resolve "context deadline exceeded" errors when polling nodes
addresses #379 - authentication errors may have been related to timeouts
Addresses #379 - Added clearer error messages for common authentication issues:
- 403 errors now explain that permissions must be set on the USER (not just the token) in Proxmox GUI
- 595 errors indicate authentication failure
- 401 errors indicate invalid credentials
While the setup script handles this correctly, these messages help users who manually configure permissions.
- Add expandable namespace rows to PBS instances table
- Show deduplication factor from PBS GC status (calculated from index-data-bytes/disk-bytes)
- Move deduplication display to bottom left of backup frequency chart
- Add namespace highlighting when filtered (blue background, filtering indicator)
- Fix backup frequency chart to properly handle PBS namespace filters
- Allow clicking namespace again to clear filter (toggle behavior)
- Improve visual feedback for selected namespaces with color changes
PBS doesn't expose deduplication factor in its standard datastore status endpoint
Would require garbage collection stats or chunk store data to calculate properly
- capture deduplication_factor from PBS API datastore status endpoint
- display average deduplication ratio in backup frequency chart header
- shows as green 'Deduplication: X.X:1' when PBS datastores provide this data
- Add GetVMFSInfo method to fetch filesystem data from guest agent
- Integrate guest agent disk stats for VMs in both polling modes
- Aggregate real disk usage from all filesystems (skip special mounts)
- Fall back gracefully to allocated size when agent unavailable
- Add VM.Monitor permission to auto-negotiation script via PulseMonitor role
- Update frontend NodeModal with new permission instructions
VMs with QEMU guest agent now show actual disk usage like LXCs do.
Addresses #344
- Add helpful "No Proxmox VE nodes configured" message to Storage and Backup tabs
- Include "Go to Settings" button for easy navigation when no nodes exist
- Enhance network discovery for Docker environments with smart subnet detection
- Auto-detect Docker network configuration and scan appropriate subnets
- Add support for common Docker network ranges (172.16.0.0/12, 10.0.0.0/8)
- Improve discovery logging to show subnet being scanned
- Fix discovery API endpoint to properly return discovered servers
- Auto-detect Docker environment and scan common home/office subnets
- Scans 192.168.1.0/24, 192.168.0.0/24, 10.0.0.0/24, 192.168.88.0/24, 172.16.0.0/24
- Removes friction - nodes are discovered automatically without configuration
- DISCOVERY_SUBNET env var now optional (only for non-standard networks)
- Update documentation to reflect automatic discovery
This makes the first-run experience much smoother - users see their
Proxmox nodes immediately without having to figure out subnet configuration.
- Automatically hash plain text API tokens (SHA3-256) and passwords (bcrypt) when loaded from env vars
- Remove unnecessary PULSE_SETUP_TOKEN feature in favor of simpler env var approach
- Remove HandleInitialSetup endpoint - not needed with env var configuration
- Update authentication to always use hashed comparisons (no plain text warnings)
- Update documentation to clearly explain auto-hashing capability
- Maintain backward compatibility with pre-hashed credentials
This makes Pulse secure by default while keeping deployment simple - users can
provide plain text credentials via environment variables and Pulse automatically
hashes them for security.
- Fixed PBS API endpoint to use /nodes/localhost/status directly
- PBS always uses 'localhost' as the node name, not dynamic discovery
- Updated PBSCard to properly detect Docker instances by name
- Improved display for PBS instances without Sys.Audit permission
- PBS instances now correctly show CPU, memory, and uptime when available
Non-clustered Proxmox nodes were getting certificate verification errors
when Pulse tried to use the cluster/resources endpoint. Now checks if
the node is actually in a cluster before attempting efficient polling.
- Fix alternating zero I/O metrics by implementing rate caching for stale data from Proxmox
- Hardcode polling interval to 10 seconds (matching Proxmox cluster/resources update cycle)
- Remove polling interval settings from UI (no longer user-configurable)
- Implement efficient VM/container polling using single cluster/resources API call
- Remove 'Remove Password' feature (auth is now mandatory)
- Fix CSRF validation for Basic Auth (exempt from CSRF checks)
- Fix Generate API Token modal and authentication
- Remove redundant 'Active' status from Authentication section
- Remove Connection Timeout setting from frontend (backend-only)
- Clean up frontend console logging (reduce verbosity)
- Remove PBS polling interval setting (fixed at 10s)
- Add frontend rebuild detection to backend-watch script
- Improve first-run setup flow and error handling
- Cluster now handles offline nodes gracefully without marking endpoints unhealthy
- Fixed error 595 (node unreachable) not being treated as node-specific failure
- Added parallel health checks with shorter timeouts for better performance
- Fixed inconsistent border width on offline node cards (removed conflicting border-l-4)
- Switched to ring utility for consistent outline on offline/alert nodes
- Improved logout functionality with proper CSRF token handling
addresses #312, #315
- Add automatic HTTPS defaulting when no protocol specified
- Warn users when using HTTP for PBS (which requires HTTPS)
- Improve error messages to suggest HTTPS when HTTP fails
- Add UI hints about PBS requiring HTTPS on port 8007
- Fix placeholder to show correct default port for PBS
Frontend:
- Enhanced cluster vs standalone node visual distinction in Settings
- Added glassmorphic style to all toast notifications for consistency
- Fixed test connection in edit modal to use stored encrypted credentials
- Added batch credential modal for bulk node operations
- Added network discovery modal with auto-subnet detection
- Improved notification system with dual toast/notification support
- Added event bus for component communication
Backend:
- Fixed duplicate toast notifications during auto-registration
- Fixed PBS auto-registration token extraction from JSON output
- Added network discovery service with background scanning
- Improved cluster detection with actual cluster name from API
- Added helper function to reduce code duplication in cluster detection
- Fixed host URL normalization in auto-registration
- Enhanced PBS client token authentication parsing
Bug Fixes:
- Fixed stacking toast notifications creating visual bugs
- Fixed PBS authentication failures after auto-registration
- Fixed network discovery not finding Proxmox servers
- Fixed test connection for existing nodes with encrypted tokens
- Removed duplicate WebSocket broadcasts for auto-registration events
- Removed PBS summary card from Dashboard and Backups tabs (not needed)
- Fixed backup frequency chart to use local timezone instead of UTC
- Chart now properly includes today in the date range
- Dates display according to user's browser timezone
- Fix Docker persistence bug where config was saved to /etc/pulse instead of /data
- Fix Windows VM memory reporting with balloon drivers
- Add GetVMStatus method to get detailed VM info including balloon memory
- Update diagnostics endpoint to use correct config paths
Fixes#253 (Docker persistence)
Fixes#258 (Windows VM memory reporting)
- Parse user@realm from token name if provided in full format
- Better handle various token input formats
- Require user info for token auth (either in token name or user field)
- Fix realm defaulting logic for different auth types
- Add GetDataDir() function to respect PULSE_DATA_DIR environment variable
- Update all hardcoded /var/lib/pulse paths to use configurable data directory
- Fix circular import by moving GetDataDir to utils package
- Ensures Docker containers can properly persist configuration and alerts
- Replace all 'any' types with proper TypeScript types throughout the codebase
- Fix Record<string, any> to use specific types (AlertThresholds, unknown)
- Update logger methods to use 'unknown' instead of 'any' for parameters
- Fix type assertions to use proper types instead of 'as any'
- Update generic type defaults from 'any' to 'unknown'
- Fix WebSocket message types to use 'unknown' for optional data
- Move global Toast declaration to top level to fix TypeScript errors
- Comment out legacy PBS backup code that referenced non-existent fields
- Ensure all code follows TypeScript standards as documented in CLAUDE.md
All TypeScript compilation errors have been resolved and the codebase now
adheres to strict typing standards with no 'any' types remaining.