Cover RPM field handling (numeric, string, SSD, N/A, null, invalid),
invalid JSON error path, and unexpected type fallbacks for both
wearout and RPM fields.
Coverage: 50% → 95.5%
Test error handling for password authentication user format validation:
- Missing realm separator (no @)
- Empty user string
- Multiple @ symbols
Improves NewClient coverage from 74.2% to 83.9%.
Test error handling for JSON parsing edge cases:
- Invalid JSON syntax
- Unsupported field types (bool, array)
- Unparseable string values for total-bytes and used-bytes
Improves coverage from 83.3% to 94.4%.
- Test jobid fallback when id field is missing
- Test jobnum field takes precedence over ID parsing
- Test last_sync_duration and duration fields
- Test last-sync-duration fallback format
- Test next_sync and next-sync fallback formats
Coverage: 79.7% → 100%
Add 4 new test cases covering previously untested branches:
- Float zero exactly (0.0)
- Float negative zero (-0.0)
- Only escaped quotes becoming empty after trimming
- Quoted whitespace becoming empty after trimming
Coverage improved from 95.8% to 100%.
The error pattern `/storage/` only matched storage content endpoints
(`/storage/{name}/content`) but not the main storage list endpoint
(`/nodes/{node}/storage`).
This caused storage timeout errors like:
Get ".../nodes/pve-100-224/storage": context deadline exceeded
to incorrectly mark cluster nodes as unhealthy, even though the timeout
was due to a slow cross-node storage query, not actual node connectivity
issues.
Fixes#754
Add 13 new test cases covering previously untested branches:
- float32 timestamp with valid value (using smaller value for precision)
- float32/float64 zero and negative values
- json.Number zero and negative values
- int32 and uint32 timestamp handling
- Invalid date format strings (no matching layout)
- Partial date strings
- Unsupported types (bool, slice)
Coverage improved from 93.8% to 100%.
Add 6 new test cases covering previously untested branches:
- float64 at MaxUint64 boundary (clamping behavior)
- float64 exceeding MaxUint64 (overflow protection)
- String with quoted "null" value
- String with quoted empty value ("")
- String with single quoted empty value ('')
- Invalid float parsing in scientific notation
Coverage improved from 92.3% to 97.4%.
Previously, recovery of unhealthy nodes only triggered when ALL nodes
were unhealthy. This caused individual degraded nodes to stay degraded
forever since operations would succeed on healthy nodes and never
trigger the recovery path.
Now recovery is attempted whenever any unhealthy nodes exist, allowing
clusters to recover individual nodes over time.
Also added:
- Panic-safe unlock/lock pattern using anonymous function
- Refresh of both healthy and cooling endpoints after recovery
- Updated timestamp for accurate cooldown checks
Related to #754
Covers both Proxmox API formats:
- Integer format (older versions): direct int value
- Object format (Proxmox 8.3+): {enabled, available} fields
- Preference order: available > enabled > 0
- Invalid input handling defaults to 0
- Integration with VMStatus struct
Tests for Proxmox client authentication error handling:
- authHTTPError.Error: message formatting based on status code
(401/403 include status in message, others don't)
- shouldFallbackToForm: determines when to retry with form encoding
(triggers on 400/415, not on auth errors or server errors)
16 test cases covering all code paths.
Tests added by ADA run #97 but commit was missed.
Covers: RaidZ types, log/cache/spare devices, nested mirrors,
ConvertToModelZFSPool, and struct field tests.
Test coverage for error detection and retry logic:
- extractStatusCode: 13 test cases for HTTP status code extraction
- isTransientRateLimitError: 17 test cases for rate limit detection
- isNotImplementedError: 14 test cases for 501 error detection
- isVMSpecificError: 16 test cases for VM-scoped errors
- calculateRateLimitBackoff: backoff timing verification
- isAuthError: 12 test cases for authentication errors
Coverage 35.5% → 37.3%
Comprehensive test coverage for JSON parsing helpers used in
replication job status parsing: stringFromAny, intFromAny,
boolFromAny, floatFromAny, parseReplicationTime, parseDurationSeconds,
parseHHMMSSToSeconds, and parseReplicationJob.
Coverage increased from 22.6% to 35.5%.
The previous fix (6db4ee7a) cleared stale error messages but didn't mark
endpoints as healthy again after successful operations. This caused
clusters to remain in "degraded" state permanently once any endpoint had
a temporary issue, even if all endpoints were actually working.
The fix now marks endpoints healthy in clearEndpointError() after
successful operations, ensuring degraded clusters recover automatically.
Related to #659
Previously, errors stored in ClusterClient.lastError were only cleared
during initial health checks or when recovering unhealthy nodes. This
caused stale error messages to persist in the UI even after the
underlying issues were resolved.
The fix clears cached errors in two places:
1. After passing connectivity test in getHealthyClient()
2. After successful operation in executeWithFailover()
This ensures that once an endpoint starts working again, any previous
error messages are cleared from the UI without requiring a restart.
Related to #659, #754
Proxmox VE 9.x removed support for the "full" parameter in the
/nodes/{node}/qemu/{vmid}/status/current endpoint. When Pulse sent
GetVMStatus() requests with ?full=1, Proxmox responded with:
API error 400: {"errors":{"full":"property is not defined in schema..."}}
This caused the cluster client to mark ALL endpoints as unhealthy, which
cascaded into multiple failures:
- VM status checks failed
- Guest agent queries were blocked
- Filesystem data collection stopped working
- All Windows VMs showed disk:-1 (unknown) instead of actual disk usage
The fix removes the ?full=1 parameter since Proxmox 9.x returns all data
by default without needing this parameter. This maintains backward
compatibility with older Proxmox versions while fixing the issue in 9.x.
After this fix:
- Cluster endpoints are correctly marked as healthy
- Guest agent queries work properly
- Windows VMs report actual disk usage (e.g., 26% on C:\ drive)
- VM monitoring functions normally on Proxmox 9.x
Proxmox VE 9.x removed support for the 'ds' parameter in RRD endpoints
(/nodes/{node}/rrddata and /nodes/{node}/lxc/{vmid}/rrddata). When Pulse
sent RRD requests with ds=memused,memavailable,etc., Proxmox responded with:
API error 400: {"errors":{"ds":"property is not defined in schema..."}}
This caused cluster nodes to be repeatedly marked unhealthy, which cascaded
into storage polling failures showing 'All cluster endpoints are unhealthy'
even though the nodes were actually healthy and reachable.
Changes:
- Added check in cluster_client.go executeWithFailover to recognize the ds
parameter error as a capability issue rather than node health failure
- Nodes with this error no longer get marked unhealthy
- Storage polling and other operations now succeed even when RRD calls fail
- The RRD data will be unavailable but core monitoring continues
This fix maintains backward compatibility with older Proxmox versions while
gracefully handling the API change in Proxmox 9.x.
Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be
more user-friendly and address common setup issues:
- Added clear explanation of why the proxy is needed (containers can't access hardware)
- Provided concrete IP example instead of placeholder
- Showed full docker-compose.yml context with proper YAML structure
- Added sudo to commands where needed
- Updated docker-compose commands to v2 syntax with note about v1
- Expanded verification steps with clearer success indicators
- Added reminder to check container name in verification commands
These improvements should help users who encounter blank temperature displays
due to missing proxy installation or bind mount configuration.
Related to #656
Windows guest agents can return multiple directory mountpoints (C:\, C:\Users,
C:\Windows) all on the same physical drive. When the QEMU guest agent omits
disk[] metadata, commit 5325ef481 falls back to using the mountpoint string
as the disk identifier. This causes every Windows directory to be treated as
a separate disk, accumulating to inflated totals (e.g., 1TB reported for a
250GB drive).
Root cause:
The fallback logic in pkg/proxmox/client.go:1585-1594 assigns fs.Disk =
fs.Mountpoint when disk[] is missing. On Windows, every directory path is
unique, so the deduplication guard in internal/monitoring/monitor_polling.go:
619-635 never triggers, causing all directories to be summed.
Changes:
- Detect Windows-style mountpoints (drive letter + colon + backslash)
- Normalize to drive root when disk[] is missing (e.g., C:\Users → C:)
- Preserve existing behavior for Linux/BSD and VMs with disk[] metadata
- Add debug logging for synthesized Windows drive identifiers
This fix maintains backward compatibility with commit 5325ef481 while
preventing the Windows directory accumulation issue. LXC containers are
unaffected as they use a different code path.
Related to #630
Proxmox 8.3+ changed the VM status API to return the `agent` field as an
object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This
caused Pulse to incorrectly treat VMs as having no guest agent, resulting
in missing disk usage data (disk:-1) even when the guest agent was running
and functional.
The issue manifested as:
- VMs showing "Guest details unavailable" or missing disk data
- Pulse logs showing no "Guest agent enabled, querying filesystem info" messages
- `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly
from the command line, confirming the agent was functional
Root cause:
The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent
the new object format, JSON unmarshaling silently left the field at zero,
causing Pulse to skip all guest agent queries.
Changes:
- Created VMAgentField type with custom UnmarshalJSON to handle both formats:
* Legacy (Proxmox <8.3): integer (0 or 1)
* Modern (Proxmox 8.3+): object {"enabled":N,"available":N}
- Updated VMStatus.Agent from `int` to `VMAgentField`
- Updated all references to `detailedStatus.Agent` to use `.Agent.Value`
- The unmarshaler prioritizes the "available" field over "enabled" to ensure
we only query when the agent is actually responding
This fix maintains backward compatibility with older Proxmox versions while
supporting the new format introduced in Proxmox 8.3+.
Related to #553
## Problem
LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%,
96% when actual was 61%) because the code used the raw `mem` value from Proxmox's
`/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which
includes reclaimable cache and buffers, making memory appear nearly full even when
plenty is available.
## Root Cause
- **Nodes**: Had sophisticated cache-aware memory calculation with RRD fallbacks
- **VMs (qemu)**: Had detailed memory calculation using guest agent meminfo
- **LXCs**: Naively used `res.Mem` directly without any cache-aware correction
The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers
(from cgroup memory accounting), which should be excluded for accurate "used" memory.
## Solution
Implement cache-aware memory calculation for LXC containers by:
1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from
`/nodes/{node}/lxc/{vmid}/rrddata`
2. Using RRD `memavailable` to calculate actual used memory (total - available)
3. Falling back to RRD `memused` if `memavailable` is not available
4. Only using cluster resources `mem` value as last resort
This matches the approach already used for nodes and VMs, providing consistent
cache-aware memory reporting across all resource types.
## Changes
- Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox
- Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations
- Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available
- Added guest memory snapshot recording for LXC containers
- Updated test stubs to implement the new interface method
## Testing
- Code compiles successfully
- Follows the same proven pattern used for nodes and VMs
- Includes diagnostic snapshot recording for troubleshooting
Related to #405
Enhances error reporting and logging when all cluster endpoints are
unhealthy, making it easier to diagnose connectivity issues.
Changes:
1. Enhanced error messages in cluster_client.go:
- Error now includes list of unreachable endpoints
- Added detailed logging when no healthy endpoints available
- Log at WARN level (not DEBUG) when cluster health check fails
- Better context in recovery attempts with start/completion summaries
2. Improved storage polling resilience in monitor_polling.go:
- Better error context when cluster storage polling fails
- Specific guidance for "no healthy nodes available" scenario
- Storage polling continues with direct node queries even if
cluster-wide query fails (already worked, but now clearer)
3. Better recovery logging:
- Log when recovery attempts start with list of unhealthy endpoints
- Log individual recovery failures at DEBUG level
- Log recovery summary (success/failure counts)
- Track throttled endpoints separately for clearer diagnostics
These changes help users understand:
- Which specific endpoints are unreachable
- Whether it's a network/connectivity issue vs. API issue
- That Pulse will continue trying to recover endpoints automatically
- That storage monitoring continues via direct node queries
The root issue is that Pulse's internal health tracking can mark all
endpoints unhealthy when they're unreachable from the Pulse server,
even if Proxmox reports them as "online" in cluster status. Better
logging helps diagnose these network connectivity issues.