Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.
Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).
Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
Backport of v6 commits a87c9950 and 347d7db1.
Part 1 (a87c9950): Wrap the four guest agent c.get() errors with
fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly
scopes them to the VM rather than the cluster endpoint.
Part 2 (347d7db1): Replace the 20+ pattern blocklist in
executeWithFailover with an allowlist via isEndpointConnectivityError().
Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP
response from Proxmox — including 500 — proves the node is reachable
and returns the error without affecting endpoint health.
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.
FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.
Fixes#1270Fixes#1264Fixes#1051
(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance
Requires Sys.Audit permission on Proxmox API token to read apt updates.
- Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3)
- Improve test coverage for AI tools, chat service, and session management
- Enhance alert and notification tests (ResolvedAlert, Webhook)
- Add frontend unit tests for utils (searchHistory, tagColors, temperature, url)
- Add proximity client API tests
Previously, when some cluster endpoints were unreachable (e.g., backup
nodes intentionally offline), the cluster was marked as "degraded" even
though the Proxmox cluster itself was healthy and had quorum.
Now the connection health check queries the Proxmox cluster's actual
quorum status. A cluster is only marked "degraded" if it has lost
quorum (not enough votes for consensus), which is the actual indicator
of cluster instability.
This means:
- Cluster with quorum + some nodes offline = "healthy"
- Cluster without quorum = "degraded" (warning)
- All endpoints down = "error"
Fixes#1085
When discovering cluster nodes, Pulse now automatically prefers IPs
on the same subnet as the initial connection. This fixes the common
issue where Pulse used internal cluster network IPs (e.g., 172.x.x.x)
instead of management network IPs (e.g., 10.x.x.x).
How it works:
1. Extract subnet from initial connection URL (assumes /24 for IPv4)
2. For each discovered node, query /nodes/{node}/network for all IPs
3. If cluster-reported IP is on a different subnet, find an IP on
the preferred subnet and set it as IPOverride
4. Manual IPOverride settings are preserved and take precedence
This eliminates the need for manual IPOverride configuration in most
multi-network Proxmox setups.
Refs #929, #1066
Changes:
1. Add MAX_POLL_TIMEOUT env var for large Proxmox clusters that need
more than 3 minutes for polling (default: 3m, minimum: 30s)
2. Handle external Ceph storage gracefully - don't mark nodes unhealthy
when Proxmox returns 'binary not installed' (e.g., for Ceph not
managed by Proxmox)
Related to #965
- Add comprehensive debug logging to diagnose replication status fetch failures
- Handle both array and single-object response formats from Proxmox API
- Log raw response body for easier debugging
- Log success/failure for each enrichment step
This helps diagnose issue #992 where replication last/next sync times aren't
showing. The logging will reveal if the API call is failing, returning empty
data, or returning data in an unexpected format.
Related to #992
Ensures that LinkedHostAgentId, CommandsEnabled, IsLegacy, and LinkedNodeId
are correctly propagated to the frontend. This prevents regressions of the
bugs fixed for #952 and #971.
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The /cluster/replication endpoint only returns job configuration (guest,
schedule, source, target), not status data (last_sync, next_sync,
duration, fail_count, state).
This fix enriches each replication job with status from the per-node
endpoint /nodes/{node}/replication/{id}/status to get timing and state
data needed for proper UI display.
Added integration tests to verify:
- Status endpoint is called and data is merged correctly
- Graceful handling when status endpoint fails
Fixes#992
PBS storage content queries with encrypted backups can take 10-20+ seconds
to enumerate. The previous 30s timeout was causing intermittent failures
when polling backup data from PBS storage configured in PVE.
This increases the timeout to 60s to accommodate slow PBS backends while
still preventing indefinite hangs on unavailable NFS/network storage.
When a PVE cluster has unique self-signed certificates on each node, Pulse
would mark secondary nodes as unhealthy because only the primary node's
fingerprint was used for all connections.
Now, during cluster discovery, Pulse captures each node's TLS fingerprint
and uses it when connecting to that specific node. This enables
"Trust On First Use" (TOFU) for clusters with unique per-node certs.
Changes:
- Add Fingerprint field to ClusterEndpoint config
- Add FetchFingerprint() to tlsutil for capturing node certs
- validateNodeAPI() now captures and returns fingerprints during discovery
- NewClusterClient() accepts endpointFingerprints map for per-node certs
- All client creation paths use per-endpoint fingerprints when available
Related to #879
- Add sanitizeEndpointError() to transform raw Go errors into user-friendly messages
- Transform 'context deadline exceeded' into helpful messages mentioning possible causes
- Storage timeout errors now suggest checking PBS/NFS/Ceph backend connectivity
- Connection refused, certificate errors, and auth errors get actionable hints
- Apply sanitization everywhere cluster endpoint lastError is stored
- Add comprehensive tests for all error transformations
- Add 'content' type to StreamDisplayEvent for tracking text chunks
- Track content events in streamEvents array for chronological display
- Update render to use Switch/Match for cleaner conditional rendering
- Interleave thinking, tool calls, and content as they stream in
- Add fallback for old messages without streamEvents for backwards compat
Previously, tool/command outputs stayed at top while AI text responses
accumulated at the bottom. Now all events appear in order like a
normal chatbot.
Cover RPM field handling (numeric, string, SSD, N/A, null, invalid),
invalid JSON error path, and unexpected type fallbacks for both
wearout and RPM fields.
Coverage: 50% → 95.5%
Test error handling for password authentication user format validation:
- Missing realm separator (no @)
- Empty user string
- Multiple @ symbols
Improves NewClient coverage from 74.2% to 83.9%.
Test error handling for JSON parsing edge cases:
- Invalid JSON syntax
- Unsupported field types (bool, array)
- Unparseable string values for total-bytes and used-bytes
Improves coverage from 83.3% to 94.4%.
- Test jobid fallback when id field is missing
- Test jobnum field takes precedence over ID parsing
- Test last_sync_duration and duration fields
- Test last-sync-duration fallback format
- Test next_sync and next-sync fallback formats
Coverage: 79.7% → 100%
Add 4 new test cases covering previously untested branches:
- Float zero exactly (0.0)
- Float negative zero (-0.0)
- Only escaped quotes becoming empty after trimming
- Quoted whitespace becoming empty after trimming
Coverage improved from 95.8% to 100%.
The error pattern `/storage/` only matched storage content endpoints
(`/storage/{name}/content`) but not the main storage list endpoint
(`/nodes/{node}/storage`).
This caused storage timeout errors like:
Get ".../nodes/pve-100-224/storage": context deadline exceeded
to incorrectly mark cluster nodes as unhealthy, even though the timeout
was due to a slow cross-node storage query, not actual node connectivity
issues.
Fixes#754
Add 13 new test cases covering previously untested branches:
- float32 timestamp with valid value (using smaller value for precision)
- float32/float64 zero and negative values
- json.Number zero and negative values
- int32 and uint32 timestamp handling
- Invalid date format strings (no matching layout)
- Partial date strings
- Unsupported types (bool, slice)
Coverage improved from 93.8% to 100%.
Add 6 new test cases covering previously untested branches:
- float64 at MaxUint64 boundary (clamping behavior)
- float64 exceeding MaxUint64 (overflow protection)
- String with quoted "null" value
- String with quoted empty value ("")
- String with single quoted empty value ('')
- Invalid float parsing in scientific notation
Coverage improved from 92.3% to 97.4%.
Previously, recovery of unhealthy nodes only triggered when ALL nodes
were unhealthy. This caused individual degraded nodes to stay degraded
forever since operations would succeed on healthy nodes and never
trigger the recovery path.
Now recovery is attempted whenever any unhealthy nodes exist, allowing
clusters to recover individual nodes over time.
Also added:
- Panic-safe unlock/lock pattern using anonymous function
- Refresh of both healthy and cooling endpoints after recovery
- Updated timestamp for accurate cooldown checks
Related to #754
Covers both Proxmox API formats:
- Integer format (older versions): direct int value
- Object format (Proxmox 8.3+): {enabled, available} fields
- Preference order: available > enabled > 0
- Invalid input handling defaults to 0
- Integration with VMStatus struct
Tests for Proxmox client authentication error handling:
- authHTTPError.Error: message formatting based on status code
(401/403 include status in message, others don't)
- shouldFallbackToForm: determines when to retry with form encoding
(triggers on 400/415, not on auth errors or server errors)
16 test cases covering all code paths.
Tests added by ADA run #97 but commit was missed.
Covers: RaidZ types, log/cache/spare devices, nested mirrors,
ConvertToModelZFSPool, and struct field tests.
Test coverage for error detection and retry logic:
- extractStatusCode: 13 test cases for HTTP status code extraction
- isTransientRateLimitError: 17 test cases for rate limit detection
- isNotImplementedError: 14 test cases for 501 error detection
- isVMSpecificError: 16 test cases for VM-scoped errors
- calculateRateLimitBackoff: backoff timing verification
- isAuthError: 12 test cases for authentication errors
Coverage 35.5% → 37.3%
Comprehensive test coverage for JSON parsing helpers used in
replication job status parsing: stringFromAny, intFromAny,
boolFromAny, floatFromAny, parseReplicationTime, parseDurationSeconds,
parseHHMMSSToSeconds, and parseReplicationJob.
Coverage increased from 22.6% to 35.5%.
The previous fix (6db4ee7a) cleared stale error messages but didn't mark
endpoints as healthy again after successful operations. This caused
clusters to remain in "degraded" state permanently once any endpoint had
a temporary issue, even if all endpoints were actually working.
The fix now marks endpoints healthy in clearEndpointError() after
successful operations, ensuring degraded clusters recover automatically.
Related to #659
Previously, errors stored in ClusterClient.lastError were only cleared
during initial health checks or when recovering unhealthy nodes. This
caused stale error messages to persist in the UI even after the
underlying issues were resolved.
The fix clears cached errors in two places:
1. After passing connectivity test in getHealthyClient()
2. After successful operation in executeWithFailover()
This ensures that once an endpoint starts working again, any previous
error messages are cleared from the UI without requiring a restart.
Related to #659, #754