When Proxmox /nodes/{node}/status returns only total/used/free without
available/buffers/cached, EffectiveAvailable() returns Free (non-zero),
causing the RRD fallback gate to be skipped. This results in inflated
node memory where cache/buffers are counted as "used."
Widen the RRD fallback condition from requiring effectiveAvailable == 0
to triggering whenever missingCacheMetrics is true. Add negative caching
for failed RRD lookups (2-minute backoff) to avoid repeated retries.
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.
Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).
Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.
FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.
Fixes#1270Fixes#1264Fixes#1051
(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance
Requires Sys.Audit permission on Proxmox API token to read apt updates.
Two critical fixes to prevent test timeouts:
1. Nil map panic in TestPollPVEInstanceUsesRRDMemUsedFallback:
- Test monitor was missing nodeLastOnline map initialization
- Panic occurred when pollPVEInstance tried to update nodeLastOnline[nodeID]
- Caused deadlock when panic recovery tried to acquire already-held mutex
- Added nodeLastOnline: make(map[string]time.Time) to test monitor
2. Alert manager goroutine leak in Docker tests:
- newTestMonitor() created alert manager but never stopped it
- Background goroutines (escalationChecker, periodicSaveAlerts) kept running
- Added t.Cleanup(func() { m.alertManager.Stop() }) to test helper
These fixes resolve the 10+ minute test timeouts in CI workflows.
Related to workflow run 19281508603.
Three categories of fixes:
1. Goroutine leak causing 10-minute timeout:
- Add defer mon.notificationMgr.Stop() in monitor_memory_test.go
- Background goroutines from notification manager weren't being stopped
2. Database NULL column scanning errors:
- Change LastError from string to *string in queue.go
- Change PayloadBytes from int to *int in queue.go
- SQL NULL values require pointer types in Go
3. SSRF protection blocking test servers:
- Check allowlist for localhost before rejecting in notifications.go
- Set PULSE_DATA_DIR to temp directory in tests
- Add defer nm.Stop() calls to prevent goroutine leaks
Fixes for preflight test failures in workflow run 19280879903.
Related to #553
## Problem
LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%,
96% when actual was 61%) because the code used the raw `mem` value from Proxmox's
`/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which
includes reclaimable cache and buffers, making memory appear nearly full even when
plenty is available.
## Root Cause
- **Nodes**: Had sophisticated cache-aware memory calculation with RRD fallbacks
- **VMs (qemu)**: Had detailed memory calculation using guest agent meminfo
- **LXCs**: Naively used `res.Mem` directly without any cache-aware correction
The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers
(from cgroup memory accounting), which should be excluded for accurate "used" memory.
## Solution
Implement cache-aware memory calculation for LXC containers by:
1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from
`/nodes/{node}/lxc/{vmid}/rrddata`
2. Using RRD `memavailable` to calculate actual used memory (total - available)
3. Falling back to RRD `memused` if `memavailable` is not available
4. Only using cluster resources `mem` value as last resort
This matches the approach already used for nodes and VMs, providing consistent
cache-aware memory reporting across all resource types.
## Changes
- Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox
- Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations
- Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available
- Added guest memory snapshot recording for LXC containers
- Updated test stubs to implement the new interface method
## Testing
- Code compiles successfully
- Follows the same proven pattern used for nodes and VMs
- Includes diagnostic snapshot recording for troubleshooting