Proxmox VM/LXC templates are intentionally excluded from the monitored
guest list, but their backup files exist on storage. The orphan-detection
logic was firing for every template backup because the VMID was never
in the guest lookup maps.
Fix: track template VMID→node pairs in State.templateVMIDs (unexported,
not serialised to API/frontend) during the resources poll loop, expose
via StateSnapshot.TemplateVMIDs, and use in both buildGuestLookups() and
the storage backup node-resolution map so orphan detection treats template
backups as valid. Also preserves the template map through the cluster
health grace-period path (zero-resource preservation), the partial-node
grace-period path, and clears it on instance removal.
Closes#1352
Pulse was generating tag colours from a hash of the tag name instead
of using the colours configured in Proxmox. Now polls /cluster/options
once per PVE instance and merges the tag-style colour map into state,
which the frontend uses as the first-priority colour source for tag
badges. Falls back to the existing special-tag and hash-based colours
when Proxmox hasn't set a custom colour for a tag.
Show reclaimable buff/cache as a distinct amber segment between used
(green) and free (gray) in the memory bar. This explains why Pulse's
memory percentage differs from Proxmox: Pulse reports cache-aware
usage (MemAvailable) while Proxmox includes cache as used (Total-Free).
Backend: add Cache field to Memory model, derived from MemInfo
(Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to
avoid inflating cache by the balloon gap on ballooned VMs.
Frontend: StackedMemoryBar renders three segments with tooltip
breakdown. Tooltip Free accounts for balloon limit when active.
Percentage label and alerts remain cache-aware (unchanged).
Replace the diskUsage <= 0 heuristic with a diskFromAgent bool that is
only set when the guest agent actually returns valid filesystem data.
Prevents carry-forward from firing on a genuine 0% disk reading.
Prevents stale disk data from persisting indefinitely in the efficient
poller when a user disables the guest agent after it had been providing
data. Matches the fallback poller's agent-disabled exclusion.
Carry forward previous cycle's disk data when the QEMU guest agent
times out or errors, instead of falling back to Proxmox cluster/resources
which always reports 0 for VM disk usage. Applied to both polling paths
(pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards
against uint64 underflow and permanent-failure exclusions.
Host agents removed from the UI would reappear on the next report cycle
because there was no rejection mechanism — unlike Docker agents which
already had resurrection prevention. Mirror the Docker agent pattern:
- Track removed host IDs in a `removedHosts` map with 24hr TTL
- Persist removal records in `State.RemovedHosts` for frontend display
- Reject reports from removed hosts in `ApplyHostReport()`
- Add `AllowHostReenroll()` + API route to clear the block
- Show removed host agents in the Settings UI with "Allow re-enroll"
- Sync removed-agent maps from state on startup for all agent types
- Fix mock integration snapshot missing `RemovedDockerHosts` field
Move the guest-agent file-read of /proc/meminfo earlier in the memory
fallback chain so it runs before RRD, giving real-time MemAvailable that
correctly excludes reclaimable buff/cache on Linux VMs. Also add
VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use
comma-separated privilege strings.
Two root causes: (1) When Proxmox cluster/resources returns a partial
response (e.g. during migration or transient API issue), VMs missing
from a responsive node were silently dropped because the node appeared
in nodesWithResources, bypassing grace-period preservation. Now
preserves recently-seen guests from online nodes for up to the grace
window. (2) The task queue allowed overlapping polls for the same PVE
instance — a slower stale poll could overwrite a newer complete VM list.
Added per-instance execution lock to skip duplicate scheduled tasks.
A broken or hung qemu-agent on one VM could stall the entire polling
loop, preventing higher-VMID VMs from being detected. Wrap all guest
agent work in a 10s per-VM budget with panic recovery, and add a 2s
timeout to GetVMStatus in the efficient poller to match the legacy path.
When the balloon driver reports Free but not Buffers or Cached, the
meminfo-derived fallback computed memAvailable = Free alone, counting
all reclaimable page cache as used memory. This caused Linux VMs to
show wildly inflated usage (e.g. 93% when actual is 21%).
Now meminfo-derived requires at least one cache metric (Buffers > 0
or Cached > 0) before trusting the value. When missing, the code
falls through to RRD/guest-agent/Total-Used fallbacks which provide
accurate cache-aware data. Both efficient and traditional polling
paths are now consistent.
Escalation was calling SendAlert() which always sends to all enabled
channels, ignoring the per-level channel selection (email/webhook/all).
Add SendAlertToChannels() that snapshots only the requested channel
configs and uses a distinct "_escalation" queue type so the dequeue
handler skips cooldown writes — preventing interference with the alert
manager's own re-notify cadence.
When Proxmox /nodes/{node}/status returns only total/used/free without
available/buffers/cached, EffectiveAvailable() returns Free (non-zero),
causing the RRD fallback gate to be skipped. This results in inflated
node memory where cache/buffers are counted as "used."
Widen the RRD fallback condition from requiring effectiveAvailable == 0
to triggering whenever missingCacheMetrics is true. Add negative caching
for failed RRD lookups (2-minute backoff) to avoid repeated retries.
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.
Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).
Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
The --disk-exclude agent flag only filtered local metric collection but
had no effect on server-side Proxmox disk health and SSD wearout alerts,
which poll the Proxmox API directly. Users excluding disks (e.g.
--disk-exclude sda) still received alerts for those disks.
Agent now sends its DiskExclude patterns in each report. The server
stores them on the Host model and consults them during Proxmox disk
polling — excluded disks get a synthetic healthy status passed to
CheckDiskHealth so any existing alerts clear immediately.
Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs,
linsysfs) to the virtual FS filter and /var/run/ to special mount
prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.
When no guest agent MemInfo or RRD data is available, prefer the linked
Pulse host agent's memory (read from /proc/meminfo via gopsutil, which
excludes page cache) over Proxmox's status.Mem (total - free, inflated
by reclaimable cache). Applied to both efficient and traditional polling
paths. Diagnostic fields added to VMMemoryRaw for visibility.
Windows VMs and VMs without qemu-guest-agent triggered an uncached
GetVMRRDData HTTP call on every poll cycle. Add vmRRDMemCache using the
same read-through cache pattern as nodeRRDMemCache (shared rrdCacheMu,
same TTL, same cleanup path).
(cherry picked from commit 582f16004a0f275de4c458e5d288be70eee613e4)
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.
FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.
Fixes#1270Fixes#1264Fixes#1051
(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
Docker container URL preserved on update (#1054): container updates
recreate the container with a new runtime ID. The agent now includes
{oldContainerId, newContainerId} in the completion ACK payload; the
server uses this to copy persisted metadata (custom URLs, descriptions,
tags) to the new ID so nothing is lost. Migration is a copy, not a move,
so rollback scenarios still find metadata under the original ID.
Reduce metrics.db write amplification (#1124): add a UNIQUE index on
(resource_type, resource_id, metric_type, timestamp, tier) so rollup
reprocessing after a failed checkpoint uses INSERT OR IGNORE instead of
creating duplicate rows. Existing duplicates are deduplicated once on
startup if the index creation would otherwise fail. Also sets
wal_autocheckpoint(500) to checkpoint the WAL more frequently, preventing
unbounded WAL growth.
Fixes#1054Fixes#1124
Recovery (all-clear) notifications were being silently suppressed during
quiet hours for any non-critical alert. Since powered-off alerts default
to Warning level, users who received an alert at 2pm would never get the
recovery notification if the VM came back during quiet hours.
Quiet hours are intended to suppress noisy firing alerts, not to hide
the fact that an issue has resolved. If you got the alert, you should
always get the all-clear.
Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved.
The notifyOnResolve toggle (explicit user preference) is still respected.
Fixes#1259
The shared storage deduplication key was just the storage name, causing
storages with the same name from different Proxmox instances (or PVE + PBS)
to be incorrectly merged into a single entry. This made one random host
appear to have all storages from all instances.
Include the instance name in the aggregation key so shared storage is only
merged within the same Proxmox cluster/instance.
Fixes#1246
The dedup logic only handled btrfs/zfs subvolumes, but Kubernetes
bind-mounts the same device at both pod and plugin paths, causing
xfs/ext4 volumes to be double-counted. Now deduplicates by
device+totalBytes for all filesystem types.
Fixes#1158
The Docker agent was not passing the disk exclusion list to
hostmetricsCollect(), so excluded mounts appeared in the Docker tab
disk totals. Also add server-side fsfilters filtering to Docker
report processing for parity with the host agent path.
Mock seeding wrote Docker container metrics as "docker" but the real
monitor uses "dockerContainer". This made mock-mode charts miss the
SQLite store path after the API normalization fix in 7336ec2d.
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
The in-memory metrics buffer was changed from 1000 to 86400 points per
metric to support 30-day sparklines, but this pre-allocated ~18 MB per
guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB —
explaining why users needed to double their LXC memory after upgrading
to 5.1.0.
- Revert in-memory buffer to 1000 points / 24h retention
- Remove eager slice pre-allocation (use append growth instead)
- Add LTTB (Largest Triangle Three Buckets) downsampling algorithm
- Chart endpoints now use a two-tier strategy: in-memory for ranges
≤ 2h, SQLite persistent store + LTTB for longer ranges
- Reduce frontend ring buffer from 86400 to 2000 points
Related to #1190
Proxmox reports cumulative byte counters that update unevenly across
polling intervals, causing a steady 100 Mbps download to appear as
spikes up to 450 Mbps in sparkline charts. Replace per-interval rate
calculation with a 4-sample sliding window (30s at 10s polling) that
averages over the full span — the same approach Prometheus rate() uses.