Commit graph

260 commits

Author SHA1 Message Date
rcourtman
48689137ec Migrate Docker metadata on observed container recreation (#1054) 2026-03-27 22:50:19 +00:00
rcourtman
8c1d4dcc04 Honor discovery subnet policy for cluster endpoints (#1319) 2026-03-27 16:30:21 +00:00
rcourtman
01f916dcb5 Use linked host-agent disk data for guest fallback (#1319) 2026-03-27 15:56:20 +00:00
rcourtman
ad10e1f116 Discover controller-backed SMART wearout paths (#1368) 2026-03-27 15:42:44 +00:00
rcourtman
2a4432048a Continue guest-agent polling after transient status failures (#1319) 2026-03-27 14:50:28 +00:00
rcourtman
51abca6421 Treat available guest agents as healthy for VM memory carry-forward (#1319) 2026-03-27 11:04:07 +00:00
rcourtman
963670f01c Serve fresh alert snapshots from monitor state reads (#1365) 2026-03-27 10:47:56 +00:00
rcourtman
ae66647eb1 Preserve VM memory when healthy guests fall back to false 100% usage (#1319) 2026-03-27 08:27:14 +00:00
rcourtman
f344938403 Retry Linux guest meminfo sooner after transient failures (#1319)
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-26 23:27:54 +00:00
rcourtman
d9b7c99f02 Rotate guest-agent poll priority across QEMU polls (#1319) 2026-03-26 22:20:27 +00:00
rcourtman
fcd2384dd5 Stabilize transient VM disk fallbacks (#1319)
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-26 11:12:23 +00:00
rcourtman
e9bbc35bae Stabilize repeated low-trust VM memory fallbacks (#1319) 2026-03-26 00:23:29 +00:00
rcourtman
2196327769 Preserve VM guest metadata across transient agent gaps (#1319) 2026-03-26 00:12:19 +00:00
rcourtman
0f70aa053e Honor disk-exclude for sleeping and Proxmox disks (#1142) 2026-03-26 00:01:59 +00:00
rcourtman
333e66a8e9 Reject shared Docker token host identity collisions (#1366) 2026-03-25 23:36:57 +00:00
rcourtman
48f4438d23 Scale v5 Proxmox guest disk polling 2026-03-25 18:24:47 +00:00
rcourtman
fba1fadccd Make alert node display name resolution instance-aware (#1218) 2026-03-25 12:44:22 +00:00
rcourtman
9c2a56d351 Respect quiet hours for recovery notifications (#1068) 2026-03-25 12:27:35 +00:00
rcourtman
ffaeea18d6 Scope cluster TLS fingerprints to their own endpoints (#1199) 2026-03-25 12:10:09 +00:00
rcourtman
40249947ed Fix template backup orphan detection race (#1352) 2026-03-25 10:36:33 +00:00
rcourtman
2fe22c3308 fix(backups): prevent template backups from being flagged as orphaned
Some checks failed
Build and Test / Secret Scan (push) Failing after 5s
Build and Test / Frontend & Backend (push) Failing after 1m8s
Core E2E Tests / Playwright Core E2E (push) Failing after 4m38s
Proxmox VM/LXC templates are intentionally excluded from the monitored
guest list, but their backup files exist on storage. The orphan-detection
logic was firing for every template backup because the VMID was never
in the guest lookup maps.

Fix: track template VMID→node pairs in State.templateVMIDs (unexported,
not serialised to API/frontend) during the resources poll loop, expose
via StateSnapshot.TemplateVMIDs, and use in both buildGuestLookups() and
the storage backup node-resolution map so orphan detection treats template
backups as valid. Also preserves the template map through the cluster
health grace-period path (zero-resource preservation), the partial-node
grace-period path, and clears it on instance removal.

Closes #1352
2026-03-17 09:04:22 +00:00
rcourtman
caff845c1a fix(ui): use Proxmox tag colours from datacenter config
Pulse was generating tag colours from a hash of the tag name instead
of using the colours configured in Proxmox. Now polls /cluster/options
once per PVE instance and merges the tag-style colour map into state,
which the frontend uses as the first-priority colour source for tag
badges. Falls back to the existing special-tag and hash-based colours
when Proxmox hasn't set a custom colour for a tag.
2026-03-15 19:49:46 +00:00
rcourtman
d05a00b931 fix(monitoring): smooth transient VM memory fallback spikes 2026-03-10 23:06:17 +00:00
rcourtman
afcfb23a30 fix(monitoring): retain intermittent FreeBSD SMART data 2026-03-10 22:52:25 +00:00
rcourtman
7dab977d91 Add split memory bar showing Used | Cache | Free segments (#1302)
Show reclaimable buff/cache as a distinct amber segment between used
(green) and free (gray) in the memory bar. This explains why Pulse's
memory percentage differs from Proxmox: Pulse reports cache-aware
usage (MemAvailable) while Proxmox includes cache as used (Total-Free).

Backend: add Cache field to Memory model, derived from MemInfo
(Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to
avoid inflating cache by the balloon gap on ballooned VMs.

Frontend: StackedMemoryBar renders three segments with tooltip
breakdown. Tooltip Free accounts for balloon limit when active.
Percentage label and alerts remain cache-aware (unchanged).
2026-03-10 10:16:14 +00:00
rcourtman
7a394ed724 Use explicit success flag for disk carry-forward guard (#1319)
Replace the diskUsage <= 0 heuristic with a diskFromAgent bool that is
only set when the guest agent actually returns valid filesystem data.
Prevents carry-forward from firing on a genuine 0% disk reading.
2026-03-09 18:54:27 +00:00
rcourtman
9c279732f7 Skip disk carry-forward when guest agent is explicitly disabled (#1319)
Prevents stale disk data from persisting indefinitely in the efficient
poller when a user disables the guest agent after it had been providing
data.  Matches the fallback poller's agent-disabled exclusion.
2026-03-09 18:37:38 +00:00
rcourtman
abbd0df609 Fix disk metric spikes when guest agent intermittently fails (#1319)
Carry forward previous cycle's disk data when the QEMU guest agent
times out or errors, instead of falling back to Proxmox cluster/resources
which always reports 0 for VM disk usage.  Applied to both polling paths
(pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards
against uint64 underflow and permanent-failure exclusions.
2026-03-09 18:23:15 +00:00
rcourtman
a4b0771974 Prevent removed host agents from resurrecting via in-flight reports (#1331)
Host agents removed from the UI would reappear on the next report cycle
because there was no rejection mechanism — unlike Docker agents which
already had resurrection prevention. Mirror the Docker agent pattern:

- Track removed host IDs in a `removedHosts` map with 24hr TTL
- Persist removal records in `State.RemovedHosts` for frontend display
- Reject reports from removed hosts in `ApplyHostReport()`
- Add `AllowHostReenroll()` + API route to clear the block
- Show removed host agents in the Settings UI with "Allow re-enroll"
- Sync removed-agent maps from state on startup for all agent types
- Fix mock integration snapshot missing `RemovedDockerHosts` field
2026-03-09 17:52:34 +00:00
rcourtman
572520ebc6 Promote guest-agent /proc/meminfo fallback for accurate VM memory (#1270)
Move the guest-agent file-read of /proc/meminfo earlier in the memory
fallback chain so it runs before RRD, giving real-time MemAvailable that
correctly excludes reclaimable buff/cache on Linux VMs. Also add
VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use
comma-separated privilege strings.
2026-03-09 10:04:28 +00:00
rcourtman
aa139b73fb Fix intermittent VM disappearance from dashboard (#555)
Two root causes: (1) When Proxmox cluster/resources returns a partial
response (e.g. during migration or transient API issue), VMs missing
from a responsive node were silently dropped because the node appeared
in nodesWithResources, bypassing grace-period preservation. Now
preserves recently-seen guests from online nodes for up to the grace
window. (2) The task queue allowed overlapping polls for the same PVE
instance — a slower stale poll could overwrite a newer complete VM list.
Added per-instance execution lock to skip duplicate scheduled tasks.
2026-03-08 22:16:24 +00:00
rcourtman
ff1bbe2fb8 Guard per-VM guest agent calls with timeout and panic recovery (#1319)
A broken or hung qemu-agent on one VM could stall the entire polling
loop, preventing higher-VMID VMs from being detected. Wrap all guest
agent work in a 10s per-VM budget with panic recovery, and add a 2s
timeout to GetVMStatus in the efficient poller to match the legacy path.
2026-03-07 22:30:18 +00:00
rcourtman
0dd3fc779b Fix alert disable notification suppression
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
2026-03-07 18:40:08 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
a4571f580b fix(monitoring): harden VM memory selection and flag repeated VM usage 2026-03-03 16:19:17 +00:00
rcourtman
ff9dc34687 Fix offline host visibility/alerting across restarts (#1311) 2026-03-03 15:43:29 +00:00
rcourtman
60bdc9a101 fix(memory): skip meminfo-derived when balloon lacks cache metrics (#1302)
When the balloon driver reports Free but not Buffers or Cached, the
meminfo-derived fallback computed memAvailable = Free alone, counting
all reclaimable page cache as used memory. This caused Linux VMs to
show wildly inflated usage (e.g. 93% when actual is 21%).

Now meminfo-derived requires at least one cache metric (Buffers > 0
or Cached > 0) before trusting the value. When missing, the code
falls through to RRD/guest-agent/Total-Used fallbacks which provide
accurate cache-aware data. Both efficient and traditional polling
paths are now consistent.
2026-03-02 11:48:18 +00:00
rcourtman
eb2397d99a fix(notifications): route escalation notifications to selected channels only (#1259)
Escalation was calling SendAlert() which always sends to all enabled
channels, ignoring the per-level channel selection (email/webhook/all).

Add SendAlertToChannels() that snapshots only the requested channel
configs and uses a distinct "_escalation" queue type so the dequeue
handler skips cooldown writes — preventing interference with the alert
manager's own re-notify cadence.
2026-02-26 20:49:10 +00:00
rcourtman
32746e2d2a fix(monitoring): use RRD memavailable fallback when PVE node cache metrics missing (#1270)
When Proxmox /nodes/{node}/status returns only total/used/free without
available/buffers/cached, EffectiveAvailable() returns Free (non-zero),
causing the RRD fallback gate to be skipped. This results in inflated
node memory where cache/buffers are counted as "used."

Widen the RRD fallback condition from requiring effectiveAvailable == 0
to triggering whenever missingCacheMetrics is true. Add negative caching
for failed RRD lookups (2-minute backoff) to avoid repeated retries.
2026-02-21 22:47:20 +00:00
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
8c7d507ea4 fix(alerts): make --disk-exclude suppress Proxmox SSD wear/health alerts (#1142)
The --disk-exclude agent flag only filtered local metric collection but
had no effect on server-side Proxmox disk health and SSD wearout alerts,
which poll the Proxmox API directly. Users excluding disks (e.g.
--disk-exclude sda) still received alerts for those disks.

Agent now sends its DiskExclude patterns in each report. The server
stores them on the Host model and consults them during Proxmox disk
polling — excluded disks get a synthetic healthy status passed to
CheckDiskHealth so any existing alerts clear immediately.

Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs,
linsysfs) to the virtual FS filter and /var/run/ to special mount
prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.
2026-02-20 13:31:52 +00:00
rcourtman
fb7582c7e4 fix(memory): use linked Pulse host agent memory to avoid VM inflation (#1270)
When no guest agent MemInfo or RRD data is available, prefer the linked
Pulse host agent's memory (read from /proc/meminfo via gopsutil, which
excludes page cache) over Proxmox's status.Mem (total - free, inflated
by reclaimable cache). Applied to both efficient and traditional polling
paths. Diagnostic fields added to VMMemoryRaw for visibility.
2026-02-19 19:04:19 +00:00
rcourtman
71b8b81af5 fix(monitoring): cache per-VM RRD memory lookups to avoid serial HTTP calls
Windows VMs and VMs without qemu-guest-agent triggered an uncached
GetVMRRDData HTTP call on every poll cycle. Add vmRRDMemCache using the
same read-through cache pattern as nodeRRDMemCache (shared rrdCacheMu,
same TTL, same cleanup path).

(cherry picked from commit 582f16004a0f275de4c458e5d288be70eee613e4)
2026-02-18 12:57:15 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
df23d80919 fix(alerts): always send recovery notifications regardless of quiet hours
Recovery (all-clear) notifications were being silently suppressed during
quiet hours for any non-critical alert. Since powered-off alerts default
to Warning level, users who received an alert at 2pm would never get the
recovery notification if the VM came back during quiet hours.

Quiet hours are intended to suppress noisy firing alerts, not to hide
the fact that an issue has resolved. If you got the alert, you should
always get the all-clear.

Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved.
The notifyOnResolve toggle (explicit user preference) is still respected.

Fixes #1259
2026-02-18 12:53:09 +00:00
rcourtman
03939c3f9e fix: deduplicate bind-mounted volumes in disk total calculation
The dedup logic only handled btrfs/zfs subvolumes, but Kubernetes
bind-mounts the same device at both pod and plugin paths, causing
xfs/ext4 volumes to be double-counted. Now deduplicates by
device+totalBytes for all filesystem types.

Fixes #1158
2026-02-10 21:52:25 +00:00
rcourtman
26776b2075 fix(agent): apply --disk-exclude to Docker agent disk metrics (#1237)
The Docker agent was not passing the disk exclusion list to
hostmetricsCollect(), so excluded mounts appeared in the Docker tab
disk totals. Also add server-side fsfilters filtering to Docker
report processing for parity with the host agent path.
2026-02-10 16:59:35 +00:00
rcourtman
8a48acef1d fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
  PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
  hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
  plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
  DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
  eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
2026-02-08 11:48:22 +00:00
rcourtman
d1e61d8a8a fix: ship alerting hotfixes and prepare 5.1.4 2026-02-07 22:05:55 +00:00
rcourtman
13af83f3fc fix(monitoring): preserve recent PVE nodes on empty polls (#1094) 2026-02-07 14:18:33 +00:00