Commit graph

407 commits

Author SHA1 Message Date
rcourtman
c19b06d1f4 fix(telemetry): aggregate tenant snapshots 2026-03-28 22:31:33 +00:00
rcourtman
c03ec1e74d fix(monitoring): preserve canonical agent identity 2026-03-27 12:14:40 +00:00
rcourtman
2dac3bedef test(recovery): cover downstream malformed metadata consumers 2026-03-26 22:58:27 +00:00
rcourtman
2afb96ee13 fix(release): align api and hostagent rc contracts 2026-03-26 17:08:48 +00:00
rcourtman
2b93a08558 Carry Proxmox pool membership into VM inventory export 2026-03-25 21:58:46 +00:00
rcourtman
fadd3ef2bd Prefer linked host disk metrics for Proxmox nodes 2026-03-25 16:49:47 +00:00
rcourtman
a2b9723db9 Canonicalize top-level infrastructure counting 2026-03-23 16:38:13 +00:00
rcourtman
da20a171dd Project incident timelines from canonical history 2026-03-20 11:42:26 +00:00
rcourtman
abc4b1a62a Canonicalize incident timeline events 2026-03-20 11:30:16 +00:00
rcourtman
e4ad855d90 Centralize connected infrastructure display names 2026-03-19 03:27:48 +00:00
rcourtman
778a2577b6 feat: Pulse v6 release 2026-03-18 16:06:30 +00:00
rcourtman
2fe22c3308 fix(backups): prevent template backups from being flagged as orphaned
Some checks failed
Build and Test / Secret Scan (push) Failing after 5s
Build and Test / Frontend & Backend (push) Failing after 1m8s
Core E2E Tests / Playwright Core E2E (push) Failing after 4m38s
Proxmox VM/LXC templates are intentionally excluded from the monitored
guest list, but their backup files exist on storage. The orphan-detection
logic was firing for every template backup because the VMID was never
in the guest lookup maps.

Fix: track template VMID→node pairs in State.templateVMIDs (unexported,
not serialised to API/frontend) during the resources poll loop, expose
via StateSnapshot.TemplateVMIDs, and use in both buildGuestLookups() and
the storage backup node-resolution map so orphan detection treats template
backups as valid. Also preserves the template map through the cluster
health grace-period path (zero-resource preservation), the partial-node
grace-period path, and clears it on instance removal.

Closes #1352
2026-03-17 09:04:22 +00:00
rcourtman
caff845c1a fix(ui): use Proxmox tag colours from datacenter config
Pulse was generating tag colours from a hash of the tag name instead
of using the colours configured in Proxmox. Now polls /cluster/options
once per PVE instance and merges the tag-style colour map into state,
which the frontend uses as the first-priority colour source for tag
badges. Falls back to the existing special-tag and hash-based colours
when Proxmox hasn't set a custom colour for a tag.
2026-03-15 19:49:46 +00:00
rcourtman
da982d0fca Prepare v5.1.24 release 2026-03-14 16:43:26 +00:00
rcourtman
d05a00b931 fix(monitoring): smooth transient VM memory fallback spikes 2026-03-10 23:06:17 +00:00
rcourtman
afcfb23a30 fix(monitoring): retain intermittent FreeBSD SMART data 2026-03-10 22:52:25 +00:00
rcourtman
7dab977d91 Add split memory bar showing Used | Cache | Free segments (#1302)
Show reclaimable buff/cache as a distinct amber segment between used
(green) and free (gray) in the memory bar. This explains why Pulse's
memory percentage differs from Proxmox: Pulse reports cache-aware
usage (MemAvailable) while Proxmox includes cache as used (Total-Free).

Backend: add Cache field to Memory model, derived from MemInfo
(Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to
avoid inflating cache by the balloon gap on ballooned VMs.

Frontend: StackedMemoryBar renders three segments with tooltip
breakdown. Tooltip Free accounts for balloon limit when active.
Percentage label and alerts remain cache-aware (unchanged).
2026-03-10 10:16:14 +00:00
rcourtman
7a394ed724 Use explicit success flag for disk carry-forward guard (#1319)
Replace the diskUsage <= 0 heuristic with a diskFromAgent bool that is
only set when the guest agent actually returns valid filesystem data.
Prevents carry-forward from firing on a genuine 0% disk reading.
2026-03-09 18:54:27 +00:00
rcourtman
9c279732f7 Skip disk carry-forward when guest agent is explicitly disabled (#1319)
Prevents stale disk data from persisting indefinitely in the efficient
poller when a user disables the guest agent after it had been providing
data.  Matches the fallback poller's agent-disabled exclusion.
2026-03-09 18:37:38 +00:00
rcourtman
abbd0df609 Fix disk metric spikes when guest agent intermittently fails (#1319)
Carry forward previous cycle's disk data when the QEMU guest agent
times out or errors, instead of falling back to Proxmox cluster/resources
which always reports 0 for VM disk usage.  Applied to both polling paths
(pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards
against uint64 underflow and permanent-failure exclusions.
2026-03-09 18:23:15 +00:00
rcourtman
a4b0771974 Prevent removed host agents from resurrecting via in-flight reports (#1331)
Host agents removed from the UI would reappear on the next report cycle
because there was no rejection mechanism — unlike Docker agents which
already had resurrection prevention. Mirror the Docker agent pattern:

- Track removed host IDs in a `removedHosts` map with 24hr TTL
- Persist removal records in `State.RemovedHosts` for frontend display
- Reject reports from removed hosts in `ApplyHostReport()`
- Add `AllowHostReenroll()` + API route to clear the block
- Show removed host agents in the Settings UI with "Allow re-enroll"
- Sync removed-agent maps from state on startup for all agent types
- Fix mock integration snapshot missing `RemovedDockerHosts` field
2026-03-09 17:52:34 +00:00
rcourtman
572520ebc6 Promote guest-agent /proc/meminfo fallback for accurate VM memory (#1270)
Move the guest-agent file-read of /proc/meminfo earlier in the memory
fallback chain so it runs before RRD, giving real-time MemAvailable that
correctly excludes reclaimable buff/cache on Linux VMs. Also add
VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use
comma-separated privilege strings.
2026-03-09 10:04:28 +00:00
rcourtman
aa139b73fb Fix intermittent VM disappearance from dashboard (#555)
Two root causes: (1) When Proxmox cluster/resources returns a partial
response (e.g. during migration or transient API issue), VMs missing
from a responsive node were silently dropped because the node appeared
in nodesWithResources, bypassing grace-period preservation. Now
preserves recently-seen guests from online nodes for up to the grace
window. (2) The task queue allowed overlapping polls for the same PVE
instance — a slower stale poll could overwrite a newer complete VM list.
Added per-instance execution lock to skip duplicate scheduled tasks.
2026-03-08 22:16:24 +00:00
rcourtman
ff1bbe2fb8 Guard per-VM guest agent calls with timeout and panic recovery (#1319)
A broken or hung qemu-agent on one VM could stall the entire polling
loop, preventing higher-VMID VMs from being detected. Wrap all guest
agent work in a 10s per-VM budget with panic recovery, and add a 2s
timeout to GetVMStatus in the efficient poller to match the legacy path.
2026-03-07 22:30:18 +00:00
rcourtman
0dd3fc779b Fix alert disable notification suppression
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
2026-03-07 18:40:08 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
a4571f580b fix(monitoring): harden VM memory selection and flag repeated VM usage 2026-03-03 16:19:17 +00:00
rcourtman
ff9dc34687 Fix offline host visibility/alerting across restarts (#1311) 2026-03-03 15:43:29 +00:00
rcourtman
60bdc9a101 fix(memory): skip meminfo-derived when balloon lacks cache metrics (#1302)
When the balloon driver reports Free but not Buffers or Cached, the
meminfo-derived fallback computed memAvailable = Free alone, counting
all reclaimable page cache as used memory. This caused Linux VMs to
show wildly inflated usage (e.g. 93% when actual is 21%).

Now meminfo-derived requires at least one cache metric (Buffers > 0
or Cached > 0) before trusting the value. When missing, the code
falls through to RRD/guest-agent/Total-Used fallbacks which provide
accurate cache-aware data. Both efficient and traditional polling
paths are now consistent.
2026-03-02 11:48:18 +00:00
rcourtman
eb2397d99a fix(notifications): route escalation notifications to selected channels only (#1259)
Escalation was calling SendAlert() which always sends to all enabled
channels, ignoring the per-level channel selection (email/webhook/all).

Add SendAlertToChannels() that snapshots only the requested channel
configs and uses a distinct "_escalation" queue type so the dequeue
handler skips cooldown writes — preventing interference with the alert
manager's own re-notify cadence.
2026-02-26 20:49:10 +00:00
rcourtman
32746e2d2a fix(monitoring): use RRD memavailable fallback when PVE node cache metrics missing (#1270)
When Proxmox /nodes/{node}/status returns only total/used/free without
available/buffers/cached, EffectiveAvailable() returns Free (non-zero),
causing the RRD fallback gate to be skipped. This results in inflated
node memory where cache/buffers are counted as "used."

Widen the RRD fallback condition from requiring effectiveAvailable == 0
to triggering whenever missingCacheMetrics is true. Add negative caching
for failed RRD lookups (2-minute backoff) to avoid repeated retries.
2026-02-21 22:47:20 +00:00
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
8c7d507ea4 fix(alerts): make --disk-exclude suppress Proxmox SSD wear/health alerts (#1142)
The --disk-exclude agent flag only filtered local metric collection but
had no effect on server-side Proxmox disk health and SSD wearout alerts,
which poll the Proxmox API directly. Users excluding disks (e.g.
--disk-exclude sda) still received alerts for those disks.

Agent now sends its DiskExclude patterns in each report. The server
stores them on the Host model and consults them during Proxmox disk
polling — excluded disks get a synthetic healthy status passed to
CheckDiskHealth so any existing alerts clear immediately.

Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs,
linsysfs) to the virtual FS filter and /var/run/ to special mount
prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.
2026-02-20 13:31:52 +00:00
rcourtman
fb7582c7e4 fix(memory): use linked Pulse host agent memory to avoid VM inflation (#1270)
When no guest agent MemInfo or RRD data is available, prefer the linked
Pulse host agent's memory (read from /proc/meminfo via gopsutil, which
excludes page cache) over Proxmox's status.Mem (total - free, inflated
by reclaimable cache). Applied to both efficient and traditional polling
paths. Diagnostic fields added to VMMemoryRaw for visibility.
2026-02-19 19:04:19 +00:00
rcourtman
71b8b81af5 fix(monitoring): cache per-VM RRD memory lookups to avoid serial HTTP calls
Windows VMs and VMs without qemu-guest-agent triggered an uncached
GetVMRRDData HTTP call on every poll cycle. Add vmRRDMemCache using the
same read-through cache pattern as nodeRRDMemCache (shared rrdCacheMu,
same TTL, same cleanup path).

(cherry picked from commit 582f16004a0f275de4c458e5d288be70eee613e4)
2026-02-18 12:57:15 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
9d8f8b45b5 fix(docker,metrics): preserve container metadata on update and reduce DB writes
Docker container URL preserved on update (#1054): container updates
recreate the container with a new runtime ID. The agent now includes
{oldContainerId, newContainerId} in the completion ACK payload; the
server uses this to copy persisted metadata (custom URLs, descriptions,
tags) to the new ID so nothing is lost. Migration is a copy, not a move,
so rollback scenarios still find metadata under the original ID.

Reduce metrics.db write amplification (#1124): add a UNIQUE index on
(resource_type, resource_id, metric_type, timestamp, tier) so rollup
reprocessing after a failed checkpoint uses INSERT OR IGNORE instead of
creating duplicate rows. Existing duplicates are deduplicated once on
startup if the index creation would otherwise fail. Also sets
wal_autocheckpoint(500) to checkpoint the WAL more frequently, preventing
unbounded WAL growth.

Fixes #1054
Fixes #1124
2026-02-18 12:56:46 +00:00
rcourtman
df23d80919 fix(alerts): always send recovery notifications regardless of quiet hours
Recovery (all-clear) notifications were being silently suppressed during
quiet hours for any non-critical alert. Since powered-off alerts default
to Warning level, users who received an alert at 2pm would never get the
recovery notification if the VM came back during quiet hours.

Quiet hours are intended to suppress noisy firing alerts, not to hide
the fact that an issue has resolved. If you got the alert, you should
always get the all-clear.

Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved.
The notifyOnResolve toggle (explicit user preference) is still respected.

Fixes #1259
2026-02-18 12:53:09 +00:00
rcourtman
d4ff967815 fix: scope shared storage aggregation to per-instance to prevent cross-instance merging
The shared storage deduplication key was just the storage name, causing
storages with the same name from different Proxmox instances (or PVE + PBS)
to be incorrectly merged into a single entry. This made one random host
appear to have all storages from all instances.

Include the instance name in the aggregation key so shared storage is only
merged within the same Proxmox cluster/instance.

Fixes #1246
2026-02-11 09:18:09 +00:00
rcourtman
03939c3f9e fix: deduplicate bind-mounted volumes in disk total calculation
The dedup logic only handled btrfs/zfs subvolumes, but Kubernetes
bind-mounts the same device at both pod and plugin paths, causing
xfs/ext4 volumes to be double-counted. Now deduplicates by
device+totalBytes for all filesystem types.

Fixes #1158
2026-02-10 21:52:25 +00:00
rcourtman
26776b2075 fix(agent): apply --disk-exclude to Docker agent disk metrics (#1237)
The Docker agent was not passing the disk exclusion list to
hostmetricsCollect(), so excluded mounts appeared in the Docker tab
disk totals. Also add server-side fsfilters filtering to Docker
report processing for parity with the host agent path.
2026-02-10 16:59:35 +00:00
rcourtman
f7a14feb0f fix(mock): align Docker container store type with real monitor
Mock seeding wrote Docker container metrics as "docker" but the real
monitor uses "dockerContainer". This made mock-mode charts miss the
SQLite store path after the API normalization fix in 7336ec2d.
2026-02-09 22:42:08 +00:00
rcourtman
cedf0c8f0f fix(temperature): parse string sensor values without zeroing readings (#1224) 2026-02-09 14:00:09 +00:00
rcourtman
8a48acef1d fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
  PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
  hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
  plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
  DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
  eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
2026-02-08 11:48:22 +00:00
rcourtman
d1e61d8a8a fix: ship alerting hotfixes and prepare 5.1.4 2026-02-07 22:05:55 +00:00
rcourtman
13af83f3fc fix(monitoring): preserve recent PVE nodes on empty polls (#1094) 2026-02-07 14:18:33 +00:00
rcourtman
ee0e89871d fix: reduce metrics memory 86x by reverting buffer and adding LTTB downsampling
The in-memory metrics buffer was changed from 1000 to 86400 points per
metric to support 30-day sparklines, but this pre-allocated ~18 MB per
guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB —
explaining why users needed to double their LXC memory after upgrading
to 5.1.0.

- Revert in-memory buffer to 1000 points / 24h retention
- Remove eager slice pre-allocation (use append growth instead)
- Add LTTB (Largest Triangle Three Buckets) downsampling algorithm
- Chart endpoints now use a two-tier strategy: in-memory for ranges
  ≤ 2h, SQLite persistent store + LTTB for longer ranges
- Reduce frontend ring buffer from 86400 to 2000 points

Related to #1190
2026-02-04 19:49:52 +00:00
rcourtman
bcd0dbfc18 Add metrics history memory regression test 2026-02-04 19:35:19 +00:00
rcourtman
049a3e424c Add memory regression tests for agent and scheduler 2026-02-04 19:33:29 +00:00
rcourtman
64e57f0e0e fix: smooth I/O rates using sliding window like Prometheus rate()
Proxmox reports cumulative byte counters that update unevenly across
polling intervals, causing a steady 100 Mbps download to appear as
spikes up to 450 Mbps in sparkline charts. Replace per-interval rate
calculation with a 4-sample sliding window (30s at 10s polling) that
averages over the full span — the same approach Prometheus rate() uses.
2026-02-04 19:04:17 +00:00