Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-09 19:32:24 +00:00

Author	SHA1	Message	Date
rcourtman	c19b06d1f4	fix(telemetry): aggregate tenant snapshots	2026-03-28 22:31:33 +00:00
rcourtman	c03ec1e74d	fix(monitoring): preserve canonical agent identity	2026-03-27 12:14:40 +00:00
rcourtman	2dac3bedef	test(recovery): cover downstream malformed metadata consumers	2026-03-26 22:58:27 +00:00
rcourtman	2afb96ee13	fix(release): align api and hostagent rc contracts	2026-03-26 17:08:48 +00:00
rcourtman	2b93a08558	Carry Proxmox pool membership into VM inventory export	2026-03-25 21:58:46 +00:00
rcourtman	fadd3ef2bd	Prefer linked host disk metrics for Proxmox nodes	2026-03-25 16:49:47 +00:00
rcourtman	a2b9723db9	Canonicalize top-level infrastructure counting	2026-03-23 16:38:13 +00:00
rcourtman	da20a171dd	Project incident timelines from canonical history	2026-03-20 11:42:26 +00:00
rcourtman	abc4b1a62a	Canonicalize incident timeline events	2026-03-20 11:30:16 +00:00
rcourtman	e4ad855d90	Centralize connected infrastructure display names	2026-03-19 03:27:48 +00:00
rcourtman	778a2577b6	feat: Pulse v6 release	2026-03-18 16:06:30 +00:00
rcourtman	2fe22c3308	fix(backups): prevent template backups from being flagged as orphaned Some checks failed Build and Test / Secret Scan (push) Failing after 5s Details Build and Test / Frontend & Backend (push) Failing after 1m8s Details Core E2E Tests / Playwright Core E2E (push) Failing after 4m38s Details Proxmox VM/LXC templates are intentionally excluded from the monitored guest list, but their backup files exist on storage. The orphan-detection logic was firing for every template backup because the VMID was never in the guest lookup maps. Fix: track template VMID→node pairs in State.templateVMIDs (unexported, not serialised to API/frontend) during the resources poll loop, expose via StateSnapshot.TemplateVMIDs, and use in both buildGuestLookups() and the storage backup node-resolution map so orphan detection treats template backups as valid. Also preserves the template map through the cluster health grace-period path (zero-resource preservation), the partial-node grace-period path, and clears it on instance removal. Closes #1352	2026-03-17 09:04:22 +00:00
rcourtman	caff845c1a	fix(ui): use Proxmox tag colours from datacenter config Pulse was generating tag colours from a hash of the tag name instead of using the colours configured in Proxmox. Now polls /cluster/options once per PVE instance and merges the tag-style colour map into state, which the frontend uses as the first-priority colour source for tag badges. Falls back to the existing special-tag and hash-based colours when Proxmox hasn't set a custom colour for a tag.	2026-03-15 19:49:46 +00:00
rcourtman	da982d0fca	Prepare v5.1.24 release	2026-03-14 16:43:26 +00:00
rcourtman	d05a00b931	fix(monitoring): smooth transient VM memory fallback spikes	2026-03-10 23:06:17 +00:00
rcourtman	afcfb23a30	fix(monitoring): retain intermittent FreeBSD SMART data	2026-03-10 22:52:25 +00:00
rcourtman	7dab977d91	Add split memory bar showing Used \| Cache \| Free segments (#1302 ) Show reclaimable buff/cache as a distinct amber segment between used (green) and free (gray) in the memory bar. This explains why Pulse's memory percentage differs from Proxmox: Pulse reports cache-aware usage (MemAvailable) while Proxmox includes cache as used (Total-Free). Backend: add Cache field to Memory model, derived from MemInfo (Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to avoid inflating cache by the balloon gap on ballooned VMs. Frontend: StackedMemoryBar renders three segments with tooltip breakdown. Tooltip Free accounts for balloon limit when active. Percentage label and alerts remain cache-aware (unchanged).	2026-03-10 10:16:14 +00:00
rcourtman	7a394ed724	Use explicit success flag for disk carry-forward guard (#1319 ) Replace the diskUsage <= 0 heuristic with a diskFromAgent bool that is only set when the guest agent actually returns valid filesystem data. Prevents carry-forward from firing on a genuine 0% disk reading.	2026-03-09 18:54:27 +00:00
rcourtman	9c279732f7	Skip disk carry-forward when guest agent is explicitly disabled (#1319 ) Prevents stale disk data from persisting indefinitely in the efficient poller when a user disables the guest agent after it had been providing data. Matches the fallback poller's agent-disabled exclusion.	2026-03-09 18:37:38 +00:00
rcourtman	abbd0df609	Fix disk metric spikes when guest agent intermittently fails (#1319 ) Carry forward previous cycle's disk data when the QEMU guest agent times out or errors, instead of falling back to Proxmox cluster/resources which always reports 0 for VM disk usage. Applied to both polling paths (pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards against uint64 underflow and permanent-failure exclusions.	2026-03-09 18:23:15 +00:00
rcourtman	a4b0771974	Prevent removed host agents from resurrecting via in-flight reports (#1331 ) Host agents removed from the UI would reappear on the next report cycle because there was no rejection mechanism — unlike Docker agents which already had resurrection prevention. Mirror the Docker agent pattern: - Track removed host IDs in a `removedHosts` map with 24hr TTL - Persist removal records in `State.RemovedHosts` for frontend display - Reject reports from removed hosts in `ApplyHostReport()` - Add `AllowHostReenroll()` + API route to clear the block - Show removed host agents in the Settings UI with "Allow re-enroll" - Sync removed-agent maps from state on startup for all agent types - Fix mock integration snapshot missing `RemovedDockerHosts` field	2026-03-09 17:52:34 +00:00
rcourtman	572520ebc6	Promote guest-agent /proc/meminfo fallback for accurate VM memory (#1270 ) Move the guest-agent file-read of /proc/meminfo earlier in the memory fallback chain so it runs before RRD, giving real-time MemAvailable that correctly excludes reclaimable buff/cache on Linux VMs. Also add VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use comma-separated privilege strings.	2026-03-09 10:04:28 +00:00
rcourtman	aa139b73fb	Fix intermittent VM disappearance from dashboard (#555 ) Two root causes: (1) When Proxmox cluster/resources returns a partial response (e.g. during migration or transient API issue), VMs missing from a responsive node were silently dropped because the node appeared in nodesWithResources, bypassing grace-period preservation. Now preserves recently-seen guests from online nodes for up to the grace window. (2) The task queue allowed overlapping polls for the same PVE instance — a slower stale poll could overwrite a newer complete VM list. Added per-instance execution lock to skip duplicate scheduled tasks.	2026-03-08 22:16:24 +00:00
rcourtman	ff1bbe2fb8	Guard per-VM guest agent calls with timeout and panic recovery (#1319 ) A broken or hung qemu-agent on one VM could stall the entire polling loop, preventing higher-VMID VMs from being detected. Wrap all guest agent work in a 10s per-VM budget with panic recovery, and add a 2s timeout to GetVMStatus in the efficient poller to match the legacy path.	2026-03-07 22:30:18 +00:00
rcourtman	0dd3fc779b	Fix alert disable notification suppression Some checks failed Build and Test / Secret Scan (push) Has been cancelled Details Build and Test / Frontend & Backend (push) Has been cancelled Details Core E2E Tests / Playwright Core E2E (push) Has been cancelled Details	2026-03-07 18:40:08 +00:00
rcourtman	499ab812e3	Fix post-release regressions and lock v5 to single-tenant runtime	2026-03-05 23:46:35 +00:00
rcourtman	a4571f580b	fix(monitoring): harden VM memory selection and flag repeated VM usage	2026-03-03 16:19:17 +00:00
rcourtman	ff9dc34687	Fix offline host visibility/alerting across restarts (#1311 )	2026-03-03 15:43:29 +00:00
rcourtman	60bdc9a101	fix(memory): skip meminfo-derived when balloon lacks cache metrics (#1302 ) When the balloon driver reports Free but not Buffers or Cached, the meminfo-derived fallback computed memAvailable = Free alone, counting all reclaimable page cache as used memory. This caused Linux VMs to show wildly inflated usage (e.g. 93% when actual is 21%). Now meminfo-derived requires at least one cache metric (Buffers > 0 or Cached > 0) before trusting the value. When missing, the code falls through to RRD/guest-agent/Total-Used fallbacks which provide accurate cache-aware data. Both efficient and traditional polling paths are now consistent.	2026-03-02 11:48:18 +00:00
rcourtman	eb2397d99a	fix(notifications): route escalation notifications to selected channels only (#1259 ) Escalation was calling SendAlert() which always sends to all enabled channels, ignoring the per-level channel selection (email/webhook/all). Add SendAlertToChannels() that snapshots only the requested channel configs and uses a distinct "_escalation" queue type so the dequeue handler skips cooldown writes — preventing interference with the alert manager's own re-notify cadence.	2026-02-26 20:49:10 +00:00
rcourtman	32746e2d2a	fix(monitoring): use RRD memavailable fallback when PVE node cache metrics missing (#1270 ) When Proxmox /nodes/{node}/status returns only total/used/free without available/buffers/cached, EffectiveAvailable() returns Free (non-zero), causing the RRD fallback gate to be skipped. This results in inflated node memory where cache/buffers are counted as "used." Widen the RRD fallback condition from requiring effectiveAvailable == 0 to triggering whenever missingCacheMetrics is true. Add negative caching for failed RRD lookups (2-minute backoff) to avoid repeated retries.	2026-02-21 22:47:20 +00:00
rcourtman	0ae2806f18	fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270 ) Proxmox status.Mem includes page cache as "used" memory, inflating reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked host agent) were frequently unavailable, causing most VMs to fall through to the inflated status-mem source. Adds a new last-resort fallback that reads /proc/meminfo via the QEMU guest agent file-read endpoint to get accurate MemAvailable. Results are cached (60s positive, 5min negative backoff for unsupported VMs). Also fixes: RRD memavailable fallback missing from traditional polling path, cache key collisions in multi-PVE setups, FreeMem underflow guard inconsistency, and integer overflow in kB-to-bytes conversion.	2026-02-20 13:31:52 +00:00
rcourtman	8c7d507ea4	fix(alerts): make --disk-exclude suppress Proxmox SSD wear/health alerts (#1142 ) The --disk-exclude agent flag only filtered local metric collection but had no effect on server-side Proxmox disk health and SSD wearout alerts, which poll the Proxmox API directly. Users excluding disks (e.g. --disk-exclude sda) still received alerts for those disks. Agent now sends its DiskExclude patterns in each report. The server stores them on the Host model and consults them during Proxmox disk polling — excluded disks get a synthetic healthy status passed to CheckDiskHealth so any existing alerts clear immediately. Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs, linsysfs) to the virtual FS filter and /var/run/ to special mount prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.	2026-02-20 13:31:52 +00:00
rcourtman	fb7582c7e4	fix(memory): use linked Pulse host agent memory to avoid VM inflation (#1270 ) When no guest agent MemInfo or RRD data is available, prefer the linked Pulse host agent's memory (read from /proc/meminfo via gopsutil, which excludes page cache) over Proxmox's status.Mem (total - free, inflated by reclaimable cache). Applied to both efficient and traditional polling paths. Diagnostic fields added to VMMemoryRaw for visibility.	2026-02-19 19:04:19 +00:00
rcourtman	71b8b81af5	fix(monitoring): cache per-VM RRD memory lookups to avoid serial HTTP calls Windows VMs and VMs without qemu-guest-agent triggered an uncached GetVMRRDData HTTP call on every poll cycle. Add vmRRDMemCache using the same read-through cache pattern as nodeRRDMemCache (shared rrdCacheMu, same TTL, same cleanup path). (cherry picked from commit 582f16004a0f275de4c458e5d288be70eee613e4)	2026-02-18 12:57:15 +00:00
rcourtman	efa916ee2a	fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox RRD's memavailable metric (which excludes reclaimable page cache) when the qemu-guest-agent doesn't provide MemInfo.Available. Previously the fallback was detailedStatus.Mem (total - MemFree), inflating usage to 80%+ on VMs with normal Linux page cache. Mirrors the existing LXC rrd-memavailable path. FreeBSD ZFS ARC (#1264, #1051): The host agent now reads kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts the ARC size from reported memory usage. ZFS ARC is reclaimable under memory pressure (like Linux SReclaimable) but gopsutil counts it as wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms. Fixes #1270 Fixes #1264 Fixes #1051 (cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)	2026-02-18 12:56:53 +00:00
rcourtman	9d8f8b45b5	fix(docker,metrics): preserve container metadata on update and reduce DB writes Docker container URL preserved on update (#1054): container updates recreate the container with a new runtime ID. The agent now includes {oldContainerId, newContainerId} in the completion ACK payload; the server uses this to copy persisted metadata (custom URLs, descriptions, tags) to the new ID so nothing is lost. Migration is a copy, not a move, so rollback scenarios still find metadata under the original ID. Reduce metrics.db write amplification (#1124): add a UNIQUE index on (resource_type, resource_id, metric_type, timestamp, tier) so rollup reprocessing after a failed checkpoint uses INSERT OR IGNORE instead of creating duplicate rows. Existing duplicates are deduplicated once on startup if the index creation would otherwise fail. Also sets wal_autocheckpoint(500) to checkpoint the WAL more frequently, preventing unbounded WAL growth. Fixes #1054 Fixes #1124	2026-02-18 12:56:46 +00:00
rcourtman	df23d80919	fix(alerts): always send recovery notifications regardless of quiet hours Recovery (all-clear) notifications were being silently suppressed during quiet hours for any non-critical alert. Since powered-off alerts default to Warning level, users who received an alert at 2pm would never get the recovery notification if the VM came back during quiet hours. Quiet hours are intended to suppress noisy firing alerts, not to hide the fact that an issue has resolved. If you got the alert, you should always get the all-clear. Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved. The notifyOnResolve toggle (explicit user preference) is still respected. Fixes #1259	2026-02-18 12:53:09 +00:00
rcourtman	d4ff967815	fix: scope shared storage aggregation to per-instance to prevent cross-instance merging The shared storage deduplication key was just the storage name, causing storages with the same name from different Proxmox instances (or PVE + PBS) to be incorrectly merged into a single entry. This made one random host appear to have all storages from all instances. Include the instance name in the aggregation key so shared storage is only merged within the same Proxmox cluster/instance. Fixes #1246	2026-02-11 09:18:09 +00:00
rcourtman	03939c3f9e	fix: deduplicate bind-mounted volumes in disk total calculation The dedup logic only handled btrfs/zfs subvolumes, but Kubernetes bind-mounts the same device at both pod and plugin paths, causing xfs/ext4 volumes to be double-counted. Now deduplicates by device+totalBytes for all filesystem types. Fixes #1158	2026-02-10 21:52:25 +00:00
rcourtman	26776b2075	fix(agent): apply --disk-exclude to Docker agent disk metrics (#1237 ) The Docker agent was not passing the disk exclusion list to hostmetricsCollect(), so excluded mounts appeared in the Docker tab disk totals. Also add server-side fsfilters filtering to Docker report processing for parity with the host agent path.	2026-02-10 16:59:35 +00:00
rcourtman	f7a14feb0f	fix(mock): align Docker container store type with real monitor Mock seeding wrote Docker container metrics as "docker" but the real monitor uses "dockerContainer". This made mock-mode charts miss the SQLite store path after the API normalization fix in `7336ec2d`.	2026-02-09 22:42:08 +00:00
rcourtman	cedf0c8f0f	fix(temperature): parse string sensor values without zeroing readings (#1224 )	2026-02-09 14:00:09 +00:00
rcourtman	8a48acef1d	fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting - fix(models): filter nodes by instance in UpdateNodesForInstance to prevent PVE node duplication across poll cycles (#1214, #1192, #1217) - fix(alerts): sort GetActiveAlerts output for stable ordering, preventing hostname scrambling in frontend (#1218) - fix(notifications): add ntfy-specific resolved webhook formatting with plain-text body and proper headers (#1213) - fix(frontend): respect "hide Docker update actions" setting in DockerFilter Update All button (#1219) - fix(frontend): add missing v prefix to GitHub release tag URLs (#1195) - fix(monitoring): reduce disk detection warning from Warn to Debug to eliminate log spam for pass-through disks (#1216) - chore: bump VERSION to 5.1.5	2026-02-08 11:48:22 +00:00
rcourtman	d1e61d8a8a	fix: ship alerting hotfixes and prepare 5.1.4	2026-02-07 22:05:55 +00:00
rcourtman	13af83f3fc	fix(monitoring): preserve recent PVE nodes on empty polls (#1094 )	2026-02-07 14:18:33 +00:00
rcourtman	ee0e89871d	fix: reduce metrics memory 86x by reverting buffer and adding LTTB downsampling The in-memory metrics buffer was changed from 1000 to 86400 points per metric to support 30-day sparklines, but this pre-allocated ~18 MB per guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB — explaining why users needed to double their LXC memory after upgrading to 5.1.0. - Revert in-memory buffer to 1000 points / 24h retention - Remove eager slice pre-allocation (use append growth instead) - Add LTTB (Largest Triangle Three Buckets) downsampling algorithm - Chart endpoints now use a two-tier strategy: in-memory for ranges ≤ 2h, SQLite persistent store + LTTB for longer ranges - Reduce frontend ring buffer from 86400 to 2000 points Related to #1190	2026-02-04 19:49:52 +00:00
rcourtman	bcd0dbfc18	Add metrics history memory regression test	2026-02-04 19:35:19 +00:00
rcourtman	049a3e424c	Add memory regression tests for agent and scheduler	2026-02-04 19:33:29 +00:00
rcourtman	64e57f0e0e	fix: smooth I/O rates using sliding window like Prometheus rate() Proxmox reports cumulative byte counters that update unevenly across polling intervals, causing a steady 100 Mbps download to appear as spikes up to 450 Mbps in sparkline charts. Replace per-interval rate calculation with a 4-sample sliding window (30s at 10s polling) that averages over the full span — the same approach Prometheus rate() uses.	2026-02-04 19:04:17 +00:00

1 2 3 4 5 ...

407 commits