Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-15 01:07:32 +00:00

Author	SHA1	Message	Date
rcourtman	da20a171dd	Project incident timelines from canonical history	2026-03-20 11:42:26 +00:00
rcourtman	778a2577b6	feat: Pulse v6 release	2026-03-18 16:06:30 +00:00
rcourtman	2fe22c3308	fix(backups): prevent template backups from being flagged as orphaned Some checks failed Build and Test / Secret Scan (push) Failing after 5s Details Build and Test / Frontend & Backend (push) Failing after 1m8s Details Core E2E Tests / Playwright Core E2E (push) Failing after 4m38s Details Proxmox VM/LXC templates are intentionally excluded from the monitored guest list, but their backup files exist on storage. The orphan-detection logic was firing for every template backup because the VMID was never in the guest lookup maps. Fix: track template VMID→node pairs in State.templateVMIDs (unexported, not serialised to API/frontend) during the resources poll loop, expose via StateSnapshot.TemplateVMIDs, and use in both buildGuestLookups() and the storage backup node-resolution map so orphan detection treats template backups as valid. Also preserves the template map through the cluster health grace-period path (zero-resource preservation), the partial-node grace-period path, and clears it on instance removal. Closes #1352	2026-03-17 09:04:22 +00:00
rcourtman	caff845c1a	fix(ui): use Proxmox tag colours from datacenter config Pulse was generating tag colours from a hash of the tag name instead of using the colours configured in Proxmox. Now polls /cluster/options once per PVE instance and merges the tag-style colour map into state, which the frontend uses as the first-priority colour source for tag badges. Falls back to the existing special-tag and hash-based colours when Proxmox hasn't set a custom colour for a tag.	2026-03-15 19:49:46 +00:00
rcourtman	d05a00b931	fix(monitoring): smooth transient VM memory fallback spikes	2026-03-10 23:06:17 +00:00
rcourtman	afcfb23a30	fix(monitoring): retain intermittent FreeBSD SMART data	2026-03-10 22:52:25 +00:00
rcourtman	7dab977d91	Add split memory bar showing Used \| Cache \| Free segments (#1302 ) Show reclaimable buff/cache as a distinct amber segment between used (green) and free (gray) in the memory bar. This explains why Pulse's memory percentage differs from Proxmox: Pulse reports cache-aware usage (MemAvailable) while Proxmox includes cache as used (Total-Free). Backend: add Cache field to Memory model, derived from MemInfo (Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to avoid inflating cache by the balloon gap on ballooned VMs. Frontend: StackedMemoryBar renders three segments with tooltip breakdown. Tooltip Free accounts for balloon limit when active. Percentage label and alerts remain cache-aware (unchanged).	2026-03-10 10:16:14 +00:00
rcourtman	7a394ed724	Use explicit success flag for disk carry-forward guard (#1319 ) Replace the diskUsage <= 0 heuristic with a diskFromAgent bool that is only set when the guest agent actually returns valid filesystem data. Prevents carry-forward from firing on a genuine 0% disk reading.	2026-03-09 18:54:27 +00:00
rcourtman	9c279732f7	Skip disk carry-forward when guest agent is explicitly disabled (#1319 ) Prevents stale disk data from persisting indefinitely in the efficient poller when a user disables the guest agent after it had been providing data. Matches the fallback poller's agent-disabled exclusion.	2026-03-09 18:37:38 +00:00
rcourtman	abbd0df609	Fix disk metric spikes when guest agent intermittently fails (#1319 ) Carry forward previous cycle's disk data when the QEMU guest agent times out or errors, instead of falling back to Proxmox cluster/resources which always reports 0 for VM disk usage. Applied to both polling paths (pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards against uint64 underflow and permanent-failure exclusions.	2026-03-09 18:23:15 +00:00
rcourtman	a4b0771974	Prevent removed host agents from resurrecting via in-flight reports (#1331 ) Host agents removed from the UI would reappear on the next report cycle because there was no rejection mechanism — unlike Docker agents which already had resurrection prevention. Mirror the Docker agent pattern: - Track removed host IDs in a `removedHosts` map with 24hr TTL - Persist removal records in `State.RemovedHosts` for frontend display - Reject reports from removed hosts in `ApplyHostReport()` - Add `AllowHostReenroll()` + API route to clear the block - Show removed host agents in the Settings UI with "Allow re-enroll" - Sync removed-agent maps from state on startup for all agent types - Fix mock integration snapshot missing `RemovedDockerHosts` field	2026-03-09 17:52:34 +00:00
rcourtman	572520ebc6	Promote guest-agent /proc/meminfo fallback for accurate VM memory (#1270 ) Move the guest-agent file-read of /proc/meminfo earlier in the memory fallback chain so it runs before RRD, giving real-time MemAvailable that correctly excludes reclaimable buff/cache on Linux VMs. Also add VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use comma-separated privilege strings.	2026-03-09 10:04:28 +00:00
rcourtman	aa139b73fb	Fix intermittent VM disappearance from dashboard (#555 ) Two root causes: (1) When Proxmox cluster/resources returns a partial response (e.g. during migration or transient API issue), VMs missing from a responsive node were silently dropped because the node appeared in nodesWithResources, bypassing grace-period preservation. Now preserves recently-seen guests from online nodes for up to the grace window. (2) The task queue allowed overlapping polls for the same PVE instance — a slower stale poll could overwrite a newer complete VM list. Added per-instance execution lock to skip duplicate scheduled tasks.	2026-03-08 22:16:24 +00:00
rcourtman	ff1bbe2fb8	Guard per-VM guest agent calls with timeout and panic recovery (#1319 ) A broken or hung qemu-agent on one VM could stall the entire polling loop, preventing higher-VMID VMs from being detected. Wrap all guest agent work in a 10s per-VM budget with panic recovery, and add a 2s timeout to GetVMStatus in the efficient poller to match the legacy path.	2026-03-07 22:30:18 +00:00
rcourtman	0dd3fc779b	Fix alert disable notification suppression Some checks failed Build and Test / Secret Scan (push) Has been cancelled Details Build and Test / Frontend & Backend (push) Has been cancelled Details Core E2E Tests / Playwright Core E2E (push) Has been cancelled Details	2026-03-07 18:40:08 +00:00
rcourtman	499ab812e3	Fix post-release regressions and lock v5 to single-tenant runtime	2026-03-05 23:46:35 +00:00
rcourtman	a4571f580b	fix(monitoring): harden VM memory selection and flag repeated VM usage	2026-03-03 16:19:17 +00:00
rcourtman	ff9dc34687	Fix offline host visibility/alerting across restarts (#1311 )	2026-03-03 15:43:29 +00:00
rcourtman	60bdc9a101	fix(memory): skip meminfo-derived when balloon lacks cache metrics (#1302 ) When the balloon driver reports Free but not Buffers or Cached, the meminfo-derived fallback computed memAvailable = Free alone, counting all reclaimable page cache as used memory. This caused Linux VMs to show wildly inflated usage (e.g. 93% when actual is 21%). Now meminfo-derived requires at least one cache metric (Buffers > 0 or Cached > 0) before trusting the value. When missing, the code falls through to RRD/guest-agent/Total-Used fallbacks which provide accurate cache-aware data. Both efficient and traditional polling paths are now consistent.	2026-03-02 11:48:18 +00:00
rcourtman	eb2397d99a	fix(notifications): route escalation notifications to selected channels only (#1259 ) Escalation was calling SendAlert() which always sends to all enabled channels, ignoring the per-level channel selection (email/webhook/all). Add SendAlertToChannels() that snapshots only the requested channel configs and uses a distinct "_escalation" queue type so the dequeue handler skips cooldown writes — preventing interference with the alert manager's own re-notify cadence.	2026-02-26 20:49:10 +00:00
rcourtman	32746e2d2a	fix(monitoring): use RRD memavailable fallback when PVE node cache metrics missing (#1270 ) When Proxmox /nodes/{node}/status returns only total/used/free without available/buffers/cached, EffectiveAvailable() returns Free (non-zero), causing the RRD fallback gate to be skipped. This results in inflated node memory where cache/buffers are counted as "used." Widen the RRD fallback condition from requiring effectiveAvailable == 0 to triggering whenever missingCacheMetrics is true. Add negative caching for failed RRD lookups (2-minute backoff) to avoid repeated retries.	2026-02-21 22:47:20 +00:00
rcourtman	0ae2806f18	fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270 ) Proxmox status.Mem includes page cache as "used" memory, inflating reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked host agent) were frequently unavailable, causing most VMs to fall through to the inflated status-mem source. Adds a new last-resort fallback that reads /proc/meminfo via the QEMU guest agent file-read endpoint to get accurate MemAvailable. Results are cached (60s positive, 5min negative backoff for unsupported VMs). Also fixes: RRD memavailable fallback missing from traditional polling path, cache key collisions in multi-PVE setups, FreeMem underflow guard inconsistency, and integer overflow in kB-to-bytes conversion.	2026-02-20 13:31:52 +00:00
rcourtman	8c7d507ea4	fix(alerts): make --disk-exclude suppress Proxmox SSD wear/health alerts (#1142 ) The --disk-exclude agent flag only filtered local metric collection but had no effect on server-side Proxmox disk health and SSD wearout alerts, which poll the Proxmox API directly. Users excluding disks (e.g. --disk-exclude sda) still received alerts for those disks. Agent now sends its DiskExclude patterns in each report. The server stores them on the Host model and consults them during Proxmox disk polling — excluded disks get a synthetic healthy status passed to CheckDiskHealth so any existing alerts clear immediately. Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs, linsysfs) to the virtual FS filter and /var/run/ to special mount prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.	2026-02-20 13:31:52 +00:00
rcourtman	fb7582c7e4	fix(memory): use linked Pulse host agent memory to avoid VM inflation (#1270 ) When no guest agent MemInfo or RRD data is available, prefer the linked Pulse host agent's memory (read from /proc/meminfo via gopsutil, which excludes page cache) over Proxmox's status.Mem (total - free, inflated by reclaimable cache). Applied to both efficient and traditional polling paths. Diagnostic fields added to VMMemoryRaw for visibility.	2026-02-19 19:04:19 +00:00
rcourtman	71b8b81af5	fix(monitoring): cache per-VM RRD memory lookups to avoid serial HTTP calls Windows VMs and VMs without qemu-guest-agent triggered an uncached GetVMRRDData HTTP call on every poll cycle. Add vmRRDMemCache using the same read-through cache pattern as nodeRRDMemCache (shared rrdCacheMu, same TTL, same cleanup path). (cherry picked from commit 582f16004a0f275de4c458e5d288be70eee613e4)	2026-02-18 12:57:15 +00:00
rcourtman	efa916ee2a	fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox RRD's memavailable metric (which excludes reclaimable page cache) when the qemu-guest-agent doesn't provide MemInfo.Available. Previously the fallback was detailedStatus.Mem (total - MemFree), inflating usage to 80%+ on VMs with normal Linux page cache. Mirrors the existing LXC rrd-memavailable path. FreeBSD ZFS ARC (#1264, #1051): The host agent now reads kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts the ARC size from reported memory usage. ZFS ARC is reclaimable under memory pressure (like Linux SReclaimable) but gopsutil counts it as wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms. Fixes #1270 Fixes #1264 Fixes #1051 (cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)	2026-02-18 12:56:53 +00:00
rcourtman	df23d80919	fix(alerts): always send recovery notifications regardless of quiet hours Recovery (all-clear) notifications were being silently suppressed during quiet hours for any non-critical alert. Since powered-off alerts default to Warning level, users who received an alert at 2pm would never get the recovery notification if the VM came back during quiet hours. Quiet hours are intended to suppress noisy firing alerts, not to hide the fact that an issue has resolved. If you got the alert, you should always get the all-clear. Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved. The notifyOnResolve toggle (explicit user preference) is still respected. Fixes #1259	2026-02-18 12:53:09 +00:00
rcourtman	03939c3f9e	fix: deduplicate bind-mounted volumes in disk total calculation The dedup logic only handled btrfs/zfs subvolumes, but Kubernetes bind-mounts the same device at both pod and plugin paths, causing xfs/ext4 volumes to be double-counted. Now deduplicates by device+totalBytes for all filesystem types. Fixes #1158	2026-02-10 21:52:25 +00:00
rcourtman	26776b2075	fix(agent): apply --disk-exclude to Docker agent disk metrics (#1237 ) The Docker agent was not passing the disk exclusion list to hostmetricsCollect(), so excluded mounts appeared in the Docker tab disk totals. Also add server-side fsfilters filtering to Docker report processing for parity with the host agent path.	2026-02-10 16:59:35 +00:00
rcourtman	8a48acef1d	fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting - fix(models): filter nodes by instance in UpdateNodesForInstance to prevent PVE node duplication across poll cycles (#1214, #1192, #1217) - fix(alerts): sort GetActiveAlerts output for stable ordering, preventing hostname scrambling in frontend (#1218) - fix(notifications): add ntfy-specific resolved webhook formatting with plain-text body and proper headers (#1213) - fix(frontend): respect "hide Docker update actions" setting in DockerFilter Update All button (#1219) - fix(frontend): add missing v prefix to GitHub release tag URLs (#1195) - fix(monitoring): reduce disk detection warning from Warn to Debug to eliminate log spam for pass-through disks (#1216) - chore: bump VERSION to 5.1.5	2026-02-08 11:48:22 +00:00
rcourtman	d1e61d8a8a	fix: ship alerting hotfixes and prepare 5.1.4	2026-02-07 22:05:55 +00:00
rcourtman	13af83f3fc	fix(monitoring): preserve recent PVE nodes on empty polls (#1094 )	2026-02-07 14:18:33 +00:00
rcourtman	ee0e89871d	fix: reduce metrics memory 86x by reverting buffer and adding LTTB downsampling The in-memory metrics buffer was changed from 1000 to 86400 points per metric to support 30-day sparklines, but this pre-allocated ~18 MB per guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB — explaining why users needed to double their LXC memory after upgrading to 5.1.0. - Revert in-memory buffer to 1000 points / 24h retention - Remove eager slice pre-allocation (use append growth instead) - Add LTTB (Largest Triangle Three Buckets) downsampling algorithm - Chart endpoints now use a two-tier strategy: in-memory for ranges ≤ 2h, SQLite persistent store + LTTB for longer ranges - Reduce frontend ring buffer from 86400 to 2000 points Related to #1190	2026-02-04 19:49:52 +00:00
rcourtman	9d4d392026	fix: host network sparklines showing cumulative bytes instead of rates Host network sparklines were displaying wildly incorrect values (e.g., 147 GB/s for an idle Raspberry Pi) because cumulative byte counters (total bytes since boot) were being stored directly instead of being converted to rates. Changes: - monitor.go: Use RateTracker to calculate network rates for hosts, matching the existing pattern used for VMs and containers. Only record network metrics when we have enough samples to calculate valid rates. - router.go: Remove network metrics from live fallback for hosts since we can't calculate rates from a single snapshot. Better to show nothing than misleading cumulative totals. The fix follows the established codebase pattern where: 1. Agent reports cumulative RXBytes/TXBytes 2. RateTracker compares consecutive samples to calculate bytes/second 3. Rates are stored in metrics history for sparkline display	2026-02-04 16:11:04 +00:00
rcourtman	cffb91f9ea	Pre-populate node display name cache before guest polling Guest polling (CheckGuest) runs before CheckNode in each poll cycle, so the display name cache was empty when the first guest alert was created. This caused the initial notification to use the raw Proxmox node name. Fix by seeding the cache from modelNodes (which are already available) before guest polling starts. Related to #1188	2026-02-04 14:29:49 +00:00
rcourtman	05266d9062	Show node display name in alerts instead of raw Proxmox node name Alerts previously showed the raw Proxmox node name (e.g., "on pve") even when users configured a display name (e.g., "SPACEX") via Settings or the host agent --hostname flag. This affected the alert UI, email notifications, and webhook payloads. Add NodeDisplayName field to the alert chain: cache display names in the alert Manager (populated by CheckNode/CheckHost on every poll), resolve them at alert creation via preserveAlertState, refresh on metric updates, and enrich at read time in GetActiveAlerts. Update models.Alert, the syncAlertsToState conversion, email templates, Apprise body text, webhook payloads, and all frontend rendering paths. Related to #1188	2026-02-04 14:26:44 +00:00
rcourtman	5c18748742	Add SMART disk lifecycle monitoring with historical charts Expand the smartctl collector to capture detailed SMART attributes (SATA and NVMe), propagate them through the full data pipeline, persist them as time-series metrics, and display them in an interactive disk detail drawer with historical sparkline charts. Backend: add SMARTAttributes struct, writeSMARTMetrics for persistent storage, "disk" resource type in metrics API with live fallback. Frontend: enhanced DiskList with Power-On column and SMART warnings, new DiskDetail drawer matching NodeDrawer styling patterns, generic HistoryChart metric support with proper tooltip formatting.	2026-02-04 13:35:40 +00:00
rcourtman	902bdd92c2	fix: prefer status-mem over status-freemem for VM memory calculation Proxmox's FreeMem field reports free memory relative to the balloon's guest-visible total (total_mem), not relative to MaxMem. When ballooning is active and the VM's memory has been reduced, subtracting FreeMem from MaxMem produces wildly inflated usage (e.g. 97% when actual usage is 20%). Proxmox's Mem field is already calculated as (total_mem - free_mem), giving the correct used bytes regardless of balloon state. Swap the priority so Mem is checked before FreeMem. Related to #1185	2026-02-04 12:08:33 +00:00
rcourtman	5a990dd554	Fix sparkline data inconsistency and support 30d range	2026-02-03 22:39:50 +00:00
rcourtman	2ebe65bbc5	security: add scope checks to AI Patrol and agent profile endpoints - AI Patrol mutation endpoints (acknowledge, dismiss, suppress, snooze, resolve, findings/note, suppressions/) now require ai:execute scope to prevent low-privilege tokens from blinding patrol by hiding/suppressing findings - Agent profile admin endpoints (/api/admin/profiles/) now require settings:write scope to prevent low-privilege tokens from modifying fleet-wide agent behavior	2026-02-03 19:29:56 +00:00
rcourtman	1733bea15c	feat(ui): show backup permission warnings on Backups page When PVE backup polling detects permission errors (403/401/permission denied), track them per instance and surface them via the scheduler health endpoint. The Backups page now fetches instance warnings and displays a banner when backup permission issues are detected, telling users exactly how to fix the problem. Related to #1139	2026-02-03 19:27:10 +00:00
rcourtman	c7f4030c29	fix(monitoring): prevent memory leak from stale metrics history and rate tracker entries MetricsHistory.Cleanup() was defined but never called, and even if called, it only removed old data points without deleting map entries for deleted containers/VMs. Each stale entry leaked ~224KB (7 pre-allocated slices). Changes: - Call metricsHistory.Cleanup() and rateTracker.Cleanup() in maintenance loop - Delete map entries entirely when all data points have expired - Return nil instead of empty slice in cleanupMetrics() to release backing arrays - Add Cleanup() method to RateTracker with 24-hour stale threshold - Add debug logging to track cleanup activity Related to #1153	2026-02-03 17:16:06 +00:00
rcourtman	4f40c3d751	fix: resolve critical stability and auth issues - Fix data race in webhook notifications by removing shared state - Fix duplicate monitors on config reload by stopping old instances - Prevent metrics ID deletion on transient startup errors - Support Bearer auth header for config export/import endpoints	2026-02-03 16:46:27 +00:00
rcourtman	aeca5e39fa	Fix multi-tenant persistence and backend stability - Initialize Alert and Notification managers with tenant-specific data directories - Add panic recovery to WebSocket safeSend for stability - Record host metrics to history for sparkline support	2026-02-03 16:24:42 +00:00
rcourtman	71f80c8a99	Fix: alert resolution now records incident timeline during quiet hours - Fixed early return in handleAlertResolved that skipped incident recording when quiet hours suppressed recovery notifications - Added Host Agent alert delay configuration (backend + UI) - Host Agents now have dedicated time threshold settings like other resource types Related to #1179	2026-02-03 12:49:41 +00:00
rcourtman	c8483f8116	Fix: PBS backup verification status not updating after cache populated The PBS backup snapshot cache only compared BackupCount and LastBackup timestamp to decide whether to re-fetch. When PBS verify jobs complete, neither field changes — only the Verification field on individual snapshots changes — so the cache served stale data indefinitely. Add a 10-minute TTL per backup group so verification status changes are picked up periodically. Also add panic recovery to PBS and PVE backup goroutines, and use runtimeCtx for PBS backup polling to respect monitor shutdown. Closes #1174	2026-02-02 23:12:26 +00:00
rcourtman	95a0d7a6bd	feat(backend): implement AI Patrol, Investigation, and system-wide refactors	2026-01-30 19:02:14 +00:00
rcourtman	70dbb495ad	fix: address triage issues #1149 , #1153 , #1162 , #1163 - #1163: Add node badges to storage resources in threshold tables (ResourceTable.tsx, ResourceCard.tsx) - #1162: Fix PBS backup alerts showing datastore as node name (alerts.go - use "Unknown" for orphaned backups) - #1153: Fix memory leaks in tracking maps - Add max 48 sample limit for pmgQuarantineHistory - Add max 10 entry limit for flappingHistory - Add cleanup for dockerUpdateFirstSeen - Add cleanupTrackingMaps() for auth, polling, and circuit breaker maps Note: #1149 fix (chat sessions null check) is in AISettings.tsx which has other pending changes - will be committed separately.	2026-01-26 22:21:10 +00:00
rcourtman	1e77763870	feat: improve monitoring and temperature handling Temperature Monitoring: - Enhance temperature collection and processing - Add temperature tests Monitor Improvements: - Improve monitor reload handling - Add reload tests Test Coverage: - Add Ceph monitoring tests - Add Docker commands tests - Add host agent temperature tests - Add extra coverage tests	2026-01-24 22:43:31 +00:00
rcourtman	4c19fa3c1b	fix: resolve btrfs disk summing (#1158 ), podman disable flag (#1151 ), and diagnostics path (#1155 )	2026-01-23 19:24:38 +00:00

1 2 3 4 5

242 commits