Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 03:20:11 +00:00

Author	SHA1	Message	Date
rcourtman	dfbe2eb873	Suppress noisy recovery notifications Some checks are pending Build and Test / Secret Scan (push) Waiting to run Details Build and Test / Frontend & Backend (push) Waiting to run Details Core E2E Tests / Playwright Core E2E (push) Waiting to run Details	2026-04-13 14:40:12 +01:00
rcourtman	19b2a4e4c4	Clear stale guest per-disk alerts	2026-04-13 14:20:54 +01:00
rcourtman	9fb76579cc	Fix backup type-aware orphan detection	2026-04-13 11:54:46 +01:00
rcourtman	754aa0e39c	Fix linked host agent threshold overrides Some checks are pending Build and Test / Secret Scan (push) Waiting to run Details Build and Test / Frontend & Backend (push) Waiting to run Details Core E2E Tests / Playwright Core E2E (push) Waiting to run Details	2026-04-12 22:47:34 +01:00
rcourtman	95409985b5	Normalize vendor-managed NAS RAID arrays	2026-04-12 22:20:04 +01:00
rcourtman	a86c7120cf	Debounce recovery for poll-driven offline alerts	2026-04-12 22:04:10 +01:00
rcourtman	005f64182f	Respect quiet hours for escalation alerts Apply quiet-hours suppression to escalation notifications so offline and other suppressed categories do not bypass the normal notification rules during escalation. Fixes #1398.	2026-04-12 21:29:32 +01:00
rcourtman	30eb9d7847	Fix repeated Docker update recovery alerts Preserve Docker container update alerts and first-seen tracking when update status is temporarily unavailable or the registry check fails. Fixes #1394.	2026-04-09 15:59:15 +01:00
rcourtman	a4834ed80f	Disambiguate linked host agent alert names	2026-04-07 10:50:52 +01:00
rcourtman	398ef8117b	Clear stale storage alerts on inventory changes	2026-04-05 23:35:54 +01:00
rcourtman	65118b4fc4	Migrate guest alerts across node moves	2026-04-01 13:02:07 +01:00
rcourtman	31753e5536	Stabilize guest threshold overrides across node moves (#1334 ) Some checks are pending Build and Test / Secret Scan (push) Waiting to run Details Build and Test / Frontend & Backend (push) Waiting to run Details Core E2E Tests / Playwright Core E2E (push) Waiting to run Details	2026-03-31 23:18:19 +01:00
rcourtman	dcc4747215	Harden alert history and tenant storage paths	2026-03-31 09:23:03 +01:00
rcourtman	4f89a13975	Harden storage path handling for CodeQL	2026-03-31 08:55:45 +01:00
rcourtman	2b96142ee5	Broaden NAS vendor hint matching for RAID suppression (#1362 )	2026-03-25 13:24:23 +00:00
rcourtman	fba1fadccd	Make alert node display name resolution instance-aware (#1218 )	2026-03-25 12:44:22 +00:00
rcourtman	5aa8be9736	Fix Docker update alert disable handling (#1355 )	2026-03-25 10:47:57 +00:00
rcourtman	40249947ed	Fix template backup orphan detection race (#1352 )	2026-03-25 10:36:33 +00:00
rcourtman	b9c6f504d8	Fix shared storage override matching (#1341 )	2026-03-25 10:25:01 +00:00
rcourtman	b5ee2c1f98	Fix guest override migration for canonical IDs (#1334 )	2026-03-25 10:13:10 +00:00
rcourtman	ab85c5a936	Suppress QNAP internal RAID false positives (#1362 )	2026-03-25 10:05:41 +00:00
rcourtman	d560de15ad	Increase alerts test cleanup sleep to fix flaky test under load The 10ms goroutine drain pause was insufficient under full parallel test suite load, causing intermittent failures in TestPulseMonitorOnlySkipsDispatchButRetainsAlert.	2026-03-08 22:16:24 +00:00
rcourtman	62225e0c12	fix(alerts): scope orphaned backup detection per PVE instance to prevent false positives (#1286 ) The previous hasLiveInventory guard was a single boolean — if any PVE instance had at least one live guest, orphan detection ran for all instances. In multi-instance clusters with staggered polling, backups from instances whose VMs hadn't been polled yet appeared orphaned, producing false positive alerts with 0m duration. Replace the global boolean with a per-instance map. PVE storage backups now only run orphan detection when their specific instance has live inventory. PBS/PMG backups (which span instances) retain the "any instance has live guests" check.	2026-02-27 13:32:15 +00:00
rcourtman	fa519cd8ce	fix(alerts): prevent false positive orphaned backup alerts during startup race (#1286 ) Backup polling goroutines can snapshot state before VM/container polling populates the guest inventory. When guestsByVMID is empty, every backup appears orphaned. Gate orphan detection on hasLiveInventory (at least one guest with non-empty ResourceID) and preserve existing orphan alerts when inventory becomes unavailable.	2026-02-26 20:49:10 +00:00
rcourtman	4dc09a1240	feat(alerts): add dedicated backup-orphaned alert type (#1286 ) Fire a warning alert immediately when a backup's guest no longer exists in inventory, without requiring age thresholds to be breached. The existing alertOrphaned toggle and ignoreVMIDs UI control this feature with no frontend changes needed.	2026-02-24 09:07:43 +00:00
rcourtman	df23d80919	fix(alerts): always send recovery notifications regardless of quiet hours Recovery (all-clear) notifications were being silently suppressed during quiet hours for any non-critical alert. Since powered-off alerts default to Warning level, users who received an alert at 2pm would never get the recovery notification if the VM came back during quiet hours. Quiet hours are intended to suppress noisy firing alerts, not to hide the fact that an issue has resolved. If you got the alert, you should always get the all-clear. Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved. The notifyOnResolve toggle (explicit user preference) is still respected. Fixes #1259	2026-02-18 12:53:09 +00:00
rcourtman	ae4632b5b5	fix: correct UpdateAlertDelayHours doc comment (0 normalizes to 24, -1 disables)	2026-02-10 21:13:12 +00:00
rcourtman	1f74c12ef8	fix(alerts): preserve docker update delay across host identity churn (#1226 )	2026-02-09 13:59:52 +00:00
rcourtman	8a48acef1d	fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting - fix(models): filter nodes by instance in UpdateNodesForInstance to prevent PVE node duplication across poll cycles (#1214, #1192, #1217) - fix(alerts): sort GetActiveAlerts output for stable ordering, preventing hostname scrambling in frontend (#1218) - fix(notifications): add ntfy-specific resolved webhook formatting with plain-text body and proper headers (#1213) - fix(frontend): respect "hide Docker update actions" setting in DockerFilter Update All button (#1219) - fix(frontend): add missing v prefix to GitHub release tag URLs (#1195) - fix(monitoring): reduce disk detection warning from Warn to Debug to eliminate log spam for pass-through disks (#1216) - chore: bump VERSION to 5.1.5	2026-02-08 11:48:22 +00:00
rcourtman	d1e61d8a8a	fix: ship alerting hotfixes and prepare 5.1.4	2026-02-07 22:05:55 +00:00
rcourtman	6909264a02	fix(alerts): reduce swarm alert noise and preserve notification state (#1096 )	2026-02-07 14:18:39 +00:00
rcourtman	b5373749db	Fix alert history duration and re-evaluation threshold bugs Update history entry LastSeen on alert resolution so the stored duration reflects how long the alert was actually active, not the snapshot captured at creation time. This fixes the "0m" duration display for all resolved metric-based alerts. Fix reevaluateActiveAlertsLocked to use HostDefaults for host agent alerts and PBSDefaults for PBS alerts instead of falling through to GuestDefaults and NodeDefaults respectively, which could incorrectly resolve or retain alerts on config save when thresholds differ.	2026-02-04 15:40:28 +00:00
rcourtman	cffb91f9ea	Pre-populate node display name cache before guest polling Guest polling (CheckGuest) runs before CheckNode in each poll cycle, so the display name cache was empty when the first guest alert was created. This caused the initial notification to use the raw Proxmox node name. Fix by seeding the cache from modelNodes (which are already available) before guest polling starts. Related to #1188	2026-02-04 14:29:49 +00:00
rcourtman	05266d9062	Show node display name in alerts instead of raw Proxmox node name Alerts previously showed the raw Proxmox node name (e.g., "on pve") even when users configured a display name (e.g., "SPACEX") via Settings or the host agent --hostname flag. This affected the alert UI, email notifications, and webhook payloads. Add NodeDisplayName field to the alert chain: cache display names in the alert Manager (populated by CheckNode/CheckHost on every poll), resolve them at alert creation via preserveAlertState, refresh on metric updates, and enrich at read time in GetActiveAlerts. Update models.Alert, the syncAlertsToState conversion, email templates, Apprise body text, webhook payloads, and all frontend rendering paths. Related to #1188	2026-02-04 14:26:44 +00:00
rcourtman	5073c10030	Fix alert system reliability issues and update audit report - Fix stale alerts not clearing when nodes/hosts go offline in CheckNode and HandleHostOffline - Fix stale alerts persisting when thresholds are disabled or set to 0 in CheckGuest and CheckNode - Fix CheckHost to properly clear disk alerts when overrides disable them - Update audit_report.md with findings from the Alert System Reliability Audit	2026-02-04 12:50:36 +00:00
rcourtman	dcfa8cf0ba	fix: prevent false PBS backup indicators when VMIDs collide across PVE instances (#1177 ) When namespace matching fails, the VMID-only fallback now checks whether the VMID appears on multiple PVE instances. If ambiguous, the fallback is skipped — preventing backups from being falsely attributed to the wrong guest. Unique VMIDs still fall back as before.	2026-02-04 10:11:35 +00:00
rcourtman	aeca5e39fa	Fix multi-tenant persistence and backend stability - Initialize Alert and Notification managers with tenant-specific data directories - Add panic recovery to WebSocket safeSend for stability - Record host metrics to history for sparkline support	2026-02-03 16:24:42 +00:00
rcourtman	71f80c8a99	Fix: alert resolution now records incident timeline during quiet hours - Fixed early return in handleAlertResolved that skipped incident recording when quiet hours suppressed recovery notifications - Added Host Agent alert delay configuration (backend + UI) - Host Agents now have dedicated time threshold settings like other resource types Related to #1179	2026-02-03 12:49:41 +00:00
rcourtman	454448b796	fix: deadlock in offline alert recovery notifications The quiet hours fix (`07b4765b`) added ShouldSuppressResolvedNotification() to handleAlertResolved, which acquires m.mu.RLock(). Five clear*OfflineAlert functions call the resolved callback synchronously while holding m.mu.Lock(). Go's RWMutex is not reentrant, so this deadlocks permanently when any node/PBS/PMG/storage/guest comes back online after being offline. The deadlock prevents recovery notifications from being sent and freezes the monitoring goroutine, cascading to block all subsequent polling. Fix: change the five affected functions to fire the resolved callback asynchronously (matching the pattern already used by clearAlertNoLock), so it runs after m.mu is released. Related to #1068	2026-02-02 18:17:27 +00:00
rcourtman	7444bd0468	fix(alerts): guest alerts misclassified as node alerts when threshold disabled (#1145 ) In single-node setups, guest alerts had Instance == Node, causing reevaluateActiveAlertsLocked to evaluate them against NodeDefaults instead of GuestDefaults. Setting guest memory threshold to 0 (disabled) wouldn't clear existing guest alerts because they were being kept alive by the still-enabled node memory threshold. - Add resourceID colon check to distinguish guest IDs (instance:node:vmid) from node IDs (instance-node) in reevaluateActiveAlertsLocked - Clear stale alerts in checkMetric when threshold is nil or disabled - Skip hysteresis validation for disabled thresholds (Trigger <= 0) - Fix frontend tooltip: "0" not "-1" disables a threshold	2026-02-02 15:17:53 +00:00
rcourtman	70dbb495ad	fix: address triage issues #1149 , #1153 , #1162 , #1163 - #1163: Add node badges to storage resources in threshold tables (ResourceTable.tsx, ResourceCard.tsx) - #1162: Fix PBS backup alerts showing datastore as node name (alerts.go - use "Unknown" for orphaned backups) - #1153: Fix memory leaks in tracking maps - Add max 48 sample limit for pmgQuarantineHistory - Add max 10 entry limit for flappingHistory - Add cleanup for dockerUpdateFirstSeen - Add cleanupTrackingMaps() for auth, polling, and circuit breaker maps Note: #1149 fix (chat sessions null check) is in AISettings.tsx which has other pending changes - will be committed separately.	2026-01-26 22:21:10 +00:00
rcourtman	d1ab2c913e	refactor: simplify Alerts page and improve backend Frontend: - Major refactor of Alerts page removing patrol-specific UI - Patrol functionality moved to dedicated AI Intelligence page - Simplify ThresholdsTable component - Update InvestigateAlertButton Backend: - Improve alerts handling and processing - Add alerts test coverage - Add orphaned backup alerting support	2026-01-24 22:44:15 +00:00
rcourtman	889719243b	fix: reduce offline alert spam. Related to #1159 , #1043	2026-01-24 13:25:25 +00:00
rcourtman	1657beeb92	feat: add alert enhancements - Improve alert handling and processing	2026-01-22 22:32:03 +00:00
rcourtman	0248f0de5a	fix(alerts): Prevent RAID check/scrub from triggering rebuild alerts. Related to #1125 DSM data scrubbing causes RAID arrays to enter a 'check' state with RebuildPercent > 0, which was incorrectly triggering rebuild warnings. Now distinguishes between: - 'check' state: scheduled data scrubbing (no alert) - 'recover'/'resync' state: actual rebuild (warning alert) - 'clean' state with RebuildPercent: scrub in progress (no alert)	2026-01-20 16:13:58 +00:00
rcourtman	96b7370f7b	test: improve coverage for API, AI, Alerts, and Frontend Utils - Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3) - Improve test coverage for AI tools, chat service, and session management - Enhance alert and notification tests (ResolvedAlert, Webhook) - Add frontend unit tests for utils (searchHistory, tagColors, temperature, url) - Add proximity client API tests	2026-01-20 15:52:39 +00:00
rcourtman	1c22688d9b	fix: standardize API version format and guest key separators Closes #1115 (discussion feedback) Two API consistency issues reported by @FabienD74: 1. Version format mismatch in /api/version: - currentVersion: "5.0.16" (no prefix) - latestVersion: "v5.0.16" (with prefix) Fixed: LatestVersion now strips the "v" prefix to match CurrentVersion format. 2. Guest ID separator inconsistency: - Some code used colons: "instance:node:vmid" - BuildGuestKey used dashes: "instance-node-vmid" Fixed: BuildGuestKey now uses colon separator matching the canonical format used by makeGuestID in the monitoring package. The existing legacy migration in GetWithLegacyMigration handles old dash-format entries in guest_metadata.json.	2026-01-19 22:20:18 +00:00
rcourtman	9cd53814a3	feat(alerts): add per-volume disk thresholds for host agents Allow users to set custom disk usage thresholds per mounted filesystem on host agents, rather than applying a single threshold to all volumes. This addresses NAS/NVR use cases where some volumes (e.g., NVR storage) intentionally run at 99% while others need strict monitoring. Backend: - Check for disk-specific overrides before using HostDefaults.Disk - Override key format: host:<hostId>/disk:<mountpoint> - Support both custom thresholds and disable per-disk Frontend: - Add 'hostDisk' resource type - Add "Host Disks" collapsible section in Thresholds → Hosts tab - Group disks by host for easier navigation Closes #1103	2026-01-13 23:38:20 +00:00
rcourtman	1dda538265	fix(models): extend namespace disambiguation to SyncGuestBackupTimes (#1095 ) The previous commit fixed namespace disambiguation for backup alerts, but the Overview display uses SyncGuestBackupTimes to populate backup timestamps on VMs/Containers. This commit extends the same namespace matching logic to that function. Also tightened the matching algorithm to use suffix matching instead of substring matching, preventing false positives like "pve" matching "pve-nat".	2026-01-12 15:11:59 +00:00
rcourtman	a88edd7c8f	fix(alerts): disambiguate PBS backups using namespace for multi-PVE setups (#1095 ) When multiple PVE instances have VMs with overlapping VMIDs, PBS backups were being matched to the wrong VM because the code would just use the first matching guest. Now when a PBS backup has a namespace, it attempts to match that namespace to the PVE instance name to find the correct VM. This helps users who have separate PBS instances backing up different PVE clusters with namespaces like "pve1", "nat", etc.	2026-01-12 14:55:17 +00:00

1 2 3 4

188 commits