Apply quiet-hours suppression to escalation notifications so offline and other suppressed categories do not bypass the normal notification rules during escalation.
Fixes#1398.
The 10ms goroutine drain pause was insufficient under full parallel
test suite load, causing intermittent failures in
TestPulseMonitorOnlySkipsDispatchButRetainsAlert.
The previous hasLiveInventory guard was a single boolean — if any PVE
instance had at least one live guest, orphan detection ran for all
instances. In multi-instance clusters with staggered polling, backups
from instances whose VMs hadn't been polled yet appeared orphaned,
producing false positive alerts with 0m duration.
Replace the global boolean with a per-instance map. PVE storage backups
now only run orphan detection when their specific instance has live
inventory. PBS/PMG backups (which span instances) retain the "any
instance has live guests" check.
Backup polling goroutines can snapshot state before VM/container polling
populates the guest inventory. When guestsByVMID is empty, every backup
appears orphaned. Gate orphan detection on hasLiveInventory (at least one
guest with non-empty ResourceID) and preserve existing orphan alerts when
inventory becomes unavailable.
Fire a warning alert immediately when a backup's guest no longer exists
in inventory, without requiring age thresholds to be breached. The
existing alertOrphaned toggle and ignoreVMIDs UI control this feature
with no frontend changes needed.
Recovery (all-clear) notifications were being silently suppressed during
quiet hours for any non-critical alert. Since powered-off alerts default
to Warning level, users who received an alert at 2pm would never get the
recovery notification if the VM came back during quiet hours.
Quiet hours are intended to suppress noisy firing alerts, not to hide
the fact that an issue has resolved. If you got the alert, you should
always get the all-clear.
Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved.
The notifyOnResolve toggle (explicit user preference) is still respected.
Fixes#1259
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
Update history entry LastSeen on alert resolution so the stored duration
reflects how long the alert was actually active, not the snapshot captured
at creation time. This fixes the "0m" duration display for all resolved
metric-based alerts.
Fix reevaluateActiveAlertsLocked to use HostDefaults for host agent alerts
and PBSDefaults for PBS alerts instead of falling through to GuestDefaults
and NodeDefaults respectively, which could incorrectly resolve or retain
alerts on config save when thresholds differ.
Guest polling (CheckGuest) runs before CheckNode in each poll cycle,
so the display name cache was empty when the first guest alert was
created. This caused the initial notification to use the raw Proxmox
node name. Fix by seeding the cache from modelNodes (which are already
available) before guest polling starts.
Related to #1188
Alerts previously showed the raw Proxmox node name (e.g., "on pve") even
when users configured a display name (e.g., "SPACEX") via Settings or the
host agent --hostname flag. This affected the alert UI, email notifications,
and webhook payloads.
Add NodeDisplayName field to the alert chain: cache display names in the
alert Manager (populated by CheckNode/CheckHost on every poll), resolve
them at alert creation via preserveAlertState, refresh on metric updates,
and enrich at read time in GetActiveAlerts. Update models.Alert, the
syncAlertsToState conversion, email templates, Apprise body text, webhook
payloads, and all frontend rendering paths.
Related to #1188
- Fix stale alerts not clearing when nodes/hosts go offline in CheckNode and HandleHostOffline
- Fix stale alerts persisting when thresholds are disabled or set to 0 in CheckGuest and CheckNode
- Fix CheckHost to properly clear disk alerts when overrides disable them
- Update audit_report.md with findings from the Alert System Reliability Audit
When namespace matching fails, the VMID-only fallback now checks whether
the VMID appears on multiple PVE instances. If ambiguous, the fallback
is skipped — preventing backups from being falsely attributed to the
wrong guest. Unique VMIDs still fall back as before.
- Initialize Alert and Notification managers with tenant-specific data directories
- Add panic recovery to WebSocket safeSend for stability
- Record host metrics to history for sparkline support
- Fixed early return in handleAlertResolved that skipped incident recording
when quiet hours suppressed recovery notifications
- Added Host Agent alert delay configuration (backend + UI)
- Host Agents now have dedicated time threshold settings like other resource types
Related to #1179
The quiet hours fix (07b4765b) added ShouldSuppressResolvedNotification()
to handleAlertResolved, which acquires m.mu.RLock(). Five clear*OfflineAlert
functions call the resolved callback synchronously while holding m.mu.Lock().
Go's RWMutex is not reentrant, so this deadlocks permanently when any
node/PBS/PMG/storage/guest comes back online after being offline.
The deadlock prevents recovery notifications from being sent and freezes
the monitoring goroutine, cascading to block all subsequent polling.
Fix: change the five affected functions to fire the resolved callback
asynchronously (matching the pattern already used by clearAlertNoLock),
so it runs after m.mu is released.
Related to #1068
In single-node setups, guest alerts had Instance == Node, causing
reevaluateActiveAlertsLocked to evaluate them against NodeDefaults
instead of GuestDefaults. Setting guest memory threshold to 0 (disabled)
wouldn't clear existing guest alerts because they were being kept alive
by the still-enabled node memory threshold.
- Add resourceID colon check to distinguish guest IDs (instance:node:vmid)
from node IDs (instance-node) in reevaluateActiveAlertsLocked
- Clear stale alerts in checkMetric when threshold is nil or disabled
- Skip hysteresis validation for disabled thresholds (Trigger <= 0)
- Fix frontend tooltip: "0" not "-1" disables a threshold
- #1163: Add node badges to storage resources in threshold tables
(ResourceTable.tsx, ResourceCard.tsx)
- #1162: Fix PBS backup alerts showing datastore as node name
(alerts.go - use "Unknown" for orphaned backups)
- #1153: Fix memory leaks in tracking maps
- Add max 48 sample limit for pmgQuarantineHistory
- Add max 10 entry limit for flappingHistory
- Add cleanup for dockerUpdateFirstSeen
- Add cleanupTrackingMaps() for auth, polling, and circuit breaker maps
Note: #1149 fix (chat sessions null check) is in AISettings.tsx
which has other pending changes - will be committed separately.
DSM data scrubbing causes RAID arrays to enter a 'check' state with
RebuildPercent > 0, which was incorrectly triggering rebuild warnings.
Now distinguishes between:
- 'check' state: scheduled data scrubbing (no alert)
- 'recover'/'resync' state: actual rebuild (warning alert)
- 'clean' state with RebuildPercent: scrub in progress (no alert)
- Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3)
- Improve test coverage for AI tools, chat service, and session management
- Enhance alert and notification tests (ResolvedAlert, Webhook)
- Add frontend unit tests for utils (searchHistory, tagColors, temperature, url)
- Add proximity client API tests
Closes#1115 (discussion feedback)
Two API consistency issues reported by @FabienD74:
1. Version format mismatch in /api/version:
- currentVersion: "5.0.16" (no prefix)
- latestVersion: "v5.0.16" (with prefix)
Fixed: LatestVersion now strips the "v" prefix to match CurrentVersion format.
2. Guest ID separator inconsistency:
- Some code used colons: "instance:node:vmid"
- BuildGuestKey used dashes: "instance-node-vmid"
Fixed: BuildGuestKey now uses colon separator matching the canonical
format used by makeGuestID in the monitoring package. The existing
legacy migration in GetWithLegacyMigration handles old dash-format
entries in guest_metadata.json.
Allow users to set custom disk usage thresholds per mounted filesystem
on host agents, rather than applying a single threshold to all volumes.
This addresses NAS/NVR use cases where some volumes (e.g., NVR storage)
intentionally run at 99% while others need strict monitoring.
Backend:
- Check for disk-specific overrides before using HostDefaults.Disk
- Override key format: host:<hostId>/disk:<mountpoint>
- Support both custom thresholds and disable per-disk
Frontend:
- Add 'hostDisk' resource type
- Add "Host Disks" collapsible section in Thresholds → Hosts tab
- Group disks by host for easier navigation
Closes#1103
The previous commit fixed namespace disambiguation for backup alerts,
but the Overview display uses SyncGuestBackupTimes to populate backup
timestamps on VMs/Containers. This commit extends the same namespace
matching logic to that function.
Also tightened the matching algorithm to use suffix matching instead
of substring matching, preventing false positives like "pve" matching
"pve-nat".
When multiple PVE instances have VMs with overlapping VMIDs, PBS backups
were being matched to the wrong VM because the code would just use the
first matching guest. Now when a PBS backup has a namespace, it attempts
to match that namespace to the PVE instance name to find the correct VM.
This helps users who have separate PBS instances backing up different
PVE clusters with namespaces like "pve1", "nat", etc.