Commit graph

174 commits

Author SHA1 Message Date
rcourtman
2b96142ee5 Broaden NAS vendor hint matching for RAID suppression (#1362) 2026-03-25 13:24:23 +00:00
rcourtman
fba1fadccd Make alert node display name resolution instance-aware (#1218) 2026-03-25 12:44:22 +00:00
rcourtman
5aa8be9736 Fix Docker update alert disable handling (#1355) 2026-03-25 10:47:57 +00:00
rcourtman
40249947ed Fix template backup orphan detection race (#1352) 2026-03-25 10:36:33 +00:00
rcourtman
b9c6f504d8 Fix shared storage override matching (#1341) 2026-03-25 10:25:01 +00:00
rcourtman
b5ee2c1f98 Fix guest override migration for canonical IDs (#1334) 2026-03-25 10:13:10 +00:00
rcourtman
ab85c5a936 Suppress QNAP internal RAID false positives (#1362) 2026-03-25 10:05:41 +00:00
rcourtman
d560de15ad Increase alerts test cleanup sleep to fix flaky test under load
The 10ms goroutine drain pause was insufficient under full parallel
test suite load, causing intermittent failures in
TestPulseMonitorOnlySkipsDispatchButRetainsAlert.
2026-03-08 22:16:24 +00:00
rcourtman
62225e0c12 fix(alerts): scope orphaned backup detection per PVE instance to prevent false positives (#1286)
The previous hasLiveInventory guard was a single boolean — if any PVE
instance had at least one live guest, orphan detection ran for all
instances. In multi-instance clusters with staggered polling, backups
from instances whose VMs hadn't been polled yet appeared orphaned,
producing false positive alerts with 0m duration.

Replace the global boolean with a per-instance map. PVE storage backups
now only run orphan detection when their specific instance has live
inventory. PBS/PMG backups (which span instances) retain the "any
instance has live guests" check.
2026-02-27 13:32:15 +00:00
rcourtman
fa519cd8ce fix(alerts): prevent false positive orphaned backup alerts during startup race (#1286)
Backup polling goroutines can snapshot state before VM/container polling
populates the guest inventory. When guestsByVMID is empty, every backup
appears orphaned. Gate orphan detection on hasLiveInventory (at least one
guest with non-empty ResourceID) and preserve existing orphan alerts when
inventory becomes unavailable.
2026-02-26 20:49:10 +00:00
rcourtman
4dc09a1240 feat(alerts): add dedicated backup-orphaned alert type (#1286)
Fire a warning alert immediately when a backup's guest no longer exists
in inventory, without requiring age thresholds to be breached. The
existing alertOrphaned toggle and ignoreVMIDs UI control this feature
with no frontend changes needed.
2026-02-24 09:07:43 +00:00
rcourtman
df23d80919 fix(alerts): always send recovery notifications regardless of quiet hours
Recovery (all-clear) notifications were being silently suppressed during
quiet hours for any non-critical alert. Since powered-off alerts default
to Warning level, users who received an alert at 2pm would never get the
recovery notification if the VM came back during quiet hours.

Quiet hours are intended to suppress noisy firing alerts, not to hide
the fact that an issue has resolved. If you got the alert, you should
always get the all-clear.

Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved.
The notifyOnResolve toggle (explicit user preference) is still respected.

Fixes #1259
2026-02-18 12:53:09 +00:00
rcourtman
ae4632b5b5 fix: correct UpdateAlertDelayHours doc comment (0 normalizes to 24, -1 disables) 2026-02-10 21:13:12 +00:00
rcourtman
1f74c12ef8 fix(alerts): preserve docker update delay across host identity churn (#1226) 2026-02-09 13:59:52 +00:00
rcourtman
8a48acef1d fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
  PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
  hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
  plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
  DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
  eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
2026-02-08 11:48:22 +00:00
rcourtman
d1e61d8a8a fix: ship alerting hotfixes and prepare 5.1.4 2026-02-07 22:05:55 +00:00
rcourtman
6909264a02 fix(alerts): reduce swarm alert noise and preserve notification state (#1096) 2026-02-07 14:18:39 +00:00
rcourtman
b5373749db Fix alert history duration and re-evaluation threshold bugs
Update history entry LastSeen on alert resolution so the stored duration
reflects how long the alert was actually active, not the snapshot captured
at creation time. This fixes the "0m" duration display for all resolved
metric-based alerts.

Fix reevaluateActiveAlertsLocked to use HostDefaults for host agent alerts
and PBSDefaults for PBS alerts instead of falling through to GuestDefaults
and NodeDefaults respectively, which could incorrectly resolve or retain
alerts on config save when thresholds differ.
2026-02-04 15:40:28 +00:00
rcourtman
cffb91f9ea Pre-populate node display name cache before guest polling
Guest polling (CheckGuest) runs before CheckNode in each poll cycle,
so the display name cache was empty when the first guest alert was
created. This caused the initial notification to use the raw Proxmox
node name. Fix by seeding the cache from modelNodes (which are already
available) before guest polling starts.

Related to #1188
2026-02-04 14:29:49 +00:00
rcourtman
05266d9062 Show node display name in alerts instead of raw Proxmox node name
Alerts previously showed the raw Proxmox node name (e.g., "on pve") even
when users configured a display name (e.g., "SPACEX") via Settings or the
host agent --hostname flag. This affected the alert UI, email notifications,
and webhook payloads.

Add NodeDisplayName field to the alert chain: cache display names in the
alert Manager (populated by CheckNode/CheckHost on every poll), resolve
them at alert creation via preserveAlertState, refresh on metric updates,
and enrich at read time in GetActiveAlerts. Update models.Alert, the
syncAlertsToState conversion, email templates, Apprise body text, webhook
payloads, and all frontend rendering paths.

Related to #1188
2026-02-04 14:26:44 +00:00
rcourtman
5073c10030 Fix alert system reliability issues and update audit report
- Fix stale alerts not clearing when nodes/hosts go offline in CheckNode and HandleHostOffline
- Fix stale alerts persisting when thresholds are disabled or set to 0 in CheckGuest and CheckNode
- Fix CheckHost to properly clear disk alerts when overrides disable them
- Update audit_report.md with findings from the Alert System Reliability Audit
2026-02-04 12:50:36 +00:00
rcourtman
dcfa8cf0ba fix: prevent false PBS backup indicators when VMIDs collide across PVE instances (#1177)
When namespace matching fails, the VMID-only fallback now checks whether
the VMID appears on multiple PVE instances. If ambiguous, the fallback
is skipped — preventing backups from being falsely attributed to the
wrong guest. Unique VMIDs still fall back as before.
2026-02-04 10:11:35 +00:00
rcourtman
aeca5e39fa Fix multi-tenant persistence and backend stability
- Initialize Alert and Notification managers with tenant-specific data directories

- Add panic recovery to WebSocket safeSend for stability

- Record host metrics to history for sparkline support
2026-02-03 16:24:42 +00:00
rcourtman
71f80c8a99 Fix: alert resolution now records incident timeline during quiet hours
- Fixed early return in handleAlertResolved that skipped incident recording
  when quiet hours suppressed recovery notifications
- Added Host Agent alert delay configuration (backend + UI)
- Host Agents now have dedicated time threshold settings like other resource types

Related to #1179
2026-02-03 12:49:41 +00:00
rcourtman
454448b796 fix: deadlock in offline alert recovery notifications
The quiet hours fix (07b4765b) added ShouldSuppressResolvedNotification()
to handleAlertResolved, which acquires m.mu.RLock(). Five clear*OfflineAlert
functions call the resolved callback synchronously while holding m.mu.Lock().
Go's RWMutex is not reentrant, so this deadlocks permanently when any
node/PBS/PMG/storage/guest comes back online after being offline.

The deadlock prevents recovery notifications from being sent and freezes
the monitoring goroutine, cascading to block all subsequent polling.

Fix: change the five affected functions to fire the resolved callback
asynchronously (matching the pattern already used by clearAlertNoLock),
so it runs after m.mu is released.

Related to #1068
2026-02-02 18:17:27 +00:00
rcourtman
7444bd0468 fix(alerts): guest alerts misclassified as node alerts when threshold disabled (#1145)
In single-node setups, guest alerts had Instance == Node, causing
reevaluateActiveAlertsLocked to evaluate them against NodeDefaults
instead of GuestDefaults. Setting guest memory threshold to 0 (disabled)
wouldn't clear existing guest alerts because they were being kept alive
by the still-enabled node memory threshold.

- Add resourceID colon check to distinguish guest IDs (instance:node:vmid)
  from node IDs (instance-node) in reevaluateActiveAlertsLocked
- Clear stale alerts in checkMetric when threshold is nil or disabled
- Skip hysteresis validation for disabled thresholds (Trigger <= 0)
- Fix frontend tooltip: "0" not "-1" disables a threshold
2026-02-02 15:17:53 +00:00
rcourtman
70dbb495ad fix: address triage issues #1149, #1153, #1162, #1163
- #1163: Add node badges to storage resources in threshold tables
  (ResourceTable.tsx, ResourceCard.tsx)
- #1162: Fix PBS backup alerts showing datastore as node name
  (alerts.go - use "Unknown" for orphaned backups)
- #1153: Fix memory leaks in tracking maps
  - Add max 48 sample limit for pmgQuarantineHistory
  - Add max 10 entry limit for flappingHistory
  - Add cleanup for dockerUpdateFirstSeen
  - Add cleanupTrackingMaps() for auth, polling, and circuit breaker maps

Note: #1149 fix (chat sessions null check) is in AISettings.tsx
which has other pending changes - will be committed separately.
2026-01-26 22:21:10 +00:00
rcourtman
d1ab2c913e refactor: simplify Alerts page and improve backend
Frontend:
- Major refactor of Alerts page removing patrol-specific UI
- Patrol functionality moved to dedicated AI Intelligence page
- Simplify ThresholdsTable component
- Update InvestigateAlertButton

Backend:
- Improve alerts handling and processing
- Add alerts test coverage
- Add orphaned backup alerting support
2026-01-24 22:44:15 +00:00
rcourtman
889719243b fix: reduce offline alert spam. Related to #1159, #1043 2026-01-24 13:25:25 +00:00
rcourtman
1657beeb92 feat: add alert enhancements
- Improve alert handling and processing
2026-01-22 22:32:03 +00:00
rcourtman
0248f0de5a fix(alerts): Prevent RAID check/scrub from triggering rebuild alerts. Related to #1125
DSM data scrubbing causes RAID arrays to enter a 'check' state with
RebuildPercent > 0, which was incorrectly triggering rebuild warnings.

Now distinguishes between:
- 'check' state: scheduled data scrubbing (no alert)
- 'recover'/'resync' state: actual rebuild (warning alert)
- 'clean' state with RebuildPercent: scrub in progress (no alert)
2026-01-20 16:13:58 +00:00
rcourtman
96b7370f7b test: improve coverage for API, AI, Alerts, and Frontend Utils
- Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3)
- Improve test coverage for AI tools, chat service, and session management
- Enhance alert and notification tests (ResolvedAlert, Webhook)
- Add frontend unit tests for utils (searchHistory, tagColors, temperature, url)
- Add proximity client API tests
2026-01-20 15:52:39 +00:00
rcourtman
1c22688d9b fix: standardize API version format and guest key separators
Closes #1115 (discussion feedback)

Two API consistency issues reported by @FabienD74:

1. Version format mismatch in /api/version:
   - currentVersion: "5.0.16" (no prefix)
   - latestVersion: "v5.0.16" (with prefix)

   Fixed: LatestVersion now strips the "v" prefix to match CurrentVersion format.

2. Guest ID separator inconsistency:
   - Some code used colons: "instance:node:vmid"
   - BuildGuestKey used dashes: "instance-node-vmid"

   Fixed: BuildGuestKey now uses colon separator matching the canonical
   format used by makeGuestID in the monitoring package. The existing
   legacy migration in GetWithLegacyMigration handles old dash-format
   entries in guest_metadata.json.
2026-01-19 22:20:18 +00:00
rcourtman
9cd53814a3 feat(alerts): add per-volume disk thresholds for host agents
Allow users to set custom disk usage thresholds per mounted filesystem
on host agents, rather than applying a single threshold to all volumes.

This addresses NAS/NVR use cases where some volumes (e.g., NVR storage)
intentionally run at 99% while others need strict monitoring.

Backend:
- Check for disk-specific overrides before using HostDefaults.Disk
- Override key format: host:<hostId>/disk:<mountpoint>
- Support both custom thresholds and disable per-disk

Frontend:
- Add 'hostDisk' resource type
- Add "Host Disks" collapsible section in Thresholds → Hosts tab
- Group disks by host for easier navigation

Closes #1103
2026-01-13 23:38:20 +00:00
rcourtman
1dda538265 fix(models): extend namespace disambiguation to SyncGuestBackupTimes (#1095)
The previous commit fixed namespace disambiguation for backup alerts,
but the Overview display uses SyncGuestBackupTimes to populate backup
timestamps on VMs/Containers. This commit extends the same namespace
matching logic to that function.

Also tightened the matching algorithm to use suffix matching instead
of substring matching, preventing false positives like "pve" matching
"pve-nat".
2026-01-12 15:11:59 +00:00
rcourtman
a88edd7c8f fix(alerts): disambiguate PBS backups using namespace for multi-PVE setups (#1095)
When multiple PVE instances have VMs with overlapping VMIDs, PBS backups
were being matched to the wrong VM because the code would just use the
first matching guest. Now when a PBS backup has a namespace, it attempts
to match that namespace to the PVE instance name to find the correct VM.

This helps users who have separate PBS instances backing up different
PVE clusters with namespaces like "pve1", "nat", etc.
2026-01-12 14:55:17 +00:00
rcourtman
07b4765b8d fix: respect quiet hours for recovery notifications (#1068)
Recovery notifications were bypassing the quiet hours check, causing
users to receive recovery alerts during their configured quiet hours
window even though the original "down" alerts were suppressed.

- Add ShouldSuppressResolvedNotification() to alert manager
- Check quiet hours before sending recovery notifications in monitor
- Recovery notifications now follow same suppression rules as alerts
2026-01-09 21:47:36 +00:00
rcourtman
48fdff3efb fix: Preserve ackState for old acknowledged alerts during restore
When LoadActiveAlerts skipped acknowledged alerts older than 1 hour,
it was also not populating ackState. This meant that when the same
alert (e.g., backup-age) was recreated on the next poll cycle,
preserveAlertState couldn't find any acknowledgement record and
the alert would retrigger notifications.

Now ackState is populated even for skipped old acknowledged alerts,
so if they reappear, the acknowledgement will be restored.

Related to #1043
2026-01-06 11:00:36 +00:00
rcourtman
2cc9214336 feat: Make container update alerts a free feature
Update alerts for Docker containers are now available to all users,
not just Pro license holders. The feature alerts when container image
updates have been pending for longer than the configured delay
(default: 24 hours).

- Remove Pro license gating from update alerts
- Add FeatureUpdateAlerts to free tier features
- Remove obsolete license gating tests

Related to #1031
2026-01-04 23:59:29 +00:00
rcourtman
ed78509f92 Fix flaky tests and improve coverage across alerts, api, and config packages
- Fix deadlock and race conditions in internal/alerts
- Add comprehensive error path tests for internal/config
- Fix 401 handling in internal/api
- Fix Docker Swarm task filtering test logic
2026-01-03 18:36:17 +00:00
rcourtman
9e339957c6 fix: Update runtime config when toggling Docker update actions setting
The DisableDockerUpdateActions setting was being saved to disk but not
updated in h.config, causing the UI toggle to appear to revert on page
refresh since the API returned the stale runtime value.

Related to #1023
2026-01-03 11:14:17 +00:00
rcourtman
fbbefa4546 Improve tests for internal/alerts package
- Fix TestSaveHistoryWithRetry_WriteError to be robust on root
- Add TestOnAlert to history_test.go
- Add pmg_anomaly_test.go for PMG anomaly detection coverage
- Add cleanup_test.go for tracking map cleanup coverage
- extend filter_evaluation_test.go to cover all guest threshold logic
2026-01-02 23:47:16 +00:00
rcourtman
9bdbf2616c chore(tests): remove unused test code and redundant test cases
- Remove unused findAlertByID helper and its min dependency from update_alerts_test.go
- Remove redundant negative zero test case from utility_test.go (-0.0 == 0.0 in Go)
2026-01-02 16:11:09 +00:00
rcourtman
3fdf753a5b Enhance devcontainer and CI workflows
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 22:29:15 +00:00
rcourtman
9abe9c47a2 feat(alerts): add disk temperature alerts for host agents
- Add DiskTemperature threshold to ThresholdConfig (default: 55°C trigger, 50°C clear)
- Process host SMART sensor data in CheckHost to generate disk_temperature alerts
- Add 'Disk Temp °C' column to Host Agents thresholds table in UI
- Make temperature tooltip interactive and scrollable to fix overflow issues
- Update AlertThresholds type to include diskTemperature field

Closes: #941
2026-01-01 16:31:34 +00:00
rcourtman
3796408f04 fix: Preserve alert acknowledgement for long-standing alerts during backup
When a powered-off VM is backed up by Proxmox, the alert briefly disappears
as the VM status changes. The previous fix (3830e701) preserved ackState when
alerts were removed, but the cleanup TTL was measured from the acknowledgement
time. For alerts acknowledged > 1 hour ago (common for intentionally powered-off
VMs), the ackState was immediately considered stale and deleted when cleanup ran.

The fix adds an inactiveAt timestamp to track when an alert was removed, and
uses this time for the cleanup TTL instead of the acknowledgement time. This
ensures acknowledgement state is preserved for at least 1 hour after the alert
disappears, regardless of when it was originally acknowledged.

Related to #980
2025-12-31 09:49:11 +00:00
rcourtman
2b68bfe6ee fix: Update tests for RAID alerting md0/md1 skip and AI gating message change
- RAID tests now use /dev/md2 since md0/md1 are skipped for Synology compatibility
- AI handler tests now expect 'AI is not enabled' message after AI gating change
2025-12-30 23:39:55 +00:00
rcourtman
ed6c3d9c93 fix: Prevent acknowledged alerts from retriggering notifications. Related to #975
dispatchAlert() now checks if an alert is already acknowledged before sending
notifications. Previously, acknowledged alerts (especially backup-age alerts)
would continue to dispatch notifications every poll cycle because the
acknowledgement check was missing from the dispatch path.

The fix adds an early return in dispatchAlert() when alert.Acknowledged is true,
matching the existing checks for flapping, activation state, and quiet hours.
2025-12-30 22:20:47 +00:00
rcourtman
f855625f65 feat: Add full-width mode toggle for wider views on large monitors. Related to #974 2025-12-30 12:20:44 +00:00
rcourtman
065a59316f fix(alerts): respect per-guest backup and snapshot overrides (fixes #961) 2025-12-30 00:28:05 +00:00