Commit graph

100 commits

Author SHA1 Message Date
rcourtman
fba1fadccd Make alert node display name resolution instance-aware (#1218) 2026-03-25 12:44:22 +00:00
rcourtman
40249947ed Fix template backup orphan detection race (#1352) 2026-03-25 10:36:33 +00:00
rcourtman
b9c6f504d8 Fix shared storage override matching (#1341) 2026-03-25 10:25:01 +00:00
rcourtman
b5ee2c1f98 Fix guest override migration for canonical IDs (#1334) 2026-03-25 10:13:10 +00:00
rcourtman
d560de15ad Increase alerts test cleanup sleep to fix flaky test under load
The 10ms goroutine drain pause was insufficient under full parallel
test suite load, causing intermittent failures in
TestPulseMonitorOnlySkipsDispatchButRetainsAlert.
2026-03-08 22:16:24 +00:00
rcourtman
62225e0c12 fix(alerts): scope orphaned backup detection per PVE instance to prevent false positives (#1286)
The previous hasLiveInventory guard was a single boolean — if any PVE
instance had at least one live guest, orphan detection ran for all
instances. In multi-instance clusters with staggered polling, backups
from instances whose VMs hadn't been polled yet appeared orphaned,
producing false positive alerts with 0m duration.

Replace the global boolean with a per-instance map. PVE storage backups
now only run orphan detection when their specific instance has live
inventory. PBS/PMG backups (which span instances) retain the "any
instance has live guests" check.
2026-02-27 13:32:15 +00:00
rcourtman
fa519cd8ce fix(alerts): prevent false positive orphaned backup alerts during startup race (#1286)
Backup polling goroutines can snapshot state before VM/container polling
populates the guest inventory. When guestsByVMID is empty, every backup
appears orphaned. Gate orphan detection on hasLiveInventory (at least one
guest with non-empty ResourceID) and preserve existing orphan alerts when
inventory becomes unavailable.
2026-02-26 20:49:10 +00:00
rcourtman
4dc09a1240 feat(alerts): add dedicated backup-orphaned alert type (#1286)
Fire a warning alert immediately when a backup's guest no longer exists
in inventory, without requiring age thresholds to be breached. The
existing alertOrphaned toggle and ignoreVMIDs UI control this feature
with no frontend changes needed.
2026-02-24 09:07:43 +00:00
rcourtman
d1e61d8a8a fix: ship alerting hotfixes and prepare 5.1.4 2026-02-07 22:05:55 +00:00
rcourtman
6909264a02 fix(alerts): reduce swarm alert noise and preserve notification state (#1096) 2026-02-07 14:18:39 +00:00
rcourtman
dcfa8cf0ba fix: prevent false PBS backup indicators when VMIDs collide across PVE instances (#1177)
When namespace matching fails, the VMID-only fallback now checks whether
the VMID appears on multiple PVE instances. If ambiguous, the fallback
is skipped — preventing backups from being falsely attributed to the
wrong guest. Unique VMIDs still fall back as before.
2026-02-04 10:11:35 +00:00
rcourtman
454448b796 fix: deadlock in offline alert recovery notifications
The quiet hours fix (07b4765b) added ShouldSuppressResolvedNotification()
to handleAlertResolved, which acquires m.mu.RLock(). Five clear*OfflineAlert
functions call the resolved callback synchronously while holding m.mu.Lock().
Go's RWMutex is not reentrant, so this deadlocks permanently when any
node/PBS/PMG/storage/guest comes back online after being offline.

The deadlock prevents recovery notifications from being sent and freezes
the monitoring goroutine, cascading to block all subsequent polling.

Fix: change the five affected functions to fire the resolved callback
asynchronously (matching the pattern already used by clearAlertNoLock),
so it runs after m.mu is released.

Related to #1068
2026-02-02 18:17:27 +00:00
rcourtman
7444bd0468 fix(alerts): guest alerts misclassified as node alerts when threshold disabled (#1145)
In single-node setups, guest alerts had Instance == Node, causing
reevaluateActiveAlertsLocked to evaluate them against NodeDefaults
instead of GuestDefaults. Setting guest memory threshold to 0 (disabled)
wouldn't clear existing guest alerts because they were being kept alive
by the still-enabled node memory threshold.

- Add resourceID colon check to distinguish guest IDs (instance:node:vmid)
  from node IDs (instance-node) in reevaluateActiveAlertsLocked
- Clear stale alerts in checkMetric when threshold is nil or disabled
- Skip hysteresis validation for disabled thresholds (Trigger <= 0)
- Fix frontend tooltip: "0" not "-1" disables a threshold
2026-02-02 15:17:53 +00:00
rcourtman
d1ab2c913e refactor: simplify Alerts page and improve backend
Frontend:
- Major refactor of Alerts page removing patrol-specific UI
- Patrol functionality moved to dedicated AI Intelligence page
- Simplify ThresholdsTable component
- Update InvestigateAlertButton

Backend:
- Improve alerts handling and processing
- Add alerts test coverage
- Add orphaned backup alerting support
2026-01-24 22:44:15 +00:00
rcourtman
889719243b fix: reduce offline alert spam. Related to #1159, #1043 2026-01-24 13:25:25 +00:00
rcourtman
96b7370f7b test: improve coverage for API, AI, Alerts, and Frontend Utils
- Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3)
- Improve test coverage for AI tools, chat service, and session management
- Enhance alert and notification tests (ResolvedAlert, Webhook)
- Add frontend unit tests for utils (searchHistory, tagColors, temperature, url)
- Add proximity client API tests
2026-01-20 15:52:39 +00:00
rcourtman
1c22688d9b fix: standardize API version format and guest key separators
Closes #1115 (discussion feedback)

Two API consistency issues reported by @FabienD74:

1. Version format mismatch in /api/version:
   - currentVersion: "5.0.16" (no prefix)
   - latestVersion: "v5.0.16" (with prefix)

   Fixed: LatestVersion now strips the "v" prefix to match CurrentVersion format.

2. Guest ID separator inconsistency:
   - Some code used colons: "instance:node:vmid"
   - BuildGuestKey used dashes: "instance-node-vmid"

   Fixed: BuildGuestKey now uses colon separator matching the canonical
   format used by makeGuestID in the monitoring package. The existing
   legacy migration in GetWithLegacyMigration handles old dash-format
   entries in guest_metadata.json.
2026-01-19 22:20:18 +00:00
rcourtman
1dda538265 fix(models): extend namespace disambiguation to SyncGuestBackupTimes (#1095)
The previous commit fixed namespace disambiguation for backup alerts,
but the Overview display uses SyncGuestBackupTimes to populate backup
timestamps on VMs/Containers. This commit extends the same namespace
matching logic to that function.

Also tightened the matching algorithm to use suffix matching instead
of substring matching, preventing false positives like "pve" matching
"pve-nat".
2026-01-12 15:11:59 +00:00
rcourtman
a88edd7c8f fix(alerts): disambiguate PBS backups using namespace for multi-PVE setups (#1095)
When multiple PVE instances have VMs with overlapping VMIDs, PBS backups
were being matched to the wrong VM because the code would just use the
first matching guest. Now when a PBS backup has a namespace, it attempts
to match that namespace to the PVE instance name to find the correct VM.

This helps users who have separate PBS instances backing up different
PVE clusters with namespaces like "pve1", "nat", etc.
2026-01-12 14:55:17 +00:00
rcourtman
48fdff3efb fix: Preserve ackState for old acknowledged alerts during restore
When LoadActiveAlerts skipped acknowledged alerts older than 1 hour,
it was also not populating ackState. This meant that when the same
alert (e.g., backup-age) was recreated on the next poll cycle,
preserveAlertState couldn't find any acknowledgement record and
the alert would retrigger notifications.

Now ackState is populated even for skipped old acknowledged alerts,
so if they reappear, the acknowledgement will be restored.

Related to #1043
2026-01-06 11:00:36 +00:00
rcourtman
ed78509f92 Fix flaky tests and improve coverage across alerts, api, and config packages
- Fix deadlock and race conditions in internal/alerts
- Add comprehensive error path tests for internal/config
- Fix 401 handling in internal/api
- Fix Docker Swarm task filtering test logic
2026-01-03 18:36:17 +00:00
rcourtman
3fdf753a5b Enhance devcontainer and CI workflows
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 22:29:15 +00:00
rcourtman
2b68bfe6ee fix: Update tests for RAID alerting md0/md1 skip and AI gating message change
- RAID tests now use /dev/md2 since md0/md1 are skipped for Synology compatibility
- AI handler tests now expect 'AI is not enabled' message after AI gating change
2025-12-30 23:39:55 +00:00
rcourtman
065a59316f fix(alerts): respect per-guest backup and snapshot overrides (fixes #961) 2025-12-30 00:28:05 +00:00
rcourtman
ae522c9a2b fix: Allow all threshold types (Storage, Temperature, Host Agent) to be set to 0 to disable alerting
- Fixed normalizeStorageDefaults to allow Trigger=0
- Fixed normalizeNodeDefaults (Temperature) to allow Trigger=0
- Added comprehensive tests for all threshold normalization patterns
- Updated existing test that expected old behavior

Related to #864
2025-12-20 20:42:23 +00:00
rcourtman
781442cdd0 test: Add comprehensive tests for Host Agent threshold normalization with Trigger=0. Related to #864 2025-12-20 20:32:59 +00:00
rcourtman
c5ab0724f1 fix: Race condition in TestLoadActiveAlerts causing flaky test
ClearActiveAlerts triggers an async save to disk, which can race with
LoadActiveAlerts reading the file. The test now clears the in-memory
map directly without triggering the async save.
2025-12-04 20:41:00 +00:00
rcourtman
d0d989289a Refactor alert system: fix race conditions, memory leaks, and improve code quality
- Rename checkFlapping to checkFlappingLocked to clarify lock contract
- Replace goto statements with structured control flow
- Wire up unused recordAlertFired/recordAlertResolved metric hooks
- Add trackingMapCleanup goroutine to prevent memory leaks from stale entries
- Tighten alert ID validation to alphanumeric + safe punctuation
- Fix history save error handling to properly manage backup lifecycle
- Add auto-migration for deprecated GroupingWindow field
- Refactor 300+ line UpdateConfig into focused helper functions
- Unify duplicate evaluateVMCondition/evaluateContainerCondition
- Add constants for magic numbers (thresholds, timing, flapping)
- Update tests to match new backup behavior
2025-12-02 23:31:36 +00:00
rcourtman
4f824ab148 style: Apply gofmt to 37 files
Standardize code formatting across test files and monitor.go.
No functional changes.
2025-12-02 17:21:48 +00:00
rcourtman
753125d189 test: Add preserveAlertState, checkPMGQuarantineBacklog, LoadActiveAlerts tests
Add comprehensive tests for three low-coverage functions:
- preserveAlertState: nil handling, state preservation from existing alerts,
  ackState fallback, new alert handling
- checkPMGQuarantineBacklog: nil quarantine handling, warning/critical
  thresholds, growth rate alerts, alert updates, virus quarantine
- LoadActiveAlerts: missing file, valid file loading, old alert filtering,
  old acknowledged alert filtering, ack state restoration, invalid JSON,
  duplicate alert handling

Coverage improvements:
- preserveAlertState: 63.6% → 100%
- checkPMGQuarantineBacklog: 12.9% → 100%
- checkQuarantineMetric: 0% → 93.1%
- LoadActiveAlerts: 26.2% → 80.0%
- Alerts package: 83.5% → 86.6%
2025-12-02 12:22:14 +00:00
rcourtman
5ff7e20539 test: Add dispatchAlert tests (55.6%→77.8%)
Add TestDispatchAlert with 8 test cases covering:
- Returns false when onAlert callback is nil
- Returns false when alert is nil
- Returns false when activation state is pending
- Returns false when activation state is snoozed
- Returns false for monitor-only alerts
- Dispatches synchronously when async is false
- Dispatches asynchronously when async is true
- Clones alert before dispatch

Alerts package coverage: 83.4%→83.5%
2025-12-02 11:47:24 +00:00
rcourtman
3d957403ef test: Add CheckStorage tests (52.4%→92.9%)
Add comprehensive TestCheckStorageComprehensive with 11 test cases covering:
- Returns early when alerts disabled
- DisableAllStorage clears existing usage and offline alerts
- Override with Disabled clears alerts
- Usage threshold checking
- Override threshold applied correctly
- Skips usage check when offline/unavailable/zero usage
- Offline status creates alert after confirmations
- Unavailable status creates alert
- Clears offline alert when back online

Alerts package coverage: 82.4%→83.4%
2025-12-02 11:43:56 +00:00
rcourtman
905d78b6a6 test: Add CheckPMG tests (0%→100%)
Add comprehensive TestCheckPMGComprehensive with 9 test cases covering:
- Returns early when alerts disabled
- DisableAllPMG clears all PMG alert types (queue-total, queue-deferred,
  queue-hold, oldest-message, offline)
- Override with Disabled clears alerts
- DisableAllPMGOffline clears offline alert
- Offline status creates alert after confirmations
- Connection health error triggers offline alert
- Connection health unhealthy triggers offline alert
- Clears offline alert when back online
- Skips metrics when PMG is offline

Alerts package coverage: 81.5%→82.4%
2025-12-02 11:40:53 +00:00
rcourtman
e1cdd6ebdb test: Add CheckPBS tests (0%→98.3%)
Add comprehensive TestCheckPBSComprehensive with 12 test cases covering:
- Returns early when alerts disabled
- DisableAllPBS clears existing CPU/memory/offline alerts
- Override with Disabled clears alerts
- DisableAllPBSOffline clears offline alert
- CPU threshold checking when online
- Memory threshold checking when online
- Skips metrics when PBS is offline
- Override thresholds applied correctly
- Offline status creates alert after confirmations
- Connection health error triggers offline alert
- Connection health unhealthy triggers offline alert
- Clears offline alert when back online

Alerts package coverage: 80.0%→81.5%
2025-12-02 11:38:36 +00:00
rcourtman
bc7fa17b54 test: Add CheckHost tests (49.6%→98.3%)
Add comprehensive TestCheckHostComprehensive with 17 test cases covering:
- Empty host ID early return
- Alerts disabled early return
- DisableAllHosts clears existing alerts
- Override with Disabled clears alerts
- CPU/Memory/Disk threshold nil clears alerts
- RAID degraded/rebuilding/healthy states
- RAID with failed devices triggers critical alert
- RAID resync triggers rebuilding alert
- Existing RAID alert not duplicated (preserves start time)
- Override thresholds applied correctly
- Multiple disks handling
- Offline alert cleared when host comes online
- Tags included in metadata

Alerts package coverage: 78.6%→80.0%
2025-12-02 11:34:20 +00:00
rcourtman
ceb54ba349 test: Add CheckGuest tests (41.4%→97.4%)
Cover all CheckGuest branches:
- Early return when alerts disabled
- Early return when all guests disabled
- VM and Container type handling
- Unsupported guest type returns early
- pulse-no-alerts tag suppresses alerts
- Stopped guest triggers powered-off check
- DisableAllGuestsOffline clears tracking
- Paused guest clears powered-off alert
- Non-running guest clears metric alerts
- Running guest clears powered-off alert
- Disabled thresholds clear existing alerts
- CPU, memory, disk metric checks
- Individual disk checks (mountpoint, device, index fallback)
- Skips disk with zero total or negative usage
- I/O metrics (diskRead, diskWrite, networkIn, networkOut)
- pulse-relaxed tag applies relaxed thresholds

Alerts package coverage: 76.0%→78.6%
2025-12-02 11:27:32 +00:00
rcourtman
dda3d866ec test: Add CheckNode tests (31%→100%)
Cover all CheckNode branches:
- Early return when alerts disabled
- DisableAllNodes clears existing alerts
- DisableNodesOffline clears tracking
- Offline/connection error/failed triggers offline check
- Online node clears offline alert
- Online node triggers metric checks
- Offline node skips metric checks
- Override thresholds applied correctly
- Temperature with package temp and max fallback
- Temperature skipped when unavailable/nil/no threshold
- Memory and disk metric checks

Alerts package coverage: 75.2%→76.0%
2025-12-02 11:24:04 +00:00
rcourtman
42890b70f8 test: Add suppressGuestAlerts and guestHasMonitorOnlyAlerts tests
Coverage improvements:
- suppressGuestAlerts: 37% -> 96.3%
- guestHasMonitorOnlyAlerts: 40% -> 90%

Tests cover:
- No alerts returns false
- Exact ResourceID match clears
- Prefix match (e.g., "vm100/disk1") clears
- All auxiliary maps cleared (pending, recent, suppressed, rateLimit)
- Multiple alerts cleared
- Monitor-only detection via metadata (bool and string types)
2025-12-02 11:14:30 +00:00
rcourtman
914b1ced2a test: Add applyThresholdOverride tests
Coverage for applyThresholdOverride: 50% -> 93.2%

Tests cover:
- Empty override returns base unchanged
- Disabled/DisableConnectivity flag overrides
- Modern CPU threshold override
- Legacy CPU threshold conversion
- Modern takes precedence over legacy
- Multiple metrics override
- Note override, clearing, trimming
- All legacy metric types (Memory, Disk, etc.)
- Temperature and Usage override
- ensureHysteresisThreshold Clear value filling
2025-12-02 11:08:58 +00:00
rcourtman
3379e90073 test: Add ClearActiveAlerts test with existing alerts
Coverage for ClearActiveAlerts: 16% -> 92%

Tests verify all internal maps are properly cleared when alerts exist:
- activeAlerts, pendingAlerts, recentAlerts
- suppressedUntil, alertRateLimit
- nodeOfflineCount, offlineConfirmations
- dockerOfflineCount, dockerStateConfirm
- ackState, recentlyResolved
2025-12-02 11:01:54 +00:00
rcourtman
e644c38071 test: Add CheckDiskHealth normal path tests
Coverage for CheckDiskHealth: 51% -> 98%

Tests cover:
- Healthy disk (PASSED/OK) creates no alert
- Failed non-Samsung disk creates critical alert
- Alert cleared when disk health recovers
- Low wearout (<10%) creates warning alert
- Wearout alert updates on subsequent checks
- Wearout alert cleared when wearout >= 10%
- Empty/UNKNOWN health creates no alert
2025-12-02 10:59:48 +00:00
rcourtman
eac8ed48c5 test: Add Docker container restart loop alert tests
Coverage for checkDockerContainerRestartLoop: 53.5% -> 95.3%

Tests cover:
- First check initializes tracking without alert
- Stable restart count doesn't alert
- Restarts under threshold (<=3) don't alert
- Restart loop threshold (>3) triggers critical alert
- Recovery after window expires clears alert
- Incremental restart accumulation
- Alert StartTime preservation on updates
2025-12-02 10:54:51 +00:00
rcourtman
e1105d68ca test: Add Docker container health and OOM kill alert tests
Coverage improvements:
- checkDockerContainerHealth: 21.1% -> 94.7%
- checkDockerContainerOOMKill: 19.2% -> 96.2%

Tests cover:
- Health states (healthy, empty, none, starting, unhealthy, degraded)
- Health alert recovery when container becomes healthy
- OOM kill detection (exit code 137)
- OOM alert deduplication (repeated 137 doesn't re-alert)
- OOM alert clearing when container recovers or exits with different code
2025-12-02 10:49:42 +00:00
rcourtman
4538b5348d fix: Isolate alerts tests with temp directories to prevent flaky failures
Tests using NewManager() were sharing /etc/pulse/alerts, causing race
conditions when running in parallel. Added newTestManager(t) helper that
creates isolated temp directories for each test.
2025-12-02 10:21:03 +00:00
rcourtman
74fe6b9223 test: Add tests for checkPMGNodeQueues
- 12 test cases covering per-node queue monitoring
- Tests total/deferred/hold queue thresholds at warning/critical levels
- Tests oldest message age per node
- Tests outlier detection, threshold clearing, existing alert updates
- Tests empty nodes and nil QueueStatus handling

Coverage: checkPMGNodeQueues 0%→85.1%
Package coverage: 70.3%→72.3%
2025-12-01 17:16:32 +00:00
rcourtman
f38736eefe test: Add tests for checkZFSPoolHealth
- 14 test cases covering pool state alerts (ONLINE, DEGRADED, FAULTED,
  UNAVAIL), pool error tracking, device-level alerts, state transitions
- Tests pool state clearing on recovery, error count updates,
  SPARE device handling, FAULTED device critical alerts

Coverage: checkZFSPoolHealth 0%→100%
Package coverage: 68.6%→70.3%
2025-12-01 17:14:12 +00:00
rcourtman
b8facf3e8a test: Add tests for checkEscalations and CleanupAlertsForNodes
- checkEscalations: 6 test cases covering disabled/enabled states,
  acknowledged alerts, threshold timing, multi-level escalation
- CleanupAlertsForNodes: 7 test cases covering node removal,
  Docker/PBS alert preservation, empty node handling, nil safety

Coverage: checkEscalations 0%→100%, CleanupAlertsForNodes 0%→92%
Package coverage: 67.6%→68.6%
2025-12-01 17:11:47 +00:00
rcourtman
b1265451a8 test: Add tests for Cleanup and convertLegacyThreshold
- Cleanup: 16 test cases covering auto-acknowledge, TTL cleanup,
  rate limit entries, suppressions, pending alerts, flapping history,
  Docker restart tracking, PMG anomaly trackers, quarantine history
- convertLegacyThreshold: 5 test cases covering nil/zero/negative input,
  default margin, custom margin

Coverage: Cleanup 0%→97.6%, convertLegacyThreshold 0%→83.3%
Package coverage: 65.4%→67.6%
2025-12-01 17:08:45 +00:00
rcourtman
73d2e91ab6 test: Add tests for checkStorageOffline and checkGuestPoweredOff
- checkStorageOffline: 5 test cases covering confirmation polling,
  alert creation, LastSeen updates, disabled storage handling
- checkGuestPoweredOff: 9 test cases covering confirmation polling,
  alert creation, severity overrides, disabled guests, monitorOnly flag

Coverage: checkStorageOffline 0%→100%, checkGuestPoweredOff 0%→100%
Package coverage: 63.4%→65.4%

Note: Pre-existing test failure in suite related to /etc/pulse file
access - not caused by these changes.
2025-12-01 17:05:15 +00:00
rcourtman
bfcd2f95a5 test: Add tests for checkPMGQueueDepths and checkPMGOldestMessage
- checkPMGQueueDepths: 10 test cases covering total/deferred/hold queues
  at warning/critical levels, existing alert updates, threshold clearing
- checkPMGOldestMessage: 8 test cases covering threshold logic,
  multi-node oldest detection, alert creation/updates/clearing

Coverage: checkPMGQueueDepths 0%→88.1%, checkPMGOldestMessage 0%→100%
Package coverage: 61.3%→63.4%
2025-12-01 16:59:59 +00:00