Pulse/internal/alerts
rcourtman 1183b87fa1 Fix critical alert system concurrency and memory leak issues
This commit addresses 7 critical issues identified during the alert system audit:

**P0 Critical - Race Conditions Fixed:**

1. **dispatchAlert race in NotifyExistingAlert** (lines 5486-5497)
   - Changed from RLock to Lock to hold mutex during dispatchAlert call
   - dispatchAlert calls checkFlapping which writes to maps (flappingHistory, flappingActive, suppressedUntil)
   - Previous code: grabbed RLock, got alert pointer, released lock, then called dispatchAlert (RACE)
   - Fixed: hold Lock through dispatchAlert call

2. **dispatchAlert race in LoadActiveAlerts startup** (lines 8216-8235)
   - Startup goroutines called dispatchAlert without holding lock
   - Added m.mu.Lock/Unlock around dispatchAlert call in goroutine
   - Also added cancellation via escalationStop channel to prevent goroutine leaks on shutdown

3. **checkFlapping documentation** (line 738)
   - Added clear comment that checkFlapping requires caller to hold m.mu
   - Prevents future race conditions from improper usage

**P1 Important - Data Loss Prevention:**

4. **History save race condition** (lines 177-180 in history.go)
   - Added saveMu mutex to serialize disk writes
   - Previous: concurrent saves could interleave, causing newer data to be overwritten by older snapshots
   - Fixed: saveMu.Lock at start of saveHistoryWithRetry ensures atomic disk writes
   - Newer snapshots now always win over older ones

**P2 Memory Leak Prevention:**

5. **PMG anomaly tracker cleanup** (lines 7318-7331)
   - Added cleanup for pmgAnomalyTrackers map (24 hour TTL based on LastSampleTime)
   - Prevents unbounded growth from decommissioned/transient PMG instances
   - Each tracker: ~1-2KB (48 samples + baselines)

6. **PMG quarantine history cleanup** (lines 7333-7354)
   - Added cleanup for pmgQuarantineHistory map (7 day TTL based on last snapshot)
   - Prevents memory leak for deleted PMG instances
   - Removes both empty histories and very old histories

**P2 Goroutine Leak Prevention:**

7. **Startup notification goroutine cancellation** (lines 8218-8234)
   - Added select with escalationStop channel to cancel startup notifications
   - Prevents goroutines from continuing after Stop() is called
   - Scales with number of restored critical alerts

All fixes maintain proper lock ordering and prevent deadlocks by ensuring locks are held when accessing shared maps.
2025-11-07 09:12:28 +00:00
..
alerts.go Fix critical alert system concurrency and memory leak issues 2025-11-07 09:12:28 +00:00
alerts_test.go Fix webhook alerts persisting when DisableAll* flags are enabled 2025-11-06 21:17:56 +00:00
concurrency_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
history.go Fix critical alert system concurrency and memory leak issues 2025-11-07 09:12:28 +00:00
history_concurrency_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
offline_toggle_test.go feat: finalize swarm service monitoring (#598) 2025-10-26 09:35:49 +00:00
per_metric_delay_example_test.go Add configurable SSH port for temperature monitoring 2025-11-05 20:03:29 +00:00
quiet_hours_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
threshold_update_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00
time_threshold_test.go Fix settings security tab navigation 2025-10-11 23:29:47 +00:00