mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-05-08 18:21:55 +00:00
This commit addresses 7 critical issues identified during the alert system audit: **P0 Critical - Race Conditions Fixed:** 1. **dispatchAlert race in NotifyExistingAlert** (lines 5486-5497) - Changed from RLock to Lock to hold mutex during dispatchAlert call - dispatchAlert calls checkFlapping which writes to maps (flappingHistory, flappingActive, suppressedUntil) - Previous code: grabbed RLock, got alert pointer, released lock, then called dispatchAlert (RACE) - Fixed: hold Lock through dispatchAlert call 2. **dispatchAlert race in LoadActiveAlerts startup** (lines 8216-8235) - Startup goroutines called dispatchAlert without holding lock - Added m.mu.Lock/Unlock around dispatchAlert call in goroutine - Also added cancellation via escalationStop channel to prevent goroutine leaks on shutdown 3. **checkFlapping documentation** (line 738) - Added clear comment that checkFlapping requires caller to hold m.mu - Prevents future race conditions from improper usage **P1 Important - Data Loss Prevention:** 4. **History save race condition** (lines 177-180 in history.go) - Added saveMu mutex to serialize disk writes - Previous: concurrent saves could interleave, causing newer data to be overwritten by older snapshots - Fixed: saveMu.Lock at start of saveHistoryWithRetry ensures atomic disk writes - Newer snapshots now always win over older ones **P2 Memory Leak Prevention:** 5. **PMG anomaly tracker cleanup** (lines 7318-7331) - Added cleanup for pmgAnomalyTrackers map (24 hour TTL based on LastSampleTime) - Prevents unbounded growth from decommissioned/transient PMG instances - Each tracker: ~1-2KB (48 samples + baselines) 6. **PMG quarantine history cleanup** (lines 7333-7354) - Added cleanup for pmgQuarantineHistory map (7 day TTL based on last snapshot) - Prevents memory leak for deleted PMG instances - Removes both empty histories and very old histories **P2 Goroutine Leak Prevention:** 7. **Startup notification goroutine cancellation** (lines 8218-8234) - Added select with escalationStop channel to cancel startup notifications - Prevents goroutines from continuing after Stop() is called - Scales with number of restored critical alerts All fixes maintain proper lock ordering and prevent deadlocks by ensuring locks are held when accessing shared maps. |
||
|---|---|---|
| .. | ||
| alerts.go | ||
| alerts_test.go | ||
| concurrency_test.go | ||
| history.go | ||
| history_concurrency_test.go | ||
| offline_toggle_test.go | ||
| per_metric_delay_example_test.go | ||
| quiet_hours_test.go | ||
| threshold_update_test.go | ||
| time_threshold_test.go | ||