Update history entry LastSeen on alert resolution so the stored duration
reflects how long the alert was actually active, not the snapshot captured
at creation time. This fixes the "0m" duration display for all resolved
metric-based alerts.
Fix reevaluateActiveAlertsLocked to use HostDefaults for host agent alerts
and PBSDefaults for PBS alerts instead of falling through to GuestDefaults
and NodeDefaults respectively, which could incorrectly resolve or retain
alerts on config save when thresholds differ.
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Connect alert system to failure prediction:
1. Add AlertCallback to HistoryManager:
- OnAlert() method to register callbacks
- Callbacks invoked when alerts are added
- Called outside lock to prevent deadlocks
2. Expose OnAlertHistory() on alerts.Manager:
- Pass-through to HistoryManager.OnAlert()
- Enables external systems to track alerts
3. Wire pattern detector in router startup:
- Register callback when pattern detector is created
- Convert alert types to trackable events
- Pattern detector now learns from production alerts
Now every alert (memory_warning, cpu_critical, etc.) is recorded as
a historical event for pattern analysis. The AI can predict:
'High memory usage typically occurs every ~3 days (next expected in ~1 day)'
All tests passing.
- Rename checkFlapping to checkFlappingLocked to clarify lock contract
- Replace goto statements with structured control flow
- Wire up unused recordAlertFired/recordAlertResolved metric hooks
- Add trackingMapCleanup goroutine to prevent memory leaks from stale entries
- Tighten alert ID validation to alphanumeric + safe punctuation
- Fix history save error handling to properly manage backup lifecycle
- Add auto-migration for deprecated GroupingWindow field
- Refactor 300+ line UpdateConfig into focused helper functions
- Unify duplicate evaluateVMCondition/evaluateContainerCondition
- Add constants for magic numbers (thresholds, timing, flapping)
- Update tests to match new backup behavior
This commit addresses 7 critical issues identified during the alert system audit:
**P0 Critical - Race Conditions Fixed:**
1. **dispatchAlert race in NotifyExistingAlert** (lines 5486-5497)
- Changed from RLock to Lock to hold mutex during dispatchAlert call
- dispatchAlert calls checkFlapping which writes to maps (flappingHistory, flappingActive, suppressedUntil)
- Previous code: grabbed RLock, got alert pointer, released lock, then called dispatchAlert (RACE)
- Fixed: hold Lock through dispatchAlert call
2. **dispatchAlert race in LoadActiveAlerts startup** (lines 8216-8235)
- Startup goroutines called dispatchAlert without holding lock
- Added m.mu.Lock/Unlock around dispatchAlert call in goroutine
- Also added cancellation via escalationStop channel to prevent goroutine leaks on shutdown
3. **checkFlapping documentation** (line 738)
- Added clear comment that checkFlapping requires caller to hold m.mu
- Prevents future race conditions from improper usage
**P1 Important - Data Loss Prevention:**
4. **History save race condition** (lines 177-180 in history.go)
- Added saveMu mutex to serialize disk writes
- Previous: concurrent saves could interleave, causing newer data to be overwritten by older snapshots
- Fixed: saveMu.Lock at start of saveHistoryWithRetry ensures atomic disk writes
- Newer snapshots now always win over older ones
**P2 Memory Leak Prevention:**
5. **PMG anomaly tracker cleanup** (lines 7318-7331)
- Added cleanup for pmgAnomalyTrackers map (24 hour TTL based on LastSampleTime)
- Prevents unbounded growth from decommissioned/transient PMG instances
- Each tracker: ~1-2KB (48 samples + baselines)
6. **PMG quarantine history cleanup** (lines 7333-7354)
- Added cleanup for pmgQuarantineHistory map (7 day TTL based on last snapshot)
- Prevents memory leak for deleted PMG instances
- Removes both empty histories and very old histories
**P2 Goroutine Leak Prevention:**
7. **Startup notification goroutine cancellation** (lines 8218-8234)
- Added select with escalationStop channel to cancel startup notifications
- Prevents goroutines from continuing after Stop() is called
- Scales with number of restored critical alerts
All fixes maintain proper lock ordering and prevent deadlocks by ensuring locks are held when accessing shared maps.