Commit graph

9 commits

Author SHA1 Message Date
rcourtman
b5373749db Fix alert history duration and re-evaluation threshold bugs
Update history entry LastSeen on alert resolution so the stored duration
reflects how long the alert was actually active, not the snapshot captured
at creation time. This fixes the "0m" duration display for all resolved
metric-based alerts.

Fix reevaluateActiveAlertsLocked to use HostDefaults for host agent alerts
and PBSDefaults for PBS alerts instead of falling through to GuestDefaults
and NodeDefaults respectively, which could incorrectly resolve or retain
alerts on config save when thresholds differ.
2026-02-04 15:40:28 +00:00
rcourtman
ed78509f92 Fix flaky tests and improve coverage across alerts, api, and config packages
- Fix deadlock and race conditions in internal/alerts
- Add comprehensive error path tests for internal/config
- Fix 401 handling in internal/api
- Fix Docker Swarm task filtering test logic
2026-01-03 18:36:17 +00:00
rcourtman
3fdf753a5b Enhance devcontainer and CI workflows
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 22:29:15 +00:00
rcourtman
9c92bb49df feat(ai): Wire alert history to pattern detector for event tracking
Connect alert system to failure prediction:

1. Add AlertCallback to HistoryManager:
   - OnAlert() method to register callbacks
   - Callbacks invoked when alerts are added
   - Called outside lock to prevent deadlocks

2. Expose OnAlertHistory() on alerts.Manager:
   - Pass-through to HistoryManager.OnAlert()
   - Enables external systems to track alerts

3. Wire pattern detector in router startup:
   - Register callback when pattern detector is created
   - Convert alert types to trackable events
   - Pattern detector now learns from production alerts

Now every alert (memory_warning, cpu_critical, etc.) is recorded as
a historical event for pattern analysis. The AI can predict:
'High memory usage typically occurs every ~3 days (next expected in ~1 day)'

All tests passing.
2025-12-12 14:16:03 +00:00
rcourtman
d0d989289a Refactor alert system: fix race conditions, memory leaks, and improve code quality
- Rename checkFlapping to checkFlappingLocked to clarify lock contract
- Replace goto statements with structured control flow
- Wire up unused recordAlertFired/recordAlertResolved metric hooks
- Add trackingMapCleanup goroutine to prevent memory leaks from stale entries
- Tighten alert ID validation to alphanumeric + safe punctuation
- Fix history save error handling to properly manage backup lifecycle
- Add auto-migration for deprecated GroupingWindow field
- Refactor 300+ line UpdateConfig into focused helper functions
- Unify duplicate evaluateVMCondition/evaluateContainerCondition
- Add constants for magic numbers (thresholds, timing, flapping)
- Update tests to match new backup behavior
2025-12-02 23:31:36 +00:00
rcourtman
1183b87fa1 Fix critical alert system concurrency and memory leak issues
This commit addresses 7 critical issues identified during the alert system audit:

**P0 Critical - Race Conditions Fixed:**

1. **dispatchAlert race in NotifyExistingAlert** (lines 5486-5497)
   - Changed from RLock to Lock to hold mutex during dispatchAlert call
   - dispatchAlert calls checkFlapping which writes to maps (flappingHistory, flappingActive, suppressedUntil)
   - Previous code: grabbed RLock, got alert pointer, released lock, then called dispatchAlert (RACE)
   - Fixed: hold Lock through dispatchAlert call

2. **dispatchAlert race in LoadActiveAlerts startup** (lines 8216-8235)
   - Startup goroutines called dispatchAlert without holding lock
   - Added m.mu.Lock/Unlock around dispatchAlert call in goroutine
   - Also added cancellation via escalationStop channel to prevent goroutine leaks on shutdown

3. **checkFlapping documentation** (line 738)
   - Added clear comment that checkFlapping requires caller to hold m.mu
   - Prevents future race conditions from improper usage

**P1 Important - Data Loss Prevention:**

4. **History save race condition** (lines 177-180 in history.go)
   - Added saveMu mutex to serialize disk writes
   - Previous: concurrent saves could interleave, causing newer data to be overwritten by older snapshots
   - Fixed: saveMu.Lock at start of saveHistoryWithRetry ensures atomic disk writes
   - Newer snapshots now always win over older ones

**P2 Memory Leak Prevention:**

5. **PMG anomaly tracker cleanup** (lines 7318-7331)
   - Added cleanup for pmgAnomalyTrackers map (24 hour TTL based on LastSampleTime)
   - Prevents unbounded growth from decommissioned/transient PMG instances
   - Each tracker: ~1-2KB (48 samples + baselines)

6. **PMG quarantine history cleanup** (lines 7333-7354)
   - Added cleanup for pmgQuarantineHistory map (7 day TTL based on last snapshot)
   - Prevents memory leak for deleted PMG instances
   - Removes both empty histories and very old histories

**P2 Goroutine Leak Prevention:**

7. **Startup notification goroutine cancellation** (lines 8218-8234)
   - Added select with escalationStop channel to cancel startup notifications
   - Prevents goroutines from continuing after Stop() is called
   - Scales with number of restored critical alerts

All fixes maintain proper lock ordering and prevent deadlocks by ensuring locks are held when accessing shared maps.
2025-11-07 09:12:28 +00:00
rcourtman
c8e0281953 Add comprehensive alert system reliability improvements
This commit implements critical reliability features to prevent data loss
and improve alert system robustness:

**Persistent Notification Queue:**
- SQLite-backed queue with WAL journaling for crash recovery
- Dead Letter Queue (DLQ) for notifications that exhaust retries
- Exponential backoff retry logic (100ms → 200ms → 400ms)
- Full audit trail for all notification delivery attempts
- New file: internal/notifications/queue.go (661 lines)

**DLQ Management API:**
- GET /api/notifications/dlq - Retrieve DLQ items
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry failed notifications
- POST /api/notifications/dlq/delete - Delete DLQ items
- New file: internal/api/notification_queue.go (145 lines)

**Prometheus Metrics:**
- 18 comprehensive metrics for alerts and notifications
- Metric hooks integrated via function pointers to avoid import cycles
- /metrics endpoint exposed for Prometheus scraping
- New file: internal/metrics/alert_metrics.go (193 lines)

**Alert History Reliability:**
- Exponential backoff retry for history saves (3 attempts)
- Automatic backup restoration on write failure
- Modified: internal/alerts/history.go

**Flapping Detection:**
- Detects and suppresses rapidly oscillating alerts
- Configurable window (default: 5 minutes)
- Configurable threshold (default: 5 state changes)
- Configurable cooldown (default: 15 minutes)
- Automatic cleanup of inactive flapping history

**Alert TTL & Auto-Cleanup:**
- MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days)
- MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day)
- AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours)
- Prevents memory leaks from long-running alerts

**WebSocket Broadcast Sequencer:**
- Channel-based sequencing ensures ordered message delivery
- 100ms coalescing window for rapid state updates
- Prevents race conditions in WebSocket broadcasts
- Modified: internal/websocket/hub.go

**Configuration Fields Added:**
- FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes
- MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours

All features are production-ready and build successfully.
2025-11-06 16:46:30 +00:00
rcourtman
d643dcf0bc perf: reduce polling allocations and guest metadata load 2025-10-25 13:12:47 +00:00
rcourtman
f46ff1792b Fix settings security tab navigation 2025-10-11 23:29:47 +00:00