Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-09 19:32:24 +00:00

Author	SHA1	Message	Date
rcourtman	f855625f65	feat: Add full-width mode toggle for wider views on large monitors. Related to #974	2025-12-30 12:20:44 +00:00
rcourtman	065a59316f	fix(alerts): respect per-guest backup and snapshot overrides (fixes #961 )	2025-12-30 00:28:05 +00:00
rcourtman	9063695cba	fix: Preserve alert acknowledgement across transient clears When a powered-off VM is backed up by Proxmox, the VM status briefly changes (e.g., to "running" during backup). This caused the powered-off alert to be cleared, deleting the ackState record. When the backup completed and the alert was recreated, it appeared as a new unacknowledged alert, generating a new notification. The fix preserves ackState when alerts are removed, allowing preserveAlertState to restore the acknowledgement when the same alert reappears. Stale ackState entries (for alerts that don't exist) are cleaned up after 1 hour. Related to #937	2025-12-28 10:24:04 +00:00
rcourtman	056f503516	test: Add comprehensive tests for update detection system - Add registry checker tests (caching, enable/disable, parsing, concurrency) - Add alert integration tests for update detection and Pro license gating - Add API handler tests for /api/infra-updates endpoints - Test cleanup of tracking maps when containers are removed - Test threshold-based alerting behavior	2025-12-27 18:54:48 +00:00
rcourtman	d1a8383cd5	feat: Gate update alerts as Pro-only feature - Add FeatureUpdateAlerts constant for Pro license gating - Add feature to all Pro tier feature lists - Add SetLicenseChecker method to alerts Manager - Check Pro license in checkDockerContainerImageUpdate before alerting - Wire license checker from router to alert manager Free users still see update badges in the UI. Pro users get proactive alerts after 24h of pending updates.	2025-12-27 18:28:09 +00:00
rcourtman	cf44b0cca6	polish: Improve update detection edge cases and UX - Add GHCR (GitHub Container Registry) token support for public images - Clean up dockerUpdateFirstSeen tracking when containers are removed - Improve UpdateIcon tooltip to show digest info - Add cursor-help to indicate hoverable tooltip	2025-12-27 18:14:27 +00:00
rcourtman	b50872b686	feat: Implement unified update detection system (Phase 1) Docker container image update detection with full stack implementation: Backend: - Add internal/updatedetection package with types, store, registry checker, manager - Add registry checking to Docker agent (internal/dockeragent/registry.go) - Add ImageDigest and UpdateStatus fields to container reports - Add /api/infra-updates API endpoints for querying updates - Integrate with alert system - fires after 24h of pending updates Frontend: - Add UpdateBadge and UpdateIcon components for update indicators - Add updateStatus to DockerContainer TypeScript interface - Display blue update badges in Docker unified table image column - Add 'has:update' search filter support Features: - Registry digest comparison for Docker Hub, GHCR, private registries - Auth token handling for Docker Hub public images - Caching with 6h TTL (15min for errors) - Configurable alert delay via UpdateAlertDelayHours (default: 24h) - Alert metadata includes digests, pending time, image info	2025-12-27 17:58:38 +00:00
rcourtman	7f5ea636db	fix: Skip webhook re-notifications for acknowledged alerts Acknowledged alerts were still triggering repeated webhook notifications because the re-notification logic only checked cooldown period, not acknowledgment status. Now acknowledged alerts are skipped entirely. Related to #921	2025-12-26 17:47:28 +00:00
rcourtman	0837c46f5a	test: Add unit tests for guest tag filtering	2025-12-22 10:24:39 +00:00
rcourtman	71d0401c80	feat: Add guest filtering by tag and name prefix via Alert Configuration. Resolves #863	2025-12-22 10:03:12 +00:00
rcourtman	96573f4aca	feat: enhance AI baseline context visibility and incident timeline improvements Backend: - Enhanced buildEnrichedResourceContext to ALWAYS show learned baselines with status indicators (normal/elevated/anomaly) instead of only when anomalous - This makes Pulse Pro's 'moat' visible - users can see the AI understands their infrastructure's normal behavior patterns - Added baseline import to service.go Frontend (user changes): - Added incident event type filtering with toggle buttons - Added resource incident panel to view all incidents for a resource - Added timeline expand/collapse functionality in alert history - Added incident note saving with proper incidentId tracking - Added startedAt parameter for proper incident timeline loading	2025-12-21 00:14:20 +00:00
rcourtman	ae522c9a2b	fix: Allow all threshold types (Storage, Temperature, Host Agent) to be set to 0 to disable alerting - Fixed normalizeStorageDefaults to allow Trigger=0 - Fixed normalizeNodeDefaults (Temperature) to allow Trigger=0 - Added comprehensive tests for all threshold normalization patterns - Updated existing test that expected old behavior Related to #864	2025-12-20 20:42:23 +00:00
rcourtman	781442cdd0	test: Add comprehensive tests for Host Agent threshold normalization with Trigger=0. Related to #864	2025-12-20 20:32:59 +00:00
rcourtman	968e0a7b3d	fix: reduce syslog flooding by downgrading routine logs to debug level Addresses issue #861 - syslog flooded on docker host Many routine operational messages were being logged at INFO level, causing excessive log volume when monitoring multiple VMs/containers. These messages are now logged at DEBUG level: - Guest threshold checking (every guest, every poll cycle) - Storage threshold checking (every storage, every poll cycle) - Host agent linking messages - Filesystem inclusion in disk calculation - Guest agent disk usage replacement - Polling start/completion messages - Alert cleanup and save messages Users can set LOG_LEVEL=debug to see these messages if needed for troubleshooting. The default INFO level now produces significantly less log output. Also updated documentation in CONFIGURATION.md and DOCKER.md to: - Clarify what each log level includes - Add tip about using LOG_LEVEL=warn for minimal logging	2025-12-18 23:27:32 +00:00
rcourtman	5338ab580c	Stabilize core E2E tests - Preserve alerts activation state when saving thresholds - Use compliant default E2E password and deterministic bootstrap token seeding - Harden Playwright selectors, waits, and diagnostics gating	2025-12-17 19:36:48 +00:00
rcourtman	cf44352c83	feat: configurable backup freshness thresholds for dashboard indicator Adds FreshHours and StaleHours settings to control when the dashboard backup indicator shows green (fresh), amber (stale), or red (critical). - Backend: Added FreshHours/StaleHours to BackupAlertConfig (default 24/72 hours) - Frontend: getBackupInfo() now accepts optional thresholds parameter - Dashboard/GuestRow components use thresholds from alert config - Settings saved/loaded with alert configuration Closes #839	2025-12-16 16:36:08 +00:00
rcourtman	3a2a73f9d6	Merge main into ai-features: incorporate latest bugfixes Resolved conflicts: - pkg/fsfilters/filters.go: Keep both TrueNAS and EnhanceCP filter fixes - DockerUnifiedTable.tsx: Use main's resource column overlap fix	2025-12-13 15:18:51 +00:00
rcourtman	9c92bb49df	feat(ai): Wire alert history to pattern detector for event tracking Connect alert system to failure prediction: 1. Add AlertCallback to HistoryManager: - OnAlert() method to register callbacks - Callbacks invoked when alerts are added - Called outside lock to prevent deadlocks 2. Expose OnAlertHistory() on alerts.Manager: - Pass-through to HistoryManager.OnAlert() - Enables external systems to track alerts 3. Wire pattern detector in router startup: - Register callback when pattern detector is created - Convert alert types to trackable events - Pattern detector now learns from production alerts Now every alert (memory_warning, cpu_critical, etc.) is recorded as a historical event for pattern analysis. The AI can predict: 'High memory usage typically occurs every ~3 days (next expected in ~1 day)' All tests passing.	2025-12-12 14:16:03 +00:00
rcourtman	7b1ec9b5f5	Add host alert deduplication with tests - Modified alerts.go for improved host alert handling - Added host_dedup_test.go for deduplication test coverage	2025-12-07 12:38:38 +00:00
rcourtman	8948e84fe5	feat: AI features, agent improvements, and host monitoring enhancements AI Chat Integration: - Multi-provider support (Anthropic, OpenAI, Ollama) - Streaming responses with markdown rendering - Agent command execution for remote troubleshooting - Context-aware conversations with host/container metadata Agent Updates: - Add --enable-proxmox flag for automatic PVE/PBS token setup - Improve auto-update with semver comparison (prevents downgrades) - Add updatedFrom tracking to report previous version after update - Reduce initial update check delay from 30s to 5s - Add agent version column to Hosts page table Host Metrics: - Add DiskIO stats collection (read/write bytes, ops, time) - Improve disk filtering to exclude Docker overlay mounts - Add RAID array monitoring via mdadm - Enhanced temperature sensor parsing Frontend: - New Agent Version column on Hosts overview table - Improved node modal with agent-first installation flow - Add DiskIO display in host drawer - Better responsive handling for metric bars	2025-12-05 10:37:02 +00:00
rcourtman	c5ab0724f1	fix: Race condition in TestLoadActiveAlerts causing flaky test ClearActiveAlerts triggers an async save to disk, which can race with LoadActiveAlerts reading the file. The test now clears the in-memory map directly without triggering the async save.	2025-12-04 20:41:00 +00:00
rcourtman	d0d989289a	Refactor alert system: fix race conditions, memory leaks, and improve code quality - Rename checkFlapping to checkFlappingLocked to clarify lock contract - Replace goto statements with structured control flow - Wire up unused recordAlertFired/recordAlertResolved metric hooks - Add trackingMapCleanup goroutine to prevent memory leaks from stale entries - Tighten alert ID validation to alphanumeric + safe punctuation - Fix history save error handling to properly manage backup lifecycle - Add auto-migration for deprecated GroupingWindow field - Refactor 300+ line UpdateConfig into focused helper functions - Unify duplicate evaluateVMCondition/evaluateContainerCondition - Add constants for magic numbers (thresholds, timing, flapping) - Update tests to match new backup behavior	2025-12-02 23:31:36 +00:00
rcourtman	4f824ab148	style: Apply gofmt to 37 files Standardize code formatting across test files and monitor.go. No functional changes.	2025-12-02 17:21:48 +00:00
rcourtman	b9db9c140b	docs: Add godoc comments to more exported functions Add missing godoc comments to: - BuildGuestKey in alerts/alerts.go - GenerateMockData in mock/generator.go - NewDockerUpdater, NewAURUpdater in updates/adapter_installsh.go - NewMockUpdater in updates/mock_updater.go	2025-12-02 16:03:57 +00:00
rcourtman	85487b0058	perf: Cache lowercase RAID state in alerts processing Compute strings.ToLower(array.State) once and reuse stateLower instead of calling ToLower three times when checking RAID array status.	2025-12-02 15:35:32 +00:00
rcourtman	b877d4170d	test: Add saveHistoryWithRetry tests for alerts package Add comprehensive tests for the saveHistoryWithRetry function covering: - Backup file creation from existing history - Empty history serialization - Single retry success - Read-only directory failure with retries - Concurrent saves with serialization via saveMu - Snapshot isolation during save Coverage: saveHistoryWithRetry 58.6% → 86.2% Coverage: alerts package 87.4% → 87.8%	2025-12-02 12:49:27 +00:00
rcourtman	969f79c2fd	test: Add getGuestThresholds tests for alerts package Add comprehensive tests for the getGuestThresholds function covering: - Default threshold application - Guest-specific overrides - Custom rule filter matching - Override precedence over custom rules - Priority-based rule selection - Disabled rules handling - Disabled override handling - DisableConnectivity propagation from overrides and rules - Legacy CPU threshold conversion - Legacy ID migration (clustered and standalone VMs) - Container type support - All metric thresholds application Coverage: getGuestThresholds 40.2% → 77.6%	2025-12-02 12:35:07 +00:00
rcourtman	753125d189	test: Add preserveAlertState, checkPMGQuarantineBacklog, LoadActiveAlerts tests Add comprehensive tests for three low-coverage functions: - preserveAlertState: nil handling, state preservation from existing alerts, ackState fallback, new alert handling - checkPMGQuarantineBacklog: nil quarantine handling, warning/critical thresholds, growth rate alerts, alert updates, virus quarantine - LoadActiveAlerts: missing file, valid file loading, old alert filtering, old acknowledged alert filtering, ack state restoration, invalid JSON, duplicate alert handling Coverage improvements: - preserveAlertState: 63.6% → 100% - checkPMGQuarantineBacklog: 12.9% → 100% - checkQuarantineMetric: 0% → 93.1% - LoadActiveAlerts: 26.2% → 80.0% - Alerts package: 83.5% → 86.6%	2025-12-02 12:22:14 +00:00
rcourtman	192a74460e	test: Add HistoryManager tests for alerts package New history_test.go with 24 tests covering GetStats, getFileSize, AddAlert, GetHistory, GetAllHistory, RemoveAlert, ClearAllHistory, cleanOldEntries, saveHistory, loadHistory. Alerts package 83.6%→84.0%.	2025-12-02 12:06:53 +00:00
rcourtman	5ff7e20539	test: Add dispatchAlert tests (55.6%→77.8%) Add TestDispatchAlert with 8 test cases covering: - Returns false when onAlert callback is nil - Returns false when alert is nil - Returns false when activation state is pending - Returns false when activation state is snoozed - Returns false for monitor-only alerts - Dispatches synchronously when async is false - Dispatches asynchronously when async is true - Clones alert before dispatch Alerts package coverage: 83.4%→83.5%	2025-12-02 11:47:24 +00:00
rcourtman	3d957403ef	test: Add CheckStorage tests (52.4%→92.9%) Add comprehensive TestCheckStorageComprehensive with 11 test cases covering: - Returns early when alerts disabled - DisableAllStorage clears existing usage and offline alerts - Override with Disabled clears alerts - Usage threshold checking - Override threshold applied correctly - Skips usage check when offline/unavailable/zero usage - Offline status creates alert after confirmations - Unavailable status creates alert - Clears offline alert when back online Alerts package coverage: 82.4%→83.4%	2025-12-02 11:43:56 +00:00
rcourtman	905d78b6a6	test: Add CheckPMG tests (0%→100%) Add comprehensive TestCheckPMGComprehensive with 9 test cases covering: - Returns early when alerts disabled - DisableAllPMG clears all PMG alert types (queue-total, queue-deferred, queue-hold, oldest-message, offline) - Override with Disabled clears alerts - DisableAllPMGOffline clears offline alert - Offline status creates alert after confirmations - Connection health error triggers offline alert - Connection health unhealthy triggers offline alert - Clears offline alert when back online - Skips metrics when PMG is offline Alerts package coverage: 81.5%→82.4%	2025-12-02 11:40:53 +00:00
rcourtman	e1cdd6ebdb	test: Add CheckPBS tests (0%→98.3%) Add comprehensive TestCheckPBSComprehensive with 12 test cases covering: - Returns early when alerts disabled - DisableAllPBS clears existing CPU/memory/offline alerts - Override with Disabled clears alerts - DisableAllPBSOffline clears offline alert - CPU threshold checking when online - Memory threshold checking when online - Skips metrics when PBS is offline - Override thresholds applied correctly - Offline status creates alert after confirmations - Connection health error triggers offline alert - Connection health unhealthy triggers offline alert - Clears offline alert when back online Alerts package coverage: 80.0%→81.5%	2025-12-02 11:38:36 +00:00
rcourtman	bc7fa17b54	test: Add CheckHost tests (49.6%→98.3%) Add comprehensive TestCheckHostComprehensive with 17 test cases covering: - Empty host ID early return - Alerts disabled early return - DisableAllHosts clears existing alerts - Override with Disabled clears alerts - CPU/Memory/Disk threshold nil clears alerts - RAID degraded/rebuilding/healthy states - RAID with failed devices triggers critical alert - RAID resync triggers rebuilding alert - Existing RAID alert not duplicated (preserves start time) - Override thresholds applied correctly - Multiple disks handling - Offline alert cleared when host comes online - Tags included in metadata Alerts package coverage: 78.6%→80.0%	2025-12-02 11:34:20 +00:00
rcourtman	ceb54ba349	test: Add CheckGuest tests (41.4%→97.4%) Cover all CheckGuest branches: - Early return when alerts disabled - Early return when all guests disabled - VM and Container type handling - Unsupported guest type returns early - pulse-no-alerts tag suppresses alerts - Stopped guest triggers powered-off check - DisableAllGuestsOffline clears tracking - Paused guest clears powered-off alert - Non-running guest clears metric alerts - Running guest clears powered-off alert - Disabled thresholds clear existing alerts - CPU, memory, disk metric checks - Individual disk checks (mountpoint, device, index fallback) - Skips disk with zero total or negative usage - I/O metrics (diskRead, diskWrite, networkIn, networkOut) - pulse-relaxed tag applies relaxed thresholds Alerts package coverage: 76.0%→78.6%	2025-12-02 11:27:32 +00:00
rcourtman	dda3d866ec	test: Add CheckNode tests (31%→100%) Cover all CheckNode branches: - Early return when alerts disabled - DisableAllNodes clears existing alerts - DisableNodesOffline clears tracking - Offline/connection error/failed triggers offline check - Online node clears offline alert - Online node triggers metric checks - Offline node skips metric checks - Override thresholds applied correctly - Temperature with package temp and max fallback - Temperature skipped when unavailable/nil/no threshold - Memory and disk metric checks Alerts package coverage: 75.2%→76.0%	2025-12-02 11:24:04 +00:00
rcourtman	42890b70f8	test: Add suppressGuestAlerts and guestHasMonitorOnlyAlerts tests Coverage improvements: - suppressGuestAlerts: 37% -> 96.3% - guestHasMonitorOnlyAlerts: 40% -> 90% Tests cover: - No alerts returns false - Exact ResourceID match clears - Prefix match (e.g., "vm100/disk1") clears - All auxiliary maps cleared (pending, recent, suppressed, rateLimit) - Multiple alerts cleared - Monitor-only detection via metadata (bool and string types)	2025-12-02 11:14:30 +00:00
rcourtman	914b1ced2a	test: Add applyThresholdOverride tests Coverage for applyThresholdOverride: 50% -> 93.2% Tests cover: - Empty override returns base unchanged - Disabled/DisableConnectivity flag overrides - Modern CPU threshold override - Legacy CPU threshold conversion - Modern takes precedence over legacy - Multiple metrics override - Note override, clearing, trimming - All legacy metric types (Memory, Disk, etc.) - Temperature and Usage override - ensureHysteresisThreshold Clear value filling	2025-12-02 11:08:58 +00:00
rcourtman	3379e90073	test: Add ClearActiveAlerts test with existing alerts Coverage for ClearActiveAlerts: 16% -> 92% Tests verify all internal maps are properly cleared when alerts exist: - activeAlerts, pendingAlerts, recentAlerts - suppressedUntil, alertRateLimit - nodeOfflineCount, offlineConfirmations - dockerOfflineCount, dockerStateConfirm - ackState, recentlyResolved	2025-12-02 11:01:54 +00:00
rcourtman	e644c38071	test: Add CheckDiskHealth normal path tests Coverage for CheckDiskHealth: 51% -> 98% Tests cover: - Healthy disk (PASSED/OK) creates no alert - Failed non-Samsung disk creates critical alert - Alert cleared when disk health recovers - Low wearout (<10%) creates warning alert - Wearout alert updates on subsequent checks - Wearout alert cleared when wearout >= 10% - Empty/UNKNOWN health creates no alert	2025-12-02 10:59:48 +00:00
rcourtman	eac8ed48c5	test: Add Docker container restart loop alert tests Coverage for checkDockerContainerRestartLoop: 53.5% -> 95.3% Tests cover: - First check initializes tracking without alert - Stable restart count doesn't alert - Restarts under threshold (<=3) don't alert - Restart loop threshold (>3) triggers critical alert - Recovery after window expires clears alert - Incremental restart accumulation - Alert StartTime preservation on updates	2025-12-02 10:54:51 +00:00
rcourtman	e1105d68ca	test: Add Docker container health and OOM kill alert tests Coverage improvements: - checkDockerContainerHealth: 21.1% -> 94.7% - checkDockerContainerOOMKill: 19.2% -> 96.2% Tests cover: - Health states (healthy, empty, none, starting, unhealthy, degraded) - Health alert recovery when container becomes healthy - OOM kill detection (exit code 137) - OOM alert deduplication (repeated 137 doesn't re-alert) - OOM alert clearing when container recovers or exits with different code	2025-12-02 10:49:42 +00:00
rcourtman	4538b5348d	fix: Isolate alerts tests with temp directories to prevent flaky failures Tests using NewManager() were sharing /etc/pulse/alerts, causing race conditions when running in parallel. Added newTestManager(t) helper that creates isolated temp directories for each test.	2025-12-02 10:21:03 +00:00
rcourtman	74fe6b9223	test: Add tests for checkPMGNodeQueues - 12 test cases covering per-node queue monitoring - Tests total/deferred/hold queue thresholds at warning/critical levels - Tests oldest message age per node - Tests outlier detection, threshold clearing, existing alert updates - Tests empty nodes and nil QueueStatus handling Coverage: checkPMGNodeQueues 0%→85.1% Package coverage: 70.3%→72.3%	2025-12-01 17:16:32 +00:00
rcourtman	f38736eefe	test: Add tests for checkZFSPoolHealth - 14 test cases covering pool state alerts (ONLINE, DEGRADED, FAULTED, UNAVAIL), pool error tracking, device-level alerts, state transitions - Tests pool state clearing on recovery, error count updates, SPARE device handling, FAULTED device critical alerts Coverage: checkZFSPoolHealth 0%→100% Package coverage: 68.6%→70.3%	2025-12-01 17:14:12 +00:00
rcourtman	b8facf3e8a	test: Add tests for checkEscalations and CleanupAlertsForNodes - checkEscalations: 6 test cases covering disabled/enabled states, acknowledged alerts, threshold timing, multi-level escalation - CleanupAlertsForNodes: 7 test cases covering node removal, Docker/PBS alert preservation, empty node handling, nil safety Coverage: checkEscalations 0%→100%, CleanupAlertsForNodes 0%→92% Package coverage: 67.6%→68.6%	2025-12-01 17:11:47 +00:00
rcourtman	b1265451a8	test: Add tests for Cleanup and convertLegacyThreshold - Cleanup: 16 test cases covering auto-acknowledge, TTL cleanup, rate limit entries, suppressions, pending alerts, flapping history, Docker restart tracking, PMG anomaly trackers, quarantine history - convertLegacyThreshold: 5 test cases covering nil/zero/negative input, default margin, custom margin Coverage: Cleanup 0%→97.6%, convertLegacyThreshold 0%→83.3% Package coverage: 65.4%→67.6%	2025-12-01 17:08:45 +00:00
rcourtman	73d2e91ab6	test: Add tests for checkStorageOffline and checkGuestPoweredOff - checkStorageOffline: 5 test cases covering confirmation polling, alert creation, LastSeen updates, disabled storage handling - checkGuestPoweredOff: 9 test cases covering confirmation polling, alert creation, severity overrides, disabled guests, monitorOnly flag Coverage: checkStorageOffline 0%→100%, checkGuestPoweredOff 0%→100% Package coverage: 63.4%→65.4% Note: Pre-existing test failure in suite related to /etc/pulse file access - not caused by these changes.	2025-12-01 17:05:15 +00:00
rcourtman	bfcd2f95a5	test: Add tests for checkPMGQueueDepths and checkPMGOldestMessage - checkPMGQueueDepths: 10 test cases covering total/deferred/hold queues at warning/critical levels, existing alert updates, threshold clearing - checkPMGOldestMessage: 8 test cases covering threshold logic, multi-node oldest detection, alert creation/updates/clearing Coverage: checkPMGQueueDepths 0%→88.1%, checkPMGOldestMessage 0%→100% Package coverage: 61.3%→63.4%	2025-12-01 16:59:59 +00:00
rcourtman	03f0ee01b1	test: Add tests for calculateTrimmedBaseline and createOrUpdateNodeAlert - calculateTrimmedBaseline: 8 test cases covering sample count thresholds, trimmed mean calculation, median fallback logic, odd-length arrays - createOrUpdateNodeAlert: 2 test cases for new alert creation and existing alert updates Coverage: calculateTrimmedBaseline 0%→96.9%, createOrUpdateNodeAlert 0%→100% Package coverage: 59.6%→61.3%	2025-12-01 16:56:14 +00:00

1 2 3 4

176 commits