Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 20:10:21 +00:00

Author	SHA1	Message	Date
rcourtman	e46239d8ac	Preserve queued recovery notifications on alert cancellation (#1350 )	2026-03-25 13:18:33 +00:00
rcourtman	464d3f8486	Fix stale queued notification delivery	2026-03-05 23:46:35 +00:00
rcourtman	beae4c860c	fix: address 6 security and reliability issues Security fixes: - Auto-register now requires settings:write scope for API tokens - X-Forwarded-For in auto-register only trusted from verified proxies - Public URL capture requires authentication (no loopback bypass) - Lockout reset now uses RequireAdmin for session users Reliability fixes: - Docker stop command expiration clears PendingUninstall flag - Cancelled notifications get completed_at set and are cleaned up	2026-02-03 17:32:44 +00:00
rcourtman	b2639ed5a5	Fix security vulnerabilities and critical bugs - Fix WebSocket CORS bypass by strictly verifying origin - Fix OIDC refresh token persistence by encrypting at rest - Fix grouped webhook data mutation by cloning alerts - Fix host agent uninstall authorization and config fetch logic - Fix notification queue recovery for stuck sending items - Fix ignored update history limit parameter - Fix ineffective break statement in WebSocket write pump	2026-02-03 17:16:27 +00:00
rcourtman	3b347b6548	fix: harden SQLite against I/O contention causing persistent lock errors - Move all SQLite pragmas from db.Exec() to DSN parameters so every connection the pool creates gets busy_timeout and other settings. Previously only the first connection had these applied. - Set MaxOpenConns(1) on audit, RBAC, and notification databases (metrics already had this). Fixes potential for multiple connections where new ones lack busy_timeout. - Increase busy_timeout from 5s to 30s across all databases to tolerate disk I/O pressure during backup windows. - Fix nested query deadlocks in GetRoles(), GetUserAssignments(), and CancelByAlertIDs() that would deadlock with MaxOpenConns(1). - Fix circuit breaker retryInterval not resetting on recovery, which caused the next trip to start at 5-minute backoff instead of 5s. Related to #1156	2026-02-02 17:29:14 +00:00
rcourtman	c44cb5af5b	fix: use pure Go SQLite driver for arm64 compatibility Switch from mattn/go-sqlite3 (CGO) to modernc.org/sqlite (pure Go) for auth, audit, and notification queue storage. This enables SQLite functionality on arm64 Docker images which are built with CGO_ENABLED=0. Related to #1140	2026-01-21 18:58:23 +00:00
rcourtman	3427aa7f01	fix: Deadlock in CancelByAlertIDs and add tests Fixed deadlock where CancelByAlertIDs held nq.mu.Lock() and then called UpdateStatus() which also tried to acquire the same lock. Now uses direct SQL while holding the lock. Tests added for CancelByAlertIDs: - No matching notifications (notification stays pending) - Matching notification cancelled - Multiple alerts with partial match (any match cancels) Coverage: CancelByAlertIDs 65.7% -> 81.1%	2025-12-02 10:40:07 +00:00
rcourtman	01f7d81d38	style: fix gofmt formatting inconsistencies Run gofmt -w to fix tab/space inconsistencies across 33 files.	2025-11-26 23:44:36 +00:00
rcourtman	d7766af799	Fix backend test failures blocking release workflow Three categories of fixes: 1. Goroutine leak causing 10-minute timeout: - Add defer mon.notificationMgr.Stop() in monitor_memory_test.go - Background goroutines from notification manager weren't being stopped 2. Database NULL column scanning errors: - Change LastError from string to string in queue.go - Change PayloadBytes from int to int in queue.go - SQL NULL values require pointer types in Go 3. SSRF protection blocking test servers: - Check allowlist for localhost before rejecting in notifications.go - Set PULSE_DATA_DIR to temp directory in tests - Add defer nm.Stop() calls to prevent goroutine leaks Fixes for preflight test failures in workflow run 19280879903.	2025-11-11 23:27:03 +00:00
rcourtman	99e5a38534	Fix critical monitoring system issues and add robustness improvements This commit addresses 9 critical issues identified during the monitoring system audit: Race Conditions Fixed: - PBS backup pollers: Moved lock earlier to eliminate check-then-act race (lines 7316-7378) - PVE backup poll timing: Fixed double write to lastPVEBackupPoll with proper synchronization (lines 5927-5977) - Docker hosts cleanup: Refactored to avoid holding both m.mu and s.mu locks simultaneously (lines 1911-1937) Context Propagation Fixed: - Replaced all context.Background() calls with parent context for proper cancellation chain: - PBS backup poller (line 7367) - PVE backup poller (line 5955) - PBS fallback check (line 7154) Memory Leak Prevention: - Added cleanup for guest metadata cache (10 minute TTL, lines 1942-1957) - Added cleanup for diagnostic snapshots (1 hour TTL, lines 1959-1987) - Added cleanup for RRD cache (1 minute TTL, lines 1989-2007) - All cleanup methods called on 10-second ticker (lines 3791-3793) Panic Recovery: - Added recoverFromPanic helper to log panics with stack traces (lines 1910-1920) - Protected all critical goroutines: - poll (line 4020) - taskWorker (line 4200) - retryFailedConnections (line 3851) - checkMockAlerts (line 8896) - pollPVEInstance (line 4886) - pollPBSInstance (line 7164) - pollPMGInstance (line 7498) Import Fixes: - Added missing sync import to email_enhanced.go - Added missing os import to queue.go All fixes maintain proper lock ordering and release locks before calling methods that acquire other locks to prevent deadlocks.	2025-11-07 08:52:37 +00:00
rcourtman	7ee11105f5	Implement queue cancellation and atomic DB operations (P1 fixes) Queue cancellation mechanism: - Add CancelByAlertIDs method to mark queued notifications as cancelled when alerts resolve - Update CancelAlert to cancel queued notifications containing resolved alert IDs - Skip cancelled notifications in queue processor - Prevents resolved alerts from triggering notifications after they clear Atomic DB operations: - Add IncrementAttemptAndSetStatus to atomically update attempt counter and status - Replace separate IncrementAttempt + UpdateStatus calls with single atomic operation - Prevents orphaned queue entries when crashes occur between operations - Eliminates race condition where rows get stuck in "pending" or "sending" status These fixes ensure queued notifications are properly cancelled when alerts resolve and prevent database inconsistencies during crash scenarios.	2025-11-07 08:33:09 +00:00
rcourtman	c6a69e525c	Fix critical notification system bugs and security issues Critical fixes (P0): - Fix cooldown timing: Mark cooldown only after successful delivery, not before enqueue - Add os.MkdirAll to queue initialization to prevent silent failures on fresh installs - Add DNS re-validation at webhook send time to prevent DNS rebinding SSRF attacks - Add SSRF validation for Apprise HTTP URLs - Remove secret logging (bot tokens, routing keys) from debug logs - Implement lastNotified cleanup to prevent unbounded memory growth - Use shared HTTP client for webhooks to enable TLS connection reuse - Add fallback to direct sending when queue enqueue fails - Make queue worker concurrent (5 workers with semaphore) to prevent head-of-line blocking - Fix webhook rate limiter race condition with separate mutex - Fix email manager thread safety with mutex on rate limiter - Fix grouping timer leak by adding stopCleanup signal - Fix webhook 429 double sleep (use Retry-After OR backoff, not both) Frontend improvements: - Add queue/DLQ management API methods (getQueueStats, getDLQ, retryDLQItem, deleteDLQItem) - Add getNotificationHealth and getWebhookHistory endpoints - Add Apprise test support to NotificationTestRequest type Related to notification system audit	2025-11-07 08:29:13 +00:00
rcourtman	febce91145	Remove internal development documentation files Remove 4 LLM-generated internal development docs that don't belong in the repository: - MIGRATION_SCAFFOLDING.md - NOTIFICATION_AUDIT.md - NOTIFICATION_QUICK_REFERENCE.md - NOTIFICATION_SYSTEM_MAP.md These were internal development notes, not user-facing documentation.	2025-11-07 08:23:19 +00:00
rcourtman	6a48c759e8	Fix critical notification system bugs and security issues This commit addresses multiple critical issues identified in the notification system audit conducted with Codex: Critical Fixes: 1. Queue Retry Logic (Critical #1) - Fixed broken retry/DLQ system where send functions never returned errors - Made sendGroupedEmail(), sendGroupedWebhook(), sendGroupedApprise() return errors - Made sendWebhookRequest() return errors - ProcessQueuedNotification() now properly propagates errors to queue - Retry logic and DLQ now function correctly 2. Attempt Counter Bug (Critical #2) - Fixed double-increment bug in queue processing - Separated UpdateStatus() from attempt tracking - Added IncrementAttempt() method - Notifications now get correct number of retry attempts 3. Secret Exposure (Critical #3 & #4) - Masked webhook headers and customFields in GET /api/notifications/webhooks - Added redactSecretsFromURL() to sanitize webhook URLs in history - Truncated/redacted response bodies in webhook history - Protected against credential harvesting via API 4. Email Rate Limiting (Critical #5) - Added emailManager field to NotificationManager - Shared EnhancedEmailManager instance across sends - Rate limiter now accumulates across multiple emails - SMTP rate limits are now enforced correctly 5. SSRF Protection (High #6) - Added DNS resolution of webhook URLs - Added isPrivateIP() check using CIDR ranges - Blocks all private IP ranges (10/8, 172.16/12, 192.168/16, 127/8, 169.254/16) - Blocks IPv6 private ranges (::1, fe80::/10, fc00::/7) - Prevents DNS rebinding attacks - Returns error instead of warning for private IPs New Features: 6. Health Endpoint (High #8) - Added GET /api/notifications/health - Returns queue stats (pending, sending, sent, failed, dlq) - Shows email/webhook configuration status - Provides overall health indicator Related to notification system audit Files changed: - internal/notifications/notifications.go: Error returns, rate limiting, SSRF hardening - internal/notifications/queue.go: Attempt tracking fix - internal/api/notifications.go: Secret masking, health endpoint	2025-11-06 23:26:03 +00:00
rcourtman	c8e0281953	Add comprehensive alert system reliability improvements This commit implements critical reliability features to prevent data loss and improve alert system robustness: Persistent Notification Queue: - SQLite-backed queue with WAL journaling for crash recovery - Dead Letter Queue (DLQ) for notifications that exhaust retries - Exponential backoff retry logic (100ms → 200ms → 400ms) - Full audit trail for all notification delivery attempts - New file: internal/notifications/queue.go (661 lines) DLQ Management API: - GET /api/notifications/dlq - Retrieve DLQ items - GET /api/notifications/queue/stats - Queue statistics - POST /api/notifications/dlq/retry - Retry failed notifications - POST /api/notifications/dlq/delete - Delete DLQ items - New file: internal/api/notification_queue.go (145 lines) Prometheus Metrics: - 18 comprehensive metrics for alerts and notifications - Metric hooks integrated via function pointers to avoid import cycles - /metrics endpoint exposed for Prometheus scraping - New file: internal/metrics/alert_metrics.go (193 lines) Alert History Reliability: - Exponential backoff retry for history saves (3 attempts) - Automatic backup restoration on write failure - Modified: internal/alerts/history.go Flapping Detection: - Detects and suppresses rapidly oscillating alerts - Configurable window (default: 5 minutes) - Configurable threshold (default: 5 state changes) - Configurable cooldown (default: 15 minutes) - Automatic cleanup of inactive flapping history Alert TTL & Auto-Cleanup: - MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days) - MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day) - AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours) - Prevents memory leaks from long-running alerts WebSocket Broadcast Sequencer: - Channel-based sequencing ensures ordered message delivery - 100ms coalescing window for rapid state updates - Prevents race conditions in WebSocket broadcasts - Modified: internal/websocket/hub.go Configuration Fields Added: - FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes - MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours All features are production-ready and build successfully.	2025-11-06 16:46:30 +00:00

15 commits