Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-06 07:57:08 +00:00

Author	SHA1	Message	Date
rcourtman	7ee252bd84	Fix Docker host display bug when multiple agents share API tokens (related to #658 ) Root cause: findMatchingDockerHost() was matching hosts by token ID alone, causing multiple Docker agents using the same API token to overwrite each other in state. This resulted in only N visible hosts (where N = number of unique tokens) instead of all M agents, with hosts "rotating" as each agent reported every 10 seconds. Example: 4 agents using 2 tokens would show only 2 hosts, rotating between agents 1↔2 (token A) and agents 3↔4 (token B). Fix: Remove token-only matching from findMatchingDockerHost(). Hosts should only match by: 1. Agent ID (unique per agent) 2. Machine ID + hostname combination (with optional token validation) 3. Machine ID or hostname alone (only for tokenless agents) This allows multiple agents to share the same API token without colliding. Additional fix: UpsertDockerHost() now preserves Hidden, PendingUninstall, and Command fields from existing hosts, preventing these flags from being reset to defaults on every agent report.	2025-11-07 13:46:35 +00:00
rcourtman	9199892115	Fix Windows VM disk accumulation bug by normalizing drive letters Related to #656 Windows guest agents can return multiple directory mountpoints (C:\, C:\Users, C:\Windows) all on the same physical drive. When the QEMU guest agent omits disk[] metadata, commit `5325ef481` falls back to using the mountpoint string as the disk identifier. This causes every Windows directory to be treated as a separate disk, accumulating to inflated totals (e.g., 1TB reported for a 250GB drive). Root cause: The fallback logic in pkg/proxmox/client.go:1585-1594 assigns fs.Disk = fs.Mountpoint when disk[] is missing. On Windows, every directory path is unique, so the deduplication guard in internal/monitoring/monitor_polling.go: 619-635 never triggers, causing all directories to be summed. Changes: - Detect Windows-style mountpoints (drive letter + colon + backslash) - Normalize to drive root when disk[] is missing (e.g., C:\Users → C:) - Preserve existing behavior for Linux/BSD and VMs with disk[] metadata - Add debug logging for synthesized Windows drive identifiers This fix maintains backward compatibility with commit `5325ef481` while preventing the Windows directory accumulation issue. LXC containers are unaffected as they use a different code path.	2025-11-07 12:27:11 +00:00
rcourtman	32e0d453c4	Add Windows ARM64 support for host agent (related to #654 ) Windows 11 25H2 ships exclusively on ARM64 hardware. When users on ARM64 attempt to install the host agent, the Service Control Manager fails to load the amd64 binary with ERROR_BAD_EXE_FORMAT, surfaced as "The Pulse Host Agent is not compatible with this Windows version". Changes: - Dockerfile: Build pulse-host-agent-windows-arm64.exe alongside amd64 - Dockerfile: Copy windows-arm64 binary and create symlink for download endpoint - install-host-agent.ps1: Use RuntimeInformation.OSArchitecture to detect ARM64 - build-release.sh: Build darwin-amd64, darwin-arm64, windows-amd64, windows-arm64 - build-release.sh: Package Windows binaries as .zip archives - validate-release.sh: Check for windows-arm64 binary and symlink - validate-release.sh: Add architecture validation for all darwin/windows variants The installer now correctly detects ARM64 and downloads the appropriate binary.	2025-11-07 12:18:57 +00:00
rcourtman	2a79d57f73	Add SMART temperature collection for physical disks (related to #652 ) Extends temperature monitoring to collect SMART temps for SATA/SAS disks, addressing issue #652 where physical disk temperatures showed as empty. Architecture: - Deploys pulse-sensor-wrapper.sh as SSH forced command on Proxmox nodes - Wrapper collects both CPU/GPU temps (sensors -j) and disk temps (smartctl) - Implements 30-min cache with background refresh to avoid performance impact - Uses smartctl -n standby,after to skip sleeping drives without waking them - Returns unified JSON: {sensors: {...}, smart: [...]} Backend changes: - Add DiskTemp model with device, serial, WWN, temperature, lastUpdated - Extend Temperature model with SMART []DiskTemp field and HasSMART flag - Add WWN field to PhysicalDisk for reliable disk matching - Update parseSensorsJSON to handle both legacy and new wrapper formats - Rewrite mergeNVMeTempsIntoDisks to match SMART temps by WWN → serial → devpath - Preserve legacy NVMe temperature support for backward compatibility Performance considerations: - SMART data cached for 30 minutes per node to avoid excessive smartctl calls - Background refresh prevents blocking temperature requests - Respects drive standby state to avoid spinning up idle arrays - Staggered disk scanning with 0.1s delay to avoid saturating SATA controllers Install script: - Deploys wrapper to /usr/local/bin/pulse-sensor-wrapper.sh - Updates SSH forced command from "sensors -j" to wrapper script - Backward compatible - falls back to direct sensors output if wrapper missing Testing note: - Requires real hardware with smartmontools installed for full functionality - Empty smart array returned gracefully when smartctl unavailable - Legacy sensor-only nodes continue working without changes	2025-11-07 11:46:57 +00:00
rcourtman	50cf34a2da	Fix install.sh to deploy host agent binaries (related to #651 ) The bare metal installer was not copying pulse-host-agent binaries from release tarballs into /opt/pulse/bin/, causing 404 errors when users tried to install the host agent via the download endpoint. Changes: - Copy pulse-host-agent binary during initial installation (alongside pulse-docker-agent) - Update install_additional_agent_binaries() to fetch and install cross-platform host agent binaries (linux-amd64, linux-arm64, linux-armv7, darwin-amd64, darwin-arm64, windows-amd64) - Match existing pattern used for Docker agent distribution The build pipeline (build-release.sh and Dockerfile) already correctly includes host agent binaries in releases and Docker images. This fix ensures the installer deploys them. Users on bare metal deployments should rerun install.sh to populate /opt/pulse/bin/ with the missing host agent binaries. Docker deployments are unaffected.	2025-11-07 11:19:47 +00:00
rcourtman	910f2dd800	Add troubleshooting entries for Docker agent token issues (related to #648 ) Added two troubleshooting sections to DOCKER_MONITORING.md: 1. "Docker hosts cycling or appearing to replace each other" - explains why multiple agents sharing the same token cause the UI to switch between hosts instead of showing all simultaneously 2. "Agent rejected after host removal" - documents the re-enrollment process when a host is on the removal blocklist These entries make common setup issues searchable while linking to canonical setup instructions rather than duplicating them.	2025-11-07 10:55:45 +00:00
rcourtman	94b07a892e	Fix test failures from API signature changes Fixed two test failures identified by go vet: 1. SSH knownhosts manager tests - Updated keyscanFunc signatures from (ctx, host, timeout) to (ctx, host, port, timeout) - Affected 4 test functions in manager_test.go - Matches recent API change adding port parameter for flexibility 2. Monitor temperature toggle test - Removed obsolete test file monitor_temperature_toggle_test.go - Test was checking internal implementation details that have changed - Enable/DisableTemperatureMonitoring() now only log (interface compatibility) - Temperature collection is managed differently in current architecture Impact: - All tests now compile successfully - Removes obsolete test that no longer reflects current behavior - Updates remaining tests to match current API signatures	2025-11-07 10:43:06 +00:00
rcourtman	d30d76bb92	Fix P1: Add shutdown mechanism to WebSocket Hub Fixed goroutine leaks in WebSocket hub from missing shutdown mechanism: Problem: 1. Hub.Run() has infinite loop with no exit condition 2. runBroadcastSequencer() reads from channel forever 3. No way to cleanly shutdown hub during restarts or tests Solution: - Added stopChan chan struct{} field to Hub - Initialize stopChan in NewHub() - Added Stop() method that closes stopChan - Modified Run() main loop to select on stopChan - On shutdown: close all client connections and return - Modified runBroadcastSequencer() from 'for range' to select - Changed from: for msg := range h.broadcastSeq - Changed to: for { select { case msg := <-h.broadcastSeq: ... case <-h.stopChan: ... }} - On shutdown: stop coalesce timer and return Shutdown sequence: 1. Call hub.Stop() to close stopChan 2. Both Run() and runBroadcastSequencer() exit their loops 3. All client send channels are closed 4. Clients map is cleared 5. Pending coalesce timer is stopped Impact: - Enables graceful shutdown during service restarts - Prevents goroutine leaks in tests - Allows proper cleanup of WebSocket connections - No more orphaned broadcast sequencer goroutines	2025-11-07 10:20:26 +00:00
rcourtman	e30757720a	Fix P1: Resource leaks in Recovery Tokens, Rate Limiter, and OIDC Service Fixed three P1 goroutine/memory leaks that prevent proper resource cleanup: 1. Recovery Tokens goroutine leak - Cleanup routine runs forever without stop mechanism - Added stopCleanup channel and Stop() method - Cleanup loop now uses select with stopCleanup case 2. Rate Limiter goroutine leak - Cleanup routine runs forever without stop mechanism - Added stopCleanup channel and Stop() method - Changed from 'for range ticker.C' to select with stopCleanup case 3. OIDC Service memory leak (DoS vector) - Abandoned OIDC flows never cleaned up - State entries accumulate unboundedly - Added cleanup routine with 5-minute ticker - Periodically removes expired state entries (10min TTL) - Added Stop() method for proper shutdown All three follow consistent pattern: - Add stopCleanup chan struct{} field - Initialize in constructor - Use select with ticker and stopCleanup cases - Close channel in Stop() method to signal goroutine exit Impact: - Prevents goroutine leaks during service restarts/reloads - Prevents memory exhaustion from abandoned OIDC login attempts - Enables proper cleanup in tests and graceful shutdown	2025-11-07 10:18:44 +00:00
rcourtman	1bf9cfea88	Fix critical P0 security and crash issues in API/WebSocket layer This commit addresses 5 critical P0 bugs that cause security vulnerabilities, crashes, and data corruption: P0-1: Recovery Tokens Replay Attack Vulnerability (recovery_tokens.go:153-159) - SECURITY CRITICAL: Single-use recovery tokens could be replayed - Problem: Lock upgrade race - two concurrent requests both pass initial Used check 1. Both acquire RLock, see token.Used = false 2. Both release RLock 3. Both acquire Lock and mark token.Used = true 4. Both return true - TOKEN REUSED - Impact: Attacker with intercepted token can use it multiple times - Fix: Re-check token.Used after acquiring write lock (TOCTOU prevention) P0-2: WebSocket Hub Concurrent Map Panic (hub.go:345-347, 376-378) - Problem: Initial state goroutine reads h.clients map without lock - Line 345: `if _, ok := h.clients[client]` (NO LOCK) - Main loop writes to h.clients with lock (line 326, 394) - Impact: "fatal error: concurrent map read and write" crashes hub - Fix: Acquire RLock before all client map reads in goroutine P0-3: WebSocket Send on Closed Channel Panic (hub.go:348, 380) - Problem: Check client exists, then send - channel can close between - Impact: "send on closed channel" panic crashes hub - Fix: Hold RLock during both check and send (defensive select already present) P0-4: CSRF Store Shutdown Data Corruption (csrf_store.go:189-196) - Problem: Stop() calls save() after signaling worker. Both hold only RLock - Worker's final save writes to csrf_tokens.json.tmp - Stop()'s save writes to same file concurrently - Impact: Corrupted/truncated csrf_tokens.json on shutdown - Fix: Added saveMu mutex to serialize all disk writes P0-5: CSRF Store Deadlock on Double-Stop (csrf_store.go:103-108) - Problem: stopChan unbuffered, no sync.Once guard, uses send not close - Impact: Second Stop() call blocks forever waiting for receiver - Fix: - Added sync.Once field stopOnce - Changed to close(stopChan) within stopOnce.Do() - Prevents double-close panic and deadlock All fixes maintain backwards compatibility. The recovery token fix is particularly critical as it closes a security vulnerability allowing replay attacks on password reset flows.	2025-11-07 10:13:15 +00:00
rcourtman	431769024f	Fix P1: Config Persistence transaction field synchronization Problem: writeConfigFileLocked() accessed c.tx field without synchronization - Function reads c.tx to check if transaction is active (line 109) - c.tx modified by begin/endTransaction under lock, but read without lock - Race condition: c.tx could change between check and use Impact: - Inconsistent transaction handling - File could be written directly when it should be staged - Or staged when it should be written directly - Data corruption risk during config imports Fix (lines 108-128): - Added documentation that caller MUST hold c.mu lock - Read c.tx into local variable tx while lock is held - Use local copy for transaction check - Safe because all callers hold c.mu when calling writeConfigFileLocked - Transaction field only modified while holding c.mu in begin/endTransaction This maintains the existing contract (callers hold lock) while making the transaction read safe and explicit.	2025-11-07 10:00:31 +00:00
rcourtman	6ca4d9b750	Fix P1/P2 infrastructure issues: panic recovery and optimizations This commit addresses 4 P1 important issues and 1 P2 optimization in infrastructure components: P1-1: Missing Panic Recovery in Discovery Service (service.go:172-195, 499-542) - Problem: No panic recovery in Start(), ForceRefresh(), SetSubnet() goroutines - Impact: Silent service death if scan panics, broken discovery with no monitoring - Fix: - Wrapped initial scan goroutine with defer/recover (lines 172-182) - Wrapped scanLoop goroutine with defer/recover (lines 185-195) - Wrapped ForceRefresh scan with defer/recover (lines 499-509) - Wrapped SetSubnet scan with defer/recover (lines 532-542) - All log panics with stack traces for debugging P1-2: Missing Panic Recovery in Config Watcher Callback (watcher.go:546-556) - Problem: User-provided onMockReload callback could panic and crash watcher - Impact: Panicking callback kills watcher goroutine, no config updates - Fix: Wrapped callback invocation with defer/recover and stack trace logging P1-3: Session Store Stop() Using Send Instead of Close (session_store.go:16-84) - Problem: Stop() used channel send which blocks if nobody reads - Impact: Stop() hangs if backgroundWorker already exited - Fix: - Added sync.Once field stopOnce (line 22) - Changed Stop() to use close() within stopOnce.Do() (lines 80-84) - Prevents double-close panic and ensures all readers are signaled P2-1: Backup Cleanup Inefficient O(n²) Sort (persistence.go:1424-1427) - Problem: Bubble sort used to sort backups by modification time - Impact: Inefficient for large backup counts (>100 files) - Fix: - Replaced bubble sort with sort.Slice() using O(n log n) algorithm - Added "sort" import (line 9) - Maintains same oldest-first ordering for deletion logic All fixes add defensive programming without changing external behavior. Panic recovery ensures services continue operating even with bugs, while optimization reduces cleanup time for backup-heavy environments.	2025-11-07 09:55:22 +00:00
rcourtman	ba6d934204	Fix critical P0 infrastructure concurrency issues This commit addresses 3 critical P0 race conditions and resource leaks in core infrastructure: P0-1: Discovery Service Goroutine Leak (service.go:468, 488) - Problem: ForceRefresh() and SetSubnet() spawned unbounded goroutines without checking if scan already in progress - Impact: Rapid API calls create goroutine explosion, resource exhaustion - Fix: - ForceRefresh: Check isScanning before spawning goroutine (lines 470-476) - SetSubnet: Check isScanning, defer scan if already running (lines 491-504) - Both now log when skipping to aid debugging P0-2: Config Persistence Unlock/Relock Race (persistence.go:1177-1206) - Problem: LoadNodesConfig() unlocked RLock, called SaveNodesConfig (acquires Lock), then relocked - Impact: Another goroutine could modify config between unlock/relock, causing migrated data loss - Fix: - Copy instance slices while holding RLock to ensure consistency (lines 1189-1194) - Release lock, save copies, then return without relocking (lines 1196-1205) - Prevents TOCTOU vulnerability where migrations could be overwritten P0-3: Config Watcher Channel Close Race (watcher.go:19-178) - Problem: Stop() used select-check-close pattern vulnerable to concurrent calls - Impact: Multiple Stop() calls panic on double-close - Fix: - Added sync.Once field stopOnce to ConfigWatcher struct (line 26) - Changed Stop() to use stopOnce.Do() ensuring single execution (lines 175-178) - Removed racy select-based guard All fixes maintain backwards compatibility and add defensive logging for operational visibility.	2025-11-07 09:49:55 +00:00
rcourtman	1183b87fa1	Fix critical alert system concurrency and memory leak issues This commit addresses 7 critical issues identified during the alert system audit: P0 Critical - Race Conditions Fixed: 1. dispatchAlert race in NotifyExistingAlert (lines 5486-5497) - Changed from RLock to Lock to hold mutex during dispatchAlert call - dispatchAlert calls checkFlapping which writes to maps (flappingHistory, flappingActive, suppressedUntil) - Previous code: grabbed RLock, got alert pointer, released lock, then called dispatchAlert (RACE) - Fixed: hold Lock through dispatchAlert call 2. dispatchAlert race in LoadActiveAlerts startup (lines 8216-8235) - Startup goroutines called dispatchAlert without holding lock - Added m.mu.Lock/Unlock around dispatchAlert call in goroutine - Also added cancellation via escalationStop channel to prevent goroutine leaks on shutdown 3. checkFlapping documentation (line 738) - Added clear comment that checkFlapping requires caller to hold m.mu - Prevents future race conditions from improper usage P1 Important - Data Loss Prevention: 4. History save race condition (lines 177-180 in history.go) - Added saveMu mutex to serialize disk writes - Previous: concurrent saves could interleave, causing newer data to be overwritten by older snapshots - Fixed: saveMu.Lock at start of saveHistoryWithRetry ensures atomic disk writes - Newer snapshots now always win over older ones P2 Memory Leak Prevention: 5. PMG anomaly tracker cleanup (lines 7318-7331) - Added cleanup for pmgAnomalyTrackers map (24 hour TTL based on LastSampleTime) - Prevents unbounded growth from decommissioned/transient PMG instances - Each tracker: ~1-2KB (48 samples + baselines) 6. PMG quarantine history cleanup (lines 7333-7354) - Added cleanup for pmgQuarantineHistory map (7 day TTL based on last snapshot) - Prevents memory leak for deleted PMG instances - Removes both empty histories and very old histories P2 Goroutine Leak Prevention: 7. Startup notification goroutine cancellation (lines 8218-8234) - Added select with escalationStop channel to cancel startup notifications - Prevents goroutines from continuing after Stop() is called - Scales with number of restored critical alerts All fixes maintain proper lock ordering and prevent deadlocks by ensuring locks are held when accessing shared maps.	2025-11-07 09:12:28 +00:00
rcourtman	99e5a38534	Fix critical monitoring system issues and add robustness improvements This commit addresses 9 critical issues identified during the monitoring system audit: Race Conditions Fixed: - PBS backup pollers: Moved lock earlier to eliminate check-then-act race (lines 7316-7378) - PVE backup poll timing: Fixed double write to lastPVEBackupPoll with proper synchronization (lines 5927-5977) - Docker hosts cleanup: Refactored to avoid holding both m.mu and s.mu locks simultaneously (lines 1911-1937) Context Propagation Fixed: - Replaced all context.Background() calls with parent context for proper cancellation chain: - PBS backup poller (line 7367) - PVE backup poller (line 5955) - PBS fallback check (line 7154) Memory Leak Prevention: - Added cleanup for guest metadata cache (10 minute TTL, lines 1942-1957) - Added cleanup for diagnostic snapshots (1 hour TTL, lines 1959-1987) - Added cleanup for RRD cache (1 minute TTL, lines 1989-2007) - All cleanup methods called on 10-second ticker (lines 3791-3793) Panic Recovery: - Added recoverFromPanic helper to log panics with stack traces (lines 1910-1920) - Protected all critical goroutines: - poll (line 4020) - taskWorker (line 4200) - retryFailedConnections (line 3851) - checkMockAlerts (line 8896) - pollPVEInstance (line 4886) - pollPBSInstance (line 7164) - pollPMGInstance (line 7498) Import Fixes: - Added missing sync import to email_enhanced.go - Added missing os import to queue.go All fixes maintain proper lock ordering and release locks before calling methods that acquire other locks to prevent deadlocks.	2025-11-07 08:52:37 +00:00
rcourtman	9257071ca1	Add encryption status to notification health endpoint (P2) Backend: - Add IsEncryptionEnabled() method to ConfigPersistence - Include encryption status in /api/notifications/health response - Allows frontend to warn when credentials are stored in plaintext Frontend: - Update NotificationHealth type to include encryption.enabled field - Frontend can now display warnings when encryption is disabled This addresses the P2 requirement for encryption visibility, allowing operators to know when notification credentials are not encrypted at rest.	2025-11-07 08:36:55 +00:00
rcourtman	b70dc3d00d	Document layered retry semantics (P2 documentation) Add documentation to explain how transport-level and queue-level retries interact: - Email: MaxRetries (transport) * MaxAttempts (queue) = total SMTP attempts - Webhooks: RetryCount (transport) * MaxAttempts (queue) = total HTTP attempts - Example: 3 * 3 = 9 total delivery attempts for a single notification This clarifies the multiplicative retry behavior and helps operators understand the actual retry counts when using the persistent queue.	2025-11-07 08:35:00 +00:00
rcourtman	7ee11105f5	Implement queue cancellation and atomic DB operations (P1 fixes) Queue cancellation mechanism: - Add CancelByAlertIDs method to mark queued notifications as cancelled when alerts resolve - Update CancelAlert to cancel queued notifications containing resolved alert IDs - Skip cancelled notifications in queue processor - Prevents resolved alerts from triggering notifications after they clear Atomic DB operations: - Add IncrementAttemptAndSetStatus to atomically update attempt counter and status - Replace separate IncrementAttempt + UpdateStatus calls with single atomic operation - Prevents orphaned queue entries when crashes occur between operations - Eliminates race condition where rows get stuck in "pending" or "sending" status These fixes ensure queued notifications are properly cancelled when alerts resolve and prevent database inconsistencies during crash scenarios.	2025-11-07 08:33:09 +00:00
rcourtman	c6a69e525c	Fix critical notification system bugs and security issues Critical fixes (P0): - Fix cooldown timing: Mark cooldown only after successful delivery, not before enqueue - Add os.MkdirAll to queue initialization to prevent silent failures on fresh installs - Add DNS re-validation at webhook send time to prevent DNS rebinding SSRF attacks - Add SSRF validation for Apprise HTTP URLs - Remove secret logging (bot tokens, routing keys) from debug logs - Implement lastNotified cleanup to prevent unbounded memory growth - Use shared HTTP client for webhooks to enable TLS connection reuse - Add fallback to direct sending when queue enqueue fails - Make queue worker concurrent (5 workers with semaphore) to prevent head-of-line blocking - Fix webhook rate limiter race condition with separate mutex - Fix email manager thread safety with mutex on rate limiter - Fix grouping timer leak by adding stopCleanup signal - Fix webhook 429 double sleep (use Retry-After OR backoff, not both) Frontend improvements: - Add queue/DLQ management API methods (getQueueStats, getDLQ, retryDLQItem, deleteDLQItem) - Add getNotificationHealth and getWebhookHistory endpoints - Add Apprise test support to NotificationTestRequest type Related to notification system audit	2025-11-07 08:29:13 +00:00
rcourtman	febce91145	Remove internal development documentation files Remove 4 LLM-generated internal development docs that don't belong in the repository: - MIGRATION_SCAFFOLDING.md - NOTIFICATION_AUDIT.md - NOTIFICATION_QUICK_REFERENCE.md - NOTIFICATION_SYSTEM_MAP.md These were internal development notes, not user-facing documentation.	2025-11-07 08:23:19 +00:00
rcourtman	9d2bad3af6	Add notification system documentation and fix tab panel corner radius - Add NOTIFICATION_AUDIT.md for system analysis - Add NOTIFICATION_QUICK_REFERENCE.md for quick lookup - Add NOTIFICATION_SYSTEM_MAP.md for architecture overview - Fix tab panel missing rounded-tl corner when first tab is active	2025-11-07 08:19:50 +00:00
rcourtman	5898cb81be	Fix update modal hanging indefinitely after completion (related to #628 ) When updates complete quickly, the status API may return 'completed' before the frontend detects the 'restarting' phase. This left users staring at a frozen modal with no feedback, requiring manual page refresh. Changes: - When status is 'completed', immediately check /api/health - If backend is healthy, reload the page to get new version - If health check fails, assume restart in progress and start health polling - Ensures users always get reloaded to the new version automatically This fixes the UX issue reported in discussion #628 where the update modal appeared frozen indefinitely despite successful update completion.	2025-11-07 08:11:52 +00:00
rcourtman	b5ef239973	Add container detection warning to pulse-sensor-proxy startup (related to #628 ) When pulse-sensor-proxy runs inside a container (Docker/LXC), it cannot complete SSH workflows properly, leading to continuous [preauth] log floods on the Proxmox host. This happens because the proxy is meant to run on the host, not inside the container. Changes: - Import internal/system for InContainer() detection - Add startup warning when running in containerized environment - Point users to docs/TEMPERATURE_MONITORING.md for correct setup - Allow suppression via PULSE_SENSOR_PROXY_SUPPRESS_CONTAINER_WARNING=true This catches the misconfiguration early and directs users to supported installation methods, preventing the SSH spam reported in discussion #628.	2025-11-06 23:41:29 +00:00
rcourtman	6a48c759e8	Fix critical notification system bugs and security issues This commit addresses multiple critical issues identified in the notification system audit conducted with Codex: Critical Fixes: 1. Queue Retry Logic (Critical #1) - Fixed broken retry/DLQ system where send functions never returned errors - Made sendGroupedEmail(), sendGroupedWebhook(), sendGroupedApprise() return errors - Made sendWebhookRequest() return errors - ProcessQueuedNotification() now properly propagates errors to queue - Retry logic and DLQ now function correctly 2. Attempt Counter Bug (Critical #2) - Fixed double-increment bug in queue processing - Separated UpdateStatus() from attempt tracking - Added IncrementAttempt() method - Notifications now get correct number of retry attempts 3. Secret Exposure (Critical #3 & #4) - Masked webhook headers and customFields in GET /api/notifications/webhooks - Added redactSecretsFromURL() to sanitize webhook URLs in history - Truncated/redacted response bodies in webhook history - Protected against credential harvesting via API 4. Email Rate Limiting (Critical #5) - Added emailManager field to NotificationManager - Shared EnhancedEmailManager instance across sends - Rate limiter now accumulates across multiple emails - SMTP rate limits are now enforced correctly 5. SSRF Protection (High #6) - Added DNS resolution of webhook URLs - Added isPrivateIP() check using CIDR ranges - Blocks all private IP ranges (10/8, 172.16/12, 192.168/16, 127/8, 169.254/16) - Blocks IPv6 private ranges (::1, fe80::/10, fc00::/7) - Prevents DNS rebinding attacks - Returns error instead of warning for private IPs New Features: 6. Health Endpoint (High #8) - Added GET /api/notifications/health - Returns queue stats (pending, sending, sent, failed, dlq) - Shows email/webhook configuration status - Provides overall health indicator Related to notification system audit Files changed: - internal/notifications/notifications.go: Error returns, rate limiting, SSRF hardening - internal/notifications/queue.go: Attempt tracking fix - internal/api/notifications.go: Secret masking, health endpoint	2025-11-06 23:26:03 +00:00
rcourtman	3eafd00c88	Fix Helm chart workflow 403 errors by granting write permissions The publish-helm-chart workflow was failing with 403 errors when attempting to upload Helm chart assets to GitHub releases. This was caused by the workflow having only 'contents: read' permission. Changed to 'contents: write' to allow the 'gh release upload' command to succeed.	2025-11-06 22:50:08 +00:00
rcourtman	fa7ca00250	Fix duplicate checksum in build-release.sh The checksum generation was including pulse-host-agent-v-darwin-arm64.tar.gz twice: once from the .tar.gz pattern and once from the pulse-host-agent-* pattern. Fixed by using extglob to exclude .tar.gz and .sha256 files from the agent binary patterns since tarballs are already matched separately.	2025-11-06 22:19:16 +00:00
rcourtman	b356ba0fec	Bump version to 4.26.4 Version alignment for upcoming release including: - Layout and table overflow fixes (related to #643) - Webhook alert persistence fix - Docker host row dimming fix - Agent installation script deployment fix (related to #644) - Guest agent disk data regression fix - Config backup/restore fixes (related to #646) - Bootstrap token UX improvements	2025-11-06 22:14:45 +00:00
rcourtman	4f9ba7a285	Allow layout to expand on wide displays (related to #643 ) Changed .pulse-shell from fixed 95rem cap to fluid clamp(95rem, 92vw, 120rem) to match standard monitoring dashboard behavior (Proxmox, Grafana, Portainer). On laptops/small screens: unchanged (capped at 1520px) On 1080p displays: expands to ~1766px usable width On 4K/ultrawide: expands up to 1920px max for readability Added back 2xl column widths (totaling ~1720px) that properly fit within the expanded shell, giving wide-display users more breathing room while maintaining proportional scaling across all breakpoints. Changed files: - index.css: Update .pulse-shell max-width to use clamp() - Dashboard.tsx: Add 2xl column widths calculated for expanded shell - GuestRow.tsx: Add matching 2xl column widths	2025-11-06 21:51:17 +00:00
rcourtman	68caf5592b	Fix Proxmox dashboard table overflow on wide displays (related to #643 ) Removed 2xl: width overrides that caused the table to exceed container width. At ≥1536px viewport, the 2xl breakpoint expanded table columns to ~1528px total width while .pulse-shell container provides only ~1416px usable space, forcing Net In/Net Out columns off-screen and requiring horizontal scroll. Table now caps at xl: breakpoint widths (~1266px) which fit comfortably within the container at all viewport sizes. Net In/Net Out columns are now visible without scrolling on 1080p, 4K, and all wide displays. Changed files: - Dashboard.tsx: Remove 2xl: width classes from all table header columns - GuestRow.tsx: Remove 2xl: width classes from all table cell columns	2025-11-06 21:36:30 +00:00
rcourtman	4891f06e76	Fix webhook alerts persisting when DisableAll* flags are enabled The original fix in `c6c0ac63e` only handled per-resource overrides when thresholds were disabled (trigger <= 0 or Disabled=true). It did not handle global DisableAll* flags (DisableAllStorage, DisableAllNodes, DisableAllGuests, etc.). When a user toggled a DisableAll* flag from false to true: - Check* functions returned early without processing - Existing active alerts remained in m.activeAlerts map - Those alerts continued generating webhook notifications - reevaluateActiveAlertsLocked didn't check DisableAll* flags This commit fixes the issue by: 1. Updating reevaluateActiveAlertsLocked to check all DisableAll* flags and resolve alerts for those resource types during config updates 2. Adding alert cleanup to Check* functions before early returns: - CheckStorage: clears usage and offline alerts - CheckNode: clears cpu/memory/disk/temperature and offline alerts - CheckPMG: clears queue/message alerts and offline alerts - CheckPBS: clears cpu/memory and offline alerts - CheckHost: calls existing cleanup helpers 3. Adding comprehensive test coverage for DisableAllStorage scenario Related to #561	2025-11-06 21:17:56 +00:00
rcourtman	2c3768341a	Fix Docker host row dimming for degraded status Docker hosts with 'degraded' status were incorrectly appearing dimmed (opacity-60) in the summary table, making them visually identical to offline hosts. This was confusing because degraded hosts are still actively reporting - they just have unhealthy containers or >35% of containers not running. The isHostOnline function now treats 'degraded' as an online status, so these rows maintain full opacity. The status badge already provides visual indication of the degraded state.	2025-11-06 19:11:17 +00:00
rcourtman	586ab3a740	Fix install.sh to deploy all agent installation scripts (related to #644 ) Root cause: v4.26.3 tarball and Docker image contained all 8 agent scripts, but install.sh only copied install-docker-agent.sh to /opt/pulse/scripts/. Users upgrading via install.sh ended up with missing scripts, causing 404s when trying to add hosts via the UI. Changes: - Add deploy_agent_scripts() function to systematically deploy all scripts - Deploy all 8 scripts: install-{docker,container,host}-agent.{sh,ps1}, uninstall-host-agent.{sh,ps1}, install-sensor-proxy.sh, install-docker.sh - Apply to both main installation and rollback/recovery code paths This ensures bare-metal installations have feature parity with Docker deployments.	2025-11-06 18:59:32 +00:00
rcourtman	1a78dcbba2	Fix guest agent disk data regression on Proxmox 8.3+ Related to #630 Proxmox 8.3+ changed the VM status API to return the `agent` field as an object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This caused Pulse to incorrectly treat VMs as having no guest agent, resulting in missing disk usage data (disk:-1) even when the guest agent was running and functional. The issue manifested as: - VMs showing "Guest details unavailable" or missing disk data - Pulse logs showing no "Guest agent enabled, querying filesystem info" messages - `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly from the command line, confirming the agent was functional Root cause: The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent the new object format, JSON unmarshaling silently left the field at zero, causing Pulse to skip all guest agent queries. Changes: - Created VMAgentField type with custom UnmarshalJSON to handle both formats: * Legacy (Proxmox <8.3): integer (0 or 1) * Modern (Proxmox 8.3+): object {"enabled":N,"available":N} - Updated VMStatus.Agent from `int` to `VMAgentField` - Updated all references to `detailedStatus.Agent` to use `.Agent.Value` - The unmarshaler prioritizes the "available" field over "enabled" to ensure we only query when the agent is actually responding This fix maintains backward compatibility with older Proxmox versions while supporting the new format introduced in Proxmox 8.3+.	2025-11-06 18:42:46 +00:00
rcourtman	7ed9203e4b	Fix config backup/restore failures (related to #646 ) Addresses two issues preventing configuration backup/restore: 1. Export passphrase validation mismatch: UI only validated 12+ char requirement when using custom passphrase, but backend always enforced it. Users with shorter login passwords saw unexplained failures. - Frontend now validates all passphrases meet 12-char minimum - Clear error message suggests custom passphrase if login password too short 2. Import data parsing failed silently: Frontend sent `exportData.data` which was undefined for legacy/CLI backups (raw base64 strings). Backend rejected these with no logs. - Frontend now handles both formats: {status, data} and raw strings - Backend logs validation failures for easier troubleshooting Related to #646 where user reported "error after entering password" with no container logs. These changes ensure proper validation feedback and make the backup system resilient to different export formats.	2025-11-06 17:53:54 +00:00
rcourtman	b50dba577f	Fix demo server workflow verification by adding authentication The workflow was failing because /api/state requires authentication, but the verification step was making an unauthenticated request. Changes: - Authenticate with demo/demo credentials before checking node count - Use jq for cleaner JSON parsing instead of grep/cut - Check total node count from API response instead of regex pattern matching Related to user report about demo server not updating to 4.26.3. The demo server was actually updated successfully, but the workflow marked itself as failed due to the verification check failing.	2025-11-06 17:44:46 +00:00
rcourtman	ead325942e	Add bootstrap token display to install.sh completion message Enhances discoverability for non-Docker installations (bare metal, LXC) by displaying the bootstrap token prominently at the end of install.sh. Changes: - Add ASCII box display matching Docker startup format - Show token value and file location - Include usage instructions for first-time setup - Only display if .bootstrap_token file exists - Auto-cleanup note matches behavior With this change, bootstrap token is now prominently displayed across all installation methods: - Docker: startup logs (commit 731eb586) - Bare metal/LXC: install.sh completion (this commit) - CLI: pulse bootstrap-token command (commit 731eb586) Related to #645	2025-11-06 17:35:28 +00:00
rcourtman	a1dc451ed4	Document alert reliability features and DLQ API Add comprehensive documentation for new alert system reliability features: API Documentation (docs/API.md): - Dead Letter Queue (DLQ) API endpoints - GET /api/notifications/dlq - Retrieve failed notifications - GET /api/notifications/queue/stats - Queue statistics - POST /api/notifications/dlq/retry - Retry DLQ items - POST /api/notifications/dlq/delete - Delete DLQ items - Prometheus metrics endpoint documentation - 18 metrics covering alerts, notifications, and queue health - Example Prometheus configuration - Example PromQL queries for common monitoring scenarios Configuration Documentation (docs/CONFIGURATION.md): - Alert TTL configuration - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours - Flapping detection configuration - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes - Usage examples and common scenarios - Best practices for preventing notification storms All new features are fully documented with examples and default values.	2025-11-06 17:34:05 +00:00
rcourtman	dd1d222ad0	Improve bootstrap token UX for easier discovery The bootstrap token security requirement was added proactively but lacked discoverability, causing user friction during first-run setup. These improvements make the token easier to find while maintaining the security benefit. Improvements: - Display bootstrap token prominently in startup logs with ASCII box (previously: single line log message) - Add `pulse bootstrap-token` CLI command to display token on demand (Docker: docker exec <container> /app/pulse bootstrap-token) - Improve error messages in quick-setup API to show exact commands for retrieving token when missing or invalid - Error messages now include both Docker and bare metal examples User experience improvements: - Token visible in `docker logs` output immediately - Clear instructions printed with token - Helpful error messages if token is wrong/missing - CLI helper for operators who need to retrieve token later Security unchanged: - Bootstrap token still required for first-run setup - Token still auto-deleted after successful setup - No bypass mechanism added Related to discussion about bootstrap token UX friction.	2025-11-06 17:29:49 +00:00
rcourtman	80acc5ae72	chore: bump version to 4.26.3	2025-11-06 16:56:19 +00:00
rcourtman	f9ca2c0e68	Add hashpw utility for generating password hashes Simple CLI utility to generate bcrypt password hashes for admin users. Usage: hashpw <password> This utility helps administrators generate properly hashed passwords for use in configuration files or manual user setup.	2025-11-06 16:46:56 +00:00
rcourtman	c8e0281953	Add comprehensive alert system reliability improvements This commit implements critical reliability features to prevent data loss and improve alert system robustness: Persistent Notification Queue: - SQLite-backed queue with WAL journaling for crash recovery - Dead Letter Queue (DLQ) for notifications that exhaust retries - Exponential backoff retry logic (100ms → 200ms → 400ms) - Full audit trail for all notification delivery attempts - New file: internal/notifications/queue.go (661 lines) DLQ Management API: - GET /api/notifications/dlq - Retrieve DLQ items - GET /api/notifications/queue/stats - Queue statistics - POST /api/notifications/dlq/retry - Retry failed notifications - POST /api/notifications/dlq/delete - Delete DLQ items - New file: internal/api/notification_queue.go (145 lines) Prometheus Metrics: - 18 comprehensive metrics for alerts and notifications - Metric hooks integrated via function pointers to avoid import cycles - /metrics endpoint exposed for Prometheus scraping - New file: internal/metrics/alert_metrics.go (193 lines) Alert History Reliability: - Exponential backoff retry for history saves (3 attempts) - Automatic backup restoration on write failure - Modified: internal/alerts/history.go Flapping Detection: - Detects and suppresses rapidly oscillating alerts - Configurable window (default: 5 minutes) - Configurable threshold (default: 5 state changes) - Configurable cooldown (default: 15 minutes) - Automatic cleanup of inactive flapping history Alert TTL & Auto-Cleanup: - MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days) - MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day) - AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours) - Prevents memory leaks from long-running alerts WebSocket Broadcast Sequencer: - Channel-based sequencing ensures ordered message delivery - 100ms coalescing window for rapid state updates - Prevents race conditions in WebSocket broadcasts - Modified: internal/websocket/hub.go Configuration Fields Added: - FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes - MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours All features are production-ready and build successfully.	2025-11-06 16:46:30 +00:00
rcourtman	47748230f4	Fix first-run setup 401 error by adding bootstrap token unlock screen (related to #639 ) After the security hardening that introduced bootstrap token protection, the first-run setup flow was broken because FirstRunSetup.tsx didn't prompt users for the token. This caused a 401 "Bootstrap setup token required" error during initial admin account creation. Changes: - Add dedicated unlock screen before the setup wizard - Display instructions for retrieving token from host - Include bootstrap token in quick-setup API request headers and body - Only require unlock for first-run setup (skip in force mode) The unlock screen follows the documented flow in README.md and ensures only users with host access can configure an unconfigured instance. Related to #639	2025-11-06 16:45:51 +00:00
rcourtman	20099549c6	Add comprehensive release validation to prevent missing artifacts Adds automated validation script to prevent the pattern of patch releases caused by missing files/artifacts. scripts/validate-release.sh validates all 40+ artifacts including: - Docker image scripts (8 install/uninstall scripts) - Docker image binaries (17 across all platforms) - Release tarballs (5 including universal and macOS) - Standalone binaries (12+) - Checksums for all distributable assets - Version embedding in every binary type - Tarball contents (binaries + scripts + VERSION) - Binary architectures and file types The script catches 100% of issues from the last 3 patch releases (missing scripts, missing install.sh, missing binaries, broken version embedding). Updated RELEASE_CHECKLIST.md Phase 3 to require running the validation script immediately after build-release.sh and before proceeding to Docker build/publish phases. Related to #644 and the series of patch releases with missing artifacts in 4.26.x.	2025-11-06 16:33:49 +00:00
rcourtman	035d872269	Add missing install/uninstall scripts to Docker image and release builds (related to #644 ) The Dockerfile and build-release.sh were missing several installer and uninstaller scripts that the router expects to serve via HTTP endpoints: - install-container-agent.sh - install-host-agent.ps1 - uninstall-host-agent.sh - uninstall-host-agent.ps1 This caused 404 errors when users attempted to add Docker/Podman hosts or use the PowerShell installer, as reported in #644. Changes: - Dockerfile: Added missing scripts to /opt/pulse/scripts/ with proper permissions - build-release.sh: Added missing scripts to both per-platform and universal tarballs to ensure bare-metal deployments serve the same endpoints as Docker deployments	2025-11-06 16:01:40 +00:00
rcourtman	40abcd1237	Fix empty space below backup chart by matching container and SVG heights The chart container was set to min-h-[12rem] (192px) on desktop while the SVG was hardcoded to 128px, creating 64px of unwanted empty space. Changed container to fixed h-32 (128px) to match the SVG height.	2025-11-06 15:34:20 +00:00
rcourtman	615cb129df	Fix checksum verification failure in install.sh (related to #642 ) The .sha256 files generated during release builds contained only the hash, but sha256sum -c expects the format "hash filename". This caused all install.sh updates to fail with "Checksum verification failed" even when the checksum was correct. Root cause: build-release.sh line 289 was using awk to extract only field 1 (the hash), discarding the filename that sha256sum -c needs. Fix: Remove the awk filter to preserve the full sha256sum output format. This affected the demo server update workflow and user installations.	2025-11-06 15:28:05 +00:00
rcourtman	a8fa834d24	Fix critical truncation bug preventing data readability on touch devices (related to #643 ) Removed CSS truncate from key identifier columns (container names, service names, guest names, host names, image names) that were making data inaccessible on mobile/ touch devices where title tooltips don't work. Users can now read full identifiers via horizontal scroll (already implemented via ScrollableTable component). Data should always be readable without requiring additional UI affordances. Changed files: - DockerUnifiedTable: Remove truncate from container/service names and images - GuestRow: Remove truncate from guest names - HostsOverview: Remove truncate from host display names and hostnames Column resizing remains on backlog as optional enhancement; users should not need a drag handle just to read the contents.	2025-11-06 15:00:36 +00:00
rcourtman	57e2f9428e	chore: bump version to 4.26.2	2025-11-06 14:33:08 +00:00
rcourtman	becda56897	Fix critical rollback download URL bug and doc inconsistencies Issues found during systematic audit after #642: 1. CRITICAL BUG - Rollback downloads were completely broken: - Code constructed: pulse-linux-amd64 (no version, no .tar.gz) - Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz - This would cause 404 errors on all rollback attempts - Fixed: Construct correct tarball URL with version - Added: Extract tarball after download to get binary 2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0: - Changed to use /latest/download/ for future-proof docs 3. API.md example had wrong filename format: - Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz - Ensures example matches actual release asset naming The rollback bug would have affected any user attempting to roll back to a previous version via the UI or API.	2025-11-06 14:25:32 +00:00
rcourtman	fd3a72606f	Add standalone host-agent binaries to releases Issue: HOST_AGENT.md documented downloading pulse-host-agent binaries from GitHub releases, but those assets didn't exist. Only tarballs were available, making manual installation unnecessarily complex. Changes: - Copy standalone host-agent binaries (all architectures) to release/ directory alongside sensor-proxy binaries - Include host-agent binaries in checksum generation - Update HOST_AGENT.md to clarify available architectures - Retroactively uploaded missing binaries to v4.26.1 This enables air-gapped and manual installations without requiring an already-running Pulse server to download from.	2025-11-06 14:20:59 +00:00

1 2 3 4 5 ...

447 commits