Root cause: findMatchingDockerHost() was matching hosts by token ID alone,
causing multiple Docker agents using the same API token to overwrite each
other in state. This resulted in only N visible hosts (where N = number of
unique tokens) instead of all M agents, with hosts "rotating" as each agent
reported every 10 seconds.
Example: 4 agents using 2 tokens would show only 2 hosts, rotating between
agents 1↔2 (token A) and agents 3↔4 (token B).
Fix: Remove token-only matching from findMatchingDockerHost(). Hosts should
only match by:
1. Agent ID (unique per agent)
2. Machine ID + hostname combination (with optional token validation)
3. Machine ID or hostname alone (only for tokenless agents)
This allows multiple agents to share the same API token without colliding.
Additional fix: UpsertDockerHost() now preserves Hidden, PendingUninstall,
and Command fields from existing hosts, preventing these flags from being
reset to defaults on every agent report.
Related to #656
Windows guest agents can return multiple directory mountpoints (C:\, C:\Users,
C:\Windows) all on the same physical drive. When the QEMU guest agent omits
disk[] metadata, commit 5325ef481 falls back to using the mountpoint string
as the disk identifier. This causes every Windows directory to be treated as
a separate disk, accumulating to inflated totals (e.g., 1TB reported for a
250GB drive).
Root cause:
The fallback logic in pkg/proxmox/client.go:1585-1594 assigns fs.Disk =
fs.Mountpoint when disk[] is missing. On Windows, every directory path is
unique, so the deduplication guard in internal/monitoring/monitor_polling.go:
619-635 never triggers, causing all directories to be summed.
Changes:
- Detect Windows-style mountpoints (drive letter + colon + backslash)
- Normalize to drive root when disk[] is missing (e.g., C:\Users → C:)
- Preserve existing behavior for Linux/BSD and VMs with disk[] metadata
- Add debug logging for synthesized Windows drive identifiers
This fix maintains backward compatibility with commit 5325ef481 while
preventing the Windows directory accumulation issue. LXC containers are
unaffected as they use a different code path.
Windows 11 25H2 ships exclusively on ARM64 hardware. When users on ARM64
attempt to install the host agent, the Service Control Manager fails to
load the amd64 binary with ERROR_BAD_EXE_FORMAT, surfaced as "The Pulse
Host Agent is not compatible with this Windows version".
Changes:
- Dockerfile: Build pulse-host-agent-windows-arm64.exe alongside amd64
- Dockerfile: Copy windows-arm64 binary and create symlink for download endpoint
- install-host-agent.ps1: Use RuntimeInformation.OSArchitecture to detect ARM64
- build-release.sh: Build darwin-amd64, darwin-arm64, windows-amd64, windows-arm64
- build-release.sh: Package Windows binaries as .zip archives
- validate-release.sh: Check for windows-arm64 binary and symlink
- validate-release.sh: Add architecture validation for all darwin/windows variants
The installer now correctly detects ARM64 and downloads the appropriate binary.
Extends temperature monitoring to collect SMART temps for SATA/SAS disks,
addressing issue #652 where physical disk temperatures showed as empty.
Architecture:
- Deploys pulse-sensor-wrapper.sh as SSH forced command on Proxmox nodes
- Wrapper collects both CPU/GPU temps (sensors -j) and disk temps (smartctl)
- Implements 30-min cache with background refresh to avoid performance impact
- Uses smartctl -n standby,after to skip sleeping drives without waking them
- Returns unified JSON: {sensors: {...}, smart: [...]}
Backend changes:
- Add DiskTemp model with device, serial, WWN, temperature, lastUpdated
- Extend Temperature model with SMART []DiskTemp field and HasSMART flag
- Add WWN field to PhysicalDisk for reliable disk matching
- Update parseSensorsJSON to handle both legacy and new wrapper formats
- Rewrite mergeNVMeTempsIntoDisks to match SMART temps by WWN → serial → devpath
- Preserve legacy NVMe temperature support for backward compatibility
Performance considerations:
- SMART data cached for 30 minutes per node to avoid excessive smartctl calls
- Background refresh prevents blocking temperature requests
- Respects drive standby state to avoid spinning up idle arrays
- Staggered disk scanning with 0.1s delay to avoid saturating SATA controllers
Install script:
- Deploys wrapper to /usr/local/bin/pulse-sensor-wrapper.sh
- Updates SSH forced command from "sensors -j" to wrapper script
- Backward compatible - falls back to direct sensors output if wrapper missing
Testing note:
- Requires real hardware with smartmontools installed for full functionality
- Empty smart array returned gracefully when smartctl unavailable
- Legacy sensor-only nodes continue working without changes
The bare metal installer was not copying pulse-host-agent binaries from
release tarballs into /opt/pulse/bin/, causing 404 errors when users
tried to install the host agent via the download endpoint.
Changes:
- Copy pulse-host-agent binary during initial installation (alongside
pulse-docker-agent)
- Update install_additional_agent_binaries() to fetch and install
cross-platform host agent binaries (linux-amd64, linux-arm64,
linux-armv7, darwin-amd64, darwin-arm64, windows-amd64)
- Match existing pattern used for Docker agent distribution
The build pipeline (build-release.sh and Dockerfile) already correctly
includes host agent binaries in releases and Docker images. This fix
ensures the installer deploys them.
Users on bare metal deployments should rerun install.sh to populate
/opt/pulse/bin/ with the missing host agent binaries. Docker
deployments are unaffected.
Added two troubleshooting sections to DOCKER_MONITORING.md:
1. "Docker hosts cycling or appearing to replace each other" - explains
why multiple agents sharing the same token cause the UI to switch
between hosts instead of showing all simultaneously
2. "Agent rejected after host removal" - documents the re-enrollment
process when a host is on the removal blocklist
These entries make common setup issues searchable while linking to
canonical setup instructions rather than duplicating them.
Fixed two test failures identified by go vet:
1. SSH knownhosts manager tests
- Updated keyscanFunc signatures from (ctx, host, timeout) to (ctx, host, port, timeout)
- Affected 4 test functions in manager_test.go
- Matches recent API change adding port parameter for flexibility
2. Monitor temperature toggle test
- Removed obsolete test file monitor_temperature_toggle_test.go
- Test was checking internal implementation details that have changed
- Enable/DisableTemperatureMonitoring() now only log (interface compatibility)
- Temperature collection is managed differently in current architecture
Impact:
- All tests now compile successfully
- Removes obsolete test that no longer reflects current behavior
- Updates remaining tests to match current API signatures
Fixed goroutine leaks in WebSocket hub from missing shutdown mechanism:
Problem:
1. Hub.Run() has infinite loop with no exit condition
2. runBroadcastSequencer() reads from channel forever
3. No way to cleanly shutdown hub during restarts or tests
Solution:
- Added stopChan chan struct{} field to Hub
- Initialize stopChan in NewHub()
- Added Stop() method that closes stopChan
- Modified Run() main loop to select on stopChan
- On shutdown: close all client connections and return
- Modified runBroadcastSequencer() from 'for range' to select
- Changed from: for msg := range h.broadcastSeq
- Changed to: for { select { case msg := <-h.broadcastSeq: ... case <-h.stopChan: ... }}
- On shutdown: stop coalesce timer and return
Shutdown sequence:
1. Call hub.Stop() to close stopChan
2. Both Run() and runBroadcastSequencer() exit their loops
3. All client send channels are closed
4. Clients map is cleared
5. Pending coalesce timer is stopped
Impact:
- Enables graceful shutdown during service restarts
- Prevents goroutine leaks in tests
- Allows proper cleanup of WebSocket connections
- No more orphaned broadcast sequencer goroutines
Fixed three P1 goroutine/memory leaks that prevent proper resource cleanup:
1. Recovery Tokens goroutine leak
- Cleanup routine runs forever without stop mechanism
- Added stopCleanup channel and Stop() method
- Cleanup loop now uses select with stopCleanup case
2. Rate Limiter goroutine leak
- Cleanup routine runs forever without stop mechanism
- Added stopCleanup channel and Stop() method
- Changed from 'for range ticker.C' to select with stopCleanup case
3. OIDC Service memory leak (DoS vector)
- Abandoned OIDC flows never cleaned up
- State entries accumulate unboundedly
- Added cleanup routine with 5-minute ticker
- Periodically removes expired state entries (10min TTL)
- Added Stop() method for proper shutdown
All three follow consistent pattern:
- Add stopCleanup chan struct{} field
- Initialize in constructor
- Use select with ticker and stopCleanup cases
- Close channel in Stop() method to signal goroutine exit
Impact:
- Prevents goroutine leaks during service restarts/reloads
- Prevents memory exhaustion from abandoned OIDC login attempts
- Enables proper cleanup in tests and graceful shutdown
This commit addresses 5 critical P0 bugs that cause security vulnerabilities, crashes, and data corruption:
**P0-1: Recovery Tokens Replay Attack Vulnerability** (recovery_tokens.go:153-159)
- **SECURITY CRITICAL**: Single-use recovery tokens could be replayed
- **Problem**: Lock upgrade race - two concurrent requests both pass initial Used check
1. Both acquire RLock, see token.Used = false
2. Both release RLock
3. Both acquire Lock and mark token.Used = true
4. Both return true - TOKEN REUSED
- **Impact**: Attacker with intercepted token can use it multiple times
- **Fix**: Re-check token.Used after acquiring write lock (TOCTOU prevention)
**P0-2: WebSocket Hub Concurrent Map Panic** (hub.go:345-347, 376-378)
- **Problem**: Initial state goroutine reads h.clients map without lock
- Line 345: `if _, ok := h.clients[client]` (NO LOCK)
- Main loop writes to h.clients with lock (line 326, 394)
- **Impact**: "fatal error: concurrent map read and write" crashes hub
- **Fix**: Acquire RLock before all client map reads in goroutine
**P0-3: WebSocket Send on Closed Channel Panic** (hub.go:348, 380)
- **Problem**: Check client exists, then send - channel can close between
- **Impact**: "send on closed channel" panic crashes hub
- **Fix**: Hold RLock during both check and send (defensive select already present)
**P0-4: CSRF Store Shutdown Data Corruption** (csrf_store.go:189-196)
- **Problem**: Stop() calls save() after signaling worker. Both hold only RLock
- Worker's final save writes to csrf_tokens.json.tmp
- Stop()'s save writes to same file concurrently
- **Impact**: Corrupted/truncated csrf_tokens.json on shutdown
- **Fix**: Added saveMu mutex to serialize all disk writes
**P0-5: CSRF Store Deadlock on Double-Stop** (csrf_store.go:103-108)
- **Problem**: stopChan unbuffered, no sync.Once guard, uses send not close
- **Impact**: Second Stop() call blocks forever waiting for receiver
- **Fix**:
- Added sync.Once field stopOnce
- Changed to close(stopChan) within stopOnce.Do()
- Prevents double-close panic and deadlock
All fixes maintain backwards compatibility. The recovery token fix is particularly critical as it closes a security vulnerability allowing replay attacks on password reset flows.
**Problem**: writeConfigFileLocked() accessed c.tx field without synchronization
- Function reads c.tx to check if transaction is active (line 109)
- c.tx modified by begin/endTransaction under lock, but read without lock
- Race condition: c.tx could change between check and use
**Impact**:
- Inconsistent transaction handling
- File could be written directly when it should be staged
- Or staged when it should be written directly
- Data corruption risk during config imports
**Fix** (lines 108-128):
- Added documentation that caller MUST hold c.mu lock
- Read c.tx into local variable tx while lock is held
- Use local copy for transaction check
- Safe because all callers hold c.mu when calling writeConfigFileLocked
- Transaction field only modified while holding c.mu in begin/endTransaction
This maintains the existing contract (callers hold lock) while making the transaction read safe and explicit.
This commit addresses 4 P1 important issues and 1 P2 optimization in infrastructure components:
**P1-1: Missing Panic Recovery in Discovery Service** (service.go:172-195, 499-542)
- **Problem**: No panic recovery in Start(), ForceRefresh(), SetSubnet() goroutines
- **Impact**: Silent service death if scan panics, broken discovery with no monitoring
- **Fix**:
- Wrapped initial scan goroutine with defer/recover (lines 172-182)
- Wrapped scanLoop goroutine with defer/recover (lines 185-195)
- Wrapped ForceRefresh scan with defer/recover (lines 499-509)
- Wrapped SetSubnet scan with defer/recover (lines 532-542)
- All log panics with stack traces for debugging
**P1-2: Missing Panic Recovery in Config Watcher Callback** (watcher.go:546-556)
- **Problem**: User-provided onMockReload callback could panic and crash watcher
- **Impact**: Panicking callback kills watcher goroutine, no config updates
- **Fix**: Wrapped callback invocation with defer/recover and stack trace logging
**P1-3: Session Store Stop() Using Send Instead of Close** (session_store.go:16-84)
- **Problem**: Stop() used channel send which blocks if nobody reads
- **Impact**: Stop() hangs if backgroundWorker already exited
- **Fix**:
- Added sync.Once field stopOnce (line 22)
- Changed Stop() to use close() within stopOnce.Do() (lines 80-84)
- Prevents double-close panic and ensures all readers are signaled
**P2-1: Backup Cleanup Inefficient O(n²) Sort** (persistence.go:1424-1427)
- **Problem**: Bubble sort used to sort backups by modification time
- **Impact**: Inefficient for large backup counts (>100 files)
- **Fix**:
- Replaced bubble sort with sort.Slice() using O(n log n) algorithm
- Added "sort" import (line 9)
- Maintains same oldest-first ordering for deletion logic
All fixes add defensive programming without changing external behavior. Panic recovery ensures services continue operating even with bugs, while optimization reduces cleanup time for backup-heavy environments.
This commit addresses 3 critical P0 race conditions and resource leaks in core infrastructure:
**P0-1: Discovery Service Goroutine Leak** (service.go:468, 488)
- **Problem**: ForceRefresh() and SetSubnet() spawned unbounded goroutines without checking if scan already in progress
- **Impact**: Rapid API calls create goroutine explosion, resource exhaustion
- **Fix**:
- ForceRefresh: Check isScanning before spawning goroutine (lines 470-476)
- SetSubnet: Check isScanning, defer scan if already running (lines 491-504)
- Both now log when skipping to aid debugging
**P0-2: Config Persistence Unlock/Relock Race** (persistence.go:1177-1206)
- **Problem**: LoadNodesConfig() unlocked RLock, called SaveNodesConfig (acquires Lock), then relocked
- **Impact**: Another goroutine could modify config between unlock/relock, causing migrated data loss
- **Fix**:
- Copy instance slices while holding RLock to ensure consistency (lines 1189-1194)
- Release lock, save copies, then return without relocking (lines 1196-1205)
- Prevents TOCTOU vulnerability where migrations could be overwritten
**P0-3: Config Watcher Channel Close Race** (watcher.go:19-178)
- **Problem**: Stop() used select-check-close pattern vulnerable to concurrent calls
- **Impact**: Multiple Stop() calls panic on double-close
- **Fix**:
- Added sync.Once field stopOnce to ConfigWatcher struct (line 26)
- Changed Stop() to use stopOnce.Do() ensuring single execution (lines 175-178)
- Removed racy select-based guard
All fixes maintain backwards compatibility and add defensive logging for operational visibility.
This commit addresses 7 critical issues identified during the alert system audit:
**P0 Critical - Race Conditions Fixed:**
1. **dispatchAlert race in NotifyExistingAlert** (lines 5486-5497)
- Changed from RLock to Lock to hold mutex during dispatchAlert call
- dispatchAlert calls checkFlapping which writes to maps (flappingHistory, flappingActive, suppressedUntil)
- Previous code: grabbed RLock, got alert pointer, released lock, then called dispatchAlert (RACE)
- Fixed: hold Lock through dispatchAlert call
2. **dispatchAlert race in LoadActiveAlerts startup** (lines 8216-8235)
- Startup goroutines called dispatchAlert without holding lock
- Added m.mu.Lock/Unlock around dispatchAlert call in goroutine
- Also added cancellation via escalationStop channel to prevent goroutine leaks on shutdown
3. **checkFlapping documentation** (line 738)
- Added clear comment that checkFlapping requires caller to hold m.mu
- Prevents future race conditions from improper usage
**P1 Important - Data Loss Prevention:**
4. **History save race condition** (lines 177-180 in history.go)
- Added saveMu mutex to serialize disk writes
- Previous: concurrent saves could interleave, causing newer data to be overwritten by older snapshots
- Fixed: saveMu.Lock at start of saveHistoryWithRetry ensures atomic disk writes
- Newer snapshots now always win over older ones
**P2 Memory Leak Prevention:**
5. **PMG anomaly tracker cleanup** (lines 7318-7331)
- Added cleanup for pmgAnomalyTrackers map (24 hour TTL based on LastSampleTime)
- Prevents unbounded growth from decommissioned/transient PMG instances
- Each tracker: ~1-2KB (48 samples + baselines)
6. **PMG quarantine history cleanup** (lines 7333-7354)
- Added cleanup for pmgQuarantineHistory map (7 day TTL based on last snapshot)
- Prevents memory leak for deleted PMG instances
- Removes both empty histories and very old histories
**P2 Goroutine Leak Prevention:**
7. **Startup notification goroutine cancellation** (lines 8218-8234)
- Added select with escalationStop channel to cancel startup notifications
- Prevents goroutines from continuing after Stop() is called
- Scales with number of restored critical alerts
All fixes maintain proper lock ordering and prevent deadlocks by ensuring locks are held when accessing shared maps.
Backend:
- Add IsEncryptionEnabled() method to ConfigPersistence
- Include encryption status in /api/notifications/health response
- Allows frontend to warn when credentials are stored in plaintext
Frontend:
- Update NotificationHealth type to include encryption.enabled field
- Frontend can now display warnings when encryption is disabled
This addresses the P2 requirement for encryption visibility, allowing
operators to know when notification credentials are not encrypted at rest.
Add documentation to explain how transport-level and queue-level retries interact:
- Email: MaxRetries (transport) * MaxAttempts (queue) = total SMTP attempts
- Webhooks: RetryCount (transport) * MaxAttempts (queue) = total HTTP attempts
- Example: 3 * 3 = 9 total delivery attempts for a single notification
This clarifies the multiplicative retry behavior and helps operators understand
the actual retry counts when using the persistent queue.
Queue cancellation mechanism:
- Add CancelByAlertIDs method to mark queued notifications as cancelled when alerts resolve
- Update CancelAlert to cancel queued notifications containing resolved alert IDs
- Skip cancelled notifications in queue processor
- Prevents resolved alerts from triggering notifications after they clear
Atomic DB operations:
- Add IncrementAttemptAndSetStatus to atomically update attempt counter and status
- Replace separate IncrementAttempt + UpdateStatus calls with single atomic operation
- Prevents orphaned queue entries when crashes occur between operations
- Eliminates race condition where rows get stuck in "pending" or "sending" status
These fixes ensure queued notifications are properly cancelled when alerts resolve
and prevent database inconsistencies during crash scenarios.
Critical fixes (P0):
- Fix cooldown timing: Mark cooldown only after successful delivery, not before enqueue
- Add os.MkdirAll to queue initialization to prevent silent failures on fresh installs
- Add DNS re-validation at webhook send time to prevent DNS rebinding SSRF attacks
- Add SSRF validation for Apprise HTTP URLs
- Remove secret logging (bot tokens, routing keys) from debug logs
- Implement lastNotified cleanup to prevent unbounded memory growth
- Use shared HTTP client for webhooks to enable TLS connection reuse
- Add fallback to direct sending when queue enqueue fails
- Make queue worker concurrent (5 workers with semaphore) to prevent head-of-line blocking
- Fix webhook rate limiter race condition with separate mutex
- Fix email manager thread safety with mutex on rate limiter
- Fix grouping timer leak by adding stopCleanup signal
- Fix webhook 429 double sleep (use Retry-After OR backoff, not both)
Frontend improvements:
- Add queue/DLQ management API methods (getQueueStats, getDLQ, retryDLQItem, deleteDLQItem)
- Add getNotificationHealth and getWebhookHistory endpoints
- Add Apprise test support to NotificationTestRequest type
Related to notification system audit
Remove 4 LLM-generated internal development docs that don't belong in the repository:
- MIGRATION_SCAFFOLDING.md
- NOTIFICATION_AUDIT.md
- NOTIFICATION_QUICK_REFERENCE.md
- NOTIFICATION_SYSTEM_MAP.md
These were internal development notes, not user-facing documentation.
- Add NOTIFICATION_AUDIT.md for system analysis
- Add NOTIFICATION_QUICK_REFERENCE.md for quick lookup
- Add NOTIFICATION_SYSTEM_MAP.md for architecture overview
- Fix tab panel missing rounded-tl corner when first tab is active
When updates complete quickly, the status API may return 'completed' before
the frontend detects the 'restarting' phase. This left users staring at a
frozen modal with no feedback, requiring manual page refresh.
Changes:
- When status is 'completed', immediately check /api/health
- If backend is healthy, reload the page to get new version
- If health check fails, assume restart in progress and start health polling
- Ensures users always get reloaded to the new version automatically
This fixes the UX issue reported in discussion #628 where the update modal
appeared frozen indefinitely despite successful update completion.
When pulse-sensor-proxy runs inside a container (Docker/LXC), it cannot
complete SSH workflows properly, leading to continuous [preauth] log floods
on the Proxmox host. This happens because the proxy is meant to run on the
host, not inside the container.
Changes:
- Import internal/system for InContainer() detection
- Add startup warning when running in containerized environment
- Point users to docs/TEMPERATURE_MONITORING.md for correct setup
- Allow suppression via PULSE_SENSOR_PROXY_SUPPRESS_CONTAINER_WARNING=true
This catches the misconfiguration early and directs users to supported
installation methods, preventing the SSH spam reported in discussion #628.
The publish-helm-chart workflow was failing with 403 errors when attempting
to upload Helm chart assets to GitHub releases. This was caused by the workflow
having only 'contents: read' permission. Changed to 'contents: write' to allow
the 'gh release upload' command to succeed.
The checksum generation was including pulse-host-agent-v*-darwin-arm64.tar.gz
twice: once from the *.tar.gz pattern and once from the pulse-host-agent-*
pattern. Fixed by using extglob to exclude .tar.gz and .sha256 files from
the agent binary patterns since tarballs are already matched separately.
Changed .pulse-shell from fixed 95rem cap to fluid clamp(95rem, 92vw, 120rem)
to match standard monitoring dashboard behavior (Proxmox, Grafana, Portainer).
On laptops/small screens: unchanged (capped at 1520px)
On 1080p displays: expands to ~1766px usable width
On 4K/ultrawide: expands up to 1920px max for readability
Added back 2xl column widths (totaling ~1720px) that properly fit within
the expanded shell, giving wide-display users more breathing room while
maintaining proportional scaling across all breakpoints.
Changed files:
- index.css: Update .pulse-shell max-width to use clamp()
- Dashboard.tsx: Add 2xl column widths calculated for expanded shell
- GuestRow.tsx: Add matching 2xl column widths
Removed 2xl: width overrides that caused the table to exceed container width.
At ≥1536px viewport, the 2xl breakpoint expanded table columns to ~1528px
total width while .pulse-shell container provides only ~1416px usable space,
forcing Net In/Net Out columns off-screen and requiring horizontal scroll.
Table now caps at xl: breakpoint widths (~1266px) which fit comfortably within
the container at all viewport sizes. Net In/Net Out columns are now visible
without scrolling on 1080p, 4K, and all wide displays.
Changed files:
- Dashboard.tsx: Remove 2xl: width classes from all table header columns
- GuestRow.tsx: Remove 2xl: width classes from all table cell columns
The original fix in c6c0ac63e only handled per-resource overrides when
thresholds were disabled (trigger <= 0 or Disabled=true). It did not
handle global DisableAll* flags (DisableAllStorage, DisableAllNodes,
DisableAllGuests, etc.).
When a user toggled a DisableAll* flag from false to true:
- Check* functions returned early without processing
- Existing active alerts remained in m.activeAlerts map
- Those alerts continued generating webhook notifications
- reevaluateActiveAlertsLocked didn't check DisableAll* flags
This commit fixes the issue by:
1. Updating reevaluateActiveAlertsLocked to check all DisableAll* flags
and resolve alerts for those resource types during config updates
2. Adding alert cleanup to Check* functions before early returns:
- CheckStorage: clears usage and offline alerts
- CheckNode: clears cpu/memory/disk/temperature and offline alerts
- CheckPMG: clears queue/message alerts and offline alerts
- CheckPBS: clears cpu/memory and offline alerts
- CheckHost: calls existing cleanup helpers
3. Adding comprehensive test coverage for DisableAllStorage scenario
Related to #561
Docker hosts with 'degraded' status were incorrectly appearing dimmed
(opacity-60) in the summary table, making them visually identical to
offline hosts. This was confusing because degraded hosts are still
actively reporting - they just have unhealthy containers or >35% of
containers not running.
The isHostOnline function now treats 'degraded' as an online status,
so these rows maintain full opacity. The status badge already provides
visual indication of the degraded state.
Root cause: v4.26.3 tarball and Docker image contained all 8 agent scripts,
but install.sh only copied install-docker-agent.sh to /opt/pulse/scripts/.
Users upgrading via install.sh ended up with missing scripts, causing 404s
when trying to add hosts via the UI.
Changes:
- Add deploy_agent_scripts() function to systematically deploy all scripts
- Deploy all 8 scripts: install-{docker,container,host}-agent.{sh,ps1},
uninstall-host-agent.{sh,ps1}, install-sensor-proxy.sh, install-docker.sh
- Apply to both main installation and rollback/recovery code paths
This ensures bare-metal installations have feature parity with Docker deployments.
Related to #630
Proxmox 8.3+ changed the VM status API to return the `agent` field as an
object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This
caused Pulse to incorrectly treat VMs as having no guest agent, resulting
in missing disk usage data (disk:-1) even when the guest agent was running
and functional.
The issue manifested as:
- VMs showing "Guest details unavailable" or missing disk data
- Pulse logs showing no "Guest agent enabled, querying filesystem info" messages
- `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly
from the command line, confirming the agent was functional
Root cause:
The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent
the new object format, JSON unmarshaling silently left the field at zero,
causing Pulse to skip all guest agent queries.
Changes:
- Created VMAgentField type with custom UnmarshalJSON to handle both formats:
* Legacy (Proxmox <8.3): integer (0 or 1)
* Modern (Proxmox 8.3+): object {"enabled":N,"available":N}
- Updated VMStatus.Agent from `int` to `VMAgentField`
- Updated all references to `detailedStatus.Agent` to use `.Agent.Value`
- The unmarshaler prioritizes the "available" field over "enabled" to ensure
we only query when the agent is actually responding
This fix maintains backward compatibility with older Proxmox versions while
supporting the new format introduced in Proxmox 8.3+.
Addresses two issues preventing configuration backup/restore:
1. Export passphrase validation mismatch: UI only validated 12+ char
requirement when using custom passphrase, but backend always enforced
it. Users with shorter login passwords saw unexplained failures.
- Frontend now validates all passphrases meet 12-char minimum
- Clear error message suggests custom passphrase if login password too short
2. Import data parsing failed silently: Frontend sent `exportData.data`
which was undefined for legacy/CLI backups (raw base64 strings).
Backend rejected these with no logs.
- Frontend now handles both formats: {status, data} and raw strings
- Backend logs validation failures for easier troubleshooting
Related to #646 where user reported "error after entering password" with
no container logs. These changes ensure proper validation feedback and
make the backup system resilient to different export formats.
The workflow was failing because /api/state requires authentication,
but the verification step was making an unauthenticated request.
Changes:
- Authenticate with demo/demo credentials before checking node count
- Use jq for cleaner JSON parsing instead of grep/cut
- Check total node count from API response instead of regex pattern matching
Related to user report about demo server not updating to 4.26.3.
The demo server was actually updated successfully, but the workflow
marked itself as failed due to the verification check failing.
Enhances discoverability for non-Docker installations (bare metal, LXC)
by displaying the bootstrap token prominently at the end of install.sh.
Changes:
- Add ASCII box display matching Docker startup format
- Show token value and file location
- Include usage instructions for first-time setup
- Only display if .bootstrap_token file exists
- Auto-cleanup note matches behavior
With this change, bootstrap token is now prominently displayed across
all installation methods:
- Docker: startup logs (commit 731eb586)
- Bare metal/LXC: install.sh completion (this commit)
- CLI: pulse bootstrap-token command (commit 731eb586)
Related to #645
Add comprehensive documentation for new alert system reliability features:
**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
- GET /api/notifications/dlq - Retrieve failed notifications
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry DLQ items
- POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
- 18 metrics covering alerts, notifications, and queue health
- Example Prometheus configuration
- Example PromQL queries for common monitoring scenarios
**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
- maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
- flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms
All new features are fully documented with examples and default values.
The bootstrap token security requirement was added proactively but
lacked discoverability, causing user friction during first-run setup.
These improvements make the token easier to find while maintaining
the security benefit.
Improvements:
- Display bootstrap token prominently in startup logs with ASCII box
(previously: single line log message)
- Add `pulse bootstrap-token` CLI command to display token on demand
(Docker: docker exec <container> /app/pulse bootstrap-token)
- Improve error messages in quick-setup API to show exact commands
for retrieving token when missing or invalid
- Error messages now include both Docker and bare metal examples
User experience improvements:
- Token visible in `docker logs` output immediately
- Clear instructions printed with token
- Helpful error messages if token is wrong/missing
- CLI helper for operators who need to retrieve token later
Security unchanged:
- Bootstrap token still required for first-run setup
- Token still auto-deleted after successful setup
- No bypass mechanism added
Related to discussion about bootstrap token UX friction.
Simple CLI utility to generate bcrypt password hashes for admin users.
Usage: hashpw <password>
This utility helps administrators generate properly hashed passwords
for use in configuration files or manual user setup.
After the security hardening that introduced bootstrap token protection,
the first-run setup flow was broken because FirstRunSetup.tsx didn't
prompt users for the token. This caused a 401 "Bootstrap setup token
required" error during initial admin account creation.
Changes:
- Add dedicated unlock screen before the setup wizard
- Display instructions for retrieving token from host
- Include bootstrap token in quick-setup API request headers and body
- Only require unlock for first-run setup (skip in force mode)
The unlock screen follows the documented flow in README.md and ensures
only users with host access can configure an unconfigured instance.
Related to #639
Adds automated validation script to prevent the pattern of patch
releases caused by missing files/artifacts.
scripts/validate-release.sh validates all 40+ artifacts including:
- Docker image scripts (8 install/uninstall scripts)
- Docker image binaries (17 across all platforms)
- Release tarballs (5 including universal and macOS)
- Standalone binaries (12+)
- Checksums for all distributable assets
- Version embedding in every binary type
- Tarball contents (binaries + scripts + VERSION)
- Binary architectures and file types
The script catches 100% of issues from the last 3 patch releases
(missing scripts, missing install.sh, missing binaries, broken
version embedding).
Updated RELEASE_CHECKLIST.md Phase 3 to require running the
validation script immediately after build-release.sh and before
proceeding to Docker build/publish phases.
Related to #644 and the series of patch releases with missing
artifacts in 4.26.x.
The Dockerfile and build-release.sh were missing several installer and uninstaller
scripts that the router expects to serve via HTTP endpoints:
- install-container-agent.sh
- install-host-agent.ps1
- uninstall-host-agent.sh
- uninstall-host-agent.ps1
This caused 404 errors when users attempted to add Docker/Podman hosts or use the
PowerShell installer, as reported in #644.
Changes:
- Dockerfile: Added missing scripts to /opt/pulse/scripts/ with proper permissions
- build-release.sh: Added missing scripts to both per-platform and universal tarballs
to ensure bare-metal deployments serve the same endpoints as Docker deployments
The chart container was set to min-h-[12rem] (192px) on desktop while the SVG
was hardcoded to 128px, creating 64px of unwanted empty space. Changed container
to fixed h-32 (128px) to match the SVG height.
The .sha256 files generated during release builds contained only the hash,
but sha256sum -c expects the format "hash filename". This caused all
install.sh updates to fail with "Checksum verification failed" even when
the checksum was correct.
Root cause: build-release.sh line 289 was using awk to extract only field 1
(the hash), discarding the filename that sha256sum -c needs.
Fix: Remove the awk filter to preserve the full sha256sum output format.
This affected the demo server update workflow and user installations.
Removed CSS truncate from key identifier columns (container names, service names,
guest names, host names, image names) that were making data inaccessible on mobile/
touch devices where title tooltips don't work.
Users can now read full identifiers via horizontal scroll (already implemented via
ScrollableTable component). Data should always be readable without requiring additional
UI affordances.
Changed files:
- DockerUnifiedTable: Remove truncate from container/service names and images
- GuestRow: Remove truncate from guest names
- HostsOverview: Remove truncate from host display names and hostnames
Column resizing remains on backlog as optional enhancement; users should not need
a drag handle just to read the contents.
Issues found during systematic audit after #642:
1. CRITICAL BUG - Rollback downloads were completely broken:
- Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
- Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
- This would cause 404 errors on all rollback attempts
- Fixed: Construct correct tarball URL with version
- Added: Extract tarball after download to get binary
2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
- Changed to use /latest/download/ for future-proof docs
3. API.md example had wrong filename format:
- Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
- Ensures example matches actual release asset naming
The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
Issue: HOST_AGENT.md documented downloading pulse-host-agent binaries
from GitHub releases, but those assets didn't exist. Only tarballs were
available, making manual installation unnecessarily complex.
Changes:
- Copy standalone host-agent binaries (all architectures) to release/
directory alongside sensor-proxy binaries
- Include host-agent binaries in checksum generation
- Update HOST_AGENT.md to clarify available architectures
- Retroactively uploaded missing binaries to v4.26.1
This enables air-gapped and manual installations without requiring an
already-running Pulse server to download from.