Squashfs snap mounts on Ubuntu (and similar read-only filesystems like
erofs on Home Assistant OS) always report near-full usage and trigger
false disk alerts. The filter logic existed in Proxmox monitoring but
wasn't applied to host agents.
Changes:
- Extract read-only filesystem filter to shared pkg/fsfilters package
- Apply filter in hostmetrics.collectDisks() for host/docker agents
- Apply filter in monitor.ApplyHostReport() for backward compatibility
- Convert internal/monitoring/fs_filters.go to wrapper functions
This prevents squashfs, erofs, iso9660, cdfs, udf, cramfs, romfs, and
saturated overlay filesystems from generating alerts. Filtering happens
at both collection time (agents) and ingestion time (server) to ensure
older agents don't cause false alerts until they're updated.
Tests were timing out waiting for login form because fresh installations
show the first-run setup screen instead. Adding PULSE_AUTH_USER and
PULSE_AUTH_PASS environment variables pre-configures authentication and
bypasses the setup screen, allowing tests to login directly.
Related to #695
Root cause: Pulse server was crashing on startup with permission denied when
trying to create .encryption.key file.
The docker-compose test config set PULSE_DATA_DIR=/tmp/pulse-test-data, but
this directory was owned by root (created by Docker volume mount). The
entrypoint script only chowns /data, not /tmp/pulse-test-data.
Solution: Change PULSE_DATA_DIR to /data which is already handled by the
entrypoint script's chown command (line 36 of docker-entrypoint.sh).
This fixes the fatal error:
failed to get encryption key: failed to save key:
open /tmp/pulse-test-data/.encryption.key: permission denied
Related to #695
Tests were failing with connection refused even though healthcheck passed. This
suggests the Docker port mapping may not be established when healthcheck passes.
Add explicit verification step that curls localhost:7655 from the host before
running tests. This will reveal if the issue is:
1. Port mapping not working (server healthy inside container but unreachable from host)
2. Server not actually running/listening
3. Timing issue where port mapping needs more time to establish
If verification fails, output container logs to help diagnose the root cause.
Related to #695
Integration tests were failing because the workflow didn't wait for containers
to be healthy before running Playwright tests.
Changes:
- Wait for mock-github container healthcheck to pass (60s timeout)
- Wait for pulse-test-server healthcheck to pass (60s timeout)
- Output container logs if healthcheck fails for debugging
- Remove arbitrary sleep 20 in favor of actual healthcheck verification
This will help diagnose why the pulse server isn't responding on port 7655.
Related to workflow run 19281966710.
The release workflow uses 'npm ci' which requires package-lock.json to exist.
Force-add package-lock.json despite root gitignore for reproducible builds.
Fixes workflow failure in run 19281878604.
The "Check Proxy Nodes" button in Settings > Diagnostics was returning
403 Forbidden due to missing CSRF token. The frontend was using native
fetch() instead of apiFetch() which automatically includes CSRF tokens
for POST requests.
Fixed three endpoints in Settings.tsx:
- /api/diagnostics (GET) - for consistency
- /api/diagnostics/temperature-proxy/register-nodes (POST) - reported issue
- /api/diagnostics/docker/prepare-token (POST) - same bug
Note: Export/import config endpoints intentionally continue using native
fetch() because they need custom 401/403 handling to show the API token
modal instead of redirecting to login.
Critical deadlock fix:
- Stop() was holding n.mu lock while calling queue.Stop()
- queue.Stop() waits for worker goroutines to finish
- Worker goroutines call ProcessQueuedNotification() which needs n.mu lock
- This created a classic lock-order deadlock
Fix:
- Unlock n.mu before calling queue.Stop()
- Relock after queue shutdown completes
- Workers can now finish and acquire lock as needed
This resolves 30-second test timeouts in notifications package.
Tests now complete in <1s instead of timing out at 30s.
Update test expectations to match new SMART-preferred behavior:
- mergeNVMeTempsIntoDisks now prioritizes SMART temps over NVMe temps
- NVMe temps only applied to disks with Temperature == 0
- Tests were failing because disks started with non-zero temperatures
- Changed test disks to start with Temperature: 0 to simulate fresh disks
This change was introduced in commit 2a79d57f7 (Add SMART temperature
collection for physical disks) but tests weren't updated.
Fixes TestMergeNVMeTempsIntoDisks and TestMergeNVMeTempsIntoDisksClearsMissingOrInvalid.
Two critical fixes to prevent test timeouts:
1. Nil map panic in TestPollPVEInstanceUsesRRDMemUsedFallback:
- Test monitor was missing nodeLastOnline map initialization
- Panic occurred when pollPVEInstance tried to update nodeLastOnline[nodeID]
- Caused deadlock when panic recovery tried to acquire already-held mutex
- Added nodeLastOnline: make(map[string]time.Time) to test monitor
2. Alert manager goroutine leak in Docker tests:
- newTestMonitor() created alert manager but never stopped it
- Background goroutines (escalationChecker, periodicSaveAlerts) kept running
- Added t.Cleanup(func() { m.alertManager.Stop() }) to test helper
These fixes resolve the 10+ minute test timeouts in CI workflows.
Related to workflow run 19281508603.
Remove t.Parallel() from tests that verify global Prometheus gauge values.
When tests run in parallel, they update the same global gauges
(discoveryScanServers, discoveryScanErrors) causing race conditions and
incorrect metric values.
Fixes test failure in workflow run 19281332332:
- TestPerformScanRecordsHistoryAndMetrics expected 2 servers, got 1
Related to release workflow preflight tests.
Three categories of fixes:
1. Goroutine leak causing 10-minute timeout:
- Add defer mon.notificationMgr.Stop() in monitor_memory_test.go
- Background goroutines from notification manager weren't being stopped
2. Database NULL column scanning errors:
- Change LastError from string to *string in queue.go
- Change PayloadBytes from int to *int in queue.go
- SQL NULL values require pointer types in Go
3. SSRF protection blocking test servers:
- Check allowlist for localhost before rejecting in notifications.go
- Set PULSE_DATA_DIR to temp directory in tests
- Add defer nm.Stop() calls to prevent goroutine leaks
Fixes for preflight test failures in workflow run 19280879903.
Snap-installed Docker does not automatically create a docker group,
causing permission denied errors when the pulse-docker service user
tries to access /var/run/docker.sock.
Changes:
- Auto-detect Snap Docker installations
- Create docker group if missing when Snap Docker is detected
- Restart Snap Docker after group creation to refresh socket ACLs
- Add socket access validation before starting the service
- Handle symlinked Docker sockets in systemd unit ReadWritePaths
- Document troubleshooting steps in DOCKER_MONITORING.md
When pulse-sensor-proxy runs inside an LXC container on a Proxmox host,
pvecm status fails with "ipcc_send_rec[1] failed: Unknown error -1"
because the container can't access the host's corosync IPC socket.
This caused repeated warnings every few seconds even though the proxy
can function correctly by discovering local host addresses.
Extended the standalone node detection to recognize "ipcc_send_rec"
errors as indicating an LXC container deployment and gracefully fall
back to local address discovery instead of logging warnings.
Fixes three test failures that were blocking release workflow:
1. TestApplyDockerReportGeneratesUniqueIDsForCollidingHosts:
- Initialize dockerTokenBindings and dockerMetadataStore in test helper
- These maps were nil causing panic on first access
2. TestSendGroupedAppriseHTTP & TestSendTestNotificationAppriseHTTP:
- Configure allowlist to permit localhost (127.0.0.1) for test servers
- SSRF protection was blocking httptest.NewServer() URLs
- Tests need to allowlist the test server IP to bypass security checks
Related to workflow fix in 5fa78c3e3.
The Python heredoc was not indented, causing YAML parsers to interpret
the Python code as YAML syntax. This caused workflow_dispatch runs to
fail instantly with 'workflow file issue' error before any jobs could start.
The fix indents the heredoc content and changes delimiter from 'PY' to
'EOF' to match standard conventions.
Add defensive mitigation to prevent repeated guest-get-osinfo calls that
trigger buggy behavior in QEMU guest agent 9.0.2 on OpenBSD 7.6.
The issue: OpenBSD doesn't have /etc/os-release (Linux convention), and
qemu-ga 9.0.2 appears to spawn excessive helper processes trying to read
this file whenever guest-get-osinfo is called. These helpers don't clean
up properly, eventually exhausting the process table and crashing the VM.
The fix: Track consecutive OS info failures per VM. After 3 failures,
automatically skip future guest-get-osinfo calls for that VM while
continuing to fetch other guest agent data (network interfaces, version).
This prevents triggering the buggy code path while maintaining most guest
agent functionality.
The counter resets on success, so if the guest agent is upgraded or the
issue is resolved, Pulse will automatically resume OS info collection.
Related to #692
Bare metal installations couldn't serve Windows host agent downloads because
the Windows and macOS binaries weren't included in the universal tarball. The
download endpoint would return 404 when Windows users tried to install the
host agent from a bare metal Pulse deployment (Proxmox LXC, Debian VM, etc.).
Changes:
- build-release.sh: Copy Windows/macOS host agent binaries into universal tarball
- build-release.sh: Create symlinks for Windows binaries without .exe extension
- validate-release.sh: Add Windows 386 binary and symlink to Docker validation
- validate-release.sh: Add explicit validation that universal tarball contains all Windows/macOS binaries
The universal tarball now matches the Docker image, ensuring both deployment
methods can serve the complete set of downloadable binaries for the /download/
endpoint.
Replace mv with install command to ensure correct SELinux context.
The mv command preserves the user_tmp_t label from /tmp, which
prevents systemd from executing the binary on SELinux systems.
The install command creates a new file with the correct label for
/usr/local/bin. Added automatic restorecon call for SELinux systems
to ensure policy compliance.
Related to #688
The grep pattern was too loose and could match filenames like:
- pulse-v4.29.0-linux-amd64.tar.gz (correct)
- pulse-v4.29.0-linux-amd64.tar.gz.sha256 (also matched)
Using grep -w ensures we only match the exact filename as a complete word,
preventing false matches on files with the same prefix.
Following best practices for release format transitions:
- build-release.sh now generates both formats from same sha256sum run
- Workflow uploads both checksums.txt and individual .sha256 files
- Validation ensures both formats exist and match
This provides a safe transition period for users with older install scripts
while maintaining the cleaner checksums.txt format going forward. After 2-3
releases when most users have updated scripts, we can remove .sha256 generation.
Related: Install script already supports both formats (falls back gracefully).
The install script now tries checksums.txt first (v4.29.0+), then falls back
to individual .sha256 files (v4.28.0 and earlier). This ensures users can
update from any version regardless of which checksum format was used.
This fixes the release format transition issue where changing asset structure
broke updates for users on older versions.
Aligns with release asset reduction changes. The install script now downloads the unified checksums.txt file and extracts the checksum for the specific architecture being installed.
The workflow was failing because GitHub returns 302 redirects for freshly published release assets while the CDN propagates. Adding -L flag to curl commands allows them to follow redirects and properly detect when assets are available.
High-impact improvements based on Codex recommendations:
1. values.schema.json - JSON schema validation catches config errors at install time
2. helm-docs automation - Auto-generates documentation from values.yaml comments
3. kind smoke tests - Deploys and upgrades chart in real cluster to catch runtime issues
4. ServiceMonitor template - Built-in Prometheus integration for observability
5. Artifact Hub metadata - Changelog, links, and maintainer info for better discoverability
These improvements provide:
- Configuration validation before deployment
- Always up-to-date documentation
- Runtime validation in CI
- First-class monitoring support
- Better user experience on Artifact Hub
Related to #686
- Auto-update Chart.yaml version from release tag or manual input
- Add strict helm lint validation before publishing
- Validate chart templates with multiple configuration scenarios
- Ensures chart quality before publishing to GitHub Pages
Enables automatic listing on https://artifacthub.io for improved
Helm chart discovery and provides metadata like screenshots, links,
and maintainer information.
GHCR OCI packages cannot be made public through any available mechanism:
- Package doesn't appear in user/repo package lists
- API endpoints return 404
- Workflow tokens lack package visibility permissions
- Manual UI shows no packages to configure
- OCI annotations don't link package to repository
Implementing GitHub Pages Helm repo as canonical distribution method:
- Uses chart-releaser-action to publish to gh-pages branch
- Provides standard 'helm repo add' workflow without authentication
- Maintains OCI push for future use if GHCR resolves visibility issues
Resolves#686
Linux host-agent binaries don't have separate archives - they're included in
the main pulse-v*.tar.gz files. Only macOS and Windows have separate archives.
Adding org.opencontainers.image.source annotation will connect the GHCR package
to the repository, making it visible in the repo's packages section and allowing
proper visibility management.
Related to #686
Removed validation checks for standalone binaries that are no longer
uploaded to GitHub releases. These binaries are only needed in Docker
images for the /download/ endpoint.
Updated required assets list to include all versioned tarballs/zips
instead of standalone binaries.
Removed:
- Individual .sha256 files (checksums.txt already contains all checksums)
- Standalone binaries without version numbers (users should download versioned tarballs/zips)
Standalone binaries are only needed in Docker images for the /download/ endpoint.
GitHub releases should only contain versioned archives for user downloads.
This reduces release assets from ~54 files to ~19 files per release.
The pulse-chart package in GHCR currently requires authentication for pulls
because it defaults to private visibility. This affects all users trying to
install via `helm install oci://ghcr.io/rcourtman/pulse-chart`.
This commit adds a workflow step to automatically set the package to public
after each push, enabling anonymous pulls without requiring `helm registry login`.
Note: The existing package will need one-time manual configuration via GitHub
web UI until the next release triggers this workflow.
Related to discussion #686
Users don't care about CI/CD improvements, release workflows, build
processes, or testing infrastructure. Only include user-visible changes.
Related to #671
Commit hashes clutter the release notes and aren't useful for end users.
Only include issue references when explicitly mentioned in commits.
Related to #671
Remove # symbol from commit hash references so GitHub auto-links them.
Format: (abc123) instead of (#abc123)
Issue references still use #: (#123)
Related to #671
Backticks in GitHub Actions output were still being interpreted even
when assigned to a variable and then echoed to a file. Use heredoc
with single quotes to prevent any bash expansion.
Related to #671
Use --notes-file instead of --notes with variable expansion to prevent
bash from interpreting markdown code blocks as shell commands.
Fixes the error where installation examples like:
```bash
docker pull rcourtman/pulse:v4.29.0
```
Were being executed as actual commands during release creation.
Related to #671
actions/checkout@v4 does not fetch tags by default, causing the
previous tag lookup to fail and fall back to comparing with the
first commit SHA. Added fetch-depth: 0 to fetch all history including tags.
Just use the latest tag directly instead of trying to exclude the current version.
Since we're generating release notes BEFORE creating the tag, the latest tag
will always be the previous release.