Major improvements:
- Trigger on tag push (git push origin vX.Y.Z) instead of workflow_dispatch
- Auto-generate release notes using GitHub's API
- Tag is single source of truth (eliminates version/tag mismatch)
- Follows industry standard pattern (Kubernetes, Docker, HashiCorp)
- Also push 'latest' tag to Docker registries
- Simpler workflow: update VERSION → commit → tag → push tag
Breaking change: Manual workflow_dispatch releases no longer supported.
Use: git tag vX.Y.Z && git push origin vX.Y.Z
- Replace unreliable git fetch --dry-run check
- Use git rev-parse to compare local and remote commits
- Prevents false warnings about diverged branches
- Check VERSION file matches before triggering workflow
- Validate working directory is clean
- Confirm on main branch and up to date
- Load release notes from /tmp/release_notes_X.Y.Z.md
- Prevents wasting CI time on misconfigured releases
Proxmox VE 9.x removed support for the "full" parameter in the
/nodes/{node}/qemu/{vmid}/status/current endpoint. When Pulse sent
GetVMStatus() requests with ?full=1, Proxmox responded with:
API error 400: {"errors":{"full":"property is not defined in schema..."}}
This caused the cluster client to mark ALL endpoints as unhealthy, which
cascaded into multiple failures:
- VM status checks failed
- Guest agent queries were blocked
- Filesystem data collection stopped working
- All Windows VMs showed disk:-1 (unknown) instead of actual disk usage
The fix removes the ?full=1 parameter since Proxmox 9.x returns all data
by default without needing this parameter. This maintains backward
compatibility with older Proxmox versions while fixing the issue in 9.x.
After this fix:
- Cluster endpoints are correctly marked as healthy
- Guest agent queries work properly
- Windows VMs report actual disk usage (e.g., 26% on C:\ drive)
- VM monitoring functions normally on Proxmox 9.x
Implements a "Remember Me" option that allows users to stay logged in
for 30 days instead of the default 24 hours. This addresses the pain
point of frequent re-authentication in LAN-only environments while
maintaining authentication security.
Backend changes:
- Add rememberMe field to login request handling
- Support variable session durations (24h default, 30d with Remember Me)
- Implement sliding session expiration that extends sessions on each
authenticated request using the original duration
- Store OriginalDuration in session data for proper sliding window
- Update session cookie MaxAge to match session duration
Frontend changes:
- Add "Remember Me for 30 days" checkbox to login form
- Pass rememberMe flag in login request
- Improve UI with clear duration indication
Key features:
- Sessions extend automatically on each request (sliding window)
- Original duration preserved across session extension
- Backward compatible with existing sessions (legacy sessions work)
- Sessions persist across server restarts
This provides a better user experience for LAN deployments without
compromising security by completely disabling authentication.
After implementing the health gate, added comprehensive safety measures
to prevent the health checks themselves from becoming a new failure point.
**Problem**: Previous commit added strict health checks but could fail in
edge cases:
- `pct exec` could hang if container stopped/frozen → installer deadlocks
- systemctl/journalctl might not be available → diagnostics fail
- Container access check could fail for transient reasons
- pvecm error detection was fragile (string matching specific messages)
**Solutions Implemented**:
1. **Timeouts on All External Commands** (install.sh:1596,1618)
- `timeout 5` on systemctl checks
- `timeout 10` on pct exec checks
- Prevents installer from hanging indefinitely
2. **Graceful Degradation** (install.sh:1602-1630)
- Check for systemctl/pct availability before using
- Warn if tools missing instead of failing
- Container check is warning-only (may be transient)
- Only fail on critical checks: service running, socket exists
3. **Bypass Flag Support** (install.sh:1589-1594)
- Set `PULSE_SKIP_HEALTH_CHECKS=1` to bypass all checks
- Documented in error messages for troubleshooting
- Allows installation in unsupported environments
4. **Flexible Diagnostics** (install.sh:1640-1647)
- Use journalctl if available, fallback to syslog
- Conditional tool-specific advice
5. **Broader Error Detection** (ssh.go:582-628)
- List of 14 standalone indicators (vs 5 hardcoded checks)
- Case-insensitive matching for localization tolerance
- Permissive strategy: treat any known pattern as standalone
- Handles variations: "no cluster", "IPC", "connection refused", etc.
6. **Enhanced Test Coverage** (ssh_test.go:+35 lines)
- Added 3 new test cases (variation patterns)
- Tests now cover 8 standalone scenarios + 3 negative cases
- All tests pass (11/11)
**Impact**:
- Health gate won't block installation in edge cases
- Better user experience on non-standard setups
- Standalone detection handles more error message variations
- Clear escape hatch for troubleshooting (bypass flag)
**Confidence Level**: High
- All tests pass (bash syntax + Go unit tests)
- Graceful fallbacks for every external command
- Only critical checks are hard failures
- Warnings guide users through validation issues
Related to #571
Tests validate the error pattern matching logic added in previous commit,
ensuring we correctly identify:
1. **Standalone Node Patterns** (should trigger fallback):
- Classic: 'Corosync config does not exist'
- LXC ipcc errors: 'ipcc_send_rec[1] failed: Unknown error -1'
- Access control errors: 'Unable to load access control list'
- All patterns from GitHub issue #571
2. **Genuine Errors** (should NOT trigger fallback):
- Network timeouts
- Permission denied
- Command not found
Tests use real error messages from production GitHub issues to prevent
regressions. All 9 test cases pass.
Coverage:
- 6 standalone/LXC error patterns
- 3 genuine error cases (negative testing)
- References issue #571 for traceability
Related to #571
Users were abandoning Pulse due to catastrophic temperature monitoring setup failures. This commit addresses the root causes:
**Problem 1: Silent Failures**
- Installations reported "SUCCESS" even when proxy never started
- UI showed green checkmarks with no temperature data
- Zero feedback when things went wrong
**Problem 2: Missing Diagnostics**
- Service failures logged only in journald
- Users saw "Something going on with the proxy" with no actionable guidance
- No way to troubleshoot from error messages
**Problem 3: Standalone Node Issues**
- Proxy daemon logged continuous pvecm errors as warnings
- "ipcc_send_rec" and "Unknown error -1" messages confused users
- These are expected for non-clustered/LXC setups
**Solutions Implemented:**
1. **Health Gate in install.sh (lines 1588-1629)**
- Verify service is running after installation
- Check socket exists on host
- Confirm socket visible inside container via bind mount
- Fail loudly with specific diagnostics if any check fails
2. **Actionable Error Messages in install-sensor-proxy.sh (lines 822-877)**
- When service fails to start: dump full systemctl status + 40 lines of logs
- When socket missing: show permissions, service status, and remediation command
- Include common issues checklist (missing user, permission errors, lm-sensors, etc.)
- Direct link to troubleshooting docs
3. **Better Standalone Node Detection in ssh.go (lines 585-595)**
- Recognize "Unknown error -1" and "Unable to load access control list" as LXC indicators
- Log at INFO level (not WARN) since this is expected behavior
- Clarify message: "using localhost for temperature collection"
**Impact:**
- Eliminates "green checkmark but no temps" scenario
- Users get immediate actionable feedback on failures
- Standalone/LXC installations work silently without error spam
- Reduces support burden from #571 (15+ comments of user frustration)
Related to #571
CRITICAL SECURITY FIX: The /download/pulse-host-agent endpoint was directly
concatenating user-supplied platform and arch query parameters into file paths
without validation, allowing path traversal attacks.
An attacker could request:
/download/pulse-host-agent?platform=../../etc/passwd
to read arbitrary files from the container filesystem.
Fix: Add input validation to only allow alphanumeric characters and hyphens
in platform/arch parameters before using them in file paths.
Related: Codex security audit identified this during pre-release review
Draft releases created without --target get 'untagged-...' slugs instead of
the proper tag name. This breaks all download URLs since installers expect
/download/vX.Y.Z/... but assets are under /download/untagged-.../
Add --target parameter to gh release create to ensure the tag is created
properly even for draft releases.
The releases REST API endpoint is eventually consistent for draft releases.
Immediately after gh release create, the new release may not appear in the
listing yet, causing the release_id lookup to return empty and fail validation.
Add retry loop (10 attempts, 2s intervals) to wait for the release to appear
in the API before extracting the ID. Also add validation to ensure we got
a valid release_id before proceeding.
This fixes the immediate validation failure with 'Release metadata is missing'.
Related to systematic release workflow failures. The workflow has never
successfully completed from start to finish since validation was added.
Root causes identified and fixed:
1. **GraphQL node_id vs numeric release ID**: The create-release job was
using `gh release view --json id` which returns a GraphQL node_id
(RE_kwDON5nJtM4PmlTt) instead of the numeric database ID (261772525)
needed by the REST API. The validation workflow then failed with 404
when trying to download assets. Fixed by using `gh api` to get the
numeric ID from the releases list endpoint.
2. **Missing binaries in Docker image**: The validation script expects 26
binaries + 3 Windows symlinks in /opt/pulse/bin/, but the Dockerfile
was only copying a subset. Missing binaries included the main pulse
server binary, armv6/386 builds for all agents, and caused immediate
validation failure. Fixed by copying all built binaries from
backend-builder stage.
3. **Assets-only validation fallback broken**: When Docker image pull
times out, the workflow falls back to assets-only validation but was
still calling the validation script without --skip-docker flag,
causing it to fail on the first docker command. Fixed by passing
--skip-docker flag in the fallback path.
4. **Asset download pagination**: The asset download was not using
--paginate, which would cause silent failures once we exceed 30 assets
(currently at 27). Fixed by adding --paginate to gh api call.
All fixes verified locally and address the complete failure chain.