Commit graph

668 commits

Author SHA1 Message Date
rcourtman
cdb692c8fd Refactor to tag-driven release workflow with auto-changelog
Major improvements:
- Trigger on tag push (git push origin vX.Y.Z) instead of workflow_dispatch
- Auto-generate release notes using GitHub's API
- Tag is single source of truth (eliminates version/tag mismatch)
- Follows industry standard pattern (Kubernetes, Docker, HashiCorp)
- Also push 'latest' tag to Docker registries
- Simpler workflow: update VERSION → commit → tag → push tag

Breaking change: Manual workflow_dispatch releases no longer supported.
Use: git tag vX.Y.Z && git push origin vX.Y.Z
2025-11-13 11:48:10 +00:00
rcourtman
e822ab7ae1 Fix remote sync check in release trigger script
- Replace unreliable git fetch --dry-run check
- Use git rev-parse to compare local and remote commits
- Prevents false warnings about diverged branches
2025-11-13 11:43:36 +00:00
rcourtman
0739b3ade1 Prepare v4.30.0 release 2025-11-13 11:40:54 +00:00
rcourtman
6a647c3274 Revert VERSION to 4.29.5 for release testing 2025-11-13 11:38:37 +00:00
rcourtman
d400befd35 Add pre-flight validation script for releases
- Check VERSION file matches before triggering workflow
- Validate working directory is clean
- Confirm on main branch and up to date
- Load release notes from /tmp/release_notes_X.Y.Z.md
- Prevents wasting CI time on misconfigured releases
2025-11-13 11:36:53 +00:00
rcourtman
332f92b696 Prepare v4.30.0 release 2025-11-13 11:34:49 +00:00
rcourtman
67b3b2ff84 Fix Proxmox 9.x VM status endpoint incompatibility
Proxmox VE 9.x removed support for the "full" parameter in the
/nodes/{node}/qemu/{vmid}/status/current endpoint. When Pulse sent
GetVMStatus() requests with ?full=1, Proxmox responded with:

  API error 400: {"errors":{"full":"property is not defined in schema..."}}

This caused the cluster client to mark ALL endpoints as unhealthy, which
cascaded into multiple failures:
- VM status checks failed
- Guest agent queries were blocked
- Filesystem data collection stopped working
- All Windows VMs showed disk:-1 (unknown) instead of actual disk usage

The fix removes the ?full=1 parameter since Proxmox 9.x returns all data
by default without needing this parameter. This maintains backward
compatibility with older Proxmox versions while fixing the issue in 9.x.

After this fix:
- Cluster endpoints are correctly marked as healthy
- Guest agent queries work properly
- Windows VMs report actual disk usage (e.g., 26% on C:\ drive)
- VM monitoring functions normally on Proxmox 9.x
2025-11-13 11:22:36 +00:00
rcourtman
c1d076db0c Simplify Remember Me label text
Remove '30 days' from label - most apps just say 'Remember me' without
specifying the duration.
2025-11-13 10:40:53 +00:00
rcourtman
56a7579c99 Add Remember Me feature with sliding session expiration (Related to #707)
Implements a "Remember Me" option that allows users to stay logged in
for 30 days instead of the default 24 hours. This addresses the pain
point of frequent re-authentication in LAN-only environments while
maintaining authentication security.

Backend changes:
- Add rememberMe field to login request handling
- Support variable session durations (24h default, 30d with Remember Me)
- Implement sliding session expiration that extends sessions on each
  authenticated request using the original duration
- Store OriginalDuration in session data for proper sliding window
- Update session cookie MaxAge to match session duration

Frontend changes:
- Add "Remember Me for 30 days" checkbox to login form
- Pass rememberMe flag in login request
- Improve UI with clear duration indication

Key features:
- Sessions extend automatically on each request (sliding window)
- Original duration preserved across session extension
- Backward compatible with existing sessions (legacy sessions work)
- Sessions persist across server restarts

This provides a better user experience for LAN deployments without
compromising security by completely disabling authentication.
2025-11-13 10:37:08 +00:00
rcourtman
8e993ea901 Add critical safety guards to temperature proxy installation
After implementing the health gate, added comprehensive safety measures
to prevent the health checks themselves from becoming a new failure point.

**Problem**: Previous commit added strict health checks but could fail in
edge cases:
- `pct exec` could hang if container stopped/frozen → installer deadlocks
- systemctl/journalctl might not be available → diagnostics fail
- Container access check could fail for transient reasons
- pvecm error detection was fragile (string matching specific messages)

**Solutions Implemented**:

1. **Timeouts on All External Commands** (install.sh:1596,1618)
   - `timeout 5` on systemctl checks
   - `timeout 10` on pct exec checks
   - Prevents installer from hanging indefinitely

2. **Graceful Degradation** (install.sh:1602-1630)
   - Check for systemctl/pct availability before using
   - Warn if tools missing instead of failing
   - Container check is warning-only (may be transient)
   - Only fail on critical checks: service running, socket exists

3. **Bypass Flag Support** (install.sh:1589-1594)
   - Set `PULSE_SKIP_HEALTH_CHECKS=1` to bypass all checks
   - Documented in error messages for troubleshooting
   - Allows installation in unsupported environments

4. **Flexible Diagnostics** (install.sh:1640-1647)
   - Use journalctl if available, fallback to syslog
   - Conditional tool-specific advice

5. **Broader Error Detection** (ssh.go:582-628)
   - List of 14 standalone indicators (vs 5 hardcoded checks)
   - Case-insensitive matching for localization tolerance
   - Permissive strategy: treat any known pattern as standalone
   - Handles variations: "no cluster", "IPC", "connection refused", etc.

6. **Enhanced Test Coverage** (ssh_test.go:+35 lines)
   - Added 3 new test cases (variation patterns)
   - Tests now cover 8 standalone scenarios + 3 negative cases
   - All tests pass (11/11)

**Impact**:
- Health gate won't block installation in edge cases
- Better user experience on non-standard setups
- Standalone detection handles more error message variations
- Clear escape hatch for troubleshooting (bypass flag)

**Confidence Level**: High
- All tests pass (bash syntax + Go unit tests)
- Graceful fallbacks for every external command
- Only critical checks are hard failures
- Warnings guide users through validation issues

Related to #571
2025-11-13 10:26:46 +00:00
rcourtman
4f2e9055cd Add comprehensive tests for standalone node detection patterns
Tests validate the error pattern matching logic added in previous commit,
ensuring we correctly identify:

1. **Standalone Node Patterns** (should trigger fallback):
   - Classic: 'Corosync config does not exist'
   - LXC ipcc errors: 'ipcc_send_rec[1] failed: Unknown error -1'
   - Access control errors: 'Unable to load access control list'
   - All patterns from GitHub issue #571

2. **Genuine Errors** (should NOT trigger fallback):
   - Network timeouts
   - Permission denied
   - Command not found

Tests use real error messages from production GitHub issues to prevent
regressions. All 9 test cases pass.

Coverage:
- 6 standalone/LXC error patterns
- 3 genuine error cases (negative testing)
- References issue #571 for traceability

Related to #571
2025-11-13 10:17:57 +00:00
rcourtman
1dd0e74133 Dramatically improve temperature proxy installation robustness
Users were abandoning Pulse due to catastrophic temperature monitoring setup failures. This commit addresses the root causes:

**Problem 1: Silent Failures**
- Installations reported "SUCCESS" even when proxy never started
- UI showed green checkmarks with no temperature data
- Zero feedback when things went wrong

**Problem 2: Missing Diagnostics**
- Service failures logged only in journald
- Users saw "Something going on with the proxy" with no actionable guidance
- No way to troubleshoot from error messages

**Problem 3: Standalone Node Issues**
- Proxy daemon logged continuous pvecm errors as warnings
- "ipcc_send_rec" and "Unknown error -1" messages confused users
- These are expected for non-clustered/LXC setups

**Solutions Implemented:**

1. **Health Gate in install.sh (lines 1588-1629)**
   - Verify service is running after installation
   - Check socket exists on host
   - Confirm socket visible inside container via bind mount
   - Fail loudly with specific diagnostics if any check fails

2. **Actionable Error Messages in install-sensor-proxy.sh (lines 822-877)**
   - When service fails to start: dump full systemctl status + 40 lines of logs
   - When socket missing: show permissions, service status, and remediation command
   - Include common issues checklist (missing user, permission errors, lm-sensors, etc.)
   - Direct link to troubleshooting docs

3. **Better Standalone Node Detection in ssh.go (lines 585-595)**
   - Recognize "Unknown error -1" and "Unable to load access control list" as LXC indicators
   - Log at INFO level (not WARN) since this is expected behavior
   - Clarify message: "using localhost for temperature collection"

**Impact:**
- Eliminates "green checkmark but no temps" scenario
- Users get immediate actionable feedback on failures
- Standalone/LXC installations work silently without error spam
- Reduces support burden from #571 (15+ comments of user frustration)

Related to #571
2025-11-13 10:14:19 +00:00
rcourtman
1cfc6da2d9 Revert "Document release notes input requirement"
This reverts commit 5c06597b6e.
2025-11-13 09:41:43 +00:00
rcourtman
5c06597b6e Document release notes input requirement 2025-11-13 09:40:21 +00:00
rcourtman
1f3723a7ad Require release notes input for workflow 2025-11-13 09:37:38 +00:00
rcourtman
221c41b2b0 Polish release notes fallback 2025-11-13 09:10:43 +00:00
rcourtman
8330664ed2 Add deterministic release notes fallback 2025-11-13 00:00:25 +00:00
rcourtman
a6a8f0a8ef Improve release notes fallback 2025-11-12 23:40:26 +00:00
rcourtman
d4b7a61e9c Prepare v4.29.6 release 2025-11-12 23:19:20 +00:00
rcourtman
1166576b21 Ensure agent ID collisions respect token boundaries (Related to #658) 2025-11-12 22:46:56 +00:00
rcourtman
bb55144637 Improve update integration diagnostics 2025-11-12 22:27:05 +00:00
rcourtman
88d0aeebdf Ensure alert cooldown persists reliably
Related to #706
2025-11-12 21:52:24 +00:00
rcourtman
3fa7356ff3 Improve auth timeout handling (related to #705) 2025-11-12 21:50:53 +00:00
rcourtman
4f950de29d Make API update test resilient to stale cookies 2025-11-12 21:32:33 +00:00
rcourtman
e3952a073c Use internal mock host for release assets 2025-11-12 21:23:01 +00:00
rcourtman
8b2d64b583 Fix API update test to send CSRF token 2025-11-12 21:13:46 +00:00
rcourtman
98adb5e08e Preserve storage backups after partial failures (Related to #704) 2025-11-12 21:10:18 +00:00
rcourtman
3b079eeddb Add release dry run workflow and API update integration test 2025-11-12 21:02:52 +00:00
rcourtman
874f41b75c Related to discussion #702: harden uninstall cleanup 2025-11-12 19:46:45 +00:00
rcourtman
92e155ed3f Handle Snap Docker home restrictions (Related to #693) 2025-11-12 19:20:04 +00:00
rcourtman
653ce04bc1 Improve sensor proxy cluster validation (Related to #703) 2025-11-12 19:17:45 +00:00
rcourtman
923293fbf7 Related to #692: Skip unsupported guest OS info calls 2025-11-12 19:17:09 +00:00
rcourtman
b829d86d2c Handle docker validation under errexit
Related to #693
2025-11-12 18:34:53 +00:00
rcourtman
b3e8e43736 Related to #698: harden installer release detection 2025-11-12 17:56:16 +00:00
rcourtman
9dd6357328 Improve sensor-proxy release detection (related to #701) 2025-11-12 17:49:20 +00:00
rcourtman
429aa075af Ensure release validation handles published edits (related to #669) 2025-11-12 17:33:30 +00:00
rcourtman
357a7fe920 Ensure VM status requests always return meminfo (Related to #694) 2025-11-12 17:30:10 +00:00
rcourtman
29a2fa8bdb Auto-update Helm chart version to 4.29.5 2025-11-12 17:15:02 +00:00
rcourtman
9abc10265b Auto-update Helm chart documentation 2025-11-12 17:15:01 +00:00
rcourtman
70d6f911b5 Skip helm-docs commits during release workflows 2025-11-12 17:14:31 +00:00
rcourtman
39548dbbe2 Bump version to 4.29.5 2025-11-12 16:34:22 +00:00
rcourtman
ff5de0147b Fix missing regexp import for path traversal validation 2025-11-12 16:34:16 +00:00
rcourtman
be40faf67f Bump version to 4.29.4 - includes security fix 2025-11-12 16:27:49 +00:00
rcourtman
946e2e455f Security: Fix path traversal vulnerability in host-agent download endpoint
CRITICAL SECURITY FIX: The /download/pulse-host-agent endpoint was directly
concatenating user-supplied platform and arch query parameters into file paths
without validation, allowing path traversal attacks.

An attacker could request:
  /download/pulse-host-agent?platform=../../etc/passwd
to read arbitrary files from the container filesystem.

Fix: Add input validation to only allow alphanumeric characters and hyphens
in platform/arch parameters before using them in file paths.

Related: Codex security audit identified this during pre-release review
2025-11-12 16:27:11 +00:00
rcourtman
775bf99a96 Bump version to 4.29.3 for final release test 2025-11-12 16:18:51 +00:00
rcourtman
66d30c56eb Fix draft release tag creation
Draft releases created without --target get 'untagged-...' slugs instead of
the proper tag name. This breaks all download URLs since installers expect
/download/vX.Y.Z/... but assets are under /download/untagged-.../

Add --target parameter to gh release create to ensure the tag is created
properly even for draft releases.
2025-11-12 16:18:22 +00:00
rcourtman
0c98d0b801 Bump version to 4.29.2 2025-11-12 15:47:33 +00:00
rcourtman
ba6d019c5b Fix eventual consistency issue with release API lookup
The releases REST API endpoint is eventually consistent for draft releases.
Immediately after gh release create, the new release may not appear in the
listing yet, causing the release_id lookup to return empty and fail validation.

Add retry loop (10 attempts, 2s intervals) to wait for the release to appear
in the API before extracting the ID. Also add validation to ensure we got
a valid release_id before proceeding.

This fixes the immediate validation failure with 'Release metadata is missing'.
2025-11-12 15:47:21 +00:00
rcourtman
bf8622eeaa Bump version to 4.29.1 for release workflow test 2025-11-12 15:08:41 +00:00
rcourtman
e3890c2925 Fix release workflow to complete successfully end-to-end
Related to systematic release workflow failures. The workflow has never
successfully completed from start to finish since validation was added.

Root causes identified and fixed:

1. **GraphQL node_id vs numeric release ID**: The create-release job was
   using `gh release view --json id` which returns a GraphQL node_id
   (RE_kwDON5nJtM4PmlTt) instead of the numeric database ID (261772525)
   needed by the REST API. The validation workflow then failed with 404
   when trying to download assets. Fixed by using `gh api` to get the
   numeric ID from the releases list endpoint.

2. **Missing binaries in Docker image**: The validation script expects 26
   binaries + 3 Windows symlinks in /opt/pulse/bin/, but the Dockerfile
   was only copying a subset. Missing binaries included the main pulse
   server binary, armv6/386 builds for all agents, and caused immediate
   validation failure. Fixed by copying all built binaries from
   backend-builder stage.

3. **Assets-only validation fallback broken**: When Docker image pull
   times out, the workflow falls back to assets-only validation but was
   still calling the validation script without --skip-docker flag,
   causing it to fail on the first docker command. Fixed by passing
   --skip-docker flag in the fallback path.

4. **Asset download pagination**: The asset download was not using
   --paginate, which would cause silent failures once we exceed 30 assets
   (currently at 27). Fixed by adding --paginate to gh api call.

All fixes verified locally and address the complete failure chain.
2025-11-12 14:59:16 +00:00