After implementing the health gate, added comprehensive safety measures
to prevent the health checks themselves from becoming a new failure point.
**Problem**: Previous commit added strict health checks but could fail in
edge cases:
- `pct exec` could hang if container stopped/frozen → installer deadlocks
- systemctl/journalctl might not be available → diagnostics fail
- Container access check could fail for transient reasons
- pvecm error detection was fragile (string matching specific messages)
**Solutions Implemented**:
1. **Timeouts on All External Commands** (install.sh:1596,1618)
- `timeout 5` on systemctl checks
- `timeout 10` on pct exec checks
- Prevents installer from hanging indefinitely
2. **Graceful Degradation** (install.sh:1602-1630)
- Check for systemctl/pct availability before using
- Warn if tools missing instead of failing
- Container check is warning-only (may be transient)
- Only fail on critical checks: service running, socket exists
3. **Bypass Flag Support** (install.sh:1589-1594)
- Set `PULSE_SKIP_HEALTH_CHECKS=1` to bypass all checks
- Documented in error messages for troubleshooting
- Allows installation in unsupported environments
4. **Flexible Diagnostics** (install.sh:1640-1647)
- Use journalctl if available, fallback to syslog
- Conditional tool-specific advice
5. **Broader Error Detection** (ssh.go:582-628)
- List of 14 standalone indicators (vs 5 hardcoded checks)
- Case-insensitive matching for localization tolerance
- Permissive strategy: treat any known pattern as standalone
- Handles variations: "no cluster", "IPC", "connection refused", etc.
6. **Enhanced Test Coverage** (ssh_test.go:+35 lines)
- Added 3 new test cases (variation patterns)
- Tests now cover 8 standalone scenarios + 3 negative cases
- All tests pass (11/11)
**Impact**:
- Health gate won't block installation in edge cases
- Better user experience on non-standard setups
- Standalone detection handles more error message variations
- Clear escape hatch for troubleshooting (bypass flag)
**Confidence Level**: High
- All tests pass (bash syntax + Go unit tests)
- Graceful fallbacks for every external command
- Only critical checks are hard failures
- Warnings guide users through validation issues
Related to #571
Tests validate the error pattern matching logic added in previous commit,
ensuring we correctly identify:
1. **Standalone Node Patterns** (should trigger fallback):
- Classic: 'Corosync config does not exist'
- LXC ipcc errors: 'ipcc_send_rec[1] failed: Unknown error -1'
- Access control errors: 'Unable to load access control list'
- All patterns from GitHub issue #571
2. **Genuine Errors** (should NOT trigger fallback):
- Network timeouts
- Permission denied
- Command not found
Tests use real error messages from production GitHub issues to prevent
regressions. All 9 test cases pass.
Coverage:
- 6 standalone/LXC error patterns
- 3 genuine error cases (negative testing)
- References issue #571 for traceability
Related to #571
This commit resolves the recurring temperature monitoring failures that have plagued multiple releases:
1. **Fix user mismatch (v4.27.1 regression)**:
- Changed binary default user from 'pulse-sensor' to 'pulse-sensor-proxy'
- Aligns with the user created by install-sensor-proxy.sh (line 389)
- Prevents panic when binary is run outside systemd context
- Systemd unit already uses User=pulse-sensor-proxy, so this makes manual runs work too
2. **Fix standalone node validation (v4.25.0+ regression)**:
- pvecm status exits with code 2 on standalone nodes (not in a cluster)
- This caused validation to fail, rejecting all temperature requests
- Added discoverLocalHostAddresses() helper that discovers actual host IPs/hostnames
- On standalone nodes, cluster membership list is populated with host's own addresses
- Maintains SSRF protection while allowing standalone operation
- Added comprehensive test coverage
3. **Make installer fail loudly on proxy setup failure**:
- Previously, failed proxy installation only printed a warning
- Install script then claimed "Pulse installation complete!" (confusing for users)
- Now exits with clear error message and remediation steps
- Forces operators to fix proxy issues before claiming success
- Users who skip temperature monitoring are unaffected
4. **Add test coverage to prevent future regressions**:
- Added TestDiscoverLocalHostAddresses to verify local address discovery
- Validates no loopback or link-local addresses are returned
- All existing tests pass with new changes
Pattern of failures across releases:
- v4.23.0: Missing proxy binaries in release
- v4.24.0-rc.3: AMD CPU sensor naming (Tctl vs Tdie)
- v4.25.0: Single-node pvecm status exit code
- v4.27.1: User mismatch (pulse-sensor vs pulse-sensor-proxy)
This comprehensive fix addresses the root causes rather than applying another tactical patch.
Related to #571