Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 19:41:17 +00:00

Author	SHA1	Message	Date
rcourtman	6ff345fb6b	chore: fix staticcheck SA warnings - Fix SA4006 unused value issues in ssh.go, validation.go, generator.go - Replace deprecated ioutil with io/os in config.go - Replace deprecated tar.TypeRegA with tar.TypeReg - Remove deprecated rand.Seed calls (auto-seeded in Go 1.20+) - Fix always-true nil check in main.go - Fix impossible nil comparison in tempproxy/client.go - Add nil check for config in monitor.New()	2025-11-27 09:16:53 +00:00
rcourtman	3fce14469c	chore: remove legacy proxy handlers and unused functions Remove legacy V1 handlers replaced by V2 versions: - sendError (replaced by sendErrorV2) - handleGetStatus (replaced by handleGetStatusV2) - handleEnsureClusterKeys (replaced by handleEnsureClusterKeysV2) - handleRegisterNodes (replaced by handleRegisterNodesV2) - handleGetTemperature (replaced by handleGetTemperatureV2) Also remove related unused functions: - getPublicKey wrapper (only getPublicKeyFrom is used) - pushSSHKey wrapper (only pushSSHKeyFrom is used) - nodeValidator.ipAllowed method (standalone ipAllowed is used) - validateConfigFile (never called) - runServiceDebug (Windows debug mode, never called)	2025-11-27 08:41:28 +00:00
courtmanr@gmail.com	c91add36d2	fix: filter out qdevice from cluster node discovery	2025-11-24 22:54:58 +00:00
courtmanr@gmail.com	1ae34285c5	fix(sensor-proxy): relax pvecm status parsing to support decimal node IDs Fixes an issue where pvecm status output using decimal node IDs (e.g. '1' instead of '0x1') caused node discovery to fail. Added test case for this format.	2025-11-23 08:21:58 +00:00
courtmanr@gmail.com	a5fbe52a59	Fix pvecm status parsing for QDevice flags (#738 )	2025-11-22 23:44:01 +00:00
rcourtman	429f9c45bb	Ensure sensor proxy wrapper delivers SMART temps locally	2025-11-21 10:07:42 +00:00
rcourtman	e178ae50a5	Add context timeout to local temperature collection The getTemperatureLocal() function was running sensors without a timeout, which could cause HTTP requests to hang if the sensors command stalled. This adds context.Context parameter and uses exec.CommandContext to ensure local temperature collection respects the same 15-second timeout as SSH-based collection. Fixes issue where HTTP mode worked for remote nodes but timed out for self-monitoring on the same host.	2025-11-13 20:15:05 +00:00
rcourtman	a703cc2be6	Fix HTTP mode reliability: add context timeouts to SSH collection Critical fix for intermittent HTTP endpoint hangs identified by Codex analysis. ## Root Cause SSH collection via getTemperatureViaSSH() had no timeout, causing HTTP handlers to block indefinitely when sensors command hung. This held node-level mutexes and rate limit slots, creating cascading failures where subsequent requests queued indefinitely. ## Solution - Thread request context through to SSH execution - Add exec.CommandContext with 15s timeout (vs 30s HTTP client timeout) - Create execCommandWithLimitsContext() to wrap SSH commands - Ensures handlers always release locks and respond within deadline ## Impact - HTTP temps endpoint now responds in ~70ms consistently - Temperature data successfully collected and displayed in Pulse - Eliminates 'context deadline exceeded' errors - Prevents node gate deadlocks from slow/stuck SSH sessions Related to Codex session 019a7e99-00fc-7903-afa3-01100baf47c6	2025-11-13 19:09:50 +00:00
rcourtman	aa357e5013	Fix HTTP mode for pulse-sensor-proxy and improve installer safety ## HTTP Server Fixes - Add source IP middleware to enforce allowed_source_subnets - Fix missing source subnet validation for external HTTP requests - HTTP health endpoint now respects subnet restrictions ## Installer Improvements - Auto-configure allowed_source_subnets with Pulse server IP - Add cluster node hostnames to allowed_nodes (not just IPs) - Fix node validation to accept both hostnames and IPs - Add Pulse server reachability check before installation - Add port availability check for HTTP mode - Add automatic rollback on service startup failure - Add HTTP endpoint health check after installation - Fix config backup and deduplication (prevent duplicate keys) - Fix IPv4 validation with loopback rejection - Improve registration retry logic with detailed errors - Add automatic LXC bind mount cleanup on uninstall ## Temperature Collection Fixes - Add local temperature collection for self-monitoring nodes - Fix node identifier matching (use hostname not SSH host) - Fix JSON double-encoding in HTTP client response Related to #XXX (temperature monitoring fixes)	2025-11-13 18:22:36 +00:00
rcourtman	19a960de8f	Address Codex security review feedback Changes based on independent Codex review: 1. Elevated log level from Debug to Warn for permissive mode fallback - Operators now see "SECURITY: Cluster validation unavailable" in journalctl at default log level - Added similar warning on startup when running in permissive mode - Makes it obvious when node validation is bypassed 2. Added runtime fallback for AF_NETLINK restrictions - New discoverLocalHostAddressesFallback() shells out to 'ip addr' - Triggered when net.Interfaces() fails with netlinkrib error - Ensures existing installations work even without systemd unit update - Logs recommendation to update systemd unit for better performance 3. Improved security awareness - Changed message to explicitly state "allowing all nodes" - Recommends configuring allowed_nodes for security - Makes permissive fallback behavior transparent to operators Related to #571 - temperature monitoring on standalone nodes These changes ensure the fix works for existing installations that haven't updated their systemd units, while clearly communicating when the proxy is running in an insecure permissive mode.	2025-11-13 13:55:26 +00:00
rcourtman	4bb8ab15a7	Fix temperature monitoring for clustered and LXC Proxmox environments (addresses #571 ) Root cause: pulse-sensor-proxy runs with strict systemd hardening that prevents access to Proxmox corosync IPC (abstract UNIX sockets). When pvecm fails with IPC errors, the code incorrectly treated it as "standalone mode" and only discovered localhost addresses, rejecting legitimate cluster members and external nodes. Changes: 1. Distinguish IPC failures from true standalone mode - Detect ipcc_send_rec and access control list errors specifically - These indicate a cluster exists but isn't accessible (LXC, systemd restrictions) - Return error to disable cluster validation instead of misusing standalone logic 2. Graceful degradation when cluster validation fails - When cluster IPC is unavailable, fall through to permissive mode - Log debug message suggesting allowed_nodes configuration - Allows requests to proceed rather than blocking all temperature monitoring 3. Improve local address discovery for true standalone nodes - Use Go's native net.Interfaces() instead of shelling out to 'ip addr' - More reliable and works with AF_NETLINK restrictions - Add helpful logging when only hostnames are discovered 4. Systemd hardening adjustments - Add AF_NETLINK to RestrictAddressFamilies (for net.Interfaces()) - Remove RemoveIPC=true (attempted fix for corosync, insufficient) - Add ReadWritePaths=-/run/corosync (optional path, corosync uses abstract sockets anyway) Result: Temperature monitoring now works in: - Clustered Proxmox hosts (falls back to permissive when IPC blocked) - LXC containers (correctly detects IPC failure, allows requests) - Standalone nodes (proper local address discovery with IPs) Workaround for maximum security: Configure allowed_nodes in /etc/pulse-sensor-proxy/config.yaml when cluster validation cannot be used.	2025-11-13 13:25:27 +00:00
rcourtman	573851a388	Fix temperature monitoring on standalone Proxmox nodes (addresses #571 ) Root cause: The systemd service hardening blocked AF_NETLINK sockets, preventing IP address discovery on standalone nodes. The proxy could only discover hostnames, causing node_not_cluster_member rejections when users configured Pulse with IP addresses. Changes: 1. Add AF_NETLINK to RestrictAddressFamilies in all systemd services - pulse-sensor-proxy.service - install-sensor-proxy.sh (both modes) - pulse-sensor-cleanup.service 2. Replace shell-based 'ip addr' with Go native net.Interfaces() API - More reliable and doesn't require external commands - Works even with strict systemd restrictions - Properly filters loopback, link-local, and down interfaces 3. Improve error logging and user guidance - Warn when no IP addresses can be discovered - Provide clear instructions about allowed_nodes workaround - Include address counts in logs for debugging This fix ensures standalone Proxmox nodes can properly validate temperature requests by IP address without requiring manual allowed_nodes configuration.	2025-11-13 13:02:15 +00:00
rcourtman	6a5b8d698b	Add critical safety guards to temperature proxy installation After implementing the health gate, added comprehensive safety measures to prevent the health checks themselves from becoming a new failure point. Problem: Previous commit added strict health checks but could fail in edge cases: - `pct exec` could hang if container stopped/frozen → installer deadlocks - systemctl/journalctl might not be available → diagnostics fail - Container access check could fail for transient reasons - pvecm error detection was fragile (string matching specific messages) Solutions Implemented: 1. Timeouts on All External Commands (install.sh:1596,1618) - `timeout 5` on systemctl checks - `timeout 10` on pct exec checks - Prevents installer from hanging indefinitely 2. Graceful Degradation (install.sh:1602-1630) - Check for systemctl/pct availability before using - Warn if tools missing instead of failing - Container check is warning-only (may be transient) - Only fail on critical checks: service running, socket exists 3. Bypass Flag Support (install.sh:1589-1594) - Set `PULSE_SKIP_HEALTH_CHECKS=1` to bypass all checks - Documented in error messages for troubleshooting - Allows installation in unsupported environments 4. Flexible Diagnostics (install.sh:1640-1647) - Use journalctl if available, fallback to syslog - Conditional tool-specific advice 5. Broader Error Detection (ssh.go:582-628) - List of 14 standalone indicators (vs 5 hardcoded checks) - Case-insensitive matching for localization tolerance - Permissive strategy: treat any known pattern as standalone - Handles variations: "no cluster", "IPC", "connection refused", etc. 6. Enhanced Test Coverage (ssh_test.go:+35 lines) - Added 3 new test cases (variation patterns) - Tests now cover 8 standalone scenarios + 3 negative cases - All tests pass (11/11) Impact: - Health gate won't block installation in edge cases - Better user experience on non-standard setups - Standalone detection handles more error message variations - Clear escape hatch for troubleshooting (bypass flag) Confidence Level: High - All tests pass (bash syntax + Go unit tests) - Graceful fallbacks for every external command - Only critical checks are hard failures - Warnings guide users through validation issues Related to #571	2025-11-13 10:26:46 +00:00
rcourtman	d3875eaae5	Dramatically improve temperature proxy installation robustness Users were abandoning Pulse due to catastrophic temperature monitoring setup failures. This commit addresses the root causes: Problem 1: Silent Failures - Installations reported "SUCCESS" even when proxy never started - UI showed green checkmarks with no temperature data - Zero feedback when things went wrong Problem 2: Missing Diagnostics - Service failures logged only in journald - Users saw "Something going on with the proxy" with no actionable guidance - No way to troubleshoot from error messages Problem 3: Standalone Node Issues - Proxy daemon logged continuous pvecm errors as warnings - "ipcc_send_rec" and "Unknown error -1" messages confused users - These are expected for non-clustered/LXC setups Solutions Implemented: 1. Health Gate in install.sh (lines 1588-1629) - Verify service is running after installation - Check socket exists on host - Confirm socket visible inside container via bind mount - Fail loudly with specific diagnostics if any check fails 2. Actionable Error Messages in install-sensor-proxy.sh (lines 822-877) - When service fails to start: dump full systemctl status + 40 lines of logs - When socket missing: show permissions, service status, and remediation command - Include common issues checklist (missing user, permission errors, lm-sensors, etc.) - Direct link to troubleshooting docs 3. Better Standalone Node Detection in ssh.go (lines 585-595) - Recognize "Unknown error -1" and "Unable to load access control list" as LXC indicators - Log at INFO level (not WARN) since this is expected behavior - Clarify message: "using localhost for temperature collection" Impact: - Eliminates "green checkmark but no temps" scenario - Users get immediate actionable feedback on failures - Standalone/LXC installations work silently without error spam - Reduces support burden from #571 (15+ comments of user frustration) Related to #571	2025-11-13 10:14:19 +00:00
rcourtman	b7cfafe2cf	Fix temperature monitoring on standalone Proxmox nodes (addresses #571 ) The standalone node detection in discoverClusterNodes was only checking stderr for "not part of a cluster" messages, but some Proxmox versions write these messages to stdout instead. This caused the fallback to discoverLocalHostAddresses to never trigger, leaving temperature monitoring broken on standalone nodes. Changes: - Check both stdout and stderr for standalone node indicators - Document exit code 255 in addition to code 2 - Improve error logging to show both stdout and stderr This ensures standalone nodes correctly fall back to local address discovery regardless of where pvecm writes its error messages.	2025-11-12 11:51:41 +00:00
rcourtman	27c2774af4	Fix pulse-sensor-proxy pvecm errors in LXC containers (related to #600 ) When pulse-sensor-proxy runs inside an LXC container on a Proxmox host, pvecm status fails with "ipcc_send_rec[1] failed: Unknown error -1" because the container can't access the host's corosync IPC socket. This caused repeated warnings every few seconds even though the proxy can function correctly by discovering local host addresses. Extended the standalone node detection to recognize "ipcc_send_rec" errors as indicating an LXC container deployment and gracefully fall back to local address discovery instead of logging warnings.	2025-11-11 23:04:36 +00:00
rcourtman	c9d1671afd	Fix persistent temperature monitoring issues for standalone Proxmox nodes (addresses #571 ) This commit resolves the recurring temperature monitoring failures that have plagued multiple releases: 1. Fix user mismatch (v4.27.1 regression): - Changed binary default user from 'pulse-sensor' to 'pulse-sensor-proxy' - Aligns with the user created by install-sensor-proxy.sh (line 389) - Prevents panic when binary is run outside systemd context - Systemd unit already uses User=pulse-sensor-proxy, so this makes manual runs work too 2. Fix standalone node validation (v4.25.0+ regression): - pvecm status exits with code 2 on standalone nodes (not in a cluster) - This caused validation to fail, rejecting all temperature requests - Added discoverLocalHostAddresses() helper that discovers actual host IPs/hostnames - On standalone nodes, cluster membership list is populated with host's own addresses - Maintains SSRF protection while allowing standalone operation - Added comprehensive test coverage 3. Make installer fail loudly on proxy setup failure: - Previously, failed proxy installation only printed a warning - Install script then claimed "Pulse installation complete!" (confusing for users) - Now exits with clear error message and remediation steps - Forces operators to fix proxy issues before claiming success - Users who skip temperature monitoring are unaffected 4. Add test coverage to prevent future regressions: - Added TestDiscoverLocalHostAddresses to verify local address discovery - Validates no loopback or link-local addresses are returned - All existing tests pass with new changes Pattern of failures across releases: - v4.23.0: Missing proxy binaries in release - v4.24.0-rc.3: AMD CPU sensor naming (Tctl vs Tdie) - v4.25.0: Single-node pvecm status exit code - v4.27.1: User mismatch (pulse-sensor vs pulse-sensor-proxy) This comprehensive fix addresses the root causes rather than applying another tactical patch. Related to #571	2025-11-09 16:53:14 +00:00
rcourtman	b2e65f7b3e	feat(security): Add SSH output limits and improve host key management Addresses two security vulnerabilities: 1. SSH Output Size Limits: - Prevents memory exhaustion from malicious remote nodes - Configurable max_ssh_output_bytes (default 1MB) - Stream with io.LimitReader to cap output size - New metric: pulse_proxy_ssh_output_oversized_total{node} - WARN logging for oversized outputs 2. Improved Host Key Management: - Seed host keys from Proxmox cluster store (/etc/pve/priv/known_hosts) - Falls back to ssh-keyscan only if Proxmox unavailable (with WARN) - Fingerprint change detection with ERROR logging - require_proxmox_hostkeys option for strict mode - New metric: pulse_proxy_hostkey_changes_total{node} - Reduces MITM attack surface significantly Known hosts manager now normalizes entries, reuses existing fingerprints, and raises typed HostKeyChangeError when fingerprints differ. Related to security audit 2025-11-07. Co-authored-by: Codex <codex@openai.com>	2025-11-07 17:09:02 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
rcourtman	123e0f04ca	feat: add comprehensive node cleanup system Implements automated cleanup workflow when nodes are deleted from Pulse, removing all monitoring footprint from the host. Changes include a new RPC handler in the sensor proxy for cleanup requests, enhanced node deletion modal with detailed cleanup explanations, and improved SSH key management with proper tagging for atomic updates.	2025-10-17 18:53:45 +00:00
rcourtman	f141f7db33	feat: enhance sensor proxy with improved cluster discovery and SSH management Improvements to pulse-sensor-proxy: - Fix cluster discovery to use pvecm status for IP addresses instead of node names - Add standalone node support for non-clustered Proxmox hosts - Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting - Add --pulse-server flag to installer for custom Pulse URLs - Configure www-data group membership for Proxmox IPC access UI and API cleanup: - Remove unused "Ensure cluster keys" button from Settings - Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint - Remove EnsureClusterKeys method from tempproxy client The setup script already handles SSH key distribution during initial configuration, making the manual refresh button redundant.	2025-10-17 11:43:26 +00:00
rcourtman	b952444837	refactor: Rename pulse-temp-proxy to pulse-sensor-proxy The name "temp-proxy" implied a temporary or incomplete implementation. The new name better reflects its purpose as a secure sensor data bridge for containerized Pulse deployments. Changes: - Renamed cmd/pulse-temp-proxy/ to cmd/pulse-sensor-proxy/ - Updated all path constants and binary references - Renamed environment variables: PULSE_TEMP_PROXY_* to PULSE_SENSOR_PROXY_* - Updated systemd service and service account name - Updated installation, rotation, and build scripts - Renamed hardening documentation - Maintained backward compatibility for key removal during upgrades	2025-10-13 13:17:05 +00:00

22 commits