Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 11:30:15 +00:00

Author	SHA1	Message	Date
rcourtman	d3875eaae5	Dramatically improve temperature proxy installation robustness Users were abandoning Pulse due to catastrophic temperature monitoring setup failures. This commit addresses the root causes: Problem 1: Silent Failures - Installations reported "SUCCESS" even when proxy never started - UI showed green checkmarks with no temperature data - Zero feedback when things went wrong Problem 2: Missing Diagnostics - Service failures logged only in journald - Users saw "Something going on with the proxy" with no actionable guidance - No way to troubleshoot from error messages Problem 3: Standalone Node Issues - Proxy daemon logged continuous pvecm errors as warnings - "ipcc_send_rec" and "Unknown error -1" messages confused users - These are expected for non-clustered/LXC setups Solutions Implemented: 1. Health Gate in install.sh (lines 1588-1629) - Verify service is running after installation - Check socket exists on host - Confirm socket visible inside container via bind mount - Fail loudly with specific diagnostics if any check fails 2. Actionable Error Messages in install-sensor-proxy.sh (lines 822-877) - When service fails to start: dump full systemctl status + 40 lines of logs - When socket missing: show permissions, service status, and remediation command - Include common issues checklist (missing user, permission errors, lm-sensors, etc.) - Direct link to troubleshooting docs 3. Better Standalone Node Detection in ssh.go (lines 585-595) - Recognize "Unknown error -1" and "Unable to load access control list" as LXC indicators - Log at INFO level (not WARN) since this is expected behavior - Clarify message: "using localhost for temperature collection" Impact: - Eliminates "green checkmark but no temps" scenario - Users get immediate actionable feedback on failures - Standalone/LXC installations work silently without error spam - Reduces support burden from #571 (15+ comments of user frustration) Related to #571	2025-11-13 10:14:19 +00:00
rcourtman	0899e36ad2	Improve sensor proxy cluster validation (Related to #703 )	2025-11-12 19:17:45 +00:00
rcourtman	b7cfafe2cf	Fix temperature monitoring on standalone Proxmox nodes (addresses #571 ) The standalone node detection in discoverClusterNodes was only checking stderr for "not part of a cluster" messages, but some Proxmox versions write these messages to stdout instead. This caused the fallback to discoverLocalHostAddresses to never trigger, leaving temperature monitoring broken on standalone nodes. Changes: - Check both stdout and stderr for standalone node indicators - Document exit code 255 in addition to code 2 - Improve error logging to show both stdout and stderr This ensures standalone nodes correctly fall back to local address discovery regardless of where pvecm writes its error messages.	2025-11-12 11:51:41 +00:00
rcourtman	27c2774af4	Fix pulse-sensor-proxy pvecm errors in LXC containers (related to #600 ) When pulse-sensor-proxy runs inside an LXC container on a Proxmox host, pvecm status fails with "ipcc_send_rec[1] failed: Unknown error -1" because the container can't access the host's corosync IPC socket. This caused repeated warnings every few seconds even though the proxy can function correctly by discovering local host addresses. Extended the standalone node detection to recognize "ipcc_send_rec" errors as indicating an LXC container deployment and gracefully fall back to local address discovery instead of logging warnings.	2025-11-11 23:04:36 +00:00
rcourtman	c9d1671afd	Fix persistent temperature monitoring issues for standalone Proxmox nodes (addresses #571 ) This commit resolves the recurring temperature monitoring failures that have plagued multiple releases: 1. Fix user mismatch (v4.27.1 regression): - Changed binary default user from 'pulse-sensor' to 'pulse-sensor-proxy' - Aligns with the user created by install-sensor-proxy.sh (line 389) - Prevents panic when binary is run outside systemd context - Systemd unit already uses User=pulse-sensor-proxy, so this makes manual runs work too 2. Fix standalone node validation (v4.25.0+ regression): - pvecm status exits with code 2 on standalone nodes (not in a cluster) - This caused validation to fail, rejecting all temperature requests - Added discoverLocalHostAddresses() helper that discovers actual host IPs/hostnames - On standalone nodes, cluster membership list is populated with host's own addresses - Maintains SSRF protection while allowing standalone operation - Added comprehensive test coverage 3. Make installer fail loudly on proxy setup failure: - Previously, failed proxy installation only printed a warning - Install script then claimed "Pulse installation complete!" (confusing for users) - Now exits with clear error message and remediation steps - Forces operators to fix proxy issues before claiming success - Users who skip temperature monitoring are unaffected 4. Add test coverage to prevent future regressions: - Added TestDiscoverLocalHostAddresses to verify local address discovery - Validates no loopback or link-local addresses are returned - All existing tests pass with new changes Pattern of failures across releases: - v4.23.0: Missing proxy binaries in release - v4.24.0-rc.3: AMD CPU sensor naming (Tctl vs Tdie) - v4.25.0: Single-node pvecm status exit code - v4.27.1: User mismatch (pulse-sensor vs pulse-sensor-proxy) This comprehensive fix addresses the root causes rather than applying another tactical patch. Related to #571	2025-11-09 16:53:14 +00:00
rcourtman	9aafa6449f	feat(security): Add capability-based authorization Implements proper least-privilege model for RPC methods. Previously, any UID in allowed_peer_uids could call privileged methods, meaning another service's UID would inherit full host-level control. Capability System: - Three levels: read, write, admin - Per-UID capability assignment via allowed_peers config - Privileged methods require admin capability - Backwards compatible with legacy allowed_peer_uids format Configuration: allowed_peers: - uid: 0 capabilities: [read, write, admin] # Root gets all - uid: 1000 capabilities: [read] # Docker: read-only - uid: 1001 capabilities: [read, write] # Temps but not key distribution Security benefit: Services can be granted only the capabilities they need, preventing unintended privilege escalation. Related to security audit 2025-11-07. Co-authored-by: Codex <codex@openai.com>	2025-11-07 17:09:32 +00:00
rcourtman	734cebb4dc	feat(security): Implement GID authorization enforcement Fixes bug where allowed_peer_gids was populated from config but never checked during authorization, creating false sense of security. Changes: - authorizePeer() now checks GIDs in addition to UIDs - Peer authorized if UID OR GID matches allowlist - Debug logging shows which rule granted access (UID vs GID) - Full test coverage for GID-based authorization Security benefit: GID-based policies now actually enforced as administrators expect. Related to security audit 2025-11-07. Co-authored-by: Codex <codex@openai.com>	2025-11-07 17:09:16 +00:00
rcourtman	b2e65f7b3e	feat(security): Add SSH output limits and improve host key management Addresses two security vulnerabilities: 1. SSH Output Size Limits: - Prevents memory exhaustion from malicious remote nodes - Configurable max_ssh_output_bytes (default 1MB) - Stream with io.LimitReader to cap output size - New metric: pulse_proxy_ssh_output_oversized_total{node} - WARN logging for oversized outputs 2. Improved Host Key Management: - Seed host keys from Proxmox cluster store (/etc/pve/priv/known_hosts) - Falls back to ssh-keyscan only if Proxmox unavailable (with WARN) - Fingerprint change detection with ERROR logging - require_proxmox_hostkeys option for strict mode - New metric: pulse_proxy_hostkey_changes_total{node} - Reduces MITM attack surface significantly Known hosts manager now normalizes entries, reuses existing fingerprints, and raises typed HostKeyChangeError when fingerprints differ. Related to security audit 2025-11-07. Co-authored-by: Codex <codex@openai.com>	2025-11-07 17:09:02 +00:00
rcourtman	885a62e96b	feat(security): Implement range-based rate limiting Prevents multi-UID rate limit bypass attacks from containers. Previously, attackers could create multiple users in a container (each mapped to unique host UIDs 100000-165535) to bypass per-UID rate limits. Implementation: - Automatic detection of ID-mapped UID ranges from /etc/subuid and /etc/subgid - Rate limits applied per-range for container UIDs - Rate limits applied per-UID for host UIDs (backwards compatible) - identifyPeer() checks if BOTH UID AND GID are in mapped ranges - Metrics show peer='range:100000-165535' or peer='uid:0' Security benefit: Entire container limited as single entity, preventing 100+ UIDs from bypassing rate controls. New metrics: - pulse_proxy_limiter_rejections_total{peer,reason} - pulse_proxy_limiter_penalties_total{peer,reason} - pulse_proxy_global_concurrency_inflight Related to security audit 2025-11-07. Co-authored-by: Codex <codex@openai.com>	2025-11-07 17:08:45 +00:00
rcourtman	7062b07411	feat(security): Add node allowlist validation to prevent SSRF attacks Implements comprehensive node validation system to prevent SSRF attacks via the temperature proxy. Addresses critical vulnerability where proxy would SSH to any hostname/IP passing format validation. Features: - Configurable allowed_nodes list (hostnames, IPs, CIDR ranges) - Automatic Proxmox cluster membership validation - 5-minute cluster membership cache to reduce pvecm overhead - strict_node_validation option for strict vs permissive modes - New metric: pulse_proxy_node_validation_failures_total{node,reason} - Logs blocked attempts at WARN level with 'potential SSRF attempt' Configuration: - allowed_nodes: [] (empty = auto-discover from cluster) - strict_node_validation: true (require cluster membership) Default behavior: Empty allowlist + Proxmox host = validate cluster members (secure by default, backwards compatible). Related to security audit 2025-11-07. Co-authored-by: Codex <codex@openai.com>	2025-11-07 17:08:28 +00:00
rcourtman	b5ef239973	Add container detection warning to pulse-sensor-proxy startup (related to #628 ) When pulse-sensor-proxy runs inside a container (Docker/LXC), it cannot complete SSH workflows properly, leading to continuous [preauth] log floods on the Proxmox host. This happens because the proxy is meant to run on the host, not inside the container. Changes: - Import internal/system for InContainer() detection - Add startup warning when running in containerized environment - Point users to docs/TEMPERATURE_MONITORING.md for correct setup - Allow suppression via PULSE_SENSOR_PROXY_SUPPRESS_CONTAINER_WARNING=true This catches the misconfiguration early and directs users to supported installation methods, preventing the SSH spam reported in discussion #628.	2025-11-06 23:41:29 +00:00
rcourtman	5b89b2371a	Make pulse-sensor-proxy resilient to read-only filesystems Related to #637 The sensor-proxy was failing to start on systems with read-only filesystems because audit logging required a writable /var/log/pulse/sensor-proxy directory. Changes: - Modified newAuditLogger() to automatically fall back to stderr (systemd journal) if the audit log file cannot be opened - Removed error return from newAuditLogger() since it now always succeeds - Added warning logs when fallback mode is used to alert operators - Updated tests to handle the new signature - Added better debugging to audit log tests This allows the sensor-proxy to run on: - Immutable/read-only root filesystems - Hardened systems with restricted /var mounts - Containerized environments with limited write access Audit events are still captured via systemd journal when file logging is unavailable, maintaining the security audit trail.	2025-11-06 00:18:51 +00:00
rcourtman	930ad20921	Add configurable log level for pulse-sensor-proxy Users can now control logging verbosity through: - YAML config file: log_level: "debug\|info\|warn\|error" - Environment variable: PULSE_SENSOR_PROXY_LOG_LEVEL Default log level is set to "info" instead of debug, reducing verbose output. Supported levels: trace, debug, info, warn, error, fatal, panic, disabled Related to #629	2025-11-05 19:48:00 +00:00
rcourtman	35adcf104f	docs: add guidance for large deployments (30+ nodes) in rate limit config Update config.example.yaml with: - Recommendations for very large deployments (30+ nodes) - Formula for calculating optimal rate limits based on node count - Example calculation: 30 nodes with 10s polling = 300ms interval - Security note about minimum safe intervals This helps admins properly configure the proxy for enterprise deployments with dozens of nodes.	2025-10-21 11:27:13 +00:00
rcourtman	44d5f91e92	feat: make pulse-sensor-proxy rate limits configurable Add support for configuring rate limits via config.yaml to allow administrators to tune the proxy for different deployment sizes. Changes: - Add RateLimitConfig struct to config.go with per_peer_interval_ms and per_peer_burst - Update newRateLimiter() to accept optional RateLimitConfig parameter - Load rate limit config from YAML and apply overrides to defaults - Update tests to pass nil for default behavior - Add comprehensive config.example.yaml with documentation Configuration examples: - Small (1-3 nodes): 1000ms interval, burst 5 (default) - Medium (4-10 nodes): 500ms interval, burst 10 - Large (10+ nodes): 250ms interval, burst 20 Defaults remain conservative (1 req/sec, burst 5) to support most deployments while allowing customization for larger environments. Related: #`46b8b8d08` (rate limit fix for multi-node support)	2025-10-21 11:25:21 +00:00
rcourtman	d856e75018	fix: increase pulse-sensor-proxy rate limits for multi-node support - Increase rate limit from 1 req/5sec to 1 req/sec (60/min) - Increase burst from 2 to 5 requests - Fixes temperature collection failures when monitoring 3+ nodes - All requests from containerized Pulse use same UID, causing rate limiting - New limits support 5-10 node deployments comfortably Resolves issue where adding standalone nodes broke temperature monitoring for all nodes due to aggressive rate limiting.	2025-10-21 11:21:12 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
rcourtman	29f4879cd4	test: add comprehensive security tests and documentation Implements all remaining Codex recommendations before launch: 1. Privileged Methods Tests: - TestPrivilegedMethodsCompleteness ensures all host-side RPCs are protected - Will fail if new privileged RPC is added without authorization - Verifies read-only methods are NOT in privilegedMethods 2. ID-Mapped Root Detection Tests: - TestIDMappedRootDetection covers all boundary conditions - Tests UID/GID range detection (both must be in range) - Tests multiple ID ranges, edge cases, disabled mode - 100% coverage of container identification logic 3. Authorization Tests: - TestPrivilegedMethodsBlocked verifies containers can't call privileged RPCs - TestIDMappedRootDisabled ensures feature can be disabled - Tests both container and host credentials 4. Comprehensive Security Documentation (23 KB): - Architecture overview with diagrams - Complete authentication & authorization flow - Rate limiting details (already implemented: 20/min per peer) - SSH security model and forced commands - Container isolation mechanisms - Monitoring & alerting recommendations - Development mode documentation (PULSE_DEV_ALLOW_CONTAINER_SSH) - Troubleshooting guide with common issues - Incident response procedures Rate Limiting Status: - Already implemented in throttle.go (20 req/min, burst 10, max 10 concurrent) - Per-peer rate limiting at line 328 in main.go - Per-node concurrency control at line 825 in main.go - Exceeds Codex's requirements All tests pass. Documentation covers all security aspects. Addresses final Codex recommendations for production readiness.	2025-10-19 16:47:13 +00:00
rcourtman	1519390f08	security: enhance logging for denied privileged method calls Improved security audit trail for attempted container privilege escalation: - Added detailed logging when containers attempt privileged methods - Logs UID, GID, PID, correlation ID, and method name - Marked with "SECURITY:" prefix for easy filtering/alerting - Helps operators detect and investigate compromise attempts Example log output: SECURITY: Container attempted to call privileged method - access denied method=ensure_cluster_keys uid=101000 gid=101000 pid=12345 Addresses Codex recommendation for comprehensive logging of denied privileged RPCs to enable monitoring and alerting on attempted abuse.	2025-10-19 16:40:42 +00:00
rcourtman	026b9c5b77	security: add method-level authorization for privileged RPC methods RELEASE BLOCKER FIX - Prevents containers from triggering host-level operations. Added host-only method restrictions: - RPCEnsureClusterKeys (SSH key distribution) - RPCRegisterNodes (node registration) - RPCRequestCleanup (cleanup operations) Implementation: - New privilegedMethods map defines host-only methods - Request handler checks if method is privileged - If privileged AND caller is from ID-mapped UID range (container), reject - Host processes (real root, configured UIDs) can still call privileged methods - Containers can still call get_temperature and get_status Security impact: - Prevents compromised containers from: • Triggering unwanted SSH key distribution to cluster nodes • Learning about cluster topology via forced registration • DOS attacks by repeatedly calling key distribution • Other host-level privileged operations Without this fix, any container with root could call these methods after authentication, undermining the security isolation between container and host. Addresses high-severity finding #2 from security audit.	2025-10-19 16:31:50 +00:00
rcourtman	3a6a4fd362	security: fix SSH command injection vulnerabilities in pulse-sensor-proxy CRITICAL security fixes for pulse-sensor-proxy: 1. Strengthened hostname validation regex: - Now requires hostnames to start with alphanumeric character - Prevents SSH option injection via hostnames starting with '-' - Pattern: ^[a-zA-Z0-9][a-zA-Z0-9._-]{0,63}$ (1-64 chars total) - Added IPv4 and IPv6 validation regexes for future use 2. Added validation to vulnerable V1 RPC handlers: - handleGetTemperature: Now validates node parameter before SSH - handleRegisterNodes: Now validates discovered cluster nodes - Previously these handlers passed unsanitized input directly to SSH 3. Defense in depth: - V2 handlers already had validation (now using improved regex) - Multiple layers of protection against malicious node identifiers - Validation prevents container from passing SSH options as hostnames Without these fixes, a compromised container could potentially inject SSH options by providing malicious node names, though the 'root@' prefix provided some mitigation. Addresses high-severity finding from security audit.	2025-10-19 16:28:38 +00:00
rcourtman	123e0f04ca	feat: add comprehensive node cleanup system Implements automated cleanup workflow when nodes are deleted from Pulse, removing all monitoring footprint from the host. Changes include a new RPC handler in the sensor proxy for cleanup requests, enhanced node deletion modal with detailed cleanup explanations, and improved SSH key management with proper tagging for atomic updates.	2025-10-17 18:53:45 +00:00
rcourtman	f141f7db33	feat: enhance sensor proxy with improved cluster discovery and SSH management Improvements to pulse-sensor-proxy: - Fix cluster discovery to use pvecm status for IP addresses instead of node names - Add standalone node support for non-clustered Proxmox hosts - Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting - Add --pulse-server flag to installer for custom Pulse URLs - Configure www-data group membership for Proxmox IPC access UI and API cleanup: - Remove unused "Ensure cluster keys" button from Settings - Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint - Remove EnsureClusterKeys method from tempproxy client The setup script already handles SSH key distribution during initial configuration, making the manual refresh button redundant.	2025-10-17 11:43:26 +00:00
rcourtman	e4c3b06f14	Automate sensor proxy container mount and auth	2025-10-14 12:41:48 +00:00
rcourtman	b952444837	refactor: Rename pulse-temp-proxy to pulse-sensor-proxy The name "temp-proxy" implied a temporary or incomplete implementation. The new name better reflects its purpose as a secure sensor data bridge for containerized Pulse deployments. Changes: - Renamed cmd/pulse-temp-proxy/ to cmd/pulse-sensor-proxy/ - Updated all path constants and binary references - Renamed environment variables: PULSE_TEMP_PROXY_* to PULSE_SENSOR_PROXY_* - Updated systemd service and service account name - Updated installation, rotation, and build scripts - Renamed hardening documentation - Maintained backward compatibility for key removal during upgrades	2025-10-13 13:17:05 +00:00

25 commits