Commit graph

16 commits

Author SHA1 Message Date
rcourtman
a703cc2be6 Fix HTTP mode reliability: add context timeouts to SSH collection
Critical fix for intermittent HTTP endpoint hangs identified by Codex analysis.

## Root Cause
SSH collection via getTemperatureViaSSH() had no timeout, causing HTTP
handlers to block indefinitely when sensors command hung. This held node-level
mutexes and rate limit slots, creating cascading failures where subsequent
requests queued indefinitely.

## Solution
- Thread request context through to SSH execution
- Add exec.CommandContext with 15s timeout (vs 30s HTTP client timeout)
- Create execCommandWithLimitsContext() to wrap SSH commands
- Ensures handlers always release locks and respond within deadline

## Impact
- HTTP temps endpoint now responds in ~70ms consistently
- Temperature data successfully collected and displayed in Pulse
- Eliminates 'context deadline exceeded' errors
- Prevents node gate deadlocks from slow/stuck SSH sessions

Related to Codex session 019a7e99-00fc-7903-afa3-01100baf47c6
2025-11-13 19:09:50 +00:00
rcourtman
2ee693cc63 Add HTTP mode to pulse-sensor-proxy for multi-instance temperature monitoring
This implements HTTP/HTTPS support for pulse-sensor-proxy to enable
temperature monitoring across multiple separate Proxmox instances.

Architecture changes:
- Dual-mode operation: Unix socket (local) + HTTPS (remote)
- Unix socket remains default for security/performance (no breaking change)
- HTTP mode enables temps from external PVE hosts

Backend implementation:
- Add HTTPS server with TLS + Bearer token authentication to sensor-proxy
- Add TemperatureProxyURL and TemperatureProxyToken fields to PVEInstance
- Add HTTP client (internal/tempproxy/http_client.go) for remote proxy calls
- Update temperature collector to prefer HTTP proxy when configured
- Fallback logic: HTTP proxy → Unix socket → direct SSH (if not containerized)

Configuration:
- pulse-sensor-proxy config: http_enabled, http_listen_addr, http_tls_cert/key, http_auth_token
- PVEInstance config: temperature_proxy_url, temperature_proxy_token
- Environment variables: PULSE_SENSOR_PROXY_HTTP_* for all HTTP settings

Security:
- TLS 1.2+ with modern cipher suites
- Constant-time token comparison (timing attack prevention)
- Rate limiting applied to HTTP requests (shared with socket mode)
- Audit logging for all HTTP requests

Next steps:
- Update installer script to support HTTP mode + auto-registration
- Add Pulse API endpoint for proxy registration
- Generate TLS certificates during installation
- Test multi-instance temperature collection

Related to #571 (multi-instance architecture)
2025-11-13 16:13:53 +00:00
rcourtman
c9d1671afd Fix persistent temperature monitoring issues for standalone Proxmox nodes (addresses #571)
This commit resolves the recurring temperature monitoring failures that have plagued multiple releases:

1. **Fix user mismatch (v4.27.1 regression)**:
   - Changed binary default user from 'pulse-sensor' to 'pulse-sensor-proxy'
   - Aligns with the user created by install-sensor-proxy.sh (line 389)
   - Prevents panic when binary is run outside systemd context
   - Systemd unit already uses User=pulse-sensor-proxy, so this makes manual runs work too

2. **Fix standalone node validation (v4.25.0+ regression)**:
   - pvecm status exits with code 2 on standalone nodes (not in a cluster)
   - This caused validation to fail, rejecting all temperature requests
   - Added discoverLocalHostAddresses() helper that discovers actual host IPs/hostnames
   - On standalone nodes, cluster membership list is populated with host's own addresses
   - Maintains SSRF protection while allowing standalone operation
   - Added comprehensive test coverage

3. **Make installer fail loudly on proxy setup failure**:
   - Previously, failed proxy installation only printed a warning
   - Install script then claimed "Pulse installation complete!" (confusing for users)
   - Now exits with clear error message and remediation steps
   - Forces operators to fix proxy issues before claiming success
   - Users who skip temperature monitoring are unaffected

4. **Add test coverage to prevent future regressions**:
   - Added TestDiscoverLocalHostAddresses to verify local address discovery
   - Validates no loopback or link-local addresses are returned
   - All existing tests pass with new changes

Pattern of failures across releases:
- v4.23.0: Missing proxy binaries in release
- v4.24.0-rc.3: AMD CPU sensor naming (Tctl vs Tdie)
- v4.25.0: Single-node pvecm status exit code
- v4.27.1: User mismatch (pulse-sensor vs pulse-sensor-proxy)

This comprehensive fix addresses the root causes rather than applying another tactical patch.

Related to #571
2025-11-09 16:53:14 +00:00
rcourtman
7062b07411 feat(security): Add node allowlist validation to prevent SSRF attacks
Implements comprehensive node validation system to prevent SSRF attacks
via the temperature proxy. Addresses critical vulnerability where proxy
would SSH to any hostname/IP passing format validation.

Features:
- Configurable allowed_nodes list (hostnames, IPs, CIDR ranges)
- Automatic Proxmox cluster membership validation
- 5-minute cluster membership cache to reduce pvecm overhead
- strict_node_validation option for strict vs permissive modes
- New metric: pulse_proxy_node_validation_failures_total{node,reason}
- Logs blocked attempts at WARN level with 'potential SSRF attempt'

Configuration:
- allowed_nodes: [] (empty = auto-discover from cluster)
- strict_node_validation: true (require cluster membership)

Default behavior: Empty allowlist + Proxmox host = validate cluster
members (secure by default, backwards compatible).

Related to security audit 2025-11-07.

Co-authored-by: Codex <codex@openai.com>
2025-11-07 17:08:28 +00:00
rcourtman
b5ef239973 Add container detection warning to pulse-sensor-proxy startup (related to #628)
When pulse-sensor-proxy runs inside a container (Docker/LXC), it cannot
complete SSH workflows properly, leading to continuous [preauth] log floods
on the Proxmox host. This happens because the proxy is meant to run on the
host, not inside the container.

Changes:
- Import internal/system for InContainer() detection
- Add startup warning when running in containerized environment
- Point users to docs/TEMPERATURE_MONITORING.md for correct setup
- Allow suppression via PULSE_SENSOR_PROXY_SUPPRESS_CONTAINER_WARNING=true

This catches the misconfiguration early and directs users to supported
installation methods, preventing the SSH spam reported in discussion #628.
2025-11-06 23:41:29 +00:00
rcourtman
5b89b2371a Make pulse-sensor-proxy resilient to read-only filesystems
Related to #637

The sensor-proxy was failing to start on systems with read-only filesystems
because audit logging required a writable /var/log/pulse/sensor-proxy directory.

Changes:
- Modified newAuditLogger() to automatically fall back to stderr (systemd journal)
  if the audit log file cannot be opened
- Removed error return from newAuditLogger() since it now always succeeds
- Added warning logs when fallback mode is used to alert operators
- Updated tests to handle the new signature
- Added better debugging to audit log tests

This allows the sensor-proxy to run on:
- Immutable/read-only root filesystems
- Hardened systems with restricted /var mounts
- Containerized environments with limited write access

Audit events are still captured via systemd journal when file logging is
unavailable, maintaining the security audit trail.
2025-11-06 00:18:51 +00:00
rcourtman
930ad20921 Add configurable log level for pulse-sensor-proxy
Users can now control logging verbosity through:
- YAML config file: log_level: "debug|info|warn|error"
- Environment variable: PULSE_SENSOR_PROXY_LOG_LEVEL

Default log level is set to "info" instead of debug, reducing verbose output.
Supported levels: trace, debug, info, warn, error, fatal, panic, disabled

Related to #629
2025-11-05 19:48:00 +00:00
rcourtman
44d5f91e92 feat: make pulse-sensor-proxy rate limits configurable
Add support for configuring rate limits via config.yaml to allow
administrators to tune the proxy for different deployment sizes.

Changes:
- Add RateLimitConfig struct to config.go with per_peer_interval_ms and per_peer_burst
- Update newRateLimiter() to accept optional RateLimitConfig parameter
- Load rate limit config from YAML and apply overrides to defaults
- Update tests to pass nil for default behavior
- Add comprehensive config.example.yaml with documentation

Configuration examples:
- Small (1-3 nodes): 1000ms interval, burst 5 (default)
- Medium (4-10 nodes): 500ms interval, burst 10
- Large (10+ nodes): 250ms interval, burst 20

Defaults remain conservative (1 req/sec, burst 5) to support most
deployments while allowing customization for larger environments.

Related: #46b8b8d08 (rate limit fix for multi-node support)
2025-10-21 11:25:21 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
rcourtman
1519390f08 security: enhance logging for denied privileged method calls
Improved security audit trail for attempted container privilege escalation:

- Added detailed logging when containers attempt privileged methods
- Logs UID, GID, PID, correlation ID, and method name
- Marked with "SECURITY:" prefix for easy filtering/alerting
- Helps operators detect and investigate compromise attempts

Example log output:
  SECURITY: Container attempted to call privileged method - access denied
  method=ensure_cluster_keys uid=101000 gid=101000 pid=12345

Addresses Codex recommendation for comprehensive logging of denied
privileged RPCs to enable monitoring and alerting on attempted abuse.
2025-10-19 16:40:42 +00:00
rcourtman
026b9c5b77 security: add method-level authorization for privileged RPC methods
RELEASE BLOCKER FIX - Prevents containers from triggering host-level operations.

Added host-only method restrictions:
- RPCEnsureClusterKeys (SSH key distribution)
- RPCRegisterNodes (node registration)
- RPCRequestCleanup (cleanup operations)

Implementation:
- New privilegedMethods map defines host-only methods
- Request handler checks if method is privileged
- If privileged AND caller is from ID-mapped UID range (container), reject
- Host processes (real root, configured UIDs) can still call privileged methods
- Containers can still call get_temperature and get_status

Security impact:
- Prevents compromised containers from:
  • Triggering unwanted SSH key distribution to cluster nodes
  • Learning about cluster topology via forced registration
  • DOS attacks by repeatedly calling key distribution
  • Other host-level privileged operations

Without this fix, any container with root could call these methods after
authentication, undermining the security isolation between container and host.

Addresses high-severity finding #2 from security audit.
2025-10-19 16:31:50 +00:00
rcourtman
3a6a4fd362 security: fix SSH command injection vulnerabilities in pulse-sensor-proxy
CRITICAL security fixes for pulse-sensor-proxy:

1. Strengthened hostname validation regex:
   - Now requires hostnames to start with alphanumeric character
   - Prevents SSH option injection via hostnames starting with '-'
   - Pattern: ^[a-zA-Z0-9][a-zA-Z0-9._-]{0,63}$ (1-64 chars total)
   - Added IPv4 and IPv6 validation regexes for future use

2. Added validation to vulnerable V1 RPC handlers:
   - handleGetTemperature: Now validates node parameter before SSH
   - handleRegisterNodes: Now validates discovered cluster nodes
   - Previously these handlers passed unsanitized input directly to SSH

3. Defense in depth:
   - V2 handlers already had validation (now using improved regex)
   - Multiple layers of protection against malicious node identifiers
   - Validation prevents container from passing SSH options as hostnames

Without these fixes, a compromised container could potentially inject SSH
options by providing malicious node names, though the 'root@' prefix
provided some mitigation.

Addresses high-severity finding from security audit.
2025-10-19 16:28:38 +00:00
rcourtman
123e0f04ca feat: add comprehensive node cleanup system
Implements automated cleanup workflow when nodes are deleted from Pulse, removing all monitoring footprint from the host. Changes include a new RPC handler in the sensor proxy for cleanup requests, enhanced node deletion modal with detailed cleanup explanations, and improved SSH key management with proper tagging for atomic updates.
2025-10-17 18:53:45 +00:00
rcourtman
f141f7db33 feat: enhance sensor proxy with improved cluster discovery and SSH management
Improvements to pulse-sensor-proxy:
- Fix cluster discovery to use pvecm status for IP addresses instead of node names
- Add standalone node support for non-clustered Proxmox hosts
- Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting
- Add --pulse-server flag to installer for custom Pulse URLs
- Configure www-data group membership for Proxmox IPC access

UI and API cleanup:
- Remove unused "Ensure cluster keys" button from Settings
- Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint
- Remove EnsureClusterKeys method from tempproxy client

The setup script already handles SSH key distribution during initial configuration,
making the manual refresh button redundant.
2025-10-17 11:43:26 +00:00
rcourtman
e4c3b06f14 Automate sensor proxy container mount and auth 2025-10-14 12:41:48 +00:00
rcourtman
b952444837 refactor: Rename pulse-temp-proxy to pulse-sensor-proxy
The name "temp-proxy" implied a temporary or incomplete implementation. The new name better reflects its purpose as a secure sensor data bridge for containerized Pulse deployments.

Changes:
- Renamed cmd/pulse-temp-proxy/ to cmd/pulse-sensor-proxy/
- Updated all path constants and binary references
- Renamed environment variables: PULSE_TEMP_PROXY_* to PULSE_SENSOR_PROXY_*
- Updated systemd service and service account name
- Updated installation, rotation, and build scripts
- Renamed hardening documentation
- Maintained backward compatibility for key removal during upgrades
2025-10-13 13:17:05 +00:00