Commit graph

35 commits

Author SHA1 Message Date
rcourtman
cedf0c8f0f fix(temperature): parse string sensor values without zeroing readings (#1224) 2026-02-09 14:00:09 +00:00
rcourtman
1e77763870 feat: improve monitoring and temperature handling
Temperature Monitoring:
- Enhance temperature collection and processing
- Add temperature tests

Monitor Improvements:
- Improve monitor reload handling
- Add reload tests

Test Coverage:
- Add Ceph monitoring tests
- Add Docker commands tests
- Add host agent temperature tests
- Add extra coverage tests
2026-01-24 22:43:31 +00:00
rcourtman
7049f5b43c refactor: simplify temperature monitoring after sensor proxy removal
Remove proxy-related temperature code paths:
- temperature.go: remove proxy client integration and fallback logic
- config.go: remove SensorProxyEnabled and related config fields
- monitor.go: remove proxy client initialization and state

Temperature monitoring now relies solely on the unified agent approach.
2026-01-21 12:00:28 +00:00
rcourtman
c81bbba8a3 perf: Use strconv.Itoa instead of fmt.Sprintf for int conversion
strconv.Itoa is faster than fmt.Sprintf("%d", ...) because it doesn't
need to parse a format string. Changed 4 occurrences in monitoring
package where integers are converted to strings.
2025-12-02 15:21:41 +00:00
rcourtman
02a00473f2 fix: Make parseNVMeTemps deterministic by checking Composite before Sensor 1
Go map iteration order is non-deterministic, causing flaky test failures.
This fix ensures "Composite" sensor is always preferred over "Sensor 1"
by checking them in separate loops rather than relying on iteration order.
2025-12-01 20:03:05 +00:00
rcourtman
b370799988 chore: remove more dead code
Remove 330 lines of unreachable code:
- internal/monitoring/temperature_service.go: unused temperature service abstraction
- internal/monitoring/temperature.go: unused NewTemperatureCollector wrapper
- internal/mock/generator.go: unused GenerateAlerts function
- internal/mock/integration.go: unused ToggleMockMode wrapper
- internal/notifications/notifications.go: unused sendEmailWithContent,
  generatePayloadFromTemplate, isPrivateRange172, groupAlerts
- internal/notifications/email_providers.go: unused GetProviderDefaults
2025-11-27 00:10:55 +00:00
courtmanr@gmail.com
a68f2de3e6 Relax container SSH check for temperature monitoring (ref #727) 2025-11-25 08:00:08 +00:00
courtmanr@gmail.com
d23ab9a7f7 Feat: Add support for Raspberry Pi RP1 ADC temperature sensor (Fixes #745)
- Added 'rp1_adc' to the list of recognized CPU temperature chips
2025-11-23 22:33:16 +00:00
courtmanr@gmail.com
78308cbc10 Fix: Prevent single node auth failure from disabling global SSH temperature collection
- Removed global legacySSHDisabled flag that was triggered by any single node auth failure
- Changed disableLegacySSHOnAuthFailure to only log warnings
- Fixed potential context leak in monitor.go
- Updated tests to reflect removal of global disable logic
2025-11-23 22:24:15 +00:00
rcourtman
596bdbfb13 Handle standby SMART temps and capture disk identity 2025-11-22 07:35:13 +00:00
rcourtman
a03f8115b6 Improve installer temperature proxy and backup polling 2025-11-18 18:42:33 +00:00
rcourtman
13daa61d1d Harden turnkey install and proxy auto-registration 2025-11-18 00:24:50 +00:00
rcourtman
f9341ae1fc Improve temperature proxy workflow 2025-11-17 14:25:46 +00:00
rcourtman
47d5c14aef Improve temperature proxy control-plane flow 2025-11-15 21:49:51 +00:00
rcourtman
aa357e5013 Fix HTTP mode for pulse-sensor-proxy and improve installer safety
## HTTP Server Fixes
- Add source IP middleware to enforce allowed_source_subnets
- Fix missing source subnet validation for external HTTP requests
- HTTP health endpoint now respects subnet restrictions

## Installer Improvements
- Auto-configure allowed_source_subnets with Pulse server IP
- Add cluster node hostnames to allowed_nodes (not just IPs)
- Fix node validation to accept both hostnames and IPs
- Add Pulse server reachability check before installation
- Add port availability check for HTTP mode
- Add automatic rollback on service startup failure
- Add HTTP endpoint health check after installation
- Fix config backup and deduplication (prevent duplicate keys)
- Fix IPv4 validation with loopback rejection
- Improve registration retry logic with detailed errors
- Add automatic LXC bind mount cleanup on uninstall

## Temperature Collection Fixes
- Add local temperature collection for self-monitoring nodes
- Fix node identifier matching (use hostname not SSH host)
- Fix JSON double-encoding in HTTP client response

Related to #XXX (temperature monitoring fixes)
2025-11-13 18:22:36 +00:00
rcourtman
2ee693cc63 Add HTTP mode to pulse-sensor-proxy for multi-instance temperature monitoring
This implements HTTP/HTTPS support for pulse-sensor-proxy to enable
temperature monitoring across multiple separate Proxmox instances.

Architecture changes:
- Dual-mode operation: Unix socket (local) + HTTPS (remote)
- Unix socket remains default for security/performance (no breaking change)
- HTTP mode enables temps from external PVE hosts

Backend implementation:
- Add HTTPS server with TLS + Bearer token authentication to sensor-proxy
- Add TemperatureProxyURL and TemperatureProxyToken fields to PVEInstance
- Add HTTP client (internal/tempproxy/http_client.go) for remote proxy calls
- Update temperature collector to prefer HTTP proxy when configured
- Fallback logic: HTTP proxy → Unix socket → direct SSH (if not containerized)

Configuration:
- pulse-sensor-proxy config: http_enabled, http_listen_addr, http_tls_cert/key, http_auth_token
- PVEInstance config: temperature_proxy_url, temperature_proxy_token
- Environment variables: PULSE_SENSOR_PROXY_HTTP_* for all HTTP settings

Security:
- TLS 1.2+ with modern cipher suites
- Constant-time token comparison (timing attack prevention)
- Rate limiting applied to HTTP requests (shared with socket mode)
- Audit logging for all HTTP requests

Next steps:
- Update installer script to support HTTP mode + auto-registration
- Add Pulse API endpoint for proxy registration
- Generate TLS certificates during installation
- Test multi-instance temperature collection

Related to #571 (multi-instance architecture)
2025-11-13 16:13:53 +00:00
rcourtman
48fabdd827 Improve Docker temperature monitoring documentation for clarity (related to #600)
Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be
more user-friendly and address common setup issues:

- Added clear explanation of why the proxy is needed (containers can't access hardware)
- Provided concrete IP example instead of placeholder
- Showed full docker-compose.yml context with proper YAML structure
- Added sudo to commands where needed
- Updated docker-compose commands to v2 syntax with note about v1
- Expanded verification steps with clearer success indicators
- Added reminder to check container name in verification commands

These improvements should help users who encounter blank temperature displays
due to missing proxy installation or bind mount configuration.
2025-11-07 15:09:42 +00:00
rcourtman
2a79d57f73 Add SMART temperature collection for physical disks (related to #652)
Extends temperature monitoring to collect SMART temps for SATA/SAS disks,
addressing issue #652 where physical disk temperatures showed as empty.

Architecture:
- Deploys pulse-sensor-wrapper.sh as SSH forced command on Proxmox nodes
- Wrapper collects both CPU/GPU temps (sensors -j) and disk temps (smartctl)
- Implements 30-min cache with background refresh to avoid performance impact
- Uses smartctl -n standby,after to skip sleeping drives without waking them
- Returns unified JSON: {sensors: {...}, smart: [...]}

Backend changes:
- Add DiskTemp model with device, serial, WWN, temperature, lastUpdated
- Extend Temperature model with SMART []DiskTemp field and HasSMART flag
- Add WWN field to PhysicalDisk for reliable disk matching
- Update parseSensorsJSON to handle both legacy and new wrapper formats
- Rewrite mergeNVMeTempsIntoDisks to match SMART temps by WWN → serial → devpath
- Preserve legacy NVMe temperature support for backward compatibility

Performance considerations:
- SMART data cached for 30 minutes per node to avoid excessive smartctl calls
- Background refresh prevents blocking temperature requests
- Respects drive standby state to avoid spinning up idle arrays
- Staggered disk scanning with 0.1s delay to avoid saturating SATA controllers

Install script:
- Deploys wrapper to /usr/local/bin/pulse-sensor-wrapper.sh
- Updates SSH forced command from "sensors -j" to wrapper script
- Backward compatible - falls back to direct sensors output if wrapper missing

Testing note:
- Requires real hardware with smartmontools installed for full functionality
- Empty smart array returned gracefully when smartctl unavailable
- Legacy sensor-only nodes continue working without changes
2025-11-07 11:46:57 +00:00
rcourtman
dfe960deb4 Fix container SSH detection and improve troubleshooting for issue #617
Related to #617

This fixes a misconfiguration scenario where Docker containers could
attempt direct SSH connections (producing [preauth] log spam) instead
of using the sensor proxy.

Changes:
- Fix container detection to check PULSE_DOCKER=true in addition to
  system.InContainer() heuristics (both temperature.go and config_handlers.go)
- Upgrade temperature collection log from Error to Warn with actionable
  guidance about mounting the proxy socket
- Add Info log when dev mode override is active so operators understand
  the security posture
- Add troubleshooting section to docs for SSH [preauth] logs from containers

The container detection was inconsistent - monitor.go checked both flags
but temperature.go and config_handlers.go only checked InContainer().
Now all locations consistently check PULSE_DOCKER || InContainer().
2025-11-06 09:57:53 +00:00
rcourtman
12dc8693c4 Add NVIDIA GPU temperature monitoring support (nouveau driver)
- Add nouveau chip recognition to temperature parser
- Implement parseNouveauGPUTemps() for NVIDIA GPU temps via nouveau driver
- Map "GPU core" sensor to edge temperature field
- Supports systems using open-source nouveau driver

This complements the AMD GPU support added previously. Systems using
the nouveau driver will now see NVIDIA GPU temperatures in the
dashboard. For proprietary nvidia driver users, GPU temps are not
available via lm-sensors and would require nvidia-smi integration.
2025-11-06 00:24:42 +00:00
rcourtman
d62259ffa7 Add AMD GPU temperature monitoring support
Related to #600

- Add GPU field to Temperature model with edge, junction, and mem sensors
- Add amdgpu chip recognition to temperature parser
- Implement parseGPUTemps() to extract AMD GPU temperature data
- Update frontend TypeScript types to include GPU temperatures
- Display GPU temps in node table tooltip alongside CPU temps
- Set hasGPU flag when GPU data is available

This enables temperature monitoring for AMD GPUs (amdgpu sensors)
that was previously being collected via SSH but silently discarded
during parsing.
2025-11-06 00:19:04 +00:00
rcourtman
e21a72578f Add configurable SSH port for temperature monitoring
Related to #595

This change adds support for custom SSH ports when collecting temperature
data from Proxmox nodes, resolving issues for users who run SSH on non-standard
ports.

**Why SSH is still needed:**
Temperature monitoring requires reading /sys/class/hwmon sensors on Proxmox
nodes, which is not exposed via the Proxmox API. Even when using API tokens
for authentication, Pulse needs SSH access to collect temperature data.

**Changes:**
- Add `sshPort` configuration to SystemSettings (system.json)
- Add `SSHPort` field to Config with environment variable support (SSH_PORT)
- Add per-node SSH port override capability for PVE, PBS, and PMG instances
- Update TemperatureCollector to accept and use custom SSH port
- Update SSH known_hosts manager to support non-standard ports
- Add NewTemperatureCollectorWithPort() constructor with port parameter
- Maintain backward compatibility with NewTemperatureCollector() (uses port 22)
- Update frontend TypeScript types for SSH port configuration

**Configuration methods:**
1. Environment variable: SSH_PORT=2222
2. system.json: {"sshPort": 2222}
3. Per-node override in nodes.enc (future UI support)

**Default behavior:**
- Defaults to port 22 if not configured
- Maintains full backward compatibility
- No changes required for existing deployments

The implementation includes proper ssh-keyscan port handling and known_hosts
management for non-standard ports using [host]:port notation per SSH standards.
2025-11-05 20:03:29 +00:00
rcourtman
6404b6a5fc Expand temperature sensor compatibility for SuperIO and AMD CPUs
Users with NCT6687 SuperIO chips and AMD processors reporting only chiplet
temperatures were unable to see CPU temperature data. Added support for
Nuvoton/Winbond/Fintek SuperIO chips and AMD Tccd chiplet temperatures,
with debug logging to aid troubleshooting unsupported sensor configurations.

Related to discussion #586
2025-11-05 18:47:21 +00:00
rcourtman
10862db4e4 Enhance container detection for temperature SSH safeguards (refs #601) 2025-11-04 22:30:35 +00:00
rcourtman
5c4be1921c chore: snapshot current changes 2025-11-02 22:47:55 +00:00
rcourtman
dd2beffc8c Stop legacy temperature SSH retries when auth fails (#595) 2025-10-22 19:35:51 +00:00
rcourtman
30879c3b7b Handle AMD Tctl temperature readings (refs #586) 2025-10-22 12:58:34 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
Richard Courtman
de3bb47930 fix: improve turnkey temperature monitoring for standalone nodes
- Fix script input handling to work with standard curl | bash pattern by prioritizing /dev/tty
- Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors)
- Add comprehensive documentation for turnkey standalone node setup
- Fix printf formatting error in setup script
2025-10-18 06:51:56 +00:00
Richard Courtman
669d7dc05c feat: add turnkey temperature monitoring for standalone nodes
Implements automatic temperature monitoring setup for standalone
Proxmox/Pimox nodes without manual SSH key configuration.

Changes:
- Add /api/system/proxy-public-key endpoint to expose proxy's SSH public key
- Setup script now detects standalone nodes (non-cluster)
- Auto-fetches and installs proxy SSH key with forced commands
- Add Raspberry Pi temperature support via cpu_thermal and /sys/class/thermal
- Enhance setup script with better error handling for lm-sensors installation
- Add RPi detection to skip lm-sensors and use native thermal interface

Security:
- Public key endpoint is safe (public keys are meant to be public)
- All installed keys use forced command="sensors -j" with full restrictions
- No shell access, port forwarding, or other SSH features enabled
2025-10-17 22:15:50 +00:00
rcourtman
123e0f04ca feat: add comprehensive node cleanup system
Implements automated cleanup workflow when nodes are deleted from Pulse, removing all monitoring footprint from the host. Changes include a new RPC handler in the sensor proxy for cleanup requests, enhanced node deletion modal with detailed cleanup explanations, and improved SSH key management with proper tagging for atomic updates.
2025-10-17 18:53:45 +00:00
rcourtman
dd9bd65a2e fix: Add hasCPU/hasNVMe flags to prevent false 'no CPU sensor' errors
Addresses #101

v4.23.0 introduced a regression where systems with only NVMe temperatures
(no CPU sensor) would display "No CPU sensor" in the UI. This was caused
by the Available flag being set to true when NVMe temps existed, even
without CPU data, triggering the error message in the frontend.

Backend changes:
- Add HasCPU and HasNVMe boolean fields to Temperature model
- Extend CPU sensor detection to support more chip types: zenpower,
  k8temp, acpitz, it87 (case-insensitive matching)
- HasCPU is set based on CPU chip detection (coretemp, k10temp, etc.),
  not value thresholds
- This prevents false negatives when sensors report 0°C during resets
- CPU temperature values now accepted even when 0 (checked with !IsNaN
  instead of > 0)
- extractTempInput returns NaN instead of 0 when no data found
- Available flag means "any temperature data exists" for backward compatibility
- Update mock generator to properly set the new flags
- Add unit tests for NVMe-only and 0°C scenarios to prevent regression
- Removed amd_energy from CPU chip list (power sensor, not temperature)

Frontend changes:
- Add hasCPU and hasNVMe optional fields to Temperature interface
- Update NodeSummaryTable to check hasCPU flag with fallback to available
  for backward compatibility with older API responses
- Update NodeCard temperature display logic with same fallback pattern
- Systems with only NVMe temps now show "-" instead of error message
- Fallback ensures UI works with both old and new API responses

Testing:
- All unit tests pass including NVMe-only and 0°C test cases
- Fix prevents false "no CPU sensor" errors when sensors temporarily report 0°C
- Fix eliminates false "no CPU sensor" errors for NVMe-only systems
2025-10-13 10:17:17 +00:00
rcourtman
e7bc338891 feat: Implement secure temperature proxy for containerized deployments
Addresses #528

Introduces pulse-temp-proxy architecture to eliminate SSH key exposure in containers:

**Architecture:**
- pulse-temp-proxy runs on Proxmox host (outside LXC/Docker)
- SSH keys stored on host filesystem (/var/lib/pulse-temp-proxy/ssh/)
- Pulse communicates via unix socket (bind-mounted into container)
- Proxy handles cluster discovery, key rollout, and temperature fetching

**Components:**
- cmd/pulse-temp-proxy: Standalone Go binary with unix socket RPC server
- internal/tempproxy: Client library for Pulse backend
- scripts/install-temp-proxy.sh: Idempotent installer for existing deployments
- scripts/pulse-temp-proxy.service: Systemd service for proxy

**Integration:**
- Pulse automatically detects and uses proxy when socket exists
- Falls back to direct SSH for native installations
- Installer automatically configures proxy for new LXC deployments
- Existing LXC users can upgrade by running install-temp-proxy.sh

**Security improvements:**
- Container compromise no longer exposes SSH keys
- SSH keys never enter container filesystem
- Maintains forced command restrictions
- Transparent to users - no workflow changes

**Documentation:**
- Updated TEMPERATURE_MONITORING.md with new architecture
- Added verification steps and upgrade instructions
- Preserved legacy documentation for native installs
2025-10-12 21:35:35 +00:00
rcourtman
274f36daa8 Improve dashboard responsiveness and temperature handling 2025-10-12 10:34:06 +00:00
rcourtman
f46ff1792b Fix settings security tab navigation 2025-10-11 23:29:47 +00:00