Commit graph

23 commits

Author SHA1 Message Date
rcourtman
cedf0c8f0f fix(temperature): parse string sensor values without zeroing readings (#1224) 2026-02-09 14:00:09 +00:00
rcourtman
1e77763870 feat: improve monitoring and temperature handling
Temperature Monitoring:
- Enhance temperature collection and processing
- Add temperature tests

Monitor Improvements:
- Improve monitor reload handling
- Add reload tests

Test Coverage:
- Add Ceph monitoring tests
- Add Docker commands tests
- Add host agent temperature tests
- Add extra coverage tests
2026-01-24 22:43:31 +00:00
rcourtman
d4a6c0d2e8 refactor: remove legacy pulse-sensor-proxy temperature monitoring
The sensor proxy approach for temperature monitoring has been superseded
by the unified agent architecture where host agents report temperature
data directly. This removes:

- cmd/pulse-sensor-proxy/ - standalone proxy daemon
- internal/tempproxy/ - client library
- internal/api/*temperature_proxy* - API handlers and tests
- internal/api/sensor_proxy_gate* - feature gate
- internal/monitoring/*proxy_test* - proxy-specific tests
- scripts/*sensor-proxy* - installation and management scripts
- security/apparmor/, security/seccomp/ - proxy security profiles

Temperature monitoring remains available via the unified agent approach.
2026-01-21 11:59:04 +00:00
rcourtman
1230099d3d fix(test): resolve flaky concurrent temperature collection test 2025-12-19 17:09:57 +00:00
rcourtman
4f824ab148 style: Apply gofmt to 37 files
Standardize code formatting across test files and monitor.go.
No functional changes.
2025-12-02 17:21:48 +00:00
rcourtman
2f86211d51 test: Add tests for GPU temperature parsing functions
- parseGPUTemps: AMD GPU edge/junction/mem temperatures, sensor mapping
- parseNouveauGPUTemps: NVIDIA nouveau GPU core temperature parsing
- Case insensitive sensor name matching
- Invalid/zero/negative temperature handling
- Append behavior for multiple GPUs
2025-12-01 21:56:33 +00:00
rcourtman
5a56b11c3a test: Add table-driven tests for parseSensorsJSON
Comprehensive coverage for temperature sensor JSON parsing:
- Empty/whitespace input handling
- Invalid JSON error handling
- Legacy and wrapper format support
- Multiple chip types: coretemp, k10temp, acpitz, nvme, amdgpu, nouveau, zenpower, nct6795, it87, rp1_adc
- Edge cases: non-map chip data, CPUMax calculation from cores
2025-12-01 21:00:46 +00:00
rcourtman
d079b4ccbb test: Add tests for StalenessScore, parseCPUTemps, queueDockerStopCommand
- StalenessScore: 78.3%→95.7% (4 cases for metrics, future timestamp, defaults)
- parseCPUTemps: 98.1%→100% (core temp exceeding package case)
- queueDockerStopCommand: 76.9%→100% (12 cases for validation, status checks)
2025-12-01 19:50:42 +00:00
rcourtman
d9f58fc45e test: Add tests for handleProxyHostFailure, recordNodeSnapshot, evaluateHostAgents
- handleProxyHostFailure: 76.5%→100% (9 cases for per-host failure tracking)
- recordNodeSnapshot: 75%→100% (6 cases for diagnostic snapshot storage)
- evaluateHostAgents: 87%→100% (10 cases for host health evaluation)
2025-12-01 19:40:32 +00:00
rcourtman
35eb2c58f1 test: Add tests for handleProxyFailure, parseRPiTemperature, RecordResult
- handleProxyFailure: 86.7%→100% (6 cases for proxy disable threshold logic)
- parseRPiTemperature: 77.8%→100% (7 cases for RPi temp parsing edge cases)
- RecordResult: 95.8%→100% (negative staleness clamping case)
2025-12-01 19:33:06 +00:00
rcourtman
1a79d1ca3f test: Add tests for isProxyEnabled, taskHeap.Pop, NewAdaptiveScheduler
- isProxyEnabled: 94.4%→100% (7 cases for proxy cooldown/restore logic)
- taskHeap.Pop: 87.5%→100% (3 cases for empty/single/multi-element heap)
- NewAdaptiveScheduler: 84.6%→100% (14 cases for default value handling)
2025-12-01 19:26:37 +00:00
rcourtman
3aa3ab5bd0 test: Add tests for disableLegacySSHOnAuthFailure, parseNVMeTemps, updateDeadLetterMetrics
- disableLegacySSHOnAuthFailure: 87.5%→100% (13 cases for auth error detection)
- parseNVMeTemps: 88.9%→100% (16 cases for NVMe temp parsing)
- updateDeadLetterMetrics: 75%→100% (6 cases for nil handling, queue states)
2025-12-01 19:20:09 +00:00
rcourtman
dd347501d9 test: Add tests for polling interval, proxy success handlers, and failed instance removal
- effectivePVEPollingInterval: 80%→100% (6 cases for nil/clamping)
- handleProxySuccess: 80%→100% (3 cases for nil client, reset)
- handleProxyHostSuccess: →100% (6 cases for empty/whitespace/removal)
- removeFailedPBSNode: 75%→100% (5 cases for removal, backups, health)
- removeFailedPMGInstance: 75%→100% (5 cases for removal, backups, health)
2025-12-01 18:49:09 +00:00
rcourtman
2758261581 test: Add tests for recordAuthFailure, shouldSkipProxyHost, recoverFromPanic
- recordAuthFailure: 53%→100% (8 tests covering failure counting, node removal)
- shouldSkipProxyHost: 44%→100% (9 tests covering cooldown logic, state cleanup)
- recoverFromPanic: 50%→100% (7 tests covering various panic value types)

Monitoring package 45.3%→45.9%
2025-12-01 18:20:12 +00:00
rcourtman
d054c20cd4 Add unit tests for temperature utility functions
Add focused unit tests for four utility functions in temperature.go:
- extractTempInput: 16 test cases for sensor value extraction
- extractCoreNumber: 18 test cases for core number parsing
- extractHostname: 21 test cases for URL hostname extraction
- normalizeSMARTEntries: 15 test cases for SMART data normalization

70 test cases total covering type conversions, edge cases,
boundary conditions, and error handling paths.
2025-11-30 04:20:28 +00:00
courtmanr@gmail.com
78308cbc10 Fix: Prevent single node auth failure from disabling global SSH temperature collection
- Removed global legacySSHDisabled flag that was triggered by any single node auth failure
- Changed disableLegacySSHOnAuthFailure to only log warnings
- Fixed potential context leak in monitor.go
- Updated tests to reflect removal of global disable logic
2025-11-23 22:24:15 +00:00
rcourtman
596bdbfb13 Handle standby SMART temps and capture disk identity 2025-11-22 07:35:13 +00:00
rcourtman
6404b6a5fc Expand temperature sensor compatibility for SuperIO and AMD CPUs
Users with NCT6687 SuperIO chips and AMD processors reporting only chiplet
temperatures were unable to see CPU temperature data. Added support for
Nuvoton/Winbond/Fintek SuperIO chips and AMD Tccd chiplet temperatures,
with debug logging to aid troubleshooting unsupported sensor configurations.

Related to discussion #586
2025-11-05 18:47:21 +00:00
rcourtman
dd2beffc8c Stop legacy temperature SSH retries when auth fails (#595) 2025-10-22 19:35:51 +00:00
rcourtman
30879c3b7b Handle AMD Tctl temperature readings (refs #586) 2025-10-22 12:58:34 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
rcourtman
dd9bd65a2e fix: Add hasCPU/hasNVMe flags to prevent false 'no CPU sensor' errors
Addresses #101

v4.23.0 introduced a regression where systems with only NVMe temperatures
(no CPU sensor) would display "No CPU sensor" in the UI. This was caused
by the Available flag being set to true when NVMe temps existed, even
without CPU data, triggering the error message in the frontend.

Backend changes:
- Add HasCPU and HasNVMe boolean fields to Temperature model
- Extend CPU sensor detection to support more chip types: zenpower,
  k8temp, acpitz, it87 (case-insensitive matching)
- HasCPU is set based on CPU chip detection (coretemp, k10temp, etc.),
  not value thresholds
- This prevents false negatives when sensors report 0°C during resets
- CPU temperature values now accepted even when 0 (checked with !IsNaN
  instead of > 0)
- extractTempInput returns NaN instead of 0 when no data found
- Available flag means "any temperature data exists" for backward compatibility
- Update mock generator to properly set the new flags
- Add unit tests for NVMe-only and 0°C scenarios to prevent regression
- Removed amd_energy from CPU chip list (power sensor, not temperature)

Frontend changes:
- Add hasCPU and hasNVMe optional fields to Temperature interface
- Update NodeSummaryTable to check hasCPU flag with fallback to available
  for backward compatibility with older API responses
- Update NodeCard temperature display logic with same fallback pattern
- Systems with only NVMe temps now show "-" instead of error message
- Fallback ensures UI works with both old and new API responses

Testing:
- All unit tests pass including NVMe-only and 0°C test cases
- Fix prevents false "no CPU sensor" errors when sensors temporarily report 0°C
- Fix eliminates false "no CPU sensor" errors for NVMe-only systems
2025-10-13 10:17:17 +00:00
rcourtman
274f36daa8 Improve dashboard responsiveness and temperature handling 2025-10-12 10:34:06 +00:00