Commit graph

53 commits

Author SHA1 Message Date
rcourtman
7ae393c8ec Refine Proxmox node memory fallback (#582) 2025-10-22 15:36:26 +00:00
rcourtman
c9543e8a7e Add qemu guest agent version metadata 2025-10-22 15:24:07 +00:00
rcourtman
77108abc65 Propagate config updates to settings nodes (#588) 2025-10-22 13:45:13 +00:00
rcourtman
be26f957c0 Add snapshot size alert thresholds (#585) 2025-10-22 13:30:40 +00:00
rcourtman
30879c3b7b Handle AMD Tctl temperature readings (refs #586) 2025-10-22 12:58:34 +00:00
rcourtman
f83caf8933 Add collision-safe Docker host identifiers (#590) 2025-10-22 12:30:25 +00:00
rcourtman
bc479643e4 release: prepare v4.25.0 2025-10-22 10:46:18 +00:00
rcourtman
4eb8bed9b5 Fix initial setup caching and container discovery defaults 2025-10-22 07:34:32 +00:00
rcourtman
2786afdff0 feat: comprehensive diagnostics and observability improvements
Upgrade diagnostics infrastructure from 5/10 to 8/10 production readiness
with enhanced metrics, logging, and request correlation capabilities.

**Request Correlation**
- Wire request IDs through context in middleware
- Return X-Request-ID header in all API responses
- Enable downstream log correlation across request lifecycle

**HTTP/API Metrics** (18 new Prometheus metrics)
- pulse_http_request_duration_seconds - API latency histogram
- pulse_http_requests_total - request counter by method/route/status
- pulse_http_request_errors_total - error counter by type
- Path normalization to control label cardinality

**Per-Node Poll Metrics**
- pulse_monitor_node_poll_duration_seconds - per-node timing
- pulse_monitor_node_poll_total - success/error counts per node
- pulse_monitor_node_poll_errors_total - error breakdown per node
- pulse_monitor_node_poll_last_success_timestamp - freshness tracking
- pulse_monitor_node_poll_staleness_seconds - age since last success
- Enables multi-node hotspot identification

**Scheduler Health Metrics**
- pulse_scheduler_queue_due_soon - ready queue depth
- pulse_scheduler_queue_depth - by instance type
- pulse_scheduler_queue_wait_seconds - time in queue histogram
- pulse_scheduler_dead_letter_depth - failed task tracking
- pulse_scheduler_breaker_state - circuit breaker state
- pulse_scheduler_breaker_failure_count - consecutive failures
- pulse_scheduler_breaker_retry_seconds - time until retry
- Enable alerting on DLQ spikes, breaker opens, queue backlogs

**Diagnostics Endpoint Caching**
- pulse_diagnostics_cache_hits_total - cache performance
- pulse_diagnostics_cache_misses_total - cache misses
- pulse_diagnostics_refresh_duration_seconds - probe timing
- 45-second TTL prevents thundering herd on /api/diagnostics
- Thread-safe with RWMutex
- X-Diagnostics-Cached-At header shows cache freshness

**Debug Log Performance**
- Gate high-frequency debug logs behind IsLevelEnabled() checks
- Reduces CPU waste in production when debug disabled
- Covers scheduler loops, poll cycles, API handlers

**Persistent Logging**
- File logging with automatic rotation
- LOG_FILE, LOG_MAX_SIZE, LOG_MAX_AGE, LOG_COMPRESS env vars
- MultiWriter sends logs to both stderr and file
- Gzip compression support for rotated logs

Files modified:
- internal/api/diagnostics.go (caching layer)
- internal/api/middleware.go (request IDs, HTTP metrics)
- internal/api/http_metrics.go (NEW - HTTP metric definitions)
- internal/logging/logging.go (file logging with rotation)
- internal/monitoring/metrics.go (node + scheduler metrics)
- internal/monitoring/monitor.go (instrumentation, debug gating)

Impact: Dramatically improved production troubleshooting with per-node
visibility, scheduler health metrics, persistent logs, and cached
diagnostics. Fast incident response now possible for multi-node deployments.
2025-10-21 12:37:39 +00:00
rcourtman
5ebb32ce10 feat: enhance runtime configuration and system settings management
Improves configuration handling and system settings APIs to support
v4.24.0 features including runtime logging controls, adaptive polling
configuration, and enhanced config export/persistence.

Changes:
- Add config override system for discovery service
- Enhance system settings API with runtime logging controls
- Improve config persistence and export functionality
- Update security setup handling
- Refine monitoring and discovery service integration

These changes provide the backend support for the configuration
features documented in the v4.24.0 release.
2025-10-20 17:41:19 +00:00
rcourtman
73fb9d986f feat: add PBS/PMG stubs to test harness and implement HTTP config fetch
Resolves two remaining TODOs from codebase audit.

## 1. PBS/PMG Test Harness Stubs

**Location:** internal/monitoring/harness_integration.go:149-151

**Changes:**
- Added PBS client stub registration: `monitor.pbsClients[inst.Name] = &pbs.Client{}`
- Added PMG client stub registration: `monitor.pmgClients[inst.Name] = &pmg.Client{}`
- Added imports for pkg/pbs and pkg/pmg

**Purpose:**
Enables integration test scenarios to include PBS and PMG instance types
alongside existing PVE support. Stubs allow scheduler to register and
execute tasks for these instance types during integration testing.

**Testing:**
 TestAdaptiveSchedulerIntegration passes (55.5s)
 Integration test harness now supports all three instance types

## 2. HTTP Config URL Fetch

**Location:** cmd/pulse/config.go:226-261

**Problem:**
`PULSE_INIT_CONFIG_URL` was recognized but not implemented, returning
"URL import not yet implemented" error.

**Implementation:**
- URL validation (http/https schemes only)
- HTTP client with 15 second timeout
- Status code validation (2xx required)
- Empty response detection
- Base64 decoding with fallback to raw data
- Matches existing env-var behavior for `PULSE_INIT_CONFIG_DATA`

**Security:**
- Both HTTP and HTTPS supported (HTTPS recommended for production)
- URL scheme validation prevents file:// or other protocols
- Timeout prevents hanging on unresponsive servers

**Usage:**
```bash
export PULSE_INIT_CONFIG_URL="https://config-server/encrypted-config"
export PULSE_INIT_CONFIG_PASSPHRASE="secret"
pulse config auto-import
```

**Testing:**
 Code compiles cleanly
 Follows same pattern as existing PULSE_INIT_CONFIG_DATA handling

## Impact

- Completes integration test infrastructure for all instance types
- Enables automated config distribution via HTTP(S) for container deployments
- Removes last TODOs from codebase (no TODO/FIXME remaining in Go files)
2025-10-20 16:05:45 +00:00
rcourtman
c1bf03fe39 fix: use proper Monitor constructor in PMG tests to initialize all maps
Fixes panic: assignment to entry in nil map in PMG polling tests.

**Problem:**
Tests were manually creating Monitor structs without initializing internal
maps like pollStatusMap, causing nil map panics when recordTaskResult()
tried to update task status.

**Root Cause:**
- TestPollPMGInstancePopulatesState (line 90)
- TestPollPMGInstanceRecordsAuthFailures (line 189)

Both created Monitor with only partial field initialization, missing:
- pollStatusMap
- dlqInsightMap
- instanceInfoCache
- Other internal state maps

**Solution:**
Changed both tests to use New() constructor which properly initializes all
maps and internal state (monitor.go:1541). This ensures tests match production
initialization and will automatically pick up any future map additions.

**Tests:**
 TestPollPMGInstancePopulatesState - now passes
 TestPollPMGInstanceRecordsAuthFailures - now passes
 All monitoring tests pass (0.125s)

Follows best practice: use constructors instead of manual struct creation
to maintain initialization invariants.
2025-10-20 15:22:23 +00:00
rcourtman
9b1709a05b feat: enhance scheduler health API with rich instance metadata
Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health

**New Response Structure:**

Enhanced "instances" array with per-instance details:
- Instance metadata: displayName, type, connection URL
- Poll status: last success/error timestamps, error messages, error category
- Circuit breaker: state, timestamps, failure counts, retry windows
- Dead letter: present flag, reason, attempt history, retry schedule

**Implementation:**

Data structures:
- instanceInfo: cache of display names, URLs, types
- pollStatus: tracks successes/errors with timestamps and categories
- dlqInsight: DLQ entry metadata (reason, attempts, schedule)
- circuitBreaker: enhanced with stateSince, lastTransition

Tracking logic:
- buildInstanceInfoCache: populate metadata from config on startup
- recordTaskResult: track poll outcomes, error details, categories
- sendToDeadLetter: capture DLQ insights (reason, timestamps)
- circuitBreaker: record state transitions with timestamps

**Backward Compatible:**
- Existing fields (deadLetter, breakers, staleness) unchanged
- New "instances" array is additive
- Old clients can ignore new fields

**Testing:**
- Unit test: TestSchedulerHealth_EnhancedResponse validates all fields
- Integration tests: still passing (55s)
- All error tracking and breaker history verified

**Operator Benefits:**
- Diagnose issues without log digging
- See error messages directly in API
- Understand breaker states and retry schedules
- Track DLQ entries with full context
- Single API call for complete instance health view

Example: Quickly identify "401 unauthorized" on specific PBS instance,
see it's in DLQ after 5 retries, and know when next retry scheduled.

Part of Phase 2 follow-up work to improve observability.
2025-10-20 15:13:38 +00:00
rcourtman
14d06a1654 test: add soak test with runtime instrumentation (Phase 2 Task 9d)
Add comprehensive soak testing capabilities:

**Runtime Instrumentation:**
- Periodic sampling of heap, stack, goroutines, GC count
- Sample every 10s during harness runs
- HarnessReport includes full RuntimeSamples history
- Detect memory leaks (>10% sustained growth)
- Detect goroutine leaks (>20 leaked goroutines)

**Soak Test:**
- TestAdaptiveSchedulerSoak with 15min+ duration
- Skip unless -soak flag or HARNESS_SOAK_MINUTES set
- 80 synthetic instances (60 healthy, 15 transient, 5 permanent)
- Configurable duration via env var
- Validates: heap growth <10%, goroutines stable, queue depth bounded
- Staleness threshold: 45s for long-running tests

**Wrapper Script:**
- testing-tools/run_adaptive_soak.sh for easy execution
- Accepts duration in minutes: ./run_adaptive_soak.sh 30
- Logs to tmp/adaptive_soak_<timestamp>.log
- Sets proper timeout (duration + 5min buffer)

**Test Results (2-minute validation):**
- 80 instances, 17 samples
- Heap: 2.3MB → 3.1MB (healthy)
- Goroutines: 16 → 6 (no leak, actually decreased)
- Circuit breakers: correctly blocking transient failures

Run with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerSoak -soak -timeout 20m

Part of Phase 2 Task 9 (Integration/Soak Testing)
2025-10-20 15:13:38 +00:00
rcourtman
2636ba9137 test: add comprehensive integration test harness for adaptive polling (Phase 2 Task 9c)
Add PollExecutor seam and integration test infrastructure:

**PollExecutor Interface:**
- Add pluggable executor interface for testability
- Implement realExecutor wrapping existing poll functions
- Add SetExecutor() for test injection
- Zero impact on production behavior

**Integration Test Harness:**
- Build-tagged integration tests (go:build integration)
- Synthetic workload generator with configurable scenarios
- Fake executor simulating latencies, failures, recovery
- Runtime metrics collection (queue depth, staleness, goroutines)

**Comprehensive Assertions:**
- Queue depth bounds: stays within 1.5× instance count
- Staleness: healthy instances <20s, multiple poll cycles
- Circuit breakers: transient failures recover, permanent stay blocked
- Dead-letter queue: only permanent failures routed
- Scheduler health: snapshot consistency validation

**Test Scenarios:**
- 10 healthy PVE instances (rapid polling)
- 1 transient failure instance (fail → recover)
- 1 permanent failure instance (DLQ routing)
- 55s test duration with 3s base intervals
- Validates full adaptive scheduler lifecycle

Runs with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerIntegration

Part of Phase 2 Task 9 (Integration/Soak Testing)
2025-10-20 15:13:38 +00:00
rcourtman
7d422d2909 feat: add professional logging with runtime configuration and performance optimization
Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.
2025-10-20 15:13:38 +00:00
rcourtman
25b797f18d test: add comprehensive staleness tracker unit tests (Phase 2 Task 9b)
Added 17 test cases covering:
- UpdateSuccess/UpdateError state management
- Staleness scoring (fresh, stale, max-stale, never-succeeded)
- Score normalization and capping (0.0 to 1.0 range)
- SetBounds behavior and defaults
- Snapshot merging logic
- Snapshot() API for full state export
- Nil safety and concurrent access

All tests verify correct freshness calculation based on lastSuccess
timestamps and configurable maxStale bounds.

Phase 2 testing status:
-  Backoff exponential growth and jitter (13 tests)
-  Circuit breaker state machine (10 tests)
-  Staleness tracker scoring (17 tests)
- Total: 40+ unit tests covering core scheduling logic
2025-10-20 15:13:38 +00:00
rcourtman
24ae6d8d78 test: add comprehensive unit tests for backoff and circuit breaker (Phase 2 Task 9a)
Added 30+ test cases covering:

Backoff tests (backoff_test.go):
- Exponential growth with multiplier
- Jitter distribution and bounds
- Max delay capping
- Edge cases (negative attempts, zero config values)
- Realistic production scenarios

Circuit breaker tests (circuit_breaker_test.go):
- State transitions: closed → open → half-open → closed
- Retry interval backoff with bit-shifting (5s << failureCount)
- Half-open window behavior
- Concurrent access safety
- Default parameter validation

All tests pass with proper handling of time-based state transitions
and exponential backoff mechanics (bit-shift based retry intervals).
2025-10-20 15:13:38 +00:00
rcourtman
160adeb3b8 feat: add scheduler health API endpoint (Phase 2 Task 8)
Task 8 of 10 complete. Exposes read-only scheduler health data including:
- Queue depth and distribution by instance type
- Dead-letter queue inspection (top 25 tasks with error details)
- Circuit breaker states (instance-level)
- Staleness scores per instance

New API endpoint:
  GET /api/monitoring/scheduler/health (requires authentication)

New snapshot methods:
- StalenessTracker.Snapshot() - exports all staleness data
- TaskQueue.Snapshot() - queue depth & per-type distribution
- TaskQueue.PeekAll() - dead-letter task inspection
- circuitBreaker.State() - exports state, failures, retryAt
- Monitor.SchedulerHealth() - aggregates all health data

Documentation updated with API spec, field descriptions, and usage examples.
2025-10-20 15:13:38 +00:00
rcourtman
b1f445b33d feat: implement error handling with circuit breakers and backoff (Phase 2 Task 7)
Adds comprehensive error resilience:
- Circuit breaker with closed/open/half-open states (3 failures = trip)
- Exponential backoff with jitter (2s initial, 2x multiplier, 5min max)
- Dead-letter queue for tasks exceeding 5 retry attempts
- Error classification (transient vs permanent) using internal/errors helpers
- Per-instance failure tracking and breaker state management
- Integration with staleness tracker for outcome recording

Task 7 of 10 complete (70%). Ready for API surfaces and testing.
2025-10-20 15:13:37 +00:00
rcourtman
aa5c08ad4a feat: implement priority queue-based task execution (Phase 2 Task 6)
Replaces immediate polling with queue-based scheduling:
- TaskQueue with min-heap (container/heap) for NextRun-ordered execution
- Worker goroutines that block on WaitNext() until tasks are due
- Tasks only execute when NextRun <= now, respecting adaptive intervals
- Automatic rescheduling after execution via scheduler.BuildPlan
- Queue depth tracking for backpressure-aware interval adjustments
- Upsert semantics for updating scheduled tasks without duplicates

Task 6 of 10 complete (60%). Ready for error/backoff policies.
2025-10-20 15:13:37 +00:00
rcourtman
c554380cb5 feat: verify adaptive interval logic implementation (Phase 2 Task 5)
Confirms adaptive scheduling logic is fully operational:
- EMA smoothing (alpha=0.6) to prevent interval oscillations
- Staleness-based interpolation between min/max intervals
- Error penalty (0.6x per error) for faster recovery detection
- Queue depth stretch (0.1x per task) for backpressure handling
- ±5% jitter to prevent thundering herd effects
- Per-instance state tracking for smooth transitions

Task 5 of 10 complete. Scheduler foundation ready for queue-based execution.
2025-10-20 15:13:37 +00:00
rcourtman
c7d1abf874 feat: implement staleness tracker for adaptive polling (Phase 2 Task 4)
Adds freshness metadata tracking for all monitored instances:
- StalenessTracker with per-instance last success/error/mutation timestamps
- Change hash detection using SHA1 for detecting data mutations
- Normalized staleness scoring (0-1 scale) based on age vs maxStale
- Integration with PollMetrics for authoritative last-success data
- Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError
- Connected to scheduler as StalenessSource implementation

Task 4 of 10 complete. Ready for adaptive interval logic.
2025-10-20 15:13:37 +00:00
rcourtman
57429900a6 feat: add adaptive polling scheduler infrastructure (Phase 2 Tasks 1-3)
Implements adaptive scheduling foundation for Phase 2:
- Poll cycle metrics: duration, staleness, queue depth, in-flight counters
- Adaptive scheduler with pluggable staleness/interval/enqueue interfaces
- Config support: ADAPTIVE_POLLING_ENABLED flag + min/max/base intervals
- Feature flag defaults to disabled for safe rollout
- Scheduler wiring into Monitor with conditional instantiation

Tasks 1-3 of 10 complete. Ready for staleness tracker implementation.
2025-10-20 15:13:37 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
Pulse Automation Bot
cfdfe896be Adjust backup and snapshot alert handling 2025-10-18 20:11:01 +00:00
Pulse Automation Bot
80b9d0602a Add Apprise notification integration (#570) 2025-10-18 16:39:39 +00:00
Pulse Automation Bot
0b4e4f9c59 Add configurable backup polling interval 2025-10-18 13:06:41 +00:00
Richard Courtman
97b9c6739c feat: add min/max temperature tracking for nodes
Track minimum and maximum CPU temperatures since monitoring started.
This provides better insight into temperature trends and cooling
adequacy over time.

Changes:
- Backend: Add CPUMin, CPUMaxRecord, MinRecorded, MaxRecorded fields
  to Temperature model
- Backend: Implement min/max tracking logic in monitoring cycle that
  preserves values across polling cycles
- Backend: Initialize min/max on first reading, update on extremes
- Frontend: Update Temperature TypeScript interface with new fields
- Frontend: Display min/max range in NodeCard tooltip (e.g., "52°C
  (48-67°C since monitoring started)")
- Frontend: Rebuild dist assets

Temperature display now shows:
- Current temperature with color coding (green/yellow/red)
- Tooltip with full min-max range and context
- Min/max tracked in-memory (resets on Pulse restart)

Example tooltip: "CPU: 52°C (48-67°C since monitoring started)"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-18 08:15:10 +00:00
Richard Courtman
de3bb47930 fix: improve turnkey temperature monitoring for standalone nodes
- Fix script input handling to work with standard curl | bash pattern by prioritizing /dev/tty
- Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors)
- Add comprehensive documentation for turnkey standalone node setup
- Fix printf formatting error in setup script
2025-10-18 06:51:56 +00:00
Richard Courtman
669d7dc05c feat: add turnkey temperature monitoring for standalone nodes
Implements automatic temperature monitoring setup for standalone
Proxmox/Pimox nodes without manual SSH key configuration.

Changes:
- Add /api/system/proxy-public-key endpoint to expose proxy's SSH public key
- Setup script now detects standalone nodes (non-cluster)
- Auto-fetches and installs proxy SSH key with forced commands
- Add Raspberry Pi temperature support via cpu_thermal and /sys/class/thermal
- Enhance setup script with better error handling for lm-sensors installation
- Add RPi detection to skip lm-sensors and use native thermal interface

Security:
- Public key endpoint is safe (public keys are meant to be public)
- All installed keys use forced command="sensors -j" with full restrictions
- No shell access, port forwarding, or other SSH features enabled
2025-10-17 22:15:50 +00:00
rcourtman
123e0f04ca feat: add comprehensive node cleanup system
Implements automated cleanup workflow when nodes are deleted from Pulse, removing all monitoring footprint from the host. Changes include a new RPC handler in the sensor proxy for cleanup requests, enhanced node deletion modal with detailed cleanup explanations, and improved SSH key management with proper tagging for atomic updates.
2025-10-17 18:53:45 +00:00
rcourtman
f141f7db33 feat: enhance sensor proxy with improved cluster discovery and SSH management
Improvements to pulse-sensor-proxy:
- Fix cluster discovery to use pvecm status for IP addresses instead of node names
- Add standalone node support for non-clustered Proxmox hosts
- Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting
- Add --pulse-server flag to installer for custom Pulse URLs
- Configure www-data group membership for Proxmox IPC access

UI and API cleanup:
- Remove unused "Ensure cluster keys" button from Settings
- Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint
- Remove EnsureClusterKeys method from tempproxy client

The setup script already handles SSH key distribution during initial configuration,
making the manual refresh button redundant.
2025-10-17 11:43:26 +00:00
rcourtman
6fdef61710 Expand monitoring and discovery test coverage 2025-10-16 08:17:08 +00:00
rcourtman
3a4fc044ea Add guest agent caching and update doc hints (refs #560) 2025-10-16 08:15:49 +00:00
rcourtman
958d6218c2 test: cover docker command lifecycle and server info 2025-10-15 19:47:51 +00:00
rcourtman
91fecacfef feat: add docker agent command handling 2025-10-15 19:27:19 +00:00
rcourtman
aaae27dc11 Log memory source transitions for diagnostics (#553) 2025-10-15 19:19:11 +00:00
rcourtman
32421b36b8 Refs #533: add total-minus-used memory fallback 2025-10-15 18:19:54 +00:00
rcourtman
881b7f9a54 Fix false ZFS log/cache warnings 2025-10-14 20:57:43 +00:00
rcourtman
7e5fa9a147 fix: restore cache-aware node memory on PVE 8.4 2025-10-14 16:40:45 +00:00
rcourtman
78889ffedc Ignore read-only guest filesystems in disk aggregation 2025-10-14 16:13:53 +00:00
rcourtman
156fd34c50 Update Proxmox guest agent permissions docs and tooling (refs #548) 2025-10-14 10:21:52 +00:00
rcourtman
5c79d2516d feat: streamline docker agent onboarding 2025-10-14 09:45:32 +00:00
rcourtman
dd9bd65a2e fix: Add hasCPU/hasNVMe flags to prevent false 'no CPU sensor' errors
Addresses #101

v4.23.0 introduced a regression where systems with only NVMe temperatures
(no CPU sensor) would display "No CPU sensor" in the UI. This was caused
by the Available flag being set to true when NVMe temps existed, even
without CPU data, triggering the error message in the frontend.

Backend changes:
- Add HasCPU and HasNVMe boolean fields to Temperature model
- Extend CPU sensor detection to support more chip types: zenpower,
  k8temp, acpitz, it87 (case-insensitive matching)
- HasCPU is set based on CPU chip detection (coretemp, k10temp, etc.),
  not value thresholds
- This prevents false negatives when sensors report 0°C during resets
- CPU temperature values now accepted even when 0 (checked with !IsNaN
  instead of > 0)
- extractTempInput returns NaN instead of 0 when no data found
- Available flag means "any temperature data exists" for backward compatibility
- Update mock generator to properly set the new flags
- Add unit tests for NVMe-only and 0°C scenarios to prevent regression
- Removed amd_energy from CPU chip list (power sensor, not temperature)

Frontend changes:
- Add hasCPU and hasNVMe optional fields to Temperature interface
- Update NodeSummaryTable to check hasCPU flag with fallback to available
  for backward compatibility with older API responses
- Update NodeCard temperature display logic with same fallback pattern
- Systems with only NVMe temps now show "-" instead of error message
- Fallback ensures UI works with both old and new API responses

Testing:
- All unit tests pass including NVMe-only and 0°C test cases
- Fix prevents false "no CPU sensor" errors when sensors temporarily report 0°C
- Fix eliminates false "no CPU sensor" errors for NVMe-only systems
2025-10-13 10:17:17 +00:00
rcourtman
e7bc338891 feat: Implement secure temperature proxy for containerized deployments
Addresses #528

Introduces pulse-temp-proxy architecture to eliminate SSH key exposure in containers:

**Architecture:**
- pulse-temp-proxy runs on Proxmox host (outside LXC/Docker)
- SSH keys stored on host filesystem (/var/lib/pulse-temp-proxy/ssh/)
- Pulse communicates via unix socket (bind-mounted into container)
- Proxy handles cluster discovery, key rollout, and temperature fetching

**Components:**
- cmd/pulse-temp-proxy: Standalone Go binary with unix socket RPC server
- internal/tempproxy: Client library for Pulse backend
- scripts/install-temp-proxy.sh: Idempotent installer for existing deployments
- scripts/pulse-temp-proxy.service: Systemd service for proxy

**Integration:**
- Pulse automatically detects and uses proxy when socket exists
- Falls back to direct SSH for native installations
- Installer automatically configures proxy for new LXC deployments
- Existing LXC users can upgrade by running install-temp-proxy.sh

**Security improvements:**
- Container compromise no longer exposes SSH keys
- SSH keys never enter container filesystem
- Maintains forced command restrictions
- Transparent to users - no workflow changes

**Documentation:**
- Updated TEMPERATURE_MONITORING.md with new architecture
- Added verification steps and upgrade instructions
- Preserved legacy documentation for native installs
2025-10-12 21:35:35 +00:00
rcourtman
c8e3c93516 fix: Add security gates for containerized temperature monitoring
Addresses #528

- Added opt-in confirmation prompt to setup script with security notice
- Added runtime warning when containerized Pulse uses SSH temperature monitoring
- Documented security considerations and hardening recommendations
- Users must explicitly confirm understanding before enabling in containers
2025-10-12 21:01:25 +00:00
rcourtman
c18cf3d4b8 Fix node config API to preserve fields on partial updates
The PUT /api/config/nodes/{id} endpoint was corrupting node configurations
when making partial updates (e.g., updating just monitorPhysicalDisks):

- Authentication fields (tokenName, tokenValue, password) were being cleared
  when updating unrelated settings
- Name field was being blanked when not included in request
- Monitor* boolean fields were defaulting to false

Changes:
- Only update name field if explicitly provided in request
- Only switch authentication method when auth fields are explicitly provided
- Preserve existing auth credentials on non-auth updates
- Applied fix to all node types (PVE, PBS, PMG)

Also enables physical disk monitoring by default (opt-out instead of opt-in)
and preserves disk data between polling intervals.
2025-10-12 17:50:55 +00:00
rcourtman
18a88cb4cc Improve NVMe temperature handling 2025-10-12 16:06:55 +00:00
rcourtman
2163d6f5a8 Use guest meminfo available for VM memory usage 2025-10-12 11:03:56 +00:00