Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 11:30:15 +00:00

Author	SHA1	Message	Date
rcourtman	7ae393c8ec	Refine Proxmox node memory fallback (#582 )	2025-10-22 15:36:26 +00:00
rcourtman	c9543e8a7e	Add qemu guest agent version metadata	2025-10-22 15:24:07 +00:00
rcourtman	77108abc65	Propagate config updates to settings nodes (#588 )	2025-10-22 13:45:13 +00:00
rcourtman	be26f957c0	Add snapshot size alert thresholds (#585 )	2025-10-22 13:30:40 +00:00
rcourtman	30879c3b7b	Handle AMD Tctl temperature readings (refs #586 )	2025-10-22 12:58:34 +00:00
rcourtman	f83caf8933	Add collision-safe Docker host identifiers (#590 )	2025-10-22 12:30:25 +00:00
rcourtman	bc479643e4	release: prepare v4.25.0	2025-10-22 10:46:18 +00:00
rcourtman	4eb8bed9b5	Fix initial setup caching and container discovery defaults	2025-10-22 07:34:32 +00:00
rcourtman	2786afdff0	feat: comprehensive diagnostics and observability improvements Upgrade diagnostics infrastructure from 5/10 to 8/10 production readiness with enhanced metrics, logging, and request correlation capabilities. Request Correlation - Wire request IDs through context in middleware - Return X-Request-ID header in all API responses - Enable downstream log correlation across request lifecycle HTTP/API Metrics (18 new Prometheus metrics) - pulse_http_request_duration_seconds - API latency histogram - pulse_http_requests_total - request counter by method/route/status - pulse_http_request_errors_total - error counter by type - Path normalization to control label cardinality Per-Node Poll Metrics - pulse_monitor_node_poll_duration_seconds - per-node timing - pulse_monitor_node_poll_total - success/error counts per node - pulse_monitor_node_poll_errors_total - error breakdown per node - pulse_monitor_node_poll_last_success_timestamp - freshness tracking - pulse_monitor_node_poll_staleness_seconds - age since last success - Enables multi-node hotspot identification Scheduler Health Metrics - pulse_scheduler_queue_due_soon - ready queue depth - pulse_scheduler_queue_depth - by instance type - pulse_scheduler_queue_wait_seconds - time in queue histogram - pulse_scheduler_dead_letter_depth - failed task tracking - pulse_scheduler_breaker_state - circuit breaker state - pulse_scheduler_breaker_failure_count - consecutive failures - pulse_scheduler_breaker_retry_seconds - time until retry - Enable alerting on DLQ spikes, breaker opens, queue backlogs Diagnostics Endpoint Caching - pulse_diagnostics_cache_hits_total - cache performance - pulse_diagnostics_cache_misses_total - cache misses - pulse_diagnostics_refresh_duration_seconds - probe timing - 45-second TTL prevents thundering herd on /api/diagnostics - Thread-safe with RWMutex - X-Diagnostics-Cached-At header shows cache freshness Debug Log Performance - Gate high-frequency debug logs behind IsLevelEnabled() checks - Reduces CPU waste in production when debug disabled - Covers scheduler loops, poll cycles, API handlers Persistent Logging - File logging with automatic rotation - LOG_FILE, LOG_MAX_SIZE, LOG_MAX_AGE, LOG_COMPRESS env vars - MultiWriter sends logs to both stderr and file - Gzip compression support for rotated logs Files modified: - internal/api/diagnostics.go (caching layer) - internal/api/middleware.go (request IDs, HTTP metrics) - internal/api/http_metrics.go (NEW - HTTP metric definitions) - internal/logging/logging.go (file logging with rotation) - internal/monitoring/metrics.go (node + scheduler metrics) - internal/monitoring/monitor.go (instrumentation, debug gating) Impact: Dramatically improved production troubleshooting with per-node visibility, scheduler health metrics, persistent logs, and cached diagnostics. Fast incident response now possible for multi-node deployments.	2025-10-21 12:37:39 +00:00
rcourtman	5ebb32ce10	feat: enhance runtime configuration and system settings management Improves configuration handling and system settings APIs to support v4.24.0 features including runtime logging controls, adaptive polling configuration, and enhanced config export/persistence. Changes: - Add config override system for discovery service - Enhance system settings API with runtime logging controls - Improve config persistence and export functionality - Update security setup handling - Refine monitoring and discovery service integration These changes provide the backend support for the configuration features documented in the v4.24.0 release.	2025-10-20 17:41:19 +00:00
rcourtman	73fb9d986f	feat: add PBS/PMG stubs to test harness and implement HTTP config fetch Resolves two remaining TODOs from codebase audit. ## 1. PBS/PMG Test Harness Stubs Location: internal/monitoring/harness_integration.go:149-151 Changes: - Added PBS client stub registration: `monitor.pbsClients[inst.Name] = &pbs.Client{}` - Added PMG client stub registration: `monitor.pmgClients[inst.Name] = &pmg.Client{}` - Added imports for pkg/pbs and pkg/pmg Purpose: Enables integration test scenarios to include PBS and PMG instance types alongside existing PVE support. Stubs allow scheduler to register and execute tasks for these instance types during integration testing. Testing: ✅ TestAdaptiveSchedulerIntegration passes (55.5s) ✅ Integration test harness now supports all three instance types ## 2. HTTP Config URL Fetch Location: cmd/pulse/config.go:226-261 Problem: `PULSE_INIT_CONFIG_URL` was recognized but not implemented, returning "URL import not yet implemented" error. Implementation: - URL validation (http/https schemes only) - HTTP client with 15 second timeout - Status code validation (2xx required) - Empty response detection - Base64 decoding with fallback to raw data - Matches existing env-var behavior for `PULSE_INIT_CONFIG_DATA` Security: - Both HTTP and HTTPS supported (HTTPS recommended for production) - URL scheme validation prevents file:// or other protocols - Timeout prevents hanging on unresponsive servers Usage: ```bash export PULSE_INIT_CONFIG_URL="https://config-server/encrypted-config" export PULSE_INIT_CONFIG_PASSPHRASE="secret" pulse config auto-import ``` Testing: ✅ Code compiles cleanly ✅ Follows same pattern as existing PULSE_INIT_CONFIG_DATA handling ## Impact - Completes integration test infrastructure for all instance types - Enables automated config distribution via HTTP(S) for container deployments - Removes last TODOs from codebase (no TODO/FIXME remaining in Go files)	2025-10-20 16:05:45 +00:00
rcourtman	c1bf03fe39	fix: use proper Monitor constructor in PMG tests to initialize all maps Fixes panic: assignment to entry in nil map in PMG polling tests. Problem: Tests were manually creating Monitor structs without initializing internal maps like pollStatusMap, causing nil map panics when recordTaskResult() tried to update task status. Root Cause: - TestPollPMGInstancePopulatesState (line 90) - TestPollPMGInstanceRecordsAuthFailures (line 189) Both created Monitor with only partial field initialization, missing: - pollStatusMap - dlqInsightMap - instanceInfoCache - Other internal state maps Solution: Changed both tests to use New() constructor which properly initializes all maps and internal state (monitor.go:1541). This ensures tests match production initialization and will automatically pick up any future map additions. Tests: ✅ TestPollPMGInstancePopulatesState - now passes ✅ TestPollPMGInstanceRecordsAuthFailures - now passes ✅ All monitoring tests pass (0.125s) Follows best practice: use constructors instead of manual struct creation to maintain initialization invariants.	2025-10-20 15:22:23 +00:00
rcourtman	9b1709a05b	feat: enhance scheduler health API with rich instance metadata Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health New Response Structure: Enhanced "instances" array with per-instance details: - Instance metadata: displayName, type, connection URL - Poll status: last success/error timestamps, error messages, error category - Circuit breaker: state, timestamps, failure counts, retry windows - Dead letter: present flag, reason, attempt history, retry schedule Implementation: Data structures: - instanceInfo: cache of display names, URLs, types - pollStatus: tracks successes/errors with timestamps and categories - dlqInsight: DLQ entry metadata (reason, attempts, schedule) - circuitBreaker: enhanced with stateSince, lastTransition Tracking logic: - buildInstanceInfoCache: populate metadata from config on startup - recordTaskResult: track poll outcomes, error details, categories - sendToDeadLetter: capture DLQ insights (reason, timestamps) - circuitBreaker: record state transitions with timestamps Backward Compatible: - Existing fields (deadLetter, breakers, staleness) unchanged - New "instances" array is additive - Old clients can ignore new fields Testing: - Unit test: TestSchedulerHealth_EnhancedResponse validates all fields - Integration tests: still passing (55s) - All error tracking and breaker history verified Operator Benefits: - Diagnose issues without log digging - See error messages directly in API - Understand breaker states and retry schedules - Track DLQ entries with full context - Single API call for complete instance health view Example: Quickly identify "401 unauthorized" on specific PBS instance, see it's in DLQ after 5 retries, and know when next retry scheduled. Part of Phase 2 follow-up work to improve observability.	2025-10-20 15:13:38 +00:00
rcourtman	14d06a1654	test: add soak test with runtime instrumentation (Phase 2 Task 9d) Add comprehensive soak testing capabilities: Runtime Instrumentation: - Periodic sampling of heap, stack, goroutines, GC count - Sample every 10s during harness runs - HarnessReport includes full RuntimeSamples history - Detect memory leaks (>10% sustained growth) - Detect goroutine leaks (>20 leaked goroutines) Soak Test: - TestAdaptiveSchedulerSoak with 15min+ duration - Skip unless -soak flag or HARNESS_SOAK_MINUTES set - 80 synthetic instances (60 healthy, 15 transient, 5 permanent) - Configurable duration via env var - Validates: heap growth <10%, goroutines stable, queue depth bounded - Staleness threshold: 45s for long-running tests Wrapper Script: - testing-tools/run_adaptive_soak.sh for easy execution - Accepts duration in minutes: ./run_adaptive_soak.sh 30 - Logs to tmp/adaptive_soak_<timestamp>.log - Sets proper timeout (duration + 5min buffer) Test Results (2-minute validation): - 80 instances, 17 samples - Heap: 2.3MB → 3.1MB (healthy) - Goroutines: 16 → 6 (no leak, actually decreased) - Circuit breakers: correctly blocking transient failures Run with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerSoak -soak -timeout 20m Part of Phase 2 Task 9 (Integration/Soak Testing)	2025-10-20 15:13:38 +00:00
rcourtman	2636ba9137	test: add comprehensive integration test harness for adaptive polling (Phase 2 Task 9c) Add PollExecutor seam and integration test infrastructure: PollExecutor Interface: - Add pluggable executor interface for testability - Implement realExecutor wrapping existing poll functions - Add SetExecutor() for test injection - Zero impact on production behavior Integration Test Harness: - Build-tagged integration tests (go:build integration) - Synthetic workload generator with configurable scenarios - Fake executor simulating latencies, failures, recovery - Runtime metrics collection (queue depth, staleness, goroutines) Comprehensive Assertions: - Queue depth bounds: stays within 1.5× instance count - Staleness: healthy instances <20s, multiple poll cycles - Circuit breakers: transient failures recover, permanent stay blocked - Dead-letter queue: only permanent failures routed - Scheduler health: snapshot consistency validation Test Scenarios: - 10 healthy PVE instances (rapid polling) - 1 transient failure instance (fail → recover) - 1 permanent failure instance (DLQ routing) - 55s test duration with 3s base intervals - Validates full adaptive scheduler lifecycle Runs with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerIntegration Part of Phase 2 Task 9 (Integration/Soak Testing)	2025-10-20 15:13:38 +00:00
rcourtman	7d422d2909	feat: add professional logging with runtime configuration and performance optimization Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.	2025-10-20 15:13:38 +00:00
rcourtman	25b797f18d	test: add comprehensive staleness tracker unit tests (Phase 2 Task 9b) Added 17 test cases covering: - UpdateSuccess/UpdateError state management - Staleness scoring (fresh, stale, max-stale, never-succeeded) - Score normalization and capping (0.0 to 1.0 range) - SetBounds behavior and defaults - Snapshot merging logic - Snapshot() API for full state export - Nil safety and concurrent access All tests verify correct freshness calculation based on lastSuccess timestamps and configurable maxStale bounds. Phase 2 testing status: - ✅ Backoff exponential growth and jitter (13 tests) - ✅ Circuit breaker state machine (10 tests) - ✅ Staleness tracker scoring (17 tests) - Total: 40+ unit tests covering core scheduling logic	2025-10-20 15:13:38 +00:00
rcourtman	24ae6d8d78	test: add comprehensive unit tests for backoff and circuit breaker (Phase 2 Task 9a) Added 30+ test cases covering: Backoff tests (backoff_test.go): - Exponential growth with multiplier - Jitter distribution and bounds - Max delay capping - Edge cases (negative attempts, zero config values) - Realistic production scenarios Circuit breaker tests (circuit_breaker_test.go): - State transitions: closed → open → half-open → closed - Retry interval backoff with bit-shifting (5s << failureCount) - Half-open window behavior - Concurrent access safety - Default parameter validation All tests pass with proper handling of time-based state transitions and exponential backoff mechanics (bit-shift based retry intervals).	2025-10-20 15:13:38 +00:00
rcourtman	160adeb3b8	feat: add scheduler health API endpoint (Phase 2 Task 8) Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.	2025-10-20 15:13:38 +00:00
rcourtman	b1f445b33d	feat: implement error handling with circuit breakers and backoff (Phase 2 Task 7) Adds comprehensive error resilience: - Circuit breaker with closed/open/half-open states (3 failures = trip) - Exponential backoff with jitter (2s initial, 2x multiplier, 5min max) - Dead-letter queue for tasks exceeding 5 retry attempts - Error classification (transient vs permanent) using internal/errors helpers - Per-instance failure tracking and breaker state management - Integration with staleness tracker for outcome recording Task 7 of 10 complete (70%). Ready for API surfaces and testing.	2025-10-20 15:13:37 +00:00
rcourtman	aa5c08ad4a	feat: implement priority queue-based task execution (Phase 2 Task 6) Replaces immediate polling with queue-based scheduling: - TaskQueue with min-heap (container/heap) for NextRun-ordered execution - Worker goroutines that block on WaitNext() until tasks are due - Tasks only execute when NextRun <= now, respecting adaptive intervals - Automatic rescheduling after execution via scheduler.BuildPlan - Queue depth tracking for backpressure-aware interval adjustments - Upsert semantics for updating scheduled tasks without duplicates Task 6 of 10 complete (60%). Ready for error/backoff policies.	2025-10-20 15:13:37 +00:00
rcourtman	c554380cb5	feat: verify adaptive interval logic implementation (Phase 2 Task 5) Confirms adaptive scheduling logic is fully operational: - EMA smoothing (alpha=0.6) to prevent interval oscillations - Staleness-based interpolation between min/max intervals - Error penalty (0.6x per error) for faster recovery detection - Queue depth stretch (0.1x per task) for backpressure handling - ±5% jitter to prevent thundering herd effects - Per-instance state tracking for smooth transitions Task 5 of 10 complete. Scheduler foundation ready for queue-based execution.	2025-10-20 15:13:37 +00:00
rcourtman	c7d1abf874	feat: implement staleness tracker for adaptive polling (Phase 2 Task 4) Adds freshness metadata tracking for all monitored instances: - StalenessTracker with per-instance last success/error/mutation timestamps - Change hash detection using SHA1 for detecting data mutations - Normalized staleness scoring (0-1 scale) based on age vs maxStale - Integration with PollMetrics for authoritative last-success data - Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError - Connected to scheduler as StalenessSource implementation Task 4 of 10 complete. Ready for adaptive interval logic.	2025-10-20 15:13:37 +00:00
rcourtman	57429900a6	feat: add adaptive polling scheduler infrastructure (Phase 2 Tasks 1-3) Implements adaptive scheduling foundation for Phase 2: - Poll cycle metrics: duration, staleness, queue depth, in-flight counters - Adaptive scheduler with pluggable staleness/interval/enqueue interfaces - Config support: ADAPTIVE_POLLING_ENABLED flag + min/max/base intervals - Feature flag defaults to disabled for safe rollout - Scheduler wiring into Monitor with conditional instantiation Tasks 1-3 of 10 complete. Ready for staleness tracker implementation.	2025-10-20 15:13:37 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
Pulse Automation Bot	cfdfe896be	Adjust backup and snapshot alert handling	2025-10-18 20:11:01 +00:00
Pulse Automation Bot	80b9d0602a	Add Apprise notification integration (#570 )	2025-10-18 16:39:39 +00:00
Pulse Automation Bot	0b4e4f9c59	Add configurable backup polling interval	2025-10-18 13:06:41 +00:00
Richard Courtman	97b9c6739c	feat: add min/max temperature tracking for nodes Track minimum and maximum CPU temperatures since monitoring started. This provides better insight into temperature trends and cooling adequacy over time. Changes: - Backend: Add CPUMin, CPUMaxRecord, MinRecorded, MaxRecorded fields to Temperature model - Backend: Implement min/max tracking logic in monitoring cycle that preserves values across polling cycles - Backend: Initialize min/max on first reading, update on extremes - Frontend: Update Temperature TypeScript interface with new fields - Frontend: Display min/max range in NodeCard tooltip (e.g., "52°C (48-67°C since monitoring started)") - Frontend: Rebuild dist assets Temperature display now shows: - Current temperature with color coding (green/yellow/red) - Tooltip with full min-max range and context - Min/max tracked in-memory (resets on Pulse restart) Example tooltip: "CPU: 52°C (48-67°C since monitoring started)" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-18 08:15:10 +00:00
Richard Courtman	de3bb47930	fix: improve turnkey temperature monitoring for standalone nodes - Fix script input handling to work with standard curl \| bash pattern by prioritizing /dev/tty - Add Raspberry Pi temperature sensor support (cpu_thermal chip and generic temp sensors) - Add comprehensive documentation for turnkey standalone node setup - Fix printf formatting error in setup script	2025-10-18 06:51:56 +00:00
Richard Courtman	669d7dc05c	feat: add turnkey temperature monitoring for standalone nodes Implements automatic temperature monitoring setup for standalone Proxmox/Pimox nodes without manual SSH key configuration. Changes: - Add /api/system/proxy-public-key endpoint to expose proxy's SSH public key - Setup script now detects standalone nodes (non-cluster) - Auto-fetches and installs proxy SSH key with forced commands - Add Raspberry Pi temperature support via cpu_thermal and /sys/class/thermal - Enhance setup script with better error handling for lm-sensors installation - Add RPi detection to skip lm-sensors and use native thermal interface Security: - Public key endpoint is safe (public keys are meant to be public) - All installed keys use forced command="sensors -j" with full restrictions - No shell access, port forwarding, or other SSH features enabled	2025-10-17 22:15:50 +00:00
rcourtman	123e0f04ca	feat: add comprehensive node cleanup system Implements automated cleanup workflow when nodes are deleted from Pulse, removing all monitoring footprint from the host. Changes include a new RPC handler in the sensor proxy for cleanup requests, enhanced node deletion modal with detailed cleanup explanations, and improved SSH key management with proper tagging for atomic updates.	2025-10-17 18:53:45 +00:00
rcourtman	f141f7db33	feat: enhance sensor proxy with improved cluster discovery and SSH management Improvements to pulse-sensor-proxy: - Fix cluster discovery to use pvecm status for IP addresses instead of node names - Add standalone node support for non-clustered Proxmox hosts - Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting - Add --pulse-server flag to installer for custom Pulse URLs - Configure www-data group membership for Proxmox IPC access UI and API cleanup: - Remove unused "Ensure cluster keys" button from Settings - Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint - Remove EnsureClusterKeys method from tempproxy client The setup script already handles SSH key distribution during initial configuration, making the manual refresh button redundant.	2025-10-17 11:43:26 +00:00
rcourtman	6fdef61710	Expand monitoring and discovery test coverage	2025-10-16 08:17:08 +00:00
rcourtman	3a4fc044ea	Add guest agent caching and update doc hints (refs #560 )	2025-10-16 08:15:49 +00:00
rcourtman	958d6218c2	test: cover docker command lifecycle and server info	2025-10-15 19:47:51 +00:00
rcourtman	91fecacfef	feat: add docker agent command handling	2025-10-15 19:27:19 +00:00
rcourtman	aaae27dc11	Log memory source transitions for diagnostics (#553 )	2025-10-15 19:19:11 +00:00
rcourtman	32421b36b8	Refs #533 : add total-minus-used memory fallback	2025-10-15 18:19:54 +00:00
rcourtman	881b7f9a54	Fix false ZFS log/cache warnings	2025-10-14 20:57:43 +00:00
rcourtman	7e5fa9a147	fix: restore cache-aware node memory on PVE 8.4	2025-10-14 16:40:45 +00:00
rcourtman	78889ffedc	Ignore read-only guest filesystems in disk aggregation	2025-10-14 16:13:53 +00:00
rcourtman	156fd34c50	Update Proxmox guest agent permissions docs and tooling (refs #548 )	2025-10-14 10:21:52 +00:00
rcourtman	5c79d2516d	feat: streamline docker agent onboarding	2025-10-14 09:45:32 +00:00
rcourtman	dd9bd65a2e	fix: Add hasCPU/hasNVMe flags to prevent false 'no CPU sensor' errors Addresses #101 v4.23.0 introduced a regression where systems with only NVMe temperatures (no CPU sensor) would display "No CPU sensor" in the UI. This was caused by the Available flag being set to true when NVMe temps existed, even without CPU data, triggering the error message in the frontend. Backend changes: - Add HasCPU and HasNVMe boolean fields to Temperature model - Extend CPU sensor detection to support more chip types: zenpower, k8temp, acpitz, it87 (case-insensitive matching) - HasCPU is set based on CPU chip detection (coretemp, k10temp, etc.), not value thresholds - This prevents false negatives when sensors report 0°C during resets - CPU temperature values now accepted even when 0 (checked with !IsNaN instead of > 0) - extractTempInput returns NaN instead of 0 when no data found - Available flag means "any temperature data exists" for backward compatibility - Update mock generator to properly set the new flags - Add unit tests for NVMe-only and 0°C scenarios to prevent regression - Removed amd_energy from CPU chip list (power sensor, not temperature) Frontend changes: - Add hasCPU and hasNVMe optional fields to Temperature interface - Update NodeSummaryTable to check hasCPU flag with fallback to available for backward compatibility with older API responses - Update NodeCard temperature display logic with same fallback pattern - Systems with only NVMe temps now show "-" instead of error message - Fallback ensures UI works with both old and new API responses Testing: - All unit tests pass including NVMe-only and 0°C test cases - Fix prevents false "no CPU sensor" errors when sensors temporarily report 0°C - Fix eliminates false "no CPU sensor" errors for NVMe-only systems	2025-10-13 10:17:17 +00:00
rcourtman	e7bc338891	feat: Implement secure temperature proxy for containerized deployments Addresses #528 Introduces pulse-temp-proxy architecture to eliminate SSH key exposure in containers: Architecture: - pulse-temp-proxy runs on Proxmox host (outside LXC/Docker) - SSH keys stored on host filesystem (/var/lib/pulse-temp-proxy/ssh/) - Pulse communicates via unix socket (bind-mounted into container) - Proxy handles cluster discovery, key rollout, and temperature fetching Components: - cmd/pulse-temp-proxy: Standalone Go binary with unix socket RPC server - internal/tempproxy: Client library for Pulse backend - scripts/install-temp-proxy.sh: Idempotent installer for existing deployments - scripts/pulse-temp-proxy.service: Systemd service for proxy Integration: - Pulse automatically detects and uses proxy when socket exists - Falls back to direct SSH for native installations - Installer automatically configures proxy for new LXC deployments - Existing LXC users can upgrade by running install-temp-proxy.sh Security improvements: - Container compromise no longer exposes SSH keys - SSH keys never enter container filesystem - Maintains forced command restrictions - Transparent to users - no workflow changes Documentation: - Updated TEMPERATURE_MONITORING.md with new architecture - Added verification steps and upgrade instructions - Preserved legacy documentation for native installs	2025-10-12 21:35:35 +00:00
rcourtman	c8e3c93516	fix: Add security gates for containerized temperature monitoring Addresses #528 - Added opt-in confirmation prompt to setup script with security notice - Added runtime warning when containerized Pulse uses SSH temperature monitoring - Documented security considerations and hardening recommendations - Users must explicitly confirm understanding before enabling in containers	2025-10-12 21:01:25 +00:00
rcourtman	c18cf3d4b8	Fix node config API to preserve fields on partial updates The PUT /api/config/nodes/{id} endpoint was corrupting node configurations when making partial updates (e.g., updating just monitorPhysicalDisks): - Authentication fields (tokenName, tokenValue, password) were being cleared when updating unrelated settings - Name field was being blanked when not included in request - Monitor* boolean fields were defaulting to false Changes: - Only update name field if explicitly provided in request - Only switch authentication method when auth fields are explicitly provided - Preserve existing auth credentials on non-auth updates - Applied fix to all node types (PVE, PBS, PMG) Also enables physical disk monitoring by default (opt-out instead of opt-in) and preserves disk data between polling intervals.	2025-10-12 17:50:55 +00:00
rcourtman	18a88cb4cc	Improve NVMe temperature handling	2025-10-12 16:06:55 +00:00
rcourtman	2163d6f5a8	Use guest meminfo available for VM memory usage	2025-10-12 11:03:56 +00:00

1 2

53 commits