Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-28 19:41:17 +00:00

Author	SHA1	Message	Date
rcourtman	7ae393c8ec	Refine Proxmox node memory fallback (#582 )	2025-10-22 15:36:26 +00:00
rcourtman	c9543e8a7e	Add qemu guest agent version metadata	2025-10-22 15:24:07 +00:00
rcourtman	be26f957c0	Add snapshot size alert thresholds (#585 )	2025-10-22 13:30:40 +00:00
rcourtman	f83caf8933	Add collision-safe Docker host identifiers (#590 )	2025-10-22 12:30:25 +00:00
rcourtman	4eb8bed9b5	Fix initial setup caching and container discovery defaults	2025-10-22 07:34:32 +00:00
rcourtman	2786afdff0	feat: comprehensive diagnostics and observability improvements Upgrade diagnostics infrastructure from 5/10 to 8/10 production readiness with enhanced metrics, logging, and request correlation capabilities. Request Correlation - Wire request IDs through context in middleware - Return X-Request-ID header in all API responses - Enable downstream log correlation across request lifecycle HTTP/API Metrics (18 new Prometheus metrics) - pulse_http_request_duration_seconds - API latency histogram - pulse_http_requests_total - request counter by method/route/status - pulse_http_request_errors_total - error counter by type - Path normalization to control label cardinality Per-Node Poll Metrics - pulse_monitor_node_poll_duration_seconds - per-node timing - pulse_monitor_node_poll_total - success/error counts per node - pulse_monitor_node_poll_errors_total - error breakdown per node - pulse_monitor_node_poll_last_success_timestamp - freshness tracking - pulse_monitor_node_poll_staleness_seconds - age since last success - Enables multi-node hotspot identification Scheduler Health Metrics - pulse_scheduler_queue_due_soon - ready queue depth - pulse_scheduler_queue_depth - by instance type - pulse_scheduler_queue_wait_seconds - time in queue histogram - pulse_scheduler_dead_letter_depth - failed task tracking - pulse_scheduler_breaker_state - circuit breaker state - pulse_scheduler_breaker_failure_count - consecutive failures - pulse_scheduler_breaker_retry_seconds - time until retry - Enable alerting on DLQ spikes, breaker opens, queue backlogs Diagnostics Endpoint Caching - pulse_diagnostics_cache_hits_total - cache performance - pulse_diagnostics_cache_misses_total - cache misses - pulse_diagnostics_refresh_duration_seconds - probe timing - 45-second TTL prevents thundering herd on /api/diagnostics - Thread-safe with RWMutex - X-Diagnostics-Cached-At header shows cache freshness Debug Log Performance - Gate high-frequency debug logs behind IsLevelEnabled() checks - Reduces CPU waste in production when debug disabled - Covers scheduler loops, poll cycles, API handlers Persistent Logging - File logging with automatic rotation - LOG_FILE, LOG_MAX_SIZE, LOG_MAX_AGE, LOG_COMPRESS env vars - MultiWriter sends logs to both stderr and file - Gzip compression support for rotated logs Files modified: - internal/api/diagnostics.go (caching layer) - internal/api/middleware.go (request IDs, HTTP metrics) - internal/api/http_metrics.go (NEW - HTTP metric definitions) - internal/logging/logging.go (file logging with rotation) - internal/monitoring/metrics.go (node + scheduler metrics) - internal/monitoring/monitor.go (instrumentation, debug gating) Impact: Dramatically improved production troubleshooting with per-node visibility, scheduler health metrics, persistent logs, and cached diagnostics. Fast incident response now possible for multi-node deployments.	2025-10-21 12:37:39 +00:00
rcourtman	5ebb32ce10	feat: enhance runtime configuration and system settings management Improves configuration handling and system settings APIs to support v4.24.0 features including runtime logging controls, adaptive polling configuration, and enhanced config export/persistence. Changes: - Add config override system for discovery service - Enhance system settings API with runtime logging controls - Improve config persistence and export functionality - Update security setup handling - Refine monitoring and discovery service integration These changes provide the backend support for the configuration features documented in the v4.24.0 release.	2025-10-20 17:41:19 +00:00
rcourtman	9b1709a05b	feat: enhance scheduler health API with rich instance metadata Add comprehensive instance-level diagnostics to /api/monitoring/scheduler/health New Response Structure: Enhanced "instances" array with per-instance details: - Instance metadata: displayName, type, connection URL - Poll status: last success/error timestamps, error messages, error category - Circuit breaker: state, timestamps, failure counts, retry windows - Dead letter: present flag, reason, attempt history, retry schedule Implementation: Data structures: - instanceInfo: cache of display names, URLs, types - pollStatus: tracks successes/errors with timestamps and categories - dlqInsight: DLQ entry metadata (reason, attempts, schedule) - circuitBreaker: enhanced with stateSince, lastTransition Tracking logic: - buildInstanceInfoCache: populate metadata from config on startup - recordTaskResult: track poll outcomes, error details, categories - sendToDeadLetter: capture DLQ insights (reason, timestamps) - circuitBreaker: record state transitions with timestamps Backward Compatible: - Existing fields (deadLetter, breakers, staleness) unchanged - New "instances" array is additive - Old clients can ignore new fields Testing: - Unit test: TestSchedulerHealth_EnhancedResponse validates all fields - Integration tests: still passing (55s) - All error tracking and breaker history verified Operator Benefits: - Diagnose issues without log digging - See error messages directly in API - Understand breaker states and retry schedules - Track DLQ entries with full context - Single API call for complete instance health view Example: Quickly identify "401 unauthorized" on specific PBS instance, see it's in DLQ after 5 retries, and know when next retry scheduled. Part of Phase 2 follow-up work to improve observability.	2025-10-20 15:13:38 +00:00
rcourtman	2636ba9137	test: add comprehensive integration test harness for adaptive polling (Phase 2 Task 9c) Add PollExecutor seam and integration test infrastructure: PollExecutor Interface: - Add pluggable executor interface for testability - Implement realExecutor wrapping existing poll functions - Add SetExecutor() for test injection - Zero impact on production behavior Integration Test Harness: - Build-tagged integration tests (go:build integration) - Synthetic workload generator with configurable scenarios - Fake executor simulating latencies, failures, recovery - Runtime metrics collection (queue depth, staleness, goroutines) Comprehensive Assertions: - Queue depth bounds: stays within 1.5× instance count - Staleness: healthy instances <20s, multiple poll cycles - Circuit breakers: transient failures recover, permanent stay blocked - Dead-letter queue: only permanent failures routed - Scheduler health: snapshot consistency validation Test Scenarios: - 10 healthy PVE instances (rapid polling) - 1 transient failure instance (fail → recover) - 1 permanent failure instance (DLQ routing) - 55s test duration with 3s base intervals - Validates full adaptive scheduler lifecycle Runs with: go test -tags=integration ./internal/monitoring -run TestAdaptiveSchedulerIntegration Part of Phase 2 Task 9 (Integration/Soak Testing)	2025-10-20 15:13:38 +00:00
rcourtman	160adeb3b8	feat: add scheduler health API endpoint (Phase 2 Task 8) Task 8 of 10 complete. Exposes read-only scheduler health data including: - Queue depth and distribution by instance type - Dead-letter queue inspection (top 25 tasks with error details) - Circuit breaker states (instance-level) - Staleness scores per instance New API endpoint: GET /api/monitoring/scheduler/health (requires authentication) New snapshot methods: - StalenessTracker.Snapshot() - exports all staleness data - TaskQueue.Snapshot() - queue depth & per-type distribution - TaskQueue.PeekAll() - dead-letter task inspection - circuitBreaker.State() - exports state, failures, retryAt - Monitor.SchedulerHealth() - aggregates all health data Documentation updated with API spec, field descriptions, and usage examples.	2025-10-20 15:13:38 +00:00
rcourtman	b1f445b33d	feat: implement error handling with circuit breakers and backoff (Phase 2 Task 7) Adds comprehensive error resilience: - Circuit breaker with closed/open/half-open states (3 failures = trip) - Exponential backoff with jitter (2s initial, 2x multiplier, 5min max) - Dead-letter queue for tasks exceeding 5 retry attempts - Error classification (transient vs permanent) using internal/errors helpers - Per-instance failure tracking and breaker state management - Integration with staleness tracker for outcome recording Task 7 of 10 complete (70%). Ready for API surfaces and testing.	2025-10-20 15:13:37 +00:00
rcourtman	aa5c08ad4a	feat: implement priority queue-based task execution (Phase 2 Task 6) Replaces immediate polling with queue-based scheduling: - TaskQueue with min-heap (container/heap) for NextRun-ordered execution - Worker goroutines that block on WaitNext() until tasks are due - Tasks only execute when NextRun <= now, respecting adaptive intervals - Automatic rescheduling after execution via scheduler.BuildPlan - Queue depth tracking for backpressure-aware interval adjustments - Upsert semantics for updating scheduled tasks without duplicates Task 6 of 10 complete (60%). Ready for error/backoff policies.	2025-10-20 15:13:37 +00:00
rcourtman	c7d1abf874	feat: implement staleness tracker for adaptive polling (Phase 2 Task 4) Adds freshness metadata tracking for all monitored instances: - StalenessTracker with per-instance last success/error/mutation timestamps - Change hash detection using SHA1 for detecting data mutations - Normalized staleness scoring (0-1 scale) based on age vs maxStale - Integration with PollMetrics for authoritative last-success data - Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError - Connected to scheduler as StalenessSource implementation Task 4 of 10 complete. Ready for adaptive interval logic.	2025-10-20 15:13:37 +00:00
rcourtman	57429900a6	feat: add adaptive polling scheduler infrastructure (Phase 2 Tasks 1-3) Implements adaptive scheduling foundation for Phase 2: - Poll cycle metrics: duration, staleness, queue depth, in-flight counters - Adaptive scheduler with pluggable staleness/interval/enqueue interfaces - Config support: ADAPTIVE_POLLING_ENABLED flag + min/max/base intervals - Feature flag defaults to disabled for safe rollout - Scheduler wiring into Monitor with conditional instantiation Tasks 1-3 of 10 complete. Ready for staleness tracker implementation.	2025-10-20 15:13:37 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
Pulse Automation Bot	cfdfe896be	Adjust backup and snapshot alert handling	2025-10-18 20:11:01 +00:00
Pulse Automation Bot	80b9d0602a	Add Apprise notification integration (#570 )	2025-10-18 16:39:39 +00:00
Pulse Automation Bot	0b4e4f9c59	Add configurable backup polling interval	2025-10-18 13:06:41 +00:00
Richard Courtman	97b9c6739c	feat: add min/max temperature tracking for nodes Track minimum and maximum CPU temperatures since monitoring started. This provides better insight into temperature trends and cooling adequacy over time. Changes: - Backend: Add CPUMin, CPUMaxRecord, MinRecorded, MaxRecorded fields to Temperature model - Backend: Implement min/max tracking logic in monitoring cycle that preserves values across polling cycles - Backend: Initialize min/max on first reading, update on extremes - Frontend: Update Temperature TypeScript interface with new fields - Frontend: Display min/max range in NodeCard tooltip (e.g., "52°C (48-67°C since monitoring started)") - Frontend: Rebuild dist assets Temperature display now shows: - Current temperature with color coding (green/yellow/red) - Tooltip with full min-max range and context - Min/max tracked in-memory (resets on Pulse restart) Example tooltip: "CPU: 52°C (48-67°C since monitoring started)" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-18 08:15:10 +00:00
rcourtman	f141f7db33	feat: enhance sensor proxy with improved cluster discovery and SSH management Improvements to pulse-sensor-proxy: - Fix cluster discovery to use pvecm status for IP addresses instead of node names - Add standalone node support for non-clustered Proxmox hosts - Enhanced SSH key push with detailed logging, success/failure tracking, and error reporting - Add --pulse-server flag to installer for custom Pulse URLs - Configure www-data group membership for Proxmox IPC access UI and API cleanup: - Remove unused "Ensure cluster keys" button from Settings - Remove /api/diagnostics/temperature-proxy/ensure-cluster-keys endpoint - Remove EnsureClusterKeys method from tempproxy client The setup script already handles SSH key distribution during initial configuration, making the manual refresh button redundant.	2025-10-17 11:43:26 +00:00
rcourtman	3a4fc044ea	Add guest agent caching and update doc hints (refs #560 )	2025-10-16 08:15:49 +00:00
rcourtman	91fecacfef	feat: add docker agent command handling	2025-10-15 19:27:19 +00:00
rcourtman	32421b36b8	Refs #533 : add total-minus-used memory fallback	2025-10-15 18:19:54 +00:00
rcourtman	7e5fa9a147	fix: restore cache-aware node memory on PVE 8.4	2025-10-14 16:40:45 +00:00
rcourtman	156fd34c50	Update Proxmox guest agent permissions docs and tooling (refs #548 )	2025-10-14 10:21:52 +00:00
rcourtman	5c79d2516d	feat: streamline docker agent onboarding	2025-10-14 09:45:32 +00:00
rcourtman	c8e3c93516	fix: Add security gates for containerized temperature monitoring Addresses #528 - Added opt-in confirmation prompt to setup script with security notice - Added runtime warning when containerized Pulse uses SSH temperature monitoring - Documented security considerations and hardening recommendations - Users must explicitly confirm understanding before enabling in containers	2025-10-12 21:01:25 +00:00
rcourtman	c18cf3d4b8	Fix node config API to preserve fields on partial updates The PUT /api/config/nodes/{id} endpoint was corrupting node configurations when making partial updates (e.g., updating just monitorPhysicalDisks): - Authentication fields (tokenName, tokenValue, password) were being cleared when updating unrelated settings - Name field was being blanked when not included in request - Monitor* boolean fields were defaulting to false Changes: - Only update name field if explicitly provided in request - Only switch authentication method when auth fields are explicitly provided - Preserve existing auth credentials on non-auth updates - Applied fix to all node types (PVE, PBS, PMG) Also enables physical disk monitoring by default (opt-out instead of opt-in) and preserves disk data between polling intervals.	2025-10-12 17:50:55 +00:00
rcourtman	18a88cb4cc	Improve NVMe temperature handling	2025-10-12 16:06:55 +00:00
rcourtman	2163d6f5a8	Use guest meminfo available for VM memory usage	2025-10-12 11:03:56 +00:00
rcourtman	a74baed121	feat: capture Proxmox memory snapshots in diagnostics	2025-10-12 10:25:43 +00:00
rcourtman	f46ff1792b	Fix settings security tab navigation	2025-10-11 23:29:47 +00:00

32 commits