Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 03:50:18 +00:00

Author	SHA1	Message	Date
rcourtman	a4edd6229f	perf: Cache lowercase error string in guest agent error handling Compute strings.ToLower(errStr) once and reuse errStrLower instead of calling ToLower three times on the same string in permission check.	2025-12-02 15:26:40 +00:00
rcourtman	f0da506524	fix: Properly show disabled storage status from Proxmox Storage that is disabled in Proxmox (Datacenter > Storage > enabled=no) was incorrectly showing as "available" in Pulse. The issue was that Enabled/Active fields defaulted to true and were never set to false from the per-node API response. Now the model correctly initializes Enabled/Active from the Proxmox per-node storage API response, and the status determination prioritizes checking the disabled state first. Related to #796	2025-12-02 15:03:40 +00:00
rcourtman	f30e2ca547	test: Add tests for allow, ensureContainerRootDiskEntry; remove dead code - allow (circuitBreaker): all states including unknown/default branch, open→half-open transition, half-open window timing (93.8% -> 100%) - ensureContainerRootDiskEntry: nil container, existing disks, used>total clamping, negative free clamping, usage calculation (92.3% -> 100%) - convertPoolInfoToModel: removed unreachable nil check (dead code since ConvertToModelZFSPool only returns nil for nil receiver) (88.9% -> 100%)	2025-12-01 20:15:32 +00:00
rcourtman	2ff0a0988f	refactor: remove unnecessary type conversions Remove redundant type conversions identified by unconvert linter: - Remove int() conversions for already-int VMID fields - Remove int64() conversions for already-int64 arithmetic results - Remove uint64() conversions for already-uint64 Disk/MaxDisk fields - Remove int() on syscall.Stdin (already int constant)	2025-11-27 10:33:35 +00:00
rcourtman	8f6a481cd2	refactor: use builtin max() and fix unused parameter - Replace custom maxInt64 helper with Go 1.21+ builtin max() - Mark unused cfg parameter in newAdaptiveIntervalSelector - Remove test for deleted helper function	2025-11-27 10:08:37 +00:00
rcourtman	e6adffb2ff	fix: prevent context leak in temperature collection - Use defer for tempCancel() to ensure context is always cancelled - Remove redundant shouldCollect variable that was always true - Fix indentation after removing the unnecessary conditional block	2025-11-26 23:43:18 +00:00
courtmanr@gmail.com	584ad94ee5	Refactor: Parallelize PVE node polling	2025-11-25 08:38:03 +00:00
rcourtman	61a7c2829c	Honor configured PVE polling interval in scheduler	2025-11-20 22:00:56 +00:00
rcourtman	b474a77b65	Add direct node fallback for storage polling	2025-11-18 19:58:38 +00:00
rcourtman	9d34327dbe	Don't disable storages when cluster metadata omits flags	2025-11-18 17:18:23 +00:00
rcourtman	1a78dcbba2	Fix guest agent disk data regression on Proxmox 8.3+ Related to #630 Proxmox 8.3+ changed the VM status API to return the `agent` field as an object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This caused Pulse to incorrectly treat VMs as having no guest agent, resulting in missing disk usage data (disk:-1) even when the guest agent was running and functional. The issue manifested as: - VMs showing "Guest details unavailable" or missing disk data - Pulse logs showing no "Guest agent enabled, querying filesystem info" messages - `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly from the command line, confirming the agent was functional Root cause: The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent the new object format, JSON unmarshaling silently left the field at zero, causing Pulse to skip all guest agent queries. Changes: - Created VMAgentField type with custom UnmarshalJSON to handle both formats: * Legacy (Proxmox <8.3): integer (0 or 1) * Modern (Proxmox 8.3+): object {"enabled":N,"available":N} - Updated VMStatus.Agent from `int` to `VMAgentField` - Updated all references to `detailedStatus.Agent` to use `.Agent.Value` - The unmarshaler prioritizes the "available" field over "enabled" to ensure we only query when the agent is actually responding This fix maintains backward compatibility with older Proxmox versions while supporting the new format introduced in Proxmox 8.3+.	2025-11-06 18:42:46 +00:00
rcourtman	20854256c3	Fix VM migration issue where custom alert thresholds are lost Resolves #641 ## Problem When a VM migrates between Proxmox nodes, Pulse was treating it as a new resource and discarding custom alert threshold overrides. This occurred because guest IDs included the node name (e.g., `instance-node-VMID`), causing the ID to change when the VM moved to a different node. Users reported that after migrating a VM, previously disabled alerts (e.g., memory threshold set to 0) would resume firing. ## Root Cause Guest IDs were constructed as: - Standalone: `node-VMID` - Cluster: `instance-node-VMID` When a VM migrated from node1 to node2, the ID changed from `instance-node1-100` to `instance-node2-100`, causing: - Alert threshold overrides to be orphaned (keyed by old ID) - Guest metadata (custom URLs, descriptions) to be orphaned - Active alerts to reference the wrong resource ID ## Solution Changed guest ID format to be stable across node migrations: - New format: `instance-VMID` (for both standalone and cluster) - Retains uniqueness across instances while being node-independent - Allows VMs to migrate freely without losing configuration ## Implementation ### Backend Changes 1. Guest ID Construction (`monitor_polling.go`): - Simplified to always use `instance-VMID` format - Removed node from the ID construction logic 2. Alert Override Migration (`alerts.go`): - Added lazy migration in `getGuestThresholds()` - Detects legacy ID formats and migrates to new format - Preserves user configurations automatically 3. Guest Metadata Migration (`guest_metadata.go`): - Added `GetWithLegacyMigration()` helper method - Called during VM/container polling to migrate metadata - Preserves custom URLs and descriptions 4. Active Alerts Migration (`alerts.go`): - Added migration logic in `LoadActiveAlerts()` - Translates legacy alert resource IDs to new format - Preserves alert acknowledgments across restarts ### Frontend Changes 5. ID Construction Updates: - `ThresholdsTable.tsx`: Updated fallback from `instance-node-vmid` to `instance-vmid` - `Dashboard.tsx`: Simplified guest ID construction - `GuestRow.tsx`: Updated `buildGuestId()` helper ## Migration Strategy - Lazy Migration: Configs are migrated as guests are discovered - Backwards Compatible: Old IDs are detected and automatically converted - Zero Downtime: No manual intervention required - Persisted: Migrated configs are saved on next config write cycle ## Testing Recommendations After deployment: 1. Verify existing alert overrides still apply 2. Test VM migration - confirm thresholds persist 3. Check guest metadata (custom URLs) survive migration 4. Verify active alerts maintain acknowledgment state ## Related - Addresses similar issues with guest metadata and active alert tracking - Lays groundwork for any future guest-specific configuration features - Aligns with project philosophy: correctness and UX over implementation complexity	2025-11-06 10:27:15 +00:00
rcourtman	23691d5b41	Improve cluster health diagnostics and error messaging Related to #405 Enhances error reporting and logging when all cluster endpoints are unhealthy, making it easier to diagnose connectivity issues. Changes: 1. Enhanced error messages in cluster_client.go: - Error now includes list of unreachable endpoints - Added detailed logging when no healthy endpoints available - Log at WARN level (not DEBUG) when cluster health check fails - Better context in recovery attempts with start/completion summaries 2. Improved storage polling resilience in monitor_polling.go: - Better error context when cluster storage polling fails - Specific guidance for "no healthy nodes available" scenario - Storage polling continues with direct node queries even if cluster-wide query fails (already worked, but now clearer) 3. Better recovery logging: - Log when recovery attempts start with list of unhealthy endpoints - Log individual recovery failures at DEBUG level - Log recovery summary (success/failure counts) - Track throttled endpoints separately for clearer diagnostics These changes help users understand: - Which specific endpoints are unreachable - Whether it's a network/connectivity issue vs. API issue - That Pulse will continue trying to recover endpoints automatically - That storage monitoring continues via direct node queries The root issue is that Pulse's internal health tracking can mark all endpoints unhealthy when they're unreachable from the Pulse server, even if Proxmox reports them as "online" in cluster status. Better logging helps diagnose these network connectivity issues.	2025-11-05 19:44:29 +00:00
rcourtman	f0088070be	Improve guest agent error classification to prevent false permission errors Related to #596 Problem: Users were seeing persistent "permission denied" error messages for VMs that simply didn't have qemu-guest-agent installed or running. The error detection logic was too broad and classified Proxmox API 500 errors as permission issues, even when they indicated guest agent unavailability. Root Cause: When qemu-guest-agent is not installed or not running, Proxmox API returns various error responses (500, 403) that may contain permission-related text. The previous error detection logic checked for "permission denied" strings without considering the HTTP status code context, leading to: - VMs with guest agent: guest details display correctly - VMs without guest agent: false "Permission denied" error shown Solution: Enhanced error classification logic to distinguish between: 1. Actual permission issues (401/403 with permission keywords) 2. Guest agent unavailability (500 errors) 3. Agent timeout issues 4. Other agent errors The fix ensures that only explicit authentication/authorization errors (401 Unauthorized, 403 Forbidden with permission keywords) are classified as permission-denied, while API 500 errors are correctly identified as agent-not-running issues. Changes: - Reordered error detection to check most specific patterns first - Added HTTP status code context to permission error detection - 500 errors now correctly map to "agent-not-running" status - Only 401/403 errors with explicit permission keywords trigger "permission-denied" - Improved log messages to guide users toward correct resolution - Fixed err.Error() vs errStr variable inconsistency Impact: Users will now see accurate error messages that guide them to: - Install qemu-guest-agent when it's missing (most common case) - Check permissions only when there's an actual auth/authz issue - Understand the difference between agent problems and permission problems	2025-11-05 19:21:58 +00:00
rcourtman	7dd7a0b0f9	Fix node/host dropout issue caused by cluster health failures Implemented comprehensive state preservation to prevent temporary dropouts: 1. Node Grace Period (60s): - Track last-online timestamp for each Proxmox node - Preserve online status during grace period to prevent flapping - Applied to all node status checks throughout codebase 2. Efficient Polling Preservation: - Detect when cluster/resources returns empty arrays - Preserve previous VMs/containers if had resources before - Handles cluster health check failures gracefully 3. Traditional Polling Preservation: - Updated preservation logic for per-node VM/container polling - Triggers when zero resources returned regardless of node response - Fixed issue where nodes responding with empty data bypassed preservation Root cause: Intermittent Proxmox cluster health failures ("no healthy nodes available") caused both efficient and traditional polling to return empty arrays, immediately clearing all VMs/containers from state. Changes: - internal/monitoring/monitor.go: Added node grace period, efficient polling preservation - internal/monitoring/monitor_polling.go: Fixed traditional polling preservation logic Fixes frequent UI flickering where vmCount/containerCount would briefly drop to zero.	2025-11-05 17:01:20 +00:00
rcourtman	7a185c4ab3	Improve guest agent timeout handling for high-load environments (refs #592 ) This change addresses intermittent "Guest details unavailable" and "Disk stats unavailable" errors affecting users with large VM deployments (50+ VMs) or high-load Proxmox environments. Changes: - Increased default guest agent timeouts (3-5s → 10-15s) to better handle environments under load - Added automatic retry logic (1 retry by default) for transient timeout failures - Made all timeouts and retry count configurable via environment variables: * GUEST_AGENT_FSINFO_TIMEOUT (default: 15s) * GUEST_AGENT_NETWORK_TIMEOUT (default: 10s) * GUEST_AGENT_OSINFO_TIMEOUT (default: 10s) * GUEST_AGENT_VERSION_TIMEOUT (default: 10s) * GUEST_AGENT_RETRIES (default: 1) - Added comprehensive documentation in VM_DISK_MONITORING.md with configuration examples for different deployment scenarios These improvements allow Pulse to gracefully handle intermittent API timeouts without immediately displaying errors, while remaining configurable for different network conditions and environment sizes. Fixes: https://github.com/rcourtman/Pulse/discussions/592	2025-11-05 09:40:58 +00:00
rcourtman	aac3dacd63	Improve LXC guest metrics visibility (#596 )	2025-10-22 22:24:33 +00:00
rcourtman	c9543e8a7e	Add qemu guest agent version metadata	2025-10-22 15:24:07 +00:00
rcourtman	7d422d2909	feat: add professional logging with runtime configuration and performance optimization Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.	2025-10-20 15:13:38 +00:00
rcourtman	aa5c08ad4a	feat: implement priority queue-based task execution (Phase 2 Task 6) Replaces immediate polling with queue-based scheduling: - TaskQueue with min-heap (container/heap) for NextRun-ordered execution - Worker goroutines that block on WaitNext() until tasks are due - Tasks only execute when NextRun <= now, respecting adaptive intervals - Automatic rescheduling after execution via scheduler.BuildPlan - Queue depth tracking for backpressure-aware interval adjustments - Upsert semantics for updating scheduled tasks without duplicates Task 6 of 10 complete (60%). Ready for error/backoff policies.	2025-10-20 15:13:37 +00:00
rcourtman	c7d1abf874	feat: implement staleness tracker for adaptive polling (Phase 2 Task 4) Adds freshness metadata tracking for all monitored instances: - StalenessTracker with per-instance last success/error/mutation timestamps - Change hash detection using SHA1 for detecting data mutations - Normalized staleness scoring (0-1 scale) based on age vs maxStale - Integration with PollMetrics for authoritative last-success data - Wired into all poll functions (PVE/PBS/PMG) via UpdateSuccess/UpdateError - Connected to scheduler as StalenessSource implementation Task 4 of 10 complete. Ready for adaptive interval logic.	2025-10-20 15:13:37 +00:00
rcourtman	57429900a6	feat: add adaptive polling scheduler infrastructure (Phase 2 Tasks 1-3) Implements adaptive scheduling foundation for Phase 2: - Poll cycle metrics: duration, staleness, queue depth, in-flight counters - Adaptive scheduler with pluggable staleness/interval/enqueue interfaces - Config support: ADAPTIVE_POLLING_ENABLED flag + min/max/base intervals - Feature flag defaults to disabled for safe rollout - Scheduler wiring into Monitor with conditional instantiation Tasks 1-3 of 10 complete. Ready for staleness tracker implementation.	2025-10-20 15:13:37 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00

1 2

73 commits