Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-04-29 12:00:13 +00:00

Author	SHA1	Message	Date
rcourtman	3a02dd171b	fix(proxmox): add GetClusterOptions to ClusterClient for tag colour fetch	2026-03-15 19:51:20 +00:00
rcourtman	0ae2806f18	fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270 ) Proxmox status.Mem includes page cache as "used" memory, inflating reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked host agent) were frequently unavailable, causing most VMs to fall through to the inflated status-mem source. Adds a new last-resort fallback that reads /proc/meminfo via the QEMU guest agent file-read endpoint to get accurate MemAvailable. Results are cached (60s positive, 5min negative backoff for unsupported VMs). Also fixes: RRD memavailable fallback missing from traditional polling path, cache key collisions in multi-PVE setups, FreeMem underflow guard inconsistency, and integer overflow in kB-to-bytes conversion.	2026-02-20 13:31:52 +00:00
rcourtman	a54d71117b	fix(proxmox): prevent guest agent errors from marking endpoints unhealthy Backport of v6 commits a87c9950 and 347d7db1. Part 1 (a87c9950): Wrap the four guest agent c.get() errors with fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly scopes them to the VM rather than the cluster endpoint. Part 2 (347d7db1): Replace the 20+ pattern blocklist in executeWithFailover with an allowlist via isEndpointConnectivityError(). Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP response from Proxmox — including 500 — proves the node is reachable and returns the error without affecting endpoint health.	2026-02-18 12:59:20 +00:00
rcourtman	efa916ee2a	fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox RRD's memavailable metric (which excludes reclaimable page cache) when the qemu-guest-agent doesn't provide MemInfo.Available. Previously the fallback was detailedStatus.Mem (total - MemFree), inflating usage to 80%+ on VMs with normal Linux page cache. Mirrors the existing LXC rrd-memavailable path. FreeBSD ZFS ARC (#1264, #1051): The host agent now reads kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts the ARC size from reported memory usage. ZFS ARC is reclaimable under memory pressure (like Linux SReclaimable) but gopsutil counts it as wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms. Fixes #1270 Fixes #1264 Fixes #1051 (cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)	2026-02-18 12:56:53 +00:00
rcourtman	ebc29b4fdb	feat: show pending apt updates for Proxmox nodes (#1083 ) - Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model - Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update) - Add 30-minute polling cache to avoid excessive API calls - Add pendingUpdates to frontend Node type - Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+) - Update test stubs for interface compliance Requires Sys.Audit permission on Proxmox API token to read apt updates.	2026-01-21 10:53:36 +00:00
rcourtman	80444a9022	fix(monitor): use cluster quorum status instead of endpoint count for health Previously, when some cluster endpoints were unreachable (e.g., backup nodes intentionally offline), the cluster was marked as "degraded" even though the Proxmox cluster itself was healthy and had quorum. Now the connection health check queries the Proxmox cluster's actual quorum status. A cluster is only marked "degraded" if it has lost quorum (not enough votes for consensus), which is the actual indicator of cluster instability. This means: - Cluster with quorum + some nodes offline = "healthy" - Cluster without quorum = "degraded" (warning) - All endpoints down = "error" Fixes #1085	2026-01-11 11:54:02 +00:00
rcourtman	d0191d136f	fix: Add configurable poll timeout and handle external Ceph storage Changes: 1. Add MAX_POLL_TIMEOUT env var for large Proxmox clusters that need more than 3 minutes for polling (default: 3m, minimum: 30s) 2. Handle external Ceph storage gracefully - don't mark nodes unhealthy when Proxmox returns 'binary not installed' (e.g., for Ceph not managed by Proxmox) Related to #965	2026-01-05 23:34:33 +00:00
rcourtman	e0dc6695fc	fix: Per-node TLS fingerprints for cluster peers (TOFU) When a PVE cluster has unique self-signed certificates on each node, Pulse would mark secondary nodes as unhealthy because only the primary node's fingerprint was used for all connections. Now, during cluster discovery, Pulse captures each node's TLS fingerprint and uses it when connecting to that specific node. This enables "Trust On First Use" (TOFU) for clusters with unique per-node certs. Changes: - Add Fingerprint field to ClusterEndpoint config - Add FetchFingerprint() to tlsutil for capturing node certs - validateNodeAPI() now captures and returns fingerprints during discovery - NewClusterClient() accepts endpointFingerprints map for per-node certs - All client creation paths use per-endpoint fingerprints when available Related to #879	2025-12-24 10:05:03 +00:00
rcourtman	a115af6906	feat: Improve cluster endpoint error messages for users - Add sanitizeEndpointError() to transform raw Go errors into user-friendly messages - Transform 'context deadline exceeded' into helpful messages mentioning possible causes - Storage timeout errors now suggest checking PBS/NFS/Ceph backend connectivity - Connection refused, certificate errors, and auth errors get actionable hints - Apply sanitization everywhere cluster endpoint lastError is stored - Add comprehensive tests for all error transformations	2025-12-16 21:50:02 +00:00
rcourtman	fa13919987	fix(ai-chat): Display messages chronologically in AI chatbot - Add 'content' type to StreamDisplayEvent for tracking text chunks - Track content events in streamEvents array for chronological display - Update render to use Switch/Match for cleaner conditional rendering - Interleave thinking, tool calls, and content as they stream in - Add fallback for old messages without streamEvents for backwards compat Previously, tool/command outputs stayed at top while AI text responses accumulated at the bottom. Now all events appear in order like a normal chatbot.	2025-12-11 23:02:59 +00:00
rcourtman	8948e84fe5	feat: AI features, agent improvements, and host monitoring enhancements AI Chat Integration: - Multi-provider support (Anthropic, OpenAI, Ollama) - Streaming responses with markdown rendering - Agent command execution for remote troubleshooting - Context-aware conversations with host/container metadata Agent Updates: - Add --enable-proxmox flag for automatic PVE/PBS token setup - Improve auto-update with semver comparison (prevents downgrades) - Add updatedFrom tracking to report previous version after update - Reduce initial update check delay from 30s to 5s - Add agent version column to Hosts page table Host Metrics: - Add DiskIO stats collection (read/write bytes, ops, time) - Improve disk filtering to exclude Docker overlay mounts - Add RAID array monitoring via mdadm - Enhanced temperature sensor parsing Frontend: - New Agent Version column on Hosts overview table - Improved node modal with agent-first installation flow - Add DiskIO display in host drawer - Better responsive handling for metric bars	2025-12-05 10:37:02 +00:00
rcourtman	be892f5e07	fix: match storage timeout errors without trailing slash The error pattern `/storage/` only matched storage content endpoints (`/storage/{name}/content`) but not the main storage list endpoint (`/nodes/{node}/storage`). This caused storage timeout errors like: Get ".../nodes/pve-100-224/storage": context deadline exceeded to incorrectly mark cluster nodes as unhealthy, even though the timeout was due to a slow cross-node storage query, not actual node connectivity issues. Fixes #754	2025-12-01 22:48:01 +00:00
rcourtman	1f748e8670	fix: recover unhealthy cluster nodes even when some nodes are healthy Previously, recovery of unhealthy nodes only triggered when ALL nodes were unhealthy. This caused individual degraded nodes to stay degraded forever since operations would succeed on healthy nodes and never trigger the recovery path. Now recovery is attempted whenever any unhealthy nodes exist, allowing clusters to recover individual nodes over time. Also added: - Panic-safe unlock/lock pattern using anonymous function - Refresh of both healthy and cooling endpoints after recovery - Updated timestamp for accurate cooldown checks Related to #754	2025-12-01 21:47:26 +00:00
rcourtman	69de7c25ce	Fix cluster degraded status not recovering after transient failures The previous fix (`6db4ee7a`) cleared stale error messages but didn't mark endpoints as healthy again after successful operations. This caused clusters to remain in "degraded" state permanently once any endpoint had a temporary issue, even if all endpoints were actually working. The fix now marks endpoints healthy in clearEndpointError() after successful operations, ensuring degraded clusters recover automatically. Related to #659	2025-11-29 19:04:11 +00:00
rcourtman	1b5528356b	fix: clear stale errors after successful cluster operations Previously, errors stored in ClusterClient.lastError were only cleared during initial health checks or when recovering unhealthy nodes. This caused stale error messages to persist in the UI even after the underlying issues were resolved. The fix clears cached errors in two places: 1. After passing connectivity test in getHealthyClient() 2. After successful operation in executeWithFailover() This ensures that once an endpoint starts working again, any previous error messages are cleared from the UI without requiring a restart. Related to #659, #754	2025-11-27 16:22:16 +00:00
rcourtman	bc9e89696b	chore: fix staticcheck U1000 unused code warnings - Remove unused ipv6Regex from validation.go - Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use) - Remove unused apiLimiter rate limiter - Remove unused stopOnce fields from csrf_store.go and session_store.go - Remove unused lastBroadcast field from hub.go - Remove unused lastUsedIndex field from cluster_client.go	2025-11-27 09:12:17 +00:00
rcourtman	8276ae837e	chore: cleanup proxmox IsAuthError and remove stray comment - Make IsAuthError unexported (isAuthError) since it's only used internally - Remove stray '// test comment' from docker_metadata.go	2025-11-27 08:59:01 +00:00
rcourtman	c439a83fba	chore: remove additional dead code Remove 241 lines of unreachable code across internal and pkg: - internal/crypto/crypto.go: unused NewCryptoManager wrapper - internal/monitoring/scheduler.go: unused fixedIntervalSelector type - internal/ssh/knownhosts/manager.go: unused hostKeyExists function - internal/updates/manager.go: unused getLatestRelease wrapper - internal/updates/updater.go: unused GetAll method - pkg/discovery/discovery.go: unused scanWorker and runPhase (legacy compat) - pkg/proxmox/client.go: unused post, getTaskStatus, waitForTaskCompletion, getTaskLog - pkg/proxmox/cluster_client.go: unused markUnhealthy wrapper	2025-11-27 05:13:26 +00:00
rcourtman	b28828a822	Handle VM guest agent errors without marking nodes unhealthy (related to #736 )	2025-11-21 17:34:25 +00:00
rcourtman	2207642fa9	Related to #727 : normalize persisted Proxmox hosts	2025-11-20 19:58:05 +00:00
rcourtman	766cbe573e	Handle missing storage on cluster nodes	2025-11-18 15:57:29 +00:00
rcourtman	a406fe42d8	Fix Proxmox 9.x RRD parameter incompatibility causing cluster health issues Proxmox VE 9.x removed support for the 'ds' parameter in RRD endpoints (/nodes/{node}/rrddata and /nodes/{node}/lxc/{vmid}/rrddata). When Pulse sent RRD requests with ds=memused,memavailable,etc., Proxmox responded with: API error 400: {"errors":{"ds":"property is not defined in schema..."}} This caused cluster nodes to be repeatedly marked unhealthy, which cascaded into storage polling failures showing 'All cluster endpoints are unhealthy' even though the nodes were actually healthy and reachable. Changes: - Added check in cluster_client.go executeWithFailover to recognize the ds parameter error as a capability issue rather than node health failure - Nodes with this error no longer get marked unhealthy - Storage polling and other operations now succeed even when RRD calls fail - The RRD data will be unavailable but core monitoring continues This fix maintains backward compatibility with older Proxmox versions while gracefully handling the API change in Proxmox 9.x.	2025-11-08 12:06:08 +00:00
rcourtman	48fabdd827	Improve Docker temperature monitoring documentation for clarity (related to #600 ) Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be more user-friendly and address common setup issues: - Added clear explanation of why the proxy is needed (containers can't access hardware) - Provided concrete IP example instead of placeholder - Showed full docker-compose.yml context with proper YAML structure - Added sudo to commands where needed - Updated docker-compose commands to v2 syntax with note about v1 - Expanded verification steps with clearer success indicators - Added reminder to check container name in verification commands These improvements should help users who encounter blank temperature displays due to missing proxy installation or bind mount configuration.	2025-11-07 15:09:42 +00:00
rcourtman	af55362009	Fix inflated RAM usage reporting for LXC containers Related to #553 ## Problem LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%, 96% when actual was 61%) because the code used the raw `mem` value from Proxmox's `/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which includes reclaimable cache and buffers, making memory appear nearly full even when plenty is available. ## Root Cause - Nodes: Had sophisticated cache-aware memory calculation with RRD fallbacks - VMs (qemu): Had detailed memory calculation using guest agent meminfo - LXCs: Naively used `res.Mem` directly without any cache-aware correction The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers (from cgroup memory accounting), which should be excluded for accurate "used" memory. ## Solution Implement cache-aware memory calculation for LXC containers by: 1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from `/nodes/{node}/lxc/{vmid}/rrddata` 2. Using RRD `memavailable` to calculate actual used memory (total - available) 3. Falling back to RRD `memused` if `memavailable` is not available 4. Only using cluster resources `mem` value as last resort This matches the approach already used for nodes and VMs, providing consistent cache-aware memory reporting across all resource types. ## Changes - Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox - Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations - Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available - Added guest memory snapshot recording for LXC containers - Updated test stubs to implement the new interface method ## Testing - Code compiles successfully - Follows the same proven pattern used for nodes and VMs - Includes diagnostic snapshot recording for troubleshooting	2025-11-06 00:16:18 +00:00
rcourtman	23691d5b41	Improve cluster health diagnostics and error messaging Related to #405 Enhances error reporting and logging when all cluster endpoints are unhealthy, making it easier to diagnose connectivity issues. Changes: 1. Enhanced error messages in cluster_client.go: - Error now includes list of unreachable endpoints - Added detailed logging when no healthy endpoints available - Log at WARN level (not DEBUG) when cluster health check fails - Better context in recovery attempts with start/completion summaries 2. Improved storage polling resilience in monitor_polling.go: - Better error context when cluster storage polling fails - Specific guidance for "no healthy nodes available" scenario - Storage polling continues with direct node queries even if cluster-wide query fails (already worked, but now clearer) 3. Better recovery logging: - Log when recovery attempts start with list of unhealthy endpoints - Log individual recovery failures at DEBUG level - Log recovery summary (success/failure counts) - Track throttled endpoints separately for clearer diagnostics These changes help users understand: - Which specific endpoints are unreachable - Whether it's a network/connectivity issue vs. API issue - That Pulse will continue trying to recover endpoints automatically - That storage monitoring continues via direct node queries The root issue is that Pulse's internal health tracking can mark all endpoints unhealthy when they're unreachable from the Pulse server, even if Proxmox reports them as "online" in cluster status. Better logging helps diagnose these network connectivity issues.	2025-11-05 19:44:29 +00:00
rcourtman	6eb1a10d9b	Refactor: Code cleanup and localStorage consolidation This commit includes comprehensive codebase cleanup and refactoring: ## Code Cleanup - Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate) - Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo) - Clean up commented-out code blocks across multiple files - Remove unused TypeScript exports (helpTextClass, private tag color helpers) - Delete obsolete test files and components ## localStorage Consolidation - Centralize all storage keys into STORAGE_KEYS constant - Update 5 files to use centralized keys: * utils/apiClient.ts (AUTH, LEGACY_TOKEN) * components/Dashboard/Dashboard.tsx (GUEST_METADATA) * components/Docker/DockerHosts.tsx (DOCKER_METADATA) * App.tsx (PLATFORMS_SEEN) * stores/updates.ts (UPDATES) - Benefits: Single source of truth, prevents typos, better maintainability ## Previous Work Committed - Docker monitoring improvements and disk metrics - Security enhancements and setup fixes - API refactoring and cleanup - Documentation updates - Build system improvements ## Testing - All frontend tests pass (29 tests) - All Go tests pass (15 packages) - Production build successful - Zero breaking changes Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)	2025-11-04 21:50:46 +00:00
rcourtman	a885fb5472	Surface LXC interface IPs via PVE interfaces API (#596 )	2025-10-23 08:07:32 +00:00
rcourtman	b95c01066e	Capture dynamic LXC IP metrics (#596 )	2025-10-23 07:50:45 +00:00
rcourtman	be85459db2	Add LXC config metadata for guest drawers (#596 )	2025-10-23 07:30:32 +00:00
rcourtman	c9543e8a7e	Add qemu guest agent version metadata	2025-10-22 15:24:07 +00:00
rcourtman	f8b6aa6c97	Treat 501 responses as non-fatal in cluster failover (#449 )	2025-10-22 14:23:13 +00:00
rcourtman	7d422d2909	feat: add professional logging with runtime configuration and performance optimization Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.	2025-10-20 15:13:38 +00:00
rcourtman	524f42cc28	security: complete Phase 1 sensor proxy hardening Implements comprehensive security hardening for pulse-sensor-proxy: - Privilege drop from root to unprivileged user (UID 995) - Hash-chained tamper-evident audit logging with remote forwarding - Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps - Enhanced command validation with 10+ attack pattern tests - Fuzz testing (7M+ executions, 0 crashes) - SSH hardening, AppArmor/seccomp profiles, operational runbooks All 27 Phase 1 tasks complete. Ready for production deployment.	2025-10-20 15:13:37 +00:00
rcourtman	7e5fa9a147	fix: restore cache-aware node memory on PVE 8.4	2025-10-14 16:40:45 +00:00
rcourtman	f46ff1792b	Fix settings security tab navigation	2025-10-11 23:29:47 +00:00

35 commits