Commit graph

35 commits

Author SHA1 Message Date
rcourtman
3a02dd171b fix(proxmox): add GetClusterOptions to ClusterClient for tag colour fetch 2026-03-15 19:51:20 +00:00
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
a54d71117b fix(proxmox): prevent guest agent errors from marking endpoints unhealthy
Backport of v6 commits a87c9950 and 347d7db1.

Part 1 (a87c9950): Wrap the four guest agent c.get() errors with
fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly
scopes them to the VM rather than the cluster endpoint.

Part 2 (347d7db1): Replace the 20+ pattern blocklist in
executeWithFailover with an allowlist via isEndpointConnectivityError().
Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP
response from Proxmox — including 500 — proves the node is reachable
and returns the error without affecting endpoint health.
2026-02-18 12:59:20 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
ebc29b4fdb feat: show pending apt updates for Proxmox nodes (#1083)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance

Requires Sys.Audit permission on Proxmox API token to read apt updates.
2026-01-21 10:53:36 +00:00
rcourtman
80444a9022 fix(monitor): use cluster quorum status instead of endpoint count for health
Previously, when some cluster endpoints were unreachable (e.g., backup
nodes intentionally offline), the cluster was marked as "degraded" even
though the Proxmox cluster itself was healthy and had quorum.

Now the connection health check queries the Proxmox cluster's actual
quorum status. A cluster is only marked "degraded" if it has lost
quorum (not enough votes for consensus), which is the actual indicator
of cluster instability.

This means:
- Cluster with quorum + some nodes offline = "healthy"
- Cluster without quorum = "degraded" (warning)
- All endpoints down = "error"

Fixes #1085
2026-01-11 11:54:02 +00:00
rcourtman
d0191d136f fix: Add configurable poll timeout and handle external Ceph storage
Changes:
1. Add MAX_POLL_TIMEOUT env var for large Proxmox clusters that need
   more than 3 minutes for polling (default: 3m, minimum: 30s)
2. Handle external Ceph storage gracefully - don't mark nodes unhealthy
   when Proxmox returns 'binary not installed' (e.g., for Ceph not
   managed by Proxmox)

Related to #965
2026-01-05 23:34:33 +00:00
rcourtman
e0dc6695fc fix: Per-node TLS fingerprints for cluster peers (TOFU)
When a PVE cluster has unique self-signed certificates on each node, Pulse
would mark secondary nodes as unhealthy because only the primary node's
fingerprint was used for all connections.

Now, during cluster discovery, Pulse captures each node's TLS fingerprint
and uses it when connecting to that specific node. This enables
"Trust On First Use" (TOFU) for clusters with unique per-node certs.

Changes:
- Add Fingerprint field to ClusterEndpoint config
- Add FetchFingerprint() to tlsutil for capturing node certs
- validateNodeAPI() now captures and returns fingerprints during discovery
- NewClusterClient() accepts endpointFingerprints map for per-node certs
- All client creation paths use per-endpoint fingerprints when available

Related to #879
2025-12-24 10:05:03 +00:00
rcourtman
a115af6906 feat: Improve cluster endpoint error messages for users
- Add sanitizeEndpointError() to transform raw Go errors into user-friendly messages
- Transform 'context deadline exceeded' into helpful messages mentioning possible causes
- Storage timeout errors now suggest checking PBS/NFS/Ceph backend connectivity
- Connection refused, certificate errors, and auth errors get actionable hints
- Apply sanitization everywhere cluster endpoint lastError is stored
- Add comprehensive tests for all error transformations
2025-12-16 21:50:02 +00:00
rcourtman
fa13919987 fix(ai-chat): Display messages chronologically in AI chatbot
- Add 'content' type to StreamDisplayEvent for tracking text chunks
- Track content events in streamEvents array for chronological display
- Update render to use Switch/Match for cleaner conditional rendering
- Interleave thinking, tool calls, and content as they stream in
- Add fallback for old messages without streamEvents for backwards compat

Previously, tool/command outputs stayed at top while AI text responses
accumulated at the bottom. Now all events appear in order like a
normal chatbot.
2025-12-11 23:02:59 +00:00
rcourtman
8948e84fe5 feat: AI features, agent improvements, and host monitoring enhancements
AI Chat Integration:
- Multi-provider support (Anthropic, OpenAI, Ollama)
- Streaming responses with markdown rendering
- Agent command execution for remote troubleshooting
- Context-aware conversations with host/container metadata

Agent Updates:
- Add --enable-proxmox flag for automatic PVE/PBS token setup
- Improve auto-update with semver comparison (prevents downgrades)
- Add updatedFrom tracking to report previous version after update
- Reduce initial update check delay from 30s to 5s
- Add agent version column to Hosts page table

Host Metrics:
- Add DiskIO stats collection (read/write bytes, ops, time)
- Improve disk filtering to exclude Docker overlay mounts
- Add RAID array monitoring via mdadm
- Enhanced temperature sensor parsing

Frontend:
- New Agent Version column on Hosts overview table
- Improved node modal with agent-first installation flow
- Add DiskIO display in host drawer
- Better responsive handling for metric bars
2025-12-05 10:37:02 +00:00
rcourtman
be892f5e07 fix: match storage timeout errors without trailing slash
The error pattern `/storage/` only matched storage content endpoints
(`/storage/{name}/content`) but not the main storage list endpoint
(`/nodes/{node}/storage`).

This caused storage timeout errors like:
  Get ".../nodes/pve-100-224/storage": context deadline exceeded

to incorrectly mark cluster nodes as unhealthy, even though the timeout
was due to a slow cross-node storage query, not actual node connectivity
issues.

Fixes #754
2025-12-01 22:48:01 +00:00
rcourtman
1f748e8670 fix: recover unhealthy cluster nodes even when some nodes are healthy
Previously, recovery of unhealthy nodes only triggered when ALL nodes
were unhealthy. This caused individual degraded nodes to stay degraded
forever since operations would succeed on healthy nodes and never
trigger the recovery path.

Now recovery is attempted whenever any unhealthy nodes exist, allowing
clusters to recover individual nodes over time.

Also added:
- Panic-safe unlock/lock pattern using anonymous function
- Refresh of both healthy and cooling endpoints after recovery
- Updated timestamp for accurate cooldown checks

Related to #754
2025-12-01 21:47:26 +00:00
rcourtman
69de7c25ce Fix cluster degraded status not recovering after transient failures
The previous fix (6db4ee7a) cleared stale error messages but didn't mark
endpoints as healthy again after successful operations. This caused
clusters to remain in "degraded" state permanently once any endpoint had
a temporary issue, even if all endpoints were actually working.

The fix now marks endpoints healthy in clearEndpointError() after
successful operations, ensuring degraded clusters recover automatically.

Related to #659
2025-11-29 19:04:11 +00:00
rcourtman
1b5528356b fix: clear stale errors after successful cluster operations
Previously, errors stored in ClusterClient.lastError were only cleared
during initial health checks or when recovering unhealthy nodes. This
caused stale error messages to persist in the UI even after the
underlying issues were resolved.

The fix clears cached errors in two places:
1. After passing connectivity test in getHealthyClient()
2. After successful operation in executeWithFailover()

This ensures that once an endpoint starts working again, any previous
error messages are cleared from the UI without requiring a restart.

Related to #659, #754
2025-11-27 16:22:16 +00:00
rcourtman
bc9e89696b chore: fix staticcheck U1000 unused code warnings
- Remove unused ipv6Regex from validation.go
- Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use)
- Remove unused apiLimiter rate limiter
- Remove unused stopOnce fields from csrf_store.go and session_store.go
- Remove unused lastBroadcast field from hub.go
- Remove unused lastUsedIndex field from cluster_client.go
2025-11-27 09:12:17 +00:00
rcourtman
8276ae837e chore: cleanup proxmox IsAuthError and remove stray comment
- Make IsAuthError unexported (isAuthError) since it's only used internally
- Remove stray '// test comment' from docker_metadata.go
2025-11-27 08:59:01 +00:00
rcourtman
c439a83fba chore: remove additional dead code
Remove 241 lines of unreachable code across internal and pkg:
- internal/crypto/crypto.go: unused NewCryptoManager wrapper
- internal/monitoring/scheduler.go: unused fixedIntervalSelector type
- internal/ssh/knownhosts/manager.go: unused hostKeyExists function
- internal/updates/manager.go: unused getLatestRelease wrapper
- internal/updates/updater.go: unused GetAll method
- pkg/discovery/discovery.go: unused scanWorker and runPhase (legacy compat)
- pkg/proxmox/client.go: unused post, getTaskStatus, waitForTaskCompletion, getTaskLog
- pkg/proxmox/cluster_client.go: unused markUnhealthy wrapper
2025-11-27 05:13:26 +00:00
rcourtman
b28828a822 Handle VM guest agent errors without marking nodes unhealthy (related to #736) 2025-11-21 17:34:25 +00:00
rcourtman
2207642fa9 Related to #727: normalize persisted Proxmox hosts 2025-11-20 19:58:05 +00:00
rcourtman
766cbe573e Handle missing storage on cluster nodes 2025-11-18 15:57:29 +00:00
rcourtman
a406fe42d8 Fix Proxmox 9.x RRD parameter incompatibility causing cluster health issues
Proxmox VE 9.x removed support for the 'ds' parameter in RRD endpoints
(/nodes/{node}/rrddata and /nodes/{node}/lxc/{vmid}/rrddata). When Pulse
sent RRD requests with ds=memused,memavailable,etc., Proxmox responded with:

  API error 400: {"errors":{"ds":"property is not defined in schema..."}}

This caused cluster nodes to be repeatedly marked unhealthy, which cascaded
into storage polling failures showing 'All cluster endpoints are unhealthy'
even though the nodes were actually healthy and reachable.

Changes:
- Added check in cluster_client.go executeWithFailover to recognize the ds
  parameter error as a capability issue rather than node health failure
- Nodes with this error no longer get marked unhealthy
- Storage polling and other operations now succeed even when RRD calls fail
- The RRD data will be unavailable but core monitoring continues

This fix maintains backward compatibility with older Proxmox versions while
gracefully handling the API change in Proxmox 9.x.
2025-11-08 12:06:08 +00:00
rcourtman
48fabdd827 Improve Docker temperature monitoring documentation for clarity (related to #600)
Updated the Quick Start for Docker section in TEMPERATURE_MONITORING.md to be
more user-friendly and address common setup issues:

- Added clear explanation of why the proxy is needed (containers can't access hardware)
- Provided concrete IP example instead of placeholder
- Showed full docker-compose.yml context with proper YAML structure
- Added sudo to commands where needed
- Updated docker-compose commands to v2 syntax with note about v1
- Expanded verification steps with clearer success indicators
- Added reminder to check container name in verification commands

These improvements should help users who encounter blank temperature displays
due to missing proxy installation or bind mount configuration.
2025-11-07 15:09:42 +00:00
rcourtman
af55362009 Fix inflated RAM usage reporting for LXC containers
Related to #553

## Problem

LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%,
96% when actual was 61%) because the code used the raw `mem` value from Proxmox's
`/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which
includes reclaimable cache and buffers, making memory appear nearly full even when
plenty is available.

## Root Cause

- **Nodes**: Had sophisticated cache-aware memory calculation with RRD fallbacks
- **VMs (qemu)**: Had detailed memory calculation using guest agent meminfo
- **LXCs**: Naively used `res.Mem` directly without any cache-aware correction

The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers
(from cgroup memory accounting), which should be excluded for accurate "used" memory.

## Solution

Implement cache-aware memory calculation for LXC containers by:

1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from
   `/nodes/{node}/lxc/{vmid}/rrddata`
2. Using RRD `memavailable` to calculate actual used memory (total - available)
3. Falling back to RRD `memused` if `memavailable` is not available
4. Only using cluster resources `mem` value as last resort

This matches the approach already used for nodes and VMs, providing consistent
cache-aware memory reporting across all resource types.

## Changes

- Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox
- Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations
- Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available
- Added guest memory snapshot recording for LXC containers
- Updated test stubs to implement the new interface method

## Testing

- Code compiles successfully
- Follows the same proven pattern used for nodes and VMs
- Includes diagnostic snapshot recording for troubleshooting
2025-11-06 00:16:18 +00:00
rcourtman
23691d5b41 Improve cluster health diagnostics and error messaging
Related to #405

Enhances error reporting and logging when all cluster endpoints are
unhealthy, making it easier to diagnose connectivity issues.

Changes:

1. Enhanced error messages in cluster_client.go:
   - Error now includes list of unreachable endpoints
   - Added detailed logging when no healthy endpoints available
   - Log at WARN level (not DEBUG) when cluster health check fails
   - Better context in recovery attempts with start/completion summaries

2. Improved storage polling resilience in monitor_polling.go:
   - Better error context when cluster storage polling fails
   - Specific guidance for "no healthy nodes available" scenario
   - Storage polling continues with direct node queries even if
     cluster-wide query fails (already worked, but now clearer)

3. Better recovery logging:
   - Log when recovery attempts start with list of unhealthy endpoints
   - Log individual recovery failures at DEBUG level
   - Log recovery summary (success/failure counts)
   - Track throttled endpoints separately for clearer diagnostics

These changes help users understand:
- Which specific endpoints are unreachable
- Whether it's a network/connectivity issue vs. API issue
- That Pulse will continue trying to recover endpoints automatically
- That storage monitoring continues via direct node queries

The root issue is that Pulse's internal health tracking can mark all
endpoints unhealthy when they're unreachable from the Pulse server,
even if Proxmox reports them as "online" in cluster status. Better
logging helps diagnose these network connectivity issues.
2025-11-05 19:44:29 +00:00
rcourtman
6eb1a10d9b Refactor: Code cleanup and localStorage consolidation
This commit includes comprehensive codebase cleanup and refactoring:

## Code Cleanup
- Remove dead TypeScript code (types/monitoring.ts - 194 lines duplicate)
- Remove unused Go functions (GetClusterNodes, MigratePassword, GetClusterHealthInfo)
- Clean up commented-out code blocks across multiple files
- Remove unused TypeScript exports (helpTextClass, private tag color helpers)
- Delete obsolete test files and components

## localStorage Consolidation
- Centralize all storage keys into STORAGE_KEYS constant
- Update 5 files to use centralized keys:
  * utils/apiClient.ts (AUTH, LEGACY_TOKEN)
  * components/Dashboard/Dashboard.tsx (GUEST_METADATA)
  * components/Docker/DockerHosts.tsx (DOCKER_METADATA)
  * App.tsx (PLATFORMS_SEEN)
  * stores/updates.ts (UPDATES)
- Benefits: Single source of truth, prevents typos, better maintainability

## Previous Work Committed
- Docker monitoring improvements and disk metrics
- Security enhancements and setup fixes
- API refactoring and cleanup
- Documentation updates
- Build system improvements

## Testing
- All frontend tests pass (29 tests)
- All Go tests pass (15 packages)
- Production build successful
- Zero breaking changes

Total: 186 files changed, 5825 insertions(+), 11602 deletions(-)
2025-11-04 21:50:46 +00:00
rcourtman
a885fb5472 Surface LXC interface IPs via PVE interfaces API (#596) 2025-10-23 08:07:32 +00:00
rcourtman
b95c01066e Capture dynamic LXC IP metrics (#596) 2025-10-23 07:50:45 +00:00
rcourtman
be85459db2 Add LXC config metadata for guest drawers (#596) 2025-10-23 07:30:32 +00:00
rcourtman
c9543e8a7e Add qemu guest agent version metadata 2025-10-22 15:24:07 +00:00
rcourtman
f8b6aa6c97 Treat 501 responses as non-fatal in cluster failover (#449) 2025-10-22 14:23:13 +00:00
rcourtman
7d422d2909 feat: add professional logging with runtime configuration and performance optimization
Implements structured logging package with LOG_LEVEL/LOG_FORMAT env support, debug level guards for hot paths, enriched error messages with actionable context, and stack trace capture for production debugging. Improves observability and reduces log overhead in high-frequency polling loops.
2025-10-20 15:13:38 +00:00
rcourtman
524f42cc28 security: complete Phase 1 sensor proxy hardening
Implements comprehensive security hardening for pulse-sensor-proxy:
- Privilege drop from root to unprivileged user (UID 995)
- Hash-chained tamper-evident audit logging with remote forwarding
- Per-UID rate limiting (0.2 QPS, burst 2) with concurrency caps
- Enhanced command validation with 10+ attack pattern tests
- Fuzz testing (7M+ executions, 0 crashes)
- SSH hardening, AppArmor/seccomp profiles, operational runbooks

All 27 Phase 1 tasks complete. Ready for production deployment.
2025-10-20 15:13:37 +00:00
rcourtman
7e5fa9a147 fix: restore cache-aware node memory on PVE 8.4 2025-10-14 16:40:45 +00:00
rcourtman
f46ff1792b Fix settings security tab navigation 2025-10-11 23:29:47 +00:00