Commit graph

81 commits

Author SHA1 Message Date
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
a54d71117b fix(proxmox): prevent guest agent errors from marking endpoints unhealthy
Backport of v6 commits a87c9950 and 347d7db1.

Part 1 (a87c9950): Wrap the four guest agent c.get() errors with
fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly
scopes them to the VM rather than the cluster endpoint.

Part 2 (347d7db1): Replace the 20+ pattern blocklist in
executeWithFailover with an allowlist via isEndpointConnectivityError().
Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP
response from Proxmox — including 500 — proves the node is reachable
and returns the error without affecting endpoint health.
2026-02-18 12:59:20 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
815c990e85 fix(proxmox): avoid 403 on apt update checks 2026-02-09 20:28:09 +00:00
rcourtman
13a6f7750c Minor updates to main and proxmox client 2026-01-28 16:52:50 +00:00
rcourtman
ebc29b4fdb feat: show pending apt updates for Proxmox nodes (#1083)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance

Requires Sys.Audit permission on Proxmox API token to read apt updates.
2026-01-21 10:53:36 +00:00
rcourtman
96b7370f7b test: improve coverage for API, AI, Alerts, and Frontend Utils
- Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3)
- Improve test coverage for AI tools, chat service, and session management
- Enhance alert and notification tests (ResolvedAlert, Webhook)
- Add frontend unit tests for utils (searchHistory, tagColors, temperature, url)
- Add proximity client API tests
2026-01-20 15:52:39 +00:00
rcourtman
a6a8efaa65 test: Add comprehensive test coverage across packages
New test files with expanded coverage:

API tests:
- ai_handler_test.go: AI handler unit tests with mocking
- agent_profiles_tools_test.go: Profile management tests
- alerts_endpoints_test.go: Alert API endpoint tests
- alerts_test.go: Updated for interface changes
- audit_handlers_test.go: Audit handler tests
- frontend_embed_test.go: Frontend embedding tests
- metadata_handlers_test.go, metadata_provider_test.go: Metadata tests
- notifications_test.go: Updated for interface changes
- profile_suggestions_test.go: Profile suggestion tests
- saml_service_test.go: SAML authentication tests
- sensor_proxy_gate_test.go: Sensor proxy tests
- updates_test.go: Updated for interface changes

Agent tests:
- dockeragent/signature_test.go: Docker agent signature tests
- hostagent/agent_metrics_test.go: Host agent metrics tests
- hostagent/commands_test.go: Command execution tests
- hostagent/network_helpers_test.go: Network helper tests
- hostagent/proxmox_setup_test.go: Updated setup tests
- kubernetesagent/*_test.go: Kubernetes agent tests

Core package tests:
- monitoring/kubernetes_agents_test.go, reload_test.go
- remoteconfig/client_test.go, signature_test.go
- sensors/collector_test.go
- updates/adapter_installsh_*_test.go: Install adapter tests
- updates/manager_*_test.go: Update manager tests
- websocket/hub_*_test.go: WebSocket hub tests

Library tests:
- pkg/audit/export_test.go: Audit export tests
- pkg/metrics/store_test.go: Metrics store tests
- pkg/proxmox/*_test.go: Proxmox client tests
- pkg/reporting/reporting_test.go: Reporting tests
- pkg/server/*_test.go: Server tests
- pkg/tlsutil/extra_test.go: TLS utility tests

Total: ~8000 lines of new test code
2026-01-19 19:26:18 +00:00
rcourtman
80444a9022 fix(monitor): use cluster quorum status instead of endpoint count for health
Previously, when some cluster endpoints were unreachable (e.g., backup
nodes intentionally offline), the cluster was marked as "degraded" even
though the Proxmox cluster itself was healthy and had quorum.

Now the connection health check queries the Proxmox cluster's actual
quorum status. A cluster is only marked "degraded" if it has lost
quorum (not enough votes for consensus), which is the actual indicator
of cluster instability.

This means:
- Cluster with quorum + some nodes offline = "healthy"
- Cluster without quorum = "degraded" (warning)
- All endpoints down = "error"

Fixes #1085
2026-01-11 11:54:02 +00:00
rcourtman
bd1df9f942 feat: automatic subnet preference for cluster node discovery
When discovering cluster nodes, Pulse now automatically prefers IPs
on the same subnet as the initial connection. This fixes the common
issue where Pulse used internal cluster network IPs (e.g., 172.x.x.x)
instead of management network IPs (e.g., 10.x.x.x).

How it works:
1. Extract subnet from initial connection URL (assumes /24 for IPv4)
2. For each discovered node, query /nodes/{node}/network for all IPs
3. If cluster-reported IP is on a different subnet, find an IP on
   the preferred subnet and set it as IPOverride
4. Manual IPOverride settings are preserved and take precedence

This eliminates the need for manual IPOverride configuration in most
multi-network Proxmox setups.

Refs #929, #1066
2026-01-08 23:12:30 +00:00
rcourtman
d0191d136f fix: Add configurable poll timeout and handle external Ceph storage
Changes:
1. Add MAX_POLL_TIMEOUT env var for large Proxmox clusters that need
   more than 3 minutes for polling (default: 3m, minimum: 30s)
2. Handle external Ceph storage gracefully - don't mark nodes unhealthy
   when Proxmox returns 'binary not installed' (e.g., for Ceph not
   managed by Proxmox)

Related to #965
2026-01-05 23:34:33 +00:00
rcourtman
45d4d68127 fix: Add debug logging and response format handling for replication status
- Add comprehensive debug logging to diagnose replication status fetch failures
- Handle both array and single-object response formats from Proxmox API
- Log raw response body for easier debugging
- Log success/failure for each enrichment step

This helps diagnose issue #992 where replication last/next sync times aren't
showing. The logging will reveal if the API call is failing, returning empty
data, or returning data in an unexpected format.

Related to #992
2026-01-04 15:01:32 +00:00
rcourtman
4cd3e53c3e test: add regression tests for missing frontend fields
Ensures that LinkedHostAgentId, CommandsEnabled, IsLegacy, and LinkedNodeId
are correctly propagated to the frontend. This prevents regressions of the
bugs fixed for #952 and #971.
2026-01-02 20:45:35 +00:00
rcourtman
3fdf753a5b Enhance devcontainer and CI workflows
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 22:29:15 +00:00
rcourtman
567a4ad147 fix(replication): fetch status from per-node endpoint
The /cluster/replication endpoint only returns job configuration (guest,
schedule, source, target), not status data (last_sync, next_sync,
duration, fail_count, state).

This fix enriches each replication job with status from the per-node
endpoint /nodes/{node}/replication/{id}/status to get timing and state
data needed for proper UI display.

Added integration tests to verify:
- Status endpoint is called and data is merged correctly
- Graceful handling when status endpoint fails

Fixes #992
2025-12-31 23:58:06 +00:00
rcourtman
3fd20340d1 fix: increase PBS storage content timeout to 60s
PBS storage content queries with encrypted backups can take 10-20+ seconds
to enumerate. The previous 30s timeout was causing intermittent failures
when polling backup data from PBS storage configured in PVE.

This increases the timeout to 60s to accommodate slow PBS backends while
still preventing indefinite hangs on unavailable NFS/network storage.
2025-12-26 00:21:17 +00:00
rcourtman
e0dc6695fc fix: Per-node TLS fingerprints for cluster peers (TOFU)
When a PVE cluster has unique self-signed certificates on each node, Pulse
would mark secondary nodes as unhealthy because only the primary node's
fingerprint was used for all connections.

Now, during cluster discovery, Pulse captures each node's TLS fingerprint
and uses it when connecting to that specific node. This enables
"Trust On First Use" (TOFU) for clusters with unique per-node certs.

Changes:
- Add Fingerprint field to ClusterEndpoint config
- Add FetchFingerprint() to tlsutil for capturing node certs
- validateNodeAPI() now captures and returns fingerprints during discovery
- NewClusterClient() accepts endpointFingerprints map for per-node certs
- All client creation paths use per-endpoint fingerprints when available

Related to #879
2025-12-24 10:05:03 +00:00
rcourtman
969fa0e509 test: add unit tests for AI, Kubernetes agent, and clients 2025-12-17 12:47:36 +00:00
rcourtman
a115af6906 feat: Improve cluster endpoint error messages for users
- Add sanitizeEndpointError() to transform raw Go errors into user-friendly messages
- Transform 'context deadline exceeded' into helpful messages mentioning possible causes
- Storage timeout errors now suggest checking PBS/NFS/Ceph backend connectivity
- Connection refused, certificate errors, and auth errors get actionable hints
- Apply sanitization everywhere cluster endpoint lastError is stored
- Add comprehensive tests for all error transformations
2025-12-16 21:50:02 +00:00
rcourtman
fa13919987 fix(ai-chat): Display messages chronologically in AI chatbot
- Add 'content' type to StreamDisplayEvent for tracking text chunks
- Track content events in streamEvents array for chronological display
- Update render to use Switch/Match for cleaner conditional rendering
- Interleave thinking, tool calls, and content as they stream in
- Add fallback for old messages without streamEvents for backwards compat

Previously, tool/command outputs stayed at top while AI text responses
accumulated at the bottom. Now all events appear in order like a
normal chatbot.
2025-12-11 23:02:59 +00:00
rcourtman
8948e84fe5 feat: AI features, agent improvements, and host monitoring enhancements
AI Chat Integration:
- Multi-provider support (Anthropic, OpenAI, Ollama)
- Streaming responses with markdown rendering
- Agent command execution for remote troubleshooting
- Context-aware conversations with host/container metadata

Agent Updates:
- Add --enable-proxmox flag for automatic PVE/PBS token setup
- Improve auto-update with semver comparison (prevents downgrades)
- Add updatedFrom tracking to report previous version after update
- Reduce initial update check delay from 30s to 5s
- Add agent version column to Hosts page table

Host Metrics:
- Add DiskIO stats collection (read/write bytes, ops, time)
- Improve disk filtering to exclude Docker overlay mounts
- Add RAID array monitoring via mdadm
- Enhanced temperature sensor parsing

Frontend:
- New Agent Version column on Hosts overview table
- Improved node modal with agent-first installation flow
- Add DiskIO display in host drawer
- Better responsive handling for metric bars
2025-12-05 10:37:02 +00:00
rcourtman
4f824ab148 style: Apply gofmt to 37 files
Standardize code formatting across test files and monitor.go.
No functional changes.
2025-12-02 17:21:48 +00:00
rcourtman
c812720f25 test: Add Disk UnmarshalJSON RPM and error path tests
Cover RPM field handling (numeric, string, SSD, N/A, null, invalid),
invalid JSON error path, and unexpected type fallbacks for both
wearout and RPM fields.
Coverage: 50% → 95.5%
2025-12-02 02:23:44 +00:00
rcourtman
618fc084f1 test: Add invalid user format tests for NewClient
Test error handling for password authentication user format validation:
- Missing realm separator (no @)
- Empty user string
- Multiple @ symbols

Improves NewClient coverage from 74.2% to 83.9%.
2025-12-02 01:25:11 +00:00
rcourtman
de33653dc2 test: Add invalid value tests for VMFileSystem.UnmarshalJSON
Test error handling for JSON parsing edge cases:
- Invalid JSON syntax
- Unsupported field types (bool, array)
- Unparseable string values for total-bytes and used-bytes

Improves coverage from 83.3% to 94.4%.
2025-12-02 01:22:42 +00:00
rcourtman
79afff8ba2 test: Add invalid value tests for MemoryStatus.UnmarshalJSON
Test error handling for JSON parsing edge cases:
- Invalid JSON syntax
- Unsupported field types (bool, array, object)
- Unparseable string values

Improves coverage from 70.0% to 83.3%.
2025-12-02 01:20:15 +00:00
rcourtman
22d9e2795c test: Add permanent failure test for ClusterClient.GetNodes
Tests the error logging path when all endpoints fail with auth error
(83.3% to 91.7% coverage).
2025-12-02 01:05:48 +00:00
rcourtman
5bbf7de1a3 test: Add JSON decode error test for Client.GetNodes
Tests the error path when server returns invalid JSON (87.5% to 100%).
2025-12-02 01:03:30 +00:00
rcourtman
490fd9a810 test: Add edge cases for parseReplicationJob fields
- Test jobid fallback when id field is missing
- Test jobnum field takes precedence over ID parsing
- Test last_sync_duration and duration fields
- Test last-sync-duration fallback format
- Test next_sync and next-sync fallback formats

Coverage: 79.7% → 100%
2025-12-02 00:24:40 +00:00
rcourtman
29e01f8ff5 test: Add edge case for coerceUint64 ParseUint error branch
String 'abc' without .eE characters triggers ParseUint error path.
Coverage: 97.4% to 100%.
2025-12-01 23:44:04 +00:00
rcourtman
e2172b16de test: Add edge case test for isNotImplementedError fallback branch
Tab character triggers extractStatusCode fallback path (regex \s+ matches
tab but ' 501' substring check doesn't). Coverage: 87.5% to 100%.
2025-12-01 23:18:45 +00:00
rcourtman
2afc7f0c41 test: Add edge case tests for parseWearoutValue function
Add 4 new test cases covering previously untested branches:
- Float zero exactly (0.0)
- Float negative zero (-0.0)
- Only escaped quotes becoming empty after trimming
- Quoted whitespace becoming empty after trimming

Coverage improved from 95.8% to 100%.
2025-12-01 23:02:18 +00:00
rcourtman
be892f5e07 fix: match storage timeout errors without trailing slash
The error pattern `/storage/` only matched storage content endpoints
(`/storage/{name}/content`) but not the main storage list endpoint
(`/nodes/{node}/storage`).

This caused storage timeout errors like:
  Get ".../nodes/pve-100-224/storage": context deadline exceeded

to incorrectly mark cluster nodes as unhealthy, even though the timeout
was due to a slow cross-node storage query, not actual node connectivity
issues.

Fixes #754
2025-12-01 22:48:01 +00:00
rcourtman
9097b507fd test: Add edge case tests for parseReplicationTime function
Add 13 new test cases covering previously untested branches:
- float32 timestamp with valid value (using smaller value for precision)
- float32/float64 zero and negative values
- json.Number zero and negative values
- int32 and uint32 timestamp handling
- Invalid date format strings (no matching layout)
- Partial date strings
- Unsupported types (bool, slice)

Coverage improved from 93.8% to 100%.
2025-12-01 22:44:23 +00:00
rcourtman
18472f1668 test: Add float32 NaN/Inf tests for intFromAny and floatFromAny
Add 6 test cases covering float32 special values:
- intFromAny: float32 NaN, +Inf, -Inf (all return 0, false)
- floatFromAny: float32 NaN, +Inf, -Inf (all return 0, false)

Coverage improved:
- intFromAny: 96.7% -> 100%
- floatFromAny: 95.0% -> 100%
2025-12-01 22:40:08 +00:00
rcourtman
1e9fbdfdcc test: Add edge case tests for coerceUint64 function
Add 6 new test cases covering previously untested branches:
- float64 at MaxUint64 boundary (clamping behavior)
- float64 exceeding MaxUint64 (overflow protection)
- String with quoted "null" value
- String with quoted empty value ("")
- String with single quoted empty value ('')
- Invalid float parsing in scientific notation

Coverage improved from 92.3% to 97.4%.
2025-12-01 22:36:03 +00:00
rcourtman
05b9c3ab2d test: Add tests for CPUInfo.GetMHzString method
Add 11 test cases covering:
- Nil MHz returns empty string
- String MHz returned as-is
- Empty string handling
- Float64 formatted without decimals
- Float64 zero handling
- Float64 rounding for large values
- Int formatting
- Int zero handling
- Default formatting for other types (int64, bool, slice)

Coverage: GetMHzString 0% -> 100%
2025-12-01 22:29:30 +00:00
rcourtman
1f748e8670 fix: recover unhealthy cluster nodes even when some nodes are healthy
Previously, recovery of unhealthy nodes only triggered when ALL nodes
were unhealthy. This caused individual degraded nodes to stay degraded
forever since operations would succeed on healthy nodes and never
trigger the recovery path.

Now recovery is attempted whenever any unhealthy nodes exist, allowing
clusters to recover individual nodes over time.

Also added:
- Panic-safe unlock/lock pattern using anonymous function
- Refresh of both healthy and cooling endpoints after recovery
- Updated timestamp for accurate cooldown checks

Related to #754
2025-12-01 21:47:26 +00:00
rcourtman
d9331570f5 test: Add tests for VMAgentField JSON unmarshaling
Covers both Proxmox API formats:
- Integer format (older versions): direct int value
- Object format (Proxmox 8.3+): {enabled, available} fields
- Preference order: available > enabled > 0
- Invalid input handling defaults to 0
- Integration with VMStatus struct
2025-12-01 21:40:47 +00:00
rcourtman
32333cdbbe test: Add tests for authHTTPError.Error and shouldFallbackToForm
Tests for Proxmox client authentication error handling:
- authHTTPError.Error: message formatting based on status code
  (401/403 include status in message, others don't)
- shouldFallbackToForm: determines when to retry with form encoding
  (triggers on 400/415, not on auth errors or server errors)

16 test cases covering all code paths.
2025-12-01 13:39:50 +00:00
rcourtman
42eec54d6e Add unit tests for parseWearoutValue and clampWearoutConsumed functions
52 test cases covering:
- Empty/whitespace input
- Simple numeric strings and quoted values
- Percentage symbols and N/A variants
- Float values with truncation
- Messy SMART data with digit extraction fallback
- Clamping behavior for unknown, normal, and out-of-range values
2025-12-01 09:18:04 +00:00
rcourtman
f9122d736e Add unit tests for parseUint64Flexible function
32 test cases covering all code paths:
- nil, uint64, int, int64, float64 type handling
- json.Number parsing (delegates to string branch)
- String parsing: empty, decimal, hex (0x/0X), float notation, scientific
- Negative value handling (returns 0 for numeric types)
- Error cases: invalid strings, unsupported types
2025-12-01 09:11:02 +00:00
rcourtman
37550bff6d Add unit tests for ZFS device conversion functions
Tests added by ADA run #97 but commit was missed.
Covers: RaidZ types, log/cache/spare devices, nested mirrors,
ConvertToModelZFSPool, and struct field tests.
2025-12-01 09:03:48 +00:00
rcourtman
6c18849f79 Add unit tests for cluster_client utility functions
Test coverage for error detection and retry logic:
- extractStatusCode: 13 test cases for HTTP status code extraction
- isTransientRateLimitError: 17 test cases for rate limit detection
- isNotImplementedError: 14 test cases for 501 error detection
- isVMSpecificError: 16 test cases for VM-scoped errors
- calculateRateLimitBackoff: backoff timing verification
- isAuthError: 12 test cases for authentication errors

Coverage 35.5% → 37.3%
2025-12-01 00:24:21 +00:00
rcourtman
92c2d198b1 Add unit tests for Proxmox replication utility functions
Comprehensive test coverage for JSON parsing helpers used in
replication job status parsing: stringFromAny, intFromAny,
boolFromAny, floatFromAny, parseReplicationTime, parseDurationSeconds,
parseHHMMSSToSeconds, and parseReplicationJob.

Coverage increased from 22.6% to 35.5%.
2025-11-30 02:35:11 +00:00
rcourtman
316161f989 Add unit tests for coerceUint64 and FlexInt.UnmarshalJSON
45 test cases covering:
- FlexInt: integer/float/string parsing, truncation behavior, error cases
- coerceUint64: nil, float64 (including NaN/Inf), int/int32/int64,
  uint32/uint64, json.Number, string parsing (whitespace, null, quotes,
  commas, scientific notation), unsupported types

Coverage: 20.5% -> 22.6%
2025-11-30 02:17:52 +00:00
rcourtman
69de7c25ce Fix cluster degraded status not recovering after transient failures
The previous fix (6db4ee7a) cleared stale error messages but didn't mark
endpoints as healthy again after successful operations. This caused
clusters to remain in "degraded" state permanently once any endpoint had
a temporary issue, even if all endpoints were actually working.

The fix now marks endpoints healthy in clearEndpointError() after
successful operations, ensuring degraded clusters recover automatically.

Related to #659
2025-11-29 19:04:11 +00:00
rcourtman
1b5528356b fix: clear stale errors after successful cluster operations
Previously, errors stored in ClusterClient.lastError were only cleared
during initial health checks or when recovering unhealthy nodes. This
caused stale error messages to persist in the UI even after the
underlying issues were resolved.

The fix clears cached errors in two places:
1. After passing connectivity test in getHealthyClient()
2. After successful operation in executeWithFailover()

This ensures that once an endpoint starts working again, any previous
error messages are cleared from the UI without requiring a restart.

Related to #659, #754
2025-11-27 16:22:16 +00:00
rcourtman
bc9e89696b chore: fix staticcheck U1000 unused code warnings
- Remove unused ipv6Regex from validation.go
- Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use)
- Remove unused apiLimiter rate limiter
- Remove unused stopOnce fields from csrf_store.go and session_store.go
- Remove unused lastBroadcast field from hub.go
- Remove unused lastUsedIndex field from cluster_client.go
2025-11-27 09:12:17 +00:00
rcourtman
8276ae837e chore: cleanup proxmox IsAuthError and remove stray comment
- Make IsAuthError unexported (isAuthError) since it's only used internally
- Remove stray '// test comment' from docker_metadata.go
2025-11-27 08:59:01 +00:00