Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-03 05:50:15 +00:00

Author	SHA1	Message	Date
rcourtman	0ae2806f18	fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270 ) Proxmox status.Mem includes page cache as "used" memory, inflating reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked host agent) were frequently unavailable, causing most VMs to fall through to the inflated status-mem source. Adds a new last-resort fallback that reads /proc/meminfo via the QEMU guest agent file-read endpoint to get accurate MemAvailable. Results are cached (60s positive, 5min negative backoff for unsupported VMs). Also fixes: RRD memavailable fallback missing from traditional polling path, cache key collisions in multi-PVE setups, FreeMem underflow guard inconsistency, and integer overflow in kB-to-bytes conversion.	2026-02-20 13:31:52 +00:00
rcourtman	a54d71117b	fix(proxmox): prevent guest agent errors from marking endpoints unhealthy Backport of v6 commits a87c9950 and 347d7db1. Part 1 (a87c9950): Wrap the four guest agent c.get() errors with fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly scopes them to the VM rather than the cluster endpoint. Part 2 (347d7db1): Replace the 20+ pattern blocklist in executeWithFailover with an allowlist via isEndpointConnectivityError(). Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP response from Proxmox — including 500 — proves the node is reachable and returns the error without affecting endpoint health.	2026-02-18 12:59:20 +00:00
rcourtman	efa916ee2a	fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox RRD's memavailable metric (which excludes reclaimable page cache) when the qemu-guest-agent doesn't provide MemInfo.Available. Previously the fallback was detailedStatus.Mem (total - MemFree), inflating usage to 80%+ on VMs with normal Linux page cache. Mirrors the existing LXC rrd-memavailable path. FreeBSD ZFS ARC (#1264, #1051): The host agent now reads kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts the ARC size from reported memory usage. ZFS ARC is reclaimable under memory pressure (like Linux SReclaimable) but gopsutil counts it as wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms. Fixes #1270 Fixes #1264 Fixes #1051 (cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)	2026-02-18 12:56:53 +00:00
rcourtman	815c990e85	fix(proxmox): avoid 403 on apt update checks	2026-02-09 20:28:09 +00:00
rcourtman	13a6f7750c	Minor updates to main and proxmox client	2026-01-28 16:52:50 +00:00
rcourtman	ebc29b4fdb	feat: show pending apt updates for Proxmox nodes (#1083 ) - Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model - Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update) - Add 30-minute polling cache to avoid excessive API calls - Add pendingUpdates to frontend Node type - Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+) - Update test stubs for interface compliance Requires Sys.Audit permission on Proxmox API token to read apt updates.	2026-01-21 10:53:36 +00:00
rcourtman	96b7370f7b	test: improve coverage for API, AI, Alerts, and Frontend Utils - Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3) - Improve test coverage for AI tools, chat service, and session management - Enhance alert and notification tests (ResolvedAlert, Webhook) - Add frontend unit tests for utils (searchHistory, tagColors, temperature, url) - Add proximity client API tests	2026-01-20 15:52:39 +00:00
rcourtman	a6a8efaa65	test: Add comprehensive test coverage across packages New test files with expanded coverage: API tests: - ai_handler_test.go: AI handler unit tests with mocking - agent_profiles_tools_test.go: Profile management tests - alerts_endpoints_test.go: Alert API endpoint tests - alerts_test.go: Updated for interface changes - audit_handlers_test.go: Audit handler tests - frontend_embed_test.go: Frontend embedding tests - metadata_handlers_test.go, metadata_provider_test.go: Metadata tests - notifications_test.go: Updated for interface changes - profile_suggestions_test.go: Profile suggestion tests - saml_service_test.go: SAML authentication tests - sensor_proxy_gate_test.go: Sensor proxy tests - updates_test.go: Updated for interface changes Agent tests: - dockeragent/signature_test.go: Docker agent signature tests - hostagent/agent_metrics_test.go: Host agent metrics tests - hostagent/commands_test.go: Command execution tests - hostagent/network_helpers_test.go: Network helper tests - hostagent/proxmox_setup_test.go: Updated setup tests - kubernetesagent/_test.go: Kubernetes agent tests Core package tests: - monitoring/kubernetes_agents_test.go, reload_test.go - remoteconfig/client_test.go, signature_test.go - sensors/collector_test.go - updates/adapter_installsh__test.go: Install adapter tests - updates/manager__test.go: Update manager tests - websocket/hub__test.go: WebSocket hub tests Library tests: - pkg/audit/export_test.go: Audit export tests - pkg/metrics/store_test.go: Metrics store tests - pkg/proxmox/_test.go: Proxmox client tests - pkg/reporting/reporting_test.go: Reporting tests - pkg/server/_test.go: Server tests - pkg/tlsutil/extra_test.go: TLS utility tests Total: ~8000 lines of new test code	2026-01-19 19:26:18 +00:00
rcourtman	80444a9022	fix(monitor): use cluster quorum status instead of endpoint count for health Previously, when some cluster endpoints were unreachable (e.g., backup nodes intentionally offline), the cluster was marked as "degraded" even though the Proxmox cluster itself was healthy and had quorum. Now the connection health check queries the Proxmox cluster's actual quorum status. A cluster is only marked "degraded" if it has lost quorum (not enough votes for consensus), which is the actual indicator of cluster instability. This means: - Cluster with quorum + some nodes offline = "healthy" - Cluster without quorum = "degraded" (warning) - All endpoints down = "error" Fixes #1085	2026-01-11 11:54:02 +00:00
rcourtman	bd1df9f942	feat: automatic subnet preference for cluster node discovery When discovering cluster nodes, Pulse now automatically prefers IPs on the same subnet as the initial connection. This fixes the common issue where Pulse used internal cluster network IPs (e.g., 172.x.x.x) instead of management network IPs (e.g., 10.x.x.x). How it works: 1. Extract subnet from initial connection URL (assumes /24 for IPv4) 2. For each discovered node, query /nodes/{node}/network for all IPs 3. If cluster-reported IP is on a different subnet, find an IP on the preferred subnet and set it as IPOverride 4. Manual IPOverride settings are preserved and take precedence This eliminates the need for manual IPOverride configuration in most multi-network Proxmox setups. Refs #929, #1066	2026-01-08 23:12:30 +00:00
rcourtman	d0191d136f	fix: Add configurable poll timeout and handle external Ceph storage Changes: 1. Add MAX_POLL_TIMEOUT env var for large Proxmox clusters that need more than 3 minutes for polling (default: 3m, minimum: 30s) 2. Handle external Ceph storage gracefully - don't mark nodes unhealthy when Proxmox returns 'binary not installed' (e.g., for Ceph not managed by Proxmox) Related to #965	2026-01-05 23:34:33 +00:00
rcourtman	45d4d68127	fix: Add debug logging and response format handling for replication status - Add comprehensive debug logging to diagnose replication status fetch failures - Handle both array and single-object response formats from Proxmox API - Log raw response body for easier debugging - Log success/failure for each enrichment step This helps diagnose issue #992 where replication last/next sync times aren't showing. The logging will reveal if the API call is failing, returning empty data, or returning data in an unexpected format. Related to #992	2026-01-04 15:01:32 +00:00
rcourtman	4cd3e53c3e	test: add regression tests for missing frontend fields Ensures that LinkedHostAgentId, CommandsEnabled, IsLegacy, and LinkedNodeId are correctly propagated to the frontend. This prevents regressions of the bugs fixed for #952 and #971.	2026-01-02 20:45:35 +00:00
rcourtman	3fdf753a5b	Enhance devcontainer and CI workflows - Add persistent volume mounts for Go/npm caches (faster rebuilds) - Add shell config with helpful aliases and custom prompt - Add comprehensive devcontainer documentation - Add pre-commit hooks for Go formatting and linting - Use go-version-file in CI workflows instead of hardcoded versions - Simplify docker compose commands with --wait flag - Add gitignore entries for devcontainer auth files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-01 22:29:15 +00:00
rcourtman	567a4ad147	fix(replication): fetch status from per-node endpoint The /cluster/replication endpoint only returns job configuration (guest, schedule, source, target), not status data (last_sync, next_sync, duration, fail_count, state). This fix enriches each replication job with status from the per-node endpoint /nodes/{node}/replication/{id}/status to get timing and state data needed for proper UI display. Added integration tests to verify: - Status endpoint is called and data is merged correctly - Graceful handling when status endpoint fails Fixes #992	2025-12-31 23:58:06 +00:00
rcourtman	3fd20340d1	fix: increase PBS storage content timeout to 60s PBS storage content queries with encrypted backups can take 10-20+ seconds to enumerate. The previous 30s timeout was causing intermittent failures when polling backup data from PBS storage configured in PVE. This increases the timeout to 60s to accommodate slow PBS backends while still preventing indefinite hangs on unavailable NFS/network storage.	2025-12-26 00:21:17 +00:00
rcourtman	e0dc6695fc	fix: Per-node TLS fingerprints for cluster peers (TOFU) When a PVE cluster has unique self-signed certificates on each node, Pulse would mark secondary nodes as unhealthy because only the primary node's fingerprint was used for all connections. Now, during cluster discovery, Pulse captures each node's TLS fingerprint and uses it when connecting to that specific node. This enables "Trust On First Use" (TOFU) for clusters with unique per-node certs. Changes: - Add Fingerprint field to ClusterEndpoint config - Add FetchFingerprint() to tlsutil for capturing node certs - validateNodeAPI() now captures and returns fingerprints during discovery - NewClusterClient() accepts endpointFingerprints map for per-node certs - All client creation paths use per-endpoint fingerprints when available Related to #879	2025-12-24 10:05:03 +00:00
rcourtman	969fa0e509	test: add unit tests for AI, Kubernetes agent, and clients	2025-12-17 12:47:36 +00:00
rcourtman	a115af6906	feat: Improve cluster endpoint error messages for users - Add sanitizeEndpointError() to transform raw Go errors into user-friendly messages - Transform 'context deadline exceeded' into helpful messages mentioning possible causes - Storage timeout errors now suggest checking PBS/NFS/Ceph backend connectivity - Connection refused, certificate errors, and auth errors get actionable hints - Apply sanitization everywhere cluster endpoint lastError is stored - Add comprehensive tests for all error transformations	2025-12-16 21:50:02 +00:00
rcourtman	fa13919987	fix(ai-chat): Display messages chronologically in AI chatbot - Add 'content' type to StreamDisplayEvent for tracking text chunks - Track content events in streamEvents array for chronological display - Update render to use Switch/Match for cleaner conditional rendering - Interleave thinking, tool calls, and content as they stream in - Add fallback for old messages without streamEvents for backwards compat Previously, tool/command outputs stayed at top while AI text responses accumulated at the bottom. Now all events appear in order like a normal chatbot.	2025-12-11 23:02:59 +00:00
rcourtman	8948e84fe5	feat: AI features, agent improvements, and host monitoring enhancements AI Chat Integration: - Multi-provider support (Anthropic, OpenAI, Ollama) - Streaming responses with markdown rendering - Agent command execution for remote troubleshooting - Context-aware conversations with host/container metadata Agent Updates: - Add --enable-proxmox flag for automatic PVE/PBS token setup - Improve auto-update with semver comparison (prevents downgrades) - Add updatedFrom tracking to report previous version after update - Reduce initial update check delay from 30s to 5s - Add agent version column to Hosts page table Host Metrics: - Add DiskIO stats collection (read/write bytes, ops, time) - Improve disk filtering to exclude Docker overlay mounts - Add RAID array monitoring via mdadm - Enhanced temperature sensor parsing Frontend: - New Agent Version column on Hosts overview table - Improved node modal with agent-first installation flow - Add DiskIO display in host drawer - Better responsive handling for metric bars	2025-12-05 10:37:02 +00:00
rcourtman	4f824ab148	style: Apply gofmt to 37 files Standardize code formatting across test files and monitor.go. No functional changes.	2025-12-02 17:21:48 +00:00
rcourtman	c812720f25	test: Add Disk UnmarshalJSON RPM and error path tests Cover RPM field handling (numeric, string, SSD, N/A, null, invalid), invalid JSON error path, and unexpected type fallbacks for both wearout and RPM fields. Coverage: 50% → 95.5%	2025-12-02 02:23:44 +00:00
rcourtman	618fc084f1	test: Add invalid user format tests for NewClient Test error handling for password authentication user format validation: - Missing realm separator (no @) - Empty user string - Multiple @ symbols Improves NewClient coverage from 74.2% to 83.9%.	2025-12-02 01:25:11 +00:00
rcourtman	de33653dc2	test: Add invalid value tests for VMFileSystem.UnmarshalJSON Test error handling for JSON parsing edge cases: - Invalid JSON syntax - Unsupported field types (bool, array) - Unparseable string values for total-bytes and used-bytes Improves coverage from 83.3% to 94.4%.	2025-12-02 01:22:42 +00:00
rcourtman	79afff8ba2	test: Add invalid value tests for MemoryStatus.UnmarshalJSON Test error handling for JSON parsing edge cases: - Invalid JSON syntax - Unsupported field types (bool, array, object) - Unparseable string values Improves coverage from 70.0% to 83.3%.	2025-12-02 01:20:15 +00:00
rcourtman	22d9e2795c	test: Add permanent failure test for ClusterClient.GetNodes Tests the error logging path when all endpoints fail with auth error (83.3% to 91.7% coverage).	2025-12-02 01:05:48 +00:00
rcourtman	5bbf7de1a3	test: Add JSON decode error test for Client.GetNodes Tests the error path when server returns invalid JSON (87.5% to 100%).	2025-12-02 01:03:30 +00:00
rcourtman	490fd9a810	test: Add edge cases for parseReplicationJob fields - Test jobid fallback when id field is missing - Test jobnum field takes precedence over ID parsing - Test last_sync_duration and duration fields - Test last-sync-duration fallback format - Test next_sync and next-sync fallback formats Coverage: 79.7% → 100%	2025-12-02 00:24:40 +00:00
rcourtman	29e01f8ff5	test: Add edge case for coerceUint64 ParseUint error branch String 'abc' without .eE characters triggers ParseUint error path. Coverage: 97.4% to 100%.	2025-12-01 23:44:04 +00:00
rcourtman	e2172b16de	test: Add edge case test for isNotImplementedError fallback branch Tab character triggers extractStatusCode fallback path (regex \s+ matches tab but ' 501' substring check doesn't). Coverage: 87.5% to 100%.	2025-12-01 23:18:45 +00:00
rcourtman	2afc7f0c41	test: Add edge case tests for parseWearoutValue function Add 4 new test cases covering previously untested branches: - Float zero exactly (0.0) - Float negative zero (-0.0) - Only escaped quotes becoming empty after trimming - Quoted whitespace becoming empty after trimming Coverage improved from 95.8% to 100%.	2025-12-01 23:02:18 +00:00
rcourtman	be892f5e07	fix: match storage timeout errors without trailing slash The error pattern `/storage/` only matched storage content endpoints (`/storage/{name}/content`) but not the main storage list endpoint (`/nodes/{node}/storage`). This caused storage timeout errors like: Get ".../nodes/pve-100-224/storage": context deadline exceeded to incorrectly mark cluster nodes as unhealthy, even though the timeout was due to a slow cross-node storage query, not actual node connectivity issues. Fixes #754	2025-12-01 22:48:01 +00:00
rcourtman	9097b507fd	test: Add edge case tests for parseReplicationTime function Add 13 new test cases covering previously untested branches: - float32 timestamp with valid value (using smaller value for precision) - float32/float64 zero and negative values - json.Number zero and negative values - int32 and uint32 timestamp handling - Invalid date format strings (no matching layout) - Partial date strings - Unsupported types (bool, slice) Coverage improved from 93.8% to 100%.	2025-12-01 22:44:23 +00:00
rcourtman	18472f1668	test: Add float32 NaN/Inf tests for intFromAny and floatFromAny Add 6 test cases covering float32 special values: - intFromAny: float32 NaN, +Inf, -Inf (all return 0, false) - floatFromAny: float32 NaN, +Inf, -Inf (all return 0, false) Coverage improved: - intFromAny: 96.7% -> 100% - floatFromAny: 95.0% -> 100%	2025-12-01 22:40:08 +00:00
rcourtman	1e9fbdfdcc	test: Add edge case tests for coerceUint64 function Add 6 new test cases covering previously untested branches: - float64 at MaxUint64 boundary (clamping behavior) - float64 exceeding MaxUint64 (overflow protection) - String with quoted "null" value - String with quoted empty value ("") - String with single quoted empty value ('') - Invalid float parsing in scientific notation Coverage improved from 92.3% to 97.4%.	2025-12-01 22:36:03 +00:00
rcourtman	05b9c3ab2d	test: Add tests for CPUInfo.GetMHzString method Add 11 test cases covering: - Nil MHz returns empty string - String MHz returned as-is - Empty string handling - Float64 formatted without decimals - Float64 zero handling - Float64 rounding for large values - Int formatting - Int zero handling - Default formatting for other types (int64, bool, slice) Coverage: GetMHzString 0% -> 100%	2025-12-01 22:29:30 +00:00
rcourtman	1f748e8670	fix: recover unhealthy cluster nodes even when some nodes are healthy Previously, recovery of unhealthy nodes only triggered when ALL nodes were unhealthy. This caused individual degraded nodes to stay degraded forever since operations would succeed on healthy nodes and never trigger the recovery path. Now recovery is attempted whenever any unhealthy nodes exist, allowing clusters to recover individual nodes over time. Also added: - Panic-safe unlock/lock pattern using anonymous function - Refresh of both healthy and cooling endpoints after recovery - Updated timestamp for accurate cooldown checks Related to #754	2025-12-01 21:47:26 +00:00
rcourtman	d9331570f5	test: Add tests for VMAgentField JSON unmarshaling Covers both Proxmox API formats: - Integer format (older versions): direct int value - Object format (Proxmox 8.3+): {enabled, available} fields - Preference order: available > enabled > 0 - Invalid input handling defaults to 0 - Integration with VMStatus struct	2025-12-01 21:40:47 +00:00
rcourtman	32333cdbbe	test: Add tests for authHTTPError.Error and shouldFallbackToForm Tests for Proxmox client authentication error handling: - authHTTPError.Error: message formatting based on status code (401/403 include status in message, others don't) - shouldFallbackToForm: determines when to retry with form encoding (triggers on 400/415, not on auth errors or server errors) 16 test cases covering all code paths.	2025-12-01 13:39:50 +00:00
rcourtman	42eec54d6e	Add unit tests for parseWearoutValue and clampWearoutConsumed functions 52 test cases covering: - Empty/whitespace input - Simple numeric strings and quoted values - Percentage symbols and N/A variants - Float values with truncation - Messy SMART data with digit extraction fallback - Clamping behavior for unknown, normal, and out-of-range values	2025-12-01 09:18:04 +00:00
rcourtman	f9122d736e	Add unit tests for parseUint64Flexible function 32 test cases covering all code paths: - nil, uint64, int, int64, float64 type handling - json.Number parsing (delegates to string branch) - String parsing: empty, decimal, hex (0x/0X), float notation, scientific - Negative value handling (returns 0 for numeric types) - Error cases: invalid strings, unsupported types	2025-12-01 09:11:02 +00:00
rcourtman	37550bff6d	Add unit tests for ZFS device conversion functions Tests added by ADA run #97 but commit was missed. Covers: RaidZ types, log/cache/spare devices, nested mirrors, ConvertToModelZFSPool, and struct field tests.	2025-12-01 09:03:48 +00:00
rcourtman	6c18849f79	Add unit tests for cluster_client utility functions Test coverage for error detection and retry logic: - extractStatusCode: 13 test cases for HTTP status code extraction - isTransientRateLimitError: 17 test cases for rate limit detection - isNotImplementedError: 14 test cases for 501 error detection - isVMSpecificError: 16 test cases for VM-scoped errors - calculateRateLimitBackoff: backoff timing verification - isAuthError: 12 test cases for authentication errors Coverage 35.5% → 37.3%	2025-12-01 00:24:21 +00:00
rcourtman	92c2d198b1	Add unit tests for Proxmox replication utility functions Comprehensive test coverage for JSON parsing helpers used in replication job status parsing: stringFromAny, intFromAny, boolFromAny, floatFromAny, parseReplicationTime, parseDurationSeconds, parseHHMMSSToSeconds, and parseReplicationJob. Coverage increased from 22.6% to 35.5%.	2025-11-30 02:35:11 +00:00
rcourtman	316161f989	Add unit tests for coerceUint64 and FlexInt.UnmarshalJSON 45 test cases covering: - FlexInt: integer/float/string parsing, truncation behavior, error cases - coerceUint64: nil, float64 (including NaN/Inf), int/int32/int64, uint32/uint64, json.Number, string parsing (whitespace, null, quotes, commas, scientific notation), unsupported types Coverage: 20.5% -> 22.6%	2025-11-30 02:17:52 +00:00
rcourtman	69de7c25ce	Fix cluster degraded status not recovering after transient failures The previous fix (`6db4ee7a`) cleared stale error messages but didn't mark endpoints as healthy again after successful operations. This caused clusters to remain in "degraded" state permanently once any endpoint had a temporary issue, even if all endpoints were actually working. The fix now marks endpoints healthy in clearEndpointError() after successful operations, ensuring degraded clusters recover automatically. Related to #659	2025-11-29 19:04:11 +00:00
rcourtman	1b5528356b	fix: clear stale errors after successful cluster operations Previously, errors stored in ClusterClient.lastError were only cleared during initial health checks or when recovering unhealthy nodes. This caused stale error messages to persist in the UI even after the underlying issues were resolved. The fix clears cached errors in two places: 1. After passing connectivity test in getHealthyClient() 2. After successful operation in executeWithFailover() This ensures that once an endpoint starts working again, any previous error messages are cleared from the UI without requiring a restart. Related to #659, #754	2025-11-27 16:22:16 +00:00
rcourtman	bc9e89696b	chore: fix staticcheck U1000 unused code warnings - Remove unused ipv6Regex from validation.go - Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use) - Remove unused apiLimiter rate limiter - Remove unused stopOnce fields from csrf_store.go and session_store.go - Remove unused lastBroadcast field from hub.go - Remove unused lastUsedIndex field from cluster_client.go	2025-11-27 09:12:17 +00:00
rcourtman	8276ae837e	chore: cleanup proxmox IsAuthError and remove stray comment - Make IsAuthError unexported (isAuthError) since it's only used internally - Remove stray '// test comment' from docker_metadata.go	2025-11-27 08:59:01 +00:00

1 2

81 commits