Commit graph

269 commits

Author SHA1 Message Date
rcourtman
03e9f98ab6 fix: Exclude autofs mount type from disk counts. Related to #942 2025-12-29 23:36:58 +00:00
rcourtman
32111c7837 feat: Add --report-ip flag for multi-NIC systems (issue #945)
Allows specifying which IP address the agent should report, useful for:
- Multi-homed systems with separate management networks
- Systems with private monitoring interfaces
- VPN/overlay network scenarios

Usage:
  pulse-agent --report-ip 192.168.1.100
  PULSE_REPORT_IP=192.168.1.100 pulse-agent
2025-12-29 09:28:28 +00:00
rcourtman
35a83afbcb fix: Filter overlay filesystems from disk metrics
Docker overlay filesystems were being counted as separate disks when
monitoring hosts running Docker. These are virtual layers, not actual
storage.

Added overlay and overlayfs to the virtualFSTypes list so they are
always excluded from disk usage calculations, regardless of their
reported usage percentage.

NFS and CIFS mounts were already being filtered correctly.

Related to #942
2025-12-28 16:12:18 +00:00
rcourtman
9f3367da36 fix: Include GuestURL in NodeFrontend for cluster node navigation
The GuestURL field was missing from NodeFrontend and its converter,
causing configured Guest URLs to be ignored when clicking on cluster
node names. The frontend would fall back to the auto-detected IP
instead of using the user-configured Guest URL.

Related to #940
2025-12-28 14:49:49 +00:00
rcourtman
b50872b686 feat: Implement unified update detection system (Phase 1)
Docker container image update detection with full stack implementation:

Backend:
- Add internal/updatedetection package with types, store, registry checker, manager
- Add registry checking to Docker agent (internal/dockeragent/registry.go)
- Add ImageDigest and UpdateStatus fields to container reports
- Add /api/infra-updates API endpoints for querying updates
- Integrate with alert system - fires after 24h of pending updates

Frontend:
- Add UpdateBadge and UpdateIcon components for update indicators
- Add updateStatus to DockerContainer TypeScript interface
- Display blue update badges in Docker unified table image column
- Add 'has:update' search filter support

Features:
- Registry digest comparison for Docker Hub, GHCR, private registries
- Auth token handling for Docker Hub public images
- Caching with 6h TTL (15min for errors)
- Configurable alert delay via UpdateAlertDelayHours (default: 24h)
- Alert metadata includes digests, pending time, image info
2025-12-27 17:58:38 +00:00
rcourtman
3d671c1824 feat(pbs): add API-based token creation for turnkey PBS setup
- Added PBS client methods: CreateUser, SetUserACL, CreateUserToken
- Added SetupMonitoringAccess() turnkey method that creates user + token
- Updated handleSecureAutoRegister to use PBS API for token creation
- Enables one-click PBS setup for Docker/containerized deployments

When users provide PBS root credentials, Pulse can now create the
monitoring user and API token remotely via the PBS API, eliminating
the need to SSH/exec into the container manually.
2025-12-26 10:08:41 +00:00
rcourtman
3fd20340d1 fix: increase PBS storage content timeout to 60s
PBS storage content queries with encrypted backups can take 10-20+ seconds
to enumerate. The previous 30s timeout was causing intermittent failures
when polling backup data from PBS storage configured in PVE.

This increases the timeout to 60s to accommodate slow PBS backends while
still preventing indefinite hangs on unavailable NFS/network storage.
2025-12-26 00:21:17 +00:00
rcourtman
86e41effc0 feat: Display environment variables for Docker containers
- Add Env field to Container struct in pkg/agents/docker/report.go
- Extract env vars from inspect.Config.Env in Docker agent
- Mask sensitive values (password, secret, key, token, etc.) with ***
- Display env vars in container drawer with green badges (amber for masked)
- Add tests for maskSensitiveEnvVars function

Related to #916
2025-12-25 23:52:57 +00:00
rcourtman
08c04b78ae feat: add power consumption monitoring (Intel RAPL + AMD Energy)
- Add power.go with Intel RAPL and AMD energy driver support
- Read CPU package, core, and DRAM power consumption in watts
- Sample energy counters over 100ms interval to calculate power
- Add PowerWatts field to Sensors struct for API reporting
- Integrate power collection into host agent sensor gathering
- Add comprehensive tests for power collection module

Supports Intel CPUs (Sandy Bridge+) via RAPL and AMD Ryzen/EPYC
via the amd_energy kernel module.

Closes community-scripts/ProxmoxVE#9575
2025-12-25 21:14:12 +00:00
rcourtman
c1422882bd feat: Add disk exclusion filter for host agent. Closes #896
Users can now exclude specific mount points from disk monitoring:
- Via CLI: --disk-exclude /mnt/backup --disk-exclude '/media/*'
- Via env: PULSE_DISK_EXCLUDE=/mnt/backup,*pbs*

Patterns support:
- Exact paths: /mnt/backup
- Prefix patterns: /mnt/ext*
- Contains patterns: *pbs*

This addresses the common case where external disks or
PBS datastores are being monitored but shouldn't be.
2025-12-25 12:04:40 +00:00
rcourtman
8f9d5c1120 feat: Agent collects S.M.A.R.T. disk data via smartctl. Related to #907
- Add smartctl package to collect disk temperature and health data
- Add SMART field to agent Sensors struct
- Host agent now runs smartctl to collect disk temps when available
- Backend processes agent SMART data for temperature display
- Graceful fallback when smartctl not installed
2025-12-25 11:37:53 +00:00
rcourtman
598285d3d2 feat: Agent reports CommandsEnabled status to server. Related to #903
- Add CommandsEnabled field to AgentInfo in pkg/agents/host/report.go
- Agent now reports whether AI command execution is enabled
- Server stores and exposes this via Host model
- Frontend can now show which agents have commands enabled
- This provides visibility before implementing remote configuration
2025-12-25 07:55:22 +00:00
rcourtman
e0dc6695fc fix: Per-node TLS fingerprints for cluster peers (TOFU)
When a PVE cluster has unique self-signed certificates on each node, Pulse
would mark secondary nodes as unhealthy because only the primary node's
fingerprint was used for all connections.

Now, during cluster discovery, Pulse captures each node's TLS fingerprint
and uses it when connecting to that specific node. This enables
"Trust On First Use" (TOFU) for clusters with unique per-node certs.

Changes:
- Add Fingerprint field to ClusterEndpoint config
- Add FetchFingerprint() to tlsutil for capturing node certs
- validateNodeAPI() now captures and returns fingerprints during discovery
- NewClusterClient() accepts endpointFingerprints map for per-node certs
- All client creation paths use per-endpoint fingerprints when available

Related to #879
2025-12-24 10:05:03 +00:00
rcourtman
d663ba4342 hostagent: avoid host ID collisions and prefer LAN IP 2025-12-17 16:29:59 +00:00
rcourtman
e44a6fdadd test(envdetect): cover environment detection decisions 2025-12-17 16:08:10 +00:00
rcourtman
969fa0e509 test: add unit tests for AI, Kubernetes agent, and clients 2025-12-17 12:47:36 +00:00
rcourtman
a115af6906 feat: Improve cluster endpoint error messages for users
- Add sanitizeEndpointError() to transform raw Go errors into user-friendly messages
- Transform 'context deadline exceeded' into helpful messages mentioning possible causes
- Storage timeout errors now suggest checking PBS/NFS/Ceph backend connectivity
- Connection refused, certificate errors, and auth errors get actionable hints
- Apply sanitization everywhere cluster endpoint lastError is stored
- Add comprehensive tests for all error transformations
2025-12-16 21:50:02 +00:00
rcourtman
3a2a73f9d6 Merge main into ai-features: incorporate latest bugfixes
Resolved conflicts:
- pkg/fsfilters/filters.go: Keep both TrueNAS and EnhanceCP filter fixes
- DockerUnifiedTable.tsx: Use main's resource column overlap fix
2025-12-13 15:18:51 +00:00
rcourtman
a259b67348 feat: add Kubernetes platform support 2025-12-12 21:31:11 +00:00
rcourtman
88d419dd5b feat(ai): Add enriched context with historical trends and predictions
Phase 1 of Pulse AI differentiation:

- Create internal/ai/context package with types, trends, builder, formatter
- Implement linear regression for trend computation (growing/declining/stable/volatile)
- Add storage capacity predictions (predicts days until 90% and 100%)
- Wire MetricsHistory from monitor to patrol service
- Update patrol to use buildEnrichedContext instead of basic summary
- Update patrol prompt to reference trend indicators and predictions

This gives the AI awareness of historical patterns, enabling it to:
- Identify resources with concerning growth rates
- Predict capacity exhaustion before it happens
- Distinguish between stable high usage vs growing problems
- Provide more actionable, time-aware insights

All tests passing. Falls back to basic summary if metrics history unavailable.
2025-12-12 09:45:57 +00:00
rcourtman
fa13919987 fix(ai-chat): Display messages chronologically in AI chatbot
- Add 'content' type to StreamDisplayEvent for tracking text chunks
- Track content events in streamEvents array for chronological display
- Update render to use Switch/Match for cleaner conditional rendering
- Interleave thinking, tool calls, and content as they stream in
- Add fallback for old messages without streamEvents for backwards compat

Previously, tool/command outputs stayed at top while AI text responses
accumulated at the bottom. Now all events appear in order like a
normal chatbot.
2025-12-11 23:02:59 +00:00
rcourtman
927ac76bad feat: AI integration, Docker metrics, RAID display, and infrastructure improvements
- Add Claude OAuth authentication support with hybrid API key/OAuth flow
- Implement Docker container historical metrics in backend and charts API
- Add CEPH cluster data collection and new Ceph page
- Enhance RAID status display with detailed tooltips and visual indicators
- Fix host deduplication logic with Docker bridge IP filtering
- Fix NVMe temperature collection in host agent
- Add comprehensive test coverage for new features
- Improve frontend sparklines and metrics history handling
- Fix navigation issues and frontend reload loops
2025-12-09 09:29:27 +00:00
rcourtman
8948e84fe5 feat: AI features, agent improvements, and host monitoring enhancements
AI Chat Integration:
- Multi-provider support (Anthropic, OpenAI, Ollama)
- Streaming responses with markdown rendering
- Agent command execution for remote troubleshooting
- Context-aware conversations with host/container metadata

Agent Updates:
- Add --enable-proxmox flag for automatic PVE/PBS token setup
- Improve auto-update with semver comparison (prevents downgrades)
- Add updatedFrom tracking to report previous version after update
- Reduce initial update check delay from 30s to 5s
- Add agent version column to Hosts page table

Host Metrics:
- Add DiskIO stats collection (read/write bytes, ops, time)
- Improve disk filtering to exclude Docker overlay mounts
- Add RAID array monitoring via mdadm
- Enhanced temperature sensor parsing

Frontend:
- New Agent Version column on Hosts overview table
- Improved node modal with agent-first installation flow
- Add DiskIO display in host drawer
- Better responsive handling for metric bars
2025-12-05 10:37:02 +00:00
rcourtman
63038b5f30 fix: Filter EnhanceCP /var/container_tmp overlay mounts from disk stats
EnhanceCP uses /var/container_tmp/{uuid}/merged for container overlays.
These are ephemeral container layers, not user storage, and should be
filtered from disk usage display. Related to #790
2025-12-04 20:11:10 +00:00
rcourtman
da51449392 fix: Exclude TrueNAS Docker overlay mounts from disk stats
Host agent was including Docker overlay2 mounts from TrueNAS SCALE's
.ix-apps directory in disk totals. These mounts inherit the ZFS pool's
AVAIL space, causing massively inflated storage numbers (e.g., 173 TB
per container overlay instead of actual usage).

Changes:
- Add /mnt/.ix-apps/docker/ to container overlay path exclusions
- Use ShouldSkipFilesystem() in host agent disk collection (was only
  using ShouldIgnoreReadOnlyFilesystem() which missed container paths)
- Add test cases for TrueNAS overlay paths

Related to #718
2025-12-04 03:03:04 +00:00
rcourtman
4c98933175 fix: Filter container overlay mounts in non-standard locations
Detect container overlay filesystem paths from various container runtimes
(Docker, Podman, LXC, EnhanceCP, etc.) that may not be in standard
/var/lib/docker or /var/lib/containers locations.

Paths containing /containers/ with overlay patterns (/overlay2/, /overlay/,
/diff/, /merged) are now filtered from disk usage aggregation.

Related to #790
2025-12-03 14:06:15 +00:00
rcourtman
4f824ab148 style: Apply gofmt to 37 files
Standardize code formatting across test files and monitor.go.
No functional changes.
2025-12-02 17:21:48 +00:00
rcourtman
c05817f9de docs: Add godoc comments to exported functions
Add missing godoc comments to:
- NewRateLimiter and Allow in ratelimit.go
- SnapshotSyncStatus in temperature_proxy.go
- NewClient and GetVersion in pkg/pmg/client.go
2025-12-02 15:58:59 +00:00
rcourtman
c812720f25 test: Add Disk UnmarshalJSON RPM and error path tests
Cover RPM field handling (numeric, string, SSD, N/A, null, invalid),
invalid JSON error path, and unexpected type fallbacks for both
wearout and RPM fields.
Coverage: 50% → 95.5%
2025-12-02 02:23:44 +00:00
rcourtman
618fc084f1 test: Add invalid user format tests for NewClient
Test error handling for password authentication user format validation:
- Missing realm separator (no @)
- Empty user string
- Multiple @ symbols

Improves NewClient coverage from 74.2% to 83.9%.
2025-12-02 01:25:11 +00:00
rcourtman
de33653dc2 test: Add invalid value tests for VMFileSystem.UnmarshalJSON
Test error handling for JSON parsing edge cases:
- Invalid JSON syntax
- Unsupported field types (bool, array)
- Unparseable string values for total-bytes and used-bytes

Improves coverage from 83.3% to 94.4%.
2025-12-02 01:22:42 +00:00
rcourtman
79afff8ba2 test: Add invalid value tests for MemoryStatus.UnmarshalJSON
Test error handling for JSON parsing edge cases:
- Invalid JSON syntax
- Unsupported field types (bool, array, object)
- Unparseable string values

Improves coverage from 70.0% to 83.3%.
2025-12-02 01:20:15 +00:00
rcourtman
22d9e2795c test: Add permanent failure test for ClusterClient.GetNodes
Tests the error logging path when all endpoints fail with auth error
(83.3% to 91.7% coverage).
2025-12-02 01:05:48 +00:00
rcourtman
5bbf7de1a3 test: Add JSON decode error test for Client.GetNodes
Tests the error path when server returns invalid JSON (87.5% to 100%).
2025-12-02 01:03:30 +00:00
rcourtman
490fd9a810 test: Add edge cases for parseReplicationJob fields
- Test jobid fallback when id field is missing
- Test jobnum field takes precedence over ID parsing
- Test last_sync_duration and duration fields
- Test last-sync-duration fallback format
- Test next_sync and next-sync fallback formats

Coverage: 79.7% → 100%
2025-12-02 00:24:40 +00:00
rcourtman
29e01f8ff5 test: Add edge case for coerceUint64 ParseUint error branch
String 'abc' without .eE characters triggers ParseUint error path.
Coverage: 97.4% to 100%.
2025-12-01 23:44:04 +00:00
rcourtman
e2172b16de test: Add edge case test for isNotImplementedError fallback branch
Tab character triggers extractStatusCode fallback path (regex \s+ matches
tab but ' 501' substring check doesn't). Coverage: 87.5% to 100%.
2025-12-01 23:18:45 +00:00
rcourtman
2afc7f0c41 test: Add edge case tests for parseWearoutValue function
Add 4 new test cases covering previously untested branches:
- Float zero exactly (0.0)
- Float negative zero (-0.0)
- Only escaped quotes becoming empty after trimming
- Quoted whitespace becoming empty after trimming

Coverage improved from 95.8% to 100%.
2025-12-01 23:02:18 +00:00
rcourtman
be892f5e07 fix: match storage timeout errors without trailing slash
The error pattern `/storage/` only matched storage content endpoints
(`/storage/{name}/content`) but not the main storage list endpoint
(`/nodes/{node}/storage`).

This caused storage timeout errors like:
  Get ".../nodes/pve-100-224/storage": context deadline exceeded

to incorrectly mark cluster nodes as unhealthy, even though the timeout
was due to a slow cross-node storage query, not actual node connectivity
issues.

Fixes #754
2025-12-01 22:48:01 +00:00
rcourtman
9097b507fd test: Add edge case tests for parseReplicationTime function
Add 13 new test cases covering previously untested branches:
- float32 timestamp with valid value (using smaller value for precision)
- float32/float64 zero and negative values
- json.Number zero and negative values
- int32 and uint32 timestamp handling
- Invalid date format strings (no matching layout)
- Partial date strings
- Unsupported types (bool, slice)

Coverage improved from 93.8% to 100%.
2025-12-01 22:44:23 +00:00
rcourtman
18472f1668 test: Add float32 NaN/Inf tests for intFromAny and floatFromAny
Add 6 test cases covering float32 special values:
- intFromAny: float32 NaN, +Inf, -Inf (all return 0, false)
- floatFromAny: float32 NaN, +Inf, -Inf (all return 0, false)

Coverage improved:
- intFromAny: 96.7% -> 100%
- floatFromAny: 95.0% -> 100%
2025-12-01 22:40:08 +00:00
rcourtman
1e9fbdfdcc test: Add edge case tests for coerceUint64 function
Add 6 new test cases covering previously untested branches:
- float64 at MaxUint64 boundary (clamping behavior)
- float64 exceeding MaxUint64 (overflow protection)
- String with quoted "null" value
- String with quoted empty value ("")
- String with single quoted empty value ('')
- Invalid float parsing in scientific notation

Coverage improved from 92.3% to 97.4%.
2025-12-01 22:36:03 +00:00
rcourtman
05b9c3ab2d test: Add tests for CPUInfo.GetMHzString method
Add 11 test cases covering:
- Nil MHz returns empty string
- String MHz returned as-is
- Empty string handling
- Float64 formatted without decimals
- Float64 zero handling
- Float64 rounding for large values
- Int formatting
- Int zero handling
- Default formatting for other types (int64, bool, slice)

Coverage: GetMHzString 0% -> 100%
2025-12-01 22:29:30 +00:00
rcourtman
1f748e8670 fix: recover unhealthy cluster nodes even when some nodes are healthy
Previously, recovery of unhealthy nodes only triggered when ALL nodes
were unhealthy. This caused individual degraded nodes to stay degraded
forever since operations would succeed on healthy nodes and never
trigger the recovery path.

Now recovery is attempted whenever any unhealthy nodes exist, allowing
clusters to recover individual nodes over time.

Also added:
- Panic-safe unlock/lock pattern using anonymous function
- Refresh of both healthy and cooling endpoints after recovery
- Updated timestamp for accurate cooldown checks

Related to #754
2025-12-01 21:47:26 +00:00
rcourtman
d9331570f5 test: Add tests for VMAgentField JSON unmarshaling
Covers both Proxmox API formats:
- Integer format (older versions): direct int value
- Object format (Proxmox 8.3+): {enabled, available} fields
- Preference order: available > enabled > 0
- Invalid input handling defaults to 0
- Integration with VMStatus struct
2025-12-01 21:40:47 +00:00
rcourtman
32333cdbbe test: Add tests for authHTTPError.Error and shouldFallbackToForm
Tests for Proxmox client authentication error handling:
- authHTTPError.Error: message formatting based on status code
  (401/403 include status in message, others don't)
- shouldFallbackToForm: determines when to retry with form encoding
  (triggers on 400/415, not on auth errors or server errors)

16 test cases covering all code paths.
2025-12-01 13:39:50 +00:00
rcourtman
42eec54d6e Add unit tests for parseWearoutValue and clampWearoutConsumed functions
52 test cases covering:
- Empty/whitespace input
- Simple numeric strings and quoted values
- Percentage symbols and N/A variants
- Float values with truncation
- Messy SMART data with digit extraction fallback
- Clamping behavior for unknown, normal, and out-of-range values
2025-12-01 09:18:04 +00:00
rcourtman
f9122d736e Add unit tests for parseUint64Flexible function
32 test cases covering all code paths:
- nil, uint64, int, int64, float64 type handling
- json.Number parsing (delegates to string branch)
- String parsing: empty, decimal, hex (0x/0X), float notation, scientific
- Negative value handling (returns 0 for numeric types)
- Error cases: invalid strings, unsupported types
2025-12-01 09:11:02 +00:00
rcourtman
37550bff6d Add unit tests for ZFS device conversion functions
Tests added by ADA run #97 but commit was missed.
Covers: RaidZ types, log/cache/spare devices, nested mirrors,
ConvertToModelZFSPool, and struct field tests.
2025-12-01 09:03:48 +00:00
rcourtman
68c0e79b21 Add unit tests for cloneProfile and clonePhase functions in discovery
Add comprehensive tests for the cloneProfile and clonePhase utility
functions in pkg/discovery/discovery.go. Tests verify deep copying
behavior for all fields including subnets, metadata, warnings, extra
targets, and phases to ensure mutations don't affect original objects.
2025-12-01 01:51:54 +00:00