Commit graph

73 commits

Author SHA1 Message Date
rcourtman
4120d75359 Surface shared cluster-only storage in alerts (#1341) 2026-03-30 19:25:54 +01:00
rcourtman
03b620a429 Parallelize legacy Proxmox VM guest-agent polling (#1319) 2026-03-27 16:20:48 +00:00
rcourtman
0abbb8ba92 Rotate legacy guest-agent VM priority across polls (#1319) 2026-03-27 16:17:48 +00:00
rcourtman
01f916dcb5 Use linked host-agent disk data for guest fallback (#1319) 2026-03-27 15:56:20 +00:00
rcourtman
d4242d9a13 Fix ZFS pool attachment in storage frontend (discussion #1351) 2026-03-27 14:59:52 +00:00
rcourtman
2a4432048a Continue guest-agent polling after transient status failures (#1319) 2026-03-27 14:50:28 +00:00
rcourtman
01e4227ec7 Preserve cached guest metadata in legacy PVE VM poll (#1319) 2026-03-27 14:35:40 +00:00
rcourtman
e508bc3380 Prefer sane VM free-mem fallback over false full-usage samples (#1319) 2026-03-27 13:55:07 +00:00
rcourtman
8fc41f774c Keep normalized Windows guest disks in efficient VM polling (#1319) 2026-03-27 13:51:55 +00:00
rcourtman
627181566a Allow SSH temperature fallback when host agent lacks SMART 2026-03-26 22:40:43 +00:00
rcourtman
ae6b663e95 Attach ZFS pools for dataset-backed storages 2026-03-26 22:29:32 +00:00
rcourtman
92e6075ee4 Fix ZFS pool matching for local-zfs storages 2026-03-26 09:09:17 +00:00
rcourtman
e9bbc35bae Stabilize repeated low-trust VM memory fallbacks (#1319) 2026-03-26 00:23:29 +00:00
rcourtman
2196327769 Preserve VM guest metadata across transient agent gaps (#1319) 2026-03-26 00:12:19 +00:00
rcourtman
4ad7e51875 Prefer linked host disk metrics for v5 Proxmox nodes 2026-03-25 16:54:00 +00:00
rcourtman
7dab977d91 Add split memory bar showing Used | Cache | Free segments (#1302)
Show reclaimable buff/cache as a distinct amber segment between used
(green) and free (gray) in the memory bar. This explains why Pulse's
memory percentage differs from Proxmox: Pulse reports cache-aware
usage (MemAvailable) while Proxmox includes cache as used (Total-Free).

Backend: add Cache field to Memory model, derived from MemInfo
(Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to
avoid inflating cache by the balloon gap on ballooned VMs.

Frontend: StackedMemoryBar renders three segments with tooltip
breakdown. Tooltip Free accounts for balloon limit when active.
Percentage label and alerts remain cache-aware (unchanged).
2026-03-10 10:16:14 +00:00
rcourtman
abbd0df609 Fix disk metric spikes when guest agent intermittently fails (#1319)
Carry forward previous cycle's disk data when the QEMU guest agent
times out or errors, instead of falling back to Proxmox cluster/resources
which always reports 0 for VM disk usage.  Applied to both polling paths
(pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards
against uint64 underflow and permanent-failure exclusions.
2026-03-09 18:23:15 +00:00
rcourtman
572520ebc6 Promote guest-agent /proc/meminfo fallback for accurate VM memory (#1270)
Move the guest-agent file-read of /proc/meminfo earlier in the memory
fallback chain so it runs before RRD, giving real-time MemAvailable that
correctly excludes reclaimable buff/cache on Linux VMs. Also add
VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use
comma-separated privilege strings.
2026-03-09 10:04:28 +00:00
rcourtman
ff1bbe2fb8 Guard per-VM guest agent calls with timeout and panic recovery (#1319)
A broken or hung qemu-agent on one VM could stall the entire polling
loop, preventing higher-VMID VMs from being detected. Wrap all guest
agent work in a 10s per-VM budget with panic recovery, and add a 2s
timeout to GetVMStatus in the efficient poller to match the legacy path.
2026-03-07 22:30:18 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
a4571f580b fix(monitoring): harden VM memory selection and flag repeated VM usage 2026-03-03 16:19:17 +00:00
rcourtman
60bdc9a101 fix(memory): skip meminfo-derived when balloon lacks cache metrics (#1302)
When the balloon driver reports Free but not Buffers or Cached, the
meminfo-derived fallback computed memAvailable = Free alone, counting
all reclaimable page cache as used memory. This caused Linux VMs to
show wildly inflated usage (e.g. 93% when actual is 21%).

Now meminfo-derived requires at least one cache metric (Buffers > 0
or Cached > 0) before trusting the value. When missing, the code
falls through to RRD/guest-agent/Total-Used fallbacks which provide
accurate cache-aware data. Both efficient and traditional polling
paths are now consistent.
2026-03-02 11:48:18 +00:00
rcourtman
32746e2d2a fix(monitoring): use RRD memavailable fallback when PVE node cache metrics missing (#1270)
When Proxmox /nodes/{node}/status returns only total/used/free without
available/buffers/cached, EffectiveAvailable() returns Free (non-zero),
causing the RRD fallback gate to be skipped. This results in inflated
node memory where cache/buffers are counted as "used."

Widen the RRD fallback condition from requiring effectiveAvailable == 0
to triggering whenever missingCacheMetrics is true. Add negative caching
for failed RRD lookups (2-minute backoff) to avoid repeated retries.
2026-02-21 22:47:20 +00:00
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
fb7582c7e4 fix(memory): use linked Pulse host agent memory to avoid VM inflation (#1270)
When no guest agent MemInfo or RRD data is available, prefer the linked
Pulse host agent's memory (read from /proc/meminfo via gopsutil, which
excludes page cache) over Proxmox's status.Mem (total - free, inflated
by reclaimable cache). Applied to both efficient and traditional polling
paths. Diagnostic fields added to VMMemoryRaw for visibility.
2026-02-19 19:04:19 +00:00
rcourtman
d4ff967815 fix: scope shared storage aggregation to per-instance to prevent cross-instance merging
The shared storage deduplication key was just the storage name, causing
storages with the same name from different Proxmox instances (or PVE + PBS)
to be incorrectly merged into a single entry. This made one random host
appear to have all storages from all instances.

Include the instance name in the aggregation key so shared storage is only
merged within the same Proxmox cluster/instance.

Fixes #1246
2026-02-11 09:18:09 +00:00
rcourtman
902bdd92c2 fix: prefer status-mem over status-freemem for VM memory calculation
Proxmox's FreeMem field reports free memory relative to the balloon's
guest-visible total (total_mem), not relative to MaxMem. When ballooning
is active and the VM's memory has been reduced, subtracting FreeMem from
MaxMem produces wildly inflated usage (e.g. 97% when actual usage is 20%).

Proxmox's Mem field is already calculated as (total_mem - free_mem),
giving the correct used bytes regardless of balloon state. Swap the
priority so Mem is checked before FreeMem.

Related to #1185
2026-02-04 12:08:33 +00:00
rcourtman
19a67dd4f3 Update core infrastructure components
Config:
- AI configuration improvements
- API tokens handling
- Persistence layer updates

Host Agent:
- Command execution improvements
- Better test coverage

Infrastructure Discovery:
- Service improvements
- Enhanced test coverage

Models:
- State snapshot updates
- Model improvements

Monitoring:
- Polling improvements
- Guest config handling
- Storage config support

WebSocket:
- Hub tenant test updates

Service Discovery:
- New service discovery module
2026-01-28 16:52:35 +00:00
rcourtman
2e0da42a81 chore: reliability and maintenance improvements
Host agent:
- Add SHA256 checksum verification for downloaded binaries
- Verify checksum file matches expected bundle filename

WebSocket:
- Add write failure tracking with graceful disconnection
- Increase write deadline to 30s for large state payloads
- Better handling for slow clients (Raspberry Pi, slow networks)

Monitoring:
- Remove unused temperature proxy imports
- Add monitor polling improvements
- Expand test coverage

Other:
- Update package.json dependencies
- Fix generate-release-notes.sh path handling
- Minor reporting engine cleanup
2026-01-22 00:45:04 +00:00
rcourtman
ebc29b4fdb feat: show pending apt updates for Proxmox nodes (#1083)
- Add PendingUpdates and PendingUpdatesCheckedAt fields to Node model
- Add GetNodePendingUpdates method to Proxmox client (calls /nodes/{node}/apt/update)
- Add 30-minute polling cache to avoid excessive API calls
- Add pendingUpdates to frontend Node type
- Add color-coded badge in NodeSummaryTable (yellow: 1-9, orange: 10+)
- Update test stubs for interface compliance

Requires Sys.Audit permission on Proxmox API token to read apt updates.
2026-01-21 10:53:36 +00:00
rcourtman
103eb9c3e0 feat(monitoring): auto-detect Docker inside LXC containers
Adds automatic Docker detection for Proxmox LXC containers:
- New HasDocker and DockerCheckedAt fields on Container model
- Docker socket check via connected agents on first run, restart, or start
- Parallel checking with timeouts for efficiency
- Caches results and only re-checks after state transitions

This enables the AI to know which LXC containers are Docker hosts
for better infrastructure guidance.
2026-01-17 14:42:52 +00:00
rcourtman
1f4f0472b0 fix: use configured memory (MaxMem) instead of balloon for VM total
Previously, when memory ballooning was active on a VM, Pulse would use
the balloon value as the total memory instead of the configured MaxMem.
This caused confusing displays where a 4GB VM with 1GB balloon would
show "94% (966MB/1GB)" instead of "24% (966MB/4GB)".

The balloon value is still tracked in memory.balloon for the frontend's
yellow balloon marker visualization, but no longer replaces the total.

Fixes #1070
2026-01-10 15:37:45 +00:00
rcourtman
2a8f55d719 feat(enterprise): add Advanced Reporting and Audit Webhooks integration
This commit adds enterprise-grade reporting and audit capabilities:

Reporting:
- Refactored metrics store from internal/ to pkg/ for enterprise access
- Added pkg/reporting with shared interfaces for report generation
- Created API endpoint: GET /api/admin/reports/generate
- New ReportingPanel.tsx for PDF/CSV report configuration

Audit Webhooks:
- Extended pkg/audit with webhook URL management interface
- Added API endpoint: GET/POST /api/admin/webhooks/audit
- New AuditWebhookPanel.tsx for webhook configuration
- Updated Settings.tsx with Reporting and Webhooks tabs

Server Hardening:
- Enterprise hooks now execute outside mutex with panic recovery
- Removed dbPath from metrics Stats API to prevent path disclosure
- Added storage metrics persistence to polling loop

Documentation:
- Updated README.md feature table
- Updated docs/API.md with new endpoints
- Updated docs/PULSE_PRO.md with feature descriptions
- Updated docs/WEBHOOKS.md with audit webhooks section
2026-01-09 21:31:49 +00:00
rcourtman
3e2824a7ff feat: remove Enterprise badges, simplify Pro upgrade prompts
- Replace barrel import in AuditLogPanel.tsx to fix ad-blocker crash
- Remove all Enterprise/Pro badges from nav and feature headers
- Simplify upgrade CTAs to clean 'Upgrade to Pro' links
- Update docs: PULSE_PRO.md, API.md, README.md, SECURITY.md
- Align terminology: single Pro tier, no separate Enterprise tier

Also includes prior refactoring:
- Move auth package to pkg/auth for enterprise reuse
- Export server functions for testability
- Stabilize CLI tests
2026-01-09 16:51:08 +00:00
rcourtman
5c4399d69f feat(agent): add DisableCeph toggle, report_ip remote config, and improved IP detection (#929) 2026-01-09 14:45:29 +00:00
rcourtman
568aac6bd0 fix: multiple triage fixes for stability and correctness
1. Use correct mutex (diagMu) in cleanupDiagnosticSnapshots to prevent
   "concurrent map iteration and map write" panics (Fixes #1063)

2. Use cluster name for storage instance comparison in UpdateStorageForInstance
   to prevent storage duplication in clustered Proxmox setups (Fixes #1062)

3. Fix KUBECONFIG unbound variable error in install.sh by using ${KUBECONFIG:-}
   default parameter expansion (Fixes #1065)
2026-01-08 22:54:33 +00:00
rcourtman
06ebaf50b2 fix: use consistent ID for shared storage to prevent duplication (#1049)
Shared storage was duplicating across polling cycles because the ID
included the node name of whichever node reported it first. When a
different node reported first on the next cycle, a new ID was created.

This fix updates the shared storage aggregation to use a consistent ID
format (instance-cluster-storageName) that doesn't include the node name.

Closes #1049. Thanks to @siccous for the report and initial investigation.
2026-01-08 21:29:24 +00:00
rcourtman
9cfcdbb247 fix: Use per-node shared flag for storage deduplication
The storage deduplication logic only checked cluster config's Shared
flag, but this required the cluster config API call to succeed. When
the per-node storage API already returns shared=1 (as the user
verified), we should use that directly.

Now we check three sources for shared storage detection:
1. Per-node API shared flag (storage.Shared)
2. Cluster config shared flag (if available)
3. Storage type heuristics (NFS, RBD, PBS, etc.)

Related to #1049
2026-01-07 10:16:23 +00:00
rcourtman
96d06da0d7 fix: Deduplicate shared storages (NFS, RBD, PBS, etc) in cluster view
Shared storages were appearing multiple times (once per node) because
the deduplication logic only checked the Proxmox `Shared` flag. Many
storage types are inherently cluster-wide but don't set this flag:

- RBD (Ceph block storage)
- CephFS
- PBS (Proxmox Backup Server)
- GlusterFS
- NFS
- CIFS/SMB
- iSCSI

Now we detect shared storage based on both the Shared flag AND the
storage type. Inherently shared storage types are deduplicated and
shown once with a "cluster" node designation.

Related to #1049
2026-01-06 17:44:52 +00:00
rcourtman
ed78509f92 Fix flaky tests and improve coverage across alerts, api, and config packages
- Fix deadlock and race conditions in internal/alerts
- Add comprehensive error path tests for internal/config
- Fix 401 handling in internal/api
- Fix Docker Swarm task filtering test logic
2026-01-03 18:36:17 +00:00
rcourtman
800fab10c2 fix: Use LinkedNodeID for temperature matching to fix duplicate hostname bug
When two Proxmox nodes have the same hostname (e.g., 'px1' on different IPs),
the getHostAgentTemperature function was matching by hostname alone, causing
both nodes to show temperature from whichever host agent appeared first.

The fix:
- Added getHostAgentTemperatureByID that first tries matching by LinkedNodeID
  (the unique node ID) before falling back to hostname matching
- Updated the caller to pass modelNode.ID for precise matching
- Maintains backwards compatibility for setups where linking hasn't occurred

Related to #891
2025-12-25 10:00:19 +00:00
rcourtman
968e0a7b3d fix: reduce syslog flooding by downgrading routine logs to debug level
Addresses issue #861 - syslog flooded on docker host

Many routine operational messages were being logged at INFO level,
causing excessive log volume when monitoring multiple VMs/containers.
These messages are now logged at DEBUG level:

- Guest threshold checking (every guest, every poll cycle)
- Storage threshold checking (every storage, every poll cycle)
- Host agent linking messages
- Filesystem inclusion in disk calculation
- Guest agent disk usage replacement
- Polling start/completion messages
- Alert cleanup and save messages

Users can set LOG_LEVEL=debug to see these messages if needed for
troubleshooting. The default INFO level now produces significantly
less log output.

Also updated documentation in CONFIGURATION.md and DOCKER.md to:
- Clarify what each log level includes
- Add tip about using LOG_LEVEL=warn for minimal logging
2025-12-18 23:27:32 +00:00
rcourtman
c91307be94 fix: guest URL icon now appears/disappears immediately after AI sets/removes it
The issue was a SolidJS reactivity problem in the Dashboard component.
When guestMetadata signal was accessed inside a For loop callback and
assigned to a plain variable, SolidJS lost reactive tracking.

Changed from:
  const metadata = guestMetadata()[guestId] || ...
  customUrl={metadata?.customUrl}

To:
  const getMetadata = () => guestMetadata()[guestId] || ...
  customUrl={getMetadata()?.customUrl}

This ensures SolidJS properly tracks the signal dependency when the
getter function is called directly in JSX props.
2025-12-18 14:42:47 +00:00
rcourtman
397871629c fix: cluster-aware guest deduplication and multi-agent token binding
- Add cluster-aware guest ID generation (clusterName-VMID instead of instanceName-VMID)
  to prevent duplicate VMs/containers when multiple cluster nodes are monitored

- Add cluster deduplication at registration time - when a node is added that belongs
  to an already-configured cluster, merge as endpoint instead of creating duplicate

- Add startup consolidation to automatically merge duplicate cluster instances

- Change host agent token binding from agent GUID to hostname, allowing:
  - Multiple host agents to share a token (each bound by hostname)
  - Agent reinstalls on same host without token conflicts

- Remove 12-character password minimum requirement

- Remove emoji from auto-registration success message

- Fix grouped view node lookup to support both cluster-aware node IDs
  (clusterName-nodeName) and legacy guest grouping keys (instance-nodeName)

Fixes duplicate guests appearing when agents are installed on multiple
cluster nodes. Also improves multi-agent UX by allowing shared tokens.
2025-12-14 10:16:17 +00:00
rcourtman
c7361362b3 fix: Robust OCI container detection with state persistence
Backend:
- Seed OCI classification from previous state so containers never
  'downgrade' to LXC if config fetching intermittently fails
- Prevent type regression in recordGuestSnapshot when OCI was previously detected
- Move metrics zeroing before snapshot recording for cleaner flow

Frontend:
- Add isOCIContainer() memo that checks both type and isOci flag
- Use isOCI helper in Dashboard.tsx for AI context building
- Include oci-container type in useResources container conversion
- Preserve isOci and osTemplate fields through legacy conversion

This ensures OCI containers retain their classification even when
Proxmox API permissions or transient errors prevent config reads.
2025-12-12 20:06:39 +00:00
rcourtman
fa13919987 fix(ai-chat): Display messages chronologically in AI chatbot
- Add 'content' type to StreamDisplayEvent for tracking text chunks
- Track content events in streamEvents array for chronological display
- Update render to use Switch/Match for cleaner conditional rendering
- Interleave thinking, tool calls, and content as they stream in
- Add fallback for old messages without streamEvents for backwards compat

Previously, tool/command outputs stayed at top while AI text responses
accumulated at the bottom. Now all events appear in order like a
normal chatbot.
2025-12-11 23:02:59 +00:00
rcourtman
927ac76bad feat: AI integration, Docker metrics, RAID display, and infrastructure improvements
- Add Claude OAuth authentication support with hybrid API key/OAuth flow
- Implement Docker container historical metrics in backend and charts API
- Add CEPH cluster data collection and new Ceph page
- Enhance RAID status display with detailed tooltips and visual indicators
- Fix host deduplication logic with Docker bridge IP filtering
- Fix NVMe temperature collection in host agent
- Add comprehensive test coverage for new features
- Improve frontend sparklines and metrics history handling
- Fix navigation issues and frontend reload loops
2025-12-09 09:29:27 +00:00
rcourtman
bcd7b550d4 AI Problem Solver implementation and various fixes
- Implement 'Show Problems Only' toggle combining degraded status, high CPU/memory alerts, and needs backup filters
- Add 'Investigate with AI' button to filter bar for problematic guests
- Fix dashboard column sizing inconsistencies between bars and sparklines view modes
- Fix PBS backups display and polling
- Refine AI prompt for general-purpose usage
- Fix frontend flickering and reload loops during initial load
- Integrate persistent SQLite metrics store with Monitor
- Fortify AI command routing with improved validation and logging
- Fix CSRF token handling for note deletion
- Debug and fix AI command execution issues
- Various AI reliability improvements and command safety enhancements
2025-12-06 23:46:08 +00:00
rcourtman
8948e84fe5 feat: AI features, agent improvements, and host monitoring enhancements
AI Chat Integration:
- Multi-provider support (Anthropic, OpenAI, Ollama)
- Streaming responses with markdown rendering
- Agent command execution for remote troubleshooting
- Context-aware conversations with host/container metadata

Agent Updates:
- Add --enable-proxmox flag for automatic PVE/PBS token setup
- Improve auto-update with semver comparison (prevents downgrades)
- Add updatedFrom tracking to report previous version after update
- Reduce initial update check delay from 30s to 5s
- Add agent version column to Hosts page table

Host Metrics:
- Add DiskIO stats collection (read/write bytes, ops, time)
- Improve disk filtering to exclude Docker overlay mounts
- Add RAID array monitoring via mdadm
- Enhanced temperature sensor parsing

Frontend:
- New Agent Version column on Hosts overview table
- Improved node modal with agent-first installation flow
- Add DiskIO display in host drawer
- Better responsive handling for metric bars
2025-12-05 10:37:02 +00:00
rcourtman
0bc58f678e perf: Cache err.Error() in storage timeout error handling
Cache err.Error() result in two locations:
- monitor.go: storage query retry logic (2x calls to 1)
- monitor_polling.go: storage timeout handling (2x calls to 1)
2025-12-02 15:39:37 +00:00