Commit graph

219 commits

Author SHA1 Message Date
rcourtman
5a17456a60 Fix Ceph manager standby parsing
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-04-13 11:57:12 +01:00
rcourtman
76c3f1ac88 test: stabilize metrics flush visibility
Some checks failed
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
Update Integration Tests / Update Flow Integration Tests (push) Has been cancelled
2026-04-09 10:58:04 +01:00
rcourtman
387a79bc2d Defer metrics store startup maintenance 2026-04-01 12:44:17 +01:00
rcourtman
177ae5f6da Tighten integer and allocation bounds for CodeQL 2026-03-31 09:50:11 +01:00
rcourtman
9155480bbd Use explicit integer bounds in Proxmox parsing 2026-03-31 09:43:04 +01:00
rcourtman
57cf7b305a Validate SAML redirects and request URLs 2026-03-31 09:34:31 +01:00
rcourtman
33efdc3fb5 Normalize outbound client and update URLs 2026-03-31 09:31:56 +01:00
rcourtman
4f89a13975 Harden storage path handling for CodeQL 2026-03-31 08:55:45 +01:00
rcourtman
e93c8b40ae Fix CodeQL integer and audit findings 2026-03-28 13:33:48 +00:00
rcourtman
e306c0a461 Tolerate partial guest network address payloads (#1319)
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-27 17:09:09 +00:00
rcourtman
81b0a567ce Harden guest network interface parsing (#1319) 2026-03-27 17:05:34 +00:00
rcourtman
2ed4253573 Accept object-style single guest fsinfo results (#1319) 2026-03-27 16:33:41 +00:00
rcourtman
d11e3d8f2d Use Ceph monmap and mgrmap counts in cluster summaries (#1319) 2026-03-27 16:23:57 +00:00
rcourtman
3d27c8f006 Accept object-style guest fsinfo disk metadata (#1319) 2026-03-27 15:24:40 +00:00
rcourtman
fcfa0c2903 Skip malformed guest fsinfo entries (#1319) 2026-03-27 15:23:13 +00:00
rcourtman
d4242d9a13 Fix ZFS pool attachment in storage frontend (discussion #1351) 2026-03-27 14:59:52 +00:00
rcourtman
b5629fb1df Normalize Windows volume GUID fsinfo mountpoints (#1319) 2026-03-27 14:04:58 +00:00
rcourtman
b05d2b0489 Handle Windows fsinfo name fallback for guest disks (#1319) 2026-03-27 11:39:22 +00:00
rcourtman
1f332bee52 Support privileged fsinfo totals for guest disks (#1319) 2026-03-27 11:18:53 +00:00
rcourtman
f45f7401c0 Make metrics Flush wait for queued writes 2026-03-25 14:14:00 +00:00
rcourtman
2acf2e9ef9 Reduce metrics store transaction churn (#1124) 2026-03-25 12:06:28 +00:00
rcourtman
1885bd02c0 Fix Proxmox tag color parsing (#1348) 2026-03-25 10:40:31 +00:00
rcourtman
3a02dd171b fix(proxmox): add GetClusterOptions to ClusterClient for tag colour fetch 2026-03-15 19:51:20 +00:00
rcourtman
caff845c1a fix(ui): use Proxmox tag colours from datacenter config
Pulse was generating tag colours from a hash of the tag name instead
of using the colours configured in Proxmox. Now polls /cluster/options
once per PVE instance and merges the tag-style colour map into state,
which the frontend uses as the first-priority colour source for tag
badges. Falls back to the existing special-tag and hash-based colours
when Proxmox hasn't set a custom colour for a tag.
2026-03-15 19:49:46 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
ac83074fc2 fix(hostmetrics): skip network mounts before usage probe (#1313) 2026-03-03 20:27:41 +00:00
rcourtman
adad2bbd1c fix(server): fail fast on frontend bind errors 2026-03-03 15:43:22 +00:00
rcourtman
b38488f2da fix(proxmox): stabilize pulse monitor token lifecycle 2026-03-03 10:57:19 +00:00
rcourtman
0ae2806f18 fix(memory): add guest agent /proc/meminfo fallback to avoid VM memory inflation (#1270)
Proxmox status.Mem includes page cache as "used" memory, inflating
reported VM usage. The existing fallbacks (balloon meminfo, RRD, linked
host agent) were frequently unavailable, causing most VMs to fall
through to the inflated status-mem source.

Adds a new last-resort fallback that reads /proc/meminfo via the QEMU
guest agent file-read endpoint to get accurate MemAvailable. Results
are cached (60s positive, 5min negative backoff for unsupported VMs).

Also fixes: RRD memavailable fallback missing from traditional polling
path, cache key collisions in multi-PVE setups, FreeMem underflow
guard inconsistency, and integer overflow in kB-to-bytes conversion.
2026-02-20 13:31:52 +00:00
rcourtman
8c7d507ea4 fix(alerts): make --disk-exclude suppress Proxmox SSD wear/health alerts (#1142)
The --disk-exclude agent flag only filtered local metric collection but
had no effect on server-side Proxmox disk health and SSD wearout alerts,
which poll the Proxmox API directly. Users excluding disks (e.g.
--disk-exclude sda) still received alerts for those disks.

Agent now sends its DiskExclude patterns in each report. The server
stores them on the Host model and consults them during Proxmox disk
polling — excluded disks get a synthetic healthy status passed to
CheckDiskHealth so any existing alerts clear immediately.

Also adds FreeBSD pseudo-filesystem types (fdescfs, devfs, linprocfs,
linsysfs) to the virtual FS filter and /var/run/ to special mount
prefixes, fixing false disk-full alerts on FreeBSD for fdescfs mounts.
2026-02-20 13:31:52 +00:00
rcourtman
74d46807d8 fix(test): update /var/lib/docker fsfilter expectation to match new skip=false behavior
(cherry picked from commit 6671a0d5a654ef8d0b023bd610e5c2fd5fedbf8b)
2026-02-18 12:59:23 +00:00
rcourtman
a54d71117b fix(proxmox): prevent guest agent errors from marking endpoints unhealthy
Backport of v6 commits a87c9950 and 347d7db1.

Part 1 (a87c9950): Wrap the four guest agent c.get() errors with
fmt.Errorf("guest agent ...: %w", err) so isVMSpecificError() correctly
scopes them to the VM rather than the cluster endpoint.

Part 2 (347d7db1): Replace the 20+ pattern blocklist in
executeWithFailover with an allowlist via isEndpointConnectivityError().
Only true TCP/DNS/TLS failures mark an endpoint unhealthy. Any HTTP
response from Proxmox — including 500 — proves the node is reachable
and returns the error without affecting endpoint health.
2026-02-18 12:59:20 +00:00
rcourtman
a69a2061cb fix(ui,disk): remove double 'ago' suffix and allow /var/lib/docker block devices (#1266, #1143)
#1266: ageFormatted already includes 'ago' from formatTimeDiff(); remove the
duplicate literal suffix from the backup age tooltip in GuestRow.tsx.

#1143: Remove /var/lib/docker from specialMountPrefixes so real block devices
mounted there are visible in disk usage. Container overlay layers (fstype=overlay)
are already filtered by virtualFSTypes and are unaffected.

(cherry picked from commit 5acef3405d4288f627788675123e266d661c2fe3)
2026-02-18 12:56:58 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
9d8f8b45b5 fix(docker,metrics): preserve container metadata on update and reduce DB writes
Docker container URL preserved on update (#1054): container updates
recreate the container with a new runtime ID. The agent now includes
{oldContainerId, newContainerId} in the completion ACK payload; the
server uses this to copy persisted metadata (custom URLs, descriptions,
tags) to the new ID so nothing is lost. Migration is a copy, not a move,
so rollback scenarios still find metadata under the original ID.

Reduce metrics.db write amplification (#1124): add a UNIQUE index on
(resource_type, resource_id, metric_type, timestamp, tier) so rollup
reprocessing after a failed checkpoint uses INSERT OR IGNORE instead of
creating duplicate rows. Existing duplicates are deduplicated once on
startup if the index creation would otherwise fail. Also sets
wal_autocheckpoint(500) to checkpoint the WAL more frequently, preventing
unbounded WAL growth.

Fixes #1054
Fixes #1124
2026-02-18 12:56:46 +00:00
rcourtman
c2d2e7de0e fix: run retention before auto-vacuum migration to reduce VACUUM cost
VACUUM creates a full copy of the database. Running retention first
deletes stale data (5GB → ~60MB live), so the VACUUM copies far less
data — faster startup and much less temporary disk space needed.
2026-02-11 13:37:29 +00:00
rcourtman
06a9f9694f fix: migrate existing metrics databases to incremental auto-vacuum
The auto_vacuum(INCREMENTAL) pragma from the previous commit only takes
effect on new databases. SQLite requires a full VACUUM to restructure
existing files when switching from NONE to INCREMENTAL. Without this,
users upgrading from bloated 5GB+ databases would never reclaim space.

Adds a one-time migration on startup that detects the current auto_vacuum
mode and runs VACUUM to convert if needed. Subsequent startups skip the
migration since the mode is already INCREMENTAL.
2026-02-11 13:35:03 +00:00
rcourtman
284bdd7ade fix: prevent metrics.db bloat with automatic vacuum and WAL checkpointing
The metrics database could grow to 5GB+ for modest setups because:
1. Retention deletes rows hourly but SQLite never reclaims the space
2. WAL file grows unbounded without explicit checkpointing
3. No cleanup runs on startup, so restarts accumulate stale data

Fixes:
- Enable auto_vacuum=INCREMENTAL so deleted pages can be reclaimed
- Run incremental_vacuum after each retention cleanup
- Force WAL checkpoint(TRUNCATE) after deletes to prevent WAL bloat
- Run retention on startup to clean stale data immediately

Expected DB size for a 50-resource setup drops from 5GB+ to ~60-70MB.

Ref: GitHub Discussion #1231
2026-02-10 23:13:32 +00:00
rcourtman
815c990e85 fix(proxmox): avoid 403 on apt update checks 2026-02-09 20:28:09 +00:00
rcourtman
5c18748742 Add SMART disk lifecycle monitoring with historical charts
Expand the smartctl collector to capture detailed SMART attributes (SATA
and NVMe), propagate them through the full data pipeline, persist them
as time-series metrics, and display them in an interactive disk detail
drawer with historical sparkline charts.

Backend: add SMARTAttributes struct, writeSMARTMetrics for persistent
storage, "disk" resource type in metrics API with live fallback.
Frontend: enhanced DiskList with Power-On column and SMART warnings,
new DiskDetail drawer matching NodeDrawer styling patterns, generic
HistoryChart metric support with proper tooltip formatting.
2026-02-04 13:35:40 +00:00
rcourtman
6427f28a08 Fix stale metrics store reference in reporting engine after monitor reload
The reporting engine held a direct pointer to the metrics store, which
becomes invalid after a monitor reload (settings change, node config
save, etc.) closes and recreates the store. Use a dynamic getter closure
that always resolves to the current monitor's active store.

Also adds diagnostic logging when report queries return zero metrics,
and integration tests covering the full metrics-to-report pipeline
including reload scenarios.

Fixes #1186
2026-02-04 12:34:40 +00:00
rcourtman
5c1487e406 feat: add resource picker and multi-resource report generation
Replace manual resource ID entry with a searchable, filterable resource
picker that uses live WebSocket state. Support selecting multiple
resources (up to 50) for combined fleet reports.

Multi-resource PDFs include a cover page, fleet summary table with
aggregate health status, and condensed per-resource detail pages with
overlaid CPU/memory charts. Multi-resource CSVs include a summary
section followed by interleaved time-series data with resource columns.

New POST /api/admin/reports/generate-multi endpoint handles multi-resource
requests while the existing single-resource GET endpoint remains unchanged.

Also fixes resource ID validation regex to allow colons used in
VM/container IDs (e.g., "instance:node:vmid").
2026-02-04 10:24:23 +00:00
rcourtman
2ebe65bbc5 security: add scope checks to AI Patrol and agent profile endpoints
- AI Patrol mutation endpoints (acknowledge, dismiss, suppress, snooze, resolve,
  findings/note, suppressions/*) now require ai:execute scope to prevent
  low-privilege tokens from blinding patrol by hiding/suppressing findings

- Agent profile admin endpoints (/api/admin/profiles/*) now require
  settings:write scope to prevent low-privilege tokens from modifying
  fleet-wide agent behavior
2026-02-03 19:29:56 +00:00
rcourtman
4fdc0cae64 feat(reporting): enrich metric reports with detailed resource info 2026-02-03 18:51:27 +00:00
rcourtman
11aa3e05af fix(reporting): correct SSD life documentation and logic
- Fix misleading comment in DiskInfo struct that said "percentage of
  life used" when it's actually "percentage of life REMAINING"
- Document that 100 = healthy, 0 = end of life, -1 = unknown
- This matches the Proxmox API behavior where wearout "100 is best"
2026-02-03 18:24:09 +00:00
rcourtman
9f412e69b3 fix(reporting): correct SSD life interpretation (100% = healthy, not worn)
The WearLevel field represents SSD life REMAINING, not wear used:
- 100% = fully healthy (new drive)
- 0% = end of life

Fixed logic to:
- Show critical warning when life <= 10% (not >= 90%)
- Show warning when life <= 30% (not >= 70%)
- Display values in green when healthy (>30% life remaining)
- Rename column from "Wear" to "Life" for clarity
2026-02-03 18:20:54 +00:00
rcourtman
442d29e9b9 feat(reporting): enhance PDF reports with Executive Summary and actionable insights
- Add professional cover page with branding and report period
- Add Executive Summary page with health status banner (HEALTHY/WARNING/CRITICAL)
- Add Quick Stats section with color-coded metrics and trend indicators
- Add Key Observations with automated analysis of CPU, memory, disk, and disk wear
- Add Recommended Actions section with prioritized, actionable items
- Add Resource Details page with hardware info, storage pools, physical disks
- Add color-coded tables for alerts, storage, and disk health
- Add performance charts with area fills and proper scaling
- Improve overall visual design with consistent color scheme
- Fix SAML session invalidation to use correct SessionStore method
2026-02-03 18:17:31 +00:00
rcourtman
c6aeb9429b fix: initialize reporting engine in standard binary
Pro license holders running the standard Docker image/binary were
getting "Reporting engine not initialized" errors because the
reporting engine was only wired up in the enterprise build.

Now the core server initializes the reporting engine automatically
when the metrics store is ready, ensuring PDF/CSV report generation
works for all Pro license holders regardless of which binary they use.

The enterprise hooks are still honored if set, allowing the enterprise
build to override with its own implementation if needed.
2026-02-03 16:53:20 +00:00
rcourtman
35eedcb5ac Fix: metrics store tier fallback for mock mode sparklines
When querying short time ranges (1h, 6h), the metrics store only looked
in TierRaw and TierMinute which were empty in mock mode. The seeded data
was stored in TierHourly and TierDaily.

Updated tierFallbacks to include coarser tiers as fallbacks:
- TierRaw now falls back to TierMinute, then TierHourly
- TierMinute now falls back to TierRaw, then TierHourly

This ensures sparkline data is available in mock/demo mode where
historical data is seeded into coarser tiers.
2026-02-03 12:03:06 +00:00
rcourtman
ed5ab5eebf Fix: flaky metrics fallback test — use WriteBatchSync for deterministic writes 2026-02-02 23:32:28 +00:00