Commit graph

1596 commits

Author SHA1 Message Date
rcourtman
00afaec2ae fix(agent): add retry with backoff to Proxmox auto-registration (#1267, #1269, #1261, #1268)
registerWithPulse() was a one-shot call at agent startup — if it failed
(timing, transient network, Pulse not ready), the agent silently continued
as a generic Host forever. Wrap the HTTP POST in a retry loop with
exponential backoff (5s, 10s, 20s, 40s, 60s) and distinguish 4xx errors
(no retry) from 5xx/network errors (retry).
2026-02-18 16:05:40 +00:00
Surendra Raika
f663aade53 feat(docker): add macOS Docker Desktop socket auto-detection
Probe ~/.docker/run/docker.sock for RuntimeDocker and RuntimeAuto
before falling back to /var/run/docker.sock. This lets the agent
connect on macOS without requiring DOCKER_HOST to be set manually.

Ref #1200
2026-02-18 19:23:14 +05:30
rcourtman
5666d6a9e8 fix(ai): fsync knowledge store temp file before rename to prevent empty reads
saveToDisk used os.WriteFile which doesn't sync to disk before the
atomic rename. On CI runners with aggressive filesystem caching this
can leave the destination file with zero bytes, causing
TestKnowledgeStore_SaveLoad to fail with "unexpected end of JSON input".
2026-02-18 13:27:47 +00:00
rcourtman
6c720b7aea fix(freebsd): use golang.org/x/sys/unix.SysctlRaw instead of syscall.SysctlRaw
syscall.SysctlRaw is Darwin-only in Go's standard library; FreeBSD
requires the equivalent from golang.org/x/sys/unix. This fixes the
Docker cross-compilation build failure for the freebsd/amd64 target.

(cherry picked from commit 5fe16c75a075b817f90b7192d8270a7bd6677017)
2026-02-18 13:00:02 +00:00
rcourtman
71b8b81af5 fix(monitoring): cache per-VM RRD memory lookups to avoid serial HTTP calls
Windows VMs and VMs without qemu-guest-agent triggered an uncached
GetVMRRDData HTTP call on every poll cycle. Add vmRRDMemCache using the
same read-through cache pattern as nodeRRDMemCache (shared rrdCacheMu,
same TTL, same cleanup path).

(cherry picked from commit 582f16004a0f275de4c458e5d288be70eee613e4)
2026-02-18 12:57:15 +00:00
rcourtman
7efcec3120 fix(agents,ai): host URL field, AI Docker routing, Proxmox registration logging (#1197, #1210, #1267)
#1197: Add Custom URL input to the expanded host row in Settings → Agents.
Loads existing URL via HostMetadataAPI on row expand; saves on button click.
Only shown for host-type agent rows.

#1210: Fix agent_connected always false for Docker hosts on Proxmox VMs.
connectedAgentHostnames now also marks Docker host hostnames reachable when
their matching VM/LXC has a node with a connected Proxmox agent, mirroring
the routing logic already used in the control path.

#1267/#1269: Improve Proxmox auto-registration failure logging. Response body
is now included in the error message, and the warning directs users to delete
the state file to force re-registration rather than claiming the node exists.

(cherry picked from commit 305f6d3c94f0da4fc970450a6304da57d6d7fe80)
2026-02-18 12:57:09 +00:00
rcourtman
efa916ee2a fix(memory): correct memory reporting for Linux VMs and FreeBSD ZFS ARC
Linux VM page cache (#1270): QEMU VM memory now falls back to Proxmox
RRD's memavailable metric (which excludes reclaimable page cache) when
the qemu-guest-agent doesn't provide MemInfo.Available. Previously the
fallback was detailedStatus.Mem (total - MemFree), inflating usage to
80%+ on VMs with normal Linux page cache. Mirrors the existing LXC
rrd-memavailable path.

FreeBSD ZFS ARC (#1264, #1051): The host agent now reads
kstat.zfs.misc.arcstats.size via SysctlRaw on FreeBSD and subtracts
the ARC size from reported memory usage. ZFS ARC is reclaimable under
memory pressure (like Linux SReclaimable) but gopsutil counts it as
wired/non-reclaimable, causing false 90%+ memory alerts on TrueNAS
and FreeBSD hosts. Build-tagged so it compiles cleanly on all platforms.

Fixes #1270
Fixes #1264
Fixes #1051

(cherry picked from commit 94502f83ff9ffc6da28aaadc946a2f7d8b4e9bac)
2026-02-18 12:56:53 +00:00
rcourtman
9d8f8b45b5 fix(docker,metrics): preserve container metadata on update and reduce DB writes
Docker container URL preserved on update (#1054): container updates
recreate the container with a new runtime ID. The agent now includes
{oldContainerId, newContainerId} in the completion ACK payload; the
server uses this to copy persisted metadata (custom URLs, descriptions,
tags) to the new ID so nothing is lost. Migration is a copy, not a move,
so rollback scenarios still find metadata under the original ID.

Reduce metrics.db write amplification (#1124): add a UNIQUE index on
(resource_type, resource_id, metric_type, timestamp, tier) so rollup
reprocessing after a failed checkpoint uses INSERT OR IGNORE instead of
creating duplicate rows. Existing duplicates are deduplicated once on
startup if the index creation would otherwise fail. Also sets
wal_autocheckpoint(500) to checkpoint the WAL more frequently, preventing
unbounded WAL growth.

Fixes #1054
Fixes #1124
2026-02-18 12:56:46 +00:00
rcourtman
7522f6599c fix(agent): three backend fixes for FreeBSD, Docker rootless, and duplicate PVE hosts
FreeBSD auto-update (#1254): determineArch() now includes freebsd in its
OS switch, producing freebsd-amd64/arm64 instead of falling through to
a uname -m fallback that incorrectly returned linux-<arch>. FreeBSD agents
were downloading Linux ELF binaries and failing to exec them.

Docker rootless socket (#1200): buildRuntimeCandidates() now probes
/run/user/<uid>/docker.sock before the system-wide /var/run/docker.sock,
enabling auto-detection of Docker rootless installations.

Duplicate PVE/PBS hosts (#1245, #1252): handleSecureAutoRegister() now
deduplicates by host URL, updating the existing instance's token in-place
instead of appending a duplicate entry on each re-run of the setup script.

Fixes #1254
Fixes #1200
Fixes #1245
Fixes #1252

(cherry picked from commit 0f1d9e9b9fea6c8b9e65872e8a78e25f93653eef)
2026-02-18 12:53:25 +00:00
rcourtman
97aee77ae7 fix(sso): preserve oidc/saml sub-config when toggle sends flat update payload
The enable/disable toggle PUT sends back the flat list-response shape
(no nested oidc/saml objects). handleUpdateSSOProvider was unmarshaling
this directly, leaving OIDC and SAML as nil and overwriting all stored
credentials on every toggle.

Now preserves existing sub-config objects when the incoming payload omits
them, matching the existing ClientSecret preservation behaviour.

Fixes part of #1255

(cherry picked from commit 44868e99d66aa157f5c62d100151a6f8bc940205)
2026-02-18 12:53:18 +00:00
rcourtman
a210b01a03 fix(sso): load SSO config at startup and expose providers on login page
r.ssoConfig was never loaded from persistence in NewRouter(), so on every
restart all SSO providers were silently discarded (handleListSSOProviders
would reinitialize to an empty config on the first request).

Also adds ssoProviders to /api/security/status so the login page can
render SAML/OIDC login buttons for enabled providers.

Fixes part of #1255

(cherry picked from commit 395cd101ff4acb1b7f89ec3d907b84cbec217dc8)
2026-02-18 12:53:15 +00:00
rcourtman
43af70ca1f fix(patrol): skip alert triggers when Patrol is disabled
TriggerPatrolForAlert was enqueuing into adHocTrigger regardless of
whether Patrol was enabled. With patrolLoop not running (disabled),
nothing drained the channel — it filled on the 10th alert and spammed
"Patrol trigger queue full, dropping trigger" on every subsequent alert.

Read p.config.Enabled in the same RLock as triggerManager and return
early when disabled.

Fixes #1258

(cherry picked from commit 69f399469538f0c9cd59084f6429fed8a793c042)
2026-02-18 12:53:12 +00:00
rcourtman
df23d80919 fix(alerts): always send recovery notifications regardless of quiet hours
Recovery (all-clear) notifications were being silently suppressed during
quiet hours for any non-critical alert. Since powered-off alerts default
to Warning level, users who received an alert at 2pm would never get the
recovery notification if the VM came back during quiet hours.

Quiet hours are intended to suppress noisy firing alerts, not to hide
the fact that an issue has resolved. If you got the alert, you should
always get the all-clear.

Remove the ShouldSuppressResolvedNotification gate from handleAlertResolved.
The notifyOnResolve toggle (explicit user preference) is still respected.

Fixes #1259
2026-02-18 12:53:09 +00:00
rcourtman
6f156cd211 fix: exit agent when exec fails after binary replacement during auto-update
When syscall.Exec() fails after the binary has already been atomically
replaced on disk, the old process would log an error and keep running
indefinitely with stale code. The next update check (1 hour later) sees
the on-disk version matches the server and skips the update — so the
restart is never retried.

Now the agent exits with code 1 when this happens, allowing systemd (or
any service manager) to restart it with the new binary. This fixes the
"temperature broken after each upgrade" reports where users had to
manually reinstall the agent after every Pulse server upgrade.

Fixes #1247
2026-02-11 14:26:14 +00:00
rcourtman
2fb6ebc25f fix: add SAML auth bypass and update route inventory tests
The SAML route registration (bee3d05f) was incomplete: the auth
middleware uses exact-match for public paths, so /api/saml/{id}/login
etc. would be blocked. Add prefix-based auth bypass for /api/saml/
paths and update route inventory tests for both SSO and SAML routes.
2026-02-11 13:48:16 +00:00
rcourtman
bee3d05f0d fix: register SAML login flow routes (login, ACS, metadata, logout, SLO)
The SAML handler functions existed but were never registered in
setupRoutes(), causing 404s for all SAML authentication flows.
Adds /api/saml/ prefix route with dispatcher for all 5 endpoints.
2026-02-11 13:29:05 +00:00
rcourtman
89969079b9 fix: register SSO provider API routes
The SSO handler functions and frontend were implemented but the HTTP
routes were never registered in setupRoutes(), causing 404 on all
/api/security/sso/providers endpoints.

Fixes #1248
2026-02-11 13:17:51 +00:00
rcourtman
2735204638 fix: skip ambiguous shared-storage backups when VMID exists on multiple instances
When two standalone (non-clustered) PVE hosts share the same storage (NFS,
etc.), both instances see the same backup files during polling. Each instance
creates its own StorageBackup entry, causing guests with the same VMID on
different hosts to incorrectly show each other's backups.

Detect shared-storage duplicates by checking if the same volid appears across
multiple instances. When it does AND the VMID is ambiguous (exists on multiple
instances), skip the backup in SyncGuestBackupTimes rather than guessing which
instance owns it. This uses the same ambiguity pattern already applied to PBS
backups.

Fixes #1177
2026-02-11 11:07:28 +00:00
rcourtman
d4ff967815 fix: scope shared storage aggregation to per-instance to prevent cross-instance merging
The shared storage deduplication key was just the storage name, causing
storages with the same name from different Proxmox instances (or PVE + PBS)
to be incorrectly merged into a single entry. This made one random host
appear to have all storages from all instances.

Include the instance name in the aggregation key so shared storage is only
merged within the same Proxmox cluster/instance.

Fixes #1246
2026-02-11 09:18:09 +00:00
rcourtman
2ba590d994 fix: fall back to SMART attributes 194/190 for disk temperature
When the top-level temperature.current field is 0 or missing (common
on some SATA drives), temperature was reported as 0°C with no fallback.
Now extracts temperature from ATA SMART attribute 194 (Temperature_Celsius)
or 190 (Airflow_Temperature_Cel) as a fallback.

Fixes #1243
2026-02-11 09:09:55 +00:00
rcourtman
03939c3f9e fix: deduplicate bind-mounted volumes in disk total calculation
The dedup logic only handled btrfs/zfs subvolumes, but Kubernetes
bind-mounts the same device at both pod and plugin paths, causing
xfs/ext4 volumes to be double-counted. Now deduplicates by
device+totalBytes for all filesystem types.

Fixes #1158
2026-02-10 21:52:25 +00:00
rcourtman
42c01c1be5 fix: probe all guest IPs for reachability, not just first
Patrol only pinged the first IP address of each VM/container, causing
false "unreachable" reports for guests with multiple IPs (common with
Windows VMs that have IPv6 or multi-adapter setups). Now probes all
IPs and marks reachable if any responds.

Fixes #1215
2026-02-10 21:46:11 +00:00
rcourtman
6140cb5be4 fix: auto-default discovery interval to 24h when enabled
When users enable AI discovery without setting an interval, the
default of 0 silently stays in manual-only mode. Now normalizes
0 to 24h on save so discovery actually starts automatically.

Fixes #1225
2026-02-10 21:45:59 +00:00
rcourtman
ae4632b5b5 fix: correct UpdateAlertDelayHours doc comment (0 normalizes to 24, -1 disables) 2026-02-10 21:13:12 +00:00
rcourtman
a68e0050f8 fix(docker): use manual CPU delta tracking instead of stale PreCPUStats (#1229)
Docker's one-shot stats API (stream=false) returns PreCPUStats from the
daemon's internal cache, which many Docker versions don't update between
non-streaming reads. This causes every call to return the same stale
PreCPUStats from container start, producing a constant lifetime-average
CPU% (e.g. 3.4%) instead of current usage.

Switch to always using manual delta tracking, which stores the previous
sample from our own reads and computes accurate deltas between collection
cycles. The first cycle returns 0 while establishing a baseline; all
subsequent cycles produce correct current CPU percentages.
2026-02-10 20:49:29 +00:00
rcourtman
47ceffe0c2 fix(smart): parse raw.string instead of raw.value for SATA attributes (#1239)
Seagate drives pack vendor-specific data in the upper bytes of the
48-bit SMART raw value, causing Power_On_Hours to report billions of
years instead of the actual value. Use smartctl's raw.string field
(e.g. "16951 (223 173 0)") and extract the first integer, which is
the correct interpretation. Falls back to raw.value when the string
is empty or non-numeric.
2026-02-10 20:42:15 +00:00
rcourtman
26776b2075 fix(agent): apply --disk-exclude to Docker agent disk metrics (#1237)
The Docker agent was not passing the disk exclusion list to
hostmetricsCollect(), so excluded mounts appeared in the Docker tab
disk totals. Also add server-side fsfilters filtering to Docker
report processing for parity with the host agent path.
2026-02-10 16:59:35 +00:00
rcourtman
47adcbd8af feat(agent): add FreeBSD S.M.A.R.T. disk collection support (#1236)
Relax the Linux-only gate on SMART collection to also run on FreeBSD.
Add FreeBSD disk discovery via sysctl kern.disks (lsblk is Linux-only).
The smartctl invocation and JSON parsing are already platform-agnostic.
2026-02-10 12:44:15 +00:00
rcourtman
f7a14feb0f fix(mock): align Docker container store type with real monitor
Mock seeding wrote Docker container metrics as "docker" but the real
monitor uses "dockerContainer". This made mock-mode charts miss the
SQLite store path after the API normalization fix in 7336ec2d.
2026-02-09 22:42:08 +00:00
rcourtman
7336ec2d87 fix(metrics): normalize docker resource type in metrics history API (#1229)
Frontend sends resourceType="docker" but the SQLite store uses
"dockerContainer". The /api/metrics-store/history handler now
normalizes the alias so queries return the correct historical data
instead of falling back to a single live data point.
2026-02-09 22:33:24 +00:00
rcourtman
c92ccc122e fix(state): deduplicate PVE nodes and AI mention resources (#1217, #1214)
Backend: nodes with the same logical identity (cluster+name) are merged
using a health-weighted preference, preserving host-agent links across
node-ID churn.

Frontend: extract buildMentionResources() with alias-based dedup so
docker hosts and standalone host agents sharing an ID/hostname appear
once in the @ mention autocomplete.
2026-02-09 22:19:55 +00:00
rcourtman
815c990e85 fix(proxmox): avoid 403 on apt update checks 2026-02-09 20:28:09 +00:00
rcourtman
721be9bce6 fix(config): honor legacy env aliases for docker update-action toggle (#1219) 2026-02-09 14:00:24 +00:00
rcourtman
cedf0c8f0f fix(temperature): parse string sensor values without zeroing readings (#1224) 2026-02-09 14:00:09 +00:00
rcourtman
0d6fffbb1c fix(servicediscovery): run automatic refresh for changed/stale resources (#1225) 2026-02-09 14:00:02 +00:00
rcourtman
1f74c12ef8 fix(alerts): preserve docker update delay across host identity churn (#1226) 2026-02-09 13:59:52 +00:00
rcourtman
8a48acef1d fix: hotfix 5.1.5 — node duplication, alert scrambling, ntfy resolved formatting
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
  PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
  hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
  plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
  DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
  eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
2026-02-08 11:48:22 +00:00
rcourtman
d1e61d8a8a fix: ship alerting hotfixes and prepare 5.1.4 2026-02-07 22:05:55 +00:00
rcourtman
f253ed2778 fix(license): harden release key validation and fingerprint logging 2026-02-07 14:18:44 +00:00
rcourtman
6909264a02 fix(alerts): reduce swarm alert noise and preserve notification state (#1096) 2026-02-07 14:18:39 +00:00
rcourtman
13af83f3fc fix(monitoring): preserve recent PVE nodes on empty polls (#1094) 2026-02-07 14:18:33 +00:00
rcourtman
0f961054c6 fix: allow agent tokens to auto-register Proxmox nodes
The security hardening in beae4c86 added a settings:write scope
requirement to /api/auto-register, but agent install tokens only have
host-agent:report scope. This broke Proxmox auto-registration for all
agent-generated tokens. Accept either settings:write or host-agent:report
scope for auto-registration.

Fixes #1191
2026-02-04 22:55:25 +00:00
rcourtman
f6338f34fa fix: add agent:exec scope to generated agent tokens
Agent tokens created from the Settings UI and the backend install
command handler were missing the agent:exec scope, which was added
as a security requirement in 60f9e6f0. This caused all newly
installed agents to fail registration with "Agent exec token missing
required scope: agent:exec".

Fixes #1191
2026-02-04 22:33:01 +00:00
rcourtman
5bbc4329bd Remove pprof diagnostics endpoint 2026-02-04 20:44:00 +00:00
rcourtman
a37b59b7e4 Add admin-gated pprof diagnostics endpoint 2026-02-04 20:39:24 +00:00
rcourtman
8bb89c4031 test: add memory regression coverage for AI stores 2026-02-04 19:56:12 +00:00
rcourtman
ee0e89871d fix: reduce metrics memory 86x by reverting buffer and adding LTTB downsampling
The in-memory metrics buffer was changed from 1000 to 86400 points per
metric to support 30-day sparklines, but this pre-allocated ~18 MB per
guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB —
explaining why users needed to double their LXC memory after upgrading
to 5.1.0.

- Revert in-memory buffer to 1000 points / 24h retention
- Remove eager slice pre-allocation (use append growth instead)
- Add LTTB (Largest Triangle Three Buckets) downsampling algorithm
- Chart endpoints now use a two-tier strategy: in-memory for ranges
  ≤ 2h, SQLite persistent store + LTTB for longer ranges
- Reduce frontend ring buffer from 86400 to 2000 points

Related to #1190
2026-02-04 19:49:52 +00:00
rcourtman
d2604a6859 test: add AI memory regression coverage 2026-02-04 19:46:20 +00:00
rcourtman
bcd0dbfc18 Add metrics history memory regression test 2026-02-04 19:35:19 +00:00
rcourtman
049a3e424c Add memory regression tests for agent and scheduler 2026-02-04 19:33:29 +00:00