Commit graph

3186 commits

Author SHA1 Message Date
rcourtman
8a43a964b6 fix(ai): wire patrol circuit breaker on first-time configure 2026-03-13 12:10:14 +00:00
rcourtman
ae2edbde20 fix(ai): complete wiring on first-time configure; guard Ollama fallback
Three follow-up fixes:

1. RestartAIChat() now performs the full post-start wiring (MCP providers,
   patrol adapter, investigation orchestrator) when the service starts for
   the first time via Restart(). Previously these were only wired via
   StartAIChat(), leaving first-time configure with a partially wired service.

2. The Ollama→OpenAI-compatible fallback in createProviderForModel is now
   guarded by !strings.HasPrefix(modelStr, "ollama:") so explicit
   "ollama:llama3" models are never silently rerouted to a different provider.

3. Windows install script registration check now uses the $Hostname override
   (if set) instead of always looking up $env:COMPUTERNAME, so post-install
   verification works correctly when a custom hostname is specified.
2026-03-13 12:06:08 +00:00
rcourtman
6b317f08d2 fix(agent): add --hostname support to Windows PowerShell install script
Adds $Hostname / $env:PULSE_HOSTNAME parameter so users can set a
custom display name at install time, matching the Linux install.sh
behaviour. Persists to config.json and passes --hostname to the agent
binary args.

Closes discussion #818
2026-03-13 11:54:12 +00:00
rcourtman
e137f3fbf7 fix(ai): start chat service on first-time configure without restart
When Pulse starts before AI is configured, legacyService is nil.
Saving AI settings called Restart() which bailed immediately on the
nil check, leaving the service unstarted (503 on /api/ai/sessions)
until a full process restart.

Merged the nil and !IsRunning checks so first-time configure now
starts the service inline, same as the already-handled stopped case.

Also: bare model names that ParseModelString routes to Ollama (e.g.
"qwen3-omni") now fall back to a configured custom OpenAI base URL
when Ollama is not explicitly configured — handles manually-typed
model names on self-hosted OpenAI-compatible endpoints.

Fixes #1339, #1296
2026-03-13 11:13:27 +00:00
rcourtman
fde4d9124e fix(frontend): defer discovery tab initialization until opened
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
2026-03-10 23:14:30 +00:00
rcourtman
d05a00b931 fix(monitoring): smooth transient VM memory fallback spikes 2026-03-10 23:06:17 +00:00
rcourtman
40a85175be fix(frontend): preserve drawer chart range across live updates 2026-03-10 22:56:30 +00:00
rcourtman
afcfb23a30 fix(monitoring): retain intermittent FreeBSD SMART data 2026-03-10 22:52:25 +00:00
rcourtman
1a582ccc35 fix(diagnostics): honor PVE fingerprint in diagnostics probe 2026-03-10 22:46:12 +00:00
rcourtman
92b6da83ea Refine tooltip labels: Reclaimable cache, Shown in Proxmox
Some checks failed
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
Helm CI / Lint and Render Chart (push) Has been cancelled
2026-03-10 10:35:19 +00:00
rcourtman
9601afb44c Rename Cache to Reclaimable and add Proxmox reconciliation in tooltip
Rename the amber segment label from "Cache" to "Reclaimable" to avoid
jargon confusion. Add a "Proxmox view: X%" line in the tooltip so
users immediately see why the percentage differs from Proxmox (which
includes reclaimable cache as used memory).
2026-03-10 10:26:50 +00:00
rcourtman
7dab977d91 Add split memory bar showing Used | Cache | Free segments (#1302)
Show reclaimable buff/cache as a distinct amber segment between used
(green) and free (gray) in the memory bar. This explains why Pulse's
memory percentage differs from Proxmox: Pulse reports cache-aware
usage (MemAvailable) while Proxmox includes cache as used (Total-Free).

Backend: add Cache field to Memory model, derived from MemInfo
(Available - Free). Only uses MemInfo.Free (not FreeMem fallback) to
avoid inflating cache by the balloon gap on ballooned VMs.

Frontend: StackedMemoryBar renders three segments with tooltip
breakdown. Tooltip Free accounts for balloon limit when active.
Percentage label and alerts remain cache-aware (unchanged).
2026-03-10 10:16:14 +00:00
rcourtman
5498575b8f Auto-update Helm chart documentation 2026-03-09 22:25:17 +00:00
rcourtman
83d3e3e95e Bump version to 5.1.23 2026-03-09 21:49:21 +00:00
rcourtman
7a394ed724 Use explicit success flag for disk carry-forward guard (#1319)
Replace the diskUsage <= 0 heuristic with a diskFromAgent bool that is
only set when the guest agent actually returns valid filesystem data.
Prevents carry-forward from firing on a genuine 0% disk reading.
2026-03-09 18:54:27 +00:00
rcourtman
9c279732f7 Skip disk carry-forward when guest agent is explicitly disabled (#1319)
Prevents stale disk data from persisting indefinitely in the efficient
poller when a user disables the guest agent after it had been providing
data.  Matches the fallback poller's agent-disabled exclusion.
2026-03-09 18:37:38 +00:00
rcourtman
abbd0df609 Fix disk metric spikes when guest agent intermittently fails (#1319)
Carry forward previous cycle's disk data when the QEMU guest agent
times out or errors, instead of falling back to Proxmox cluster/resources
which always reports 0 for VM disk usage.  Applied to both polling paths
(pollVMsAndContainersEfficient and pollVMsWithNodes) with safety guards
against uint64 underflow and permanent-failure exclusions.
2026-03-09 18:23:15 +00:00
rcourtman
a4b0771974 Prevent removed host agents from resurrecting via in-flight reports (#1331)
Host agents removed from the UI would reappear on the next report cycle
because there was no rejection mechanism — unlike Docker agents which
already had resurrection prevention. Mirror the Docker agent pattern:

- Track removed host IDs in a `removedHosts` map with 24hr TTL
- Persist removal records in `State.RemovedHosts` for frontend display
- Reject reports from removed hosts in `ApplyHostReport()`
- Add `AllowHostReenroll()` + API route to clear the block
- Show removed host agents in the Settings UI with "Allow re-enroll"
- Sync removed-agent maps from state on startup for all agent types
- Fix mock integration snapshot missing `RemovedDockerHosts` field
2026-03-09 17:52:34 +00:00
rcourtman
9b531c547d Fix recovery notifications silently disabled by config PUT (#1332)
Two fixes for missing recovery/resolved notifications:

1. API config PUT handler now preserves notifyOnResolve when the client
   omits it from the request body. Go decodes a missing bool as false,
   which silently disabled recovery notifications on older clients.

2. CancelAlert now always cleans up the cooldown record even when the
   alert has already left the pending buffer, preventing stale cooldown
   entries from suppressing future alert cycles.
2026-03-09 11:28:28 +00:00
rcourtman
572520ebc6 Promote guest-agent /proc/meminfo fallback for accurate VM memory (#1270)
Move the guest-agent file-read of /proc/meminfo earlier in the memory
fallback chain so it runs before RRD, giving real-time MemAvailable that
correctly excludes reclaimable buff/cache on Linux VMs. Also add
VM.GuestAgent.FileRead permission for PVE 9 and fix install.sh to use
comma-separated privilege strings.
2026-03-09 10:04:28 +00:00
rcourtman
aa139b73fb Fix intermittent VM disappearance from dashboard (#555)
Two root causes: (1) When Proxmox cluster/resources returns a partial
response (e.g. during migration or transient API issue), VMs missing
from a responsive node were silently dropped because the node appeared
in nodesWithResources, bypassing grace-period preservation. Now
preserves recently-seen guests from online nodes for up to the grace
window. (2) The task queue allowed overlapping polls for the same PVE
instance — a slower stale poll could overwrite a newer complete VM list.
Added per-instance execution lock to skip duplicate scheduled tasks.
2026-03-08 22:16:24 +00:00
rcourtman
d560de15ad Increase alerts test cleanup sleep to fix flaky test under load
The 10ms goroutine drain pause was insufficient under full parallel
test suite load, causing intermittent failures in
TestPulseMonitorOnlySkipsDispatchButRetainsAlert.
2026-03-08 22:16:24 +00:00
rcourtman
98c9de7c91 Fix FreeBSD SMART disk detection for ada/da/nvd devices (#1254)
FreeBSD disk discovery now falls back to scanning /dev for ada*, da*,
nvd*, nda* and other FreeBSD disk names when kern.disks misses them.
Probe order prefers the correct device type first (sat for ada, nvme
for nvd). Standby disks are preserved as valid results instead of
being dropped.
2026-03-08 22:16:24 +00:00
rcourtman
82c615b3b9 Filter virtual disks from SMART checks to prevent false positives (#1329)
ZFS zvols (zd*), device-mapper, virtio disks, and other virtual block
devices don't support SMART and were being reported as FAILED. Use lsblk
JSON metadata to filter by device prefix, transport, subsystem, and
vendor/model. Also treat missing smart_status as unknown rather than
failed, and ignore UNKNOWN health in Patrol/AI signals.
2026-03-08 22:16:24 +00:00
rcourtman
f66aa66e74 Auto-update Helm chart version to 5.1.22 2026-03-08 12:27:01 +00:00
rcourtman
015a33ba13 Auto-update Helm chart documentation 2026-03-08 12:27:00 +00:00
rcourtman
43864ffb95 Bump version to 5.1.22 2026-03-08 11:51:17 +00:00
rcourtman
45b5c8a861 Restore previous license on persistence failure instead of clearing it
If license save fails, the in-memory license was being cleared, which
could drop a valid existing license. Now snapshots the current license
before activation and restores it if persistence fails.
2026-03-08 11:49:26 +00:00
rcourtman
fe0706f614 Fix cluster double-registration invalidating Proxmox credentials (#1319)
Two nodes in the same PVE cluster generated identical Proxmox API token
names, so the second node's setup rotated the shared token and broke the
first node. Include the hostname in the token name so each node gets its
own token. Also refresh the stored cluster credential on the server when
a new endpoint merges into an existing cluster entry.
2026-03-07 22:36:01 +00:00
rcourtman
ff1bbe2fb8 Guard per-VM guest agent calls with timeout and panic recovery (#1319)
A broken or hung qemu-agent on one VM could stall the entire polling
loop, preventing higher-VMID VMs from being detected. Wrap all guest
agent work in a 10s per-VM budget with panic recovery, and add a 2s
timeout to GetVMStatus in the efficient poller to match the legacy path.
2026-03-07 22:30:18 +00:00
rcourtman
0dd3fc779b Fix alert disable notification suppression
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
2026-03-07 18:40:08 +00:00
rcourtman
d6e8bffaeb pulse/license upgrade safety hardening 2026-03-07 15:13:09 +00:00
rcourtman
a6f6f66078 Improve auto-register auth errors and setup token grace window (#1319)
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
The /api/auto-register endpoint returned a generic "Invalid or expired
setup code" for all auth failures, making cluster registration issues
impossible to diagnose. Now returns specific errors for expired tokens,
wrong scope, invalid API tokens, etc.

Also extend the setup token grace window to /api/auto-register so
multiple cluster nodes can register with the same token within the
1-minute grace period after first use.
2026-03-07 13:39:26 +00:00
rcourtman
c0b3a0e665 Restart Pulse service after failed auto-update (#1323)
The auto-update flow stops the Pulse service before applying updates.
If the update fails, the rollback path restored files but never
restarted the service. Since the main unit was explicitly stopped
(not crashed), systemd's Restart=always didn't rescue it.

Add restart-on-failure guards to both pulse-auto-update.sh and
install.sh so Pulse is always restarted after a failed update attempt.
2026-03-07 10:46:19 +00:00
rcourtman
64f3bfa922 Bump dompurify to 3.3.2 to fix XSS vulnerability (Dependabot #64)
DOMPurify 3.1.3–3.3.1 has an XSS vulnerability via missing rawtext
element sanitization. Bump to 3.3.2 which includes the fix.
2026-03-07 10:46:12 +00:00
rcourtman
ddecf6d00c Guard legacyMonitor typed-nil and add OIDC refresh panic recovery
Normalize SystemSettingsMonitor interface assignments via reflect to
prevent typed-nil-in-interface (same class as #1324 fix). Also add
defer/recover to the background OIDC token refresh goroutine so a
panic there cannot take down the process.
2026-03-07 10:21:07 +00:00
rcourtman
23a9fa70da Fix nil pointer crash when saving settings (#1324)
SystemSettingsHandler.mtMonitor was an interface field. A nil
*MultiTenantMonitor stored in it became a non-nil interface
(Go typed-nil-in-interface), bypassing the nil guard in getMonitor()
and panicking on every settings save in single-tenant mode.

Change mtMonitor to concrete *monitoring.MultiTenantMonitor so nil
checks work correctly. Also resolve getMonitor() once per request
instead of repeated calls to eliminate a TOCTOU race.
2026-03-07 10:21:07 +00:00
rcourtman
4ea2f49771 Auto-update Helm chart version to 5.1.21
Some checks failed
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
Helm CI / Lint and Render Chart (push) Has been cancelled
2026-03-06 12:15:39 +00:00
rcourtman
c26a96ef51 Auto-update Helm chart documentation 2026-03-06 12:15:38 +00:00
rcourtman
01bf637d0d Fix QNAP agent duplicate processes during upgrades (#1317)
Add singleton watchdog with lock dir, pidfile tracking, and signal
traps to prevent multiple pulse-agent instances spawning on QNAP.
Tighten procfs matching to avoid killing unrelated processes.
2026-03-06 11:40:53 +00:00
rcourtman
9244498b75 Bump version to 5.1.21 2026-03-06 11:05:01 +00:00
rcourtman
89577fe533 Fix OIDC token refresh bypass and guard AISettingsHandler nil path
The applyAuthContextHeaders early-return in CheckAuth skipped the OIDC
token refresh block, causing long-lived OIDC sessions to expire instead
of auto-refreshing. Move the refresh trigger into extractAndStoreAuthContext
so it fires at the middleware level before CheckAuth's early return.

Also add a nil guard on mtPersistence in AISettingsHandler.GetAIService
for non-default org paths, preventing a potential panic if background
code carries a non-default org context in v5 single-tenant mode.
2026-03-06 11:05:01 +00:00
rcourtman
743ef17b79 Fix AI and config profile handlers broken in v5 single-tenant mode
The single-tenant lockdown (499ab812e) set mtPersistence to nil but
only patched AISettingsHandler with a legacy fallback. AIHandler (chat
service) and ConfigProfileHandler were missed, so AI features (Patrol,
Chat) failed with "chat service not available" and config profiles
would panic on nil dereference. Wire legacy persistence into both
handlers and add the same fallback to ProfileSuggestionHandler.

Fixes #1322
2026-03-06 11:05:01 +00:00
rcourtman
73bf2c1c7b Auto-update Helm chart version to 5.1.20
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Helm CI / Lint and Render Chart (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-06 00:33:13 +00:00
rcourtman
4c5fbb0c04 Auto-update Helm chart documentation 2026-03-06 00:33:12 +00:00
rcourtman
6618db7799 Fix v5 single-tenant router test setup 2026-03-05 23:58:11 +00:00
rcourtman
ed8283b223 Bump version to 5.1.20
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-05 23:46:35 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
464d3f8486 Fix stale queued notification delivery 2026-03-05 23:46:35 +00:00
rcourtman
74ce77132b Auto-update Helm chart version to 5.1.19
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Helm CI / Lint and Render Chart (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-05 11:21:24 +00:00