Commit graph

3440 commits

Author SHA1 Message Date
rcourtman
7475c8a238 Auto-update Helm chart version to 5.1.30 2026-05-03 19:07:40 +00:00
rcourtman
719e78ce2f Auto-update Helm chart documentation 2026-05-03 19:07:39 +00:00
rcourtman
8071758ce3 Prepare v5.1.30 release
Refs #1454
2026-05-03 19:25:54 +01:00
rcourtman
8337cbc4c9 Fix v5 diagnostics GitHub export
Normalize diagnostics collection fields to empty arrays before encoding and harden the sanitized GitHub export path against null arrays so empty v5 installs can still produce issue attachments.

Refs #1454
2026-05-03 19:12:24 +01:00
rcourtman
9bfef81d93 Fix v5 update helper installer URL
Render the maintenance installer URL into the generated update helper so it does not depend on installer-only shell functions after installation. Add a smoke test that executes the generated helper with fake curl and bash to preserve source-build forwarding.\n\nRefs #1454
2026-05-03 18:57:28 +01:00
rcourtman
80adfe848c Bump postcss to 8.5.13 on release/5.1
Some checks failed
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Helm CI / Lint and Render Chart (push) Has been cancelled
Keeps the release/5.1 frontend lockfile above the patched floor for GHSA-qx2v-qp2m-jg93 and aligned with the default-branch Dependabot fix.

Refs Dependabot alert #83.
2026-05-01 20:18:00 +01:00
rcourtman
7294f795cb Auto-update Helm chart version to 5.1.29 2026-05-01 14:44:25 +00:00
rcourtman
08fd10188e Auto-update Helm chart documentation 2026-05-01 14:44:23 +00:00
rcourtman
858c894023 Prepare v5.1.29 release 2026-05-01 15:04:48 +01:00
rcourtman
84d6aa7ba8 Document issue-first contribution policy
Pulse is a single-maintainer project and does not accept unsolicited
external pull requests. README, CONTRIBUTING, and a new
PULL_REQUEST_TEMPLATE now state this directly so contributors hit the
policy before investing time in code, and so PRs opened in error point
to issues and discussions as the correct intake.

CONTRIBUTING is rewritten end-to-end around the new policy: how to
file bugs, feature requests, support questions, and security reports;
where to look for context (README, ARCHITECTURE, docs/); and the
maintainer-direction carve-out for PRs explicitly requested against
tracked issues.
2026-05-01 15:04:41 +01:00
rcourtman
3d3b1a9642 Stop re-notification spam when alert cooldown is disabled (Fixes #1444)
shouldNotifyAfterCooldown previously returned true on every call when
Schedule.Cooldown was 0 or negative, which the alert evaluation loop
runs on every metric tick. With cooldown disabled, an active alert was
re-notified on each tick.

The UI labels cooldown=0 as "Disabled," so the intuitive contract is
"do not re-notify," not "re-notify continuously." Treat <=0 as
"first-time only": fire the initial notification, then suppress
subsequent re-notifications until the alert clears or the cooldown is
configured to a positive value. Level escalation re-notifications
remain handled at the call site and are unaffected.

Tests cover all three branches: first-time fire with cooldown=0,
re-notification suppression with cooldown=0 (named regression guard
for #1444), and the same behavior for negative values.
2026-05-01 15:04:27 +01:00
rcourtman
f0f20422da Always make UpdateProgressModal closable so a stuck update can't lock the UI
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
Update Integration Tests / Update Flow Integration Tests (push) Has been cancelled
The modal had no close path when isComplete() was false: the X button
was Show-gated on isComplete(), there was no Escape handler, and the
backdrop had no onClick. So if the SSE stream dropped, the polling
fallback failed, or the update process crashed before writing a
terminal status, the modal stayed open with a black backdrop covering
the page and no way to dismiss it except a hard browser refresh — the
"page is blacked out and you can't press anything" symptom.

Make the close path always available:
  - The X button in the header is no longer Show-gated. Its tooltip
    and aria-label adapt to clarify that closing during an active
    update only hides the modal — the update keeps running.
  - Escape on the document closes the modal while it is open.
  - Clicking on the backdrop (and only the backdrop, not the modal
    body) closes the modal.

The actual update process is server-side and unaffected: closing
just unmounts the modal's local SSE/polling. GlobalUpdateProgressWatcher
keeps polling /api/updates/status independently and will surface
completion via the existing reload path or via the Updates settings
page.

Frontend type-check passes and the 447-test vitest suite is green.
2026-04-30 12:01:25 +01:00
rcourtman
611ae5b9f8 Add --agent-id-file so containerized agents keep a stable identity
Pulse agents derive their identity from /etc/machine-id by default. In
Docker containers (especially nested in LXCs), /etc/machine-id is not
guaranteed stable across container recreation: a fresh image instance
gets a new machine-id, and the resulting AgentID drift causes the
server to reject reports with 401 because the API token is bound to
the original AgentID via the bound_agent_id token-metadata check
(internal/api/router.go:1448-1458). Refs #1447.

Add a --agent-id-file (and PULSE_AGENT_ID_FILE env var) flag that:

  1. Reads the persisted AgentID from the file on start, when present,
     and short-circuits machine-id detection. The user mounts the file
     as a Docker volume (e.g. -v pulse-agent-id:/var/lib/pulse-agent)
     so it survives container recreation.
  2. On first start (or when the file is missing/empty), the existing
     machine-id derivation runs and the resolved ID is written to the
     file atomically (tmp + rename, 0600 perms, parent dir created).

Subsequent restarts of the container — even after `docker rm -f` and
a fresh `docker run` — read the same ID from the volume and the
server keeps recognising the agent.

Default is no flag set, which preserves the current
/etc/machine-id-derived behaviour for non-containerized installs.
2026-04-30 11:50:08 +01:00
rcourtman
4a5e234c12 Carry forward previous snapshots for guests we cannot poll this cycle
When the snapshot-polling budget runs out mid-loop, or a single guest's
GetVMSnapshots/GetContainerSnapshots call returns an error, the polling
function used to early-return without writing any state. That meant:

  1. snapshots successfully fetched for earlier guests in the same
     cycle were thrown away, and
  2. on the next successful cycle, the freshly-polled snapshots
     replaced the entire instance's snapshot list — wiping out any
     snapshots whose owning VM had failed to respond this round.

For users with a busy production cluster (many guests, intermittent
per-VM API failures), this manifests as "new snapshots never appear
in the Backups tab" because the failing VM keeps blanking the list
the moment a successful poll lands (#1437).

Now we read the previous snapshots for the instance up front, track
which guests we successfully polled this cycle, and at the end merge
the fresh data with previously-known snapshots for any guest we
couldn't reach. Successfully-polled guests get their fresh data so
new snapshots appear; failed guests keep their last-known list so
transient errors do not blank state. The early-return on deadline is
removed so the merge runs even on partial-failure cycles.

Tests cover the carry-forward path: a fresh successful poll for one
VM lands a new snapshot, and a concurrent failed poll for a second
VM preserves its previously-known snapshot rather than dropping it.
2026-04-30 11:43:01 +01:00
rcourtman
a53de0fc53 Surface unified-agent filesystems in linked VM/container Overview
The qemu-guest-agent's get-fsinfo cannot reliably report ZFS mounts on
some guest configurations (notably Proxmox Backup Server), so VMs that
have ZFS-formatted partitions show only their EXT4 root and datastore
in the VM Overview FILESYSTEMS card while the much larger ZFS dataset
holding the actual backups is missing entirely (Fixes #1438).

The unified pulse-agent running inside the same guest already has
direct OS-level visibility into every mounted filesystem, including
ZFS, and Pulse already knows the link between the host agent and its
guest via Host.LinkedVMID / Host.LinkedContainerID (set in
findLinkedProxmoxEntity by hostname match).

GetState now calls StateSnapshot.MergeLinkedHostDisksIntoGuests after
producing the snapshot. For each Host with a linked VM or container,
that helper:

  1. appends host-agent disks to the guest's Disks slice, deduped by
     mountpoint (qemu-guest-agent entries take precedence so we don't
     overwrite per-VM-perspective values), and
  2. updates the guest's aggregate Disk.{Total,Used,Free,Usage} to
     include the newly-added partitions so the row total stays
     consistent with the partitions visible in the FILESYSTEMS card.

The merge runs on a defensive copy of the disks slice to avoid
mutating the underlying State slice that GetSnapshot shallow-copies.
Tests cover the happy path (PBS-shaped fixture mirroring the issue
screenshots), the no-link no-op, container linking, empty-mountpoint
filtering, and the slice-isolation invariant.
2026-04-30 11:24:47 +01:00
rcourtman
5c65f65a90 Pass keep_alive=30s to Ollama so the model unloads between Patrol runs
Ollama keeps the loaded model in RAM for 5 minutes by default after
each request, and every new request refreshes that 5-minute window.
Pulse never passed keep_alive, so any Ollama traffic (Patrol, alert
analysis, AI chat) within 5 minutes of the previous request kept the
model warm — and on a server with continuous Pulse activity that
meant the model never unloaded, even with Patrol set to a 24-hour
interval (Fixes #1425).

Pass keep_alive=30s on every Chat and ChatStream request. Short
enough that the model unloads shortly after a Patrol burst or
one-shot analysis ends, long enough to span the small gaps between
sequential calls within a single analysis session (so the model is
not reloaded mid-burst).

Tests assert that both the streaming and non-streaming Chat paths
include the keep_alive field in the Ollama request body.
2026-04-30 10:59:04 +01:00
rcourtman
012c25d604 Use /proc/mdstat operation type to gate RAID rebuilding alerts
Distinguish a real rebuild ("recovery" after disk replacement) from
routine maintenance ("check" data scrubs, "resync" after unclean
shutdown) using the in-progress sync action from /proc/mdstat. The
mdadm --detail State field does not reliably surface scrub state on
all kernel/distribution combinations (notably Synology DSM), which is
why scheduled scrubs were firing "RAID array is rebuilding" warnings
every 30 seconds (Fixes #1446).

The mdadm parser now extracts the operation keyword from the
/proc/mdstat progress line and surfaces it as RAIDArray.Operation
alongside the existing speed parse. The alert layer treats "recovery"
and "reshape" as rebuild signals; "check" and "resync" are treated as
maintenance and do not fire an alert. Stringy State matching is kept
as a backstop for arrays without a /proc/mdstat progress line, but
"resync" alone in State no longer counts as a rebuild signal.

Threaded the new field through the host-agent report, the resources
converter, and the monitor's models conversion. Added /proc/mdstat
parser tests covering recovery/check/resync/reshape/idle, and
end-to-end alert tests for recovery (alerts), check (silent scrub),
and resync (silent maintenance).
2026-04-30 10:37:47 +01:00
rcourtman
0464bdbad0 Stop test-config sends from leaking stale auth into shared SMTP manager
When the email config passed to sendHTMLEmailWithError differs from the
manager's persisted config (a test send with edited but unsaved
settings), build a fresh manager so stale Username, Password,
AuthRequired, SMTPHost, SMTPPort, TLS, StartTLS, or Provider fields
cannot leak into the SMTP exchange. The shared production manager is
left untouched.

Without this, a relay-mode test (port 25, no credentials) on a
deployment that previously had authenticated SMTP saved would still
attempt AUTH and fail with "AUTH not available" because the manager's
old AuthRequired and credentials persisted (Fixes #1440).

When the configs match, the existing reuse path is preserved so the
production manager's rate limiter keeps working across grouped sends.
2026-04-30 10:28:47 +01:00
kanylbullen
4557fb8159 Refactor: extract emitFinalToolCalls helper, add EOF tests
Address review feedback:
- Extract shared tool-call finalization into emitFinalToolCalls closure
  to eliminate duplication between [DONE] and EOF-fallback paths
- Build tool calls in deterministic index order (sorted)
- Normalize stopReason consistently in both paths
- Add unit tests:
  - TestOpenAIClient_ChatStream_ToolCallWithSimultaneousEOF: verifies
    tool calls are parsed when Read returns n>0 and io.EOF together
  - TestOpenAIClient_ChatStream_ToolCallWithoutDONE: verifies fallback
    emission when stream ends without [DONE]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 10:05:59 +01:00
kanylbullen
c9bbe8b3a8 Fix SSE stream parser dropping tool calls on EOF
The read loop in ChatStream breaks immediately on io.EOF without
processing remaining buffered data. Per Go's io.Reader contract,
Read may return both n > 0 and io.EOF simultaneously, so the final
bytes (which may contain tool call deltas and [DONE]) are silently
discarded.

This causes the agentic loop to see tool_calls=0 even though the
model correctly produced tool calls in the stream.

Changes:
- Process pendingData when EOF is received before breaking
- Add fallback: emit accumulated tool calls if [DONE] was never
  reached (server closed connection early)

Fixes #1411

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-30 10:05:59 +01:00
rcourtman
94b5bb1b28 Pin Go toolchain to 1.25.9 and bump x/net to 0.51.0
Clears 11 govulncheck findings on release/5.1:
- 10 in the Go standard library (crypto/x509 auth bypass and panics,
  crypto/tls TLS 1.3 KeyUpdate DoS, archive/tar unbounded allocation,
  html/template XSS, os.Root filesystem escape, net/url IPv6 parse) —
  fixed by 1.25.9
- 1 in golang.org/x/net (HTTP/2 frame panic, GO-2026-4559) —
  fixed by 0.51.0

CI uses go-version-file: go.mod with setup-go@v5, which honors the
toolchain directive, so workflow builds will pick up 1.25.9.

Verified govulncheck reports no vulnerabilities and full Go test
suite outcome is unchanged from the v5.1.28 baseline (the
TestSetupScriptTokenLifecycleIntegration_PVE failures are pre-existing
on release/5.1 and unrelated to the bump).
2026-04-30 10:05:46 +01:00
rcourtman
570fd31548 Bump dompurify to 3.4.1 to fix four DOMPurify advisories
Dependabot #79–#82 (CVE-2026-41238/41239/41240, GHSA-39q2-94rc-95cp)
all flag dompurify@3.3.3 sanitizer bypasses. Bumps the constraint to
^3.4.0 and locks at 3.4.1.

Verified frontend-modern type-check, vite build, and the 447-test
vitest suite all pass on the new version.
2026-04-30 09:31:11 +01:00
rcourtman
b204bed8c7 Fix release/5.1 LXC installs defaulting to RC
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
Refs #1435
2026-04-21 17:18:42 +01:00
rcourtman
9fe622b885 Defer QNAP autorun until encrypted volume unlocks (Fixes #1422)
QNAP's autorun.sh fires well before encrypted data volumes are
unlocked, so the previous one-line entry that invoked
start-pulse-agent.sh on the encrypted volume failed immediately —
the wrapper did not exist yet, and the agent never started after
reboot.

Replace the entry with a backgrounded waiter that polls for the
wrapper (every 2 s, up to 30 min) and execs it once the volume
comes up. On unencrypted volumes the loop exits on the first
check, so behaviour is unchanged. A timeout message is logged to
/var/log/pulse-agent.log if the volume never unlocks within the
window. The block is uninstall-safe: no internal blank lines, so
the existing sed marker-to-blank-line range still removes it
cleanly.
2026-04-17 11:46:23 +01:00
rcourtman
7e4d4e07bf Persist QNAP agent updates to data volume (Fixes #1420)
On QNAP, /usr/local/bin is a tiny RAM disk that gets wiped on every
reboot. The install wrapper stores the real binary under
${QNAP_VOL}/.pulse-agent/<name> and a boot script copies it back into
/usr/local/bin. Without refreshing the stored copy, auto-updates applied
to the RAM disk were silently reverted on the next reboot.

Mirror the Unraid persistence pattern: after the atomic in-place swap,
when running on QNAP, rewrite the stored binary via a temp-file rename.
Skip when the running binary already is the persistent copy (fallback
mode, where the rename step already updated it).
2026-04-17 11:44:17 +01:00
rcourtman
8c8641e5f2 Merge unified host/docker rows when IDs diverge (Fixes #1421)
The host-side identifier path applies sanitizeDockerHostSuffix before
storing Host.ID, while the docker-side uses AgentKey() raw. For a QNAP
unified agent those two derivations can produce different IDs, so the
UnifiedAgents merge keyed on d.id === h.id split the single install
into two rows.

Add a 1:1 hostname fallback: if exactly one unmerged host row and one
unmerged docker row share the same hostname, merge them. The strict
1:1 constraint prevents distinct machines that happen to share a
hostname from being collapsed together.
2026-04-17 11:38:39 +01:00
rcourtman
6bc3d30548 Preserve Proxmox guest drawer state across refresh ticks
Dashboard's group-level <For> iterated over Object.entries(groupedGuests()).sort(...),
which produces brand-new tuple arrays on every refresh. Solid's <For> diffs by
reference, so every tick it destroyed and recreated all child rows — wiping out
GuestDrawer's activeTab signal (snapping Discovery back to Overview), graph
hover tooltips, and scroll position inside the expanded row.

Iterate over a memoized array of instance-ID strings instead. Primitive equality
keeps the outer For stable, so only the guest data inside each group updates
on each tick and the drawer's local state survives.

Fixes #1427
2026-04-17 11:15:50 +01:00
rcourtman
e1011230b9 Align infra discovery with Patrol interval
The infra discovery service auto-started with a hardcoded 5-minute
ticker the moment the AI service initialized, regardless of the user's
Patrol schedule. Each tick called AnalyzeForDiscovery, which hit the
Ollama chat endpoint and reset Ollama's keep_alive (5 min default), so
the model never had a chance to unload between requests.

Default the discovery interval to 24h and align it with the user's
Patrol preset (GetPatrolInterval) when the AI service constructs the
discovery service. With Patrol at its 6h default, the LLM now sits idle
long enough for Ollama to release it.

Fixes #1425
2026-04-17 11:10:14 +01:00
rcourtman
4de1c3745a Preflight disk space before Pulse updates
Some checks failed
Build and Test / Secret Scan (push) Has been cancelled
Build and Test / Frontend & Backend (push) Has been cancelled
Core E2E Tests / Playwright Core E2E (push) Has been cancelled
Update Integration Tests / Update Flow Integration Tests (push) Has been cancelled
2026-04-15 20:56:58 +01:00
rcourtman
0b836aa3af Fix v5 integration update test defaults 2026-04-14 20:24:58 +01:00
rcourtman
80dfd43f8c Fix release dry-run integration image build 2026-04-14 20:06:27 +01:00
rcourtman
65670ca011 Make v5 release automation branch-owned 2026-04-14 19:48:25 +01:00
rcourtman
10d0803262 Auto-update Helm chart version to 5.1.28 2026-04-14 19:21:20 +01:00
rcourtman
3a04896e92 Auto-update Helm chart documentation 2026-04-14 19:21:20 +01:00
rcourtman
81661a934a Move v5 maintenance flow onto release/5.1 2026-04-14 18:34:41 +01:00
rcourtman
c8f1ad75cf Bump version to 5.1.28 2026-04-14 16:58:58 +01:00
rcourtman
a24af45c67 Add v6 RC announcement surfaces to v5 2026-04-14 16:51:19 +01:00
rcourtman
dfbe2eb873 Suppress noisy recovery notifications
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-04-13 14:40:12 +01:00
rcourtman
19b2a4e4c4 Clear stale guest per-disk alerts 2026-04-13 14:20:54 +01:00
rcourtman
efb840deae Fix installer universal bundle fallback 2026-04-13 14:13:11 +01:00
rcourtman
1f0dfd60fc Lock SAML metadata public URL refresh 2026-04-13 13:48:27 +01:00
rcourtman
5a17456a60 Fix Ceph manager standby parsing
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-04-13 11:57:12 +01:00
rcourtman
9fb76579cc Fix backup type-aware orphan detection 2026-04-13 11:54:46 +01:00
rcourtman
3981df57a2 Detect NAS host vendors from platform files 2026-04-13 11:25:27 +01:00
rcourtman
754aa0e39c Fix linked host agent threshold overrides
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-04-12 22:47:34 +01:00
rcourtman
5f3a4b79ba Fix oversized AI discovery responses 2026-04-12 22:33:48 +01:00
rcourtman
2ad288c091 Fix streamed installer entrypoint 2026-04-12 22:30:58 +01:00
rcourtman
95409985b5 Normalize vendor-managed NAS RAID arrays 2026-04-12 22:20:04 +01:00
rcourtman
a86c7120cf Debounce recovery for poll-driven offline alerts 2026-04-12 22:04:10 +01:00
rcourtman
005f64182f Respect quiet hours for escalation alerts
Apply quiet-hours suppression to escalation notifications so offline and other suppressed categories do not bypass the normal notification rules during escalation.

Fixes #1398.
2026-04-12 21:29:32 +01:00