Commit graph

3416 commits

Author SHA1 Message Date
rcourtman
768b6d8b7a fix(frontend): resolve npm audit advisories in lockfile 2026-03-02 23:59:34 +00:00
rcourtman
71a7249fd7 chore: bump version to 5.1.17
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Helm CI / Lint and Render Chart (push) Waiting to run
Core E2E Tests / Playwright Core E2E (push) Waiting to run
2026-03-02 17:51:50 +00:00
rcourtman
510ec999ab fix(api): store TLS fingerprint during auto-registration (#1303)
The legacy auto-register endpoint captured TLS fingerprints via
FetchFingerprint() but never persisted them to the node config. Nodes
with self-signed certs registered via the agent would fail with
"x509: certificate signed by unknown authority" on subsequent polls.

Store the fingerprint in all add/update paths for both PVE and PBS,
guard updates against empty-fingerprint clobber when FetchFingerprint
fails, and pass the fingerprint to cluster detection configs.
2026-03-02 14:07:18 +00:00
rcourtman
10a4e994b6 fix(api): return 404 from undismiss endpoint for invalid finding IDs (#1300)
HandleUndismissFinding now checks both patrol and unified stores
before returning. Returns 404 with error message when the finding
is not found or not dismissed, instead of silently returning success.
2026-03-02 11:48:23 +00:00
rcourtman
60bdc9a101 fix(memory): skip meminfo-derived when balloon lacks cache metrics (#1302)
When the balloon driver reports Free but not Buffers or Cached, the
meminfo-derived fallback computed memAvailable = Free alone, counting
all reclaimable page cache as used memory. This caused Linux VMs to
show wildly inflated usage (e.g. 93% when actual is 21%).

Now meminfo-derived requires at least one cache metric (Buffers > 0
or Cached > 0) before trusting the value. When missing, the code
falls through to RRD/guest-agent/Total-Used fallbacks which provide
accurate cache-aware data. Both efficient and traditional polling
paths are now consistent.
2026-03-02 11:48:18 +00:00
rcourtman
9f8f372f7c chore: add .mcp.json to gitignore 2026-03-01 23:33:58 +00:00
rcourtman
d43dfbc490 feat(ui): add host removal action to hosts table
Add an actions menu to the hosts overview with a "Remove host from
Pulse" button. Includes permission checks (requires settings:write
scope), confirmation handling, and a security regression test for
the delete endpoint scope enforcement.
2026-03-01 23:28:33 +00:00
rcourtman
5bd0563283 test(providers): update Ollama integration tests for timeout parameter 2026-03-01 23:28:16 +00:00
rcourtman
f5365809b3 fix(installer): remove config backup that filled disk on upgrades
The backup_existing function copied the entire config directory
(including metrics.db at ~2.5GB) on every upgrade with no cleanup.
On small VMs this filled the disk within a few releases.

The upgrade only swaps the binary; config files are not modified,
so the backup served no practical purpose.
2026-03-01 23:20:08 +00:00
rcourtman
0c78fab337 Auto-update Helm chart documentation 2026-03-01 23:15:53 +00:00
rcourtman
fa48369dbb chore(release): bump version to 5.1.16 2026-03-01 22:40:55 +00:00
rcourtman
d46b5fc84b fix(ai): route OpenRouter slash-delimited models to OpenAI provider (#1296)
createProviderForModel() only handled "provider:model" colon format.
Models like "google/gemini-2.5-flash" or "google/gemini-2.0-flash:free"
(OpenRouter format) failed because the colon split produced invalid
provider names.

Now uses config.ParseModelString() which correctly detects slash-
delimited models as OpenRouter (routed via OpenAI-compatible API).
2026-03-01 22:29:45 +00:00
rcourtman
2fcddecf80 feat(api): add POST /api/ai/patrol/undismiss endpoint to revert suppressed findings (#1300)
The Undismiss() method existed on FindingsStore but was never exposed
via the API. Users who dismissed findings as "not_an_issue" had no way
to revert them.

- Add HandleUndismissFinding handler and route
- Add Undismiss() to UnifiedStore for parity with FindingsStore
- Also remove matching explicit suppression rules on undismiss
2026-03-01 22:29:36 +00:00
rcourtman
027fd9932c fix(proxmox): make monitor reload synchronous after auto-registration (#1303)
Auto-register was running the monitor reload in a background goroutine,
so the HTTP response was sent before the poller picked up the new node.
If reload failed or was slow, the node appeared in Settings > Proxmox
(reads config from disk) but not on the main Proxmox tab (reads from
active polling state).

Changed both auto-register paths to reload synchronously, matching the
manual add path (HandleAddNode).
2026-03-01 21:04:20 +00:00
rcourtman
d852964696 fix(ai): record patrol and QuickAnalysis token usage in cost store for budget enforcement
Patrol runs, evaluation passes, and QuickAnalysis calls were consuming
LLM tokens without recording them in the cost store. This made the
cost_budget_usd_30d budget setting ineffective since enforceBudget()
never saw patrol spend.

- Add RecordUsage() to ai.Service for thread-safe cost recording
- Add recordPatrolUsage() helper to PatrolService, called on both
  success and error paths for main patrol and evaluation pass
- Record QuickAnalysis token usage in cost store
- Return partial PatrolResponse (with token counts) on error instead
  of nil, so callers can always record consumed tokens
- Propagate partial response through chat_service_adapter on error
2026-03-01 19:19:47 +00:00
rcourtman
b1ff7e006f fix(ui): show PULSE_PUBLIC_URL value in settings and expand node tables to full width (#1305, #1304)
Expose PublicURL from runtime config in the system settings API response
so the frontend displays the actual value instead of the placeholder when
the env var is set.

Add w-full to PVE, PBS, and PMG node tables so they expand to fill the
container in full-width mode.
2026-03-01 14:42:30 +00:00
rcourtman
c575c7e295 fix(patrol): rename wearout JSON field to ssd_life_remaining_pct (#1300)
The AI also receives disk data via tool calls (pulse_metrics type="disks"),
not just the patrol context table. The raw JSON field "wearout" was
ambiguous — rename to "ssd_life_remaining_pct" so the field name itself
communicates that 100 = healthy.
2026-02-27 23:12:27 +00:00
rcourtman
3006f51b60 fix(patrol): clarify wearout semantics so AI knows 100% = healthy (#1300)
The patrol context table header said "Wearout" and the tool returned a raw
"wearout" JSON field with no indication that 100 = full life remaining.
The AI interpreted "wearout: 100" as fully worn out and raised false
"100% Disk Wearout" findings on healthy NVMe drives.

Rename the patrol table column to "SSD Life Remaining (100%=new)" and
update the data type comment to clarify the semantics.
2026-02-27 23:05:02 +00:00
rcourtman
aae6035e66 fix(docs): audit and fix agent docs vs install script discrepancies (#1299)
- Split configuration table into "Installer flags" and "Agent-only flags"
  so users know which flags work with `curl | bash` vs the binary directly
- Add missing --cacert and --env flags to installer docs
- Fix --disable-auto-update example (install script doesn't accept it;
  use --env PULSE_DISABLE_AUTO_UPDATE=true instead)
- Add --disable-docker/kubernetes/proxmox and --proxmox-type to
  install.sh show_help()
- Fix --enable-docker=false in CENTRALIZED_MANAGEMENT.md
2026-02-27 21:20:54 +00:00
rcourtman
29a6335905 fix(docs): correct remaining --enable-*=false flags in agent docs (#1299)
All --enable-docker=false, --enable-kubernetes=false, --enable-proxmox=false
references replaced with --disable-docker, --disable-kubernetes, --disable-proxmox.
2026-02-27 21:14:05 +00:00
rcourtman
0bc9445eb8 fix(docs): correct --enable-host=false to --disable-host in agent docs (#1299)
The installer uses --disable-host as a separate flag, not --enable-host=false.
2026-02-27 20:41:32 +00:00
rcourtman
b1d58fc8aa fix(installer): avoid "No space left on device" on QNAP by writing binary to persistent storage
On QNAP, /usr/local/bin is a tiny RAM disk. The installer was downloading
the binary then mv'ing it there, which failed when the RAM disk was full.
The QNAP-specific logic that copies to the persistent data volume only
ran after that mv.

Move QNAP detection before the download step so INSTALL_DIR points to the
persistent data volume (e.g. /share/CACHEDEV1_DATA/.pulse-agent) directly.
The wrapper script still attempts to copy to /usr/local/bin at boot but
falls back to running from persistent storage if that fails.

Also fixes:
- pkill -f pattern in wrapper could match and kill the wrapper itself
  (path contains "pulse-agent"); switched to pkill -x for exact match
- Upgrade detection now checks /usr/local/bin for legacy QNAP installs
- Uninstall cleans up /usr/local/bin runtime copy
2026-02-27 20:41:32 +00:00
rcourtman
538b3c3bdb Auto-update Helm chart documentation 2026-02-27 15:20:57 +00:00
rcourtman
7530b66254 fix(setup): escape printf %s in Sprintf template to fix format verb count (#1297)
The printf '%s\n' calls in shell code within the Go Sprintf template
were being counted as format verbs, causing a build failure (10 verbs
but 9 args). Using %%s produces literal %s in the output.
2026-02-27 14:44:41 +00:00
rcourtman
2f059e650e chore(release): bump version to 5.1.15 2026-02-27 14:29:10 +00:00
rcourtman
8298852483 feat(installer): add QNAP QTS/QuTS hero agent support (#1253)
QNAP wipes /etc/init.d on every reboot, so the agent needs persistent
storage on a data volume and autorun.sh boot persistence via the flash
config partition. Adds detection, install (with watchdog wrapper), and
clean uninstall paths. Flash config mount/umount is fail-safe via
subshell isolation to prevent leaving the partition mounted on write
errors.
2026-02-27 14:19:40 +00:00
rcourtman
62225e0c12 fix(alerts): scope orphaned backup detection per PVE instance to prevent false positives (#1286)
The previous hasLiveInventory guard was a single boolean — if any PVE
instance had at least one live guest, orphan detection ran for all
instances. In multi-instance clusters with staggered polling, backups
from instances whose VMs hadn't been polled yet appeared orphaned,
producing false positive alerts with 0m duration.

Replace the global boolean with a per-instance map. PVE storage backups
now only run orphan detection when their specific instance has live
inventory. PBS/PMG backups (which span instances) retain the "any
instance has live guests" check.
2026-02-27 13:32:15 +00:00
rcourtman
4c7a79cecb fix(setup): preserve SSH authorized_keys symlink on Proxmox and fix key entry quoting (#1297)
The PVE setup script had three bugs in the temperature monitoring SSH key setup:

- Nested double quotes in SSH_SENSORS_KEY_ENTRY broke the bash string, causing
  "No such file or directory" errors for the key options
- The grep/mv pattern to update authorized_keys destroyed the symlink that
  Proxmox maintains from /root/.ssh/authorized_keys to /etc/pve/priv/
- The uninstall path grepped for "# pulse-managed-key" but keys were tagged
  "# pulse-sensors", so uninstall never cleaned up sensor keys

Fixes: resolve symlinks with readlink -f before operating, create temp files in
/tmp with mv-then-cp fallback for cross-device moves, escape inner quotes, and
broaden the uninstall filter to match all pulse-prefixed keys.
2026-02-27 13:23:03 +00:00
rcourtman
9aee8fa293 fix(ui): add Pro badge to Reporting tab and reduce patrol trigger log noise (#1285, #1258)
Show "Pro" badge on the Reporting settings tab so users know upfront
that advanced reporting requires a Pro license, rather than discovering
it after filling out the form.

Downgrade patrol trigger queue-full and rejection messages from Warn to
Debug — these are normal rate-limiting behavior, not actionable warnings.
2026-02-26 21:09:13 +00:00
rcourtman
af712006c9 fix(ai): allow Gemini and other models via OpenRouter without false provider warning (#1296)
Model name detection used substring matching (.includes('gemini')) which
falsely required Gemini provider config for OpenRouter model IDs like
"google/gemini-2.5-flash". Now only known provider prefixes are treated
as explicit delimiters, slash-containing names route to OpenAI (OpenRouter
convention), and colons in model names (e.g. "llama3.2:latest") are no
longer misinterpreted as provider prefixes.
2026-02-26 20:49:10 +00:00
rcourtman
fa519cd8ce fix(alerts): prevent false positive orphaned backup alerts during startup race (#1286)
Backup polling goroutines can snapshot state before VM/container polling
populates the guest inventory. When guestsByVMID is empty, every backup
appears orphaned. Gate orphan detection on hasLiveInventory (at least one
guest with non-empty ResourceID) and preserve existing orphan alerts when
inventory becomes unavailable.
2026-02-26 20:49:10 +00:00
rcourtman
eb2397d99a fix(notifications): route escalation notifications to selected channels only (#1259)
Escalation was calling SendAlert() which always sends to all enabled
channels, ignoring the per-level channel selection (email/webhook/all).

Add SendAlertToChannels() that snapshots only the requested channel
configs and uses a distinct "_escalation" queue type so the dequeue
handler skips cooldown writes — preventing interference with the alert
manager's own re-notify cadence.
2026-02-26 20:49:10 +00:00
rcourtman
c213e0ce30 Auto-update Helm chart documentation 2026-02-25 00:14:54 +00:00
rcourtman
a5fb155b88 chore(release): bump version to 5.1.14 2026-02-24 23:39:42 +00:00
rcourtman
77bd2e70d9 fix(notifications): add service-specific resolved webhook templates (#1259)
Backport from v6 (88d5865a8). Recovery webhook notifications were using
the firing PayloadTemplate which services like Telegram, Teams, Discord
etc. silently rejected as malformed. Now uses a three-tier template
pipeline matching the firing path:
- Tier 1: Custom user template (if configured)
- Tier 2: Service-specific ResolvedPayloadTemplate (Discord green embed,
  Telegram chat_id+text, Slack header blocks, Teams MessageCard/Adaptive,
  PagerDuty event_action:"resolve", Pushover, Gotify, Mattermost)
- Tier 3: Generic JSON fallback (backward compatible)

Also adds Event, ResolvedAt, ResolvedAtISO fields to WebhookPayloadData.
2026-02-24 23:28:33 +00:00
rcourtman
6221be7311 fix(docker): serialize batch container updates per host (#1289)
The backend only allows one command per host at a time. The "Update All"
button was firing requests in parallel chunks, causing the second
container per host to fail with 400. Group targets by host and process
them sequentially within each host while still updating different hosts
in parallel.
2026-02-24 23:16:22 +00:00
rcourtman
24f5b1cb31 fix(patrol): cap per-run tokens and reset patrol session history 2026-02-24 11:29:47 +00:00
rcourtman
82ccb662f9 fix(notifications): use service-specific templates for resolved webhooks (#1068)
Recovery notifications for Discord, Slack, Teams, PagerDuty, and other
service webhooks were sending a generic JSON payload that lacked the
required format (e.g. Discord needs `embeds`, Slack needs `blocks`),
causing resolved notifications to silently fail.

- Add `prepareResolvedWebhookData` to build template data with Level="resolved"
- Route resolved webhooks through service-specific templates with full
  URL rendering, Telegram ChatID extraction, and PagerDuty routing_key
- Custom user templates take precedence over built-in service templates
- Return errors on service template failures instead of falling back to
  generic payloads that endpoints would reject
- Fix PagerDuty template to send event_action="resolve" for resolved alerts
2026-02-24 10:49:52 +00:00
rcourtman
4dc09a1240 feat(alerts): add dedicated backup-orphaned alert type (#1286)
Fire a warning alert immediately when a backup's guest no longer exists
in inventory, without requiring age thresholds to be breached. The
existing alertOrphaned toggle and ignoreVMIDs UI control this feature
with no frontend changes needed.
2026-02-24 09:07:43 +00:00
rcourtman
ffc14c7507 fix(docker): stop CPU bars flickering for idle containers (#1288)
The isRunning prop used a `cpuPercent > 0` gate that treated idle
containers (0% CPU) as not-running, causing the bar to flip between
a percentage and an em-dash on every poll cycle. Remove the value
guard so visibility depends only on container running state, matching
how memory, disk, and restart columns already behave.
2026-02-23 22:05:18 +00:00
rcourtman
cac5be2ca1 chore(frontend-modern): remove stale pnpm lockfile 2026-02-23 11:15:37 +00:00
rcourtman
5457b04608 fix(ai): deduplicate Docker host 3-way chain in mention picker (#1252)
Replace first-match-only logic in upsertMentionResource with a
union-merge algorithm that collects all matching keys, merges losers
into a canonical winner, and re-points aliases. This fixes the case
where a host agent bridges a VM and a DockerHost but only the first
alias match was merged, leaving a duplicate entry in the picker.
2026-02-22 15:15:14 +00:00
rcourtman
2140efce36 Auto-update Helm chart documentation 2026-02-22 12:43:12 +00:00
rcourtman
180c8738b4 chore(release): bump version to 5.1.13 2026-02-22 12:01:38 +00:00
rcourtman
54a1ace2c5 fix(installer): remove stale sensor-proxy mount entries that prevent LXC start after reboot (#1280)
The v4 installer added mount entries for /run/pulse-sensor-proxy to LXC
container configs. After upgrading to v5 and rebooting, /run (tmpfs) is
wiped and the container fails to start. The installer now detects and
removes these stale mp<N> and lxc.mount.entry references automatically
when run on a PVE host, and the upgrade docs include manual fix steps.
2026-02-22 10:52:12 +00:00
rcourtman
f9654f5b7a
Merge pull request #1279 from muratoda/feature/use-locale-aware-time-format
Change last refresh time display format to system locale
2026-02-21 23:11:48 +00:00
rcourtman
32746e2d2a fix(monitoring): use RRD memavailable fallback when PVE node cache metrics missing (#1270)
When Proxmox /nodes/{node}/status returns only total/used/free without
available/buffers/cached, EffectiveAvailable() returns Free (non-zero),
causing the RRD fallback gate to be skipped. This results in inflated
node memory where cache/buffers are counted as "used."

Widen the RRD fallback condition from requiring effectiveAvailable == 0
to triggering whenever missingCacheMetrics is true. Add negative caching
for failed RRD lookups (2-minute backoff) to avoid repeated retries.
2026-02-21 22:47:20 +00:00
rcourtman
1170da6a57 fix(ai): serialize linkedVmId/linkedContainerId and harden mention status (#1252)
HostFrontend was missing LinkedVmId and LinkedContainerId fields, so the
frontend dedup aliases for VM/container agents resolved to undefined and
never matched. Also add .trim() to getStatusColor and default host agent
status to 'online' to fix grey status dots.
2026-02-21 22:00:43 +00:00
rcourtman
b445f8d8fa fix(agent): preserve user-configured host URL during agent re-registration (#1283)
When an agent re-registers with the same token, the DHCP matching case
would overwrite the Host field with the agent's local IP — even if the
user had edited it to a public URL or different IP. Now agent source
re-registrations always preserve the existing host, while non-agent
DHCP updates still work. Adds 5 regression tests covering hostname
preservation, public-IP preservation, agent DHCP, non-agent DHCP, and
PBS parity.
2026-02-21 12:46:02 +00:00
rcourtman
50e476c942 fix(ai): fix mention status colors and dedup for docker/VM/LXC agents (#1252)
Three fixes for remaining mention autocomplete issues:

- Status dots now correctly show green/red/yellow for online/offline/
  degraded statuses (previously only handled running/stopped/paused)
- Docker hosts merge with their host agent via agentId cross-reference
- VMs and LXC containers merge with host agents running inside them
  via linkedVmId/linkedContainerId backend ID aliases
2026-02-20 22:53:52 +00:00