Commit graph

4826 commits

Author SHA1 Message Date
rcourtman
07b4765b8d fix: respect quiet hours for recovery notifications (#1068)
Recovery notifications were bypassing the quiet hours check, causing
users to receive recovery alerts during their configured quiet hours
window even though the original "down" alerts were suppressed.

- Add ShouldSuppressResolvedNotification() to alert manager
- Check quiet hours before sending recovery notifications in monitor
- Recovery notifications now follow same suppression rules as alerts
2026-01-09 21:47:36 +00:00
rcourtman
2a8f55d719 feat(enterprise): add Advanced Reporting and Audit Webhooks integration
This commit adds enterprise-grade reporting and audit capabilities:

Reporting:
- Refactored metrics store from internal/ to pkg/ for enterprise access
- Added pkg/reporting with shared interfaces for report generation
- Created API endpoint: GET /api/admin/reports/generate
- New ReportingPanel.tsx for PDF/CSV report configuration

Audit Webhooks:
- Extended pkg/audit with webhook URL management interface
- Added API endpoint: GET/POST /api/admin/webhooks/audit
- New AuditWebhookPanel.tsx for webhook configuration
- Updated Settings.tsx with Reporting and Webhooks tabs

Server Hardening:
- Enterprise hooks now execute outside mutex with panic recovery
- Removed dbPath from metrics Stats API to prevent path disclosure
- Added storage metrics persistence to polling loop

Documentation:
- Updated README.md feature table
- Updated docs/API.md with new endpoints
- Updated docs/PULSE_PRO.md with feature descriptions
- Updated docs/WEBHOOKS.md with audit webhooks section
2026-01-09 21:31:49 +00:00
rcourtman
92c150e979 feat(rbac): add OIDC group mapping tests and audit logging for RBAC actions 2026-01-09 19:25:33 +00:00
rcourtman
6ed1fdf806 feat(rbac): implement RBAC UI, OIDC group mapping, and API standard auth
- Added Roles and Users settings panels
- Implemented OIDC group-to-role mappings in config and auth flow
- Standardized API token context handling via pkg/auth
- Added Pulse Pro branding and upgrade banners to RBAC features
- Cleanup: Removed empty code blocks and fixed lint errors
2026-01-09 19:16:34 +00:00
rcourtman
3e2824a7ff feat: remove Enterprise badges, simplify Pro upgrade prompts
- Replace barrel import in AuditLogPanel.tsx to fix ad-blocker crash
- Remove all Enterprise/Pro badges from nav and feature headers
- Simplify upgrade CTAs to clean 'Upgrade to Pro' links
- Update docs: PULSE_PRO.md, API.md, README.md, SECURITY.md
- Align terminology: single Pro tier, no separate Enterprise tier

Also includes prior refactoring:
- Move auth package to pkg/auth for enterprise reuse
- Export server functions for testability
- Stabilize CLI tests
2026-01-09 16:51:08 +00:00
rcourtman
22059210f7 fix(frontend): remove unused import and variable to satisfy hooks 2026-01-09 14:46:15 +00:00
rcourtman
5c4399d69f feat(agent): add DisableCeph toggle, report_ip remote config, and improved IP detection (#929) 2026-01-09 14:45:29 +00:00
rcourtman
6019e3e77e fix: normalize custom OpenAI-compatible API URLs (#1067)
Users providing base URLs like "https://openrouter.ai/api/v1" were
getting HTML error responses because the client used the URL directly
without appending "/chat/completions".

- Normalize baseURL in NewOpenAIClient to ensure it ends with /chat/completions
- Fix modelsEndpoint() to derive /models from the normalized baseURL
- Add tests for URL normalization with various endpoint formats
2026-01-09 09:13:36 +00:00
rcourtman
020553a12d fix: use flexible subnet matching instead of fixed /24
The previous implementation assumed /24 subnets, which failed for
larger networks (e.g., /16 or /20). Now uses progressive subnet
matching that tries /24, /20, and /16 to handle various network sizes.

Example: If connection IP is 10.1.1.5 and a node has 10.1.2.6,
it now correctly identifies them as being on the same network.
2026-01-08 23:24:50 +00:00
rcourtman
bd1df9f942 feat: automatic subnet preference for cluster node discovery
When discovering cluster nodes, Pulse now automatically prefers IPs
on the same subnet as the initial connection. This fixes the common
issue where Pulse used internal cluster network IPs (e.g., 172.x.x.x)
instead of management network IPs (e.g., 10.x.x.x).

How it works:
1. Extract subnet from initial connection URL (assumes /24 for IPv4)
2. For each discovered node, query /nodes/{node}/network for all IPs
3. If cluster-reported IP is on a different subnet, find an IP on
   the preferred subnet and set it as IPOverride
4. Manual IPOverride settings are preserved and take precedence

This eliminates the need for manual IPOverride configuration in most
multi-network Proxmox setups.

Refs #929, #1066
2026-01-08 23:12:30 +00:00
rcourtman
d5c93fd226 fix: add cluster endpoint IP override and Windows agent download support
1. Add IPOverride field to ClusterEndpoint struct
   - Allows users to specify a custom IP that takes precedence over auto-discovered IPs
   - Fixes #929 and #1066 where Pulse used internal cluster IPs instead of management IPs
   - Added EffectiveIP() method to cleanly handle the override logic

2. Update connection code to use EffectiveIP()
   - monitor.go: Use override when building endpoint URLs
   - temperature_proxy.go: Use override for proxy connections

3. Add bare Windows EXE files to GitHub releases
   - Fixes #1064 where LXC/barebone installs couldn't download Windows agents
   - Modified build-release.sh to copy EXEs alongside ZIPs
   - Added EXEs to checksum generation
2026-01-08 23:04:25 +00:00
rcourtman
568aac6bd0 fix: multiple triage fixes for stability and correctness
1. Use correct mutex (diagMu) in cleanupDiagnosticSnapshots to prevent
   "concurrent map iteration and map write" panics (Fixes #1063)

2. Use cluster name for storage instance comparison in UpdateStorageForInstance
   to prevent storage duplication in clustered Proxmox setups (Fixes #1062)

3. Fix KUBECONFIG unbound variable error in install.sh by using ${KUBECONFIG:-}
   default parameter expansion (Fixes #1065)
2026-01-08 22:54:33 +00:00
rcourtman
06ebaf50b2 fix: use consistent ID for shared storage to prevent duplication (#1049)
Shared storage was duplicating across polling cycles because the ID
included the node name of whichever node reported it first. When a
different node reported first on the next cycle, a new ID was created.

This fix updates the shared storage aggregation to use a consistent ID
format (instance-cluster-storageName) that doesn't include the node name.

Closes #1049. Thanks to @siccous for the report and initial investigation.
2026-01-08 21:29:24 +00:00
rcourtman
5f0214b949 fix: support ReportIP override in Proxmox auto-registration (#1061) 2026-01-08 21:20:51 +00:00
rcourtman
33bb0a95bb docs: Fix formatting in API reference 2026-01-08 20:15:25 +00:00
rcourtman
6de1c660b1 chore: Improve pre-commit data validation and ignore patterns 2026-01-08 20:04:02 +00:00
rcourtman
3801b7ad7a chore: Ignore husky internal directory 2026-01-08 19:37:04 +00:00
rcourtman
73c5128a87 feat(audit): Add audit log API endpoints and UI with signature verification
- Add GET /api/audit endpoint for listing events with filters
- Add GET /api/audit/:id/verify endpoint for signature verification
- Add AuditLogPanel UI component with filtering and verification
- Update docs with audit API documentation
- Add localStorage utils for persisting UI state
- Update gitignore patterns
2026-01-08 19:19:57 +00:00
rcourtman
7342191075 docs: fix Helm chart install commands to use GitHub Pages repo
The GHCR OCI registry (ghcr.io/rcourtman/pulse-chart) is returning 403/404
errors for unauthenticated users. Updated all Helm references to use the
working GitHub Pages Helm repository at https://rcourtman.github.io/Pulse

Fixes install issues reported by customers trying to deploy via Helm.

Files updated:
- docs/KUBERNETES.md
- docs/INSTALL.md
- docs/DEPLOYMENT_MODELS.md
- docs/UPGRADE_v5.md
2026-01-08 14:27:45 +00:00
rcourtman
22e01e2244 feat: Add centralized agent configuration management (Pro)
Allows administrators to create configuration profiles and assign them
to agents for centralized fleet management.

- Configuration profiles with customizable settings (Docker, K8s,
  Proxmox monitoring, log level, reporting interval)
- Profile assignment to agents by ID
- Agent-side remote config client to fetch settings on startup
- Full CRUD API at /api/admin/profiles
- Settings UI panel in Settings → Agents → Agent Profiles
- Automatic cleanup of assignments when profiles are deleted
2026-01-08 12:06:36 +00:00
rcourtman
7db6b3e47d feat: Add AI chat session sync across devices
Implements server-side persistence for AI chat sessions, allowing users
to continue conversations across devices and browser sessions. Related
to #1059.

Backend:
- Add chat session CRUD API endpoints (GET/PUT/DELETE)
- Add persistence layer with per-user session storage
- Support session cleanup for old sessions (90 days)
- Multi-user support via auth context

Frontend:
- Rewrite aiChat store with server sync (debounced)
- Add session management UI (new conversation, switch, delete)
- Local storage as fallback/cache
- Initialize sync on app startup when AI is enabled
2026-01-08 10:47:45 +00:00
rcourtman
695ced6273 docs: Add API token scopes and kiosk mode documentation
Documents all available token scopes, UI presets, and step-by-step
instructions for setting up kiosk mode with read-only dashboard tokens.

Related to #1055
2026-01-08 10:27:15 +00:00
rcourtman
f29badbd1f feat: Add kiosk mode support with read-only dashboard tokens
- Add "Kiosk / Dashboard" preset in API token manager for easy token creation
- Backend returns token scopes in /api/security/status when authenticated via token
- Frontend hides Settings tab when token lacks settings:read scope
- URL-based token auth via ?token=xxx now properly reports scopes

Users can now create a monitoring:read token and use it in kiosk displays
without exposing settings or requiring cookie persistence.

Related to #1055
2026-01-08 10:18:27 +00:00
rcourtman
49272bd48c fix: Show usable RAIDZ capacity instead of raw pool size
For RAIDZ/mirror pools, zpool list SIZE reports raw capacity (sum of
all disks), but users expect usable capacity (accounting for parity).
The dataset stats from statfs give the correct usable capacity.

Now uses dataset Total when it's smaller than zpool Size, indicating
RAIDZ/mirror overhead.

Related to #1052
2026-01-08 09:38:18 +00:00
rcourtman
8c4bef27f0 docs: improve reverse proxy HTTPS detection and Swarm troubleshooting
- Add detailed HTTPS detection troubleshooting to REVERSE_PROXY.md
- Explain X-Forwarded-Proto header requirement for nginx/Caddy/Apache
- Add Docker Swarm troubleshooting section to UNIFIED_AGENT.md
- Document how to force Docker runtime if auto-detection fails

Based on customer feedback.
2026-01-07 18:23:48 +00:00
rcourtman
e4c17777d0 feat: Add deployment strategy configuration to Helm chart
Added strategy.type option to values.yaml (default: RollingUpdate) to allow
users to configure the deployment strategy. Users with ReadWriteOnce (RWO)
persistent volumes should set this to "Recreate" to avoid Multi-Attach errors
during upgrades.

Related to #1057
2026-01-07 17:57:41 +00:00
rcourtman
95fb896a03 fix: Agent 405 errors when reverse proxy redirects HTTP to HTTPS
When a user's reverse proxy redirects HTTP to HTTPS, Go's default HTTP
client behavior converts POST requests to GET on 301/302 redirects
(per HTTP specification). This causes the Pulse server to return 405
"Only POST is allowed" errors.

Added CheckRedirect to all agent HTTP clients (host, docker, kubernetes)
that returns a clear error message guiding users to use the correct
protocol in their --url flag instead of silently following redirects.

Related to #1058
2026-01-07 17:56:07 +00:00
rcourtman
3f0808e9f9 docs: comprehensive core and Pro documentation overhaul
- Major updates to README.md and docs/README.md for Pulse v5
- Added technical deep-dives for Pulse Pro (docs/PULSE_PRO.md) and AI Patrol (docs/AI.md)
- Updated Prometheus metrics documentation and Helm schema for metrics separation
- Refreshed security, installation, and deployment documentation for unified agent models
- Cleaned up legacy summary files
2026-01-07 17:38:27 +00:00
rcourtman
9cfcdbb247 fix: Use per-node shared flag for storage deduplication
The storage deduplication logic only checked cluster config's Shared
flag, but this required the cluster config API call to succeed. When
the per-node storage API already returns shared=1 (as the user
verified), we should use that directly.

Now we check three sources for shared storage detection:
1. Per-node API shared flag (storage.Shared)
2. Cluster config shared flag (if available)
3. Storage type heuristics (NFS, RBD, PBS, etc.)

Related to #1049
2026-01-07 10:16:23 +00:00
rcourtman
dcdbee3c5c feat: Add in-app help system with HelpIcon component
Add contextual help icons throughout the UI to improve feature
discoverability. Users can click (?) icons to see explanations
with examples for settings they might not understand.

- HelpIcon component with click-to-open popover
- Centralized help content registry in /content/help/
- FeatureTip component for dismissible contextual tips
- Help added to: alert delay, AI endpoints, update channel
2026-01-07 09:22:23 +00:00
rcourtman
b75b33b9fe fix: Read form values from DOM for password manager compatibility
Password managers may fill form fields programmatically without
triggering input events, causing SolidJS signals to remain empty.
This fix reads values directly from the DOM on submit, ensuring
credentials filled by password managers are properly captured.

Related to #1036
2026-01-06 22:25:11 +00:00
rcourtman
73e6a8edc5 fix: Add missing UI for physical disk polling interval setting
The previous commit (06261627) added backend support for configurable
physical disk polling intervals but didn't include the UI to configure it.

Adds a dropdown selector (5/15/30/60 minutes) that appears when physical
disk monitoring is enabled.

Related to #1007
2026-01-06 20:32:24 +00:00
rcourtman
96d06da0d7 fix: Deduplicate shared storages (NFS, RBD, PBS, etc) in cluster view
Shared storages were appearing multiple times (once per node) because
the deduplication logic only checked the Proxmox `Shared` flag. Many
storage types are inherently cluster-wide but don't set this flag:

- RBD (Ceph block storage)
- CephFS
- PBS (Proxmox Backup Server)
- GlusterFS
- NFS
- CIFS/SMB
- iSCSI

Now we detect shared storage based on both the Shared flag AND the
storage type. Inherently shared storage types are deduplicated and
shown once with a "cluster" node designation.

Related to #1049
2026-01-06 17:44:52 +00:00
rcourtman
d3116defe3 fix: Prevent panic from send on closed websocket channel
Add atomic `closed` flag to Client struct and `safeSend()` helper method
to prevent race condition when sending to client channels. The race
occurred when a client disconnected while a goroutine was trying to send
initial state - the channel could be closed between the registration
check and the actual send.

All sends to client.send now go through safeSend() which checks the
closed flag first. The flag is set atomically before closing the channel
in all code paths (unregister, dispatchToClients, broadcast, shutdown).

Related to #1048
2026-01-06 17:41:25 +00:00
rcourtman
48fdff3efb fix: Preserve ackState for old acknowledged alerts during restore
When LoadActiveAlerts skipped acknowledged alerts older than 1 hour,
it was also not populating ackState. This meant that when the same
alert (e.g., backup-age) was recreated on the next poll cycle,
preserveAlertState couldn't find any acknowledgement record and
the alert would retrigger notifications.

Now ackState is populated even for skipped old acknowledged alerts,
so if they reappear, the acknowledgement will be restored.

Related to #1043
2026-01-06 11:00:36 +00:00
rcourtman
74ea90e4b3 fix: Podman sockets not prioritized when --docker-runtime=podman
When --docker-runtime=podman is explicitly set, the agent should try
Podman-specific sockets first before falling back to environment
defaults (which try /var/run/docker.sock).

Also adds /var/run/podman/podman.sock as a candidate socket path,
which is used by CoreOS and some Fedora configurations.

Related to #1045
2026-01-06 10:56:37 +00:00
rcourtman
d7000fafb6 fix: Empty array expansion fails on macOS bash 3.2 with set -u
macOS ships with bash 3.2 (GPLv2) which has a bug where expanding
an empty array like ${array[@]} with set -u enabled throws an
"unbound variable" error, even when the array is initialized.

Use ${arr[@]+"${arr[@]}"} pattern to safely handle empty arrays.

Related to #1046
2026-01-06 10:52:44 +00:00
rcourtman
cfcba70b2b chore: Bump version to 5.0.12 2026-01-05 23:48:57 +00:00
rcourtman
d0191d136f fix: Add configurable poll timeout and handle external Ceph storage
Changes:
1. Add MAX_POLL_TIMEOUT env var for large Proxmox clusters that need
   more than 3 minutes for polling (default: 3m, minimum: 30s)
2. Handle external Ceph storage gracefully - don't mark nodes unhealthy
   when Proxmox returns 'binary not installed' (e.g., for Ceph not
   managed by Proxmox)

Related to #965
2026-01-05 23:34:33 +00:00
rcourtman
c6182b2ed3 feat: Add FreeBSD/OPNsense support for the Pulse agent
Added FreeBSD amd64 and arm64 build targets to the release process:
- Build host-agent and unified agent binaries for FreeBSD
- Package FreeBSD tarballs in releases
- Include FreeBSD binaries in universal tarball for download endpoint

Updated agent install script with FreeBSD support:
- Fixed architecture detection (FreeBSD reports 'amd64' not 'x86_64')
- Added FreeBSD rc.d service handler with proper daemon management
- Automatic service enabling via rc.conf

This enables users to run the Pulse agent on FreeBSD-based systems
like OPNsense, pfSense, and vanilla FreeBSD.

Fixes #1041
2026-01-05 18:18:06 +00:00
rcourtman
0826c4ddb2 fix: Show linked agents in Managed Agents table with badge
Previously, agents linked to Proxmox nodes were hidden from the
Settings > Agents > Managed Agents table, which confused users who
couldn't find their installed agents.

Now all agents are shown in the table, with linked agents displaying
an indigo 'Linked' badge that explains they're also merged with
Proxmox nodes in the Dashboard.

Fixes #1038
2026-01-05 17:57:11 +00:00
rcourtman
0b6bceb96f fix: Hide non-functional edit button for Docker hosts in thresholds table. Related to discussion #1040 2026-01-05 17:13:43 +00:00
rcourtman
e4d7f6fd3d fix: Allow querying non-PBS backup storage with Active=0
Previously, only PBS-type storages were queried when Active=0 because
querying inactive storage can return 500 errors. However, this caused
backups from datacenter backup tasks on shared storage (NFS, CIFS, etc.)
to not appear when the storage reported Active=0 on some nodes.

Now any storage with backup content is queried regardless of Active status.
If the storage is truly unavailable, GetStorageContent returns an error
which is already handled gracefully (logged and skipped).

Related to #1037
2026-01-05 14:53:40 +00:00
rcourtman
2cc9214336 feat: Make container update alerts a free feature
Update alerts for Docker containers are now available to all users,
not just Pro license holders. The feature alerts when container image
updates have been pending for longer than the configured delay
(default: 24 hours).

- Remove Pro license gating from update alerts
- Add FeatureUpdateAlerts to free tier features
- Remove obsolete license gating tests

Related to #1031
2026-01-04 23:59:29 +00:00
rcourtman
f210ef5517 Auto-update Helm chart version to 5.0.11 2026-01-04 20:01:07 +00:00
rcourtman
9388a13718 Auto-update Helm chart documentation 2026-01-04 20:01:06 +00:00
rcourtman
3b70e29b87 test: add PULSE_DATA_DIR to TestMainCmd
TestMainCmd was missing PULSE_DATA_DIR setup, causing it to try to
access /etc/pulse which fails in CI.
2026-01-04 19:15:38 +00:00
rcourtman
21a819f6dc test: use t.Setenv for safer test cleanup
t.Setenv ensures environment variables are restored after test
completion, preventing race conditions where background goroutines
(like config watchers) might access unset env vars during cleanup.
2026-01-04 19:08:45 +00:00
rcourtman
fdba559167 test: skip tests requiring /etc/pulse in CI
Tests that use the default /etc/pulse data directory fail in CI
where the directory doesn't exist and can't be created.
2026-01-04 18:59:48 +00:00
rcourtman
1731489709 test: remove obsolete EnsureDirError test
The test was checking an error path that no longer exists -
NewConfigPersistence now falls back to /etc/pulse when directory
creation fails, and calls log.Fatal() only when that also fails.
2026-01-04 18:51:02 +00:00