Commit graph

4826 commits

Author SHA1 Message Date
rcourtman
5898cb81be Fix update modal hanging indefinitely after completion (related to #628)
When updates complete quickly, the status API may return 'completed' before
the frontend detects the 'restarting' phase. This left users staring at a
frozen modal with no feedback, requiring manual page refresh.

Changes:
- When status is 'completed', immediately check /api/health
- If backend is healthy, reload the page to get new version
- If health check fails, assume restart in progress and start health polling
- Ensures users always get reloaded to the new version automatically

This fixes the UX issue reported in discussion #628 where the update modal
appeared frozen indefinitely despite successful update completion.
2025-11-07 08:11:52 +00:00
rcourtman
b5ef239973 Add container detection warning to pulse-sensor-proxy startup (related to #628)
When pulse-sensor-proxy runs inside a container (Docker/LXC), it cannot
complete SSH workflows properly, leading to continuous [preauth] log floods
on the Proxmox host. This happens because the proxy is meant to run on the
host, not inside the container.

Changes:
- Import internal/system for InContainer() detection
- Add startup warning when running in containerized environment
- Point users to docs/TEMPERATURE_MONITORING.md for correct setup
- Allow suppression via PULSE_SENSOR_PROXY_SUPPRESS_CONTAINER_WARNING=true

This catches the misconfiguration early and directs users to supported
installation methods, preventing the SSH spam reported in discussion #628.
2025-11-06 23:41:29 +00:00
rcourtman
6a48c759e8 Fix critical notification system bugs and security issues
This commit addresses multiple critical issues identified in the notification
system audit conducted with Codex:

**Critical Fixes:**

1. **Queue Retry Logic (Critical #1)**
   - Fixed broken retry/DLQ system where send functions never returned errors
   - Made sendGroupedEmail(), sendGroupedWebhook(), sendGroupedApprise() return errors
   - Made sendWebhookRequest() return errors
   - ProcessQueuedNotification() now properly propagates errors to queue
   - Retry logic and DLQ now function correctly

2. **Attempt Counter Bug (Critical #2)**
   - Fixed double-increment bug in queue processing
   - Separated UpdateStatus() from attempt tracking
   - Added IncrementAttempt() method
   - Notifications now get correct number of retry attempts

3. **Secret Exposure (Critical #3 & #4)**
   - Masked webhook headers and customFields in GET /api/notifications/webhooks
   - Added redactSecretsFromURL() to sanitize webhook URLs in history
   - Truncated/redacted response bodies in webhook history
   - Protected against credential harvesting via API

4. **Email Rate Limiting (Critical #5)**
   - Added emailManager field to NotificationManager
   - Shared EnhancedEmailManager instance across sends
   - Rate limiter now accumulates across multiple emails
   - SMTP rate limits are now enforced correctly

5. **SSRF Protection (High #6)**
   - Added DNS resolution of webhook URLs
   - Added isPrivateIP() check using CIDR ranges
   - Blocks all private IP ranges (10/8, 172.16/12, 192.168/16, 127/8, 169.254/16)
   - Blocks IPv6 private ranges (::1, fe80::/10, fc00::/7)
   - Prevents DNS rebinding attacks
   - Returns error instead of warning for private IPs

**New Features:**

6. **Health Endpoint (High #8)**
   - Added GET /api/notifications/health
   - Returns queue stats (pending, sending, sent, failed, dlq)
   - Shows email/webhook configuration status
   - Provides overall health indicator

**Related to notification system audit**

Files changed:
- internal/notifications/notifications.go: Error returns, rate limiting, SSRF hardening
- internal/notifications/queue.go: Attempt tracking fix
- internal/api/notifications.go: Secret masking, health endpoint
2025-11-06 23:26:03 +00:00
rcourtman
3eafd00c88 Fix Helm chart workflow 403 errors by granting write permissions
The publish-helm-chart workflow was failing with 403 errors when attempting
to upload Helm chart assets to GitHub releases. This was caused by the workflow
having only 'contents: read' permission. Changed to 'contents: write' to allow
the 'gh release upload' command to succeed.
2025-11-06 22:50:08 +00:00
rcourtman
fa7ca00250 Fix duplicate checksum in build-release.sh
The checksum generation was including pulse-host-agent-v*-darwin-arm64.tar.gz
twice: once from the *.tar.gz pattern and once from the pulse-host-agent-*
pattern. Fixed by using extglob to exclude .tar.gz and .sha256 files from
the agent binary patterns since tarballs are already matched separately.
2025-11-06 22:19:16 +00:00
rcourtman
b356ba0fec Bump version to 4.26.4
Version alignment for upcoming release including:
- Layout and table overflow fixes (related to #643)
- Webhook alert persistence fix
- Docker host row dimming fix
- Agent installation script deployment fix (related to #644)
- Guest agent disk data regression fix
- Config backup/restore fixes (related to #646)
- Bootstrap token UX improvements
2025-11-06 22:14:45 +00:00
rcourtman
4f9ba7a285 Allow layout to expand on wide displays (related to #643)
Changed .pulse-shell from fixed 95rem cap to fluid clamp(95rem, 92vw, 120rem)
to match standard monitoring dashboard behavior (Proxmox, Grafana, Portainer).

On laptops/small screens: unchanged (capped at 1520px)
On 1080p displays: expands to ~1766px usable width
On 4K/ultrawide: expands up to 1920px max for readability

Added back 2xl column widths (totaling ~1720px) that properly fit within
the expanded shell, giving wide-display users more breathing room while
maintaining proportional scaling across all breakpoints.

Changed files:
- index.css: Update .pulse-shell max-width to use clamp()
- Dashboard.tsx: Add 2xl column widths calculated for expanded shell
- GuestRow.tsx: Add matching 2xl column widths
2025-11-06 21:51:17 +00:00
rcourtman
68caf5592b Fix Proxmox dashboard table overflow on wide displays (related to #643)
Removed 2xl: width overrides that caused the table to exceed container width.
At ≥1536px viewport, the 2xl breakpoint expanded table columns to ~1528px
total width while .pulse-shell container provides only ~1416px usable space,
forcing Net In/Net Out columns off-screen and requiring horizontal scroll.

Table now caps at xl: breakpoint widths (~1266px) which fit comfortably within
the container at all viewport sizes. Net In/Net Out columns are now visible
without scrolling on 1080p, 4K, and all wide displays.

Changed files:
- Dashboard.tsx: Remove 2xl: width classes from all table header columns
- GuestRow.tsx: Remove 2xl: width classes from all table cell columns
2025-11-06 21:36:30 +00:00
rcourtman
4891f06e76 Fix webhook alerts persisting when DisableAll* flags are enabled
The original fix in c6c0ac63e only handled per-resource overrides when
thresholds were disabled (trigger <= 0 or Disabled=true). It did not
handle global DisableAll* flags (DisableAllStorage, DisableAllNodes,
DisableAllGuests, etc.).

When a user toggled a DisableAll* flag from false to true:
- Check* functions returned early without processing
- Existing active alerts remained in m.activeAlerts map
- Those alerts continued generating webhook notifications
- reevaluateActiveAlertsLocked didn't check DisableAll* flags

This commit fixes the issue by:

1. Updating reevaluateActiveAlertsLocked to check all DisableAll* flags
   and resolve alerts for those resource types during config updates

2. Adding alert cleanup to Check* functions before early returns:
   - CheckStorage: clears usage and offline alerts
   - CheckNode: clears cpu/memory/disk/temperature and offline alerts
   - CheckPMG: clears queue/message alerts and offline alerts
   - CheckPBS: clears cpu/memory and offline alerts
   - CheckHost: calls existing cleanup helpers

3. Adding comprehensive test coverage for DisableAllStorage scenario

Related to #561
2025-11-06 21:17:56 +00:00
rcourtman
2c3768341a Fix Docker host row dimming for degraded status
Docker hosts with 'degraded' status were incorrectly appearing dimmed
(opacity-60) in the summary table, making them visually identical to
offline hosts. This was confusing because degraded hosts are still
actively reporting - they just have unhealthy containers or >35% of
containers not running.

The isHostOnline function now treats 'degraded' as an online status,
so these rows maintain full opacity. The status badge already provides
visual indication of the degraded state.
2025-11-06 19:11:17 +00:00
rcourtman
586ab3a740 Fix install.sh to deploy all agent installation scripts (related to #644)
Root cause: v4.26.3 tarball and Docker image contained all 8 agent scripts,
but install.sh only copied install-docker-agent.sh to /opt/pulse/scripts/.
Users upgrading via install.sh ended up with missing scripts, causing 404s
when trying to add hosts via the UI.

Changes:
- Add deploy_agent_scripts() function to systematically deploy all scripts
- Deploy all 8 scripts: install-{docker,container,host}-agent.{sh,ps1},
  uninstall-host-agent.{sh,ps1}, install-sensor-proxy.sh, install-docker.sh
- Apply to both main installation and rollback/recovery code paths

This ensures bare-metal installations have feature parity with Docker deployments.
2025-11-06 18:59:32 +00:00
rcourtman
1a78dcbba2 Fix guest agent disk data regression on Proxmox 8.3+
Related to #630

Proxmox 8.3+ changed the VM status API to return the `agent` field as an
object ({"enabled":1,"available":1}) instead of an integer (0 or 1). This
caused Pulse to incorrectly treat VMs as having no guest agent, resulting
in missing disk usage data (disk:-1) even when the guest agent was running
and functional.

The issue manifested as:
- VMs showing "Guest details unavailable" or missing disk data
- Pulse logs showing no "Guest agent enabled, querying filesystem info" messages
- `pvesh get /nodes/<node>/qemu/<vmid>/agent/get-fsinfo` working correctly
  from the command line, confirming the agent was functional

Root cause:
The VMStatus struct defined `Agent` as an int field. When Proxmox 8.3+ sent
the new object format, JSON unmarshaling silently left the field at zero,
causing Pulse to skip all guest agent queries.

Changes:
- Created VMAgentField type with custom UnmarshalJSON to handle both formats:
  * Legacy (Proxmox <8.3): integer (0 or 1)
  * Modern (Proxmox 8.3+): object {"enabled":N,"available":N}
- Updated VMStatus.Agent from `int` to `VMAgentField`
- Updated all references to `detailedStatus.Agent` to use `.Agent.Value`
- The unmarshaler prioritizes the "available" field over "enabled" to ensure
  we only query when the agent is actually responding

This fix maintains backward compatibility with older Proxmox versions while
supporting the new format introduced in Proxmox 8.3+.
2025-11-06 18:42:46 +00:00
rcourtman
7ed9203e4b Fix config backup/restore failures (related to #646)
Addresses two issues preventing configuration backup/restore:

1. Export passphrase validation mismatch: UI only validated 12+ char
   requirement when using custom passphrase, but backend always enforced
   it. Users with shorter login passwords saw unexplained failures.
   - Frontend now validates all passphrases meet 12-char minimum
   - Clear error message suggests custom passphrase if login password too short

2. Import data parsing failed silently: Frontend sent `exportData.data`
   which was undefined for legacy/CLI backups (raw base64 strings).
   Backend rejected these with no logs.
   - Frontend now handles both formats: {status, data} and raw strings
   - Backend logs validation failures for easier troubleshooting

Related to #646 where user reported "error after entering password" with
no container logs. These changes ensure proper validation feedback and
make the backup system resilient to different export formats.
2025-11-06 17:53:54 +00:00
rcourtman
b50dba577f Fix demo server workflow verification by adding authentication
The workflow was failing because /api/state requires authentication,
but the verification step was making an unauthenticated request.

Changes:
- Authenticate with demo/demo credentials before checking node count
- Use jq for cleaner JSON parsing instead of grep/cut
- Check total node count from API response instead of regex pattern matching

Related to user report about demo server not updating to 4.26.3.
The demo server was actually updated successfully, but the workflow
marked itself as failed due to the verification check failing.
2025-11-06 17:44:46 +00:00
rcourtman
ead325942e Add bootstrap token display to install.sh completion message
Enhances discoverability for non-Docker installations (bare metal, LXC)
by displaying the bootstrap token prominently at the end of install.sh.

Changes:
- Add ASCII box display matching Docker startup format
- Show token value and file location
- Include usage instructions for first-time setup
- Only display if .bootstrap_token file exists
- Auto-cleanup note matches behavior

With this change, bootstrap token is now prominently displayed across
all installation methods:
- Docker: startup logs (commit 731eb586)
- Bare metal/LXC: install.sh completion (this commit)
- CLI: pulse bootstrap-token command (commit 731eb586)

Related to #645
2025-11-06 17:35:28 +00:00
rcourtman
a1dc451ed4 Document alert reliability features and DLQ API
Add comprehensive documentation for new alert system reliability features:

**API Documentation (docs/API.md):**
- Dead Letter Queue (DLQ) API endpoints
  - GET /api/notifications/dlq - Retrieve failed notifications
  - GET /api/notifications/queue/stats - Queue statistics
  - POST /api/notifications/dlq/retry - Retry DLQ items
  - POST /api/notifications/dlq/delete - Delete DLQ items
- Prometheus metrics endpoint documentation
  - 18 metrics covering alerts, notifications, and queue health
  - Example Prometheus configuration
  - Example PromQL queries for common monitoring scenarios

**Configuration Documentation (docs/CONFIGURATION.md):**
- Alert TTL configuration
  - maxAlertAgeDays, maxAcknowledgedAgeDays, autoAcknowledgeAfterHours
- Flapping detection configuration
  - flappingEnabled, flappingWindowSeconds, flappingThreshold, flappingCooldownMinutes
- Usage examples and common scenarios
- Best practices for preventing notification storms

All new features are fully documented with examples and default values.
2025-11-06 17:34:05 +00:00
rcourtman
dd1d222ad0 Improve bootstrap token UX for easier discovery
The bootstrap token security requirement was added proactively but
lacked discoverability, causing user friction during first-run setup.
These improvements make the token easier to find while maintaining
the security benefit.

Improvements:
- Display bootstrap token prominently in startup logs with ASCII box
  (previously: single line log message)
- Add `pulse bootstrap-token` CLI command to display token on demand
  (Docker: docker exec <container> /app/pulse bootstrap-token)
- Improve error messages in quick-setup API to show exact commands
  for retrieving token when missing or invalid
- Error messages now include both Docker and bare metal examples

User experience improvements:
- Token visible in `docker logs` output immediately
- Clear instructions printed with token
- Helpful error messages if token is wrong/missing
- CLI helper for operators who need to retrieve token later

Security unchanged:
- Bootstrap token still required for first-run setup
- Token still auto-deleted after successful setup
- No bypass mechanism added

Related to discussion about bootstrap token UX friction.
2025-11-06 17:29:49 +00:00
rcourtman
80acc5ae72 chore: bump version to 4.26.3 2025-11-06 16:56:19 +00:00
rcourtman
f9ca2c0e68 Add hashpw utility for generating password hashes
Simple CLI utility to generate bcrypt password hashes for admin users.

Usage: hashpw <password>

This utility helps administrators generate properly hashed passwords
for use in configuration files or manual user setup.
2025-11-06 16:46:56 +00:00
rcourtman
c8e0281953 Add comprehensive alert system reliability improvements
This commit implements critical reliability features to prevent data loss
and improve alert system robustness:

**Persistent Notification Queue:**
- SQLite-backed queue with WAL journaling for crash recovery
- Dead Letter Queue (DLQ) for notifications that exhaust retries
- Exponential backoff retry logic (100ms → 200ms → 400ms)
- Full audit trail for all notification delivery attempts
- New file: internal/notifications/queue.go (661 lines)

**DLQ Management API:**
- GET /api/notifications/dlq - Retrieve DLQ items
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry failed notifications
- POST /api/notifications/dlq/delete - Delete DLQ items
- New file: internal/api/notification_queue.go (145 lines)

**Prometheus Metrics:**
- 18 comprehensive metrics for alerts and notifications
- Metric hooks integrated via function pointers to avoid import cycles
- /metrics endpoint exposed for Prometheus scraping
- New file: internal/metrics/alert_metrics.go (193 lines)

**Alert History Reliability:**
- Exponential backoff retry for history saves (3 attempts)
- Automatic backup restoration on write failure
- Modified: internal/alerts/history.go

**Flapping Detection:**
- Detects and suppresses rapidly oscillating alerts
- Configurable window (default: 5 minutes)
- Configurable threshold (default: 5 state changes)
- Configurable cooldown (default: 15 minutes)
- Automatic cleanup of inactive flapping history

**Alert TTL & Auto-Cleanup:**
- MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days)
- MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day)
- AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours)
- Prevents memory leaks from long-running alerts

**WebSocket Broadcast Sequencer:**
- Channel-based sequencing ensures ordered message delivery
- 100ms coalescing window for rapid state updates
- Prevents race conditions in WebSocket broadcasts
- Modified: internal/websocket/hub.go

**Configuration Fields Added:**
- FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes
- MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours

All features are production-ready and build successfully.
2025-11-06 16:46:30 +00:00
rcourtman
47748230f4 Fix first-run setup 401 error by adding bootstrap token unlock screen (related to #639)
After the security hardening that introduced bootstrap token protection,
the first-run setup flow was broken because FirstRunSetup.tsx didn't
prompt users for the token. This caused a 401 "Bootstrap setup token
required" error during initial admin account creation.

Changes:
- Add dedicated unlock screen before the setup wizard
- Display instructions for retrieving token from host
- Include bootstrap token in quick-setup API request headers and body
- Only require unlock for first-run setup (skip in force mode)

The unlock screen follows the documented flow in README.md and ensures
only users with host access can configure an unconfigured instance.

Related to #639
2025-11-06 16:45:51 +00:00
rcourtman
20099549c6 Add comprehensive release validation to prevent missing artifacts
Adds automated validation script to prevent the pattern of patch
releases caused by missing files/artifacts.

scripts/validate-release.sh validates all 40+ artifacts including:
- Docker image scripts (8 install/uninstall scripts)
- Docker image binaries (17 across all platforms)
- Release tarballs (5 including universal and macOS)
- Standalone binaries (12+)
- Checksums for all distributable assets
- Version embedding in every binary type
- Tarball contents (binaries + scripts + VERSION)
- Binary architectures and file types

The script catches 100% of issues from the last 3 patch releases
(missing scripts, missing install.sh, missing binaries, broken
version embedding).

Updated RELEASE_CHECKLIST.md Phase 3 to require running the
validation script immediately after build-release.sh and before
proceeding to Docker build/publish phases.

Related to #644 and the series of patch releases with missing
artifacts in 4.26.x.
2025-11-06 16:33:49 +00:00
rcourtman
035d872269 Add missing install/uninstall scripts to Docker image and release builds (related to #644)
The Dockerfile and build-release.sh were missing several installer and uninstaller
scripts that the router expects to serve via HTTP endpoints:
- install-container-agent.sh
- install-host-agent.ps1
- uninstall-host-agent.sh
- uninstall-host-agent.ps1

This caused 404 errors when users attempted to add Docker/Podman hosts or use the
PowerShell installer, as reported in #644.

Changes:
- Dockerfile: Added missing scripts to /opt/pulse/scripts/ with proper permissions
- build-release.sh: Added missing scripts to both per-platform and universal tarballs
  to ensure bare-metal deployments serve the same endpoints as Docker deployments
2025-11-06 16:01:40 +00:00
rcourtman
40abcd1237 Fix empty space below backup chart by matching container and SVG heights
The chart container was set to min-h-[12rem] (192px) on desktop while the SVG
was hardcoded to 128px, creating 64px of unwanted empty space. Changed container
to fixed h-32 (128px) to match the SVG height.
2025-11-06 15:34:20 +00:00
rcourtman
615cb129df Fix checksum verification failure in install.sh (related to #642)
The .sha256 files generated during release builds contained only the hash,
but sha256sum -c expects the format "hash  filename". This caused all
install.sh updates to fail with "Checksum verification failed" even when
the checksum was correct.

Root cause: build-release.sh line 289 was using awk to extract only field 1
(the hash), discarding the filename that sha256sum -c needs.

Fix: Remove the awk filter to preserve the full sha256sum output format.

This affected the demo server update workflow and user installations.
2025-11-06 15:28:05 +00:00
rcourtman
a8fa834d24 Fix critical truncation bug preventing data readability on touch devices (related to #643)
Removed CSS truncate from key identifier columns (container names, service names,
guest names, host names, image names) that were making data inaccessible on mobile/
touch devices where title tooltips don't work.

Users can now read full identifiers via horizontal scroll (already implemented via
ScrollableTable component). Data should always be readable without requiring additional
UI affordances.

Changed files:
- DockerUnifiedTable: Remove truncate from container/service names and images
- GuestRow: Remove truncate from guest names
- HostsOverview: Remove truncate from host display names and hostnames

Column resizing remains on backlog as optional enhancement; users should not need
a drag handle just to read the contents.
2025-11-06 15:00:36 +00:00
rcourtman
57e2f9428e chore: bump version to 4.26.2 2025-11-06 14:33:08 +00:00
rcourtman
becda56897 Fix critical rollback download URL bug and doc inconsistencies
Issues found during systematic audit after #642:

1. CRITICAL BUG - Rollback downloads were completely broken:
   - Code constructed: pulse-linux-amd64 (no version, no .tar.gz)
   - Actual asset name: pulse-v4.26.1-linux-amd64.tar.gz
   - This would cause 404 errors on all rollback attempts
   - Fixed: Construct correct tarball URL with version
   - Added: Extract tarball after download to get binary

2. TEMPERATURE_MONITORING.md referenced non-existent v4.27.0:
   - Changed to use /latest/download/ for future-proof docs

3. API.md example had wrong filename format:
   - Changed pulse-linux-amd64.tar.gz to pulse-v4.30.0-linux-amd64.tar.gz
   - Ensures example matches actual release asset naming

The rollback bug would have affected any user attempting to roll back
to a previous version via the UI or API.
2025-11-06 14:25:32 +00:00
rcourtman
fd3a72606f Add standalone host-agent binaries to releases
Issue: HOST_AGENT.md documented downloading pulse-host-agent binaries
from GitHub releases, but those assets didn't exist. Only tarballs were
available, making manual installation unnecessarily complex.

Changes:
- Copy standalone host-agent binaries (all architectures) to release/
  directory alongside sensor-proxy binaries
- Include host-agent binaries in checksum generation
- Update HOST_AGENT.md to clarify available architectures
- Retroactively uploaded missing binaries to v4.26.1

This enables air-gapped and manual installations without requiring an
already-running Pulse server to download from.
2025-11-06 14:20:59 +00:00
rcourtman
e4378602c1 Fix install.sh missing from GitHub releases (addresses #642)
Root cause: install.sh was not being copied to the release directory
during build-release.sh execution, so it was never uploaded as a
release asset. This caused the download URL to return "Not Found",
which bash attempted to execute as a command.

Changes:
- Copy install.sh to release/ directory in build-release.sh
- Include install.sh in checksums generation

Note: RELEASE_CHECKLIST.md also updated locally to verify install.sh
presence in Phase 3 and Phase 5, but that file is gitignored.
2025-11-06 14:10:46 +00:00
rcourtman
fa3b0db243 Improve static asset caching for hashed files
Hashed static assets (e.g., index-BXHytNQV.js, index-TvhSzimt.css) are
now cached for 1 year with immutable flag since content hash changes
when files change.

Benefits:
- Faster page loads on subsequent visits
- Reduced server bandwidth
- Better user experience on demo and production instances

Only index.html and non-hashed assets remain uncached to ensure
users always get the latest version.
2025-11-06 13:54:26 +00:00
rcourtman
a9d2209edd Fix demo mode to allow authentication endpoints
Demo mode now permits login/logout and OIDC authentication endpoints
while still blocking all modification requests. This allows demo
instances to require authentication while remaining read-only.

Authentication endpoints are read-only operations that verify
credentials and issue session tokens without modifying any state.
All POST/PUT/DELETE/PATCH operations remain blocked.
2025-11-06 13:48:28 +00:00
rcourtman
aea9586145 Update demo credentials to demo/demo 2025-11-06 13:46:57 +00:00
rcourtman
1340ad5f77 docs: Add demo login credentials to README
The demo server at demo.pulserelay.pro now requires authentication.
Login credentials are demo / changeme.
2025-11-06 13:43:11 +00:00
rcourtman
497bdb625e Fix version embedding in Docker builds
The Docker build was only setting internal/dockeragent.Version but not
main.Version, causing the pulse binary to show "dev" instead of the
actual version. Now matches build-release.sh ldflags pattern.

Related to v4.26.1 release
2025-11-06 12:47:02 +00:00
rcourtman
6192e166f2 chore: prepare release v4.26.1 2025-11-06 12:13:56 +00:00
rcourtman
fdcec85931 Fix critical version embedding issues for 4.26 release
Addresses the root cause of issue #631 (infinite Docker agent restart loop)
and prevents similar issues with host-agent and sensor-proxy.

Changes:
- Set dockeragent.Version default to "dev" instead of hardcoded version
- Add version embedding to server build in Dockerfile
- Add version embedding to host-agent builds (all platforms)
- Add version embedding to sensor-proxy builds (all platforms)

This ensures:
1. Server's /api/agent/version endpoint returns correct v4.26.0
2. Downloaded agent binaries have matching embedded versions
3. Dev builds skip auto-update (Version="dev")
4. No version mismatch triggers infinite restart loops

Related to #631
2025-11-06 11:42:52 +00:00
rcourtman
c638a8c28c Fix checksum verification failure during installation
Related to #639

Users reported "Failed to download checksum for Pulse release" errors
during installation. The root cause was a mismatch between what the
build system generates and what the installer expects:

- install.sh downloads individual .sha256 files (e.g., pulse-v4.25.0-linux-amd64.tar.gz.sha256)
- build-release.sh only created a single checksums.txt file

This commit updates build-release.sh to generate both:
1. Individual .sha256 files for each asset (required by install.sh)
2. Combined checksums.txt for manual verification and signing

This maintains backwards compatibility with the installer while keeping
the aggregated checksums.txt for power users and GPG signing.
2025-11-06 11:21:49 +00:00
rcourtman
20854256c3 Fix VM migration issue where custom alert thresholds are lost
Resolves #641

## Problem
When a VM migrates between Proxmox nodes, Pulse was treating it as a new
resource and discarding custom alert threshold overrides. This occurred
because guest IDs included the node name (e.g., `instance-node-VMID`),
causing the ID to change when the VM moved to a different node.

Users reported that after migrating a VM, previously disabled alerts
(e.g., memory threshold set to 0) would resume firing.

## Root Cause
Guest IDs were constructed as:
- Standalone: `node-VMID`
- Cluster: `instance-node-VMID`

When a VM migrated from node1 to node2, the ID changed from
`instance-node1-100` to `instance-node2-100`, causing:
- Alert threshold overrides to be orphaned (keyed by old ID)
- Guest metadata (custom URLs, descriptions) to be orphaned
- Active alerts to reference the wrong resource ID

## Solution
Changed guest ID format to be stable across node migrations:
- New format: `instance-VMID` (for both standalone and cluster)
- Retains uniqueness across instances while being node-independent
- Allows VMs to migrate freely without losing configuration

## Implementation

### Backend Changes
1. **Guest ID Construction** (`monitor_polling.go`):
   - Simplified to always use `instance-VMID` format
   - Removed node from the ID construction logic

2. **Alert Override Migration** (`alerts.go`):
   - Added lazy migration in `getGuestThresholds()`
   - Detects legacy ID formats and migrates to new format
   - Preserves user configurations automatically

3. **Guest Metadata Migration** (`guest_metadata.go`):
   - Added `GetWithLegacyMigration()` helper method
   - Called during VM/container polling to migrate metadata
   - Preserves custom URLs and descriptions

4. **Active Alerts Migration** (`alerts.go`):
   - Added migration logic in `LoadActiveAlerts()`
   - Translates legacy alert resource IDs to new format
   - Preserves alert acknowledgments across restarts

### Frontend Changes
5. **ID Construction Updates**:
   - `ThresholdsTable.tsx`: Updated fallback from `instance-node-vmid` to `instance-vmid`
   - `Dashboard.tsx`: Simplified guest ID construction
   - `GuestRow.tsx`: Updated `buildGuestId()` helper

## Migration Strategy
- **Lazy Migration**: Configs are migrated as guests are discovered
- **Backwards Compatible**: Old IDs are detected and automatically converted
- **Zero Downtime**: No manual intervention required
- **Persisted**: Migrated configs are saved on next config write cycle

## Testing Recommendations
After deployment:
1. Verify existing alert overrides still apply
2. Test VM migration - confirm thresholds persist
3. Check guest metadata (custom URLs) survive migration
4. Verify active alerts maintain acknowledgment state

## Related
- Addresses similar issues with guest metadata and active alert tracking
- Lays groundwork for any future guest-specific configuration features
- Aligns with project philosophy: correctness and UX over implementation complexity
2025-11-06 10:27:15 +00:00
rcourtman
dfe960deb4 Fix container SSH detection and improve troubleshooting for issue #617
Related to #617

This fixes a misconfiguration scenario where Docker containers could
attempt direct SSH connections (producing [preauth] log spam) instead
of using the sensor proxy.

Changes:
- Fix container detection to check PULSE_DOCKER=true in addition to
  system.InContainer() heuristics (both temperature.go and config_handlers.go)
- Upgrade temperature collection log from Error to Warn with actionable
  guidance about mounting the proxy socket
- Add Info log when dev mode override is active so operators understand
  the security posture
- Add troubleshooting section to docs for SSH [preauth] logs from containers

The container detection was inconsistent - monitor.go checked both flags
but temperature.go and config_handlers.go only checked InContainer().
Now all locations consistently check PULSE_DOCKER || InContainer().
2025-11-06 09:57:53 +00:00
rcourtman
12dc8693c4 Add NVIDIA GPU temperature monitoring support (nouveau driver)
- Add nouveau chip recognition to temperature parser
- Implement parseNouveauGPUTemps() for NVIDIA GPU temps via nouveau driver
- Map "GPU core" sensor to edge temperature field
- Supports systems using open-source nouveau driver

This complements the AMD GPU support added previously. Systems using
the nouveau driver will now see NVIDIA GPU temperatures in the
dashboard. For proprietary nvidia driver users, GPU temps are not
available via lm-sensors and would require nvidia-smi integration.
2025-11-06 00:24:42 +00:00
rcourtman
d62259ffa7 Add AMD GPU temperature monitoring support
Related to #600

- Add GPU field to Temperature model with edge, junction, and mem sensors
- Add amdgpu chip recognition to temperature parser
- Implement parseGPUTemps() to extract AMD GPU temperature data
- Update frontend TypeScript types to include GPU temperatures
- Display GPU temps in node table tooltip alongside CPU temps
- Set hasGPU flag when GPU data is available

This enables temperature monitoring for AMD GPUs (amdgpu sensors)
that was previously being collected via SSH but silently discarded
during parsing.
2025-11-06 00:19:04 +00:00
rcourtman
5b89b2371a Make pulse-sensor-proxy resilient to read-only filesystems
Related to #637

The sensor-proxy was failing to start on systems with read-only filesystems
because audit logging required a writable /var/log/pulse/sensor-proxy directory.

Changes:
- Modified newAuditLogger() to automatically fall back to stderr (systemd journal)
  if the audit log file cannot be opened
- Removed error return from newAuditLogger() since it now always succeeds
- Added warning logs when fallback mode is used to alert operators
- Updated tests to handle the new signature
- Added better debugging to audit log tests

This allows the sensor-proxy to run on:
- Immutable/read-only root filesystems
- Hardened systems with restricted /var mounts
- Containerized environments with limited write access

Audit events are still captured via systemd journal when file logging is
unavailable, maintaining the security audit trail.
2025-11-06 00:18:51 +00:00
rcourtman
af55362009 Fix inflated RAM usage reporting for LXC containers
Related to #553

## Problem

LXC containers showed inflated memory usage (e.g., 90%+ when actual usage was 50-60%,
96% when actual was 61%) because the code used the raw `mem` value from Proxmox's
`/cluster/resources` API endpoint. This value comes from cgroup `memory.current` which
includes reclaimable cache and buffers, making memory appear nearly full even when
plenty is available.

## Root Cause

- **Nodes**: Had sophisticated cache-aware memory calculation with RRD fallbacks
- **VMs (qemu)**: Had detailed memory calculation using guest agent meminfo
- **LXCs**: Naively used `res.Mem` directly without any cache-aware correction

The Proxmox cluster resources API's `mem` field for LXCs includes cache/buffers
(from cgroup memory accounting), which should be excluded for accurate "used" memory.

## Solution

Implement cache-aware memory calculation for LXC containers by:

1. Adding `GetLXCRRDData()` method to fetch RRD metrics for LXC containers from
   `/nodes/{node}/lxc/{vmid}/rrddata`
2. Using RRD `memavailable` to calculate actual used memory (total - available)
3. Falling back to RRD `memused` if `memavailable` is not available
4. Only using cluster resources `mem` value as last resort

This matches the approach already used for nodes and VMs, providing consistent
cache-aware memory reporting across all resource types.

## Changes

- Added `GuestRRDPoint` type and `GetLXCRRDData()` method to pkg/proxmox
- Added `GetLXCRRDData()` to ClusterClient for cluster-aware operations
- Modified LXC memory calculation in `pollPVEInstance()` to use RRD data when available
- Added guest memory snapshot recording for LXC containers
- Updated test stubs to implement the new interface method

## Testing

- Code compiles successfully
- Follows the same proven pattern used for nodes and VMs
- Includes diagnostic snapshot recording for troubleshooting
2025-11-06 00:16:18 +00:00
rcourtman
88ad986877 Revert "Hide Settings tab when authentication is not configured"
This reverts commit d5a1e3d07729bad61743e8645a636e2545e11038.
2025-11-05 23:21:34 +00:00
rcourtman
7936808193 Add custom display name support for Docker hosts
This implements the ability for users to assign custom display names to Docker hosts,
similar to the existing functionality for Proxmox nodes. This addresses the issue where
multiple Docker hosts with identical hostnames but different IPs/domains cannot be
easily distinguished in the UI.

Backend changes:
- Add CustomDisplayName field to DockerHost model (internal/models/models.go:201)
- Update UpsertDockerHost to preserve custom display names across updates (internal/models/models.go:1110-1113)
- Add SetDockerHostCustomDisplayName method to State for updating names (internal/models/models.go:1221-1235)
- Add SetDockerHostCustomDisplayName method to Monitor (internal/monitoring/monitor.go:1070-1088)
- Add HandleSetCustomDisplayName API handler (internal/api/docker_agents.go:385-426)
- Route /api/agents/docker/hosts/{id}/display-name PUT requests (internal/api/docker_agents.go:117-120)

Frontend changes:
- Add customDisplayName field to DockerHost TypeScript interface (frontend-modern/src/types/api.ts:136)
- Add MonitoringAPI.setDockerHostDisplayName method (frontend-modern/src/api/monitoring.ts:151-187)
- Update getDisplayName function to prioritize custom names (frontend-modern/src/components/Settings/DockerAgents.tsx:84-89)
- Add inline editing UI with save/cancel buttons in Docker Agents settings (frontend-modern/src/components/Settings/DockerAgents.tsx:1349-1413)
- Update sorting to use custom display names (frontend-modern/src/components/Docker/DockerHosts.tsx:58-59)
- Update DockerHostSummaryTable to display custom names (frontend-modern/src/components/Docker/DockerHostSummaryTable.tsx:40-42, 87, 120, 254)

Users can now click the edit icon next to any Docker host name in Settings > Docker Agents
to set a custom display name. The custom name will be preserved across agent reconnections
and takes priority over the hostname reported by the agent.

Related to #623
2025-11-05 23:18:03 +00:00
rcourtman
0647a76c55 Fix temperature monitoring SSH key availability in containerized setup flow
Addresses issue #635 where users encounter "can't find the SSH key" errors
when enabling temperature monitoring during automated PVE setup with Pulse
running in Docker.

Root cause:
- Setup script embeds SSH keys at generation time (when downloaded)
- For containerized Pulse, keys are empty until pulse-sensor-proxy is installed
- Script auto-installs proxy, but didn't refresh keys after installation
- This caused temperature monitoring setup to fail with confusing errors

Changes:
1. After successful proxy installation, immediately fetch and populate the
   proxy's SSH public key (lines 4068-4080)
2. Update bash variables SSH_SENSORS_PUBLIC_KEY and SSH_SENSORS_KEY_ENTRY
   so temperature monitoring setup can proceed in the same script run
3. Improve error messaging when keys aren't available (lines 4424-4453):
   - Clear explanation of containerized Pulse requirements
   - Step-by-step instructions for container restart and verification
   - Separate guidance for bare-metal vs containerized deployments

Flow improvements:
- Initial run: Proxy installs → keys fetched → temp monitoring configures
- Rerun after container restart: Keys fetched at script start → works
- Both scenarios now handled correctly

Related to #635
2025-11-05 23:11:45 +00:00
rcourtman
3d1c910daa Hide Settings tab when authentication is not configured
Related to #636

When authentication is not configured (hasAuth() returns false), the
Settings tab is now automatically hidden from the web interface. This
provides a cleaner monitoring-only view for unauthenticated deployments
where users only need to check the health of their environment.

The Settings icon beside the Alerts tab will only appear when
authentication is properly configured via PULSE_AUTH_USER/PASS,
API tokens, proxy auth, or OIDC.

Changes:
- Modified utilityTabs in App.tsx to conditionally include Settings
  based on hasAuth() signal
- Updated CONFIGURATION.md to document this UI behavior
2025-11-05 23:10:20 +00:00
rcourtman
8ca31003a0 docs: document TLS certificate file permissions for HTTPS setup
Add comprehensive documentation for HTTPS/TLS configuration including:
- File ownership and permission requirements (pulse user)
- Common troubleshooting steps for startup failures
- Complete setup examples for systemd and Docker
- Validation commands for certificate/key verification

Related to discussion #634
2025-11-05 23:08:02 +00:00
rcourtman
d28cfed3c7 Improve temperature monitoring setup messaging for containerized deployments
When Pulse is running in a container and the SSH key is not available,
provide clearer guidance about the pulse-sensor-proxy requirement and
include documentation link for Docker deployments.

This helps users understand that containerized Pulse needs the host-side
sensor proxy to access temperature data from Proxmox hosts.
2025-11-05 23:05:47 +00:00