Commit graph

1163 commits

Author SHA1 Message Date
rcourtman
0dbd39d671 Add memory balloon/swap badges 2025-10-02 12:44:56 +00:00
rcourtman
6de5684d38 Trim memory tooltip to extras only 2025-10-02 12:43:53 +00:00
rcourtman
ac0b23021f Add memory tooltip detail 2025-10-02 12:42:17 +00:00
rcourtman
3949dfdd52 Remove memory extras row per design 2025-10-02 12:39:46 +00:00
rcourtman
fc08dbf61c Mock memory swap and varied balloon 2025-10-02 12:38:36 +00:00
rcourtman
55375e0b2a Restore mock per-disk data 2025-10-02 12:34:31 +00:00
rcourtman
532cca296f Tidy memory column extras 2025-10-02 12:28:01 +00:00
rcourtman
1192d51416 Expose guest agent network info and extended memory stats 2025-10-02 12:26:32 +00:00
rcourtman
655842ba9d Fix disabled alert webhook delivery 2025-10-02 12:09:46 +00:00
rcourtman
d9079f7bbb Refine dashboard disk column layout 2025-10-02 12:08:19 +00:00
rcourtman
9e31d68207 Fix wearout parsing for Proxmox disks (fixes #449) 2025-10-02 11:57:06 +00:00
rcourtman
5a2fb939de Handle non-numeric disk RPM values 2025-10-02 11:42:08 +00:00
rcourtman
cf365b5f80 fix: add default TimeThresholds to prevent hair-trigger alerts (fixes #491)
Root cause: NewManager() was missing TimeThresholds initialization, causing
all alert types to use 0-second delay. This meant alerts fired immediately
on the first sample exceeding threshold, with no debouncing.

Impact: LXC containers with brief CPU spikes to ~100% (normal for single-core
saturation) triggered constant alerts instead of only alerting on sustained
high CPU usage.

Fix: Add default TimeThresholds:
- guest: 10s delay (prevents alerts from brief CPU spikes)
- node: 15s delay
- storage: 30s delay
- pbs: 30s delay

This ensures CPU must stay above threshold for the configured duration
before an alert fires, preventing noise from momentary spikes.

Fixes #491
2025-10-02 11:31:56 +00:00
rcourtman
ad10d43542 refactor: create reusable NodeGroupHeader component and improve styling
- Create shared NodeGroupHeader component to eliminate code duplication
- Replace vertical line indicator with circular dot matching guest rows
- Update online indicator to use bg-green-500 (matching guest indicators)
- Reduce node row padding from py-2 to py-1 for more compact layout
- Set background to dark:bg-gray-900 to match search bar styling
- Apply changes consistently across Dashboard and Storage tabs
2025-10-02 08:29:29 +00:00
rcourtman
e15f54f851 Polish node row styling and restore disk detail support 2025-10-01 21:33:59 +00:00
rcourtman
bd658c33fa feat: add powered-off alert toggle for guests
Addresses #485 - adds UI controls for disabling powered-off alerts on a per-guest basis.

Changes:
- Add "Alert Powered-Off" / "No Powered-Off" toggle button for VMs/LXCs
- Extend toggleNodeConnectivity() to handle guests in addition to nodes/PBS
- Add disableConnectivity field to guest resource mapping
- Update hasOverride logic to track connectivity state

Previously, users could only disable ALL alerts for a guest or none.
Now they can independently control resource metric alerts vs powered-off alerts,
matching the functionality already available for nodes and PBS servers.

User impact:
- Enabled + Alert Powered-Off: All alerts including power state (default)
- Enabled + No Powered-Off: Only resource alerts, ignore power state
- Disabled: No alerts at all

Backend already supports this via DisableConnectivity flag.
2025-10-01 20:56:44 +00:00
rcourtman
346cb19da6 feat(frontend): make node summary table sortable 2025-10-01 20:43:28 +00:00
rcourtman
d3c9313e3e Refine dashboard node header styling 2025-10-01 20:32:33 +00:00
rcourtman
3f1fd2b36e chore: bump version to v4.18.0 2025-10-01 19:24:20 +00:00
rcourtman
ce4c784769 feat: add powered-off VM/container alerting
Implements alerts for powered-off VMs and containers as requested in GitHub discussion #487.

Changes:
- Modified CheckGuest to generate "powered-off" alerts for stopped guests
- Added checkGuestPoweredOff() and clearGuestPoweredOffAlert() functions
- Uses 2-poll confirmation (~10 seconds) to prevent false positives
- Alert level is Warning (not Critical) by default
- Alerts are automatically cleared when guest returns to running state
- Respects existing disableConnectivity flag for per-guest configuration
- Clears only metric alerts for non-running guests, preserves powered-off alerts
- Updated DisableConnectivity comment to include powered-off alerts

Configuration:
- Can be disabled globally or per-guest via alert overrides
- Uses existing disableConnectivity toggle (same as node offline alerts)
- No frontend changes needed - types already support this

Testing:
- Build successful
- Tests pass
- Webhooks and emails will handle new alert type automatically
2025-10-01 19:19:49 +00:00
rcourtman
49be28bcae Add Pushover webhook custom field handling 2025-10-01 19:09:06 +00:00
rcourtman
c3deb6170e fix: prevent catastrophic data loss from encryption key regeneration
CRITICAL FIX: Prevents nodes.enc configuration from being permanently lost
when decryption fails due to encryption key regeneration or corruption.

Root Cause Analysis:
1. If .encryption.key is deleted/regenerated, existing .enc files become unreadable
2. Previous code would fail to decrypt, try backup (also fails), then return error
3. This left NO nodes.enc file on disk
4. Next startup would see no .enc files and happily generate a new encryption key
5. User's node configuration was permanently lost

Changes Made:

1. **persistence.go (lines 600-645)**: When decryption fails for BOTH main file
   and backup, instead of returning error and leaving no file:
   - Log CRITICAL error with clear message about encryption key issue
   - Move corrupted file to timestamped .corrupted file for forensics
   - Create EMPTY but VALID encrypted nodes.enc file
   - Return empty config so system can start
   - This prevents encryption key regeneration on next startup

2. **crypto.go (lines 93-121)**: Enhanced encryption key generation checks:
   - Now checks for nodes.enc* (including .backup, .corrupted files)
   - Uses glob patterns to find ANY encrypted file remnants
   - Refuses to generate new key if ANY .enc* files exist
   - Provides clear error message listing all found files
   - Forces manual intervention before allowing key regeneration

Benefits:
- System can still start even if decryption fails
- Corrupted files are preserved with timestamps for forensic analysis
- Encryption key cannot be silently regenerated if ANY encrypted data exists
- Clear, prominent error logging helps diagnose the root cause
- User is forced to manually address the issue rather than silently losing data

This should prevent the recurring issue where node configurations mysteriously
disappear, requiring manual reconfiguration through the UI.
2025-10-01 18:52:10 +00:00
rcourtman
b8ec92858b fix: standardize ID formats across mock data and frontend to match production
Mock data was using inconsistent ID formats that didn't match production code.
This caused alert matching and fallback ID generation to fail.

Backend changes (mock generator):
- VMs: Use conditional logic matching production - standalone nodes use "node-vmid",
  clusters use "instance-node-vmid" (generator.go:503-509, 584-590)
- Containers: Same conditional logic as VMs (generator.go:584-590)
- Storage: Always use "instance-node-name" format matching production
  (generator.go:875, 922, 947, 982)
- Shared storage: Use "shared" as node name and correct instance
  (generator.go:1007-1008)

Frontend changes:
- Dashboard.tsx: Guest ID fallback now matches backend conditional logic
  (Dashboard.tsx:964-969)
- Storage.tsx: Storage ID fallback now uses "instance-node-name" format
  (Storage.tsx:603)

Production format (from monitor.go):
- Guest IDs: Standalone uses "node-vmid", cluster uses "instance-node-vmid"
- Storage IDs: Always "instance-node-name"
- Node IDs: Always "instance-node"

This ensures:
1. Alert resourceId matching works correctly
2. Frontend fallbacks (if ever needed) generate correct IDs
3. Mock data accurately represents production behavior
4. Consistent filtering by instance+node works across all resource types
2025-10-01 18:38:42 +00:00
rcourtman
541cb12d18 fix: correct storage Instance field to match node.Instance in mock data
Previously, storage Instance fields were set to `fmt.Sprintf("pve-%s", node.Name)`,
creating values like "pve-pve1" that didn't match the parent node's Instance field
("mock-cluster"). This caused storage filtering and counting to fail when matching
by instance + node, similar to the backup/snapshot issue fixed earlier.

Changes:
- Set storage.Instance = node.Instance for local storage (generator.go:862)
- Set storage.Instance = node.Instance for local-zfs storage (generator.go:909)
- Set storage.Instance = node.Instance for random storage (generator.go:934)
- PBS storage already correctly used node.Instance (generator.go:969)

This ensures storage counts display correctly on the Storage tab node summary cards
and that filtering by instance + node works consistently across all resource types.

Note: This is part of the broader pattern fix where all resources must match by
both instance AND node name to handle duplicate hostnames across clusters correctly.
2025-10-01 18:30:57 +00:00
rcourtman
1fc905efdd fix: add Instance field to mock backup and snapshot generation
Previously, mock-generated backups and snapshots had empty Instance fields,
causing the backups tab node summary counts to show 0. The frontend filters
backups by both instance and node name (b.instance === node.instance &&
b.node === node.name), but without the Instance field populated, no matches
were found.

Changes:
- Set Instance field on VM backups (generator.go:1030)
- Set Instance field on container backups (generator.go:1068)
- Set Instance field on VM snapshots (generator.go:1323)
- Set Instance field on container snapshots (generator.go:1355)

This ensures node backup counts display correctly across all tabs.
2025-10-01 18:21:09 +00:00
rcourtman
54a9c8f7d1 fix: complete node.id to instance+name matching across all tabs
Fixed remaining instances where resources were matched using node.id instead
of matching by both instance and node name:
- Dashboard.tsx: VM and container counts in grouped view
- UnifiedNodeSelector.tsx: Backup and snapshot counts

This ensures all tabs (Dashboard, Storage, Backups) correctly count resources
for nodes in mock mode where node.id format differs from instance format.
2025-10-01 18:13:30 +00:00
rcourtman
5100a8f335 fix: improve node matching in summary table with dual field comparison
Match resources by both instance and node name for more robust duplicate
hostname handling in the node summary cards.
2025-10-01 18:10:58 +00:00
rcourtman
dc561f009f feat: improve ESC key reset behavior in dashboard
Make ESC key a complete reset button that clears all active filters:
- First press: Clears search, sorting, node selection, view mode, and status mode
- Second press: Toggles filter section visibility (collapse/expand)

This provides a quick way to reset the entire dashboard view to defaults.
2025-10-01 18:05:50 +00:00
rcourtman
a3a7ef1a56 chore: improve mock data realism for metrics
Adjust mock data generation to produce more realistic resource usage patterns:
- VMs: Lower typical CPU usage (0-25%), mean reversion toward 15%
- Containers: Even lower CPU (0-12%), mean reversion toward 8%
- Memory: More realistic distribution with mean reversion
- Metrics updates: Smaller fluctuations with natural mean reversion
- I/O patterns: Less frequent changes for more stability
2025-10-01 17:24:13 +00:00
rcourtman
3253c3bdd3 fix: correct instance vs node.name usage for duplicate hostname support
Fix remaining issues where node.name was used for counting/grouping instead
of instance/node.id, causing incorrect counts with duplicate hostnames.

Changes:
- NodeSummaryTable: Use node.id for counting VMs/containers/storage/disks
- UnifiedNodeSelector: Use node.id for counting backups and snapshots
- DiskList: Display node.name in empty state message instead of node ID

The pattern is now consistent:
- User-facing filtering: use node.name (what users see/search)
- Counting/grouping: use instance/node.id (handles duplicates)
- Display: convert node.id to node.name for readability
2025-10-01 17:14:03 +00:00
rcourtman
62c7aa19d1 fix: restore node row filtering functionality across all tabs
Fix node summary card row selection to properly filter the tables below
by implementing independent selection state instead of search box filters.

Changes:
- Clicking a node row now filters by node name (visual selection state)
- Selection is independent of search, allowing both to work together
- Toggle selection by clicking the same row again
- Clear selection with ESC key
- Fixes filtering in Dashboard, Storage, and Backups tabs

The previous implementation had a mismatch between node IDs and instance
fields. Now using simple node.name matching for reliable filtering.
2025-10-01 17:08:01 +00:00
rcourtman
fe01b72541 Refine search tips popovers 2025-10-01 16:57:43 +00:00
rcourtman
03f823868d feat: refactor search tips into reusable popover component
- Extract SearchTipsPopover as shared component
- Improved visual design with better typography and spacing
- Consistent search help across Dashboard, Backups, and Storage tabs
- Better UX with clickable button instead of hover-only tooltip
2025-10-01 16:44:01 +00:00
rcourtman
abd0b67faa fix: correct node summary counts for VMs, containers, storage, and backups 2025-10-01 16:40:38 +00:00
rcourtman
35c08b9066 fix: remove 'guests' from search placeholder 2025-10-01 16:33:33 +00:00
rcourtman
0e97303431 feat: consistent filter UX across all tabs
- Deselectable radio toggles on all filter tabs
- Blue reset button when filters are active
- Clean search placeholders with help tooltips
- Working tooltips with proper styling on Dashboard tab
- Better placeholder text: "Search or filter guests..."
2025-10-01 16:33:18 +00:00
rcourtman
a236244730 feat: improve dashboard filter toggles UX
- Make radio toggles deselectable by clicking active option
- Reset button turns blue when filters are active
- Add auto-start hot-dev in development environment
2025-10-01 16:23:01 +00:00
rcourtman
f8b0d21c32 chore: add claude.md to .gitignore 2025-10-01 15:54:48 +00:00
rcourtman
31317738be chore: remove claude.md from repository 2025-10-01 15:54:43 +00:00
rcourtman
49311b1e39 fix: resolve multiple issues from #485
This commit addresses all issues reported in GitHub issue #485:

1. **SMART Status Recognition**
   - Fix disk health check to accept both "PASSED" and "OK" status
   - Previously only "PASSED" was recognized as healthy
   - Location: internal/monitoring/monitor.go:1255

2. **ZFS Spare Device False Alerts**
   - Skip ZFS SPARE devices unless they have actual errors
   - SPARE devices are intentional and should not trigger alerts
   - Updated in two locations:
     - pkg/proxmox/zfs.go:154 (device filtering)
     - internal/alerts/alerts.go:1077 (alert generation)

3. **Memory Display Granularity**
   - Increase byte formatting precision from 0 to 1 decimal place
   - Improves accuracy (e.g., "1.7 GB" instead of "1 GB" for 86% of 2GB)
   - Location: frontend-modern/src/utils/format.ts:3

4. **Custom Alert Rules Evaluation**
   - Add ReevaluateGuestAlert() method for proper threshold reevaluation
   - Add comments explaining custom rules evaluation limitations
   - Next poll cycle will properly clear stale alerts with new thresholds

Additional improvements:
- Fix ZFS pool alert locking to prevent deadlocks
- Prevent discovery service from running in mock mode
- Restore discovery service when exiting mock mode

Fixes #485
2025-10-01 15:53:42 +00:00
rcourtman
b0f68933dd docs: clarify SSH temperature usage 2025-10-01 15:26:00 +00:00
rcourtman
fa2656c8f0 docs: clarify SSH temperature usage 2025-10-01 15:23:41 +00:00
rcourtman
bd9c6444d6 Handle string wearout values from Proxmox disks 2025-10-01 15:06:35 +00:00
rcourtman
dc065e75f7 fix: auto-resolve alerts when thresholds increase
Fixes #484

When users increase alert thresholds (either global defaults or
resource-specific overrides), active alerts are now automatically
re-evaluated and resolved if the current metric value is below the
new threshold.

Previously, alerts would remain active even after increasing the
threshold above the current value, requiring manual resolution or
waiting for the metric to drop below the original threshold and
then rise again.

Changes:
- Add reevaluateActiveAlertsLocked() method to check all active
  alerts against updated thresholds
- Call re-evaluation automatically in UpdateConfig()
- Resolve alerts when current value is below new trigger/clear
  threshold
- Handle all resource types: guests (qemu/lxc), nodes, PBS, storage
- Add comprehensive unit tests for threshold update scenarios
2025-10-01 15:02:27 +00:00
rcourtman
a6e5a24a77 fix: load mock.env files during config initialization
Ensures PULSE_MOCK_MODE environment variable is set before the mock
package's init() function runs. This allows mock mode to work correctly
when enabled via mock.env or mock.env.local files without requiring an
explicit environment variable to be set at startup.
2025-10-01 14:45:52 +00:00
rcourtman
42f2213932 fix: correct mock node ID format to match real system
Fix mock node IDs to use instance-nodename format (e.g., 'mock-cluster-pve1')
instead of 'node/pve1' format. This matches the real system ID format used
at monitoring/monitor.go:936 and fixes the grouped/list toggle in the dashboard.

Before:
- Clustered: node/pve1, node/pve2, etc.
- Standalone: node/standalone1, node/standalone2

After:
- Clustered: mock-cluster-pve1, mock-cluster-pve2, etc.
- Standalone: standalone1-standalone1, standalone2-standalone2

This allows the dashboard grouping logic to properly match nodes by instance
and display them correctly in grouped view.
2025-10-01 13:47:59 +00:00
rcourtman
1c2431fcf6 refactor: add mock.env to repository with local override support
Make mock mode configuration part of the repository instead of a local-only
file. This ensures consistent mock mode behavior across all environments
(development, CI/CD, demo server) and makes it work out of the box for
new contributors.

Changes:
- Add mock.env to repository with sensible defaults (mock mode OFF by default)
- Support mock.env.local for personal overrides (gitignored)
- Update .gitignore to allow mock.env but exclude .local variants
- Backend loads mock.env then merges mock.env.local overrides
- hot-dev.sh loads both files in correct order

Benefits:
- New developers can clone and use mock mode immediately
- Demo server gets consistent mock configuration
- Personal preferences stay private in .local file
- No surprises - mock mode disabled by default in fresh clones
- CI/CD can use mock mode without custom configuration

Documentation:
- Updated README.md to explain mock.env is in repo
- Enhanced MOCK_MODE.md with local override instructions
- Updated claude.md with new configuration strategy
- Added mock.env.local.example for quick setup

Example workflow:
  git clone <repo>
  npm run mock:on        # Works immediately with repo defaults
  # Or create personal config:
  cp docs/development/mock.env.local.example mock.env.local
  # Edit mock.env.local with your preferences
2025-10-01 13:38:39 +00:00
rcourtman
6f2b6268a4 perf: optimize mock mode state retrieval and JSON encoding
Improve performance when serving /api/state in mock mode by optimizing
alert handling and JSON serialization.

Changes:
- Add UpdateAlertSnapshots() to cache alerts without blocking
- Use lazy population of alert snapshots to avoid lock contention
- Switch to json.Marshal for better performance with large payloads
- Add debug logging to track /api/state performance
- Simplify GetState() logic in mock mode

Performance improvements:
- Eliminates alert manager lock during /api/state requests
- Reduces JSON encoding overhead for large mock datasets
- Ensures sub-second response times even with 7 nodes and 90+ guests

Testing:
- Mock mode returns state instantly without blocking
- Alert snapshots populate correctly on first request
- Debug logs confirm fast execution path
2025-10-01 13:35:49 +00:00
rcourtman
67fc5977d1 feat: add hot-reloadable mock mode with auto-detection
Implement a hot-reloadable mock mode system that works seamlessly in both
development and production environments without requiring manual restarts
or port changes.

Key Features:
- Backend watches mock.env and auto-reloads when changed (via fsnotify + polling)
- npm commands for easy toggling: mock:on, mock:off, mock:status, mock:edit
- Works in both hot-dev mode and systemd deployments
- Reload completes in 2-5 seconds with no manual intervention
- No port changes or process restarts required

Implementation:
- Extended ConfigWatcher to monitor both .env and mock.env
- Added callback system to trigger ReloadableMonitor.Reload()
- Enhanced toggle-mock.sh to support both hot-dev and systemd modes
- Updated hot-dev.sh banner to show mock status and commands
- Created comprehensive documentation in docs/development/MOCK_MODE.md

Testing:
- Backend builds successfully
- Watcher initializes and monitors both files
- npm run mock:on/off toggles successfully
- mock.env updates correctly
- Scripts work in both hot-dev and systemd modes

Documentation:
- Added Mock Mode section to README.md
- Created detailed guide in docs/development/MOCK_MODE.md
- Updated claude.md with mock mode architecture and usage

Mock mode continues to return cached data instantly from memory
(no API calls, no locks, no timeouts), ensuring fast /api/state responses.
2025-10-01 13:35:17 +00:00
rcourtman
f30e57e36d feat: add GitHub Actions workflow to auto-update demo server on release 2025-10-01 11:34:53 +00:00