Root cause: NewManager() was missing TimeThresholds initialization, causing
all alert types to use 0-second delay. This meant alerts fired immediately
on the first sample exceeding threshold, with no debouncing.
Impact: LXC containers with brief CPU spikes to ~100% (normal for single-core
saturation) triggered constant alerts instead of only alerting on sustained
high CPU usage.
Fix: Add default TimeThresholds:
- guest: 10s delay (prevents alerts from brief CPU spikes)
- node: 15s delay
- storage: 30s delay
- pbs: 30s delay
This ensures CPU must stay above threshold for the configured duration
before an alert fires, preventing noise from momentary spikes.
Fixes#491
- Create shared NodeGroupHeader component to eliminate code duplication
- Replace vertical line indicator with circular dot matching guest rows
- Update online indicator to use bg-green-500 (matching guest indicators)
- Reduce node row padding from py-2 to py-1 for more compact layout
- Set background to dark:bg-gray-900 to match search bar styling
- Apply changes consistently across Dashboard and Storage tabs
Addresses #485 - adds UI controls for disabling powered-off alerts on a per-guest basis.
Changes:
- Add "Alert Powered-Off" / "No Powered-Off" toggle button for VMs/LXCs
- Extend toggleNodeConnectivity() to handle guests in addition to nodes/PBS
- Add disableConnectivity field to guest resource mapping
- Update hasOverride logic to track connectivity state
Previously, users could only disable ALL alerts for a guest or none.
Now they can independently control resource metric alerts vs powered-off alerts,
matching the functionality already available for nodes and PBS servers.
User impact:
- Enabled + Alert Powered-Off: All alerts including power state (default)
- Enabled + No Powered-Off: Only resource alerts, ignore power state
- Disabled: No alerts at all
Backend already supports this via DisableConnectivity flag.
Implements alerts for powered-off VMs and containers as requested in GitHub discussion #487.
Changes:
- Modified CheckGuest to generate "powered-off" alerts for stopped guests
- Added checkGuestPoweredOff() and clearGuestPoweredOffAlert() functions
- Uses 2-poll confirmation (~10 seconds) to prevent false positives
- Alert level is Warning (not Critical) by default
- Alerts are automatically cleared when guest returns to running state
- Respects existing disableConnectivity flag for per-guest configuration
- Clears only metric alerts for non-running guests, preserves powered-off alerts
- Updated DisableConnectivity comment to include powered-off alerts
Configuration:
- Can be disabled globally or per-guest via alert overrides
- Uses existing disableConnectivity toggle (same as node offline alerts)
- No frontend changes needed - types already support this
Testing:
- Build successful
- Tests pass
- Webhooks and emails will handle new alert type automatically
CRITICAL FIX: Prevents nodes.enc configuration from being permanently lost
when decryption fails due to encryption key regeneration or corruption.
Root Cause Analysis:
1. If .encryption.key is deleted/regenerated, existing .enc files become unreadable
2. Previous code would fail to decrypt, try backup (also fails), then return error
3. This left NO nodes.enc file on disk
4. Next startup would see no .enc files and happily generate a new encryption key
5. User's node configuration was permanently lost
Changes Made:
1. **persistence.go (lines 600-645)**: When decryption fails for BOTH main file
and backup, instead of returning error and leaving no file:
- Log CRITICAL error with clear message about encryption key issue
- Move corrupted file to timestamped .corrupted file for forensics
- Create EMPTY but VALID encrypted nodes.enc file
- Return empty config so system can start
- This prevents encryption key regeneration on next startup
2. **crypto.go (lines 93-121)**: Enhanced encryption key generation checks:
- Now checks for nodes.enc* (including .backup, .corrupted files)
- Uses glob patterns to find ANY encrypted file remnants
- Refuses to generate new key if ANY .enc* files exist
- Provides clear error message listing all found files
- Forces manual intervention before allowing key regeneration
Benefits:
- System can still start even if decryption fails
- Corrupted files are preserved with timestamps for forensic analysis
- Encryption key cannot be silently regenerated if ANY encrypted data exists
- Clear, prominent error logging helps diagnose the root cause
- User is forced to manually address the issue rather than silently losing data
This should prevent the recurring issue where node configurations mysteriously
disappear, requiring manual reconfiguration through the UI.
Mock data was using inconsistent ID formats that didn't match production code.
This caused alert matching and fallback ID generation to fail.
Backend changes (mock generator):
- VMs: Use conditional logic matching production - standalone nodes use "node-vmid",
clusters use "instance-node-vmid" (generator.go:503-509, 584-590)
- Containers: Same conditional logic as VMs (generator.go:584-590)
- Storage: Always use "instance-node-name" format matching production
(generator.go:875, 922, 947, 982)
- Shared storage: Use "shared" as node name and correct instance
(generator.go:1007-1008)
Frontend changes:
- Dashboard.tsx: Guest ID fallback now matches backend conditional logic
(Dashboard.tsx:964-969)
- Storage.tsx: Storage ID fallback now uses "instance-node-name" format
(Storage.tsx:603)
Production format (from monitor.go):
- Guest IDs: Standalone uses "node-vmid", cluster uses "instance-node-vmid"
- Storage IDs: Always "instance-node-name"
- Node IDs: Always "instance-node"
This ensures:
1. Alert resourceId matching works correctly
2. Frontend fallbacks (if ever needed) generate correct IDs
3. Mock data accurately represents production behavior
4. Consistent filtering by instance+node works across all resource types
Previously, storage Instance fields were set to `fmt.Sprintf("pve-%s", node.Name)`,
creating values like "pve-pve1" that didn't match the parent node's Instance field
("mock-cluster"). This caused storage filtering and counting to fail when matching
by instance + node, similar to the backup/snapshot issue fixed earlier.
Changes:
- Set storage.Instance = node.Instance for local storage (generator.go:862)
- Set storage.Instance = node.Instance for local-zfs storage (generator.go:909)
- Set storage.Instance = node.Instance for random storage (generator.go:934)
- PBS storage already correctly used node.Instance (generator.go:969)
This ensures storage counts display correctly on the Storage tab node summary cards
and that filtering by instance + node works consistently across all resource types.
Note: This is part of the broader pattern fix where all resources must match by
both instance AND node name to handle duplicate hostnames across clusters correctly.
Previously, mock-generated backups and snapshots had empty Instance fields,
causing the backups tab node summary counts to show 0. The frontend filters
backups by both instance and node name (b.instance === node.instance &&
b.node === node.name), but without the Instance field populated, no matches
were found.
Changes:
- Set Instance field on VM backups (generator.go:1030)
- Set Instance field on container backups (generator.go:1068)
- Set Instance field on VM snapshots (generator.go:1323)
- Set Instance field on container snapshots (generator.go:1355)
This ensures node backup counts display correctly across all tabs.
Fixed remaining instances where resources were matched using node.id instead
of matching by both instance and node name:
- Dashboard.tsx: VM and container counts in grouped view
- UnifiedNodeSelector.tsx: Backup and snapshot counts
This ensures all tabs (Dashboard, Storage, Backups) correctly count resources
for nodes in mock mode where node.id format differs from instance format.
Make ESC key a complete reset button that clears all active filters:
- First press: Clears search, sorting, node selection, view mode, and status mode
- Second press: Toggles filter section visibility (collapse/expand)
This provides a quick way to reset the entire dashboard view to defaults.
Adjust mock data generation to produce more realistic resource usage patterns:
- VMs: Lower typical CPU usage (0-25%), mean reversion toward 15%
- Containers: Even lower CPU (0-12%), mean reversion toward 8%
- Memory: More realistic distribution with mean reversion
- Metrics updates: Smaller fluctuations with natural mean reversion
- I/O patterns: Less frequent changes for more stability
Fix remaining issues where node.name was used for counting/grouping instead
of instance/node.id, causing incorrect counts with duplicate hostnames.
Changes:
- NodeSummaryTable: Use node.id for counting VMs/containers/storage/disks
- UnifiedNodeSelector: Use node.id for counting backups and snapshots
- DiskList: Display node.name in empty state message instead of node ID
The pattern is now consistent:
- User-facing filtering: use node.name (what users see/search)
- Counting/grouping: use instance/node.id (handles duplicates)
- Display: convert node.id to node.name for readability
Fix node summary card row selection to properly filter the tables below
by implementing independent selection state instead of search box filters.
Changes:
- Clicking a node row now filters by node name (visual selection state)
- Selection is independent of search, allowing both to work together
- Toggle selection by clicking the same row again
- Clear selection with ESC key
- Fixes filtering in Dashboard, Storage, and Backups tabs
The previous implementation had a mismatch between node IDs and instance
fields. Now using simple node.name matching for reliable filtering.
- Extract SearchTipsPopover as shared component
- Improved visual design with better typography and spacing
- Consistent search help across Dashboard, Backups, and Storage tabs
- Better UX with clickable button instead of hover-only tooltip
- Deselectable radio toggles on all filter tabs
- Blue reset button when filters are active
- Clean search placeholders with help tooltips
- Working tooltips with proper styling on Dashboard tab
- Better placeholder text: "Search or filter guests..."
- Make radio toggles deselectable by clicking active option
- Reset button turns blue when filters are active
- Add auto-start hot-dev in development environment
This commit addresses all issues reported in GitHub issue #485:
1. **SMART Status Recognition**
- Fix disk health check to accept both "PASSED" and "OK" status
- Previously only "PASSED" was recognized as healthy
- Location: internal/monitoring/monitor.go:1255
2. **ZFS Spare Device False Alerts**
- Skip ZFS SPARE devices unless they have actual errors
- SPARE devices are intentional and should not trigger alerts
- Updated in two locations:
- pkg/proxmox/zfs.go:154 (device filtering)
- internal/alerts/alerts.go:1077 (alert generation)
3. **Memory Display Granularity**
- Increase byte formatting precision from 0 to 1 decimal place
- Improves accuracy (e.g., "1.7 GB" instead of "1 GB" for 86% of 2GB)
- Location: frontend-modern/src/utils/format.ts:3
4. **Custom Alert Rules Evaluation**
- Add ReevaluateGuestAlert() method for proper threshold reevaluation
- Add comments explaining custom rules evaluation limitations
- Next poll cycle will properly clear stale alerts with new thresholds
Additional improvements:
- Fix ZFS pool alert locking to prevent deadlocks
- Prevent discovery service from running in mock mode
- Restore discovery service when exiting mock mode
Fixes#485
Fixes#484
When users increase alert thresholds (either global defaults or
resource-specific overrides), active alerts are now automatically
re-evaluated and resolved if the current metric value is below the
new threshold.
Previously, alerts would remain active even after increasing the
threshold above the current value, requiring manual resolution or
waiting for the metric to drop below the original threshold and
then rise again.
Changes:
- Add reevaluateActiveAlertsLocked() method to check all active
alerts against updated thresholds
- Call re-evaluation automatically in UpdateConfig()
- Resolve alerts when current value is below new trigger/clear
threshold
- Handle all resource types: guests (qemu/lxc), nodes, PBS, storage
- Add comprehensive unit tests for threshold update scenarios
Ensures PULSE_MOCK_MODE environment variable is set before the mock
package's init() function runs. This allows mock mode to work correctly
when enabled via mock.env or mock.env.local files without requiring an
explicit environment variable to be set at startup.
Fix mock node IDs to use instance-nodename format (e.g., 'mock-cluster-pve1')
instead of 'node/pve1' format. This matches the real system ID format used
at monitoring/monitor.go:936 and fixes the grouped/list toggle in the dashboard.
Before:
- Clustered: node/pve1, node/pve2, etc.
- Standalone: node/standalone1, node/standalone2
After:
- Clustered: mock-cluster-pve1, mock-cluster-pve2, etc.
- Standalone: standalone1-standalone1, standalone2-standalone2
This allows the dashboard grouping logic to properly match nodes by instance
and display them correctly in grouped view.
Make mock mode configuration part of the repository instead of a local-only
file. This ensures consistent mock mode behavior across all environments
(development, CI/CD, demo server) and makes it work out of the box for
new contributors.
Changes:
- Add mock.env to repository with sensible defaults (mock mode OFF by default)
- Support mock.env.local for personal overrides (gitignored)
- Update .gitignore to allow mock.env but exclude .local variants
- Backend loads mock.env then merges mock.env.local overrides
- hot-dev.sh loads both files in correct order
Benefits:
- New developers can clone and use mock mode immediately
- Demo server gets consistent mock configuration
- Personal preferences stay private in .local file
- No surprises - mock mode disabled by default in fresh clones
- CI/CD can use mock mode without custom configuration
Documentation:
- Updated README.md to explain mock.env is in repo
- Enhanced MOCK_MODE.md with local override instructions
- Updated claude.md with new configuration strategy
- Added mock.env.local.example for quick setup
Example workflow:
git clone <repo>
npm run mock:on # Works immediately with repo defaults
# Or create personal config:
cp docs/development/mock.env.local.example mock.env.local
# Edit mock.env.local with your preferences
Improve performance when serving /api/state in mock mode by optimizing
alert handling and JSON serialization.
Changes:
- Add UpdateAlertSnapshots() to cache alerts without blocking
- Use lazy population of alert snapshots to avoid lock contention
- Switch to json.Marshal for better performance with large payloads
- Add debug logging to track /api/state performance
- Simplify GetState() logic in mock mode
Performance improvements:
- Eliminates alert manager lock during /api/state requests
- Reduces JSON encoding overhead for large mock datasets
- Ensures sub-second response times even with 7 nodes and 90+ guests
Testing:
- Mock mode returns state instantly without blocking
- Alert snapshots populate correctly on first request
- Debug logs confirm fast execution path
Implement a hot-reloadable mock mode system that works seamlessly in both
development and production environments without requiring manual restarts
or port changes.
Key Features:
- Backend watches mock.env and auto-reloads when changed (via fsnotify + polling)
- npm commands for easy toggling: mock:on, mock:off, mock:status, mock:edit
- Works in both hot-dev mode and systemd deployments
- Reload completes in 2-5 seconds with no manual intervention
- No port changes or process restarts required
Implementation:
- Extended ConfigWatcher to monitor both .env and mock.env
- Added callback system to trigger ReloadableMonitor.Reload()
- Enhanced toggle-mock.sh to support both hot-dev and systemd modes
- Updated hot-dev.sh banner to show mock status and commands
- Created comprehensive documentation in docs/development/MOCK_MODE.md
Testing:
- Backend builds successfully
- Watcher initializes and monitors both files
- npm run mock:on/off toggles successfully
- mock.env updates correctly
- Scripts work in both hot-dev and systemd modes
Documentation:
- Added Mock Mode section to README.md
- Created detailed guide in docs/development/MOCK_MODE.md
- Updated claude.md with mock mode architecture and usage
Mock mode continues to return cached data instantly from memory
(no API calls, no locks, no timeouts), ensuring fast /api/state responses.