The 10ms goroutine drain pause was insufficient under full parallel
test suite load, causing intermittent failures in
TestPulseMonitorOnlySkipsDispatchButRetainsAlert.
The previous hasLiveInventory guard was a single boolean — if any PVE
instance had at least one live guest, orphan detection ran for all
instances. In multi-instance clusters with staggered polling, backups
from instances whose VMs hadn't been polled yet appeared orphaned,
producing false positive alerts with 0m duration.
Replace the global boolean with a per-instance map. PVE storage backups
now only run orphan detection when their specific instance has live
inventory. PBS/PMG backups (which span instances) retain the "any
instance has live guests" check.
Backup polling goroutines can snapshot state before VM/container polling
populates the guest inventory. When guestsByVMID is empty, every backup
appears orphaned. Gate orphan detection on hasLiveInventory (at least one
guest with non-empty ResourceID) and preserve existing orphan alerts when
inventory becomes unavailable.
Fire a warning alert immediately when a backup's guest no longer exists
in inventory, without requiring age thresholds to be breached. The
existing alertOrphaned toggle and ignoreVMIDs UI control this feature
with no frontend changes needed.
When namespace matching fails, the VMID-only fallback now checks whether
the VMID appears on multiple PVE instances. If ambiguous, the fallback
is skipped — preventing backups from being falsely attributed to the
wrong guest. Unique VMIDs still fall back as before.
The quiet hours fix (07b4765b) added ShouldSuppressResolvedNotification()
to handleAlertResolved, which acquires m.mu.RLock(). Five clear*OfflineAlert
functions call the resolved callback synchronously while holding m.mu.Lock().
Go's RWMutex is not reentrant, so this deadlocks permanently when any
node/PBS/PMG/storage/guest comes back online after being offline.
The deadlock prevents recovery notifications from being sent and freezes
the monitoring goroutine, cascading to block all subsequent polling.
Fix: change the five affected functions to fire the resolved callback
asynchronously (matching the pattern already used by clearAlertNoLock),
so it runs after m.mu is released.
Related to #1068
In single-node setups, guest alerts had Instance == Node, causing
reevaluateActiveAlertsLocked to evaluate them against NodeDefaults
instead of GuestDefaults. Setting guest memory threshold to 0 (disabled)
wouldn't clear existing guest alerts because they were being kept alive
by the still-enabled node memory threshold.
- Add resourceID colon check to distinguish guest IDs (instance:node:vmid)
from node IDs (instance-node) in reevaluateActiveAlertsLocked
- Clear stale alerts in checkMetric when threshold is nil or disabled
- Skip hysteresis validation for disabled thresholds (Trigger <= 0)
- Fix frontend tooltip: "0" not "-1" disables a threshold
- Add comprehensive tests for internal/api/config_handlers.go (Phases 1-3)
- Improve test coverage for AI tools, chat service, and session management
- Enhance alert and notification tests (ResolvedAlert, Webhook)
- Add frontend unit tests for utils (searchHistory, tagColors, temperature, url)
- Add proximity client API tests
Closes#1115 (discussion feedback)
Two API consistency issues reported by @FabienD74:
1. Version format mismatch in /api/version:
- currentVersion: "5.0.16" (no prefix)
- latestVersion: "v5.0.16" (with prefix)
Fixed: LatestVersion now strips the "v" prefix to match CurrentVersion format.
2. Guest ID separator inconsistency:
- Some code used colons: "instance:node:vmid"
- BuildGuestKey used dashes: "instance-node-vmid"
Fixed: BuildGuestKey now uses colon separator matching the canonical
format used by makeGuestID in the monitoring package. The existing
legacy migration in GetWithLegacyMigration handles old dash-format
entries in guest_metadata.json.
The previous commit fixed namespace disambiguation for backup alerts,
but the Overview display uses SyncGuestBackupTimes to populate backup
timestamps on VMs/Containers. This commit extends the same namespace
matching logic to that function.
Also tightened the matching algorithm to use suffix matching instead
of substring matching, preventing false positives like "pve" matching
"pve-nat".
When multiple PVE instances have VMs with overlapping VMIDs, PBS backups
were being matched to the wrong VM because the code would just use the
first matching guest. Now when a PBS backup has a namespace, it attempts
to match that namespace to the PVE instance name to find the correct VM.
This helps users who have separate PBS instances backing up different
PVE clusters with namespaces like "pve1", "nat", etc.
When LoadActiveAlerts skipped acknowledged alerts older than 1 hour,
it was also not populating ackState. This meant that when the same
alert (e.g., backup-age) was recreated on the next poll cycle,
preserveAlertState couldn't find any acknowledgement record and
the alert would retrigger notifications.
Now ackState is populated even for skipped old acknowledged alerts,
so if they reappear, the acknowledgement will be restored.
Related to #1043
- Add persistent volume mounts for Go/npm caches (faster rebuilds)
- Add shell config with helpful aliases and custom prompt
- Add comprehensive devcontainer documentation
- Add pre-commit hooks for Go formatting and linting
- Use go-version-file in CI workflows instead of hardcoded versions
- Simplify docker compose commands with --wait flag
- Add gitignore entries for devcontainer auth files
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- RAID tests now use /dev/md2 since md0/md1 are skipped for Synology compatibility
- AI handler tests now expect 'AI is not enabled' message after AI gating change
- Fixed normalizeStorageDefaults to allow Trigger=0
- Fixed normalizeNodeDefaults (Temperature) to allow Trigger=0
- Added comprehensive tests for all threshold normalization patterns
- Updated existing test that expected old behavior
Related to #864
ClearActiveAlerts triggers an async save to disk, which can race with
LoadActiveAlerts reading the file. The test now clears the in-memory
map directly without triggering the async save.
- Rename checkFlapping to checkFlappingLocked to clarify lock contract
- Replace goto statements with structured control flow
- Wire up unused recordAlertFired/recordAlertResolved metric hooks
- Add trackingMapCleanup goroutine to prevent memory leaks from stale entries
- Tighten alert ID validation to alphanumeric + safe punctuation
- Fix history save error handling to properly manage backup lifecycle
- Add auto-migration for deprecated GroupingWindow field
- Refactor 300+ line UpdateConfig into focused helper functions
- Unify duplicate evaluateVMCondition/evaluateContainerCondition
- Add constants for magic numbers (thresholds, timing, flapping)
- Update tests to match new backup behavior
Add TestDispatchAlert with 8 test cases covering:
- Returns false when onAlert callback is nil
- Returns false when alert is nil
- Returns false when activation state is pending
- Returns false when activation state is snoozed
- Returns false for monitor-only alerts
- Dispatches synchronously when async is false
- Dispatches asynchronously when async is true
- Clones alert before dispatch
Alerts package coverage: 83.4%→83.5%
Tests using NewManager() were sharing /etc/pulse/alerts, causing race
conditions when running in parallel. Added newTestManager(t) helper that
creates isolated temp directories for each test.