Two fixes for missing recovery/resolved notifications:
1. API config PUT handler now preserves notifyOnResolve when the client
omits it from the request body. Go decodes a missing bool as false,
which silently disabled recovery notifications on older clients.
2. CancelAlert now always cleans up the cooldown record even when the
alert has already left the pending buffer, preventing stale cooldown
entries from suppressing future alert cycles.
Escalation was calling SendAlert() which always sends to all enabled
channels, ignoring the per-level channel selection (email/webhook/all).
Add SendAlertToChannels() that snapshots only the requested channel
configs and uses a distinct "_escalation" queue type so the dequeue
handler skips cooldown writes — preventing interference with the alert
manager's own re-notify cadence.
Backport from v6 (88d5865a8). Recovery webhook notifications were using
the firing PayloadTemplate which services like Telegram, Teams, Discord
etc. silently rejected as malformed. Now uses a three-tier template
pipeline matching the firing path:
- Tier 1: Custom user template (if configured)
- Tier 2: Service-specific ResolvedPayloadTemplate (Discord green embed,
Telegram chat_id+text, Slack header blocks, Teams MessageCard/Adaptive,
PagerDuty event_action:"resolve", Pushover, Gotify, Mattermost)
- Tier 3: Generic JSON fallback (backward compatible)
Also adds Event, ResolvedAt, ResolvedAtISO fields to WebhookPayloadData.
Recovery notifications for Discord, Slack, Teams, PagerDuty, and other
service webhooks were sending a generic JSON payload that lacked the
required format (e.g. Discord needs `embeds`, Slack needs `blocks`),
causing resolved notifications to silently fail.
- Add `prepareResolvedWebhookData` to build template data with Level="resolved"
- Route resolved webhooks through service-specific templates with full
URL rendering, Telegram ChatID extraction, and PagerDuty routing_key
- Custom user templates take precedence over built-in service templates
- Return errors on service template failures instead of falling back to
generic payloads that endpoints would reject
- Fix PagerDuty template to send event_action="resolve" for resolved alerts
- fix(models): filter nodes by instance in UpdateNodesForInstance to prevent
PVE node duplication across poll cycles (#1214, #1192, #1217)
- fix(alerts): sort GetActiveAlerts output for stable ordering, preventing
hostname scrambling in frontend (#1218)
- fix(notifications): add ntfy-specific resolved webhook formatting with
plain-text body and proper headers (#1213)
- fix(frontend): respect "hide Docker update actions" setting in
DockerFilter Update All button (#1219)
- fix(frontend): add missing v prefix to GitHub release tag URLs (#1195)
- fix(monitoring): reduce disk detection warning from Warn to Debug to
eliminate log spam for pass-through disks (#1216)
- chore: bump VERSION to 5.1.5
- Replace Go stdlib smtp.PlainAuth (which refuses credentials without TLS)
with a custom plainAuth that respects the user's explicit transport choice
- Remove TLS guard from LoginAuth for the same reason
- Add RateLimit field to EmailConfig so the user's configured value is
persisted instead of being silently overwritten with 60
- Implement actual sanitization in the "Export for GitHub" diagnostics
button (was previously ignored — both exports produced identical data)
Related to #1189
Alerts previously showed the raw Proxmox node name (e.g., "on pve") even
when users configured a display name (e.g., "SPACEX") via Settings or the
host agent --hostname flag. This affected the alert UI, email notifications,
and webhook payloads.
Add NodeDisplayName field to the alert chain: cache display names in the
alert Manager (populated by CheckNode/CheckHost on every poll), resolve
them at alert creation via preserveAlertState, refresh on metric updates,
and enrich at read time in GetActiveAlerts. Update models.Alert, the
syncAlertsToState conversion, email templates, Apprise body text, webhook
payloads, and all frontend rendering paths.
Related to #1188
- Fix data race in webhook notifications by removing shared state
- Fix duplicate monitors on config reload by stopping old instances
- Prevent metrics ID deletion on transient startup errors
- Support Bearer auth header for config export/import endpoints
- Initialize Alert and Notification managers with tenant-specific data directories
- Add panic recovery to WebSocket safeSend for stability
- Record host metrics to history for sparkline support
Adds a Mention field to webhook configurations that allows users to tag
individuals or groups when alerts are sent. This works with:
- Discord: @everyone, <@USER_ID>, <@&ROLE_ID>
- Microsoft Teams: @General, user email
- Mattermost: @channel, @all, @username
The mention is included in the webhook payload via the {{.Mention}} template
variable. Built-in templates for Discord, Slack, and Teams now conditionally
include mentions when configured.
Backend changes:
- Add Mention field to WebhookConfig struct
- Add Mention field to WebhookPayloadData for template access
- Pass mention through sendGroupedWebhook
Frontend changes:
- Add mention field to Webhook interface
- Add Mention input to webhook configuration form
- Show service-specific help text for mention formats
- Merge variable declaration with assignment (S1021)
- Use unconditional strings.TrimPrefix (S1017)
- Remove unnecessary nil checks around range (S1031)
- Remove unnecessary fmt.Sprintf (S1039)
- Use copy() instead of manual loop (S1001)
- Use time.Until instead of t.Sub(time.Now()) (S1024)
- Use buf.String() instead of string(buf.Bytes()) (S1030)
Critical deadlock fix:
- Stop() was holding n.mu lock while calling queue.Stop()
- queue.Stop() waits for worker goroutines to finish
- Worker goroutines call ProcessQueuedNotification() which needs n.mu lock
- This created a classic lock-order deadlock
Fix:
- Unlock n.mu before calling queue.Stop()
- Relock after queue shutdown completes
- Workers can now finish and acquire lock as needed
This resolves 30-second test timeouts in notifications package.
Tests now complete in <1s instead of timing out at 30s.
Three categories of fixes:
1. Goroutine leak causing 10-minute timeout:
- Add defer mon.notificationMgr.Stop() in monitor_memory_test.go
- Background goroutines from notification manager weren't being stopped
2. Database NULL column scanning errors:
- Change LastError from string to *string in queue.go
- Change PayloadBytes from int to *int in queue.go
- SQL NULL values require pointer types in Go
3. SSRF protection blocking test servers:
- Check allowlist for localhost before rejecting in notifications.go
- Set PULSE_DATA_DIR to temp directory in tests
- Add defer nm.Stop() calls to prevent goroutine leaks
Fixes for preflight test failures in workflow run 19280879903.
Allow homelab users to send webhooks to internal services while maintaining security defaults.
Changes:
- Add webhookAllowedPrivateCIDRs field to SystemSettings (persistent config)
- Implement CIDR parsing and validation in NotificationManager
- Convert ValidateWebhookURL to instance method to access allowlist
- Add UI controls in System Settings for configuring trusted CIDR ranges
- Maintain strict security by default (block all private IPs)
- Keep localhost, link-local, and cloud metadata services blocked regardless of allowlist
- Re-validate on both config save and webhook delivery (DNS rebinding protection)
- Add comprehensive tests for CIDR parsing and IP matching
Backend:
- UpdateAllowedPrivateCIDRs() parses comma-separated CIDRs with validation
- Support for bare IPs (auto-converts to /32 or /128)
- Thread-safe allowlist updates with RWMutex
- Logging when allowlist is updated or used
- Validation errors prevent invalid CIDRs from being saved
Frontend:
- New "Webhook Security" section in System Settings
- Input field with examples and helpful placeholder text
- Real-time unsaved changes tracking
- Loads and saves allowlist via system settings API
Security:
- Default behavior unchanged (all private IPs blocked)
- Explicit opt-in required via configuration
- Localhost (127/8) always blocked
- Link-local (169.254/16) always blocked
- Cloud metadata services always blocked
- DNS resolution checked at both save and send time
Testing:
- Tests for CIDR parsing (valid/invalid inputs)
- Tests for IP allowlist matching
- Tests for bare IP address handling
- Tests for security boundaries (localhost, link-local remain blocked)
Related to #673🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Queue cancellation mechanism:
- Add CancelByAlertIDs method to mark queued notifications as cancelled when alerts resolve
- Update CancelAlert to cancel queued notifications containing resolved alert IDs
- Skip cancelled notifications in queue processor
- Prevents resolved alerts from triggering notifications after they clear
Atomic DB operations:
- Add IncrementAttemptAndSetStatus to atomically update attempt counter and status
- Replace separate IncrementAttempt + UpdateStatus calls with single atomic operation
- Prevents orphaned queue entries when crashes occur between operations
- Eliminates race condition where rows get stuck in "pending" or "sending" status
These fixes ensure queued notifications are properly cancelled when alerts resolve
and prevent database inconsistencies during crash scenarios.
Critical fixes (P0):
- Fix cooldown timing: Mark cooldown only after successful delivery, not before enqueue
- Add os.MkdirAll to queue initialization to prevent silent failures on fresh installs
- Add DNS re-validation at webhook send time to prevent DNS rebinding SSRF attacks
- Add SSRF validation for Apprise HTTP URLs
- Remove secret logging (bot tokens, routing keys) from debug logs
- Implement lastNotified cleanup to prevent unbounded memory growth
- Use shared HTTP client for webhooks to enable TLS connection reuse
- Add fallback to direct sending when queue enqueue fails
- Make queue worker concurrent (5 workers with semaphore) to prevent head-of-line blocking
- Fix webhook rate limiter race condition with separate mutex
- Fix email manager thread safety with mutex on rate limiter
- Fix grouping timer leak by adding stopCleanup signal
- Fix webhook 429 double sleep (use Retry-After OR backoff, not both)
Frontend improvements:
- Add queue/DLQ management API methods (getQueueStats, getDLQ, retryDLQItem, deleteDLQItem)
- Add getNotificationHealth and getWebhookHistory endpoints
- Add Apprise test support to NotificationTestRequest type
Related to notification system audit
Remove 4 LLM-generated internal development docs that don't belong in the repository:
- MIGRATION_SCAFFOLDING.md
- NOTIFICATION_AUDIT.md
- NOTIFICATION_QUICK_REFERENCE.md
- NOTIFICATION_SYSTEM_MAP.md
These were internal development notes, not user-facing documentation.
Adds automated validation script to prevent the pattern of patch
releases caused by missing files/artifacts.
scripts/validate-release.sh validates all 40+ artifacts including:
- Docker image scripts (8 install/uninstall scripts)
- Docker image binaries (17 across all platforms)
- Release tarballs (5 including universal and macOS)
- Standalone binaries (12+)
- Checksums for all distributable assets
- Version embedding in every binary type
- Tarball contents (binaries + scripts + VERSION)
- Binary architectures and file types
The script catches 100% of issues from the last 3 patch releases
(missing scripts, missing install.sh, missing binaries, broken
version embedding).
Updated RELEASE_CHECKLIST.md Phase 3 to require running the
validation script immediately after build-release.sh and before
proceeding to Docker build/publish phases.
Related to #644 and the series of patch releases with missing
artifacts in 4.26.x.
Webhook alert payloads now round Value and Threshold fields to 1 decimal
place before template rendering. This eliminates excessive precision in
webhook messages (e.g., 62.27451680630036 becomes 62.3).
The fix is applied in prepareWebhookData() so all webhook templates
benefit automatically, including Google Space webhooks, generic JSON
webhooks, and custom templates.
Related to #619
- Add support for testing Apprise notifications via /api/notifications/test endpoint
- Users can now test their Apprise configuration (both CLI and HTTP modes) using method="apprise"
- Added comprehensive unit tests for both CLI and HTTP modes
- Tests verify correct behavior when Apprise is enabled/disabled
- Tests validate that notifications are properly sent through Apprise channels
Related to #584