Security fixes:
- Auto-register now requires settings:write scope for API tokens
- X-Forwarded-For in auto-register only trusted from verified proxies
- Public URL capture requires authentication (no loopback bypass)
- Lockout reset now uses RequireAdmin for session users
Reliability fixes:
- Docker stop command expiration clears PendingUninstall flag
- Cancelled notifications get completed_at set and are cleaned up
- Move all SQLite pragmas from db.Exec() to DSN parameters so every
connection the pool creates gets busy_timeout and other settings.
Previously only the first connection had these applied.
- Set MaxOpenConns(1) on audit, RBAC, and notification databases
(metrics already had this). Fixes potential for multiple connections
where new ones lack busy_timeout.
- Increase busy_timeout from 5s to 30s across all databases to
tolerate disk I/O pressure during backup windows.
- Fix nested query deadlocks in GetRoles(), GetUserAssignments(), and
CancelByAlertIDs() that would deadlock with MaxOpenConns(1).
- Fix circuit breaker retryInterval not resetting on recovery, which
caused the next trip to start at 5-minute backoff instead of 5s.
Related to #1156
Switch from mattn/go-sqlite3 (CGO) to modernc.org/sqlite (pure Go)
for auth, audit, and notification queue storage. This enables SQLite
functionality on arm64 Docker images which are built with CGO_ENABLED=0.
Related to #1140
Fixed deadlock where CancelByAlertIDs held nq.mu.Lock() and then called
UpdateStatus() which also tried to acquire the same lock. Now uses
direct SQL while holding the lock.
Tests added for CancelByAlertIDs:
- No matching notifications (notification stays pending)
- Matching notification cancelled
- Multiple alerts with partial match (any match cancels)
Coverage: CancelByAlertIDs 65.7% -> 81.1%
Three categories of fixes:
1. Goroutine leak causing 10-minute timeout:
- Add defer mon.notificationMgr.Stop() in monitor_memory_test.go
- Background goroutines from notification manager weren't being stopped
2. Database NULL column scanning errors:
- Change LastError from string to *string in queue.go
- Change PayloadBytes from int to *int in queue.go
- SQL NULL values require pointer types in Go
3. SSRF protection blocking test servers:
- Check allowlist for localhost before rejecting in notifications.go
- Set PULSE_DATA_DIR to temp directory in tests
- Add defer nm.Stop() calls to prevent goroutine leaks
Fixes for preflight test failures in workflow run 19280879903.
Queue cancellation mechanism:
- Add CancelByAlertIDs method to mark queued notifications as cancelled when alerts resolve
- Update CancelAlert to cancel queued notifications containing resolved alert IDs
- Skip cancelled notifications in queue processor
- Prevents resolved alerts from triggering notifications after they clear
Atomic DB operations:
- Add IncrementAttemptAndSetStatus to atomically update attempt counter and status
- Replace separate IncrementAttempt + UpdateStatus calls with single atomic operation
- Prevents orphaned queue entries when crashes occur between operations
- Eliminates race condition where rows get stuck in "pending" or "sending" status
These fixes ensure queued notifications are properly cancelled when alerts resolve
and prevent database inconsistencies during crash scenarios.
Critical fixes (P0):
- Fix cooldown timing: Mark cooldown only after successful delivery, not before enqueue
- Add os.MkdirAll to queue initialization to prevent silent failures on fresh installs
- Add DNS re-validation at webhook send time to prevent DNS rebinding SSRF attacks
- Add SSRF validation for Apprise HTTP URLs
- Remove secret logging (bot tokens, routing keys) from debug logs
- Implement lastNotified cleanup to prevent unbounded memory growth
- Use shared HTTP client for webhooks to enable TLS connection reuse
- Add fallback to direct sending when queue enqueue fails
- Make queue worker concurrent (5 workers with semaphore) to prevent head-of-line blocking
- Fix webhook rate limiter race condition with separate mutex
- Fix email manager thread safety with mutex on rate limiter
- Fix grouping timer leak by adding stopCleanup signal
- Fix webhook 429 double sleep (use Retry-After OR backoff, not both)
Frontend improvements:
- Add queue/DLQ management API methods (getQueueStats, getDLQ, retryDLQItem, deleteDLQItem)
- Add getNotificationHealth and getWebhookHistory endpoints
- Add Apprise test support to NotificationTestRequest type
Related to notification system audit
Remove 4 LLM-generated internal development docs that don't belong in the repository:
- MIGRATION_SCAFFOLDING.md
- NOTIFICATION_AUDIT.md
- NOTIFICATION_QUICK_REFERENCE.md
- NOTIFICATION_SYSTEM_MAP.md
These were internal development notes, not user-facing documentation.