mirror of
https://github.com/rcourtman/Pulse.git
synced 2026-05-22 03:02:35 +00:00
Remove internal development documentation files
Remove 4 LLM-generated internal development docs that don't belong in the repository: - MIGRATION_SCAFFOLDING.md - NOTIFICATION_AUDIT.md - NOTIFICATION_QUICK_REFERENCE.md - NOTIFICATION_SYSTEM_MAP.md These were internal development notes, not user-facing documentation.
This commit is contained in:
parent
9d2bad3af6
commit
febce91145
6 changed files with 60 additions and 2013 deletions
|
|
@ -1,35 +0,0 @@
|
|||
# Migration Scaffolding Tracker
|
||||
|
||||
This document tracks temporary code added to handle migration paths between versions. All code listed here should be removed according to the specified criteria.
|
||||
|
||||
**Purpose**: These features exist solely to assist users in migrating from old patterns to new ones. They serve no functional purpose beyond migration assistance and represent technical debt that should be cleaned up once the migration period is complete.
|
||||
|
||||
---
|
||||
|
||||
## Active Migration Code
|
||||
|
||||
## Removed Migration Code
|
||||
|
||||
- Legacy SSH detection banner (removed in cleanup)
|
||||
|
||||
---
|
||||
|
||||
## Guidelines for Adding New Migration Code
|
||||
|
||||
When adding new migration scaffolding:
|
||||
|
||||
1. **Mark it clearly** with `⚠️ MIGRATION SCAFFOLDING - TEMPORARY CODE` comments
|
||||
2. **Add it to this document** with:
|
||||
- Why it exists
|
||||
- Files involved
|
||||
- Removal criteria
|
||||
- Removal instructions
|
||||
3. **Set explicit removal criteria**:
|
||||
- Version number target
|
||||
- Time-based target (e.g., "6 months after release")
|
||||
- Data-driven target (e.g., "when <1% users affected")
|
||||
4. **Add a kill switch**: Environment variable or feature flag to disable without redeployment
|
||||
5. **Consider telemetry**: If removal depends on data, instrument the code to collect that data
|
||||
6. **Create a removal task**: Add to issue tracker with assigned owner and date
|
||||
|
||||
**Remember**: Migration code is technical debt. Make it easy to find and remove later.
|
||||
|
|
@ -1,527 +0,0 @@
|
|||
# Pulse Notification System - Comprehensive Security & Architecture Audit
|
||||
|
||||
**Date:** 2025-11-06
|
||||
**Auditors:** Claude Code + OpenAI Codex
|
||||
**Scope:** Complete webhook and email notification system (backend + frontend)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The Pulse notification system is architecturally sound with sophisticated features (persistent queue, retry logic, SSRF protection, rate limiting). However, it contains **critical correctness, security, and reliability issues** that undermine the guarantees it promises to users.
|
||||
|
||||
**Key findings:**
|
||||
- **Correctness**: Cooldown timing marked before actual delivery causes silent notification drops
|
||||
- **Security**: DNS rebinding vulnerability in webhook delivery, secrets logged in plaintext
|
||||
- **Reliability**: Queue initialization fails silently, single-threaded worker creates head-of-line blocking
|
||||
- **Observability**: Queue/DLQ features exist but are completely hidden from UI
|
||||
- **Concurrency**: Multiple race conditions in email manager, webhook rate limiter, and shared state
|
||||
|
||||
**Recommendation:** Address all P0 issues before the next release. P1 issues should be fixed within 2 releases.
|
||||
|
||||
---
|
||||
|
||||
## Priority 0 - Critical Issues (Fix Immediately)
|
||||
|
||||
### 1. Cooldown Marked Before Delivery Success
|
||||
**File:** `internal/notifications/notifications.go:649-656`
|
||||
**Severity:** Critical - Silent notification loss
|
||||
|
||||
**Issue:**
|
||||
`sendGroupedAlerts` marks `lastNotified[alert.ID]` immediately after enqueuing (line 649-656), before confirming the queue accepted the notification or delivery succeeded. If SQLite rejects the enqueue or the notification later moves to DLQ, subsequent `SendAlert` calls see the cooldown and bail, causing complete silence for 5 minutes.
|
||||
|
||||
**Impact:**
|
||||
Users experience alert suppression when notifications fail - Pulse appears to acknowledge incidents but never notifies anyone.
|
||||
|
||||
**Root cause:**
|
||||
```go
|
||||
// Line 642-656
|
||||
if n.queue != nil {
|
||||
n.enqueueNotifications(emailConfig, webhooks, appriseConfig, alertsToSend)
|
||||
} else {
|
||||
n.sendNotificationsDirect(emailConfig, webhooks, appriseConfig, alertsToSend)
|
||||
}
|
||||
|
||||
// Update last notified time for all alerts
|
||||
now := time.Now()
|
||||
for _, alert := range alertsToSend {
|
||||
n.lastNotified[alert.ID] = notificationRecord{
|
||||
lastSent: now,
|
||||
alertStart: alert.StartTime,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
Move cooldown stamping to:
|
||||
1. Queue path: After successful enqueue confirmation + after successful dequeue/send
|
||||
2. Direct path: After actual delivery success
|
||||
|
||||
---
|
||||
|
||||
### 2. Queue Directory Not Created
|
||||
**File:** `internal/notifications/queue.go:62-68`
|
||||
**Severity:** Critical - Queue silently disabled on fresh installs
|
||||
|
||||
**Issue:**
|
||||
`NewNotificationQueue` assumes `/etc/pulse/notifications` (or `utils.GetDataDir()/notifications`) exists. It never calls `os.MkdirAll`, so `sql.Open` fails with "unable to open database file" on fresh installs or bare-metal deployments.
|
||||
|
||||
**Impact:**
|
||||
Queue initialization fails silently, falling back to fire-and-forget sending with no retry capability. The "persistent queue" feature is disabled by default on most installations.
|
||||
|
||||
**Code:**
|
||||
```go
|
||||
// Line 62-68
|
||||
if dataDir == "" {
|
||||
dataDir = filepath.Join(utils.GetDataDir(), "notifications")
|
||||
}
|
||||
dbPath := filepath.Join(dataDir, "notification_queue.db")
|
||||
db, err := sql.Open("sqlite3", dbPath) // FAILS if directory doesn't exist
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
```go
|
||||
if dataDir == "" {
|
||||
dataDir = filepath.Join(utils.GetDataDir(), "notifications")
|
||||
}
|
||||
if err := os.MkdirAll(dataDir, 0755); err != nil {
|
||||
return nil, fmt.Errorf("failed to create queue directory: %w", err)
|
||||
}
|
||||
dbPath := filepath.Join(dataDir, "notification_queue.db")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. DNS Rebinding Vulnerability (SSRF)
|
||||
**Files:** `internal/api/notifications.go:204, 261` + `internal/notifications/notifications.go:1396-1533`
|
||||
**Severity:** Critical - SSRF bypass via DNS rebinding
|
||||
|
||||
**Issue:**
|
||||
`ValidateWebhookURL` runs only during webhook creation/update but never at send time. An attacker can create a webhook pointing to a legitimate domain, then later update DNS to point to `169.254.169.254` (cloud metadata), `127.0.0.1`, or private IPs. The validation is never re-run in `sendWebhookRequest`.
|
||||
|
||||
**Attack scenario:**
|
||||
1. Admin creates webhook for `https://attacker.com/webhook` (passes validation)
|
||||
2. Attacker updates DNS: `attacker.com` → `169.254.169.254`
|
||||
3. Alert triggers → Pulse POSTs to cloud metadata service with OAuth tokens
|
||||
|
||||
**Code path:**
|
||||
- Validation: `api/notifications.go:204` (CreateWebhook), `261` (UpdateWebhook)
|
||||
- Send: `notifications.go:1429` - `http.NewRequest` uses URL as-is, no re-validation
|
||||
|
||||
**Fix:**
|
||||
Re-run `ValidateWebhookURL` at send time in `sendWebhookRequest` before creating HTTP request.
|
||||
|
||||
**Apprise SSRF:**
|
||||
Apprise HTTP mode (`sendAppriseViaHTTP` at line 904-949) has the same issue - no validation of `serverUrl` against private IPs.
|
||||
|
||||
---
|
||||
|
||||
### 4. Secrets Logged in Plaintext
|
||||
**File:** `internal/notifications/notifications.go:1458-1465`
|
||||
**Severity:** Critical - Credential exposure in logs
|
||||
|
||||
**Issue:**
|
||||
Debug logging for Telegram/Gotify webhooks logs the full URL (containing bot tokens) and complete JSON payload (containing routing keys, API keys) at debug level.
|
||||
|
||||
**Code:**
|
||||
```go
|
||||
// Line 1458-1465
|
||||
if webhook.Service == "telegram" || webhook.Service == "gotify" {
|
||||
log.Debug().
|
||||
Str("webhook", webhook.Name).
|
||||
Str("service", webhook.Service).
|
||||
Str("url", webhookURL). // Contains bot token
|
||||
Str("payload", string(jsonData)). // Contains all secrets
|
||||
Msg("Sending webhook with payload")
|
||||
}
|
||||
```
|
||||
|
||||
**Example leaked data:**
|
||||
- Telegram: `https://api.telegram.org/bot123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11/sendMessage`
|
||||
- PagerDuty: `{"routing_key": "R123ABC456DEF789GHI012JKL"}`
|
||||
|
||||
**Fix:**
|
||||
Remove debug logging or redact using existing `redactSecretsFromURL` function.
|
||||
|
||||
---
|
||||
|
||||
### 5. Memory Leak - Unbounded `lastNotified` Map
|
||||
**File:** `internal/notifications/notifications.go:649-656`
|
||||
**Severity:** Critical - Memory leak causing eventual OOM
|
||||
|
||||
**Issue:**
|
||||
The `lastNotified` map grows indefinitely. Every alert ID is added on send (line 652-655) but never removed when alerts resolve. Long-running clusters accumulate every historical alert ID.
|
||||
|
||||
**Impact:**
|
||||
Memory usage grows unbounded, mutex contention increases proportionally.
|
||||
|
||||
**Fix:**
|
||||
Delete entries in:
|
||||
1. `CancelAlert` - when alert resolves
|
||||
2. Queue processor - after DLQ or final success
|
||||
3. Periodic cleanup - purge entries older than 24 hours
|
||||
|
||||
---
|
||||
|
||||
### 6. HTTP Client Created Per Request
|
||||
**File:** `internal/notifications/notifications.go:1469`
|
||||
**Severity:** High - Performance degradation
|
||||
|
||||
**Issue:**
|
||||
Every webhook send creates a new `http.Client` via `createSecureWebhookClient`, preventing TLS connection reuse and causing repeated TLS handshakes.
|
||||
|
||||
**Code:**
|
||||
```go
|
||||
// Line 1469
|
||||
client := createSecureWebhookClient(WebhookTimeout)
|
||||
resp, err := client.Do(req)
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
High-volume webhook endpoints experience significant CPU overhead and latency from repeated TLS handshakes.
|
||||
|
||||
**Fix:**
|
||||
Create a shared `http.Client` at NotificationManager initialization, reuse for all webhook requests.
|
||||
|
||||
---
|
||||
|
||||
### 7. Queue Enqueue Failures Only Logged
|
||||
**File:** `internal/notifications/notifications.go:660-718`
|
||||
**Severity:** High - Silent notification loss
|
||||
|
||||
**Issue:**
|
||||
When `queue.Enqueue()` fails (disk full, SQLite error), the error is only logged - no metric, no retry, no fallback to direct sending.
|
||||
|
||||
**Code:**
|
||||
```go
|
||||
// Line 683-686
|
||||
if err := n.queue.Enqueue(notif); err != nil {
|
||||
log.Error().Err(err).Str("type", "email").Msg("Failed to enqueue email notification")
|
||||
}
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
Transient disk errors silently drop entire batches of alerts.
|
||||
|
||||
**Fix:**
|
||||
1. Fall back to direct sending when enqueue fails
|
||||
2. Expose queue health metric in `/api/notifications/health`
|
||||
3. Surface queue errors in UI
|
||||
|
||||
---
|
||||
|
||||
### 8. No Queue Cancellation Mechanism
|
||||
**Files:** `internal/notifications/notifications.go:571-614` + `internal/notifications/queue.go`
|
||||
**Severity:** High - Resolved alerts still trigger notifications
|
||||
|
||||
**Issue:**
|
||||
`CancelAlert` only removes alerts from `pendingAlerts` buffer (line 582-618). Once alerts are serialized to SQLite queue, there's no mechanism to mark them cancelled. The queue will still send notifications for resolved incidents.
|
||||
|
||||
**Scenario:**
|
||||
1. Alert triggers, enters grouping window
|
||||
2. 10 seconds later, alert resolves → `CancelAlert` removes from pending
|
||||
3. 20 seconds later, grouping window expires, alert already in queue
|
||||
4. Queue sends notification for an incident that cleared 20 seconds ago
|
||||
|
||||
**Fix:**
|
||||
Add cancellation tracking:
|
||||
1. Add `cancelled` status to queue
|
||||
2. When `CancelAlert` fires, mark queued notifications as cancelled
|
||||
3. Queue processor skips cancelled items
|
||||
|
||||
---
|
||||
|
||||
### 9. Single-Threaded Queue Worker
|
||||
**File:** `internal/notifications/queue.go:520-534`
|
||||
**Severity:** High - Head-of-line blocking
|
||||
|
||||
**Issue:**
|
||||
Queue processor is single-threaded, processes max 10 items per 5-second tick. Slow SMTP/webhook calls (10s timeout) block the entire pipeline.
|
||||
|
||||
**Code:**
|
||||
```go
|
||||
// Line 520-534
|
||||
pending, err := nq.GetPending(10) // Max 10 items
|
||||
// ...
|
||||
for _, notif := range pending {
|
||||
nq.processNotification(notif) // Sequential, blocking
|
||||
}
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
A single slow webhook endpoint (10s response time) reduces throughput to 6 notifications/minute instead of documented 120/minute.
|
||||
|
||||
**Fix:**
|
||||
Process notifications concurrently with worker pool (e.g., 5 workers).
|
||||
|
||||
---
|
||||
|
||||
### 10. Queue DB Operations Not Atomic
|
||||
**File:** `internal/notifications/queue.go:538-548`
|
||||
**Severity:** High - Orphaned queue entries on crash
|
||||
|
||||
**Issue:**
|
||||
`processNotification` updates attempts and status in separate SQL statements. Crashes between these operations leave orphaned rows.
|
||||
|
||||
**Code:**
|
||||
```go
|
||||
// Line 538-548
|
||||
if err := nq.IncrementAttempt(notif.ID); err != nil { ... } // UPDATE 1
|
||||
if err := nq.UpdateStatus(notif.ID, QueueStatusSending, ""); err != nil { ... } // UPDATE 2
|
||||
// ... processor runs ...
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
If Pulse crashes between IncrementAttempt and UpdateStatus, row stays in "pending" with incremented attempts, or gets stuck in "sending" forever.
|
||||
|
||||
**Fix:**
|
||||
Use single atomic UPDATE or transaction for state transitions.
|
||||
|
||||
---
|
||||
|
||||
## Priority 1 - Important Issues (Fix Soon)
|
||||
|
||||
### 11. Webhook Rate Limiter Race Condition
|
||||
**File:** `internal/notifications/notifications.go:1356-1394`
|
||||
**Severity:** Medium - Rate limiting ineffective under concurrency
|
||||
|
||||
**Issue:**
|
||||
`checkWebhookRateLimit` reads/writes `sentCount` and `lastSent` under the global `n.mu` mutex, but callers release the lock before sending. Concurrent sends can both pass the limit check before either increments the counter.
|
||||
|
||||
**Fix:**
|
||||
Use per-URL mutex or atomic counters for rate limiting.
|
||||
|
||||
---
|
||||
|
||||
### 12. Email Manager Not Thread-Safe
|
||||
**Files:** `internal/notifications/notifications.go:1010-1050` + `internal/notifications/email_enhanced.go:21-99`
|
||||
**Severity:** Medium - Race condition in concurrent email sends
|
||||
|
||||
**Issue:**
|
||||
`sendHTMLEmailWithError` grabs `n.emailManager` under read lock, then releases before calling `SendEmailWithRetry`. Meanwhile, the rate limiter mutates `sentCount` without any mutex.
|
||||
|
||||
**Fix:**
|
||||
Add mutex to EnhancedEmailManager or use atomic counters for rate limiter.
|
||||
|
||||
---
|
||||
|
||||
### 13. Double Retry Logic (Queue + Transport)
|
||||
**Files:** `internal/notifications/email_enhanced.go:38-78` + `queue.go:538-579`
|
||||
**Severity:** Medium - Confusing retry semantics
|
||||
|
||||
**Issue:**
|
||||
Email notifications configured with `MaxRetries=3` will retry 3 times per dequeue attempt, and the queue will retry the notification 3 more times, resulting in up to 9 total SMTP attempts.
|
||||
|
||||
**Fix:**
|
||||
Document behavior clearly or disable transport-level retries when using queue.
|
||||
|
||||
---
|
||||
|
||||
### 14. Grouping Timer Leak on Disable
|
||||
**File:** `internal/notifications/notifications.go:560-568`
|
||||
**Severity:** Medium - Goroutine leak
|
||||
|
||||
**Issue:**
|
||||
`SetEnabled(false)` doesn't cancel `groupTimer`. Timers created before disabling continue running and may fire after notifications are disabled.
|
||||
|
||||
**Fix:**
|
||||
Call `groupTimer.Stop()` in `SetEnabled(false)`.
|
||||
|
||||
---
|
||||
|
||||
### 15. Webhook 429 Retry-After Double Sleep
|
||||
**File:** `internal/notifications/webhook_enhanced.go:208-233`
|
||||
**Severity:** Low - Longer delays than intended
|
||||
|
||||
**Issue:**
|
||||
Enhanced webhooks sleep twice on 429 responses: once for backoff, again for Retry-After header.
|
||||
|
||||
**Fix:**
|
||||
Use `max(backoff, retryAfter)` instead of adding them.
|
||||
|
||||
---
|
||||
|
||||
## Priority 2 - UI/Observability Gaps
|
||||
|
||||
### 16. Queue/DLQ Endpoints Missing from Frontend
|
||||
**File:** `frontend-modern/src/api/notifications.ts`
|
||||
**Severity:** Medium - Flagship features invisible to users
|
||||
|
||||
**Missing endpoints:**
|
||||
- `/api/notifications/health` - Queue health status
|
||||
- `/api/notifications/queue/stats` - Queue statistics
|
||||
- `/api/notifications/dlq` - Dead letter queue
|
||||
- `/api/notifications/dlq/retry` - Retry DLQ items
|
||||
- `/api/notifications/dlq/delete` - Delete DLQ items
|
||||
- `/api/notifications/webhook-history` - Webhook delivery history
|
||||
|
||||
**Impact:**
|
||||
Users cannot observe queue depth, drain DLQ, or debug webhook failures without curl.
|
||||
|
||||
---
|
||||
|
||||
### 17. Apprise Test Notifications Not Supported in UI
|
||||
**File:** `frontend-modern/src/api/notifications.ts:73-79, 169-190`
|
||||
**Severity:** Low - Feature gap
|
||||
|
||||
**Issue:**
|
||||
UI test functionality only supports `email` and `webhook` types, even though backend supports `apprise` test notifications.
|
||||
|
||||
**Impact:**
|
||||
Apprise users cannot validate configuration from UI.
|
||||
|
||||
---
|
||||
|
||||
### 18. Enhanced Webhook Features Not Exposed
|
||||
**File:** `frontend-modern/src/components/Alerts/WebhookConfig.tsx`
|
||||
**Severity:** Low - Advanced features inaccessible
|
||||
|
||||
**Missing UI controls:**
|
||||
- Retry configuration (`RetryEnabled`, `RetryCount`)
|
||||
- Filter rules (by level, type, node, resource)
|
||||
- Response logging toggle
|
||||
- Custom payload templates (partially supported)
|
||||
|
||||
**Impact:**
|
||||
Half the "enhanced webhook" feature set is inaccessible without manual API calls.
|
||||
|
||||
---
|
||||
|
||||
### 19. No Plaintext Config Warning
|
||||
**File:** Frontend settings pages
|
||||
**Severity:** Low - Security transparency
|
||||
|
||||
**Issue:**
|
||||
UI never warns when `ConfigPersistence` lacks encryption, so admins don't know their SMTP passwords and webhook URLs are stored in plaintext.
|
||||
|
||||
**Fix:**
|
||||
Show warning icon in settings when encryption is disabled.
|
||||
|
||||
---
|
||||
|
||||
## Security Deep Dive
|
||||
|
||||
### SSRF Protection Analysis
|
||||
|
||||
**Current protections (CreateWebhook/UpdateWebhook only):**
|
||||
- Blocks localhost: `127.0.0.1`, `::1`, `127.*`
|
||||
- Blocks private IPs: `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`
|
||||
- Blocks link-local: `169.254.*`, `fe80::*`
|
||||
- Blocks cloud metadata: `169.254.169.254`, `metadata.google.internal`
|
||||
- Requires DNS resolution
|
||||
|
||||
**Bypasses:**
|
||||
1. DNS rebinding (validated once, never re-validated)
|
||||
2. Apprise HTTP mode (no validation at all)
|
||||
3. URL templating with custom fields (can inject private IPs via template variables)
|
||||
|
||||
**Recommendation:**
|
||||
Re-validate at send time, add Apprise URL validation, sanitize template outputs.
|
||||
|
||||
---
|
||||
|
||||
### Secrets at Rest
|
||||
|
||||
**Current behavior:**
|
||||
- `ConfigPersistence` with crypto provider → AES-256 encrypted
|
||||
- `ConfigPersistence` without crypto → plaintext JSON
|
||||
|
||||
**Files affected:**
|
||||
- `/etc/pulse/email.json` - SMTP passwords
|
||||
- `/etc/pulse/webhooks.json` - Webhook URLs with tokens
|
||||
- `/etc/pulse/apprise.json` - Apprise API keys
|
||||
|
||||
**Recommendation:**
|
||||
Either enforce encryption (fail hard without crypto) or emit warnings/health failures when encryption is disabled.
|
||||
|
||||
---
|
||||
|
||||
## Performance & Scalability
|
||||
|
||||
### Current Throughput Limits
|
||||
|
||||
**Email:**
|
||||
- Rate limit: 60/minute
|
||||
- Queue worker: 10 items per 5s = 120/minute theoretical
|
||||
- Actual: ~6-10/minute with SMTP latency
|
||||
- Retry overhead: 3x transport + 3x queue = 9x amplification
|
||||
|
||||
**Webhooks:**
|
||||
- Rate limit: 10/minute per URL
|
||||
- Queue worker: Same 10 items per 5s bottleneck
|
||||
- Actual: ~6/minute with slow endpoints
|
||||
|
||||
**Recommendations:**
|
||||
1. Concurrent queue workers (5-10 workers)
|
||||
2. Separate queues for email vs webhooks
|
||||
3. Shared HTTP client with connection pooling
|
||||
4. Document actual vs theoretical throughput
|
||||
|
||||
---
|
||||
|
||||
## Recommendations Summary
|
||||
|
||||
### Immediate Actions (P0)
|
||||
1. Fix cooldown timing - mark after delivery success
|
||||
2. Add `os.MkdirAll` to queue initialization
|
||||
3. Re-validate webhook URLs at send time
|
||||
4. Remove secret logging or add redaction
|
||||
5. Implement lastNotified cleanup
|
||||
6. Use shared HTTP client for webhooks
|
||||
7. Add fallback when enqueue fails
|
||||
8. Implement queue cancellation for resolved alerts
|
||||
9. Make queue worker concurrent
|
||||
10. Use atomic DB operations for queue state
|
||||
|
||||
### Short-term Actions (P1)
|
||||
1. Fix webhook rate limiter concurrency
|
||||
2. Fix email manager thread safety
|
||||
3. Document/fix double retry behavior
|
||||
4. Fix timer leaks
|
||||
5. Fix webhook 429 retry logic
|
||||
|
||||
### Medium-term Actions (P2)
|
||||
1. Add queue/DLQ UI components
|
||||
2. Add webhook history viewer
|
||||
3. Add Apprise test support to UI
|
||||
4. Expose enhanced webhook features in UI
|
||||
5. Add encryption status warnings
|
||||
6. Add queue health monitoring
|
||||
|
||||
---
|
||||
|
||||
## Testing Recommendations
|
||||
|
||||
### Unit Tests Needed
|
||||
- Cooldown timing with queue failures
|
||||
- Queue directory creation on fresh installs
|
||||
- Concurrent sends (email + webhooks)
|
||||
- Rate limiter under concurrency
|
||||
- lastNotified cleanup
|
||||
- Queue cancellation
|
||||
|
||||
### Integration Tests Needed
|
||||
- DNS rebinding attack simulation
|
||||
- Queue crash recovery (orphaned rows)
|
||||
- Slow webhook blocking pipeline
|
||||
- Enqueue failures with fallback
|
||||
- End-to-end queue→DLQ flow
|
||||
|
||||
### Security Tests Needed
|
||||
- SSRF bypass attempts (DNS rebinding, Apprise)
|
||||
- Secret leakage in logs (debug level enabled)
|
||||
- Plaintext config file exposure
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The notification system has strong foundations but critical reliability and security gaps. The queue infrastructure is well-designed but undermined by initialization failures, concurrency bugs, and missing observability.
|
||||
|
||||
**Most critical finding:** Cooldown timing + queue initialization failures combine to create a scenario where alerts are silently dropped, violating the core promise of a monitoring system.
|
||||
|
||||
**Most critical security finding:** DNS rebinding vulnerability allows SSRF attacks against cloud metadata services, potentially exposing credentials.
|
||||
|
||||
**Most critical UX finding:** Queue/DLQ features exist but are completely hidden from users, making the "persistent queue" feature essentially invisible.
|
||||
|
||||
All P0 issues should be addressed before the next release to restore trust in notification delivery.
|
||||
|
|
@ -1,205 +0,0 @@
|
|||
# Pulse Notification System - Quick Reference
|
||||
|
||||
## File Structure
|
||||
|
||||
### Backend Core Files
|
||||
```
|
||||
internal/notifications/
|
||||
├── notifications.go (2,358 lines) - Main NotificationManager, email, webhook, apprise
|
||||
├── webhook_enhanced.go (605 lines) - Enhanced webhook support with retry & filtering
|
||||
├── email_enhanced.go (415 lines) - EnhancedEmailManager with SMTP provider support
|
||||
├── email_providers.go (217 lines) - 11 email provider templates (Gmail, SendGrid, etc)
|
||||
├── email_template.go (200+ lines) - HTML email templates
|
||||
├── webhook_templates.go (300+ lines) - 8+ webhook service templates
|
||||
└── queue.go (600+ lines) - Persistent SQLite notification queue
|
||||
|
||||
internal/api/
|
||||
├── notifications.go (753 lines) - HTTP handlers for notification endpoints
|
||||
└── notification_queue.go (80+ lines) - Queue management endpoints
|
||||
```
|
||||
|
||||
### Frontend Files
|
||||
```
|
||||
frontend-modern/src/
|
||||
├── api/notifications.ts (201 lines) - API client for notifications
|
||||
├── components/Alerts/
|
||||
│ ├── WebhookConfig.tsx (300+ lines) - Webhook config UI
|
||||
│ └── EmailProviderSelect.tsx (200+ lines) - Email provider selector
|
||||
└── stores/notifications.ts - Reactive state management
|
||||
```
|
||||
|
||||
## Core Data Structures
|
||||
|
||||
### EmailConfig
|
||||
```go
|
||||
Enabled, Provider, SMTPHost, SMTPPort, Username, Password,
|
||||
From, To (array), TLS, StartTLS
|
||||
```
|
||||
|
||||
### WebhookConfig
|
||||
```go
|
||||
ID, Name, URL, Method, Headers (map), Enabled,
|
||||
Service (discord/slack/teams/telegram/etc), Template, CustomFields (map)
|
||||
```
|
||||
|
||||
### AppriseConfig
|
||||
```go
|
||||
Enabled, Mode (cli/http), Targets (array), CLIPath, TimeoutSeconds,
|
||||
ServerURL, ConfigKey, APIKey, APIKeyHeader, SkipTLSVerify
|
||||
```
|
||||
|
||||
## Key Methods
|
||||
|
||||
### NotificationManager
|
||||
| Method | Purpose |
|
||||
|--------|---------|
|
||||
| `SendAlert(alert)` | Add alert to queue, start grouping timer |
|
||||
| `CancelAlert(alertID)` | Remove resolved alert |
|
||||
| `SetEmailConfig(config)` | Update & persist email config |
|
||||
| `SetAppriseConfig(config)` | Update & persist apprise config |
|
||||
| `AddWebhook/UpdateWebhook/DeleteWebhook(...)` | Webhook CRUD |
|
||||
| `GetWebhooks/GetEmailConfig/GetAppriseConfig()` | Safe getters |
|
||||
| `SendTestNotification(method)` | Test email/webhook/apprise |
|
||||
| `ProcessQueuedNotification(notif)` | Queue processor callback |
|
||||
|
||||
### Alert Flow
|
||||
```
|
||||
SendAlert() → [Cooldown check] → [Add to pending] → [Start timer] →
|
||||
[Timer expires] → sendGroupedAlerts() → [Email/Webhook/Apprise async] →
|
||||
[Queue if available] → [Retry on failure]
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Configuration
|
||||
- `GET /api/notifications/email` - Get email config (masked)
|
||||
- `PUT /api/notifications/email` - Update email config
|
||||
- `GET /api/notifications/webhooks` - List webhooks
|
||||
- `POST /api/notifications/webhooks` - Create webhook
|
||||
- `PUT /api/notifications/webhooks/{id}` - Update webhook
|
||||
- `DELETE /api/notifications/webhooks/{id}` - Delete webhook
|
||||
- `GET /api/notifications/apprise` - Get apprise config
|
||||
- `PUT /api/notifications/apprise` - Update apprise config
|
||||
|
||||
### Templates & Providers
|
||||
- `GET /api/notifications/email-providers` - Email provider list
|
||||
- `GET /api/notifications/webhook-templates` - Webhook templates
|
||||
|
||||
### Testing & Monitoring
|
||||
- `POST /api/notifications/test` - Test notification (email/webhook)
|
||||
- `POST /api/notifications/webhooks/test` - Test specific webhook
|
||||
- `GET /api/notifications/webhook-history` - Delivery history
|
||||
- `GET /api/notifications/health` - Queue & health status
|
||||
- `GET /api/notifications/dlq` - Dead letter queue
|
||||
- `GET /api/notifications/queue/stats` - Queue statistics
|
||||
- `POST /api/notifications/dlq/retry` - Retry DLQ item
|
||||
- `POST /api/notifications/dlq/delete` - Delete DLQ item
|
||||
|
||||
## Email Providers
|
||||
|
||||
1. Gmail / Google Workspace (STARTTLS 587, App Password)
|
||||
2. SendGrid (STARTTLS 587, apikey username)
|
||||
3. Mailgun (STARTTLS 587)
|
||||
4. Amazon SES (STARTTLS 587)
|
||||
5. Microsoft 365 (STARTTLS 587, App Password)
|
||||
6. Brevo (STARTTLS 587)
|
||||
7. Postmark (STARTTLS 587, token auth)
|
||||
8. SparkPost (STARTTLS 587, SMTP_Injection)
|
||||
9. Resend (STARTTLS 587)
|
||||
10. SMTP2GO (STARTTLS 587)
|
||||
11. Custom SMTP
|
||||
|
||||
## Webhook Services
|
||||
|
||||
1. **Discord** - Embeds with colors, fields
|
||||
2. **Telegram** - Markdown, requires chat_id in URL
|
||||
3. **Slack** - Blocks, section fields
|
||||
4. **Microsoft Teams** - Adaptive Cards
|
||||
5. **PagerDuty** - Event API with routing key
|
||||
6. **Gotify** - Title + message + priority
|
||||
7. **Pushover** - Custom app/user tokens
|
||||
8. **ntfy.sh** - Plain text with headers
|
||||
|
||||
## Security Features
|
||||
|
||||
### SSRF Protection
|
||||
- DNS resolution required
|
||||
- Blocks: localhost, 127.*, private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
|
||||
- Blocks: link-local (169.254.*, fe80::*), cloud metadata (169.254.169.254)
|
||||
- Max 3 redirects
|
||||
|
||||
### Rate Limiting
|
||||
- Email: 60 per minute (configurable)
|
||||
- Webhooks: 10 per minute per URL
|
||||
|
||||
### Retry Logic
|
||||
- Email: Up to 3 attempts, 5s delay
|
||||
- Webhooks: Exponential backoff 1s→2s→4s→8s→16s→30s (max)
|
||||
- Queue: Configurable max attempts, DLQ for failures
|
||||
|
||||
## Constants
|
||||
|
||||
| Setting | Default | Notes |
|
||||
|---------|---------|-------|
|
||||
| Webhook timeout | 30s | 10s for tests |
|
||||
| Webhook response limit | 1 MB | Prevents memory exhaustion |
|
||||
| Email SMTP timeout | 10s dial, 30s total | Both TLS & STARTTLS |
|
||||
| Cooldown | 5 minutes | Per-alert grace period |
|
||||
| Grouping window | 30 seconds | Alert batching window |
|
||||
| Email rate limit | 60/minute | Configurable |
|
||||
| Webhook rate limit | 10/minute per URL | Configurable |
|
||||
| Queue processor | 5 seconds | Retry interval |
|
||||
| Queue cleanup | 1 hour | Remove old entries |
|
||||
|
||||
## Template Variables
|
||||
|
||||
For webhook/email templates, available fields:
|
||||
```
|
||||
{{.ID}} {{.Level}} {{.Type}} {{.ResourceName}} {{.ResourceID}}
|
||||
{{.Node}} {{.Instance}} {{.Message}} {{.Value}} {{.Threshold}}
|
||||
{{.ValueFormatted}} {{.ThresholdFormatted}} {{.StartTime}} {{.Duration}}
|
||||
{{.Timestamp}} {{.ResourceType}} {{.Acknowledged}} {{.AckTime}}
|
||||
{{.AckUser}} {{.CustomFields}} {{.AlertCount}} {{.Alerts}}
|
||||
```
|
||||
|
||||
## Database Schema (Queue)
|
||||
|
||||
### notification_queue
|
||||
- id (TEXT PRIMARY KEY)
|
||||
- type (TEXT) - email, webhook, apprise
|
||||
- status (TEXT) - pending, sending, sent, failed, dlq, cancelled
|
||||
- alerts (TEXT JSON) - Array of alerts
|
||||
- config (TEXT JSON) - EmailConfig/WebhookConfig/AppriseConfig
|
||||
- attempts, max_attempts, last_attempt, last_error
|
||||
- created_at, next_retry_at, completed_at
|
||||
- Indexes: status, next_retry_at (pending), created_at
|
||||
|
||||
### notification_audit
|
||||
- Audit trail for all notifications
|
||||
- alert_ids, alert_count, success flag
|
||||
|
||||
## Important Notes
|
||||
|
||||
### Password Handling
|
||||
- Passwords NEVER logged
|
||||
- Passwords masked in API responses (get returns empty)
|
||||
- Passwords preserved on update if empty field sent
|
||||
- For testing, must include existing password
|
||||
|
||||
### Webhook URL Templating
|
||||
- Go template syntax: {{.FieldName}}
|
||||
- Custom functions: title, upper, lower, printf, urlquery, urlencode, urlpath
|
||||
- Telegram requires chat_id in URL: `?chat_id={{.ChatID}}`
|
||||
- Service-specific data enrichment (Telegram chat_id, PagerDuty routing_key)
|
||||
|
||||
### Email Templates
|
||||
- Single alert: Level-specific colors, detailed metrics
|
||||
- Grouped: Multiple alerts summary
|
||||
- Both: Responsive HTML + plain text multipart
|
||||
|
||||
### Persistent Queue
|
||||
- SQLite WAL mode for concurrency
|
||||
- 64MB cache, 5s busy timeout
|
||||
- Background processor: 5s intervals
|
||||
- Cleanup job: Hourly
|
||||
- Dead letter queue for permanently failed items
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -607,10 +607,13 @@ func (n *NotificationManager) CancelAlert(alertID string) {
|
|||
n.groupTimer = nil
|
||||
}
|
||||
|
||||
// Clean up cooldown record for resolved alert
|
||||
delete(n.lastNotified, alertID)
|
||||
|
||||
log.Debug().
|
||||
Str("alertID", alertID).
|
||||
Int("remaining", len(n.pendingAlerts)).
|
||||
Msg("Removed resolved alert from pending notifications")
|
||||
Msg("Removed resolved alert from pending notifications and cooldown map")
|
||||
}
|
||||
|
||||
// sendGroupedAlerts sends all pending alerts as a group
|
||||
|
|
@ -642,16 +645,16 @@ func (n *NotificationManager) sendGroupedAlerts() {
|
|||
// Use persistent queue if available, otherwise send directly
|
||||
if n.queue != nil {
|
||||
n.enqueueNotifications(emailConfig, webhooks, appriseConfig, alertsToSend)
|
||||
// Note: Cooldown will be marked after successful dequeue and send
|
||||
} else {
|
||||
n.sendNotificationsDirect(emailConfig, webhooks, appriseConfig, alertsToSend)
|
||||
}
|
||||
|
||||
// Update last notified time for all alerts
|
||||
now := time.Now()
|
||||
for _, alert := range alertsToSend {
|
||||
n.lastNotified[alert.ID] = notificationRecord{
|
||||
lastSent: now,
|
||||
alertStart: alert.StartTime,
|
||||
// For direct sends, mark cooldown immediately (fire-and-forget)
|
||||
now := time.Now()
|
||||
for _, alert := range alertsToSend {
|
||||
n.lastNotified[alert.ID] = notificationRecord{
|
||||
lastSent: now,
|
||||
alertStart: alert.StartTime,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -912,6 +915,15 @@ func (n *NotificationManager) sendAppriseViaHTTP(cfg AppriseConfig, title, body,
|
|||
return fmt.Errorf("apprise server URL must start with http or https: %s", serverURL)
|
||||
}
|
||||
|
||||
// Validate Apprise server URL to prevent SSRF
|
||||
if err := ValidateWebhookURL(serverURL); err != nil {
|
||||
log.Error().
|
||||
Err(err).
|
||||
Str("serverURL", serverURL).
|
||||
Msg("Apprise server URL validation failed - possible SSRF attempt")
|
||||
return fmt.Errorf("apprise server URL validation failed: %w", err)
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(cfg.TimeoutSeconds)*time.Second)
|
||||
defer cancel()
|
||||
|
||||
|
|
@ -1395,6 +1407,16 @@ func (n *NotificationManager) checkWebhookRateLimit(webhookURL string) bool {
|
|||
|
||||
// sendWebhookRequest sends the actual webhook request
|
||||
func (n *NotificationManager) sendWebhookRequest(webhook WebhookConfig, jsonData []byte, alertType string) error {
|
||||
// Re-validate webhook URL to prevent DNS rebinding attacks
|
||||
if err := ValidateWebhookURL(webhook.URL); err != nil {
|
||||
log.Error().
|
||||
Err(err).
|
||||
Str("webhook", webhook.Name).
|
||||
Str("url", webhook.URL).
|
||||
Msg("Webhook URL validation failed at send time - possible DNS rebinding")
|
||||
return fmt.Errorf("webhook URL validation failed: %w", err)
|
||||
}
|
||||
|
||||
// Check rate limit before sending
|
||||
if !n.checkWebhookRateLimit(webhook.URL) {
|
||||
log.Warn().
|
||||
|
|
@ -1455,14 +1477,12 @@ func (n *NotificationManager) sendWebhookRequest(webhook WebhookConfig, jsonData
|
|||
}
|
||||
}
|
||||
|
||||
// Debug log the payload for Telegram and Gotify webhooks
|
||||
// Debug log for Telegram and Gotify webhooks (without secrets)
|
||||
if webhook.Service == "telegram" || webhook.Service == "gotify" {
|
||||
log.Debug().
|
||||
Str("webhook", webhook.Name).
|
||||
Str("service", webhook.Service).
|
||||
Str("url", webhookURL).
|
||||
Str("payload", string(jsonData)).
|
||||
Msg("Sending webhook with payload")
|
||||
Msg("Sending webhook")
|
||||
}
|
||||
|
||||
// Send request with secure client
|
||||
|
|
@ -2307,31 +2327,47 @@ func (n *NotificationManager) ProcessQueuedNotification(notif *QueuedNotificatio
|
|||
Int("alertCount", len(notif.Alerts)).
|
||||
Msg("Processing queued notification")
|
||||
|
||||
var err error
|
||||
switch notif.Type {
|
||||
case "email":
|
||||
var emailConfig EmailConfig
|
||||
if err := json.Unmarshal(notif.Config, &emailConfig); err != nil {
|
||||
if err = json.Unmarshal(notif.Config, &emailConfig); err != nil {
|
||||
return fmt.Errorf("failed to unmarshal email config: %w", err)
|
||||
}
|
||||
return n.sendGroupedEmail(emailConfig, notif.Alerts)
|
||||
err = n.sendGroupedEmail(emailConfig, notif.Alerts)
|
||||
|
||||
case "webhook":
|
||||
var webhookConfig WebhookConfig
|
||||
if err := json.Unmarshal(notif.Config, &webhookConfig); err != nil {
|
||||
if err = json.Unmarshal(notif.Config, &webhookConfig); err != nil {
|
||||
return fmt.Errorf("failed to unmarshal webhook config: %w", err)
|
||||
}
|
||||
return n.sendGroupedWebhook(webhookConfig, notif.Alerts)
|
||||
err = n.sendGroupedWebhook(webhookConfig, notif.Alerts)
|
||||
|
||||
case "apprise":
|
||||
var appriseConfig AppriseConfig
|
||||
if err := json.Unmarshal(notif.Config, &appriseConfig); err != nil {
|
||||
if err = json.Unmarshal(notif.Config, &appriseConfig); err != nil {
|
||||
return fmt.Errorf("failed to unmarshal apprise config: %w", err)
|
||||
}
|
||||
return n.sendGroupedApprise(appriseConfig, notif.Alerts)
|
||||
err = n.sendGroupedApprise(appriseConfig, notif.Alerts)
|
||||
|
||||
default:
|
||||
return fmt.Errorf("unknown notification type: %s", notif.Type)
|
||||
}
|
||||
|
||||
// Mark cooldown after successful send
|
||||
if err == nil {
|
||||
n.mu.Lock()
|
||||
now := time.Now()
|
||||
for _, alert := range notif.Alerts {
|
||||
n.lastNotified[alert.ID] = notificationRecord{
|
||||
lastSent: now,
|
||||
alertStart: alert.StartTime,
|
||||
}
|
||||
}
|
||||
n.mu.Unlock()
|
||||
}
|
||||
|
||||
return err
|
||||
}
|
||||
|
||||
// Stop gracefully stops the notification manager
|
||||
|
|
|
|||
|
|
@ -63,6 +63,11 @@ func NewNotificationQueue(dataDir string) (*NotificationQueue, error) {
|
|||
dataDir = filepath.Join(utils.GetDataDir(), "notifications")
|
||||
}
|
||||
|
||||
// Ensure directory exists
|
||||
if err := os.MkdirAll(dataDir, 0755); err != nil {
|
||||
return nil, fmt.Errorf("failed to create notification queue directory: %w", err)
|
||||
}
|
||||
|
||||
dbPath := filepath.Join(dataDir, "notification_queue.db")
|
||||
|
||||
db, err := sql.Open("sqlite3", dbPath)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue