Commit graph

16 commits

Author SHA1 Message Date
rcourtman
778a2577b6 feat: Pulse v6 release 2026-03-18 16:06:30 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
a9ed380718 fix(websocket): respect X-Forwarded headers in same-origin check
- Use X-Forwarded-Proto/X-Forwarded-Scheme for scheme detection
- Use X-Forwarded-Host for host matching behind reverse proxies
- Update tests with remoteAddr for CSWSH protection validation
2026-02-03 21:45:39 +00:00
rcourtman
c295ee277f security: add scope checks to AI endpoints and mitigate CSWSH
- AI Intelligence endpoints (/api/ai/intelligence/*, /api/ai/forecast/*,
  /api/ai/unified/findings, etc.) now require ai:execute scope to prevent
  low-privilege tokens from reading sensitive intelligence data

- AI Knowledge endpoints (/api/ai/knowledge/*) now require ai:chat scope
  to prevent arbitrary guest data access across the fleet

- AI Debug Context (/api/ai/debug/context) now requires settings:read scope
  to prevent system prompt and infrastructure details leakage

- WebSocket origin check now validates peer IP is private when allowing
  private network origins, mitigating CSWSH attacks where a malicious page
  on the same LAN tries to hijack connections using victim's session cookie
2026-02-03 19:40:46 +00:00
rcourtman
b2639ed5a5 Fix security vulnerabilities and critical bugs
- Fix WebSocket CORS bypass by strictly verifying origin
- Fix OIDC refresh token persistence by encrypting at rest
- Fix grouped webhook data mutation by cloning alerts
- Fix host agent uninstall authorization and config fetch logic
- Fix notification queue recovery for stuck sending items
- Fix ignored update history limit parameter
- Fix ineffective break statement in WebSocket write pump
2026-02-03 17:16:27 +00:00
rcourtman
aeca5e39fa Fix multi-tenant persistence and backend stability
- Initialize Alert and Notification managers with tenant-specific data directories

- Add panic recovery to WebSocket safeSend for stability

- Record host metrics to history for sparkline support
2026-02-03 16:24:42 +00:00
rcourtman
62e8d609de feat: add WebSocket multi-tenant isolation
- Enhance WebSocket hub with tenant awareness
- Add tenant isolation for real-time updates
- Add hub tenant isolation tests
2026-01-24 22:43:50 +00:00
rcourtman
2e0da42a81 chore: reliability and maintenance improvements
Host agent:
- Add SHA256 checksum verification for downloaded binaries
- Verify checksum file matches expected bundle filename

WebSocket:
- Add write failure tracking with graceful disconnection
- Increase write deadline to 30s for large state payloads
- Better handling for slow clients (Raspberry Pi, slow networks)

Monitoring:
- Remove unused temperature proxy imports
- Add monitor polling improvements
- Expand test coverage

Other:
- Update package.json dependencies
- Fix generate-release-notes.sh path handling
- Minor reporting engine cleanup
2026-01-22 00:45:04 +00:00
rcourtman
d3116defe3 fix: Prevent panic from send on closed websocket channel
Add atomic `closed` flag to Client struct and `safeSend()` helper method
to prevent race condition when sending to client channels. The race
occurred when a client disconnected while a goroutine was trying to send
initial state - the channel could be closed between the registration
check and the actual send.

All sends to client.send now go through safeSend() which checks the
closed flag first. The flag is set atomically before closing the channel
in all code paths (unregister, dispatchToClients, broadcast, shutdown).

Related to #1048
2026-01-06 17:41:25 +00:00
rcourtman
80baf3ad95 refactor: Remove unreachable dead code branches
- checkOrigin: Remove redundant same-origin check at line 155 that was
  already handled at line 116 (origin == requestOrigin)

Function improved from 95.1% to 97.4% coverage.
2025-12-02 14:56:35 +00:00
rcourtman
bc9e89696b chore: fix staticcheck U1000 unused code warnings
- Remove unused ipv6Regex from validation.go
- Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use)
- Remove unused apiLimiter rate limiter
- Remove unused stopOnce fields from csrf_store.go and session_store.go
- Remove unused lastBroadcast field from hub.go
- Remove unused lastUsedIndex field from cluster_client.go
2025-11-27 09:12:17 +00:00
rcourtman
9a33dc568e Guard WebSocket broadcast buffers when clients stall (Related to #734) 2025-11-21 10:14:10 +00:00
rcourtman
d30d76bb92 Fix P1: Add shutdown mechanism to WebSocket Hub
Fixed goroutine leaks in WebSocket hub from missing shutdown mechanism:

Problem:
1. Hub.Run() has infinite loop with no exit condition
2. runBroadcastSequencer() reads from channel forever
3. No way to cleanly shutdown hub during restarts or tests

Solution:
- Added stopChan chan struct{} field to Hub
- Initialize stopChan in NewHub()
- Added Stop() method that closes stopChan
- Modified Run() main loop to select on stopChan
  - On shutdown: close all client connections and return
- Modified runBroadcastSequencer() from 'for range' to select
  - Changed from: for msg := range h.broadcastSeq
  - Changed to: for { select { case msg := <-h.broadcastSeq: ... case <-h.stopChan: ... }}
  - On shutdown: stop coalesce timer and return

Shutdown sequence:
1. Call hub.Stop() to close stopChan
2. Both Run() and runBroadcastSequencer() exit their loops
3. All client send channels are closed
4. Clients map is cleared
5. Pending coalesce timer is stopped

Impact:
- Enables graceful shutdown during service restarts
- Prevents goroutine leaks in tests
- Allows proper cleanup of WebSocket connections
- No more orphaned broadcast sequencer goroutines
2025-11-07 10:20:26 +00:00
rcourtman
1bf9cfea88 Fix critical P0 security and crash issues in API/WebSocket layer
This commit addresses 5 critical P0 bugs that cause security vulnerabilities, crashes, and data corruption:

**P0-1: Recovery Tokens Replay Attack Vulnerability** (recovery_tokens.go:153-159)
- **SECURITY CRITICAL**: Single-use recovery tokens could be replayed
- **Problem**: Lock upgrade race - two concurrent requests both pass initial Used check
  1. Both acquire RLock, see token.Used = false
  2. Both release RLock
  3. Both acquire Lock and mark token.Used = true
  4. Both return true - TOKEN REUSED
- **Impact**: Attacker with intercepted token can use it multiple times
- **Fix**: Re-check token.Used after acquiring write lock (TOCTOU prevention)

**P0-2: WebSocket Hub Concurrent Map Panic** (hub.go:345-347, 376-378)
- **Problem**: Initial state goroutine reads h.clients map without lock
  - Line 345: `if _, ok := h.clients[client]` (NO LOCK)
  - Main loop writes to h.clients with lock (line 326, 394)
- **Impact**: "fatal error: concurrent map read and write" crashes hub
- **Fix**: Acquire RLock before all client map reads in goroutine

**P0-3: WebSocket Send on Closed Channel Panic** (hub.go:348, 380)
- **Problem**: Check client exists, then send - channel can close between
- **Impact**: "send on closed channel" panic crashes hub
- **Fix**: Hold RLock during both check and send (defensive select already present)

**P0-4: CSRF Store Shutdown Data Corruption** (csrf_store.go:189-196)
- **Problem**: Stop() calls save() after signaling worker. Both hold only RLock
  - Worker's final save writes to csrf_tokens.json.tmp
  - Stop()'s save writes to same file concurrently
- **Impact**: Corrupted/truncated csrf_tokens.json on shutdown
- **Fix**: Added saveMu mutex to serialize all disk writes

**P0-5: CSRF Store Deadlock on Double-Stop** (csrf_store.go:103-108)
- **Problem**: stopChan unbuffered, no sync.Once guard, uses send not close
- **Impact**: Second Stop() call blocks forever waiting for receiver
- **Fix**:
  - Added sync.Once field stopOnce
  - Changed to close(stopChan) within stopOnce.Do()
  - Prevents double-close panic and deadlock

All fixes maintain backwards compatibility. The recovery token fix is particularly critical as it closes a security vulnerability allowing replay attacks on password reset flows.
2025-11-07 10:13:15 +00:00
rcourtman
c8e0281953 Add comprehensive alert system reliability improvements
This commit implements critical reliability features to prevent data loss
and improve alert system robustness:

**Persistent Notification Queue:**
- SQLite-backed queue with WAL journaling for crash recovery
- Dead Letter Queue (DLQ) for notifications that exhaust retries
- Exponential backoff retry logic (100ms → 200ms → 400ms)
- Full audit trail for all notification delivery attempts
- New file: internal/notifications/queue.go (661 lines)

**DLQ Management API:**
- GET /api/notifications/dlq - Retrieve DLQ items
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry failed notifications
- POST /api/notifications/dlq/delete - Delete DLQ items
- New file: internal/api/notification_queue.go (145 lines)

**Prometheus Metrics:**
- 18 comprehensive metrics for alerts and notifications
- Metric hooks integrated via function pointers to avoid import cycles
- /metrics endpoint exposed for Prometheus scraping
- New file: internal/metrics/alert_metrics.go (193 lines)

**Alert History Reliability:**
- Exponential backoff retry for history saves (3 attempts)
- Automatic backup restoration on write failure
- Modified: internal/alerts/history.go

**Flapping Detection:**
- Detects and suppresses rapidly oscillating alerts
- Configurable window (default: 5 minutes)
- Configurable threshold (default: 5 state changes)
- Configurable cooldown (default: 15 minutes)
- Automatic cleanup of inactive flapping history

**Alert TTL & Auto-Cleanup:**
- MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days)
- MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day)
- AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours)
- Prevents memory leaks from long-running alerts

**WebSocket Broadcast Sequencer:**
- Channel-based sequencing ensures ordered message delivery
- 100ms coalescing window for rapid state updates
- Prevents race conditions in WebSocket broadcasts
- Modified: internal/websocket/hub.go

**Configuration Fields Added:**
- FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes
- MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours

All features are production-ready and build successfully.
2025-11-06 16:46:30 +00:00
rcourtman
f46ff1792b Fix settings security tab navigation 2025-10-11 23:29:47 +00:00