Commit graph

21 commits

Author SHA1 Message Date
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
a9ed380718 fix(websocket): respect X-Forwarded headers in same-origin check
- Use X-Forwarded-Proto/X-Forwarded-Scheme for scheme detection
- Use X-Forwarded-Host for host matching behind reverse proxies
- Update tests with remoteAddr for CSWSH protection validation
2026-02-03 21:45:39 +00:00
rcourtman
c295ee277f security: add scope checks to AI endpoints and mitigate CSWSH
- AI Intelligence endpoints (/api/ai/intelligence/*, /api/ai/forecast/*,
  /api/ai/unified/findings, etc.) now require ai:execute scope to prevent
  low-privilege tokens from reading sensitive intelligence data

- AI Knowledge endpoints (/api/ai/knowledge/*) now require ai:chat scope
  to prevent arbitrary guest data access across the fleet

- AI Debug Context (/api/ai/debug/context) now requires settings:read scope
  to prevent system prompt and infrastructure details leakage

- WebSocket origin check now validates peer IP is private when allowing
  private network origins, mitigating CSWSH attacks where a malicious page
  on the same LAN tries to hijack connections using victim's session cookie
2026-02-03 19:40:46 +00:00
rcourtman
b2639ed5a5 Fix security vulnerabilities and critical bugs
- Fix WebSocket CORS bypass by strictly verifying origin
- Fix OIDC refresh token persistence by encrypting at rest
- Fix grouped webhook data mutation by cloning alerts
- Fix host agent uninstall authorization and config fetch logic
- Fix notification queue recovery for stuck sending items
- Fix ignored update history limit parameter
- Fix ineffective break statement in WebSocket write pump
2026-02-03 17:16:27 +00:00
rcourtman
aeca5e39fa Fix multi-tenant persistence and backend stability
- Initialize Alert and Notification managers with tenant-specific data directories

- Add panic recovery to WebSocket safeSend for stability

- Record host metrics to history for sparkline support
2026-02-03 16:24:42 +00:00
rcourtman
19a67dd4f3 Update core infrastructure components
Config:
- AI configuration improvements
- API tokens handling
- Persistence layer updates

Host Agent:
- Command execution improvements
- Better test coverage

Infrastructure Discovery:
- Service improvements
- Enhanced test coverage

Models:
- State snapshot updates
- Model improvements

Monitoring:
- Polling improvements
- Guest config handling
- Storage config support

WebSocket:
- Hub tenant test updates

Service Discovery:
- New service discovery module
2026-01-28 16:52:35 +00:00
rcourtman
62e8d609de feat: add WebSocket multi-tenant isolation
- Enhance WebSocket hub with tenant awareness
- Add tenant isolation for real-time updates
- Add hub tenant isolation tests
2026-01-24 22:43:50 +00:00
rcourtman
c4ca169e2b feat: add multi-tenant isolation foundation (disabled by default)
Implements multi-tenant infrastructure for organization-based data isolation.
Feature is gated behind PULSE_MULTI_TENANT_ENABLED env var and requires
Enterprise license - no impact on existing users.

Core components:
- TenantMiddleware: extracts org ID, validates access, 501/402 responses
- AuthorizationChecker: token/user access validation for organizations
- MultiTenantChecker: WebSocket upgrade gating with license check
- Per-tenant audit logging via LogAuditEventForTenant
- Organization model with membership support

Gating behavior:
- Feature flag disabled: 501 Not Implemented for non-default orgs
- Flag enabled, no license: 402 Payment Required
- Default org always works regardless of flag/license

Documentation added: docs/MULTI_TENANT.md
2026-01-23 21:42:27 +00:00
rcourtman
2e0da42a81 chore: reliability and maintenance improvements
Host agent:
- Add SHA256 checksum verification for downloaded binaries
- Verify checksum file matches expected bundle filename

WebSocket:
- Add write failure tracking with graceful disconnection
- Increase write deadline to 30s for large state payloads
- Better handling for slow clients (Raspberry Pi, slow networks)

Monitoring:
- Remove unused temperature proxy imports
- Add monitor polling improvements
- Expand test coverage

Other:
- Update package.json dependencies
- Fix generate-release-notes.sh path handling
- Minor reporting engine cleanup
2026-01-22 00:45:04 +00:00
rcourtman
a6a8efaa65 test: Add comprehensive test coverage across packages
New test files with expanded coverage:

API tests:
- ai_handler_test.go: AI handler unit tests with mocking
- agent_profiles_tools_test.go: Profile management tests
- alerts_endpoints_test.go: Alert API endpoint tests
- alerts_test.go: Updated for interface changes
- audit_handlers_test.go: Audit handler tests
- frontend_embed_test.go: Frontend embedding tests
- metadata_handlers_test.go, metadata_provider_test.go: Metadata tests
- notifications_test.go: Updated for interface changes
- profile_suggestions_test.go: Profile suggestion tests
- saml_service_test.go: SAML authentication tests
- sensor_proxy_gate_test.go: Sensor proxy tests
- updates_test.go: Updated for interface changes

Agent tests:
- dockeragent/signature_test.go: Docker agent signature tests
- hostagent/agent_metrics_test.go: Host agent metrics tests
- hostagent/commands_test.go: Command execution tests
- hostagent/network_helpers_test.go: Network helper tests
- hostagent/proxmox_setup_test.go: Updated setup tests
- kubernetesagent/*_test.go: Kubernetes agent tests

Core package tests:
- monitoring/kubernetes_agents_test.go, reload_test.go
- remoteconfig/client_test.go, signature_test.go
- sensors/collector_test.go
- updates/adapter_installsh_*_test.go: Install adapter tests
- updates/manager_*_test.go: Update manager tests
- websocket/hub_*_test.go: WebSocket hub tests

Library tests:
- pkg/audit/export_test.go: Audit export tests
- pkg/metrics/store_test.go: Metrics store tests
- pkg/proxmox/*_test.go: Proxmox client tests
- pkg/reporting/reporting_test.go: Reporting tests
- pkg/server/*_test.go: Server tests
- pkg/tlsutil/extra_test.go: TLS utility tests

Total: ~8000 lines of new test code
2026-01-19 19:26:18 +00:00
rcourtman
d3116defe3 fix: Prevent panic from send on closed websocket channel
Add atomic `closed` flag to Client struct and `safeSend()` helper method
to prevent race condition when sending to client channels. The race
occurred when a client disconnected while a goroutine was trying to send
initial state - the channel could be closed between the registration
check and the actual send.

All sends to client.send now go through safeSend() which checks the
closed flag first. The flag is set atomically before closing the channel
in all code paths (unregister, dispatchToClients, broadcast, shutdown).

Related to #1048
2026-01-06 17:41:25 +00:00
rcourtman
80baf3ad95 refactor: Remove unreachable dead code branches
- checkOrigin: Remove redundant same-origin check at line 155 that was
  already handled at line 116 (origin == requestOrigin)

Function improved from 95.1% to 97.4% coverage.
2025-12-02 14:56:35 +00:00
rcourtman
b2659ef6a8 Add unit tests for websocket Hub.checkOrigin function
Test coverage for CORS origin validation including:
- No origin header (non-browser clients)
- Same-origin requests with/without proxy headers
- Wildcard and explicit allowed origins
- Private network fallback (192.168.x, 10.x, localhost, .local, .lan)
- Public IP/domain rejection
- X-Forwarded-Proto/Host/Scheme header handling
- WebSocket protocol normalization (ws->http, wss->https)

Coverage: 37.3% → 48.6%
2025-12-01 09:33:40 +00:00
rcourtman
27b1188368 Add unit tests for websocket utility functions
Test coverage for pure utility functions:
- isValidPrivateOrigin: validates private network origins (security)
- normalizeForwardedProto: normalizes ws/wss to http/https for proxies
- sanitizeValue: handles NaN/Inf values in JSON data
- cloneMetadata/cloneMetadataValue: deep copies metadata maps/slices
- cloneAlert/cloneAlertData: deep copies alert structures

Coverage increased from 20.9% to 37.3% (80 test cases).
2025-11-30 00:34:38 +00:00
rcourtman
bc9e89696b chore: fix staticcheck U1000 unused code warnings
- Remove unused ipv6Regex from validation.go
- Suppress unused recordAlertFired/recordAlertResolved hooks (kept for future use)
- Remove unused apiLimiter rate limiter
- Remove unused stopOnce fields from csrf_store.go and session_store.go
- Remove unused lastBroadcast field from hub.go
- Remove unused lastUsedIndex field from cluster_client.go
2025-11-27 09:12:17 +00:00
rcourtman
9a33dc568e Guard WebSocket broadcast buffers when clients stall (Related to #734) 2025-11-21 10:14:10 +00:00
rcourtman
d30d76bb92 Fix P1: Add shutdown mechanism to WebSocket Hub
Fixed goroutine leaks in WebSocket hub from missing shutdown mechanism:

Problem:
1. Hub.Run() has infinite loop with no exit condition
2. runBroadcastSequencer() reads from channel forever
3. No way to cleanly shutdown hub during restarts or tests

Solution:
- Added stopChan chan struct{} field to Hub
- Initialize stopChan in NewHub()
- Added Stop() method that closes stopChan
- Modified Run() main loop to select on stopChan
  - On shutdown: close all client connections and return
- Modified runBroadcastSequencer() from 'for range' to select
  - Changed from: for msg := range h.broadcastSeq
  - Changed to: for { select { case msg := <-h.broadcastSeq: ... case <-h.stopChan: ... }}
  - On shutdown: stop coalesce timer and return

Shutdown sequence:
1. Call hub.Stop() to close stopChan
2. Both Run() and runBroadcastSequencer() exit their loops
3. All client send channels are closed
4. Clients map is cleared
5. Pending coalesce timer is stopped

Impact:
- Enables graceful shutdown during service restarts
- Prevents goroutine leaks in tests
- Allows proper cleanup of WebSocket connections
- No more orphaned broadcast sequencer goroutines
2025-11-07 10:20:26 +00:00
rcourtman
1bf9cfea88 Fix critical P0 security and crash issues in API/WebSocket layer
This commit addresses 5 critical P0 bugs that cause security vulnerabilities, crashes, and data corruption:

**P0-1: Recovery Tokens Replay Attack Vulnerability** (recovery_tokens.go:153-159)
- **SECURITY CRITICAL**: Single-use recovery tokens could be replayed
- **Problem**: Lock upgrade race - two concurrent requests both pass initial Used check
  1. Both acquire RLock, see token.Used = false
  2. Both release RLock
  3. Both acquire Lock and mark token.Used = true
  4. Both return true - TOKEN REUSED
- **Impact**: Attacker with intercepted token can use it multiple times
- **Fix**: Re-check token.Used after acquiring write lock (TOCTOU prevention)

**P0-2: WebSocket Hub Concurrent Map Panic** (hub.go:345-347, 376-378)
- **Problem**: Initial state goroutine reads h.clients map without lock
  - Line 345: `if _, ok := h.clients[client]` (NO LOCK)
  - Main loop writes to h.clients with lock (line 326, 394)
- **Impact**: "fatal error: concurrent map read and write" crashes hub
- **Fix**: Acquire RLock before all client map reads in goroutine

**P0-3: WebSocket Send on Closed Channel Panic** (hub.go:348, 380)
- **Problem**: Check client exists, then send - channel can close between
- **Impact**: "send on closed channel" panic crashes hub
- **Fix**: Hold RLock during both check and send (defensive select already present)

**P0-4: CSRF Store Shutdown Data Corruption** (csrf_store.go:189-196)
- **Problem**: Stop() calls save() after signaling worker. Both hold only RLock
  - Worker's final save writes to csrf_tokens.json.tmp
  - Stop()'s save writes to same file concurrently
- **Impact**: Corrupted/truncated csrf_tokens.json on shutdown
- **Fix**: Added saveMu mutex to serialize all disk writes

**P0-5: CSRF Store Deadlock on Double-Stop** (csrf_store.go:103-108)
- **Problem**: stopChan unbuffered, no sync.Once guard, uses send not close
- **Impact**: Second Stop() call blocks forever waiting for receiver
- **Fix**:
  - Added sync.Once field stopOnce
  - Changed to close(stopChan) within stopOnce.Do()
  - Prevents double-close panic and deadlock

All fixes maintain backwards compatibility. The recovery token fix is particularly critical as it closes a security vulnerability allowing replay attacks on password reset flows.
2025-11-07 10:13:15 +00:00
rcourtman
c8e0281953 Add comprehensive alert system reliability improvements
This commit implements critical reliability features to prevent data loss
and improve alert system robustness:

**Persistent Notification Queue:**
- SQLite-backed queue with WAL journaling for crash recovery
- Dead Letter Queue (DLQ) for notifications that exhaust retries
- Exponential backoff retry logic (100ms → 200ms → 400ms)
- Full audit trail for all notification delivery attempts
- New file: internal/notifications/queue.go (661 lines)

**DLQ Management API:**
- GET /api/notifications/dlq - Retrieve DLQ items
- GET /api/notifications/queue/stats - Queue statistics
- POST /api/notifications/dlq/retry - Retry failed notifications
- POST /api/notifications/dlq/delete - Delete DLQ items
- New file: internal/api/notification_queue.go (145 lines)

**Prometheus Metrics:**
- 18 comprehensive metrics for alerts and notifications
- Metric hooks integrated via function pointers to avoid import cycles
- /metrics endpoint exposed for Prometheus scraping
- New file: internal/metrics/alert_metrics.go (193 lines)

**Alert History Reliability:**
- Exponential backoff retry for history saves (3 attempts)
- Automatic backup restoration on write failure
- Modified: internal/alerts/history.go

**Flapping Detection:**
- Detects and suppresses rapidly oscillating alerts
- Configurable window (default: 5 minutes)
- Configurable threshold (default: 5 state changes)
- Configurable cooldown (default: 15 minutes)
- Automatic cleanup of inactive flapping history

**Alert TTL & Auto-Cleanup:**
- MaxAlertAgeDays: Auto-cleanup old alerts (default: 7 days)
- MaxAcknowledgedAgeDays: Faster cleanup for acked alerts (default: 1 day)
- AutoAcknowledgeAfterHours: Auto-ack long-running alerts (default: 24 hours)
- Prevents memory leaks from long-running alerts

**WebSocket Broadcast Sequencer:**
- Channel-based sequencing ensures ordered message delivery
- 100ms coalescing window for rapid state updates
- Prevents race conditions in WebSocket broadcasts
- Modified: internal/websocket/hub.go

**Configuration Fields Added:**
- FlappingEnabled, FlappingWindowSeconds, FlappingThreshold, FlappingCooldownMinutes
- MaxAlertAgeDays, MaxAcknowledgedAgeDays, AutoAcknowledgeAfterHours

All features are production-ready and build successfully.
2025-11-06 16:46:30 +00:00
rcourtman
6619dc803e refactor: use strconv.Itoa instead of string(rune()) in test
Replace string(rune(i)) with strconv.Itoa(i) in hub_concurrency_test.go
for generating client IDs. While this is test code and not a production bug,
it uses the same incorrect pattern that caused the PR #575 bug.

This ensures consistent best practices across the codebase and avoids
confusion for developers who might copy this pattern.

Related: #575
2025-10-20 15:12:14 +00:00
rcourtman
f46ff1792b Fix settings security tab navigation 2025-10-11 23:29:47 +00:00