Commit graph

223 commits

Author SHA1 Message Date
rcourtman
7849dd012b Remember AI chat state provider across restart (#1339) 2026-03-27 15:34:42 +00:00
rcourtman
fd13d7f59e Avoid background update goroutines in api tests 2026-03-25 13:57:49 +00:00
rcourtman
2ed1c3b839 Proxy missing host-agent binaries from GitHub releases (#1254) 2026-03-25 13:11:31 +00:00
rcourtman
73786a9e27 Skip patrol triggers when patrol is disabled (#1258) 2026-03-25 11:33:34 +00:00
rcourtman
8a43a964b6 fix(ai): wire patrol circuit breaker on first-time configure 2026-03-13 12:10:14 +00:00
rcourtman
ae2edbde20 fix(ai): complete wiring on first-time configure; guard Ollama fallback
Three follow-up fixes:

1. RestartAIChat() now performs the full post-start wiring (MCP providers,
   patrol adapter, investigation orchestrator) when the service starts for
   the first time via Restart(). Previously these were only wired via
   StartAIChat(), leaving first-time configure with a partially wired service.

2. The Ollama→OpenAI-compatible fallback in createProviderForModel is now
   guarded by !strings.HasPrefix(modelStr, "ollama:") so explicit
   "ollama:llama3" models are never silently rerouted to a different provider.

3. Windows install script registration check now uses the $Hostname override
   (if set) instead of always looking up $env:COMPUTERNAME, so post-install
   verification works correctly when a custom hostname is specified.
2026-03-13 12:06:08 +00:00
rcourtman
a4b0771974 Prevent removed host agents from resurrecting via in-flight reports (#1331)
Host agents removed from the UI would reappear on the next report cycle
because there was no rejection mechanism — unlike Docker agents which
already had resurrection prevention. Mirror the Docker agent pattern:

- Track removed host IDs in a `removedHosts` map with 24hr TTL
- Persist removal records in `State.RemovedHosts` for frontend display
- Reject reports from removed hosts in `ApplyHostReport()`
- Add `AllowHostReenroll()` + API route to clear the block
- Show removed host agents in the Settings UI with "Allow re-enroll"
- Sync removed-agent maps from state on startup for all agent types
- Fix mock integration snapshot missing `RemovedDockerHosts` field
2026-03-09 17:52:34 +00:00
rcourtman
ddecf6d00c Guard legacyMonitor typed-nil and add OIDC refresh panic recovery
Normalize SystemSettingsMonitor interface assignments via reflect to
prevent typed-nil-in-interface (same class as #1324 fix). Also add
defer/recover to the background OIDC token refresh goroutine so a
panic there cannot take down the process.
2026-03-07 10:21:07 +00:00
rcourtman
743ef17b79 Fix AI and config profile handlers broken in v5 single-tenant mode
The single-tenant lockdown (499ab812e) set mtPersistence to nil but
only patched AISettingsHandler with a legacy fallback. AIHandler (chat
service) and ConfigProfileHandler were missed, so AI features (Patrol,
Chat) failed with "chat service not available" and config profiles
would panic on nil dereference. Wire legacy persistence into both
handlers and add the same fallback to ProfileSuggestionHandler.

Fixes #1322
2026-03-06 11:05:01 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
10872c8ca8 fix(patrol): remove noisy per-alert log when patrol is disabled (#1258)
The alert callback logged at Info level for every alert regardless of
whether patrol was enabled. TriggerPatrolForAlert already has an
enabled/running guard and its own debug logging.
2026-03-05 10:01:43 +00:00
rcourtman
2fcddecf80 feat(api): add POST /api/ai/patrol/undismiss endpoint to revert suppressed findings (#1300)
The Undismiss() method existed on FindingsStore but was never exposed
via the API. Users who dismissed findings as "not_an_issue" had no way
to revert them.

- Add HandleUndismissFinding handler and route
- Add Undismiss() to UnifiedStore for parity with FindingsStore
- Also remove matching explicit suppression rules on undismiss
2026-03-01 22:29:36 +00:00
rcourtman
a210b01a03 fix(sso): load SSO config at startup and expose providers on login page
r.ssoConfig was never loaded from persistence in NewRouter(), so on every
restart all SSO providers were silently discarded (handleListSSOProviders
would reinitialize to an empty config on the first request).

Also adds ssoProviders to /api/security/status so the login page can
render SAML/OIDC login buttons for enabled providers.

Fixes part of #1255

(cherry picked from commit 395cd101ff4acb1b7f89ec3d907b84cbec217dc8)
2026-02-18 12:53:15 +00:00
rcourtman
2fb6ebc25f fix: add SAML auth bypass and update route inventory tests
The SAML route registration (bee3d05f) was incomplete: the auth
middleware uses exact-match for public paths, so /api/saml/{id}/login
etc. would be blocked. Add prefix-based auth bypass for /api/saml/
paths and update route inventory tests for both SSO and SAML routes.
2026-02-11 13:48:16 +00:00
rcourtman
bee3d05f0d fix: register SAML login flow routes (login, ACS, metadata, logout, SLO)
The SAML handler functions existed but were never registered in
setupRoutes(), causing 404s for all SAML authentication flows.
Adds /api/saml/ prefix route with dispatcher for all 5 endpoints.
2026-02-11 13:29:05 +00:00
rcourtman
89969079b9 fix: register SSO provider API routes
The SSO handler functions and frontend were implemented but the HTTP
routes were never registered in setupRoutes(), causing 404 on all
/api/security/sso/providers endpoints.

Fixes #1248
2026-02-11 13:17:51 +00:00
rcourtman
7336ec2d87 fix(metrics): normalize docker resource type in metrics history API (#1229)
Frontend sends resourceType="docker" but the SQLite store uses
"dockerContainer". The /api/metrics-store/history handler now
normalizes the alias so queries return the correct historical data
instead of falling back to a single live data point.
2026-02-09 22:33:24 +00:00
rcourtman
5bbc4329bd Remove pprof diagnostics endpoint 2026-02-04 20:44:00 +00:00
rcourtman
a37b59b7e4 Add admin-gated pprof diagnostics endpoint 2026-02-04 20:39:24 +00:00
rcourtman
ee0e89871d fix: reduce metrics memory 86x by reverting buffer and adding LTTB downsampling
The in-memory metrics buffer was changed from 1000 to 86400 points per
metric to support 30-day sparklines, but this pre-allocated ~18 MB per
guest (7 slices × 86400 × 32 bytes). With 50 guests that's 920 MB —
explaining why users needed to double their LXC memory after upgrading
to 5.1.0.

- Revert in-memory buffer to 1000 points / 24h retention
- Remove eager slice pre-allocation (use append growth instead)
- Add LTTB (Largest Triangle Three Buckets) downsampling algorithm
- Chart endpoints now use a two-tier strategy: in-memory for ranges
  ≤ 2h, SQLite persistent store + LTTB for longer ranges
- Reduce frontend ring buffer from 86400 to 2000 points

Related to #1190
2026-02-04 19:49:52 +00:00
rcourtman
9d4d392026 fix: host network sparklines showing cumulative bytes instead of rates
Host network sparklines were displaying wildly incorrect values (e.g., 147 GB/s
for an idle Raspberry Pi) because cumulative byte counters (total bytes since
boot) were being stored directly instead of being converted to rates.

Changes:
- monitor.go: Use RateTracker to calculate network rates for hosts, matching
  the existing pattern used for VMs and containers. Only record network
  metrics when we have enough samples to calculate valid rates.
- router.go: Remove network metrics from live fallback for hosts since we
  can't calculate rates from a single snapshot. Better to show nothing than
  misleading cumulative totals.

The fix follows the established codebase pattern where:
1. Agent reports cumulative RXBytes/TXBytes
2. RateTracker compares consecutive samples to calculate bytes/second
3. Rates are stored in metrics history for sparkline display
2026-02-04 16:11:04 +00:00
rcourtman
5f2990deec Require proxy admin for SSH config endpoints 2026-02-04 15:57:59 +00:00
rcourtman
145e5c46bb Require admin for host config patch and delete 2026-02-04 15:56:07 +00:00
rcourtman
5ede1f6a97 Harden apply-restart auth for proxy/OIDC 2026-02-04 15:48:06 +00:00
rcourtman
34ca427458 Add unified guest intelligence to patrol seed context
Enrich the patrol seed context with service identity (from discovery
store) and network reachability (via ICMP ping through host agents).
The guest metrics table now includes Service and Reachable columns,
and a Service Health Issues section highlights running-but-unreachable
guests. A new SignalGuestUnreachable signal type creates deterministic
findings for unreachable guests.

New files:
- patrol_intelligence.go: GuestProber interface, GuestIntelligence
  type, gatherGuestIntelligence() with concurrent per-node probing
- patrol_prober.go: agentExecProber implementation using batch ping
  commands via connected host agents
2026-02-04 14:08:57 +00:00
rcourtman
5c18748742 Add SMART disk lifecycle monitoring with historical charts
Expand the smartctl collector to capture detailed SMART attributes (SATA
and NVMe), propagate them through the full data pipeline, persist them
as time-series metrics, and display them in an interactive disk detail
drawer with historical sparkline charts.

Backend: add SMARTAttributes struct, writeSMARTMetrics for persistent
storage, "disk" resource type in metrics API with live fallback.
Frontend: enhanced DiskList with Power-On column and SMART warnings,
new DiskDetail drawer matching NodeDrawer styling patterns, generic
HistoryChart metric support with proper tooltip formatting.
2026-02-04 13:35:40 +00:00
rcourtman
8951b6f7f9 Require monitoring scope for socket.io 2026-02-04 12:41:12 +00:00
rcourtman
5c1487e406 feat: add resource picker and multi-resource report generation
Replace manual resource ID entry with a searchable, filterable resource
picker that uses live WebSocket state. Support selecting multiple
resources (up to 50) for combined fleet reports.

Multi-resource PDFs include a cover page, fleet summary table with
aggregate health status, and condensed per-resource detail pages with
overlaid CPU/memory charts. Multi-resource CSVs include a summary
section followed by interleaved time-series data with resource columns.

New POST /api/admin/reports/generate-multi endpoint handles multi-resource
requests while the existing single-resource GET endpoint remains unchanged.

Also fixes resource ID validation regex to allow colons used in
VM/container IDs (e.g., "instance:node:vmid").
2026-02-04 10:24:23 +00:00
rcourtman
6059759958 feat: Add sparkline support for unified host agents on hosts page
Backend:
- Add HostData field to ChartResponse struct in types.go
- Add host data processing in /api/charts endpoint using 'host:' prefix key
- Include hosts count in debug logging for chart responses

Frontend:
- Add 'host' to MetricResourceKind type in metricsKeys.ts
- Add hostData field to ChartsResponse interface in charts.ts
- Process hostData in seedFromBackend() in metricsHistory.ts
- Pass resourceId to EnhancedCPUBar and StackedMemoryBar in HostsOverview.tsx
- Add '7d' and '30d' to TIME_RANGE_OPTIONS in metricsViewMode.ts

This enables sparkline trend visualization for unified host agents,
consistent with Proxmox guests. Data accumulates over time at 30s intervals.
2026-02-03 22:59:55 +00:00
rcourtman
5a990dd554 Fix sparkline data inconsistency and support 30d range 2026-02-03 22:39:50 +00:00
rcourtman
b7a94bad9f security: fix websocket scope and agent impersonation
1. Enforce monitoring:read scope on WebSocket upgrades
   - Prevents low-privilege tokens (e.g. host-agent:report) from accessing
     full infra state via requestData on the main WebSocket.

2. Enforce agent token binding to prevent impersonation
   - Added Metadata field to APITokenRecord to support bound_agent_id
   - Updated agentexec server to validate token-to-agent binding if present
   - Prevents agent:exec tokens from registering as arbitrary agent IDs
2026-02-03 20:40:08 +00:00
rcourtman
0dfe3d16b3 security: secure socket.io, test-notification, and stats endpoints
1. Secure /socket.io/ endpoint
   - Previously allowed unauthenticated WebSocket upgrades via transport=websocket
   - Now enforces CheckAuth() before upgrade

2. Secure /api/test-notification
   - Previously unauthenticated and allowed broadcasting to all clients
   - Now requires Admin + settings:write scope

3. Secure /simple-stats
   - Added authentication requirement (was public)
2026-02-03 20:08:16 +00:00
rcourtman
dd47cbe5b4 security: fix host token binding, AI findings scope, and DLQ credential exposure
1. Host agent link/unlink/delete now require settings:write scope
   - Prevents compromised host-agent:manage tokens from manipulating
     or deleting unrelated hosts
   - Host tokens scoped to one host can no longer affect other hosts

2. AI investigation endpoints now require ai:execute scope
   - /api/ai/findings/* was only protected by RequireAuth
   - Low-privilege tokens could read investigation details and chat logs

3. Notification DLQ endpoints now require settings:read/write scope
   - DLQ entries contain notification configs (webhooks, SMTP, etc.)
   - Prevents monitoring:read tokens from reading credential data
   - DLQ retry/delete operations require settings:write
2026-02-03 19:59:46 +00:00
rcourtman
fdc99418d6 security: add authentication to /api/security/apply-restart endpoint
CRITICAL FIX: This endpoint previously allowed unauthenticated users to
trigger service restarts, which is a denial-of-service vulnerability.

Now requires:
- Authentication (CheckAuth) when auth is configured
- Admin role for proxy auth users
- settings:write scope for API tokens

Initial setup (no auth configured yet) remains accessible to allow
first-time security configuration to trigger restart.
2026-02-03 19:55:29 +00:00
rcourtman
832fda6c96 security: add scope checks to alerts, AI models, patrol status/stream, and remaining AI endpoints
- /api/alerts/* now requires monitoring:read scope
- /api/ai/models now requires ai:chat scope
- /api/ai/patrol/status and /api/ai/patrol/stream now require ai:execute scope
- /api/ai/patrol/findings now requires ai:execute scope
- /api/ai/remediation/* endpoints now require ai:execute scope
- /api/ai/circuit/status now requires ai:execute scope
- /api/ai/incidents/* now requires ai:execute scope
- /api/ai/question/* now requires ai:chat scope
- /api/ai/agents now requires ai:execute scope
- /api/ai/cost/summary now requires settings:read scope
2026-02-03 19:48:43 +00:00
rcourtman
c295ee277f security: add scope checks to AI endpoints and mitigate CSWSH
- AI Intelligence endpoints (/api/ai/intelligence/*, /api/ai/forecast/*,
  /api/ai/unified/findings, etc.) now require ai:execute scope to prevent
  low-privilege tokens from reading sensitive intelligence data

- AI Knowledge endpoints (/api/ai/knowledge/*) now require ai:chat scope
  to prevent arbitrary guest data access across the fleet

- AI Debug Context (/api/ai/debug/context) now requires settings:read scope
  to prevent system prompt and infrastructure details leakage

- WebSocket origin check now validates peer IP is private when allowing
  private network origins, mitigating CSWSH attacks where a malicious page
  on the same LAN tries to hijack connections using victim's session cookie
2026-02-03 19:40:46 +00:00
rcourtman
2ebe65bbc5 security: add scope checks to AI Patrol and agent profile endpoints
- AI Patrol mutation endpoints (acknowledge, dismiss, suppress, snooze, resolve,
  findings/note, suppressions/*) now require ai:execute scope to prevent
  low-privilege tokens from blinding patrol by hiding/suppressing findings

- Agent profile admin endpoints (/api/admin/profiles/*) now require
  settings:write scope to prevent low-privilege tokens from modifying
  fleet-wide agent behavior
2026-02-03 19:29:56 +00:00
rcourtman
69e3286e5e security: fix AI OAuth scope bypass, approval replay attacks, and approval endpoint scope gating
- OAuth endpoints now require settings:write scope (not just admin)
- Approval endpoints now require ai:execute scope
- Added CommandHash to approvals for replay protection
- Approvals are now single-use (consumed on first use)
- consumeApprovalWithValidation validates command matches approval
2026-02-03 19:15:15 +00:00
rcourtman
43c696896f security: fix high severity authz issues (AI chat, patrol autonomy, discovery, host config) 2026-02-03 19:00:56 +00:00
rcourtman
225da6eb39 security: strengthen public URL capture to enforce scope and admin checks 2026-02-03 18:49:42 +00:00
rcourtman
83382ee251 security: enforce scope checks on admin diagnostics endpoint 2026-02-03 18:44:55 +00:00
rcourtman
60f9e6f07f security: fix multiple vulnerabilities (SAML, SSRF, Auth)
Addressed several security findings:
- SAML: Sanitized RelayState to prevent open redirects
- SAML: Fixed logout to properly invalidate server-side sessions
- Auth: Added auth, rate limiting, and logout checks to password change endpoint
- AI: Added admin/scope gating (ai:execute) for command execution
- AI: Blocked private IP ranges in fetch_url to prevent SSRF
- Config: Enforced settings:read/write scopes for export/import
- Agent: Added agent:exec scope requirement for WebSockets
2026-02-03 18:39:15 +00:00
rcourtman
d716bbfdeb fix(security): add proper authorization to sensitive endpoints
- /api/agent-install-command: require admin + settings:write scope
  Previously only RequireAuth, allowing any authenticated user to mint
  high-privilege API tokens (host-agent:manage)

- /api/system/ssh-config: require settings:write scope
  Previously any authenticated token could modify ~/.ssh/config

- /api/system/verify-temperature-ssh: require settings:write scope
  Previously any authenticated token could trigger SSH connection
  attempts to arbitrary nodes (network scanning risk)

- /api/diagnostics: require admin privileges
  Previously exposed API token metadata (IDs, hints, usage mapping)
  to any authenticated token, enabling enumeration attacks
2026-02-03 17:47:40 +00:00
rcourtman
12a5a98117 fix: SSE race conditions, alert user spoofing, and security status oracle
SSE Broadcaster:
- Add per-client mutex to prevent concurrent writes to ResponseWriter
- Fix data race in cleanupLoop reading LastActive without synchronization
- Update LastActive in SendHeartbeat so clients aren't incorrectly pruned
  after 5 minutes of idle heartbeat traffic

Alert Acknowledgements:
- Extract authenticated user from X-Authenticated-User header instead of
  hardcoding 'admin' or trusting request body's User field
- Prevents audit log spoofing and ensures accurate user attribution

Security Status Endpoint:
- Remove ?token= query param validation from public /api/security/status
- Prevents endpoint from acting as a token validity oracle for attackers
- Authentication still works via session cookies and X-API-Token header
2026-02-03 17:40:58 +00:00
rcourtman
beae4c860c fix: address 6 security and reliability issues
Security fixes:
- Auto-register now requires settings:write scope for API tokens
- X-Forwarded-For in auto-register only trusted from verified proxies
- Public URL capture requires authentication (no loopback bypass)
- Lockout reset now uses RequireAdmin for session users

Reliability fixes:
- Docker stop command expiration clears PendingUninstall flag
- Cancelled notifications get completed_at set and are cleaned up
2026-02-03 17:32:44 +00:00
rcourtman
bd030c7c87 security: fix webhook SSRF, rate limit spoofing, metrics retention, and url poisoning
- Fix SSRF and rate limit bypass in SendEnhancedWebhook by validating the rendered URL.
- Fix rate limit spoofing in updates API by using secure IP extraction (trusted proxies).
- Fix memory leak in metrics history by correctly clearing fully stale data series.
- Fix public URL poisoning by preventing overwrites when explicitly configured.
2026-02-03 16:58:13 +00:00
rcourtman
4f40c3d751 fix: resolve critical stability and auth issues
- Fix data race in webhook notifications by removing shared state
- Fix duplicate monitors on config reload by stopping old instances
- Prevent metrics ID deletion on transient startup errors
- Support Bearer auth header for config export/import endpoints
2026-02-03 16:46:27 +00:00
rcourtman
bea3bbe5f6 Fix API token authentication and multi-tenancy logic
- Fix AuthContextMiddleware to use tenant-specific config for token validation

- Resolve data race in token LastUsedAt update

- Fix invalid org IDs returning 501/402 instead of 400

- Prevent unauthenticated organization directory creation (DoS protection)
2026-02-03 16:24:28 +00:00
rcourtman
88d95f40be feat: add Discovery Transparency & Trust features
- Add AI provider indicator showing local (Ollama) vs cloud (Anthropic/OpenAI) analysis
- Add "What Discovery Does" explanation section before first scan
- Show commands preview before scan so users know what will run
- Add scan details section showing raw command outputs for admins
- Filter sensitive Docker labels (passwords, secrets, tokens) before AI analysis
- Add comprehensive tests for label filtering

This improves sysadmin confidence by making discovery transparent about
what it does, what data it collects, and where that data goes.
2026-02-03 14:59:27 +00:00
rcourtman
c2ed6067f1 Fix: discovery routing, host identification, and UX feedback
- Fix routing for POST/PUT/DELETE on /api/discovery/host/ endpoints
  (Go's http.ServeMux was matching the longer prefix before method-specific routes)
- Add HOST-specific AI prompt that focuses on identifying the host OS
  rather than services/containers running on it
- Add success message UI after discovery completes
- Fix timing so success appears after data is visible (not during refetch)
- Add error handling and display for failed discoveries
2026-02-03 14:10:54 +00:00