Commit graph

279 commits

Author SHA1 Message Date
rcourtman
d5b4850715 Harden AI session storage paths 2026-03-28 13:50:55 +00:00
rcourtman
4b61746f3b Adapt Patrol retry budgets to provider context limits (#1370) 2026-03-27 10:57:14 +00:00
rcourtman
608f184666 Retry Patrol with reduced seed context on provider window errors (#1370) 2026-03-26 23:16:28 +00:00
rcourtman
c12394c17f Route patrol investigations through patrol model (#1360) 2026-03-26 09:16:38 +00:00
rcourtman
4ba888b450 Fix Pulse Assistant startup for legacy OpenAI-compatible configs (#1339) 2026-03-25 23:54:17 +00:00
rcourtman
1de1392c9b Preserve provider metadata in AI model lists (#1320) 2026-03-25 13:08:15 +00:00
rcourtman
5f372e257f Respect patrol model provider in quick analysis 2026-03-25 13:01:43 +00:00
rcourtman
73786a9e27 Skip patrol triggers when patrol is disabled (#1258) 2026-03-25 11:33:34 +00:00
rcourtman
f9bf42498f Fix Gemini cost estimation tiers (#1360) 2026-03-25 09:55:17 +00:00
rcourtman
ae2edbde20 fix(ai): complete wiring on first-time configure; guard Ollama fallback
Three follow-up fixes:

1. RestartAIChat() now performs the full post-start wiring (MCP providers,
   patrol adapter, investigation orchestrator) when the service starts for
   the first time via Restart(). Previously these were only wired via
   StartAIChat(), leaving first-time configure with a partially wired service.

2. The Ollama→OpenAI-compatible fallback in createProviderForModel is now
   guarded by !strings.HasPrefix(modelStr, "ollama:") so explicit
   "ollama:llama3" models are never silently rerouted to a different provider.

3. Windows install script registration check now uses the $Hostname override
   (if set) instead of always looking up $env:COMPUTERNAME, so post-install
   verification works correctly when a custom hostname is specified.
2026-03-13 12:06:08 +00:00
rcourtman
e137f3fbf7 fix(ai): start chat service on first-time configure without restart
When Pulse starts before AI is configured, legacyService is nil.
Saving AI settings called Restart() which bailed immediately on the
nil check, leaving the service unstarted (503 on /api/ai/sessions)
until a full process restart.

Merged the nil and !IsRunning checks so first-time configure now
starts the service inline, same as the already-handled stopped case.

Also: bare model names that ParseModelString routes to Ollama (e.g.
"qwen3-omni") now fall back to a configured custom OpenAI base URL
when Ollama is not explicitly configured — handles manually-typed
model names on self-hosted OpenAI-compatible endpoints.

Fixes #1339, #1296
2026-03-13 11:13:27 +00:00
rcourtman
82c615b3b9 Filter virtual disks from SMART checks to prevent false positives (#1329)
ZFS zvols (zd*), device-mapper, virtio disks, and other virtual block
devices don't support SMART and were being reported as FAILED. Use lsblk
JSON metadata to filter by device prefix, transport, subsystem, and
vendor/model. Also treat missing smart_status as unknown rather than
failed, and ignore UNKNOWN health in Patrol/AI signals.
2026-03-08 22:16:24 +00:00
rcourtman
499ab812e3 Fix post-release regressions and lock v5 to single-tenant runtime 2026-03-05 23:46:35 +00:00
rcourtman
5bd0563283 test(providers): update Ollama integration tests for timeout parameter 2026-03-01 23:28:16 +00:00
rcourtman
d46b5fc84b fix(ai): route OpenRouter slash-delimited models to OpenAI provider (#1296)
createProviderForModel() only handled "provider:model" colon format.
Models like "google/gemini-2.5-flash" or "google/gemini-2.0-flash:free"
(OpenRouter format) failed because the colon split produced invalid
provider names.

Now uses config.ParseModelString() which correctly detects slash-
delimited models as OpenRouter (routed via OpenAI-compatible API).
2026-03-01 22:29:45 +00:00
rcourtman
2fcddecf80 feat(api): add POST /api/ai/patrol/undismiss endpoint to revert suppressed findings (#1300)
The Undismiss() method existed on FindingsStore but was never exposed
via the API. Users who dismissed findings as "not_an_issue" had no way
to revert them.

- Add HandleUndismissFinding handler and route
- Add Undismiss() to UnifiedStore for parity with FindingsStore
- Also remove matching explicit suppression rules on undismiss
2026-03-01 22:29:36 +00:00
rcourtman
d852964696 fix(ai): record patrol and QuickAnalysis token usage in cost store for budget enforcement
Patrol runs, evaluation passes, and QuickAnalysis calls were consuming
LLM tokens without recording them in the cost store. This made the
cost_budget_usd_30d budget setting ineffective since enforceBudget()
never saw patrol spend.

- Add RecordUsage() to ai.Service for thread-safe cost recording
- Add recordPatrolUsage() helper to PatrolService, called on both
  success and error paths for main patrol and evaluation pass
- Record QuickAnalysis token usage in cost store
- Return partial PatrolResponse (with token counts) on error instead
  of nil, so callers can always record consumed tokens
- Propagate partial response through chat_service_adapter on error
2026-03-01 19:19:47 +00:00
rcourtman
c575c7e295 fix(patrol): rename wearout JSON field to ssd_life_remaining_pct (#1300)
The AI also receives disk data via tool calls (pulse_metrics type="disks"),
not just the patrol context table. The raw JSON field "wearout" was
ambiguous — rename to "ssd_life_remaining_pct" so the field name itself
communicates that 100 = healthy.
2026-02-27 23:12:27 +00:00
rcourtman
3006f51b60 fix(patrol): clarify wearout semantics so AI knows 100% = healthy (#1300)
The patrol context table header said "Wearout" and the tool returned a raw
"wearout" JSON field with no indication that 100 = full life remaining.
The AI interpreted "wearout: 100" as fully worn out and raised false
"100% Disk Wearout" findings on healthy NVMe drives.

Rename the patrol table column to "SSD Life Remaining (100%=new)" and
update the data type comment to clarify the semantics.
2026-02-27 23:05:02 +00:00
rcourtman
9aee8fa293 fix(ui): add Pro badge to Reporting tab and reduce patrol trigger log noise (#1285, #1258)
Show "Pro" badge on the Reporting settings tab so users know upfront
that advanced reporting requires a Pro license, rather than discovering
it after filling out the form.

Downgrade patrol trigger queue-full and rejection messages from Warn to
Debug — these are normal rate-limiting behavior, not actionable warnings.
2026-02-26 21:09:13 +00:00
rcourtman
24f5b1cb31 fix(patrol): cap per-run tokens and reset patrol session history 2026-02-24 11:29:47 +00:00
rcourtman
706502c22d fix(alerts): default NotifyOnResolve to true and prevent patrol queue spam (#1259, #1258)
Recovery notifications were silently disabled for users with pre-5.1.12
configs because the NotifyOnResolve bool field defaults to false when
absent from JSON. Use a *bool probe to detect missing field and default
to true.

Patrol trigger queue filled with warnings when the patrol loop wasn't
running. Gate TriggerPatrolForAlert on p.running and clear the flag
via defer when the loop exits.
2026-02-20 17:56:41 +00:00
rcourtman
5666d6a9e8 fix(ai): fsync knowledge store temp file before rename to prevent empty reads
saveToDisk used os.WriteFile which doesn't sync to disk before the
atomic rename. On CI runners with aggressive filesystem caching this
can leave the destination file with zero bytes, causing
TestKnowledgeStore_SaveLoad to fail with "unexpected end of JSON input".
2026-02-18 13:27:47 +00:00
rcourtman
7efcec3120 fix(agents,ai): host URL field, AI Docker routing, Proxmox registration logging (#1197, #1210, #1267)
#1197: Add Custom URL input to the expanded host row in Settings → Agents.
Loads existing URL via HostMetadataAPI on row expand; saves on button click.
Only shown for host-type agent rows.

#1210: Fix agent_connected always false for Docker hosts on Proxmox VMs.
connectedAgentHostnames now also marks Docker host hostnames reachable when
their matching VM/LXC has a node with a connected Proxmox agent, mirroring
the routing logic already used in the control path.

#1267/#1269: Improve Proxmox auto-registration failure logging. Response body
is now included in the error message, and the warning directs users to delete
the state file to force re-registration rather than claiming the node exists.

(cherry picked from commit 305f6d3c94f0da4fc970450a6304da57d6d7fe80)
2026-02-18 12:57:09 +00:00
rcourtman
43af70ca1f fix(patrol): skip alert triggers when Patrol is disabled
TriggerPatrolForAlert was enqueuing into adHocTrigger regardless of
whether Patrol was enabled. With patrolLoop not running (disabled),
nothing drained the channel — it filled on the 10th alert and spammed
"Patrol trigger queue full, dropping trigger" on every subsequent alert.

Read p.config.Enabled in the same RLock as triggerManager and return
early when disabled.

Fixes #1258

(cherry picked from commit 69f399469538f0c9cd59084f6429fed8a793c042)
2026-02-18 12:53:12 +00:00
rcourtman
42c01c1be5 fix: probe all guest IPs for reachability, not just first
Patrol only pinged the first IP address of each VM/container, causing
false "unreachable" reports for guests with multiple IPs (common with
Windows VMs that have IPv6 or multi-adapter setups). Now probes all
IPs and marks reachable if any responds.

Fixes #1215
2026-02-10 21:46:11 +00:00
rcourtman
8bb89c4031 test: add memory regression coverage for AI stores 2026-02-04 19:56:12 +00:00
rcourtman
d2604a6859 test: add AI memory regression coverage 2026-02-04 19:46:20 +00:00
rcourtman
526fb21076 Add tests for guest intelligence and reachability signals
Cover gatherGuestIntelligence (discovery matching, instance fallback,
reachability via mock prober, edge cases), parsePingOutput parsing,
DetectReachabilitySignals, enriched seed context (Service/Reachable
columns, quiet mode variants, health issues fallback), and extend
signal helper tests for SignalGuestUnreachable.
2026-02-04 14:12:50 +00:00
rcourtman
34ca427458 Add unified guest intelligence to patrol seed context
Enrich the patrol seed context with service identity (from discovery
store) and network reachability (via ICMP ping through host agents).
The guest metrics table now includes Service and Reachable columns,
and a Service Health Issues section highlights running-but-unreachable
guests. A new SignalGuestUnreachable signal type creates deterministic
findings for unreachable guests.

New files:
- patrol_intelligence.go: GuestProber interface, GuestIntelligence
  type, gatherGuestIntelligence() with concurrent per-node probing
- patrol_prober.go: agentExecProber implementation using batch ping
  commands via connected host agents
2026-02-04 14:08:57 +00:00
rcourtman
098a722e03 Cover blocked AI fetch hosts 2026-02-04 13:54:32 +00:00
rcourtman
dd3e9fc4a8 Cover loopback override in AI fetch guard 2026-02-04 13:53:29 +00:00
rcourtman
2d29b3dcd7 Unify Proxmox discovery and integrate PMG Patrol
- Unified Proxmox VE discovery by redirecting Node requests to linked Host Agents.
- Added smart deduplication and legacy fallback for Proxmox discovery results.
- Integrated Proxmox Mail Gateway (PMG) into AI Patrol system.
- Added comprehensive tests for discovery redirection and deduplication.
2026-02-04 13:52:36 +00:00
rcourtman
634594a168 Unify Proxmox discovery results
- Redirect PVE node lookups to linked Host Agent ID when available.
- Implement deduplication in discovery lists to prefer Host Agent data over redundant Node entries.
- Add fallback mechanism to original Node ID for discovery retrieval ensuring compatibility with legacy data.
- Update data adapters and added comprehensive unit tests for redirection and deduplication logic.
2026-02-04 13:46:56 +00:00
rcourtman
a6f2a674eb fix: resolve test failures blocking release
- KnowledgeStore: use atomic write (temp+rename) to prevent file
  corruption from concurrent async saves
- Change password tests: add auth headers since endpoint now requires
  authentication
- ClearSession test: expect 2 cookies (pulse_session + pulse_csrf)
  matching updated clearSession behavior
- API token test: update to match current behavior where query-string
  tokens are accepted (needed for WebSocket connections)
- Host agent config: allow ScopeHostManage to resolve any host, not
  just token-bound hosts
2026-02-03 23:53:54 +00:00
rcourtman
2ebe65bbc5 security: add scope checks to AI Patrol and agent profile endpoints
- AI Patrol mutation endpoints (acknowledge, dismiss, suppress, snooze, resolve,
  findings/note, suppressions/*) now require ai:execute scope to prevent
  low-privilege tokens from blinding patrol by hiding/suppressing findings

- Agent profile admin endpoints (/api/admin/profiles/*) now require
  settings:write scope to prevent low-privilege tokens from modifying
  fleet-wide agent behavior
2026-02-03 19:29:56 +00:00
rcourtman
69e3286e5e security: fix AI OAuth scope bypass, approval replay attacks, and approval endpoint scope gating
- OAuth endpoints now require settings:write scope (not just admin)
- Approval endpoints now require ai:execute scope
- Added CommandHash to approvals for replay protection
- Approvals are now single-use (consumed on first use)
- consumeApprovalWithValidation validates command matches approval
2026-02-03 19:15:15 +00:00
rcourtman
60f9e6f07f security: fix multiple vulnerabilities (SAML, SSRF, Auth)
Addressed several security findings:
- SAML: Sanitized RelayState to prevent open redirects
- SAML: Fixed logout to properly invalidate server-side sessions
- Auth: Added auth, rate limiting, and logout checks to password change endpoint
- AI: Added admin/scope gating (ai:execute) for command execution
- AI: Blocked private IP ranges in fetch_url to prevent SSRF
- Config: Enforced settings:read/write scopes for export/import
- Agent: Added agent:exec scope requirement for WebSockets
2026-02-03 18:39:15 +00:00
rcourtman
f8bb14977d fix(discovery): include IPAddresses in state adapter for URL suggestion
The discovery state adapter was not copying IPAddresses from the models
when converting VM/Container state. This caused getResourceExternalIP()
to return empty strings, preventing URL suggestion from working.
2026-02-03 17:05:01 +00:00
rcourtman
935326ebb7 fix(api/ai): resolve critical auth, agent download, and lifecycle issues
- Fix API-only mode to accept Bearer tokens and query params
- Fix data race in API token validation using fine-grained locking
- Fix unified agent download serving wrong binary for invalid arch
- Fix AI infra discovery running when AI disabled and missing stop mechanism
2026-02-03 16:35:12 +00:00
rcourtman
3d8374e527 Fix AI investigation context and UI settings
- Ensure correct org context is used for AI chat service resolution

- Fix AI adapter tests

- Update AI Intelligence page UI for advanced settings
2026-02-03 16:24:56 +00:00
rcourtman
8720708e70 fix: address AI patrol concurrency and streaming issues
- HIGH: Create per-request AgenticLoop instead of sharing one across
  concurrent sessions. This prevents race conditions where ExecuteStream
  calls would overwrite each other's FSM, knowledge accumulator, and
  other session-specific state.

- MEDIUM: TriggerManager.GetStatus now recomputes adaptive interval after
  pruning old events. Previously, currentInterval could remain stuck in
  busy/quiet mode after events aged out of the window.

- MEDIUM: Patrol stream phases are now broadcast to subscribers. Fixed
  setStreamPhase() to emit phase events and SubscribeToStream() to send
  phase events to late joiners. UI was stuck on 'Starting patrol...'
  because phase events were never emitted.

- LOW: Fixed TriggerStatus.CurrentInterval JSON serialization. Changed
  from time.Duration (serializes as nanoseconds) to int64 milliseconds
  to match the 'current_interval_ms' tag.
2026-02-03 14:39:00 +00:00
rcourtman
86a7c2283c Revert "Detect incompatible models that don't support function calling"
This reverts commit 11a72ee263.
2026-02-03 13:36:30 +00:00
rcourtman
c6318a8484 Revert "Simplify incompatible model error message"
This reverts commit c58fe81700.
2026-02-03 13:36:30 +00:00
rcourtman
c58fe81700 Simplify incompatible model error message 2026-02-03 13:30:54 +00:00
rcourtman
11a72ee263 Detect incompatible models that don't support function calling
When local LLM servers (LM Studio, llama.cpp) receive tool definitions
but the model doesn't support function calling, they output internal
control tokens like <|channel|>, <|im_start|>, etc. instead of proper
responses.

This change detects these control tokens during streaming and returns
a clear error message explaining that the model doesn't support function
calling and recommending compatible models (Llama 3.1+, Mistral, Qwen).

This is better than the previous approach of offering a "disable tools"
option, which would have crippled Pulse Assistant/Patrol functionality.
Users need to use compatible models for the AI features to work properly.

Related to #1154
2026-02-03 13:28:37 +00:00
rcourtman
a55ae78715 Revert "Add config option to disable tools for OpenAI-compatible endpoints"
This reverts commit 81229f206f.
2026-02-03 13:26:26 +00:00
rcourtman
81229f206f Add config option to disable tools for OpenAI-compatible endpoints
Some local LLM servers (LM Studio, llama.cpp) expose OpenAI-compatible
APIs but don't support function calling. When tools are sent to these
models, they output raw control tokens instead of proper responses.

This change adds:
- openai_tools_disabled config field in AIConfig
- AreToolsDisabledForProvider() method to check at runtime
- API support to get/set the new setting
- Tests for the new functionality

When enabled and using a custom OpenAI base URL, the chat service will
skip sending tools to the model, allowing basic chat functionality to
work even with models that don't support function calling.

Fixes #1154
2026-02-03 13:21:44 +00:00
rcourtman
e3556455c6 Revert "Sanitize LLM control tokens from OpenAI-compatible responses"
This reverts commit e5eb15918e.
2026-02-03 13:14:33 +00:00
rcourtman
e5eb15918e Sanitize LLM control tokens from OpenAI-compatible responses
Some local models (llama.cpp, LM Studio) output internal control tokens
like <|channel|>, <|constrain|>, <|message|> instead of using proper
function calling. These tokens leak into the UI creating a poor UX.

This adds sanitization to strip these control tokens from both streaming
and non-streaming responses before they reach the user.
2026-02-03 13:12:17 +00:00