Commit graph

363 commits

Author SHA1 Message Date
rcourtman
d4463a615c Add fleet-level AI narrative for multi-resource reports
The single-resource AI narrative landed in b2bd9d114 but multi-resource
fleet reports stayed heuristic-only. That left a gap on the exact axis
where AI helps most: a 50-resource fleet PDF is where synthesis is the
difference between useful and unread.

Introduce FleetNarrator as a separate interface from Narrator. The
input shapes are different — single-resource takes one set of metric
stats with a prior window, fleet takes a denormalised cross-resource
view with per-resource summaries plus a fleet aggregate.
HeuristicFleetNarrator owns the deterministic fallback: ranks
resources by severity (critical alerts > unhealthy disks > storage
pressure > memory > CPU > non-critical alerts), picks up to 5
outliers, derives cross-cutting patterns by counting how many of N
resources share a hot signal, and emits fleet-scoped recommendations.

internal/ai.Service implements FleetNarrator through
report_fleet_narrator.go. Distinct use-case label
(report_narrative_fleet) so fleet vs single-resource spend is
separable in the cost ledger and budget gate. The fleet payload is
denormalised through buildReportFleetPayload so prompt cost scales
linearly with fleet size. Same fail-closed invariant — nil provider,
parse failure, or context cancellation falls through to the heuristic.

Single-resource Narrator is intentionally NOT propagated through
engine.GenerateMulti: a 50-resource fleet report performs one AI call
(fleet narrator), not 51. The router resolver returns the AI service
for all three roles (Narrator, FleetNarrator, FindingsProvider).

The fleet PDF renders the FleetNarrative in the fleet summary cover
when present: executive prose, named outliers with severity-coloured
bullets, cross-cutting patterns, recommendations, optional period
comparison, and an AI provenance footer. The deterministic resource
summary table is preserved above so every named outlier is verifiable
against the table immediately below it. Legacy "Highest CPU / Most
alerts" bullets remain as the fallback when no FleetNarrative is
attached.
2026-05-10 21:23:12 +01:00
rcourtman
b2bd9d1147 Replace heuristic report narrative with optional AI-generated layer
Performance reports rendered the Executive Summary, Observations, and
Recommendations sections from inline threshold rules in pdf.go. That
narrative looked intelligent but was static templating against alert
counts and metric percentiles, which felt off-brand alongside Patrol
and Pulse Assistant.

Introduce a Narrator interface in pkg/reporting and a FindingsProvider
counterpart that the engine consults at report time. The heuristic
rules are lifted into HeuristicNarrator unchanged so the deterministic
fallback still produces the same observations and recommendations.
The engine now also queries the comparable prior period and threads
its aggregate stats through the narrator so deltas can be expressed.

internal/ai.Service implements both interfaces via report_narrator.go
(single-turn JSON call grounded in the structured ReportData payload,
falling back to the heuristic on any error/timeout) and
report_findings.go (Patrol findings whose lifecycle overlaps the
report window). The reporting handler resolves the per-tenant AI
service when it is configured and supplies it in the request; absent
configuration, reports look identical to the prior heuristic output.

Charts, stats tables, alert lists, storage and disk sections stay
deterministic — sysadmins can verify every AI claim against the data
tables next to it. The PDF renders the AI prose between the health
card and Quick Stats, adds a Period-over-period section after
Recommendations, and prints a provenance footer when the narrative
came from the assistant.

ai-runtime.md and api-contracts.md updates land in a follow-up commit
on this branch; agent-lifecycle / performance-and-scalability /
storage-recovery have no contract delta from this change (router.go
is referenced in their Extension Points but their semantics are
unchanged).
2026-05-10 19:30:54 +01:00
rcourtman
51c5d344ce Plumb operator-state and operational memory into investigation findings
Closes the "has context vs uses context" gap that defines Pulse's
agent-paradigm differentiation. The orchestrator (in pulse-pro) used
to receive a Finding with no awareness of the operator's
commitments — Patrol could investigate a resource the operator had
marked never-auto-remediate and propose a restart fix that the
action broker would refuse downstream. The proposal shouldn't have
happened in the first place.

Adds two optional fields to aicontracts.Finding:

- OperatorContext: intentionally offline, never auto-remediate,
  maintenance window with computed active flag, criticality, note.
  Populated in MaybeInvestigateFinding from the same operator-state
  projection the suppression hot path consumes, so investigation
  reasoning and suppression behavior cannot drift apart.
- OperationalMemory: regression count, previous resolved fix
  summary, last regression timestamp, times raised. Populated in
  ToCoreFinding from fields the internal Finding already carries.

ResourceOperatorStateProjection grew a NeverAutoRemediate field —
the investigation read path needs it (so the orchestrator can avoid
proposing fixes the broker would refuse) even though the
suppression hot path doesn't. Same projection serves both reads.

Both fields are nil when there's no signal (fresh finding, no
operator state) so the orchestrator branches on absence rather
than parsing zero-valued structs. The pulse-pro orchestrator
consumes the fields in a separate slice; this slice ships the
in-repo half of the data path.
2026-05-09 21:03:15 +01:00
rcourtman
0dd3f8bedb Surface per-endpoint reasons in cluster "no healthy nodes" error
When every cluster endpoint failed health, getHealthyClient wrapped
the failure as `no healthy nodes available in cluster X (all N
endpoints unreachable: [...])`, dropping the per-endpoint reason from
cc.lastError. The connections aggregator's auth-error regex
(401/403/unauthorized/forbidden/authentication/...) only sees the
outer message, so a token rejected with 401 on every endpoint of a
clustered PVE connection surfaced as `state: "unreachable"` /
`adapterHealth: "blocked"` instead of `state: "unauthorized"` /
`credentialStatus: "invalid"` — the same Settings → Connections
brokenness the rest of today's commits set out to remove.

Single-node `pve:pi` already classified the same kind of failure
correctly because its error came straight from the per-instance
client; only the cluster wrapper masked it.

Surface each unhealthy endpoint's already-sanitized reason in the
outer error. The "no healthy nodes available" prefix is preserved so
existing callers that test for it (monitor_polling_storage.go,
internal cluster_client passthroughs, existing tests) keep working.

Add a regression test covering both shapes:
- all endpoints failed auth → wrapped error contains
  "Authentication failed" so the aggregator regex now matches.
- endpoint with no recorded reason → wrapped error includes the
  fallback "no recorded reason" text rather than a bare URL.
2026-05-08 21:10:14 +01:00
rcourtman
e7b5650233 Add impact and rollback to investigation records
Promote the seven-field investigation-record shape so Patrol findings
can carry consequence-if-ignored context and a record-level rollback
plan alongside the existing verification array. The shared
aicontracts.InvestigationRecord struct gains top-level Impact and
Rollback fields with matching TS mirrors, normalizes Rollback to an
empty slice, and the Patrol-owned investigation surface renders an
explicit "Impact not assessed" / "Rollback not specified" placeholder
so the operator-visible gap is conspicuous to both the operator and
Assistant when Patrol has not populated them. Backend default leaves
both empty rather than fabricating analysis from severity/category.
Also closes the existing Trigger.cause drift between Go and TS so
frontend handoff context preserves backend-attributed failure cause,
and updates the api-contracts, ai-runtime, frontend-primitives, and
patrol-intelligence subsystem contracts to pin the new shape.
2026-05-08 16:47:55 +01:00
rcourtman
ea3e1b216a Persist Patrol approval requester identity
- store requester provenance on approval records
- carry requester metadata through approval APIs and Assistant handoffs
- document the safe Patrol approval provenance boundary
2026-05-08 00:12:09 +01:00
rcourtman
d2625c4dfb Persist Patrol settings with readiness handoff
Refs #1463
2026-05-07 19:26:00 +01:00
rcourtman
86244d8c13 Track runtime build in license activation 2026-05-06 23:45:37 +01:00
rcourtman
df71bcdf09 Restore commercial monitored-system admission hook contract 2026-05-06 18:04:59 +01:00
rcourtman
b84fc2301a Surface paid runtime mismatch in licensing 2026-05-06 17:18:35 +01:00
rcourtman
75e3cb76fd Add structured Patrol investigation records 2026-05-06 16:31:51 +01:00
rcourtman
edae6d1edc refactor: split alert config and callbacks
Extract alert config types, normalization, and identity helpers into internal/alerts/config while preserving the existing alerts package API through aliases and wrappers.

Move Manager callback lifecycle state into a same-package callbackBus, keeping public Set/Subscribe methods unchanged.

Harden metrics SQLite artifacts to owner-only permissions and cover permissive umask behavior.

Proof: go test -json ./internal/api -count=1; go test ./internal/alerts/... ./internal/monitoring ./internal/ai/... ./internal/websocket ./internal/config ./pkg/metrics; go test ./internal/alerts/... ./pkg/metrics
2026-05-06 13:01:32 +01:00
rcourtman
d6ca8b12e6 Add agentless availability targets
Refs #1460
2026-05-06 10:35:34 +01:00
rcourtman
0895916283 Fix self-hosted startup web listener fail-fast
Refs #1461
2026-05-06 09:16:54 +01:00
rcourtman
d7225a45a0 Fix Proxmox guest memory fallbacks
Also fixes Ceph pool threshold resource identity.

Refs #1341
2026-05-05 14:59:29 +01:00
rcourtman
81b31e4d3b Remove monitored-system volume caps
Retire runtime/API/UI monitored-system volume enforcement now that infrastructure monitoring is no longer capped.

Keep only legacy metadata scrubbing and purchase-start compatibility for old max_monitored_systems references.

Rename the remaining preview surface to monitored-system impact and make previews explanatory rather than save-blocking.

Update subsystem contracts and RA7 evidence for the caps-retired invariant.
2026-05-05 12:59:59 +01:00
rcourtman
632f0af7f3 Keep uncapped continuity from writing raw caps 2026-05-05 09:33:44 +01:00
rcourtman
82a2494ffa Add action execution safety contract 2026-05-04 23:19:58 +01:00
rcourtman
2040285085 Add action decision API 2026-05-04 22:56:55 +01:00
rcourtman
c436e1a2a2 Add CLI fleet connection reads 2026-05-04 08:40:34 +01:00
rcourtman
863f214c10 Add CLI action audit reads 2026-05-04 00:18:19 +01:00
rcourtman
f0bf88a89d Add CLI action capability discovery 2026-05-04 00:10:15 +01:00
rcourtman
5fbe723ad9 Add CLI action planning adapter 2026-05-04 00:05:21 +01:00
rcourtman
db97478566 Reduce metrics rollup write amplification
Refs #1124
2026-05-03 21:43:20 +01:00
rcourtman
82c54cc39b Make self-hosted SSO Community-tier
Treat OIDC, SAML, and multi-provider SSO as included Community capabilities while retaining advanced_sso as a compatibility key. Remove SAML-specific paywalls and paid-upgrade copy from runtime, settings UI, entitlement snapshots, docs, journey proof, and subsystem contracts.

Refs #1449
2026-05-03 12:48:01 +01:00
rcourtman
a3617b923a Fix remaining RC3 backend CI races 2026-05-01 22:03:22 +01:00
rcourtman
3146d83701 Count Ceph monitors from monitor arrays
Refs #1290
2026-05-01 20:28:11 +01:00
rcourtman
575f432183 Make metrics writes idempotent for duplicate samples
Refs #1442
2026-05-01 20:28:11 +01:00
rcourtman
1267a817c7 Gate cloud provisioning to hosted checkouts 2026-05-01 14:13:08 +01:00
rcourtman
af7d727d45 Gate RAID rebuild alerts on mdstat operation
Parse the /proc/mdstat operation keyword for mdadm arrays and propagate it through host reports, models, unified resources, monitoring views, alert metadata, and AI storage summaries.

Treat recovery and reshape as rebuild signals while silencing routine check and resync maintenance, with fallback rebuild detection only when no mdstat operation is available.

Tests cover mdstat operation parsing plus recovery, check, and resync alert behavior.

Fixes #1446
2026-04-30 14:31:14 +01:00
rcourtman
c7164c2906 Clarify Relay mobile handoff paid copy 2026-04-30 13:18:04 +01:00
rcourtman
99129d0c09 Retire product upgrade metrics runtime
Remove local upgrade-metrics API registration, settings payload wiring, startup store migration, and backend conversion recorder hooks from the normal product runtime.

Delete the retired conversion/funnel and metering packages from compiled licensing code, and extend diagnostics boundary audits and governance contracts so maintainer commercial analytics cannot return through Settings or diagnostics.
2026-04-30 12:24:22 +01:00
rcourtman
daf825dee6 Remove customer commercial analytics wrappers 2026-04-30 11:46:16 +01:00
rcourtman
48c8d26198 Add paid feature claim proof bundle 2026-04-29 14:18:43 +01:00
rcourtman
f060f261cd Present Relay as annual-first support tier 2026-04-29 12:49:20 +01:00
rcourtman
0dd3cd804e Hide MSP-only features from self-hosted Pro plans 2026-04-29 01:02:10 +01:00
rcourtman
5f0078b0d0 Keep synthetic modes out of entitlement payloads 2026-04-29 00:33:53 +01:00
rcourtman
08fef313eb Rename hosted capacity marker copy 2026-04-29 00:07:18 +01:00
rcourtman
c0ef2d44f3 Keep compatibility-only features out of upgrade URLs 2026-04-28 23:22:20 +01:00
rcourtman
937696508c Guard self-hosted feature metadata drift 2026-04-28 23:16:12 +01:00
rcourtman
a67845ada0 Retire self-hosted volume caps 2026-04-28 20:36:37 +01:00
rcourtman
c197f6a7a5 Move license test signers to testsupport 2026-04-28 19:12:21 +01:00
rcourtman
b29f398b9d Fix release-mode licensing test expectations 2026-04-28 18:58:35 +01:00
rcourtman
1d189d3343 Clarify hosted entitlement signing compatibility 2026-04-28 18:47:19 +01:00
rcourtman
2b1d82d965 Retire self-hosted trial posture prompts 2026-04-28 17:39:09 +01:00
rcourtman
7cc980ad1d Retire self-hosted trial signup control plane 2026-04-28 17:02:04 +01:00
rcourtman
ded190dcab Retire hosted AI quickstart runtime 2026-04-28 16:11:27 +01:00
rcourtman
b1e179479d Retire self-hosted AI quickstart surfaces 2026-04-28 15:49:18 +01:00
rcourtman
ecf8fd4299 Keep self-hosted Pro prompts opt-in 2026-04-28 11:23:49 +01:00
rcourtman
fab0e77800 Refine self-hosted Pro value copy 2026-04-28 09:56:03 +01:00