The single-resource AI narrative landed in b2bd9d114 but multi-resource
fleet reports stayed heuristic-only. That left a gap on the exact axis
where AI helps most: a 50-resource fleet PDF is where synthesis is the
difference between useful and unread.
Introduce FleetNarrator as a separate interface from Narrator. The
input shapes are different — single-resource takes one set of metric
stats with a prior window, fleet takes a denormalised cross-resource
view with per-resource summaries plus a fleet aggregate.
HeuristicFleetNarrator owns the deterministic fallback: ranks
resources by severity (critical alerts > unhealthy disks > storage
pressure > memory > CPU > non-critical alerts), picks up to 5
outliers, derives cross-cutting patterns by counting how many of N
resources share a hot signal, and emits fleet-scoped recommendations.
internal/ai.Service implements FleetNarrator through
report_fleet_narrator.go. Distinct use-case label
(report_narrative_fleet) so fleet vs single-resource spend is
separable in the cost ledger and budget gate. The fleet payload is
denormalised through buildReportFleetPayload so prompt cost scales
linearly with fleet size. Same fail-closed invariant — nil provider,
parse failure, or context cancellation falls through to the heuristic.
Single-resource Narrator is intentionally NOT propagated through
engine.GenerateMulti: a 50-resource fleet report performs one AI call
(fleet narrator), not 51. The router resolver returns the AI service
for all three roles (Narrator, FleetNarrator, FindingsProvider).
The fleet PDF renders the FleetNarrative in the fleet summary cover
when present: executive prose, named outliers with severity-coloured
bullets, cross-cutting patterns, recommendations, optional period
comparison, and an AI provenance footer. The deterministic resource
summary table is preserved above so every named outlier is verifiable
against the table immediately below it. Legacy "Highest CPU / Most
alerts" bullets remain as the fallback when no FleetNarrative is
attached.
Performance reports rendered the Executive Summary, Observations, and
Recommendations sections from inline threshold rules in pdf.go. That
narrative looked intelligent but was static templating against alert
counts and metric percentiles, which felt off-brand alongside Patrol
and Pulse Assistant.
Introduce a Narrator interface in pkg/reporting and a FindingsProvider
counterpart that the engine consults at report time. The heuristic
rules are lifted into HeuristicNarrator unchanged so the deterministic
fallback still produces the same observations and recommendations.
The engine now also queries the comparable prior period and threads
its aggregate stats through the narrator so deltas can be expressed.
internal/ai.Service implements both interfaces via report_narrator.go
(single-turn JSON call grounded in the structured ReportData payload,
falling back to the heuristic on any error/timeout) and
report_findings.go (Patrol findings whose lifecycle overlaps the
report window). The reporting handler resolves the per-tenant AI
service when it is configured and supplies it in the request; absent
configuration, reports look identical to the prior heuristic output.
Charts, stats tables, alert lists, storage and disk sections stay
deterministic — sysadmins can verify every AI claim against the data
tables next to it. The PDF renders the AI prose between the health
card and Quick Stats, adds a Period-over-period section after
Recommendations, and prints a provenance footer when the narrative
came from the assistant.
ai-runtime.md and api-contracts.md updates land in a follow-up commit
on this branch; agent-lifecycle / performance-and-scalability /
storage-recovery have no contract delta from this change (router.go
is referenced in their Extension Points but their semantics are
unchanged).
Closes the "has context vs uses context" gap that defines Pulse's
agent-paradigm differentiation. The orchestrator (in pulse-pro) used
to receive a Finding with no awareness of the operator's
commitments — Patrol could investigate a resource the operator had
marked never-auto-remediate and propose a restart fix that the
action broker would refuse downstream. The proposal shouldn't have
happened in the first place.
Adds two optional fields to aicontracts.Finding:
- OperatorContext: intentionally offline, never auto-remediate,
maintenance window with computed active flag, criticality, note.
Populated in MaybeInvestigateFinding from the same operator-state
projection the suppression hot path consumes, so investigation
reasoning and suppression behavior cannot drift apart.
- OperationalMemory: regression count, previous resolved fix
summary, last regression timestamp, times raised. Populated in
ToCoreFinding from fields the internal Finding already carries.
ResourceOperatorStateProjection grew a NeverAutoRemediate field —
the investigation read path needs it (so the orchestrator can avoid
proposing fixes the broker would refuse) even though the
suppression hot path doesn't. Same projection serves both reads.
Both fields are nil when there's no signal (fresh finding, no
operator state) so the orchestrator branches on absence rather
than parsing zero-valued structs. The pulse-pro orchestrator
consumes the fields in a separate slice; this slice ships the
in-repo half of the data path.
When every cluster endpoint failed health, getHealthyClient wrapped
the failure as `no healthy nodes available in cluster X (all N
endpoints unreachable: [...])`, dropping the per-endpoint reason from
cc.lastError. The connections aggregator's auth-error regex
(401/403/unauthorized/forbidden/authentication/...) only sees the
outer message, so a token rejected with 401 on every endpoint of a
clustered PVE connection surfaced as `state: "unreachable"` /
`adapterHealth: "blocked"` instead of `state: "unauthorized"` /
`credentialStatus: "invalid"` — the same Settings → Connections
brokenness the rest of today's commits set out to remove.
Single-node `pve:pi` already classified the same kind of failure
correctly because its error came straight from the per-instance
client; only the cluster wrapper masked it.
Surface each unhealthy endpoint's already-sanitized reason in the
outer error. The "no healthy nodes available" prefix is preserved so
existing callers that test for it (monitor_polling_storage.go,
internal cluster_client passthroughs, existing tests) keep working.
Add a regression test covering both shapes:
- all endpoints failed auth → wrapped error contains
"Authentication failed" so the aggregator regex now matches.
- endpoint with no recorded reason → wrapped error includes the
fallback "no recorded reason" text rather than a bare URL.
Promote the seven-field investigation-record shape so Patrol findings
can carry consequence-if-ignored context and a record-level rollback
plan alongside the existing verification array. The shared
aicontracts.InvestigationRecord struct gains top-level Impact and
Rollback fields with matching TS mirrors, normalizes Rollback to an
empty slice, and the Patrol-owned investigation surface renders an
explicit "Impact not assessed" / "Rollback not specified" placeholder
so the operator-visible gap is conspicuous to both the operator and
Assistant when Patrol has not populated them. Backend default leaves
both empty rather than fabricating analysis from severity/category.
Also closes the existing Trigger.cause drift between Go and TS so
frontend handoff context preserves backend-attributed failure cause,
and updates the api-contracts, ai-runtime, frontend-primitives, and
patrol-intelligence subsystem contracts to pin the new shape.
- store requester provenance on approval records
- carry requester metadata through approval APIs and Assistant handoffs
- document the safe Patrol approval provenance boundary
Extract alert config types, normalization, and identity helpers into internal/alerts/config while preserving the existing alerts package API through aliases and wrappers.
Move Manager callback lifecycle state into a same-package callbackBus, keeping public Set/Subscribe methods unchanged.
Harden metrics SQLite artifacts to owner-only permissions and cover permissive umask behavior.
Proof: go test -json ./internal/api -count=1; go test ./internal/alerts/... ./internal/monitoring ./internal/ai/... ./internal/websocket ./internal/config ./pkg/metrics; go test ./internal/alerts/... ./pkg/metrics
Retire runtime/API/UI monitored-system volume enforcement now that infrastructure monitoring is no longer capped.
Keep only legacy metadata scrubbing and purchase-start compatibility for old max_monitored_systems references.
Rename the remaining preview surface to monitored-system impact and make previews explanatory rather than save-blocking.
Update subsystem contracts and RA7 evidence for the caps-retired invariant.
Treat OIDC, SAML, and multi-provider SSO as included Community capabilities while retaining advanced_sso as a compatibility key. Remove SAML-specific paywalls and paid-upgrade copy from runtime, settings UI, entitlement snapshots, docs, journey proof, and subsystem contracts.
Refs #1449
Parse the /proc/mdstat operation keyword for mdadm arrays and propagate it through host reports, models, unified resources, monitoring views, alert metadata, and AI storage summaries.
Treat recovery and reshape as rebuild signals while silencing routine check and resync maintenance, with fallback rebuild detection only when no mdstat operation is available.
Tests cover mdstat operation parsing plus recovery, check, and resync alert behavior.
Fixes#1446
Remove local upgrade-metrics API registration, settings payload wiring, startup store migration, and backend conversion recorder hooks from the normal product runtime.
Delete the retired conversion/funnel and metering packages from compiled licensing code, and extend diagnostics boundary audits and governance contracts so maintainer commercial analytics cannot return through Settings or diagnostics.