Commit graph

6005 commits

Author SHA1 Message Date
rcourtman
f13893b63f Add Create rule from this button on Patrol findings
Last of the seven contextual entries from the captured Pulse
Intelligence rubric. "Remember as expected" handles one
instance; "Create rule from this" promotes the pattern: any
future finding on the same {resource, category} pair auto-
dismisses inside the backend's existing
FindingsStore.isSuppressedInternal /
MatchesSuppressionRule machinery rather than surfacing as a new
finding.

The backend endpoint already exists
(POST /api/ai/patrol/suppressions →
HandleAddSuppressionRule → FindingsStore.AddSuppressionRule).
The button + inline confirm panel is the missing surface:

- frontend-modern/src/api/patrol.ts gains
  createSuppressionRuleFromFinding(input) that POSTs to the
  existing endpoint with the finding's resource + category +
  operator-supplied reason.
- FindingsPanel adds a Create rule button at the end of the
  lifecycle action row, plus an inline confirmation that
  surfaces the rule scope (resource + category), requires a
  reason, and explains the future-auto-dismiss commitment.
  Submission goes through aiIntelligenceStore.loadDashboardData
  so the local view reflects the audit trail of record.
- Mirrors the visual pattern of the existing dismiss-confirmation
  panel but uses neutral surface styling because this isn't a
  dismissal — it's a permanent commitment, distinct from
  Remember as expected which dismisses just this instance.

No backend changes; the rule machinery and the API endpoint are
unchanged. This is the surface piece. Type-check clean.
2026-05-11 10:33:26 +01:00
rcourtman
5a7fde7b39 Refresh advanced_reporting paywall and guidance copy
The locked-state description advertised the v5 capability set ("PDF
and CSV performance reports plus current-state VM inventory
exports") and never caught up with what the v6 reporting feature
actually delivers behind the gate: AI-narrated executive summary,
fleet outlier detection with named resources, period-over-period
comparison, and Patrol findings rolled into the narrative.

Update both the LockedState (what non-Pro users see on the upsell)
and Guidance (what Pro users see on the enabled surface) copy so
they match what ships. The locked-state copy is honest about the
AI being optional — narration uses Pulse Assistant when configured
and falls back to a deterministic summary otherwise — so users who
haven't set up Assistant don't think the Pro feature is gated
behind a separate AI configuration.

No structural change to the catalog: same fields, same JSON shape,
same downstream consumers. Frontend renders these strings directly
from the catalog endpoint, so the copy update propagates without
any frontend code change.
2026-05-11 10:16:27 +01:00
rcourtman
4ce4459bc8 Rename Dismiss: Expected to Remember as expected
Fifth of the seven contextual entries from the captured Pulse
Intelligence direction. The captured rubric named this button
"Remember this is expected" — future-looking, "Pulse should know
this state is expected" — but the surface labelled it
"Dismiss: Expected", which reads past-looking and groups with
two unrelated dismissal intents (Not an issue, Later).

Renames the button to "Remember as expected" and updates the
confirmation panel:

- Header verb tracks intent: "Remembering as expected" for the
  expected_behavior reason, "Dismiss as: ..." for the other
  reasons.
- Confirmation copy now says Pulse will "remember that this state
  is expected on this resource" alongside the existing
  acknowledgement-and-no-renotify framing.
- Button tooltip explains the future-looking commitment so
  operators understand the intent before clicking.

No wire change — the dismiss reason is still expected_behavior,
and operator-state on the resource still drives auto-acknowledge
behaviour for future similar findings (already in place from
earlier work). This is the rubric-alignment piece; the
operator-memory machinery underneath is unchanged.
2026-05-11 09:25:26 +01:00
rcourtman
ac5f140802 Add conditional Verify fix button on Patrol findings
Fourth of the seven contextual Assistant entries. Verify fix is
the post-remediation confirmation step: after a fix has run, the
operator asks the Assistant whether the underlying condition
actually cleared, rather than trusting the fix command's exit
code or the LLM's prior self-verification.

- Widens PatrolAssistantFindingIntent to include 'verify_fix'.
- buildPatrolAssistantFindingPrompt gains a verify_fix branch
  that directs the LLM to check the current evidence against the
  original signal that fired the finding (metrics, resource
  state, recent alerts, service health), then synthesize: is the
  condition cleared, what evidence supports that judgment, how
  confident, and is there residual risk to monitor for. Tool
  calls are allowed; state-changing commands are explicitly
  forbidden — verification is read-only.
- FindingsPanel adds a Verify fix button after Why, gated by
  hasAppliedFix() which returns true for investigation outcomes
  fix_executed, fix_verified, fix_verification_failed, and
  fix_verification_unknown. For fix_queued (no fix has run yet)
  and fix_failed (fix didn't complete) the button is hidden
  because there is nothing applied to verify.
- autoSendInitialPrompt extends to verify_fix; Discuss with
  Assistant unchanged.

Test: new verify_fix-intent prompt-builder case asserts the
verification dimensions (condition cleared, evidence, confidence,
residual / monitor) and the read-only safety boundary, and
isn't either Discuss or Investigate phrasing.
2026-05-11 09:21:36 +01:00
rcourtman
dee757c927 Add Why button on Patrol findings (diagnostic / cause)
Third of the seven contextual Assistant entries from the captured
Pulse Intelligence direction. Where Explain says "tell me what we
know" and Investigate says "go find out what's true now," Why
says "what caused this":

- Widens PatrolAssistantFindingIntent to include 'why'.
- buildPatrolAssistantFindingPrompt gains a why branch that
  directs the LLM toward cause signals — recent changes around
  detection time, learned correlations, prior incident memory,
  regression history — rather than current state. The prompt
  asks for: what most likely caused this to fire now, what
  evidence in the attached context supports that cause, what
  would have to be true for the cause to recur. Tool calls for
  verification are allowed; state-changing commands still
  require operator approval.
- FindingsPanel adds a Why button between Investigate and Copy
  summary, with a handleWhyFinding handler routing through
  openFindingInAssistant.
- autoSendInitialPrompt now triggers for explain / investigate /
  why; Discuss with Assistant unchanged.

Test: new why-intent prompt-builder case asserts the prompt
mentions cause-focused signals (recent changes, correlations,
prior incidents, regressions), the synthesis dimensions (caused,
evidence, recur), and the operator-approval boundary, and isn't
either Discuss or Investigate phrasing.
2026-05-11 09:18:09 +01:00
rcourtman
3b6fc3ef31 Audit chat-package Chat/ChatStream callers for cost recording
The parent-package audit (TestCostRecordingCoverage in
internal/ai/cost_recording_audit_test.go) only walks
internal/ai/*.go — sub-packages were unguarded, which is the gap
that let the chat cost-recording bug land in the first place (the
agentic loop in internal/ai/chat/ called ChatStream without anyone
recording cost; fixed in a0b3bc7ed).

Adds a parallel audit at internal/ai/chat/cost_recording_audit_test.go
covering both .Chat() and .ChatStream() callers. Same shape as the
parent audit but extended to include the streaming variant, since
the chat package is built around the agentic loop's ChatStream
call site.

The chat package's orchestrator/loop split means recording lives
in chat.Service.recordChatTurnCost — the orchestrator that owns
the loop — rather than inside the loop methods that call
ChatStream. Two doc-comment exemptions cover that:
  - AgenticLoop.ExecuteWithTools / executeWithTools (agentic.go)
  - AgenticLoop.ensureFinalTextResponse (agentic_final.go)
Both name the orchestrator that records, so a future contributor
moving recording elsewhere has a clear pointer to update.

A new ChatStream caller added to the chat package without either
recording or an exemption marker will now fail this audit, the
same way the parent audit would have caught QuickAnalysis or
Narrate if they had been written without recording.
2026-05-11 09:16:26 +01:00
rcourtman
362e5d37bd Add Investigate button on Patrol findings
Pulse Intelligence direction named seven contextual Assistant entry
points: Explain, Investigate, Prepare fix, Verify fix, Why did
this happen, Remember this is expected, Create rule from this.
Explain was the only first-class one. Investigate was rolled into
"Discuss with Assistant" with a generic open-ended prompt.

Splits Investigate off as its own action. Where Explain says
"tell me what we already know," Investigate says "go find out
what's true right now":

- Widens PatrolAssistantFindingIntent to 'discuss' | 'explain' |
  'investigate'.
- buildPatrolAssistantFindingPrompt gains an investigate branch:
  the prompt explicitly instructs the LLM to use its Pulse tools
  (metrics, alerts, resource state, recent changes, correlations)
  to gather fresh evidence, then synthesize root cause +
  confidence + safe next step + whether the recommended action
  still holds. Any command-running step must route through
  governed approval, not the LLM's own judgment.
- FindingsPanel adds an Investigate button between Explain and
  Copy summary, and a handleInvestigateFinding handler that
  routes through openFindingInAssistant with the new intent.
- Both Explain and Investigate now set autoSendInitialPrompt:
  true (single condition update); Discuss stays false.

Test: a new investigate-intent prompt-builder case asserts the
prompt mentions active tool use, the synthesis dimensions
(root cause, confidence, safe next step), and the governed-
approval safety boundary, and isn't the Discuss seed.
2026-05-11 09:14:21 +01:00
rcourtman
8230275126 Auto-send on action-style Assistant entry points
Pulse Assistant has been reactive: clicking Explain on a Patrol
finding opened the drawer, pre-filled a substantive investigation
prompt, and then waited for the operator to press Enter. The
captured Pulse Intelligence direction calls for the opposite —
when the operator clicks an action button, the analysis should
already be in flight by the time they land on the drawer, not
parked on a textarea waiting for a second confirmation step.

Adds autoSendInitialPrompt to AIChatContext. When true, the chat
surface fires handleSubmit immediately after the initialPrompt is
written to the input (deferred via queueMicrotask so the input
signal has propagated and the drawer-open effects have settled).
handleSubmit's existing guards against empty prompts and concurrent
submissions make this safe. clearAutoSendFlag is the symmetric
clearer so the flag doesn't persist across opens.

Wires Explain to autoSendInitialPrompt: true; Discuss with
Assistant stays false because that entry is open-ended by
design. The same plumbing is what future action-style entries
(Investigate, Verify fix) will route through.

Tests: two new aiChat-store cases lock in the flag plumbing and
the default-undefined behaviour. All 15 aiChat-store + 70
AIChat-component tests pass. Verified live: clicking Explain
clears the textarea and surfaces the in-flight Thinking
indicator without the operator pressing Enter.
2026-05-11 09:09:38 +01:00
rcourtman
b388e79c73 Default Settings sidebar to expanded on Infrastructure tab
The Infrastructure workspace previously inverted the default by
initializing focusedNavigationExpanded=false, which meant landing on
Settings (which defaults to the Infrastructure tab) always opened with
the navigation collapsed to a 4rem rail. On a typical desktop the 18rem
sidebar isn't crowding the workspace, but the rail-by-default cost a
click on every settings visit.

Remove the focused-navigation special case and the now-trivial wrappers
around setSidebarCollapsed; let Infrastructure share the same
sidebar-defaults-expanded behavior as every other settings tab. Manual
collapse/expand via the chevron is unchanged.
2026-05-10 23:41:48 +01:00
rcourtman
9db3b53af7 Polish dash placeholders and cluster timestamp tooltip
Two small UX touches on Settings → Infrastructure.

1. The empty-cell placeholders for endpoint and coverage on the cluster
header row and on member rows were rendered with an ASCII hyphen while
elsewhere in the same table (the member actions cell, the 'unknown'
Source badge, and other tables across the app) the convention is an
em-dash. Unify on em-dash so the placeholder reads as a single visual
language.

2. The cluster header row's lastActivityText is an aggregation: the
oldest timestamp across the cluster API connection, member liveness,
and member agent connections. That biases toward surfacing the most
stale source, but the column header just says 'STATUS' so a viewer can
mistake it for a most-recent timestamp. Add a title attribute on the
cluster row's timestamp explaining the aggregation. Non-cluster rows
keep no tooltip since their timestamp is single-sourced.
2026-05-10 23:40:26 +01:00
rcourtman
9063e7ec32 Clarify cluster-member rows in the connected-systems table
Two adjacent UX wrinkles on Settings → Infrastructure:

1. The primary cluster member's subtitle read "API contact" while the
Source column rendered an "Agent" badge directly beside it. Two unrelated
axes (role within cluster vs how Pulse reads this row) shared visually
adjacent space and parsed as a contradiction. Rename the subtitle to
"Primary node" so it no longer collides with the Source badge.

2. A clustered PVE member without an attached agent fell back to
Source = 'unknown' (rendered as "— No source attached"). That row IS
read via the cluster API, so 'unknown' is incorrect. Default the
fallback to 'api'; agent-installed members keep 'agent'. Invisible in
the current single-cluster homelab setup but accurate for mixed-agent
clusters going forward.
2026-05-10 23:37:18 +01:00
rcourtman
0d386d32a2 Update cluster source test to match aggregated badge
Follow-up to 112d42801: the test asserted the old hardcoded 'api' badge
for a PVE cluster whose members both carry agents. With the row builder
now using sourceFor() aggregation, the expected source is 'both'.
2026-05-10 23:32:39 +01:00
rcourtman
112d42801a Aggregate cluster source badge across attached agents
The Infrastructure connected-systems table hardcoded the cluster header
row's source to 'api', so a PVE cluster whose nodes all carry agents
displayed as 'API' while a standalone PVE host with the same setup
correctly displayed as 'API + Agent'. The backend already folds member
agents into ConnectionSystem.Components as attachments
(connections_grouping.go), so the row builder can use the same
sourceFor() aggregation it uses for non-cluster rows.
2026-05-10 23:28:30 +01:00
rcourtman
9f43a22fb1 Bound patrol-main session at 200 messages to stop unbounded disk growth
The patrol-main chat session was reused across every scheduled
Patrol run with no upper bound. After a month of runs the file
had grown to 16 MB / 3,593 messages, and every AddMessage
rewrote the whole file to disk — so the I/O cost per Patrol
run was scaling with total session age, not with the run's own
output. Across all chat sessions on this dev instance, the
ai_sessions directory hit 676 MB / 1,629 files.

The stateless-Patrol-input fix (commit 43760fb0d) stopped
loading the session back into the agentic loop, but Patrol
still wrote each run's messages to the session for the Pulse
Assistant sidebar's forensic view. That write path is what
this commit bounds.

ExecutePatrolStream now calls SessionStore.TrimMessages(200)
after each run, keeping roughly the last two runs' worth of
messages — enough for the sidebar to show recent activity, far
short of unbounded growth. The next Patrol run on a bloated
session will drop the historical 3,000+ messages down to 200
on its first write, so existing storage debt clears on its
own without a separate migration.

User-driven chat sessions are unaffected: TrimMessages with
keepMostRecent <= 0 is a no-op, and callers that want full
history retention simply don't call it. Only Patrol's
forensic session is capped.

The canonical Patrol forensic log is the PatrolRunRecord
history surfaced at /api/ai/patrol/runs — that's the durable
record with structured fields. The chat-session-shaped file
is a sidebar convenience, not the source of truth.

Three tests guard the boundary:
  - TrimMessages keeps the most recent N (50 messages
    trimmed to 10 → messages 40-49 remain)
  - TrimMessages is a no-op below threshold (5 messages,
    cap 200 → 5 messages remain)
  - TrimMessages with non-positive keep is a no-op (3
    messages, cap 0 or -5 → 3 messages remain)

ai-runtime contract updated.
2026-05-10 23:19:08 +01:00
rcourtman
5dcdbfabf0 Document chat-side cost recording in ai-runtime contract
Follow-up to a0b3bc7ed which closed the chat.Service cost-ledger
gap. ai-runtime.md gains a Current State paragraph documenting:
- The pre-fix bug (chat accumulated tokens via SSE done envelope
  but never recorded a cost.UsageEvent server-side; chat is the
  bulk of AI token spend so the dashboard was dramatically
  understating cost).
- The fix shape (recordChatTurnCost runs after every loop return,
  success or error since the operator was billed regardless).
- The threading path (chat.Config.CostStore wired by the router
  from AISettingsHandler.GetAIService.CostStore()).
- The double-recording invariant (ExecutePatrolStream is
  deliberately not changed; its caller patrol_ai.go records via
  its own helper).
- UseCase="chat" matches the canonical taxonomy noted on
  cost.UsageEvent.UseCase ("chat" or "patrol").
2026-05-10 23:16:47 +01:00
rcourtman
a0b3bc7ed3 Record user-chat token usage to the cost ledger
chat.Service.ExecuteStream was a long-standing cost-ledger gap: the
agentic loop accumulated token counts via stream callbacks (see
GetTotalInputTokens / GetTotalOutputTokens in agentic_control.go)
and surfaced them in the SSE done envelope to the frontend, but
nothing on the server side recorded a cost.UsageEvent. Patrol,
discovery, QuickAnalysis, and the report narrators all record; only
chat — the bulk of AI token spend — did not. The operator's AI
usage dashboard was therefore understating cost dramatically.

Found while extending the cost-recording mindset across subpackages
after fixing QuickAnalysis (08491b9f4). Initially spawned as a
separate task but the right shape and scope became clear, so landing
it directly here.

Pipeline:
- Service.CostStore() exposes the per-tenant cost store handle.
- chat.Config gains optional CostStore *cost.Store field, threaded
  into chat.Service.costStore at NewService time.
- chat.Service.recordChatTurnCost records a UsageEvent with
  UseCase="chat" after every loop.ExecuteWithTools return (success
  OR error — operator was billed regardless of clean response).
  Skips when costStore is nil or zero tokens accumulated.
- ai_handler.go's two chatCfg construction sites populate CostStore
  via h.resolveCostStore(ctx).
- router wires the resolver to AISettingsHandler.GetAIService(ctx).CostStore()
  with no Enabled gate — even brief chat usage while AI was being
  configured should appear in the dashboard.

ExecutePatrolStream is deliberately not changed. It creates a
separate tempLoop and its caller (patrol_ai.go) records cost via
its own helper at line 887. Recording in ExecuteStream only avoids
double-counting on the patrol-via-chat path.

Tests in chat/cost_recording_test.go cover: recording when store
configured, no-op when store nil, no-op on zero tokens (early
failures), graceful handling of model strings missing the
provider prefix.
2026-05-10 23:15:53 +01:00
rcourtman
e8049e894f Hide decorative dashes across the rest of the tables
Extends 688d00a55 (Workloads guest rows) to the remaining tables and
shared cells: Infrastructure host/PMG/PBS rows, Alerts desktop and
mobile rows, Recovery history and protected inventory, AI cost
dashboard cards, ResponsiveMetricCell fallback, the infrastructure
summary table CPU-temp cell, the AGENT/CHECKER node columns under
Settings → Infrastructure, and the help-icon example bullets.

Each "—" or "-" that signals "no value here" now carries
aria-hidden="true" so screen readers skip the decoration and announce
just the cell label / surrounding context. Dashes with informational
title attributes (e.g. "Disk stats unavailable…") are left audible —
the title is the accessible name and should be read.

Verified live: 24 of 25 dash spans on the Workloads page now hidden,
the only audible one is the title-bearing disk-status span.
2026-05-10 23:13:01 +01:00
rcourtman
113190a920 Skip agent-sourced alerts in node-presence cleanup
CleanupAlertsForNodes removes alerts whose Node isn't in
existingNodes. That map is built upstream only from state.Nodes
(Proxmox nodes) and state.PBSInstances (PBS instances) — no agent
resources. Agent-sourced alerts (Unraid, standalone Linux hosts,
TrueNAS, anything reached via Pulse Agent) typically have Node=""
or an agent UUID, so they fall into the cleanup branch every
cycle. Then the next poll re-creates the alert as new, calls
AddAlert, and appends a fresh history row. Observed in the wild:
3,980 alert history entries in 7 days, with the same canonical
alert ID (e.g. "Unraid array running without parity protection")
appearing every 30 seconds.

Adds a carve-out for ResourceID prefixes starting with "agent:",
matching the existing pattern for "docker-" / "docker:" and
"pbs-" / "pbs-offline". Locks in the behaviour with a new subtest
that mixes agent-sourced and Proxmox-sourced alerts and asserts
that only the legitimately stale Proxmox alert is removed.
2026-05-10 23:12:31 +01:00
rcourtman
99c499ade7 Repair orphan tool_calls in convertToProviderMessages
Defense-in-depth for the malformed-history bug pattern. The
Patrol fix made patrol-main runs stateless, but Assistant
chat sessions are inherently multi-turn and must keep their
history. Any chat session that ends mid-tool-call — network
drop, ctx timeout, browser crash, uncaught panic, any
interrupt that fires between "model emits tool_calls" and
"agentic loop appends all tool results" — leaves the
persisted session with orphan tool_call_ids. The next message
that loads this history is rejected with the same provider
error that flapped Patrol for 33 days:

  An assistant message with 'tool_calls' must be followed by
  tool messages responding to each 'tool_call_id'.

For Patrol this was fixable by ignoring the session. For
Assistant it isn't; the conversation context is the product.

convertToProviderMessages now ends with a repairOrphanToolCalls
pass that scans every assistant message with tool_calls and
inserts synthetic is_error tool result messages immediately
after the assistant turn for any tool_call_id that has no
matching downstream result. The synthetic content is marked
is_error=true and explains the interruption so the model can
retry the same call or proceed without that data — preserving
conversational continuity while satisfying the provider's
structural-validity check.

This guards every conversation that crosses
convertToProviderMessages, not just Assistant chat. If Patrol
ever changes back to loading session history, the same safety
net applies. If a new entry point appears for some other LLM
flow, it gets the repair for free.

Three tests guard the boundary:
  - Orphan injection (3 tool_calls, only 1 result → 2
    synthetic results, marked is_error with interrupted
    explanation, ordering preserved)
  - Clean no-op (all tool_calls fulfilled → no synthetic
    messages, no is_error pollution)
  - Existing truncation test still passes (assistant message
    with both tool_calls and own tool_result → no repair
    needed, tool_call_id matches in same message)

ai-runtime contract updated.
2026-05-10 23:10:13 +01:00
rcourtman
688d00a550 Hide decorative dash placeholders from screen readers
Workloads guest rows render — (or -) as a visual no-data signal in
every cell where a guest doesn't expose that metric (info, vmid,
disk, ip, uptime, node, image, namespace, context, backup). Each one
was a plain <span>, so screen readers narrated "dash" alongside every
cell label.

Mark every dash that conveys "no value" with aria-hidden="true" so SR
users hear the column label and skip the placeholder. Dashes that
carry an informational title attribute (e.g. "Disk stats
unavailable…") are intentionally left visible to assistive tech —
title is the accessible name and replacing it with aria-hidden would
drop real context.

Visual unchanged; tested live via DOM probe — 25 of 27 dash spans on
the Workloads page now carry aria-hidden, with the two title-bearing
dashes still announceable.
2026-05-10 23:08:34 +01:00
rcourtman
e657f6ace9 Suppress assessment error penalty after trailing-success recovery
The overall-health "Recent Patrol errors" coverage factor in
summarizeRecentPatrolCoverage was anchoring the score to a
stale ratio: it counted errors across the last 10 runs without
weighting recency. After Pulse fixed two compounding Patrol
bugs today, four consecutive successful runs (50+ tool calls
each) followed six earlier failures. The assessment kept
showing C/65 with the prediction "most recent Patrol runs
encountered errors (6 of 10)" — directly contradicting the
fact that *every* recent run had succeeded.

Operators reading that score would conclude Pulse Patrol is
still broken. It isn't. The fix dragged the grade.

This commit adds a recovery-suppression check: count trailing
successful full Patrol runs from the most-recent end of the
window (GetAll returns newest-first), skipping non-full runs.
When three or more consecutive trailing successes exist —
roughly a 9-hour clean stretch at the default 3-hour cadence —
the error penalty drops entirely. The score reflects current
reality.

Three is conservative: a single recovery run could be a
transient win; three consecutive demonstrate the underlying
fix is sticking. Below the threshold, the existing ratio-tiered
penalty still applies so partially-recovered states still
register.

Two tests guard the boundary:
  - 6 historical errors + 3 trailing successes → no coverage
    factor (suppressed)
  - 6 historical errors + 2 trailing successes → coverage
    factor remains (recovery incomplete)

Live verified after this commit lands: the assessment that's
been stuck at C/65 since the malformed-history fix will
recompute to A/B grade as soon as the trailing 3 successful
runs are recognized by the same recent-runs query.

ai-runtime contract updated.
2026-05-10 23:02:57 +01:00
rcourtman
68e2100955 Suppress empty metric values on state-alert Assistant handoffs
The Pulse Assistant briefing and prompt for a powered-off alert
rendered "Current value 0.0%; threshold 0.0%" because the backend
sends value=0 and threshold=0 for state alerts (which have no
metric semantics). That line is misleading to the operator and
gives the LLM no useful signal.

Adds isMetricAlertType / isStateAlertType helpers to
frontend-modern/src/utils/alerts.ts naming the state-alert set
(powered-off, unreachable, offline, host-offline, connectivity,
docker-container-state, docker-container-health,
docker-host-offline). State alerts represent binary or enumerated
conditions, not metric threshold crossings.

The alert handoff builder routes through that helper:
  - Briefing detailLines omit the value/threshold line when the
    alert is a state alert.
  - Prompt omits the **Current Value:** and **Threshold:** lines.
  - Prompt now includes **Message:** so the actual signal is
    surfaced (was previously dropped from the prompt).
  - Prompt step 2 swaps "Check related metrics" for "Check what
    changed recently for this resource (state events, recent
    commands, related alerts)" — the right question for a
    binary-state alert.

Two new tests cover the state-alert and metric-alert branches.
2026-05-10 23:01:21 +01:00
rcourtman
4dff26f728 Emit structured telemetry on reporting and summarize invocations
The reporting feature now ships across two surfaces (PDF/CSV export
and pulse_summarize chat tool) and three modes (single-resource,
fleet, summarize). Without usage telemetry we can't tell whether the
work earns its place — operator demand, AI-vs-heuristic adoption,
range/format preferences are all invisible. Stops further feature
investment from being pure speculation.

Three new info-level log events, structured so an agent can grep
transcripts and group by dimension without a separate metrics
pipeline (matches the "agent owns ops analysis, human gets outcomes"
posture in MEMORY.md):

  reporting.single.generated     — single-resource PDF/CSV
  reporting.fleet.generated      — multi-resource fleet PDF/CSV
  reporting.summarize.invoked    — pulse_summarize chat tool (both modes)

Common dimensions: org_id, format/action, range, ai_configured,
findings_configured, window_start/end. Single-resource adds
resource_type + metric_type + bytes; fleet adds resource_count +
bytes; summarize adds resource_type + resource_count (fleet mode) +
narrative_source (so we can audit AI-fallback rate).

Includes rangeLabel() helper that maps a window to the canonical
catalog range token (24h/7d/30d) with a 1h tolerance, falling back
to "<hours>h" so non-standard windows still group. Tested.

TestReportingTelemetryEventNames pins the canonical event names as
a contract — an agent grepping logs depends on them being stable;
changing them silently would break audit tooling on the consumer
side.

The reporting engine already logs the resolved narrative source
(heuristic/ai) at debug level via the existing "Generating report"
line, useful for diagnosing why a specific report fell back. Kept
at debug; the new info-level events cover the operator surface.
2026-05-10 22:59:23 +01:00
rcourtman
15f6881d89 Document the chat-side narrator wiring in ai-runtime contract
Follow-up to 03463c1bf which threaded the per-tenant report
narrators through chat.Config -> tools.ExecutorConfig ->
PulseToolExecutor so pulse_summarize can produce AI-narrated
synthesis in chat instead of heuristic-only. ai-runtime.md's
Current State paragraph documents the wiring:
- chat.Config carries three optional fields (ReportNarrator,
  ReportFleetNarrator, ReportFindingsProvider) threaded through
  to the executor at session construction time.
- The router installs a SetReportNarratorResolver closure that
  mirrors the reporting handler's pattern, asking the
  AISettingsHandler for the per-tenant ai.Service and returning
  it as the implementation for all three roles when AI is
  enabled.
- Unconfigured tenants still get the heuristic fallback —
  matching the report PDF's graceful-degradation posture.
- AI-narrated chat synthesis uses the same provider, sanitizer,
  model selection, cost ledger (report_narrative /
  report_narrative_fleet use-cases), and budget gate the report
  PDF endpoint enforces, so there is exactly one canonical
  synthesis path for both surfaces.
2026-05-10 22:51:17 +01:00
rcourtman
03463c1bfe Thread per-tenant AI narrators into pulse_summarize via chat session
v1 of pulse_summarize (1fe5d6853) shipped with heuristic narrative
only. The follow-up wiring promised in that commit now lands: the
chat session carries optional report-narration providers that the
tool's handler reads when building requests, so AI-narrated synthesis
flows into chat using the same provider, sanitizer, model selection,
cost ledger, and budget gate the report PDF endpoint already uses.

Pipeline:
- pkg/reporting Narrator / FleetNarrator / FindingsProvider interfaces
  are already implemented by internal/ai.Service. No new
  implementations.
- tools.ExecutorConfig + PulseToolExecutor gain three optional fields
  (ReportNarrator, ReportFleetNarrator, ReportFindingsProvider).
  Clone() copies them so per-session executors inherit the wiring.
- chat.Config gains the same three fields; NewService threads them
  into ExecutorConfig.
- tools_summarize.go reads e.reportNarrator/FleetNarrator/
  FindingsProvider and populates MetricReportRequest /
  MultiReportRequest. The engine already accepts these on the request
  and falls back to heuristic when they are nil — no engine changes
  needed.
- AIHandler gains SetReportNarratorResolver(ctx -> narrators); both
  per-tenant and default chat.Config construction sites invoke the
  resolver. Router wires the resolver to AISettingsHandler.GetAIService
  with the same Enabled-gate the reporting handler uses.

Unconfigured tenants are unchanged: the resolver returns nil, the
tool returns heuristic narrative — identical to today. Configured
tenants get AI synthesis in chat that matches what their report PDF
already carries, billed and budget-gated the same way.
2026-05-10 22:50:17 +01:00
rcourtman
7a7b3c9d30 Gate LLM patrol_resolve_finding on deterministic verifier for event findings
Backup-failed was flapping detected → auto-resolved → re-detected
ten times in a single day. Each cycle the LLM saw "PBS backups
look healthy in my current snapshot" during a Patrol pass, called
patrol_resolve_finding(backup-failed), and the adapter at
patrol_findings.go:985 called Resolve(findingID, true) directly —
no category check, no evidence verification.

The contract docs at findings.go:52-67 explicitly say event /
persistent categories (backup, reliability, security, general)
"stay active until explicitly resolved — either by the LLM calling
patrol_resolve_finding with evidence, or by operator action." That
"with evidence" was never enforced.

This commit enforces it. The adapter now checks two conditions
before honoring an LLM resolve:

  - finding.Category does NOT support stale-auto-resolve (per the
    contract function CategorySupportsStaleAutoResolve), AND
  - a deterministic verifier exists for finding.Key (currently
    smart-failure and backup-failed)

When both are true, the adapter runs VerifyFixResolved on the
finding's resource. If the verifier still detects the failure
signal, the LLM gets an error explaining why the resolve was
rejected and that the underlying issue must be fixed first. If
the verifier confirms the signal has cleared, the resolve
proceeds with grounded evidence.

Categories that support stale-auto-resolve (performance, capacity)
bypass the gate entirely — the LLM can resolve them based on
absence per the existing contract. Keys without a verifier also
fall through to current behavior so we don't block resolves for
categories we haven't built verifiers for yet.

New PatrolService.hasDeterministicVerifierForKey() helper keeps
the gate's verifier list in lockstep with the switch in
verifyFixDeterministically.

Tests cover the three branches:
  - performance category → gate skipped, resolve proceeds
  - reliability + no verifier → gate falls through, resolve proceeds
  - hasDeterministicVerifierForKey for known and unknown keys

ai-runtime contract updated.
2026-05-10 22:48:09 +01:00
rcourtman
3b06b4b09d Document the non-rendering reporting engine entry points in api-contracts
Follow-up to e32d4ede4 (NarrativeFor + FleetNarrativeFor entrypoints)
and 1fe5d6853 (pulse_summarize tool). api-contracts.md gains a Current
State paragraph documenting that the Engine interface now exposes two
non-rendering entry points alongside Generate/GenerateMulti, with the
explicit invariant that test stubs implementing the interface must
implement these methods so the contract is honoured across the entire
surface, not just the export-shaped subset.

ai-runtime.md was updated in the parallel-agent commit ee2de2703
(which picked up the pulse_summarize paragraph when restating
auto-resolve gating), so no further edit is needed there.
2026-05-10 22:38:07 +01:00
rcourtman
1fe5d6853f Expose reporting synthesis to Assistant via pulse_summarize tool
The reporting synthesis layer (observations, recommendations,
outliers, period comparison) shipped trapped behind the PDF/CSV
export. Operators who chat with Assistant could not ask "what's been
happening with pve1 this week" — the data path existed but had no
non-PDF surface. This commit adds a single new tool, pulse_summarize,
that wraps the engine's non-rendering entry points (NarrativeFor /
FleetNarrativeFor) so that question gets answered in chat.

The tool takes an action parameter (resource | fleet) and routes
accordingly:
- resource mode requires resource_type + resource_id and returns the
  same Narrative the single-resource report carries (health status,
  observations, recommendations, period comparison).
- fleet mode requires resource_type + a comma-separated resource_ids
  string (PropertySchema does not currently support array items, and
  CSV is LLM-friendly enough) and returns the FleetNarrative
  (outliers, patterns, recommendations). Capped at the same
  multi-report ceiling (50) as the API endpoint.

The tool is read-only — no control level requirement, no approval
gate — and uses the global reporting engine the rest of the app
already shares. Returns a JSON envelope so chat can render it or
hand it back to the model for follow-up framing.

v1 ships with heuristic narrative only. The AI narrator wiring
through the chat session (Narrator/FleetNarrator/FindingsProvider
threaded via chat.Config -> tools.ExecutorConfig -> PulseToolExecutor)
is a focused follow-up; it lets the same tool inherit the per-tenant
AI service the report PDF endpoint already uses. The seam is
already in place because NarrativeFor/FleetNarrativeFor take an
optional narrator on the request — v1 passes nil, v2 populates it.
2026-05-10 22:36:49 +01:00
rcourtman
ee2de2703b Update ai-runtime contract to cover both absence-based auto-resolve paths
reconcileStaleFindings (commit b44d5892f) and the resource-absent
gate added in commit d6bb89a1c both use the same
CategorySupportsStaleAutoResolve helper, but the contract Current
State only documented the first path. Rewords the paragraph so
both are covered explicitly: stale-cleanup and resource-absent
both gate on the whitelist, both reject event/persistent
categories, and the bogus-absence examples extend to cover the
resource-absent failure mode (transient agent reconnect,
container churn, refresh gap).
2026-05-10 22:34:25 +01:00
rcourtman
9bb157f3f0 Move body-text muted callers to the semantic token
A handful of helper texts, descriptions, and side counts used
text-slate-500 directly, so they didn't pick up the contrast fix in
e4f38d5. Switch each body-text caller to text-muted: AI settings
dialog helper text, AI provider helper text and link rows, ResourcePicker
empty-state copy / resource IDs / "+N more tags" / "N selected" footer,
AIModelSelectionSection "(loading...)" tag, ConfiguredNodeTables cluster
node count, and the PatrolIntelligenceHeader plan-restriction note.

Live contrast measurement after the change: every visible muted-style
text on the Patrol page now reads between 6.92 and 7.58 on its
resolved background — well above the WCAG AA 4.5 floor it was missing
on bg-surface-alt.

Icon-tint usages of text-slate-500 (Lucide icons, chevron rotations,
hover-state controls) are left as-is — those are deliberate color
choices, not muted-text intent.
2026-05-10 22:32:48 +01:00
rcourtman
e4f38d5556 Bump light-mode muted text to meet WCAG AA on alt surfaces
text-muted = slate-500 against bg-surface-alt = slate-100 measured a
4.34 contrast ratio — below the WCAG AA threshold of 4.5 for normal
text. Move to slate-600. New ratios: 6.92 on bg-surface-alt and 7.58
on bg-surface — both pass comfortably. Dark mode already passed at
5.09 and stays untouched.
2026-05-10 22:27:57 +01:00
rcourtman
7d4d270b44 Render briefing timestamps as relative time and drop noise
Two visible issues in the Pulse Assistant drawer when opened on a
Patrol finding via "Discuss with Assistant":

1. The Attention line rendered the last-regression timestamp as a
   raw ISO string with microseconds and timezone offset
   ("last regression 2026-05-10T22:02:11.519513+01:00"). The rest
   of the UI uses relative time and the briefing copy was the
   outlier. The LLM consuming the briefing handles "24 mins ago"
   just as well as a raw timestamp, and the structured handoff
   metadata still carries precise timestamps for any caller that
   needs them.

2. The Attention line ended with "loop detected" on every active
   finding because loop_state=detected is the default initial
   state for any active finding. Rendering "loop detected" added
   no information — only meaningful loop states (awaiting approval,
   remediation failed, timed out, etc.) need surfacing in the
   attention reason.

Adds a formatBriefingTimestamp helper that wraps formatRelativeTime
with sensible defaults and routes all three "last regression"
sites through it. Updates the briefing-test fixture to pin the
system clock so the relative-time assertion is deterministic
against the fixed regression timestamp.
2026-05-10 22:27:05 +01:00
rcourtman
20df3dcd2c Let a valid bootstrap token authorize initial setup from any origin
The loopback gate from 586473ee3 rejected non-loopback setup requests
before the bootstrap-token check could run, so a Proxmox-LXC install
(install script prints URL + token; user opens URL on workstation,
pastes token) hit "only available from localhost" even with the correct
token. The token is the security boundary — only callers with
filesystem access to the data dir can read it — so a valid token now
authorizes setup from any origin. No-token requests still require
direct loopback.

Updates the two contract/setup tests that pinned the old behavior.

Fixes discussion #1459.
2026-05-10 22:25:34 +01:00
rcourtman
9d4fdedf9a Promote guest drawer card labels to h3
Each card in the Workloads guest drawer (System, Guest Info, Memory,
Backup, Tags, Filesystems, Network) was a plain <div> with uppercase
styling. They are subsections of the drawer's existing <h2>, so make
them <h3>. Visual styling is identical — same Tailwind classes — only
the tag changes. Screen-reader users now get a navigable heading
outline inside the drawer.
2026-05-10 22:24:41 +01:00
rcourtman
e32d4ede44 Expose engine narrative entry points for non-rendering callers
The reporting engine's synthesis layer was reachable only through
Generate/GenerateMulti, which always rendered PDF or CSV. Pulse
Assistant needs the same retrospective synthesis (per-resource
summary, fleet outliers, period comparison) in a form it can present
in chat, not as a downloaded artifact.

Add two non-rendering entry points to the Engine interface:

  NarrativeFor(req MetricReportRequest) (*Narrative, error)
  FleetNarrativeFor(req MultiReportRequest) (*FleetNarrative, error)

Both run the same query path and the same narrator resolution as their
rendering counterparts (heuristic by default, AI when the request
supplies a narrator, fail-closed-to-heuristic on any narrator error)
and return the structured narrative without invoking the fpdf/csv
output stage. Test stubs in pkg/reporting and internal/api are
updated to implement the extended interface.

These are the seams the upcoming pulse_summarize Assistant tools wrap
to answer questions like "what's hot on pve1 this week" or "where
should I look across my fleet" without round-tripping through report
generation. Same synthesis layer, no PDF involved.

Also fixes a pre-existing flake in TestEngineGenerate_UsesSuppliedNarrator
(metrics writes are async; the first Generate sometimes ran before
the raw tier flushed). Wrapped in the same eventually-pattern used by
the prior-period and findings-provider tests.
2026-05-10 22:23:09 +01:00
rcourtman
ac0f361372 Document the report-narrator detection-boundary invariant
Follow-up to the narrator prompt changes that forbid acting as a
parallel detector. ai-runtime.md gains a Current State paragraph
documenting that both report narrator system prompts encode an
explicit detection-boundary invariant: warning or critical severity
classifications must be backed by a Patrol finding, an alert, or a
hard-threshold breach in the structured input, not by metric
inference alone. Patterns the narrator notices without that backing
are constrained to info severity. This keeps the narrative
retrospective on Patrol's work and prevents silent
shadow-classification competing with Patrol's detection rules.

api-contracts.md gains an equivalent paragraph in the reporting
contract section. The same rule applies to outliers and patterns
on the fleet path, and to recommendations on both. The deterministic
heuristic narrators were already constrained to the same threshold
rules; this aligns the AI path with the same evidence surface the
fallback uses, so the report PDF cannot become a back-door
detection surface that diverges from the findings store.
2026-05-10 22:02:47 +01:00
rcourtman
41155a7968 Forbid the report narrators from acting as parallel detectors
The single-resource and fleet narrator prompts both grounded their
claims in structured data, but neither prevented the model from
classifying observations at warning or critical severity based on
metric inference alone. That left a subtle gap: an AI narrator
noticing memory creep across a window could promote it to warning
even when Patrol — the canonical detection layer — had not flagged
it. That competes with Patrol rather than summarizing its work,
and it lets the report PDF silently shadow-classify in a way that
diverges from the findings store.

Add an explicit detection-boundary instruction to both prompts:
warning or critical severity may only be assigned when backed by
a Patrol finding, an alert, or a hard-threshold breach visible in
the input (cpu max > 90, memory avg > 85, disk avg > 85, failed
or high-wear disks, storage pools at >= 90%). Patterns the model
sees in metric data without that backing are constrained to info
severity. Recommendations follow the same rule. The narrative
remains a retrospective summary of Patrol's classified state, not
a parallel classifier.

This is a prompt-only change. The deterministic data surface and
the heuristic fallback narrator are unaffected; the heuristic
narrators already classify only on the same threshold rules listed
above, so the AI narrator is now constrained to the same evidence
surface its fallback uses.
2026-05-10 22:02:10 +01:00
rcourtman
08491b9f48 Fix QuickAnalysis cost recording and audit the pattern
Service.Narrate (b84b87d8d) and Service.NarrateFleet (d4463a615)
fixed missing cost-record calls in the report-narrative path. Auditing
the rest of internal/ai for the same bug class found one more:
Service.QuickAnalysis. It is used for alert auto-resolve and similar
lightweight decisions, so production token spend on auto-resolve
analysis was invisible in the AI usage dashboard.

Mirror the same fix: capture costStore under the read lock alongside
provider/cfg, and after provider.Chat returns, record a UsageEvent
labelled with the request's UseCase (defaulting to "quick_analysis"
when the caller leaves it blank). Recording happens before the
empty-content guard so failed-but-billed calls are still visible.

Adds cost_recording_audit_test.go: an AST-level audit that walks
internal/ai/*.go (excluding _test.go and sub-packages), finds every
function calling .Chat() on a providers.Provider value, and asserts
each function body also references .Record() on a cost store.
Exemption is allowed via a //cost-recording-exempt: <reason> doc
comment. RunPatrolToolPreflight is annotated as exempt — it is a
connectivity self-test, not user workload, and should not pollute
the operator's cost dashboard.

The audit is intentionally local (function-scoped, not
interprocedural). A passthrough wrapper that calls a recording
function rather than calling Record itself would need an explicit
exemption naming the wrapped callee. Keeping the scan local makes
new Chat callers loud rather than letting silent gaps creep in via
indirection.

Future Chat callers must either record cost or carry the exemption
marker. The audit fails CI otherwise, so the regression that shipped
in b2bd9d114 (Narrate) and would have shipped again in d4463a615
(fleet) cannot recur silently.
2026-05-10 21:57:39 +01:00
rcourtman
c6a786a36d Broaden regression-counter reset to include LLM-driven cycles
Initial detector (commit 942f9ca0f) only matched on the two legacy
absence-signature reason strings — but the Backup failed finding
on the live preview showed 6 auto_resolved events all with empty
messages, produced by the LLM patrol_resolve_finding tool via
Resolve(_, true). Counter stayed at 6× after the previous
migration ran.

New detector: any active finding whose category is NOT eligible
for stale-auto-resolve (i.e. anything other than performance or
capacity) AND has any auto_resolved event on its lifecycle is
treated as having an inflated counter. The rationale is the same
rule the category gate already established — for event/persistent
categories there is no legitimate absence-driven resolution path,
so any auto_resolved was either a removed-bogus-path stamp or an
LLM judgment call that repeatedly reverted through regressions on
the next run. The cumulative count is no signal either way.

Performance/capacity findings retain their counter because the
metric-cleared resolution model is sound there.

Test extended to cover four cases: LLM-driven cycle resets,
legacy-reason cycle resets, eligible-category preserves counter,
non-eligible category without any auto_resolved preserves counter.
Plus the existing idempotency case (already-reset finding stays
reset and is not re-applied).
2026-05-10 21:47:57 +01:00
rcourtman
942f9ca0f5 Reset regression counters polluted by bogus auto_resolve cycles
The Backup failed finding on the live preview showed "regressed 6×"
when the actual regression count of genuine recurrences was at
most 1 or 2 — the rest were the system fighting itself, driven by
the absence-based auto_resolve paths that were gated (category
whitelist) or removed (alert-mirror rip) earlier in this branch.
Counter stayed sticky after those fixes landed, so the trust strip
and finding badges still surfaced the inflated number.

FindingsStore.SetPersistence load pass now scans each active
finding's lifecycle for the two known bogus-signature auto_resolved
reasons ("No longer detected by patrol", "Resource no longer
exists in infrastructure"). If found, RegressionCount is reset to
0 and LastRegressionAt is cleared, and a regression_counter_reset
lifecycle event is appended so the migration is idempotent. A
finding that already has a regression_counter_reset event is left
alone; any regressed events that accrued after the reset are
genuine and stand.

findingHasBogusAutoResolveCycle returns true only when the
lifecycle contains a bogus auto_resolved and no prior reset event,
so the function is the single point of truth for the migration
decision and is straightforward to test. Test covers three cases:
finding with bogus signature gets reset, finding with empty-message
auto_resolved (LLM-driven, legitimate) keeps its counter, finding
already migrated is not re-reset.

Updates ai-runtime Current State to document the second migration
on top of the alert-mirror retirement.
2026-05-10 21:44:29 +01:00
rcourtman
590671ffbb Retire legacy "Active alert detected" findings on load
The previous commit removed the detectAlertSignals path so no NEW
alert-mirror findings are emitted, but the findings already
persisted from earlier builds stay in the store indefinitely —
nothing cleans them up (reconcileStaleFindings is gated on
performance/capacity categories, the LLM resolves them just to
have them re-detected next run except now the deterministic
emitter is gone so re-detection can't happen, but they're left
sitting as active findings draining the trust strip and score).

FindingsStore.SetPersistence now runs a one-shot retirement pass
on load: any active finding with title "Active alert detected",
source ai-analysis, and category general is auto-resolved with
reason "Patrol no longer mirrors alerts; the Alerts page is the
canonical surface for currently-firing alerts." The pass appends
an auto_resolved lifecycle event so the retirement is auditable,
syncs the loop state to resolved, and schedules a save so the
cleanup persists.

Idempotent: after the first load with this code, no findings
match the signature so the pass is a no-op. Defensive: the
signature requires all three fields (title + source + category)
to match before retiring, so an operator-authored finding that
happens to share the title is left untouched. Test covers the
mirror case, the matching-title-but-foreign-source case (must
NOT retire), and an unrelated active finding (must NOT retire),
plus verifies the retired state persists back through the
persistence layer.

Updates ai-runtime Current State to record the migration path.
2026-05-10 21:39:09 +01:00
rcourtman
271d12ecab Stop mirroring alerts into Patrol findings
The deterministic signal pipeline ran the pulse_alerts tool output
through detectAlertSignals and produced a SignalActiveAlert for
every firing alert, which Patrol then materialized as an
"Active alert detected" finding (source: ai-analysis, category:
general). The system prompt at the top of patrol_ai.go explicitly
tells the LLM not to duplicate alerts — but the deterministic
emitter was duplicating them anyway, behind the LLM's back.

Symptoms observed in the wild:
- 9 active "Active alert detected" findings in Patrol, every one a
  duplicate of an existing alert already on the Alerts page.
- The LLM, doing what the prompt told it, resolved each mirrored
  finding via patrol_resolve_finding. Next run the alert was still
  firing and Patrol re-emitted the signal → finding regressed.
  Lifecycle showed several auto_resolved → re-detected → regressed
  cycles per finding within hours.
- Health score dragged down by issues the operator already saw on
  the Alerts page, with no operator action possible from Patrol
  that wasn't already available from Alerts.

Rip detectAlertSignals entirely, remove the pulse_alerts case from
the signal-extraction switch, drop SignalActiveAlert plus its key
/ title / recommendation entries. Convert the prior
TestDetectSignals_ActiveAlert into a regression guard that locks
in the no-mirror behavior.

Updates the ai-runtime subsystem Current State to record the
decision: Patrol does not duplicate the Alerts surface; alerts
own their own lifecycle, surface, and acknowledgement model.
2026-05-10 21:33:41 +01:00
rcourtman
7bd596d378 Document the fleet-narrative surface in ai-runtime and api-contracts
Follow-up to the fleet-level AI narrative refactor: ai-runtime.md
gains a Current State paragraph documenting that internal/ai.Service
also implements pkg/reporting.FleetNarrator with its own use-case
label (report_narrative_fleet) so fleet vs single-resource spend is
distinguishable in the cost ledger and budget gate, and that the
single-resource narrator is intentionally not propagated through the
multi-report path. api-contracts.md gains a paragraph documenting
the new optional FleetNarrator field on MultiReportRequest, the new
FleetNarrative field on MultiReportData, the rendered fleet section
(executive prose, named outliers, cross-cutting patterns,
recommendations, optional period-comparison, AI provenance footer),
and the explicit invariant that the deterministic resource summary
table stays rendered from the same per-resource aggregates so every
named outlier is verifiable against the table below.

Dependent subsystems (agent-lifecycle, performance-and-scalability,
storage-recovery) remain unchanged: their Extension Points reference
internal/api/ broadly but agent lifecycle, perf scaling, and storage
recovery semantics have no delta from this change.
2026-05-10 21:24:04 +01:00
rcourtman
d4463a615c Add fleet-level AI narrative for multi-resource reports
The single-resource AI narrative landed in b2bd9d114 but multi-resource
fleet reports stayed heuristic-only. That left a gap on the exact axis
where AI helps most: a 50-resource fleet PDF is where synthesis is the
difference between useful and unread.

Introduce FleetNarrator as a separate interface from Narrator. The
input shapes are different — single-resource takes one set of metric
stats with a prior window, fleet takes a denormalised cross-resource
view with per-resource summaries plus a fleet aggregate.
HeuristicFleetNarrator owns the deterministic fallback: ranks
resources by severity (critical alerts > unhealthy disks > storage
pressure > memory > CPU > non-critical alerts), picks up to 5
outliers, derives cross-cutting patterns by counting how many of N
resources share a hot signal, and emits fleet-scoped recommendations.

internal/ai.Service implements FleetNarrator through
report_fleet_narrator.go. Distinct use-case label
(report_narrative_fleet) so fleet vs single-resource spend is
separable in the cost ledger and budget gate. The fleet payload is
denormalised through buildReportFleetPayload so prompt cost scales
linearly with fleet size. Same fail-closed invariant — nil provider,
parse failure, or context cancellation falls through to the heuristic.

Single-resource Narrator is intentionally NOT propagated through
engine.GenerateMulti: a 50-resource fleet report performs one AI call
(fleet narrator), not 51. The router resolver returns the AI service
for all three roles (Narrator, FleetNarrator, FindingsProvider).

The fleet PDF renders the FleetNarrative in the fleet summary cover
when present: executive prose, named outliers with severity-coloured
bullets, cross-cutting patterns, recommendations, optional period
comparison, and an AI provenance footer. The deterministic resource
summary table is preserved above so every named outlier is verifiable
against the table immediately below it. Legacy "Highest CPU / Most
alerts" bullets remain as the fallback when no FleetNarrative is
attached.
2026-05-10 21:23:12 +01:00
rcourtman
f9289b7c81 Make toasts announce themselves to screen readers
The toast container had no live-region wiring, so screen readers
missed every notification — success or failure. Make the container a
named region landmark, and give each toast a role/aria-live derived
from its type: error and warning use role="alert" + assertive (they
interrupt), success and info use role="status" + polite. aria-atomic
ensures the whole toast (title plus message) is read as one unit.
2026-05-10 21:11:43 +01:00
rcourtman
aa9fb9d9aa Give resource drawers a proper region landmark
Every drawer (Workloads guest, PMG instance, K8s namespaces and
deployments, Swarm services, generic resource detail) wrapped its body
in a plain <div>, so screen-reader users had no landmark to jump to
and no announced name when a row expanded. Convert each wrapping
<div> to <section aria-labelledby> with a heading inside.

For the generic resource detail drawer, the existing visible name
becomes an <h2> (visual styling unchanged via m-0). For the other
drawers, a screen-reader-only <h2> carries the entity name (guest
name, PMG hostname, cluster name) so the landmark is named without
visible duplication.
2026-05-10 21:09:41 +01:00
rcourtman
b84b87d8d8 Record cost events for AI report narration
Service.Narrate consumed provider tokens without recording a
cost.UsageEvent, so AI-narrated reports were invisible in the operator
cost ledger. Every other Service call site in the AI runtime records
cost; the narrator omitted it.

Mirror the QuickAnalysis/chat pattern: capture the cost store under
the read lock alongside the provider/cfg snapshot, and after
provider.Chat returns, record a UsageEvent labelled
report_narrative with the resource type/id as the target. Recording
happens before parsing so a failed-but-billed call (e.g. provider
returned malformed JSON) still appears in the ledger — the operator
was billed regardless of whether we could use the response.

The use_case string lifts to a package-level constant so the budget
gate (enforceBudget), the cost label, and the dashboard taxonomy all
reference one identifier.
2026-05-10 21:08:51 +01:00
rcourtman
9905383f70 Document the report-narrative surface in ai-runtime and api-contracts
Follow-up to the report narrative refactor: ai-runtime.md gains a
Current State paragraph documenting that internal/ai.Service now
implements pkg/reporting.Narrator and pkg/reporting.FindingsProvider,
and that the narrator/findings surfaces inherit the same provider,
sanitizer, model selection, budget enforcement, and fail-closed
governance as the rest of the canonical AI runtime. api-contracts.md
gains a paragraph documenting the new optional Narrator and
FindingsProvider fields on MetricReportRequest, the new Narrative,
PriorPeriod, and Findings fields on ReportData, the three new PDF
sections (executive prose, Period-over-period changes, AI provenance
footer), and the explicit invariant that the deterministic data
surface (charts, stats, alerts, storage, disks) stays rendered from
the same aggregates so every AI claim is verifiable against adjacent
data. Multi-resource fleet reports intentionally remain heuristic-only
at this transport layer.

The dependent subsystems flagged by the staged-shape guard
(agent-lifecycle, performance-and-scalability, storage-recovery)
genuinely have no contract delta from the prior commit: their
Extension Points reference internal/api/ broadly, but agent
lifecycle, performance scaling, and storage recovery semantics are
unchanged.
2026-05-10 21:00:39 +01:00
rcourtman
234d0bbe28 Stop clobbering typed input in alert Recipients textarea
The Recipients (one per line) textarea trimmed and filtered empty lines
on every keystroke. Pressing Enter at the end of a line ended up as a
trailing empty entry, the filter dropped it, and the controlled value
snapped back to the single line — so typing past line one was
impossible, and leading/trailing whitespace got eaten mid-type. Paste
still worked because it dropped multiple non-empty lines in one event.

Pass raw split lines up during edit and do the trim+filter inside
buildEmailConfigPayload at save time, so the textarea stops fighting
the cursor and the wire payload is still clean.
2026-05-10 20:54:29 +01:00
rcourtman
27bd31684a Log the underlying error on audit list 500s
HandleListAuditEvents dropped the Query/Count error before writing the
500, so a user hitting "Failed to fetch audit events" produced no
server-side log line — diagnosing the failure was impossible without a
local repro. Log the error with the org ID so the next instance is
findable. Doesn't change the user-facing response.
2026-05-10 20:44:18 +01:00