Commit graph

2420 commits

Author SHA1 Message Date
rcourtman
271d12ecab Stop mirroring alerts into Patrol findings
The deterministic signal pipeline ran the pulse_alerts tool output
through detectAlertSignals and produced a SignalActiveAlert for
every firing alert, which Patrol then materialized as an
"Active alert detected" finding (source: ai-analysis, category:
general). The system prompt at the top of patrol_ai.go explicitly
tells the LLM not to duplicate alerts — but the deterministic
emitter was duplicating them anyway, behind the LLM's back.

Symptoms observed in the wild:
- 9 active "Active alert detected" findings in Patrol, every one a
  duplicate of an existing alert already on the Alerts page.
- The LLM, doing what the prompt told it, resolved each mirrored
  finding via patrol_resolve_finding. Next run the alert was still
  firing and Patrol re-emitted the signal → finding regressed.
  Lifecycle showed several auto_resolved → re-detected → regressed
  cycles per finding within hours.
- Health score dragged down by issues the operator already saw on
  the Alerts page, with no operator action possible from Patrol
  that wasn't already available from Alerts.

Rip detectAlertSignals entirely, remove the pulse_alerts case from
the signal-extraction switch, drop SignalActiveAlert plus its key
/ title / recommendation entries. Convert the prior
TestDetectSignals_ActiveAlert into a regression guard that locks
in the no-mirror behavior.

Updates the ai-runtime subsystem Current State to record the
decision: Patrol does not duplicate the Alerts surface; alerts
own their own lifecycle, surface, and acknowledgement model.
2026-05-10 21:33:41 +01:00
rcourtman
7bd596d378 Document the fleet-narrative surface in ai-runtime and api-contracts
Follow-up to the fleet-level AI narrative refactor: ai-runtime.md
gains a Current State paragraph documenting that internal/ai.Service
also implements pkg/reporting.FleetNarrator with its own use-case
label (report_narrative_fleet) so fleet vs single-resource spend is
distinguishable in the cost ledger and budget gate, and that the
single-resource narrator is intentionally not propagated through the
multi-report path. api-contracts.md gains a paragraph documenting
the new optional FleetNarrator field on MultiReportRequest, the new
FleetNarrative field on MultiReportData, the rendered fleet section
(executive prose, named outliers, cross-cutting patterns,
recommendations, optional period-comparison, AI provenance footer),
and the explicit invariant that the deterministic resource summary
table stays rendered from the same per-resource aggregates so every
named outlier is verifiable against the table below.

Dependent subsystems (agent-lifecycle, performance-and-scalability,
storage-recovery) remain unchanged: their Extension Points reference
internal/api/ broadly but agent lifecycle, perf scaling, and storage
recovery semantics have no delta from this change.
2026-05-10 21:24:04 +01:00
rcourtman
9905383f70 Document the report-narrative surface in ai-runtime and api-contracts
Follow-up to the report narrative refactor: ai-runtime.md gains a
Current State paragraph documenting that internal/ai.Service now
implements pkg/reporting.Narrator and pkg/reporting.FindingsProvider,
and that the narrator/findings surfaces inherit the same provider,
sanitizer, model selection, budget enforcement, and fail-closed
governance as the rest of the canonical AI runtime. api-contracts.md
gains a paragraph documenting the new optional Narrator and
FindingsProvider fields on MetricReportRequest, the new Narrative,
PriorPeriod, and Findings fields on ReportData, the three new PDF
sections (executive prose, Period-over-period changes, AI provenance
footer), and the explicit invariant that the deterministic data
surface (charts, stats, alerts, storage, disks) stays rendered from
the same aggregates so every AI claim is verifiable against adjacent
data. Multi-resource fleet reports intentionally remain heuristic-only
at this transport layer.

The dependent subsystems flagged by the staged-shape guard
(agent-lifecycle, performance-and-scalability, storage-recovery)
genuinely have no contract delta from the prior commit: their
Extension Points reference internal/api/ broadly, but agent
lifecycle, performance scaling, and storage recovery semantics are
unchanged.
2026-05-10 21:00:39 +01:00
rcourtman
a343ba0126 Tier patrol-coverage health impact by recent-error ratio
The overall health score chip read "A · 95/100" while the same
assessment card said "Coverage incomplete · Recent Patrol runs
encountered errors · Verify full coverage." Those messages
contradicted because the "recent errors but had a successful run"
coverage factor used a flat -10 penalty regardless of the error
ratio. With one successful manual run among many failed startup
runs, the math stayed in the A band and the score directly
undermined the warning it sat next to.

The factor now tiers the impact by ratio of errored runs to
relevant runs in the scoring window:

  >50% errored → -30, "Most recent Patrol runs encountered errors
                       (N of M); the current health summary is not
                       reliable until coverage stabilizes."
  >25% errored → -20, "Recent Patrol runs encountered errors
                       (N of M), so the current health summary
                       may be incomplete."
  else         → -10, original light-tier description.

A single transient error stays in grade A; dominant-error periods
drop out of A so the grade matches the warning. Adds two tests
covering both ends of the new tiering and updates the ai-runtime
subsystem contract Current State section.
2026-05-10 19:33:18 +01:00
rcourtman
b44d5892f4 Gate stale-finding auto-resolve on category whitelist
reconcileStaleFindings was auto-resolving any seeded finding that the
LLM didn't re-report in a successful run. The function's own comment
acknowledges the LLM doesn't reliably use patrol_resolve_finding, so
this was built as a cleanup pass — but it cannot tell the difference
between "LLM correctly recognized this is fixed" and "LLM forgot to
re-mention it." For findings that represent discrete events or
persistent states (a backup task that failed, a service that
crashed, a security vulnerability that was found, a configuration
error), absence in a Patrol report is not evidence that the issue
has cleared. The result was bogus auto_resolved → re-detected →
regressed cycles, observed in the wild as "Backup failed" regressing
4× over 6 hours and "Provider analysis error" regressing 271×.
Those bogus auto-resolutions also inflated the trust strip with
fictional auto-resolved credit.

CategorySupportsStaleAutoResolve in findings.go gates the cleanup:
only `performance` and `capacity` findings — continuous current-state
metric thresholds — may be auto-resolved from absence. The other
four categories (reliability, backup, security, general) stay active
until explicitly resolved.

Updates the ai-runtime subsystem contract Current State section with
the whitelist and the adjacent lifecycle dedup rules already landed.

Adds TestReconcileStaleFindings_SkipsNonCurrentStateCategories with
table-driven subtests for all four event/persistent categories, and
TestCategorySupportsStaleAutoResolve to lock in the whitelist.
2026-05-10 19:27:36 +01:00
rcourtman
43760fb0d0 Patrol runs are now stateless — drop prior session history
Patrol's "patrol-main" session was reused across every
scheduled run, so ExecutePatrolStream loaded the full
session history into the agentic loop's input. When any
prior run ended after the model emitted tool_calls but
before all tool results landed (provider error, timeout,
context cancellation), the orphan tool_calls were persisted
and every subsequent run inherited them. The provider then
rejected the conversation with:

  An assistant message with 'tool_calls' must be followed
  by tool messages responding to each 'tool_call_id'.
  (insufficient tool messages following tool_calls message)

Patrol failed 271 consecutive runs with this error before
it was diagnosed. Each new run added another user prompt
on top of the broken structure, so the message slice grew
to 33+ messages with one assistant turn at position 23
holding 4 orphan tool_call_ids and 9 user prompts stacked
after it.

The "patrol runs need a clean slate" comment at line 1898
documents that the knowledge accumulator is freshened per
run; the conversation history was the matching gap.
ExecutePatrolStream now passes only this run's user prompt
to the agentic loop. The session is still written to for
audit/forensics, just no longer fed back into the model.

Live verified: Patrol now completes successfully (18 tool
calls, 3m23s) on a session that previously failed every
run with the malformed-history error. The runtime-failure
finding auto-resolved on this same successful run.

Adds a classifier bucket for "insufficient tool messages"
errors so any future regression in this area surfaces with
a meaningful diagnostic instead of the generic
"Provider analysis error" fallback. New
PatrolFailureCauseMalformedToolHistory cause; predicate
patrolMalformedToolHistory matches DeepSeek's exact phrasing
and OpenAI's similar variants.

ai-runtime contract updated.
2026-05-10 17:14:47 +01:00
rcourtman
3da835c5bc Publish a distribution path for pulse-mcp
The MCP adapter shipped in slice 51 with one install option:
clone the repo and go build. This slice integrates pulse-mcp
into Pulse's existing governed release pipeline so a Pulse
release publishes a pulse-mcp binary alongside the unified agent
and the install scripts that bring it home in one command.

What ships:

  - scripts/build-release.sh extended to build pulse-mcp for
    the same multi-OS matrix as the unified agent, package
    per-platform tarballs and zips, and copy bare binaries to
    RELEASE_DIR for /releases/latest/download/ redirect
    compatibility.
  - .github/workflows/create-release.yml extended to upload
    the bare pulse-mcp binaries plus install-mcp.sh and
    install-mcp.ps1 as release assets.
  - scripts/install-mcp.sh: bash one-line installer that
    detects platform/arch, downloads the matching binary from
    the configured release (latest by default), verifies SHA256
    against the published checksums.txt, places at
    ~/.local/bin/pulse-mcp (or /usr/local/bin if not writable).
    Honors PULSE_MCP_VERSION, PULSE_MCP_BIN_DIR, PULSE_MCP_REPO,
    PULSE_MCP_NO_VERIFY env vars; declines Windows shells with
    a pointer at the .ps1 sibling.
  - scripts/install-mcp.ps1: PowerShell installer for Windows,
    placing pulse-mcp.exe at $LOCALAPPDATA\pulse-mcp.

Documentation aligned:

  - cmd/pulse-mcp/README.md gains an Install section above
    Quick start with three options: one-line installer,
    GitHub Release download, go install. Documents the macOS
    Gatekeeper bypass since v1 is unnotarized by design.
  - The Settings -> API Access agent-integrations panel now
    surfaces the curl|bash command above the config snippet so
    operators see "install pulse-mcp" before "configure your
    MCP client."
  - docs/releases/AGENT_PARADIGM.md drops the "no published
    distribution path" item from "what it does not do yet" and
    documents the Gatekeeper / Homebrew gaps as next-tier
    follow-ups.

Trade-offs surfaced and chosen:

  - Same cadence as Pulse: pulse-mcp ships per Pulse release,
    not on its own track. The MCP server reads the manifest
    from the Pulse it talks to, so version alignment is the
    natural model.
  - No Homebrew tap or core formula in v1. Maintaining a tap
    is real ongoing work; foundation supports adding Homebrew
    later as a layer.
  - No Docker image. Stdio JSON-RPC fights Docker's stdin
    /stdout pattern.
  - No notarization in v1. SHA256 verification through the
    installer preserves the audit trail; README documents the
    Gatekeeper bypass.

Subsystem contract: deployment-installability.md gains
scripts/install-mcp.sh, scripts/install-mcp.ps1, and
cmd/pulse-mcp/ in canonical files (mid-list entries
renumbered) plus a paragraph documenting the new MCP entry
point alongside the existing installer family.

Verification artifacts:

  - scripts/installtests/build_release_assets_test.go gains
    TestBuildReleasePackagesPulseMcpForAllPlatforms which pins
    the build/package/copy wiring and the load-bearing
    install-mcp.sh helpers (platform detection, SHA256
    verification, install-dir resolution).
  - scripts/release_control/render_release_body_test.py gains
    test_agent_paradigm_release_notes_blurb_documents_-
    distribution_path which pins the AGENT_PARADIGM.md draft's
    install-mcp.sh reference and the four-axis frame so a
    future edit cannot regress the install story silently.

Smoke-tested install-mcp.sh locally on darwin-arm64: platform
detection, install-dir resolution, URL building, and 404 error
handling all correct. The full end-to-end install path becomes
live the moment a Pulse release ships pulse-mcp binaries; the
next RC cut will exercise it.
2026-05-10 17:04:49 +01:00
rcourtman
af422ddf2f Verify Patrol now tests the form's pending model, not the saved one
Two real UX bugs in the Verify Patrol panel:

1. The button silently tested the previously-saved model
   instead of the operator's pending dropdown selection.
   Clicking Verify after changing the model would re-run
   preflight against the OLD model and the operator would
   believe the new selection was verified.

2. The result panel rendered the cached green badge even
   when the form's current selection was a different model
   from the one in the cache, with no visual cue that the
   verified result was for something other than what they
   were looking at.

Fixes:

- runPatrolToolPreflight now passes form.patrolModel as the
  model override on the POST. Empty form value falls through
  to the configured shared default on the backend.
- PatrolPreflightControl computes isStaleAgainstFormSelection
  by comparing the bare model name in form.patrolModel to
  the cached result.model. When stale, the panel switches
  to amber tone with a headline that names both models and
  a detail line prompting the operator to click Verify Patrol.
- stripModelProvider helper handles "provider:model" prefix.

Architecture guardrail extended with three assertions
covering the form-aware verify path and the staleness
indicator wiring. frontend-primitives contract updated.

Live verified: changing the dropdown from deepseek-v4-flash
to deepseek-v4-pro before saving switches the panel from
green "Tool calling verified" to amber "Verified result is
for deepseek-v4-flash, your current selection is
deepseek/deepseek-v4-pro · Click Verify Patrol to test the
pending selection."
2026-05-10 16:51:17 +01:00
rcourtman
4b6e409747 Draft a release-notes blurb for the agent paradigm work
Sources cleanly into release announcements without forcing a
fresh draft when the next RC or GA cuts.

Sized as a focused source draft (under 130 lines): one-paragraph
headline, three operator-facing bullets, the four-axis frame
(discover, read, write, push) lifted and condensed from
docs/AGENT_SUBSTRATE.md, an integrators-pointing section that
names the two reference adapters and the canonical contract, an
honest "what it does not do yet" paragraph (no published
pulse-mcp distribution; no real-world consumer feedback yet),
and an audit-trail summary.

Style matches the existing docs/releases/RELEASE_NOTES_v6_*.md
draft pattern: technical, plainspoken, no marketing prose,
honest about scope. No em dashes per project rule.

Lives at docs/releases/AGENT_PARADIGM.md alongside the existing
release-notes drafts so a release author finds it where they
already look. The header explicitly frames it as a source
draft to drop into announcements, GitHub prerelease descriptions,
or a "What Changed" section in a versioned release notes file,
trimmed or expanded as the cut requires.

Contract-neutral commit: docs-only addition, no runtime code,
wire shape, manifest entry, error code, version artifact,
release workflow input, governed metadata, or canonical-files
contract changed. PULSE_ALLOW_CONTRACT_NEUTRAL_COMMIT used with
a documented reason since the deployment-installability
verification-artifact rule wants a Python release-promotion
test touched, which would be wrong scope for a markdown draft.
2026-05-10 16:44:26 +01:00
rcourtman
297556fb65 Surface cached preflight in Patrol tools readiness check
Previously the "Patrol tools" readiness check was static
model-name pattern matching: it told the operator whether
the selected model is on Pulse's tool-capable allowlist,
not whether tools have actually been verified to work.
After today's preflight cache, that's strictly less
information than what we already know.

resolvePatrolToolsCheck now consults
aiService.CachedPatrolPreflight() and grounds the check
in real evidence when a result for the configured
provider+model exists:

  - cached green (success + tool_call_observed) →
    "Tool calling verified <age> against <model>." (ready)
  - cached failure → classified summary + "(last preflight
    <age>)." (not_ready)
  - cached soft warning (model_tool_support_unverified) →
    same with warning status
  - no cache or model mismatch → static fallback

formatPatrolPreflightAge produces stable English ("just
now", "5m ago", "2h ago", "3d ago") with full unit
coverage (11 cases).

HandleGetAISettings now also includes patrol_readiness in
its response — previously only the PUT response carried it,
so the Patrol page only got augmented readiness after a
save. The frontend already had patrol_readiness typed and
read it from useAISettingsState.

ai-runtime, api-contracts, agent-lifecycle (dep), and
storage-recovery (dep) contracts updated.
2026-05-10 16:41:16 +01:00
rcourtman
d4efc08909 Surface agent integrations in Settings → API Access
The agent substrate has been entirely API-side until now. An
operator opening Pulse had no way to know it existed, no way to
see what an agent connected to their instance could do, and no
quick path to a working MCP integration. Slice 59 closes that
visibility gap.

A new AgentIntegrationsPanel renders below the existing
APITokenManager on the API Access tab. The panel:

- Fetches /api/agent/capabilities at mount and renders the
  declared capabilities grouped by category (Context, Operator
  state, Patrol findings, Action governance), each row showing
  name, method+path, scope chip, description, and the stable
  error codes the manifest declares. Adding a capability on the
  backend extends this list automatically; nothing in the panel
  is hardcoded against the substrate's surface.

- Generates an MCP config snippet using window.location.origin so
  the snippet is correct for whichever URL the operator is
  reading from. Includes a copy-to-clipboard button (matching
  the existing CopyCommandBlock pattern) and a brief explainer
  pointing at the right config path for Claude Desktop and
  Claude Code.

- Links to cmd/pulse-mcp/README.md, cmd/agent-probe, and
  docs/AGENT_SUBSTRATE.md so an integrator has the full setup
  story without leaving the panel.

Wiring is one import + one component placement in
APIAccessPanel.tsx. The api tab already concerns "what can be
done with API tokens"; adding the agent surface as a sibling
section under the same tab keeps the operator's mental model
coherent (one place for machine-driven access) and avoids
growing the Settings tab inventory or touching
settingsNavigationModel, the registry, the loaders, or the
routing tests.

Two architectural pins land alongside:

- settingsArchitecture.test.ts gains a guardrail that the
  AgentIntegrationsPanel sits as a sibling section inside
  APIAccessPanel rather than being lifted into its own tab.
  Drift would fragment the agent surface across navigation.
- The frontend-primitives contract documents the
  sibling-panel-over-new-tab pattern for additive operator
  surfaces closely related to an existing tab's intent.
- The security-privacy contract documents the new section's
  presence on the API Access tab and pins that token minting
  still flows through APITokenManager — the new section
  surfaces what tokens unlock, not a parallel auth path.

Verified against the running dev server: the page renders the
four category sections (Context, Operator state, Patrol
findings, Action governance), the live MCP config snippet
contains the deployment's own origin, and the manifest endpoint
serves all 14 capabilities (the 11 substrate capabilities + the
3 action capabilities from slice 58). TypeScript clean across
the frontend.
2026-05-10 16:35:40 +01:00
rcourtman
f74add8271 Auto-seed Patrol preflight cache on Pulse startup
Closes the cold-start gap in the preflight observability
layer: every Pulse restart blanked the cached "last verified"
indicator until the next save or manual click, which meant
operators saw "never verified" on every upgrade or process
restart even when the configured Patrol model was working
fine.

NewAISettingsHandler now reuses the existing
aiSettingsUpdateRequiresPatrolPreflight predicate with a nil
"prior config" — semantically "no in-memory cache yet, just
booted." When the loaded config has assistant enabled and a
Patrol model, the handler dispatches the same async
TriggerPatrolPreflightAsync the save path uses. Routine
boots where assistant is disabled (or no Patrol model is
selected) skip the dispatch so we never write a misleading
"Pulse Assistant is not enabled" entry into the cache.

Live verified: after Pulse restart, /api/settings/ai
surfaces a fresh patrol_preflight with success=true within
~6s of boot, no operator action required.

Predicate test extended with two named cases that document
the dual-purpose use (startup seed + skip-when-disabled).
ai-runtime, api-contracts, agent-lifecycle (dep), and
storage-recovery (dep) contracts updated.
2026-05-10 16:34:34 +01:00
rcourtman
33b6bf97d1 Auto-trigger Patrol preflight when settings save moves Patrol transport
Closes the resilience gap: today an operator picks a Patrol
model, saves, and Patrol silently fails on its first
scheduled run because nothing verified the model actually
calls tools. Now the save handler dispatches a one-shot
preflight in the background whenever the change actually
moved Patrol's transport, and the cached result surfaces on
the next /api/settings/ai poll via patrol_preflight.

Trigger conditions (aiSettingsUpdateRequiresPatrolPreflight):
  - new config has assistant enabled AND a Patrol model
  - AND any of:
      * no prior config (first save)
      * assistant was disabled, now enabled
      * Patrol model changed
      * API key for the new patrol model's provider changed

Routine saves that don't touch Patrol transport (theme,
control level, discovery toggles, unrelated provider keys)
skip preflight entirely so they don't burn provider tokens
or add 5-10s latency to every save.

Service-level TriggerPatrolPreflightAsync runs the call in a
goroutine with a 30s timeout. Detection helper has full
unit coverage including the negative paths.
2026-05-10 16:14:26 +01:00
rcourtman
728c42e47b Bring action endpoints onto the agent surface with the agent-stable envelope
Closes the last known gap in the agent substrate. The three
action endpoints (POST /api/actions/plan, /api/actions/{id}/decision,
/api/actions/{id}/execute) previously emitted the platform-wide
APIError shape (stable code under "code", human under "error").
The agent surface uses the inverted shape (stable code under
"error", human under "message"), so adding action capabilities
to the manifest as-is would have forced agents to remember which
envelope each capability uses.

The slice refactors actions.go to emit the agent-stable envelope
across all 42 writeErrorResponse call sites. writeJSONError gains
a writeJSONErrorWithDetails sibling so the 13 calls that pass
field-level reasons (validation failures) preserve that
information under a new optional `details` field. The action
endpoints' JSON shape becomes:

  {"error": "<stable_code>", "message": "<human>",
   "details"?: {"<field>": "<reason>"}}

Frontend impact: zero. Verified that no frontend code consumes
the three action endpoints; the refactor is API-only.

Three new manifest entries (plan_action, decide_action,
execute_action) under a new "action" category, with their
declared error codes pinned per capability. Internal-failure 5xx
codes (audit-store outages, encode failures) are not declared
per capability; agents branch on 5xx generically.
TestContract_AgentSurfaceErrorCodesMatchManifestDeclarations now
audits actions.go alongside the existing two handler files, with
a documented internal-only allowlist for the 5xx codes.

The TestAgentSubstrate_ActionEndpointsEmitAgentStableEnvelope e2e
test exercises one error path through each endpoint via the actual
HTTP boundary, asserting the agent-stable envelope reaches the
wire and the legacy APIError fields (code, status_code, timestamp)
do NOT — drift back would mean the refactor regressed.

The TestContract_ActionDryRunOnlyExecutionErrorJSONSnapshot pin
is updated to match the new envelope shape; the manifest's
category allowlist gains "action".

api-contracts.md documents the new envelope (with details map),
the action governance loop's place in the substrate, the
ai:execute scope distinction from monitoring:write, and the
"manifest projection has a footnote" trade-off: bringing an
existing endpoint into the agent surface may require migrating
its error envelope, but the substrate keeps a single envelope
contract rather than carrying a translation wrapper layer.
agent-lifecycle.md and storage-recovery.md document the action
surface joining the agent paradigm and its zero-new-persistence
posture respectively. AGENT_SUBSTRATE.md's "what it does not do
yet" no longer lists the action surface; it now reflects the
real outstanding items (consumer feedback, an in-Pulse agent
integrations panel, a distribution path for pulse-mcp).
2026-05-10 15:16:17 +01:00
rcourtman
add1096ec8 Cache Patrol preflight outcome and hydrate UI on settings load
The Verify Patrol button reset its result to empty on every
page load — the operator had to re-click to see the verified
state, even though nothing had changed. This commit adds the
observability layer of the auto-preflight plan: every
RunPatrolToolPreflight result is now cached on the AI Service
and surfaced through /api/settings/ai as patrol_preflight, so
the inline result panel rehydrates on page load with the
most-recent outcome and a "last verified Xs ago" indicator.

Backend: patrolPreflightCache (mutex-guarded) on Service with
defensive-copy CachedPatrolPreflight() accessor; every
RunPatrolToolPreflight branch (success, soft warning, classified
failure, validation early-return) records into the cache.
PatrolPreflightSnapshot projects the cached result onto the
AI settings response. Tests cover both success-then-failure
supersession and the defensive-copy invariant.

Frontend: PatrolPreflightSnapshot type mirrors the wire shape;
hydratePatrolPreflightFromSettings(data) projects the snapshot
into the same response shape the manual button writes;
loadSettings and updateSettings flows call it. The result
panel renders a "last verified Xs ago" line under the
provider/model row when recorded_at_unix is present.

End-to-end smoke verified against deepseek-v4-flash: panel
rehydrates as green "Tool calling verified · last verified
just now" after page reload.

Auto-preflight on save (the trigger half of the resilience
plan) follows in the next commit.

Contracts: ai-runtime, api-contracts, agent-lifecycle (dep),
storage-recovery (dep), frontend-primitives all updated to
reflect the new patrol_preflight surface and hydration
contract. Verification artifacts: settingsArchitecture +
patrolPreflight client tests.
2026-05-10 14:58:26 +01:00
rcourtman
c01282e8c5 Add Verify Patrol button to Assistant & Patrol settings
Wires the new POST /api/ai/patrol/preflight endpoint into
the Settings UI. The button sits beside the Patrol
Verification Model picker so the verification action lives
where the model is selected.

Three result tones rendered inline:
- green: provider call succeeded and the model emitted a
  tool call (the fully-verified state)
- amber: provider accepted the request but the model did
  not call the tool (cause=model_tool_support_unverified —
  Patrol may still work, recommend a real run)
- red: classified failure surface from the runtime classifier
  (tool_choice_rejected, no_tool_capable_endpoint, model
  unavailable, auth, billing, etc) with summary, recommendation,
  and the resolved provider/model

Distinct from the existing "Run Preflight" button on the
Provider Configuration card, which only fans out per-provider
TestConnection calls. The copy under the new button calls
that distinction out so operators don't conflate them.

Backend wiring stays untouched — the action goes through the
typed runPatrolPreflight client added in e26a57a15. Settings
architecture guardrail extended to assert the wiring.
2026-05-10 14:40:13 +01:00
rcourtman
e26a57a157 Add POST /api/ai/patrol/preflight tool-call verification
The existing per-provider /api/ai/test endpoints only call
ListModels — they pass for every provider that returns a
catalog, even when Patrol fails 100% of runs because tools
aren't actually wired up. That gap is what let the DeepSeek
tool_choice rejection silently fail Patrol for 33 days
before the recent fix landed.

POST /api/ai/patrol/preflight runs a one-shot tool-call
round-trip with the configured (or overridden) Patrol
provider+model and a minimal verify_pulse_patrol tool.
Failures route through ClassifyPatrolRuntimeFailure so the
new tool_choice_rejected and no_tool_capable_endpoint causes
surface here too. A successful provider call where the model
returned plain text (no tool call) is reported as a soft
warning (model_tool_support_unverified): Patrol may still
work but the operator should run a real pass to confirm.

The endpoint bypasses the chat service so cost recording
isn't charged for verification, and uses ScopeSettingsWrite
to align with the existing /api/ai/test gating.

Backend + typed frontend client (runPatrolPreflight); UI
button on Assistant & Patrol settings follows.

Contracts updated:
- ai-runtime: completion obligation extended to cover the
  new verification surface
- api-contracts: payload shape (tool_call_observed,
  duration_ms) noted in obligations
- agent-lifecycle, storage-recovery: dependent-extension
  acknowledgment that ai-runtime owns the new route despite
  it living under internal/api/
2026-05-10 14:30:41 +01:00
rcourtman
818221b457 Translate Pulse SSE events into MCP notifications when opted in
Closes the documented limitation in slice 51's pulse-mcp: MCP
clients that process server-initiated notifications can now
react to Pulse's push channel without holding a separate HTTP
connection to /api/agent/events.

The bridge is opt-in via --emit-notifications because not every
MCP client surfaces arbitrary notifications/* methods (Claude
Desktop, today, does not). Autonomous agents that consume the
JSON-RPC stream programmatically benefit; UI-mediated clients
should keep the flag off and use the SSE stream directly.

Implementation: a long-lived goroutine, started once after the
first initialize, that opens /api/agent/events, parses the
substrate's wire format, and emits a JSON-RPC notification
per non-transport event. Method names mirror the SSE event
kinds (notifications/finding.created, notifications/approval.
pending, notifications/action.completed). Params is the SSE data
payload verbatim so agents see the same wire shape an HTTP SSE
consumer would. stream.connected and heartbeat are filtered as
transport plumbing. The consumer reconnects with capped
exponential backoff on transient errors.

When --emit-notifications is on, initialize advertises the
supported event kinds under
capabilities.experimental.pulseNotifications.kinds. Clients that
don't understand the experimental block ignore it silently.

Three tests pin the behaviour: the initialize handshake's
capability block is correctly gated on the flag; the notification
filter rejects transport events and accepts the three substrate
kinds; an httptest.NewServer-backed end-to-end translates a
multi-event SSE stream into JSON-RPC notifications with the
substrate's payload preserved.

Also flagged in AGENT_SUBSTRATE.md "what it does not do yet": the
action-execution endpoints (/api/actions/plan, decision, execute)
emit a different error envelope from the agent surface (APIError
with stable code under "code") versus the agent-stable shape
(stable code under "error"). Adding them to the manifest
requires resolving that mismatch first; recorded as a focused
slice for whenever the substrate's reach extends to direct
agent-driven execution.
2026-05-10 14:19:44 +01:00
rcourtman
f2d9d2aba8 Split overgreedy "tools not supported" classifier into three causes
The Patrol runtime classifier collapsed three distinct upstream
conditions into one misleading "Selected model does not support
Patrol tools" message:

  1. Provider rejected the *value* Pulse sent for tool selection
     (e.g. DeepSeek's "deepseek-reasoner does not support this
     tool_choice" — the model accepts tools, just not the forced
     coercion). The DeepSeek fix in 46145df9 dodges the symptom by
     coercing to auto, but the original misclassification pointed
     operators at the wrong remediation for 33 days.
  2. Provider has no tool-capable endpoint available for the
     selected model (OpenRouter's "No endpoints found …" surfaces
     this when account-level provider/data filters exclude every
     tool-capable route).
  3. Model truly lacks tool calling (the literal "tools are not
     supported" / "tool calling" cases).

Each now has its own PatrolFailureCause, title, summary,
description, and recommendation. summarizePatrolRuntimeFailureDetail
mirrors the split. Helper predicates patrolToolChoiceValueRejected
and patrolNoToolCapableEndpoint encapsulate the substring matching.

The OpenRouter "No endpoints found" test fixture now correctly
classifies as no_tool_capable_endpoint instead of
model_unsupported_tools — fixture updates in
patrol_runtime_failure_test.go, patrol_assistant_handoff_test.go,
and ai_handler_test.go reflect the more accurate diagnostic.
New tests cover the tool_choice_rejected and generic
model_unsupported_tools paths explicitly.

The ai-runtime contract is updated to note the classifier-split
obligation alongside the existing transport-shape obligation.
2026-05-10 14:10:18 +01:00
rcourtman
46145df925 Coerce DeepSeek tool_choice to "auto" so Patrol stops failing
DeepSeek's API server-side aliases deepseek-v4-flash and
deepseek-v4-pro to deepseek-reasoner, which rejects forced
tool_choice with HTTP 400 ("deepseek-reasoner does not support
this tool_choice"). Pulse's classifier then surfaced this as
"Selected model does not support Patrol tools," misdirecting
diagnosis to the model rather than the request shape.

supportsForcedToolChoice now returns false for any DeepSeek
client, so every DeepSeek model falls back to tool_choice
"auto" regardless of how DeepSeek routes the requested ID.
The ai-runtime contract is updated to match: the
provider-transport boundary now coerces forced tool_choice for
every direct DeepSeek model ID, not only unknown ones.

Patrol verified end-to-end: 20 tool calls, 9 findings, prior
runtime failure auto-resolved.
2026-05-10 00:04:13 +01:00
rcourtman
6eb5bca06b Add docs/AGENT_SUBSTRATE.md as the arc's session marker
A single short file you can reach for in three weeks (or hand to
anyone wiring into Pulse from the outside) and orient yourself on
what shape the agent substrate took without re-reading the
contract subsystem. Plain English, four-axis frame (discovery,
depth, breadth, write, push), two-consumer summary
(agent-probe and pulse-mcp), an explicit "what it does not do
yet" section, and pointers into the formal contract docs and
implementation files.

Sized for release notes / GitHub announcement reuse, not as a
contract surface itself. The contract still lives in
docs/release-control/v6/internal/subsystems/api-contracts.md;
this file is the friendly door to it.
2026-05-09 23:16:16 +01:00
rcourtman
2f0468a87b Verify SSHSIG on in-app update artifacts
The unattended timer (scripts/pulse-auto-update.sh) and the public bootstrap
(scripts/install.sh, /install.sh) all verify the .sshsig sidecar against the
pinned pulse-installer ed25519 key before trusting a release artifact. The
in-app updater verified SHA256 only — same artifact, same root execution
context, lower trust bar. Closing the asymmetry: the in-app tarball download
in ApplyUpdate, adapter_installsh.go's install.sh download (piped into bash
as root), and the rollback binary download now fetch and verify the .sshsig
sidecar against the same pinned key, fail-closed.

The signing infrastructure (release_asset_common.sh, validate-release.sh,
backfill-release-assets.sh) already produces and validates these signatures
for every release; this teaches the Go updater to honor what the shell paths
have always required. ssh-keygen is shelled out to so the in-app updater
shares the exact trust path used by the unattended path, with a package-level
function variable for test injection so unit tests don't require ssh-keygen
on the build host.

Extends the deployment-installability contract's release-trust-fail-closed
invariant to cover the in-app updater paths.
2026-05-09 23:14:07 +01:00
rcourtman
3a502fefda Publish the cmd/pulse-mcp integration guide
Slice 51 added the MCP adapter as a worked example. This makes it
a published surface: an external maintainer who wants to wire
Pulse into their Claude Desktop or Claude Code can read one
README and have it working without spelunking through main.go.

The guide carries:
- Build and install instructions
- Canonical config snippets for Claude Desktop and Claude Code
- The env-var contract (PULSE_API_TOKEN, configurable name,
  always read from env so it stays out of process listings)
- The published tool list grouped by category (context,
  operator-state, finding) with what each does
- The stable error envelope shape and the difference between
  capability-specific codes and cross-cutting auth codes
- Documented limitations: no subscribe_events, manifest fetched
  once, tools-only (no resource URIs)
- Troubleshooting for the common failure modes (missing token,
  proxy gating discovery, missing write scope)

api-contracts.md now points readers at the README as the
canonical integration entry point so the contract doc keeps
its in-repo focus and the README owns the user-facing copy.
2026-05-09 23:11:54 +01:00
rcourtman
eeb2975d22 Stability sweep on the agent-substrate arc
Three things landed:

1. /api/agent/capabilities was missing from publicPathsAllowlist
   in router_public_paths_inventory_test.go. Slice 47 added the
   path to publicPaths in router.go and to publicRouteAllowlist
   in route_inventory_test.go but missed this second mirror,
   which scans publicPaths via go/ast. The test was failing on
   origin; this commit closes the gap.

2. The error-envelope paragraph in api-contracts.md now
   distinguishes capability-specific stable codes (the closed
   set declared per capability in the manifest) from
   cross-cutting codes the multi-tenant / auth middleware
   emits universally (invalid_org, org_suspended, access_denied).
   The previous wording implied all stable codes lived in
   per-capability errorCodes lists, which would have forced
   duplication on every capability or misled agents about which
   codes to expect.

3. New contract pin TestContract_AgentSurfaceErrorCodesMatch-
   ManifestDeclarations enforces the symmetry both directions:
   every code emitted by an agent-surface handler must be either
   declared in the matching capability or be one of the three
   cross-cutting codes; every manifest-declared code must have a
   matching emission. Drift either way is a contract regression.
   Pin verified clean against the current handler set.

Stale forward-reference fixed: the capabilities paragraph no
longer says "future MCP-server slices read the manifest" — slice
51 already shipped that adapter.

Sweep also surfaced two failures in internal/mock/ from
unrelated platform-support drift (unraid token set added in
ac82a2852 but the mock contract test wasn't updated). Those are
not part of the agent-substrate arc and not mine to fix; flagged
in the closing summary so they don't get lost.
2026-05-09 23:04:22 +01:00
rcourtman
d6a68f8044 Add cmd/pulse-mcp — MCP adapter wrapping the agent substrate
The whole point of slice 39's hand-authored manifest with
snake_case names and stable error codes was to make adapter
projection cheap. This slice is the test: a minimal MCP (Model
Context Protocol) server that turns Pulse's manifest into a tool
surface Claude Desktop, Claude Code, and other MCP-speaking
clients can drive natively.

Every MCP tool is a one-line projection of a manifest capability.
Input schemas are auto-derived from path placeholders ({name}
segments become required string properties) and method (non-
GET/DELETE tools accept a free-form body object). Adding a
capability to the manifest automatically extends the tool surface
— no MCP-side changes required.

The adapter is stdlib-only, runs over stdio with line-delimited
JSON-RPC 2.0 framing, preserves Pulse's stable error envelope
verbatim through MCP's content-and-isError result so agents on
the MCP side branch on the same codes they would on the wire,
and skips subscribe_events (SSE streaming doesn't fit the
request/response tool shape; future slices can layer it as MCP
notifications).

Eleven tests pin the projection rules and the JSON-RPC contract:
path-placeholder schema generation, body-property method gating,
substitution failures producing stable errors, the initialize
handshake advertising tools, tools/list filtering subscribe_events,
tools/call proxying with the bearer token and preserving the
substrate's error envelope, unknown methods producing JSON-RPC
method-not-found, and notifications producing no response.

The substrate is now wrapped in two adapters, each demonstrating a
different consumer profile: agent-probe walks the substrate as an
HTTP client (slice 49); pulse-mcp wraps it for stdio MCP clients.
Both depend only on the standard library and resolve paths from
the manifest, so the substrate is the single source of truth and
adapter additions stay cheap.
2026-05-09 22:51:37 +01:00
rcourtman
8aa22d0605 Surface action verification on the action.completed SSE payload
Closes the certainty loop for agents watching the substrate's push
channel. The action audit's read-after-write probe outcome was
already persisted on the audit record, but agents watching
action.completed only learned "the action ran" — they had to fetch
/api/actions/{id} to know whether the read-back probe confirmed
the intended state. That defeated the substrate's
push-notification guarantee for dispatch certainty.

The new agent-stable AgentResourceActionVerification projection
(ran, success, command, note, ranAt — output stays in the audit
record, deliberately omitted from events to keep payloads small)
is now carried on both:

  - the action.completed SSE payload, projected from
    record.Result.Verification by the router-side bridge in
    wireAIChatDependenciesForService, and
  - the resource-context bundle's recentActions surface, via the
    same shared projectAgentResourceVerification helper

so the bundle (depth) and the doorbell (push) speak the same
vocabulary. Refused-before-dispatch failures omit verification
(the probe never runs) so agents branch on field presence to
distinguish "no probe attempted" from "probe ran with empty
result". Three contract pins lock the symmetry: payload field
present, router bridge populates it, bundle parallels.

The capabilities manifest's subscribe_events description now
mentions the verification block so external agents discover the
field through the same path they already use to learn the rest
of the agent surface.
2026-05-09 22:45:15 +01:00
rcourtman
956646a5c1 Add cmd/agent-probe — worked example consuming the agent substrate
The substrate's read and write surfaces are end-to-end-tested
internally; this slice answers the harder question — "is the
substrate actually usable from the outside?" — by writing the
smallest standalone program that consumes it. agent-probe walks
the discovery → triage → depth → push flow against a running
Pulse instance using only the Go standard library, so it doubles
as a reference implementation for anyone building MCP servers,
Claude Code integrations, or custom agents on top of Pulse.

It resolves every path from the manifest rather than hardcoding
them — if discovery moves a path, the probe follows
automatically — and branches on the stable error envelope's
"error" code field, never on human-readable messages. The focus
rule (severity-lex-ordered) is intentionally simple so a reader
can predict what the probe will pick; real agents will have
richer policies.

This is documentation as code: the program is short enough to
read top-to-bottom and reads like the agent's own narration of
what it's doing. The unit test pins the focus rule's lex
ordering so a refactor that swaps it for a weighted score (which
allowed many warnings to outrank one critical) cannot regress
silently.
2026-05-09 22:28:00 +01:00
rcourtman
5156c03eed End-to-end test the operator-state write loop through HTTP
Closes the e2e contract proof on the write side. The only write
capability the manifest declares is the operator-state intent
loop (set / get / clear), and this test boots the full router
stack to walk every state of that loop through the actual HTTP
boundary — proving the manifest's declared error codes for
set_operator_state and get_operator_state reach the wire from
the handlers, the URL canonical id authoritatively wins over
body-supplied ids (no scope-confusion writes), and SetAt/SetBy
are server-populated so attribution cannot be spoofed.

The flow exercised:
  GET unset → 404 operator_state_not_set
  PUT valid → 200 with persisted state + server SetAt
  GET → round-trips
  PUT invalid criticality → 400 operator_state_invalid
  DELETE → 204
  GET → 404 operator_state_not_set (loop closed)
  DELETE again → 204 (idempotent)

Two contract pins lock the audit-honesty and error-token
contracts so a future refactor of the handler can't silently
regress either: SetAt/SetBy populated server-side, URL-id wins
over body-id, and the validator's domain error maps to the
stable wire token via errors.Is rather than message-matching.

Together with the read-side e2e (slice 47), the agent surface —
read, write, push — has now been exercised end-to-end as one
substrate.
2026-05-09 22:22:47 +01:00
rcourtman
8cf15fe639 End-to-end test the agent substrate's discovery → triage → depth flow
The unit tests cover each piece in isolation; this test boots the
full router stack and proves the discovery → triage → depth chain
works as one substrate through the actual HTTP boundary an
external agent would hit. It found two real bugs slice 40
introduced and slice 45/46 didn't surface:

- /api/agent/capabilities was documented as unauthenticated but
  was missing from the router's publicPaths list, so the global
  auth middleware was 401'ing the discovery manifest. Fixed by
  adding the path to publicPaths and pinning the contract so it
  cannot regress.

- The error-envelope shape across the agent surface is
  {"error": "<stable_code>", "message": "<human>"}, written via
  writeJSONError — not the {"code": ...} shape I had assumed in
  the docs. Pinned the wire shape on api-contracts.md so the
  documented error contract matches what writeJSONError actually
  writes.

The e2e test exercises capabilities discovery, triage via
fleet-context, and depth via resource-context with an unknown id
to confirm the resource_not_found stable error code reaches the
wire under the canonical "error" key. The subscribe_events SSE
path is probed unauthenticated to confirm it's gated (401) rather
than 404 — discovery's claim is honest.
2026-05-09 22:16:32 +01:00
rcourtman
a168215f6a Add /api/agent/fleet-context for org-wide triage in one read
The substrate had a per-resource bundle but no fleet view, so
"where do I focus?" forced agents to walk every resource id and
bundle each — O(N) round trips that scale with fleet size. The
fleet endpoint returns a thin per-resource rollup in a single
read: identity, operator-intent flags (intentionallyOffline,
neverAutoRemediate, maintenanceWindowActive), per-severity
finding counts, and pending-approval count.

Same auth scope and same provider wiring as the per-resource
bundle — operator-state via the canonical unified store, findings
via AgentFindingsProvider, approvals via AgentApprovalsProvider —
so the fleet sweep is the per-resource bundle's wiring multiplied
by N with no new dependencies. Audit reads are deliberately
omitted from the rollup; agents that want depth on a flagged
resource follow up via /api/agent/resource-context/{id}.

The capabilities manifest declares get_fleet_context with
AgentFleetContext as the response shape so external agents
discover the triage entry point through the same path they
already use to learn the rest of the agent surface.
2026-05-09 22:08:50 +01:00
rcourtman
d8f6b1e508 Bundle pending approvals into the agent resource-context endpoint
The substrate's "everything an agent needs in one read" guarantee
covered identity, operator state, findings, and recent actions but
forced a separate /api/approvals call for pending governance
state. AgentResourceContext now carries pendingApprovals as a
lightweight AgentResourceApprovalSummary projection — same
vocabulary as approval.pending SSE events, so the doorbell and
the bundle agree on shape. AgentApprovalsProvider is the parallel
seam to AgentFindingsProvider; the router wires a closure that
resolves approval.GetStore() at request time, scopes via
BelongsToOrg, and filters by CanonicalResourceID so cross-tenant
or cross-resource pending requests don't leak. Empty arrays
preserve the iteration-safe contract the existing sections
already follow.
2026-05-09 22:00:53 +01:00
rcourtman
7fe9b1c492 Use cursor-help on TagBadges hover-only +N indicator
The "+N" overflow indicator on TagBadges was styled with
`cursor-pointer`, which signals a clickable affordance — but the
element only listens for mouseenter/mouseleave to show a tooltip and
has no click handler. Switch to `cursor-help` so the cursor matches
the actual interaction (hover for more info), avoiding a phantom
click expectation.
2026-05-09 21:52:34 +01:00
rcourtman
52669128e6 Drop redundant policy gates in resource-link routing
Tail of the operator-local-UI redaction sweep (abdde303a, a17f879a1).

resolveKubernetesContextForResource gated on requiresGovernedResourceDisplay
to choose between getPreferredInfrastructureDisplayName and a manual
displayName-or-name fallback. Both branches produce a raw infra name
once we trust that displayName never carries a redacted summary in
local rendering, so the gate is dead complexity. Collapse to a single
call and drop the now-unused requiresGovernedResourceDisplay import.

problemResourcePresentation.getProblemResourceDisplayName has no
production consumers today, but it still routes through the governed
helper. Reclassify it now (same as every other operator-local helper)
so the rule is consistent across the codebase if the surface ever gets
adopted.
2026-05-09 21:31:45 +01:00
rcourtman
51c5d344ce Plumb operator-state and operational memory into investigation findings
Closes the "has context vs uses context" gap that defines Pulse's
agent-paradigm differentiation. The orchestrator (in pulse-pro) used
to receive a Finding with no awareness of the operator's
commitments — Patrol could investigate a resource the operator had
marked never-auto-remediate and propose a restart fix that the
action broker would refuse downstream. The proposal shouldn't have
happened in the first place.

Adds two optional fields to aicontracts.Finding:

- OperatorContext: intentionally offline, never auto-remediate,
  maintenance window with computed active flag, criticality, note.
  Populated in MaybeInvestigateFinding from the same operator-state
  projection the suppression hot path consumes, so investigation
  reasoning and suppression behavior cannot drift apart.
- OperationalMemory: regression count, previous resolved fix
  summary, last regression timestamp, times raised. Populated in
  ToCoreFinding from fields the internal Finding already carries.

ResourceOperatorStateProjection grew a NeverAutoRemediate field —
the investigation read path needs it (so the orchestrator can avoid
proposing fixes the broker would refuse) even though the
suppression hot path doesn't. Same projection serves both reads.

Both fields are nil when there's no signal (fresh finding, no
operator state) so the orchestrator branches on absence rather
than parsing zero-valued structs. The pulse-pro orchestrator
consumes the fields in a separate slice; this slice ships the
in-repo half of the data path.
2026-05-09 21:03:15 +01:00
rcourtman
94bfd48a9d Add /api/agent/events SSE stream for real-time agent notifications
Third slice on the agent-paradigm pivot, closing the substrate
triangle (discovery + bundled reads + push). Agents subscribe once
to a long-lived SSE connection and receive real-time events instead
of polling: finding.created when a new finding is raised, heartbeat
every 15 seconds for keepalive. Each event carries a monotonic ID so
agents can dedupe and reason about ordering across reconnects.

The broadcaster fan-outs to multiple subscribers and drops events
for slow consumers rather than blocking the publish path —
publishers cannot stall on consumer slowness. The findings-runtime
hook in router.go publishes finding.created when the finding is new
AND not auto-dismissed by operator-state suppression (operator
already said to stay quiet about that resource); patrol-cycle
re-detection of existing findings doesn't fire the event.

Capabilities manifest declares the stream under subscribe_events so
external agents discover it through the same channel as the REST
surface. SSE chosen over WebSocket because it's simpler, works
through every HTTP proxy without special-casing, and matches the
existing deploy_handlers pattern; agents that need bidirectional
comms call REST endpoints in parallel.

Tests pin the broadcaster's pub/sub semantics (fan-out, unsubscribe,
slow-consumer drop, monotonic IDs), the SSE handler's stream
contract (text/event-stream, no-cache, X-Accel-Buffering=no), and
the connected/published-event delivery via httptest.NewServer. A
contract test pins the publish-gate semantics so operator-state
suppression and stream notifications stay aligned.
2026-05-09 20:13:31 +01:00
rcourtman
71797f9b21 Add /api/agent/capabilities discovery manifest for agent integrations
Second slice on the agent-paradigm pivot: the discovery document any
external agent (Claude Code, custom integrations, future MCP servers)
needs to learn what Pulse exposes. Each capability declares its
agent-stable name (snake_case), description, category, REST surface,
required scope, response shape, and the closed set of stable error
codes the response may carry. Agents branch on the codes
(operator_state_invalid, resource_not_found, etc.) rather than
parsing human messages.

The manifest is hand-authored, not auto-generated, because the
contract decisions (what's agent-stable, which categories, which
error codes) are product-shaping and must not drift behind code
changes. Adding a capability is a deliberate "this is part of the
agent surface" commitment.

v1 surface includes: get_resource_context (substrate from slice 39),
get/set/clear_operator_state (slice 30), and the finding-lifecycle
actions (acknowledge, snooze, dismiss, resolve). Action-broker
capabilities are not in v1 because they go through approval flow,
not direct dispatch — those need their own contract design.

Tests pin: stable shape, version contract, unique-and-snake_case
names, every capability has method/path/scope, closed category set,
required error codes for the most consequential capabilities. The
manifest is unauthenticated and cacheable (5min); the underlying
capabilities keep their own auth scopes.
2026-05-09 19:52:17 +01:00
rcourtman
14f9270a5e Add /api/agent/resource-context/{id} substrate endpoint for agents
First slice on the agent-paradigm pivot: instead of building more
human-glance UI, expose substrate that any agent (in-process Patrol,
external Claude Code, future MCP-driven setups) can consume in one
read. The endpoint returns the full situated picture of a resource —
identity, operator-set state with server-computed
maintenanceWindowActive flag, active findings as a lightweight
seven-question-schema projection, and recent action audits with
refusal tokens (resource_remediation_locked:, plan_drift:) preserved
verbatim for agent branching.

Substrate is the right shape here: an agent reasoning about a
resource gets everything it needs without chaining four or five calls,
and the projection types decouple agent-stable wire shape from
internal type evolution. Active findings flow through an
AgentFindingsProvider adapter wired in router.go from the patrol
service, keeping the api package free of an internal/ai import.

Always-array fields (activeFindings, recentActions) and
omitempty-on-absent (operatorState) give agents stable iteration and
clean field-presence branching. AgentContextHandler owns the agent
surface as its own type so it evolves independently of resource CRUD.
Each test pins a specific contract: identity round-trip, operator-state
projection with computed flag, empty-state shape, refusal-token
preservation, 404 shape, method gating.
2026-05-09 19:46:16 +01:00
rcourtman
3f783a7d98 Distinguish auto-suppressed dismissals from manual operator dismissals on the row
Both serialize as DismissedReason="expected_behavior" but tell
different stories — the auto-dismiss is Pulse staying quiet because
the operator scheduled maintenance or marked the resource
intentionally offline, while the manual dismiss is the operator
deciding the finding is expected. The lifecycle metadata
(operator_state_cause) already distinguishes them; this slice surfaces
the distinction on the FindingsPanel row.

Adds getOperatorStateDismissCause and formatOperatorStateDismissCauseLabel
helpers (the TS mirror of findOperatorStateDismissCause from slice 37).
Renders an "auto: maintenance" / "auto: intentionally offline" badge
next to the existing dismissed-reason badge when the lifecycle scan
finds operator_state_cause on the most-recent dismissed event.

The newest-first scan order is the same as the Go side: a manual
dismissal that supersedes an earlier auto-dismiss reports as manual,
so a stale auto-dismiss cause does not falsely badge the row after
an operator override.
2026-05-09 19:10:27 +01:00
rcourtman
eae8ca2a68 Wake operator-state-suppressed findings when the suppression lifts
Real product gap exposed by closing the operator-state feature: a
finding auto-dismissed while a maintenance window covered `now` would
stay dismissed forever even after the window ended. Same for findings
auto-dismissed under IntentionallyOffline once the operator cleared
the flag. The time-bounded suppression silently became permanent.

Adds a third wake condition to the dismissed-branch in
FindingsStore.Add: when a finding's most recent dismissed lifecycle
event carries operator_state_cause metadata, AND the provider reports
no current suppression for the resource, clear the dismissal and emit
a suppression_lifted lifecycle event naming the previous cause.

Manual operator dismissals (no operator_state_cause) are unaffected —
the findOperatorStateDismissCause helper stops at the first dismissed
event when scanning newest first, so a manual dismissal that
supersedes an earlier auto-dismiss is not falsely re-awakened. Tests
cover both signal types, the manual-dismissal isolation, and the
helper's newest-first scan order.
2026-05-09 18:15:32 +01:00
rcourtman
b822ef17e6 Pin: operator-state-suppressed findings skip autonomous investigation
Cross-slice contract worth making explicit: when slices 31/32
auto-dismiss a finding because the operator's per-resource state
suppresses it, that finding must not also burn investigation budget.
The existing chain already delivers this — findings.Add sets
DismissedReason="expected_behavior", and ShouldInvestigate gates on
DismissedReason != "" — but the relationship was implicit. Without a
test, a future refactor of either branch could silently start
investigating operator-suppressed findings again.

Pins the contract with a table-driven test covering both signals
(intentionally_offline and maintenance_window) at every autonomy
level (approval/assisted/full), plus the lifecycle-cause metadata
attribution. No runtime change — only the test and a contract
paragraph naming the dependency.
2026-05-09 18:05:13 +01:00
rcourtman
bfa7ba3b58 Add maintenance-window scheduler to the operator-overrides section
Closes the per-resource operator-state feature: an operator can now
schedule a maintenance window directly from the resource detail
drawer, with HTML5 datetime-local inputs, 1h/4h/24h quick presets, an
optional reason field, and client-side validation that end > start.
The scheduled window flows through the existing
/api/resources/{id}/operator-state PUT path so the action broker and
findings runtime see it on the next dispatch and re-detection
respectively.

Distinguishes a future-scheduled window (start > now) from an active
one (now within [start, end)) so the operator sees "scheduled" before
the window opens, "active" while it covers now, and clean state once
it ends. Scheduler saves preserve the toggle state and toggle saves
preserve the window so the two facets stay decoupled — editing one
does not lose work on the other.

Also exposes Edit window and Cancel window controls in the compact
view so operators can adjust or clear an existing window without
re-entering its details.
2026-05-09 15:50:23 +01:00
rcourtman
7617795bdc Surface operator-set per-resource state on the resource detail drawer
Frontend wedge for the per-resource operator-state feature: an operator
working on a resource can now toggle Intentionally offline and Never
auto-remediate without curling the API. The section lives on the
overview tab next to the action audit history so the "what overrides
has the operator set" and "what actions has Pulse taken" stories read
together.

NeverAutoRemediate is a safety override — flipping it on requires an
explicit confirmation prompt naming what the lock means, while flipping
it back off is permissive (releasing a lock is the recoverable action).
Maintenance windows are surfaced read-only this slice; scheduling lives
in a follow-up that owns the date-picker UX.

Uses createNonSuspendingQuery rather than createResource so the
drawer's parent Suspense boundary does not flicker the page-level
"Loading view..." fallback while operator state is in flight. The save
path preserves any currently-persisted maintenance-window data so this
toggle slice does not clobber the future window-scheduler slice.

Adds a TS API client (getResourceOperatorState /
setResourceOperatorState / clearResourceOperatorState) that mirrors
the canonical Go shape from slice 30, with 404 -> null normalization
on the GET path so callers see "no state" as a clean default.
2026-05-09 15:20:53 +01:00
rcourtman
e1103bf119 Refuse action dispatch when resource is operator-locked against remediation
Third compounding slice on per-resource operator state, completing the
suppression triangle: maintenance window (slice 31), intentionally
offline (slice 32), and now NeverAutoRemediate at the action-broker
boundary. When the operator has set NeverAutoRemediate=true on a
resource, executeCommandWithAudit refuses the dispatch even with a
valid approval and matching plan hash — per-resource operator intent
outranks per-action approval.

The check sits next to the drift refusal in the same audit hot path:
fail-open on store-lookup error (logged but doesn't block), persist a
Failed audit record with `resource_remediation_locked:` prefix on the
ErrorMessage so audit-UI filters and alert rules can branch on the
stable token. Returns ErrResourceRemediationLocked sentinel for
caller-side branching.

Reuses the existing actionAuditStore handle (which is the unified
ResourceStore from slice 29), so no new provider plumbing needed —
the broker already has the data path it needs.
2026-05-09 15:06:13 +01:00
rcourtman
8a70a2c23c Auto-acknowledge findings on intentionally-offline resources
Compounding slice on the per-resource operator-state feature: when an
operator has marked a resource as IntentionallyOffline (via the API
surface from slice 30), new findings against that resource get
auto-dismissed as expected_behavior with the lifecycle event tagged
operator_state_cause=intentionally_offline. Same shape as the
maintenance-window suppression from slice 31 but indefinite — no
scheduled end, no maintenance_end_at metadata.

Restructures the ResourceOperatorStateProvider interface to return a
single ResourceOperatorStateProjection per call rather than a narrow
ActiveMaintenanceWindow method. Adding the second signal would
otherwise have meant a second method and another call per finding;
the projection shape carries every signal in one round-trip and gives
future signals (NeverAutoRemediate, criticality) a stable extension
point.

Maintenance windows take priority over intentionally_offline when
both are active because the time-bounded suppression is more honest
to surface to the operator — they'll see it auto-clear when the
window ends rather than wondering when the indefinite suppression
will lift.
2026-05-09 14:54:45 +01:00
rcourtman
cf2e61ea22 Auto-acknowledge new findings during operator-set maintenance windows
Third slice on the per-resource operator-state feature delivers the
behavioral payoff: when an operator has set a maintenance window on a
resource (via the API surface from slice 30), new findings against
that resource get auto-dismissed as expected_behavior at creation time
rather than firing notifications. The finding still lands in durable
history with a UserNote naming the window and a "dismissed" lifecycle
event tagged operator_state_cause=maintenance_window so the operator
can audit what tripped during the window.

Adds a narrow ResourceOperatorStateProvider interface to internal/ai
so the findings runtime stays free of an internal/unifiedresources
import. The API layer wires an adapter in router.go that projects
unified.ResourceOperatorState through state.IsInMaintenanceAt into
the ActiveMaintenanceWindow shape the findings runtime consumes.

Suppression is opt-in: stores without a provider wired keep the
original new-finding behavior bit-for-bit, so deployments that
haven't adopted the operator-state feature see no behavioral change.
IntentionallyOffline and NeverAutoRemediate land as compounding
slices on the same provider interface.
2026-05-09 14:43:22 +01:00
rcourtman
46646b4293 Add /api/resources/{id}/operator-state GET / PUT / DELETE handlers
Second slice on the per-resource operator-state feature: the API
surface that the storage foundation from slice 29 was designed to
support. A frontend, pulse-cli, or Assistant tool call can now read
the operator-set state for a resource, replace it with PUT, or clear
it with DELETE.

Contract decisions worth preserving:
- GET 404s with stable error code operator_state_not_set when no
  entry exists, distinct from a 200 with default (all-zero) fields.
- PUT replaces the entire record. URL canonicalId wins over body to
  prevent body-manipulation retargeting; server-side setAt/setBy
  populate from request time and authenticated identity, ignoring
  client values so the audit trail stays honest.
- Validation rejections surface 400 with operator_state_invalid so
  frontend can branch on the code without string-matching messages.
- DELETE is idempotent — 204 whether or not an entry was present.
- GET runs under monitoring:read; PUT and DELETE under
  monitoring:write because the state modulates Patrol's behavior.

Finding-suppression and action-broker integrations land in subsequent
slices that consume the same ResourceOperatorState shape.
2026-05-09 14:34:43 +01:00
rcourtman
ee21942f2a Add ResourceOperatorState foundation for per-resource operator intent
Foundational wedge for the operator-set per-resource state feature
(intentionally offline, never auto-remediate, maintenance windows,
criticality hint). Defines the canonical schema, validation, and
persistence contract; subsequent slices wire in the API surface and the
finding-suppression / action-broker integrations.

ResourceOperatorState carries four narrow operator-intent fields plus
attribution metadata. The shape is intentionally fixed (not a freeform
metadata bag) so consumers have a stable contract to honor.
ValidateResourceOperatorState rejects malformed records with a stable
sentinel (partial maintenance window, end <= start, unknown criticality);
NormalizeResourceOperatorState handles whitespace and case.

Adds GetResourceOperatorState / SetResourceOperatorState /
ClearResourceOperatorState to the ResourceStore interface, implements
both on SQLite (new resource_operator_state table) and Memory. Set is
upsert by canonical_id; Clear is idempotent. Tests cover IsEmpty,
IsInMaintenanceAt window semantics (half-open interval, partial windows
treated as no window), validation rejections, and round-trip
persistence on both stores.
2026-05-09 14:23:24 +01:00
rcourtman
f99bce7ee4 Author Proxmox VM/CT lifecycle preflight context for approval review
Pulse's primary monitored platform is Proxmox, but the per-command-class
preflight catalog only covered systemd services, Docker containers, and
Kubernetes deployments. Operators approving a Patrol-proposed
qm restart/qm stop/qm shutdown saw the generic preflight without the
operational nuance that distinguishes those verbs (qm stop is a hard halt,
qm shutdown is graceful with a 60s ACPI timeout, etc.).

Adds eight new classes covering qm and pct lifecycle (reboot/restart,
stop, start, shutdown). Each gets hand-authored safety and verification
copy that names the actual semantics — pct stop is destructive vs pct
shutdown's lxc-attach handoff — so the operator sees concrete context at
approval time.

Broker-level VerificationCommandForCommand intentionally does not derive
qm status / pct status from these classes — pulse_control's
verifyGuestAction already runs those checks at the tool layer, so adding
a parallel broker dispatch would double-run. The preflight copy still
names what the tool-layer verification will read.
2026-05-09 13:12:52 +01:00
rcourtman
06ebe55e13 Add Copy summary action to share findings as paste-ready Markdown
Operators frequently need to delegate a finding by pasting it into Slack,
email, or a ticket. Currently they have to manually copy each piece —
title, description, impact, recommendation — out of the expanded card.

Adds formatFindingForClipboard helper that produces Markdown mirroring the
seven-question schema's render order (severity + title + resource, then
description, impact, recommendation, plus trust signals). Wires a Copy
summary button between Explain and Discuss in the expanded card, routing
through the shared copyToClipboard helper with success/error notifications.

Investigation evidence and rollback plans are intentionally omitted —
those are conversation context for the Assistant flow, not "share this
finding" context for chat or tickets.
2026-05-09 13:00:04 +01:00
rcourtman
7593097e04 Surface verified-resource coverage on the Patrol recency line
Third wedge of the Patrol page IA reframe. The recency line "Last full
patrol: 3m ago" tells the operator when Pulse last ran but not what it
covered. Adds a coverage signal so the line reads "Last full patrol: 3m
ago — verified 47 resources" — operators see temporal AND coverage state
without scrolling.

Extends PatrolRecencyPresentation with an optional resourcesChecked
populated from PatrolRunRecord.resources_checked on the latest completed
run. Field is intentionally optional (omitted when zero) so a degenerate
zero-coverage run doesn't render "verified 0 resources" as an alarm.
2026-05-09 12:54:14 +01:00