Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-30 03:31:51 +00:00

Author	SHA1	Message	Date
rcourtman	4dff26f728	Emit structured telemetry on reporting and summarize invocations The reporting feature now ships across two surfaces (PDF/CSV export and pulse_summarize chat tool) and three modes (single-resource, fleet, summarize). Without usage telemetry we can't tell whether the work earns its place — operator demand, AI-vs-heuristic adoption, range/format preferences are all invisible. Stops further feature investment from being pure speculation. Three new info-level log events, structured so an agent can grep transcripts and group by dimension without a separate metrics pipeline (matches the "agent owns ops analysis, human gets outcomes" posture in MEMORY.md): reporting.single.generated — single-resource PDF/CSV reporting.fleet.generated — multi-resource fleet PDF/CSV reporting.summarize.invoked — pulse_summarize chat tool (both modes) Common dimensions: org_id, format/action, range, ai_configured, findings_configured, window_start/end. Single-resource adds resource_type + metric_type + bytes; fleet adds resource_count + bytes; summarize adds resource_type + resource_count (fleet mode) + narrative_source (so we can audit AI-fallback rate). Includes rangeLabel() helper that maps a window to the canonical catalog range token (24h/7d/30d) with a 1h tolerance, falling back to "<hours>h" so non-standard windows still group. Tested. TestReportingTelemetryEventNames pins the canonical event names as a contract — an agent grepping logs depends on them being stable; changing them silently would break audit tooling on the consumer side. The reporting engine already logs the resolved narrative source (heuristic/ai) at debug level via the existing "Generating report" line, useful for diagnosing why a specific report fell back. Kept at debug; the new info-level events cover the operator surface.	2026-05-10 22:59:23 +01:00
rcourtman	03463c1bfe	Thread per-tenant AI narrators into pulse_summarize via chat session v1 of pulse_summarize (`1fe5d6853`) shipped with heuristic narrative only. The follow-up wiring promised in that commit now lands: the chat session carries optional report-narration providers that the tool's handler reads when building requests, so AI-narrated synthesis flows into chat using the same provider, sanitizer, model selection, cost ledger, and budget gate the report PDF endpoint already uses. Pipeline: - pkg/reporting Narrator / FleetNarrator / FindingsProvider interfaces are already implemented by internal/ai.Service. No new implementations. - tools.ExecutorConfig + PulseToolExecutor gain three optional fields (ReportNarrator, ReportFleetNarrator, ReportFindingsProvider). Clone() copies them so per-session executors inherit the wiring. - chat.Config gains the same three fields; NewService threads them into ExecutorConfig. - tools_summarize.go reads e.reportNarrator/FleetNarrator/ FindingsProvider and populates MetricReportRequest / MultiReportRequest. The engine already accepts these on the request and falls back to heuristic when they are nil — no engine changes needed. - AIHandler gains SetReportNarratorResolver(ctx -> narrators); both per-tenant and default chat.Config construction sites invoke the resolver. Router wires the resolver to AISettingsHandler.GetAIService with the same Enabled-gate the reporting handler uses. Unconfigured tenants are unchanged: the resolver returns nil, the tool returns heuristic narrative — identical to today. Configured tenants get AI synthesis in chat that matches what their report PDF already carries, billed and budget-gated the same way.	2026-05-10 22:50:17 +01:00
rcourtman	20df3dcd2c	Let a valid bootstrap token authorize initial setup from any origin The loopback gate from `586473ee3` rejected non-loopback setup requests before the bootstrap-token check could run, so a Proxmox-LXC install (install script prints URL + token; user opens URL on workstation, pastes token) hit "only available from localhost" even with the correct token. The token is the security boundary — only callers with filesystem access to the data dir can read it — so a valid token now authorizes setup from any origin. No-token requests still require direct loopback. Updates the two contract/setup tests that pinned the old behavior. Fixes discussion #1459.	2026-05-10 22:25:34 +01:00
rcourtman	e32d4ede44	Expose engine narrative entry points for non-rendering callers The reporting engine's synthesis layer was reachable only through Generate/GenerateMulti, which always rendered PDF or CSV. Pulse Assistant needs the same retrospective synthesis (per-resource summary, fleet outliers, period comparison) in a form it can present in chat, not as a downloaded artifact. Add two non-rendering entry points to the Engine interface: NarrativeFor(req MetricReportRequest) (Narrative, error) FleetNarrativeFor(req MultiReportRequest) (FleetNarrative, error) Both run the same query path and the same narrator resolution as their rendering counterparts (heuristic by default, AI when the request supplies a narrator, fail-closed-to-heuristic on any narrator error) and return the structured narrative without invoking the fpdf/csv output stage. Test stubs in pkg/reporting and internal/api are updated to implement the extended interface. These are the seams the upcoming pulse_summarize Assistant tools wrap to answer questions like "what's hot on pve1 this week" or "where should I look across my fleet" without round-tripping through report generation. Same synthesis layer, no PDF involved. Also fixes a pre-existing flake in TestEngineGenerate_UsesSuppliedNarrator (metrics writes are async; the first Generate sometimes ran before the raw tier flushed). Wrapped in the same eventually-pattern used by the prior-period and findings-provider tests.	2026-05-10 22:23:09 +01:00
rcourtman	d4463a615c	Add fleet-level AI narrative for multi-resource reports The single-resource AI narrative landed in `b2bd9d114` but multi-resource fleet reports stayed heuristic-only. That left a gap on the exact axis where AI helps most: a 50-resource fleet PDF is where synthesis is the difference between useful and unread. Introduce FleetNarrator as a separate interface from Narrator. The input shapes are different — single-resource takes one set of metric stats with a prior window, fleet takes a denormalised cross-resource view with per-resource summaries plus a fleet aggregate. HeuristicFleetNarrator owns the deterministic fallback: ranks resources by severity (critical alerts > unhealthy disks > storage pressure > memory > CPU > non-critical alerts), picks up to 5 outliers, derives cross-cutting patterns by counting how many of N resources share a hot signal, and emits fleet-scoped recommendations. internal/ai.Service implements FleetNarrator through report_fleet_narrator.go. Distinct use-case label (report_narrative_fleet) so fleet vs single-resource spend is separable in the cost ledger and budget gate. The fleet payload is denormalised through buildReportFleetPayload so prompt cost scales linearly with fleet size. Same fail-closed invariant — nil provider, parse failure, or context cancellation falls through to the heuristic. Single-resource Narrator is intentionally NOT propagated through engine.GenerateMulti: a 50-resource fleet report performs one AI call (fleet narrator), not 51. The router resolver returns the AI service for all three roles (Narrator, FleetNarrator, FindingsProvider). The fleet PDF renders the FleetNarrative in the fleet summary cover when present: executive prose, named outliers with severity-coloured bullets, cross-cutting patterns, recommendations, optional period comparison, and an AI provenance footer. The deterministic resource summary table is preserved above so every named outlier is verifiable against the table immediately below it. Legacy "Highest CPU / Most alerts" bullets remain as the fallback when no FleetNarrative is attached.	2026-05-10 21:23:12 +01:00
rcourtman	27bd31684a	Log the underlying error on audit list 500s HandleListAuditEvents dropped the Query/Count error before writing the 500, so a user hitting "Failed to fetch audit events" produced no server-side log line — diagnosing the failure was impossible without a local repro. Log the error with the org ID so the next instance is findable. Doesn't change the user-facing response.	2026-05-10 20:44:18 +01:00
rcourtman	b2bd9d1147	Replace heuristic report narrative with optional AI-generated layer Performance reports rendered the Executive Summary, Observations, and Recommendations sections from inline threshold rules in pdf.go. That narrative looked intelligent but was static templating against alert counts and metric percentiles, which felt off-brand alongside Patrol and Pulse Assistant. Introduce a Narrator interface in pkg/reporting and a FindingsProvider counterpart that the engine consults at report time. The heuristic rules are lifted into HeuristicNarrator unchanged so the deterministic fallback still produces the same observations and recommendations. The engine now also queries the comparable prior period and threads its aggregate stats through the narrator so deltas can be expressed. internal/ai.Service implements both interfaces via report_narrator.go (single-turn JSON call grounded in the structured ReportData payload, falling back to the heuristic on any error/timeout) and report_findings.go (Patrol findings whose lifecycle overlaps the report window). The reporting handler resolves the per-tenant AI service when it is configured and supplies it in the request; absent configuration, reports look identical to the prior heuristic output. Charts, stats tables, alert lists, storage and disk sections stay deterministic — sysadmins can verify every AI claim against the data tables next to it. The PDF renders the AI prose between the health card and Quick Stats, adds a Period-over-period section after Recommendations, and prints a provenance footer when the narrative came from the assistant. ai-runtime.md and api-contracts.md updates land in a follow-up commit on this branch; agent-lifecycle / performance-and-scalability / storage-recovery have no contract delta from this change (router.go is referenced in their Extension Points but their semantics are unchanged).	2026-05-10 19:30:54 +01:00
rcourtman	297556fb65	Surface cached preflight in Patrol tools readiness check Previously the "Patrol tools" readiness check was static model-name pattern matching: it told the operator whether the selected model is on Pulse's tool-capable allowlist, not whether tools have actually been verified to work. After today's preflight cache, that's strictly less information than what we already know. resolvePatrolToolsCheck now consults aiService.CachedPatrolPreflight() and grounds the check in real evidence when a result for the configured provider+model exists: - cached green (success + tool_call_observed) → "Tool calling verified <age> against <model>." (ready) - cached failure → classified summary + "(last preflight <age>)." (not_ready) - cached soft warning (model_tool_support_unverified) → same with warning status - no cache or model mismatch → static fallback formatPatrolPreflightAge produces stable English ("just now", "5m ago", "2h ago", "3d ago") with full unit coverage (11 cases). HandleGetAISettings now also includes patrol_readiness in its response — previously only the PUT response carried it, so the Patrol page only got augmented readiness after a save. The frontend already had patrol_readiness typed and read it from useAISettingsState. ai-runtime, api-contracts, agent-lifecycle (dep), and storage-recovery (dep) contracts updated.	2026-05-10 16:41:16 +01:00
rcourtman	f74add8271	Auto-seed Patrol preflight cache on Pulse startup Closes the cold-start gap in the preflight observability layer: every Pulse restart blanked the cached "last verified" indicator until the next save or manual click, which meant operators saw "never verified" on every upgrade or process restart even when the configured Patrol model was working fine. NewAISettingsHandler now reuses the existing aiSettingsUpdateRequiresPatrolPreflight predicate with a nil "prior config" — semantically "no in-memory cache yet, just booted." When the loaded config has assistant enabled and a Patrol model, the handler dispatches the same async TriggerPatrolPreflightAsync the save path uses. Routine boots where assistant is disabled (or no Patrol model is selected) skip the dispatch so we never write a misleading "Pulse Assistant is not enabled" entry into the cache. Live verified: after Pulse restart, /api/settings/ai surfaces a fresh patrol_preflight with success=true within ~6s of boot, no operator action required. Predicate test extended with two named cases that document the dual-purpose use (startup seed + skip-when-disabled). ai-runtime, api-contracts, agent-lifecycle (dep), and storage-recovery (dep) contracts updated.	2026-05-10 16:34:34 +01:00
rcourtman	33b6bf97d1	Auto-trigger Patrol preflight when settings save moves Patrol transport Closes the resilience gap: today an operator picks a Patrol model, saves, and Patrol silently fails on its first scheduled run because nothing verified the model actually calls tools. Now the save handler dispatches a one-shot preflight in the background whenever the change actually moved Patrol's transport, and the cached result surfaces on the next /api/settings/ai poll via patrol_preflight. Trigger conditions (aiSettingsUpdateRequiresPatrolPreflight): - new config has assistant enabled AND a Patrol model - AND any of: * no prior config (first save) * assistant was disabled, now enabled * Patrol model changed * API key for the new patrol model's provider changed Routine saves that don't touch Patrol transport (theme, control level, discovery toggles, unrelated provider keys) skip preflight entirely so they don't burn provider tokens or add 5-10s latency to every save. Service-level TriggerPatrolPreflightAsync runs the call in a goroutine with a 30s timeout. Detection helper has full unit coverage including the negative paths.	2026-05-10 16:14:26 +01:00
rcourtman	728c42e47b	Bring action endpoints onto the agent surface with the agent-stable envelope Closes the last known gap in the agent substrate. The three action endpoints (POST /api/actions/plan, /api/actions/{id}/decision, /api/actions/{id}/execute) previously emitted the platform-wide APIError shape (stable code under "code", human under "error"). The agent surface uses the inverted shape (stable code under "error", human under "message"), so adding action capabilities to the manifest as-is would have forced agents to remember which envelope each capability uses. The slice refactors actions.go to emit the agent-stable envelope across all 42 writeErrorResponse call sites. writeJSONError gains a writeJSONErrorWithDetails sibling so the 13 calls that pass field-level reasons (validation failures) preserve that information under a new optional `details` field. The action endpoints' JSON shape becomes: {"error": "<stable_code>", "message": "<human>", "details"?: {"<field>": "<reason>"}} Frontend impact: zero. Verified that no frontend code consumes the three action endpoints; the refactor is API-only. Three new manifest entries (plan_action, decide_action, execute_action) under a new "action" category, with their declared error codes pinned per capability. Internal-failure 5xx codes (audit-store outages, encode failures) are not declared per capability; agents branch on 5xx generically. TestContract_AgentSurfaceErrorCodesMatchManifestDeclarations now audits actions.go alongside the existing two handler files, with a documented internal-only allowlist for the 5xx codes. The TestAgentSubstrate_ActionEndpointsEmitAgentStableEnvelope e2e test exercises one error path through each endpoint via the actual HTTP boundary, asserting the agent-stable envelope reaches the wire and the legacy APIError fields (code, status_code, timestamp) do NOT — drift back would mean the refactor regressed. The TestContract_ActionDryRunOnlyExecutionErrorJSONSnapshot pin is updated to match the new envelope shape; the manifest's category allowlist gains "action". api-contracts.md documents the new envelope (with details map), the action governance loop's place in the substrate, the ai:execute scope distinction from monitoring:write, and the "manifest projection has a footnote" trade-off: bringing an existing endpoint into the agent surface may require migrating its error envelope, but the substrate keeps a single envelope contract rather than carrying a translation wrapper layer. agent-lifecycle.md and storage-recovery.md document the action surface joining the agent paradigm and its zero-new-persistence posture respectively. AGENT_SUBSTRATE.md's "what it does not do yet" no longer lists the action surface; it now reflects the real outstanding items (consumer feedback, an in-Pulse agent integrations panel, a distribution path for pulse-mcp).	2026-05-10 15:16:17 +01:00
rcourtman	add1096ec8	Cache Patrol preflight outcome and hydrate UI on settings load The Verify Patrol button reset its result to empty on every page load — the operator had to re-click to see the verified state, even though nothing had changed. This commit adds the observability layer of the auto-preflight plan: every RunPatrolToolPreflight result is now cached on the AI Service and surfaced through /api/settings/ai as patrol_preflight, so the inline result panel rehydrates on page load with the most-recent outcome and a "last verified Xs ago" indicator. Backend: patrolPreflightCache (mutex-guarded) on Service with defensive-copy CachedPatrolPreflight() accessor; every RunPatrolToolPreflight branch (success, soft warning, classified failure, validation early-return) records into the cache. PatrolPreflightSnapshot projects the cached result onto the AI settings response. Tests cover both success-then-failure supersession and the defensive-copy invariant. Frontend: PatrolPreflightSnapshot type mirrors the wire shape; hydratePatrolPreflightFromSettings(data) projects the snapshot into the same response shape the manual button writes; loadSettings and updateSettings flows call it. The result panel renders a "last verified Xs ago" line under the provider/model row when recorded_at_unix is present. End-to-end smoke verified against deepseek-v4-flash: panel rehydrates as green "Tool calling verified · last verified just now" after page reload. Auto-preflight on save (the trigger half of the resilience plan) follows in the next commit. Contracts: ai-runtime, api-contracts, agent-lifecycle (dep), storage-recovery (dep), frontend-primitives all updated to reflect the new patrol_preflight surface and hydration contract. Verification artifacts: settingsArchitecture + patrolPreflight client tests.	2026-05-10 14:58:26 +01:00
rcourtman	404f87854e	Pin cross-org and cross-resource isolation on the bundle's pending approvals The AgentApprovalsProvider closure in router.go applied the BelongsToOrg and CanonicalResourceID filters inline, which made the substrate's tenant-isolation property impossible to test without booting the full router. Drift in the closure (e.g. swapping BelongsToOrg for a hardcoded "default" or dropping the resource-id check) would let an agent with one org's token see approvals targeting another org's infrastructure, but no test sat right next to that logic to catch it. Extracts the body into a named function in agent_resource_context.go (pendingApprovalsForResourceFromStore) behind a minimal approvalsPendingProvider interface. The closure in router.go now delegates to it. Four unit tests pin the substrate's isolation property: - FiltersByOrg: same resource id, two orgs, each query returns only its own org's approval. - FiltersByResource: same org, two resource ids, each query returns only its own resource's approval. - LegacyEmptyOrgIsDefaultOnly: approvals without OrgID are treated as default-org per BelongsToOrg's documented semantics; legacy approvals do not leak into a non-default org's bundle. - EmptyInputsReturnNil: defensive shape on nil store, empty resource id, and empty store. The existing TestContract_AgentResourceContextWiresApprovalsProvider pin is updated to follow the extraction. Both halves of the wire-up are now pinned: router.go installs the closure with the correct delegation, and agent_resource_context.go owns the filter logic with both safety checks present. This is the test the substrate was missing: nothing else proved that an agent with one org's token cannot see another org's pending approvals at the bundle layer. Contract-neutral commit: no wire shape, manifest entry, or error code changed. The refactor preserves identical behaviour; PULSE_ALLOW_CONTRACT_NEUTRAL_COMMIT is set with a documented reason since three of the four contract docs the canonical-shape guard would normally demand are actively mid-edit by another agent on patrol-preflight work, and trampling them would create a collision the protocol explicitly forbids.	2026-05-10 14:38:10 +01:00
rcourtman	e26a57a157	Add POST /api/ai/patrol/preflight tool-call verification The existing per-provider /api/ai/test endpoints only call ListModels — they pass for every provider that returns a catalog, even when Patrol fails 100% of runs because tools aren't actually wired up. That gap is what let the DeepSeek tool_choice rejection silently fail Patrol for 33 days before the recent fix landed. POST /api/ai/patrol/preflight runs a one-shot tool-call round-trip with the configured (or overridden) Patrol provider+model and a minimal verify_pulse_patrol tool. Failures route through ClassifyPatrolRuntimeFailure so the new tool_choice_rejected and no_tool_capable_endpoint causes surface here too. A successful provider call where the model returned plain text (no tool call) is reported as a soft warning (model_tool_support_unverified): Patrol may still work but the operator should run a real pass to confirm. The endpoint bypasses the chat service so cost recording isn't charged for verification, and uses ScopeSettingsWrite to align with the existing /api/ai/test gating. Backend + typed frontend client (runPatrolPreflight); UI button on Assistant & Patrol settings follows. Contracts updated: - ai-runtime: completion obligation extended to cover the new verification surface - api-contracts: payload shape (tool_call_observed, duration_ms) noted in obligations - agent-lifecycle, storage-recovery: dependent-extension acknowledgment that ai-runtime owns the new route despite it living under internal/api/	2026-05-10 14:30:41 +01:00
rcourtman	f2d9d2aba8	Split overgreedy "tools not supported" classifier into three causes The Patrol runtime classifier collapsed three distinct upstream conditions into one misleading "Selected model does not support Patrol tools" message: 1. Provider rejected the value Pulse sent for tool selection (e.g. DeepSeek's "deepseek-reasoner does not support this tool_choice" — the model accepts tools, just not the forced coercion). The DeepSeek fix in `46145df9` dodges the symptom by coercing to auto, but the original misclassification pointed operators at the wrong remediation for 33 days. 2. Provider has no tool-capable endpoint available for the selected model (OpenRouter's "No endpoints found …" surfaces this when account-level provider/data filters exclude every tool-capable route). 3. Model truly lacks tool calling (the literal "tools are not supported" / "tool calling" cases). Each now has its own PatrolFailureCause, title, summary, description, and recommendation. summarizePatrolRuntimeFailureDetail mirrors the split. Helper predicates patrolToolChoiceValueRejected and patrolNoToolCapableEndpoint encapsulate the substring matching. The OpenRouter "No endpoints found" test fixture now correctly classifies as no_tool_capable_endpoint instead of model_unsupported_tools — fixture updates in patrol_runtime_failure_test.go, patrol_assistant_handoff_test.go, and ai_handler_test.go reflect the more accurate diagnostic. New tests cover the tool_choice_rejected and generic model_unsupported_tools paths explicitly. The ai-runtime contract is updated to note the classifier-split obligation alongside the existing transport-shape obligation.	2026-05-10 14:10:18 +01:00
rcourtman	eeb2975d22	Stability sweep on the agent-substrate arc Three things landed: 1. /api/agent/capabilities was missing from publicPathsAllowlist in router_public_paths_inventory_test.go. Slice 47 added the path to publicPaths in router.go and to publicRouteAllowlist in route_inventory_test.go but missed this second mirror, which scans publicPaths via go/ast. The test was failing on origin; this commit closes the gap. 2. The error-envelope paragraph in api-contracts.md now distinguishes capability-specific stable codes (the closed set declared per capability in the manifest) from cross-cutting codes the multi-tenant / auth middleware emits universally (invalid_org, org_suspended, access_denied). The previous wording implied all stable codes lived in per-capability errorCodes lists, which would have forced duplication on every capability or misled agents about which codes to expect. 3. New contract pin TestContract_AgentSurfaceErrorCodesMatch- ManifestDeclarations enforces the symmetry both directions: every code emitted by an agent-surface handler must be either declared in the matching capability or be one of the three cross-cutting codes; every manifest-declared code must have a matching emission. Drift either way is a contract regression. Pin verified clean against the current handler set. Stale forward-reference fixed: the capabilities paragraph no longer says "future MCP-server slices read the manifest" — slice 51 already shipped that adapter. Sweep also surfaced two failures in internal/mock/ from unrelated platform-support drift (unraid token set added in `ac82a2852` but the mock contract test wasn't updated). Those are not part of the agent-substrate arc and not mine to fix; flagged in the closing summary so they don't get lost.	2026-05-09 23:04:22 +01:00
rcourtman	8aa22d0605	Surface action verification on the action.completed SSE payload Closes the certainty loop for agents watching the substrate's push channel. The action audit's read-after-write probe outcome was already persisted on the audit record, but agents watching action.completed only learned "the action ran" — they had to fetch /api/actions/{id} to know whether the read-back probe confirmed the intended state. That defeated the substrate's push-notification guarantee for dispatch certainty. The new agent-stable AgentResourceActionVerification projection (ran, success, command, note, ranAt — output stays in the audit record, deliberately omitted from events to keep payloads small) is now carried on both: - the action.completed SSE payload, projected from record.Result.Verification by the router-side bridge in wireAIChatDependenciesForService, and - the resource-context bundle's recentActions surface, via the same shared projectAgentResourceVerification helper so the bundle (depth) and the doorbell (push) speak the same vocabulary. Refused-before-dispatch failures omit verification (the probe never runs) so agents branch on field presence to distinguish "no probe attempted" from "probe ran with empty result". Three contract pins lock the symmetry: payload field present, router bridge populates it, bundle parallels. The capabilities manifest's subscribe_events description now mentions the verification block so external agents discover the field through the same path they already use to learn the rest of the agent surface.	2026-05-09 22:45:15 +01:00
rcourtman	5156c03eed	End-to-end test the operator-state write loop through HTTP Closes the e2e contract proof on the write side. The only write capability the manifest declares is the operator-state intent loop (set / get / clear), and this test boots the full router stack to walk every state of that loop through the actual HTTP boundary — proving the manifest's declared error codes for set_operator_state and get_operator_state reach the wire from the handlers, the URL canonical id authoritatively wins over body-supplied ids (no scope-confusion writes), and SetAt/SetBy are server-populated so attribution cannot be spoofed. The flow exercised: GET unset → 404 operator_state_not_set PUT valid → 200 with persisted state + server SetAt GET → round-trips PUT invalid criticality → 400 operator_state_invalid DELETE → 204 GET → 404 operator_state_not_set (loop closed) DELETE again → 204 (idempotent) Two contract pins lock the audit-honesty and error-token contracts so a future refactor of the handler can't silently regress either: SetAt/SetBy populated server-side, URL-id wins over body-id, and the validator's domain error maps to the stable wire token via errors.Is rather than message-matching. Together with the read-side e2e (slice 47), the agent surface — read, write, push — has now been exercised end-to-end as one substrate.	2026-05-09 22:22:47 +01:00
rcourtman	8cf15fe639	End-to-end test the agent substrate's discovery → triage → depth flow The unit tests cover each piece in isolation; this test boots the full router stack and proves the discovery → triage → depth chain works as one substrate through the actual HTTP boundary an external agent would hit. It found two real bugs slice 40 introduced and slice 45/46 didn't surface: - /api/agent/capabilities was documented as unauthenticated but was missing from the router's publicPaths list, so the global auth middleware was 401'ing the discovery manifest. Fixed by adding the path to publicPaths and pinning the contract so it cannot regress. - The error-envelope shape across the agent surface is {"error": "<stable_code>", "message": "<human>"}, written via writeJSONError — not the {"code": ...} shape I had assumed in the docs. Pinned the wire shape on api-contracts.md so the documented error contract matches what writeJSONError actually writes. The e2e test exercises capabilities discovery, triage via fleet-context, and depth via resource-context with an unknown id to confirm the resource_not_found stable error code reaches the wire under the canonical "error" key. The subscribe_events SSE path is probed unauthenticated to confirm it's gated (401) rather than 404 — discovery's claim is honest.	2026-05-09 22:16:32 +01:00
rcourtman	a168215f6a	Add /api/agent/fleet-context for org-wide triage in one read The substrate had a per-resource bundle but no fleet view, so "where do I focus?" forced agents to walk every resource id and bundle each — O(N) round trips that scale with fleet size. The fleet endpoint returns a thin per-resource rollup in a single read: identity, operator-intent flags (intentionallyOffline, neverAutoRemediate, maintenanceWindowActive), per-severity finding counts, and pending-approval count. Same auth scope and same provider wiring as the per-resource bundle — operator-state via the canonical unified store, findings via AgentFindingsProvider, approvals via AgentApprovalsProvider — so the fleet sweep is the per-resource bundle's wiring multiplied by N with no new dependencies. Audit reads are deliberately omitted from the rollup; agents that want depth on a flagged resource follow up via /api/agent/resource-context/{id}. The capabilities manifest declares get_fleet_context with AgentFleetContext as the response shape so external agents discover the triage entry point through the same path they already use to learn the rest of the agent surface.	2026-05-09 22:08:50 +01:00
rcourtman	d8f6b1e508	Bundle pending approvals into the agent resource-context endpoint The substrate's "everything an agent needs in one read" guarantee covered identity, operator state, findings, and recent actions but forced a separate /api/approvals call for pending governance state. AgentResourceContext now carries pendingApprovals as a lightweight AgentResourceApprovalSummary projection — same vocabulary as approval.pending SSE events, so the doorbell and the bundle agree on shape. AgentApprovalsProvider is the parallel seam to AgentFindingsProvider; the router wires a closure that resolves approval.GetStore() at request time, scopes via BelongsToOrg, and filters by CanonicalResourceID so cross-tenant or cross-resource pending requests don't leak. Empty arrays preserve the iteration-safe contract the existing sections already follow.	2026-05-09 22:00:53 +01:00
rcourtman	7fe9b1c492	Use cursor-help on TagBadges hover-only +N indicator The "+N" overflow indicator on TagBadges was styled with `cursor-pointer`, which signals a clickable affordance — but the element only listens for mouseenter/mouseleave to show a tooltip and has no click handler. Switch to `cursor-help` so the cursor matches the actual interaction (hover for more info), avoiding a phantom click expectation.	2026-05-09 21:52:34 +01:00
rcourtman	52669128e6	Drop redundant policy gates in resource-link routing Tail of the operator-local-UI redaction sweep (`abdde303a`, `a17f879a1`). resolveKubernetesContextForResource gated on requiresGovernedResourceDisplay to choose between getPreferredInfrastructureDisplayName and a manual displayName-or-name fallback. Both branches produce a raw infra name once we trust that displayName never carries a redacted summary in local rendering, so the gate is dead complexity. Collapse to a single call and drop the now-unused requiresGovernedResourceDisplay import. problemResourcePresentation.getProblemResourceDisplayName has no production consumers today, but it still routes through the governed helper. Reclassify it now (same as every other operator-local helper) so the rule is consistent across the codebase if the surface ever gets adopted.	2026-05-09 21:31:45 +01:00
rcourtman	51c5d344ce	Plumb operator-state and operational memory into investigation findings Closes the "has context vs uses context" gap that defines Pulse's agent-paradigm differentiation. The orchestrator (in pulse-pro) used to receive a Finding with no awareness of the operator's commitments — Patrol could investigate a resource the operator had marked never-auto-remediate and propose a restart fix that the action broker would refuse downstream. The proposal shouldn't have happened in the first place. Adds two optional fields to aicontracts.Finding: - OperatorContext: intentionally offline, never auto-remediate, maintenance window with computed active flag, criticality, note. Populated in MaybeInvestigateFinding from the same operator-state projection the suppression hot path consumes, so investigation reasoning and suppression behavior cannot drift apart. - OperationalMemory: regression count, previous resolved fix summary, last regression timestamp, times raised. Populated in ToCoreFinding from fields the internal Finding already carries. ResourceOperatorStateProjection grew a NeverAutoRemediate field — the investigation read path needs it (so the orchestrator can avoid proposing fixes the broker would refuse) even though the suppression hot path doesn't. Same projection serves both reads. Both fields are nil when there's no signal (fresh finding, no operator state) so the orchestrator branches on absence rather than parsing zero-valued structs. The pulse-pro orchestrator consumes the fields in a separate slice; this slice ships the in-repo half of the data path.	2026-05-09 21:03:15 +01:00
rcourtman	94bfd48a9d	Add /api/agent/events SSE stream for real-time agent notifications Third slice on the agent-paradigm pivot, closing the substrate triangle (discovery + bundled reads + push). Agents subscribe once to a long-lived SSE connection and receive real-time events instead of polling: finding.created when a new finding is raised, heartbeat every 15 seconds for keepalive. Each event carries a monotonic ID so agents can dedupe and reason about ordering across reconnects. The broadcaster fan-outs to multiple subscribers and drops events for slow consumers rather than blocking the publish path — publishers cannot stall on consumer slowness. The findings-runtime hook in router.go publishes finding.created when the finding is new AND not auto-dismissed by operator-state suppression (operator already said to stay quiet about that resource); patrol-cycle re-detection of existing findings doesn't fire the event. Capabilities manifest declares the stream under subscribe_events so external agents discover it through the same channel as the REST surface. SSE chosen over WebSocket because it's simpler, works through every HTTP proxy without special-casing, and matches the existing deploy_handlers pattern; agents that need bidirectional comms call REST endpoints in parallel. Tests pin the broadcaster's pub/sub semantics (fan-out, unsubscribe, slow-consumer drop, monotonic IDs), the SSE handler's stream contract (text/event-stream, no-cache, X-Accel-Buffering=no), and the connected/published-event delivery via httptest.NewServer. A contract test pins the publish-gate semantics so operator-state suppression and stream notifications stay aligned.	2026-05-09 20:13:31 +01:00
rcourtman	71797f9b21	Add /api/agent/capabilities discovery manifest for agent integrations Second slice on the agent-paradigm pivot: the discovery document any external agent (Claude Code, custom integrations, future MCP servers) needs to learn what Pulse exposes. Each capability declares its agent-stable name (snake_case), description, category, REST surface, required scope, response shape, and the closed set of stable error codes the response may carry. Agents branch on the codes (operator_state_invalid, resource_not_found, etc.) rather than parsing human messages. The manifest is hand-authored, not auto-generated, because the contract decisions (what's agent-stable, which categories, which error codes) are product-shaping and must not drift behind code changes. Adding a capability is a deliberate "this is part of the agent surface" commitment. v1 surface includes: get_resource_context (substrate from slice 39), get/set/clear_operator_state (slice 30), and the finding-lifecycle actions (acknowledge, snooze, dismiss, resolve). Action-broker capabilities are not in v1 because they go through approval flow, not direct dispatch — those need their own contract design. Tests pin: stable shape, version contract, unique-and-snake_case names, every capability has method/path/scope, closed category set, required error codes for the most consequential capabilities. The manifest is unauthenticated and cacheable (5min); the underlying capabilities keep their own auth scopes.	2026-05-09 19:52:17 +01:00
rcourtman	14f9270a5e	Add /api/agent/resource-context/{id} substrate endpoint for agents First slice on the agent-paradigm pivot: instead of building more human-glance UI, expose substrate that any agent (in-process Patrol, external Claude Code, future MCP-driven setups) can consume in one read. The endpoint returns the full situated picture of a resource — identity, operator-set state with server-computed maintenanceWindowActive flag, active findings as a lightweight seven-question-schema projection, and recent action audits with refusal tokens (resource_remediation_locked:, plan_drift:) preserved verbatim for agent branching. Substrate is the right shape here: an agent reasoning about a resource gets everything it needs without chaining four or five calls, and the projection types decouple agent-stable wire shape from internal type evolution. Active findings flow through an AgentFindingsProvider adapter wired in router.go from the patrol service, keeping the api package free of an internal/ai import. Always-array fields (activeFindings, recentActions) and omitempty-on-absent (operatorState) give agents stable iteration and clean field-presence branching. AgentContextHandler owns the agent surface as its own type so it evolves independently of resource CRUD. Each test pins a specific contract: identity round-trip, operator-state projection with computed flag, empty-state shape, refusal-token preservation, 404 shape, method gating.	2026-05-09 19:46:16 +01:00
rcourtman	8a70a2c23c	Auto-acknowledge findings on intentionally-offline resources Compounding slice on the per-resource operator-state feature: when an operator has marked a resource as IntentionallyOffline (via the API surface from slice 30), new findings against that resource get auto-dismissed as expected_behavior with the lifecycle event tagged operator_state_cause=intentionally_offline. Same shape as the maintenance-window suppression from slice 31 but indefinite — no scheduled end, no maintenance_end_at metadata. Restructures the ResourceOperatorStateProvider interface to return a single ResourceOperatorStateProjection per call rather than a narrow ActiveMaintenanceWindow method. Adding the second signal would otherwise have meant a second method and another call per finding; the projection shape carries every signal in one round-trip and gives future signals (NeverAutoRemediate, criticality) a stable extension point. Maintenance windows take priority over intentionally_offline when both are active because the time-bounded suppression is more honest to surface to the operator — they'll see it auto-clear when the window ends rather than wondering when the indefinite suppression will lift.	2026-05-09 14:54:45 +01:00
rcourtman	cf2e61ea22	Auto-acknowledge new findings during operator-set maintenance windows Third slice on the per-resource operator-state feature delivers the behavioral payoff: when an operator has set a maintenance window on a resource (via the API surface from slice 30), new findings against that resource get auto-dismissed as expected_behavior at creation time rather than firing notifications. The finding still lands in durable history with a UserNote naming the window and a "dismissed" lifecycle event tagged operator_state_cause=maintenance_window so the operator can audit what tripped during the window. Adds a narrow ResourceOperatorStateProvider interface to internal/ai so the findings runtime stays free of an internal/unifiedresources import. The API layer wires an adapter in router.go that projects unified.ResourceOperatorState through state.IsInMaintenanceAt into the ActiveMaintenanceWindow shape the findings runtime consumes. Suppression is opt-in: stores without a provider wired keep the original new-finding behavior bit-for-bit, so deployments that haven't adopted the operator-state feature see no behavioral change. IntentionallyOffline and NeverAutoRemediate land as compounding slices on the same provider interface.	2026-05-09 14:43:22 +01:00
rcourtman	46646b4293	Add /api/resources/{id}/operator-state GET / PUT / DELETE handlers Second slice on the per-resource operator-state feature: the API surface that the storage foundation from slice 29 was designed to support. A frontend, pulse-cli, or Assistant tool call can now read the operator-set state for a resource, replace it with PUT, or clear it with DELETE. Contract decisions worth preserving: - GET 404s with stable error code operator_state_not_set when no entry exists, distinct from a 200 with default (all-zero) fields. - PUT replaces the entire record. URL canonicalId wins over body to prevent body-manipulation retargeting; server-side setAt/setBy populate from request time and authenticated identity, ignoring client values so the audit trail stays honest. - Validation rejections surface 400 with operator_state_invalid so frontend can branch on the code without string-matching messages. - DELETE is idempotent — 204 whether or not an entry was present. - GET runs under monitoring:read; PUT and DELETE under monitoring:write because the state modulates Patrol's behavior. Finding-suppression and action-broker integrations land in subsequent slices that consume the same ResourceOperatorState shape.	2026-05-09 14:34:43 +01:00
rcourtman	2bd4621b76	Attribute operator-driven Mark resolved closures as "Resolved by you" Slice 22 added the manual Mark resolved button (auto_resolved=false) but the resolution-reason copy still flattened every closure into "Condition cleared" or "Issue no longer detected" — the operator couldn't tell from the timeline whether they had closed the loop themselves or Pulse had auto-detected the condition clearing. Threads the existing Finding.AutoResolved flag through unified.UnifiedFinding (Go), router.go conversions, UnifiedFindingRecord (TS), and the store-level UnifiedFinding. The frontend resolution-reason helper now reads "Resolved by you <time>" when autoResolved === false, while keeping Patrol's specific fix outcomes (fix_verified, fix_executed, resolved) priority because those describe Pulse's actual remediation rather than mere auto-detection.	2026-05-09 11:10:11 +01:00
rcourtman	5cc2f61be0	Surface will_fix_later remind-at on dismiss confirm and dismissed rows Slice 18 made will_fix_later a real operational commitment server-side, but the new RemindAt field stayed invisible to operators until the reminder fired a week later. This wires it through the API surface and renders it where the operator decides and where they later revisit. UnifiedFinding (Go and TS) and the Patrol Finding TS shape now carry RemindAt / remind_at; router.go and AddFromAI mirror it like the other user-feedback fields. FindingsPanel previews "Pulse will stay quiet for 7 days, then surface again on <date>" on the dismiss confirmation panel before the operator confirms, badges dismissed-as-will_fix_later rows with "Reminding <date>" in amber, and adds explanatory copy for the other two dismissal reasons so all three paths feel deliberate rather than undifferentiated.	2026-05-09 10:47:21 +01:00
rcourtman	07d1ab51d1	Surface trust metrics on the Patrol page Wire FindingsStore.GetTrustSummary through PatrolService and the patrol-status API into the Patrol page so the operator can scan "is Pulse useful?" at a glance. Adds a small Trust strip above the Findings/Runs tab bar that renders compact signals: fixes verified, auto-resolved, dismissed-as-noise, dismissed-as-expected, currently active, and regressed-at-least-once. The strip is hidden when every signal is zero so a fresh install sees no empty pill. Plumbing: - PatrolService.GetFindingsTrustSummary accessor (delegates to the store-level method shipped in the prior slice) - PatrolStatusResponse carries Trust *FindingsTrustSummary; populated from the active patrol service, omitted when no service is available (snapshot semantics, not lifetime totals) - TS FindingsTrustSummary mirror in api/patrol.ts and a trust field on PatrolStatus - PatrolIntelligenceWorkspace reads state.patrolStatus()?.trust and conditionally renders the strip Verification artifacts: - internal/ai/patrol_test.go: TestPatrolService_GetFindingsTrustSummary - internal/api/contract_test.go: TestContract_PatrolStatusTrustJSONSnapshot pinning the canonical wire shape - frontend-modern/src/api/__tests__/patrol.test.ts: round-trip test for the trust block on the patrol-status response - frontend-modern/src/features/patrol/__tests__/PatrolIntelligenceWorkspace.test.ts: source-text test pinning state.patrolStatus()?.trust read, aria-label, and field names so future strip additions go through the FindingsTrustSummary contract first. - frontend-modern/src/features/patrol/__tests__/patrolInvestigationContextModel.test.ts: pins that the per-finding context model does not synthesize impact from trust counters; trust is an aggregate operator-page concern, not a per-finding text source. Updates the api-contracts, ai-runtime, patrol-intelligence, agent-lifecycle, frontend-primitives, and storage-recovery contracts to pin the trust block's shape, the strip's contract-first rule, and the scope boundary (trust counters are advisory operator context, not enrollment/storage/recovery action authority).	2026-05-08 21:11:24 +01:00
rcourtman	cff4226531	Pass stored fingerprint into PVE diagnostic test client The /api/diagnostics handler builds its own test client per PVE node to run a live connectivity probe. The PBS branch already passed node.Fingerprint into the test client config, but the PVE branch did not. With VerifySSL=true and a self-signed Proxmox cert (the standard configuration), tlsutil.CreateHTTPClientWithTimeout falls into default-secure mode and validates against the system CA chain, which fails the handshake even when the actual poller — which DOES pass the fingerprint — is connecting fine. The result was that /api/diagnostics reported delly + pi as "Failed to connect to Proxmox API" while /api/resources was happily ingesting all 27 workloads from the same hosts. Mirror the PBS branch by passing node.Fingerprint into the PVE testCfg so the diagnostic probe uses the same TLS verification path as the runtime poller. Add a regression test that spins up an httptest TLS server, captures its leaf cert SHA-256, configures a PVE instance with VerifySSL=true and that fingerprint, and asserts computeDiagnostics reports Connected=true. The pre-fix code fails this with a "tls: bad certificate" handshake error.	2026-05-08 20:58:32 +01:00
rcourtman	b5c8e00859	Preserve previous successful fix across regressions Capture the prior InvestigationRecord.ProposedFix.Description into a new Finding.PreviousResolvedFixSummary field at regression time, before the InvestigationRecord is cleared. Without this capture the next investigation starts from blank context whenever a finding regresses, and operational memory of "what worked last time" is lost. The summary propagates through: - FindingsStore.Add regression branch (capture before clear) - Finding.MarshalJSON / UnmarshalJSON (wire shape) - Both Finding to UnifiedFinding conversion sites in router.go - UnifiedFinding (struct + JSON shadow + Marshal/Unmarshal) - UnifiedStore.AddFromAI update branch (non-empty overwrite) - Assistant chat context as a "Previous Resolved Fix" line so the LLM sees what worked previously rather than blank-slate diagnosing each regression. Adds a unit test that walks the full lifecycle (detect, resolve via UpdateInvestigationRecord + ResolveWithReason, re-detect) and asserts PreviousResolvedFixSummary is preserved while InvestigationRecord is cleared, plus two chat-context tests covering the surfaces-when-set and omits-when-empty cases. Adds a contract test pinning the canonical "previous_resolved_fix_summary" JSON key. Updates the api-contracts, ai-runtime, and the dependent agent-lifecycle, performance-and- scalability, and storage-recovery contracts to pin the operational memory propagation rule and its scope boundary.	2026-05-08 19:45:44 +01:00
rcourtman	10ff1c4dcf	Surface Finding.Impact through UnifiedFinding to operators Carry the Finding.Impact text added in the previous slice through the Finding to UnifiedFinding boundary and onto the FindingsPanel surface so the runtime-failure consequence-if-ignored copy is visible to the operator. Add Impact to the UnifiedFinding struct, JSON snapshot, and both Marshal/Unmarshal mirrors; copy f.Impact into both Finding to UnifiedFinding conversion sites in router.go; mirror impact in the TS UnifiedFindingRecord and Finding API types and the aiIntelligence store normalizers; render an Impact line between Description and Recommendation in FindingsPanel. Also fix the FindingsStore.Add dedup-merge path so re-detected findings overwrite existing.Impact alongside Description and Recommendation rather than preserving the stale empty value left by an older binary. Without this fix, a freshly-classified runtime failure with new Impact text would be merged onto the persisted finding but the Impact field would be silently dropped. Verified end-to-end against the live runtime: triggered a Patrol run, watched the runtime-failure finding regenerate, confirmed the operator-visible card now renders "Impact: While Patrol cannot analyze..." between Description and Recommendation. Updates the api-contracts, ai-runtime, patrol-intelligence, and the dependent agent-lifecycle, performance-and-scalability, and storage-recovery contracts to pin the propagation rule and the dedup-merge invariant.	2026-05-08 17:31:19 +01:00
rcourtman	e7b5650233	Add impact and rollback to investigation records Promote the seven-field investigation-record shape so Patrol findings can carry consequence-if-ignored context and a record-level rollback plan alongside the existing verification array. The shared aicontracts.InvestigationRecord struct gains top-level Impact and Rollback fields with matching TS mirrors, normalizes Rollback to an empty slice, and the Patrol-owned investigation surface renders an explicit "Impact not assessed" / "Rollback not specified" placeholder so the operator-visible gap is conspicuous to both the operator and Assistant when Patrol has not populated them. Backend default leaves both empty rather than fabricating analysis from severity/category. Also closes the existing Trigger.cause drift between Go and TS so frontend handoff context preserves backend-attributed failure cause, and updates the api-contracts, ai-runtime, frontend-primitives, and patrol-intelligence subsystem contracts to pin the new shape.	2026-05-08 16:47:55 +01:00
rcourtman	f88f0fcf1a	Align DeepSeek V4 Patrol readiness	2026-05-08 11:34:07 +01:00
rcourtman	ac82a28521	Fix Unraid agent host profile detection	2026-05-08 11:05:14 +01:00
rcourtman	d0940db33b	Clamp Patrol monitor-only autonomy saves Refs #1463	2026-05-08 02:21:31 +01:00
rcourtman	dd15d23a35	Hydrate Patrol requester in Assistant handoffs - hydrate approval requester identity into backend handoff actions - include requester provenance in refreshed finding action context - document the backend requester authority boundary	2026-05-08 00:26:04 +01:00
rcourtman	ea3e1b216a	Persist Patrol approval requester identity - store requester provenance on approval records - carry requester metadata through approval APIs and Assistant handoffs - document the safe Patrol approval provenance boundary	2026-05-08 00:12:09 +01:00
rcourtman	f563adf136	Mark Patrol queued fixes as Patrol actions - derive approval requester identity from investigation-fix approvals - stamp pending action-audit lifecycle events with Patrol as the producer - document the requester boundary for Patrol handoffs and timelines	2026-05-07 23:48:12 +01:00
rcourtman	4736358acc	Drive agent host profiles from platform manifest	2026-05-07 23:42:15 +01:00
rcourtman	cf8931031d	Record Patrol fix approvals in action audit - seed queued investigation fixes with governed action plans - persist planned and pending lifecycle evidence for Patrol approvals - cover the approval adapter and subsystem contracts	2026-05-07 23:26:39 +01:00
rcourtman	bbaa6c0a6c	Cover Patrol handoff approval mode - force non-empty Patrol finding handoffs into approval-required mode - add direct helper coverage for finding, scoped, resource, action, and metadata handoffs - align subsystem contracts with the backend-owned clamp	2026-05-07 23:08:56 +01:00
rcourtman	f13a89db62	Clear Patrol runtime issue after scoped success - mark DeepSeek V4 Flash and Pro tool-ready for Patrol readiness - resolve synthetic Patrol runtime findings after provider-backed scoped runs - cover readiness and run-history resolved counts	2026-05-07 22:42:38 +01:00
rcourtman	7da942226a	Clarify Pulse Agent host profile support Separate first-class platform support from Pulse Agent host profiles and classify Unraid as an agent-backed host profile while preserving it as presentation-only platform vocabulary.	2026-05-07 22:28:24 +01:00
rcourtman	d9331d39e4	Move Patrol run Assistant handoffs server-side - rehydrate Patrol run context from Patrol history for chat requests and follow-up sessions - send only browser-safe run metadata from the frontend - classify and redact provider runtime failures in handoff prompts and briefings	2026-05-07 21:58:06 +01:00
rcourtman	f7992e8e78	Return safe provider preflight diagnostics Classify Assistant and Patrol provider-test failures through the Patrol runtime failure taxonomy, redact secret-shaped provider evidence, and preserve safe recommendations in the settings shell. Refs #1463	2026-05-07 20:45:18 +01:00

1 2 3 4 5 ...

1213 commits