Pulse

vrr/Pulse

mirror of https://github.com/rcourtman/Pulse.git synced 2026-05-20 17:48:47 +00:00

Author	SHA1	Message	Date
rcourtman	b2e437b198	Separate Patrol finding controls from disclosure	2026-05-12 22:52:36 +01:00
rcourtman	b213b59f7a	Prioritize Patrol findings in Assistant handoffs	2026-05-12 22:46:53 +01:00
rcourtman	1e33d3089c	Bound Patrol refresh state	2026-05-12 22:35:50 +01:00
rcourtman	bb04bf9042	Keep Patrol findings visible during refresh	2026-05-12 22:13:19 +01:00
rcourtman	5465852be0	Use Patrol source loading for findings panel	2026-05-12 22:06:09 +01:00
rcourtman	3cbd49cb54	Align Patrol coverage caveats with full runs	2026-05-12 21:33:44 +01:00
rcourtman	5b18438528	Use checked wording for limited Patrol recency	2026-05-12 17:53:53 +01:00
rcourtman	80c061cd45	Separate Patrol runtime issues from findings copy	2026-05-12 17:43:28 +01:00
rcourtman	450e7516f0	Keep Patrol findings from showing all-clear copy	2026-05-12 17:34:30 +01:00
rcourtman	22a94f47d9	Skip release publish downstreams for drafts	2026-05-12 17:32:11 +01:00
rcourtman	8b0f3564f6	Fail closed on stale API action plans	2026-05-12 17:32:11 +01:00
rcourtman	3566a4d61d	Drive promote-floating-tags via workflow_call from create-release promote-floating-tags.yml's `workflow_run` chain off publish-docker.yml silently stopped firing for rc.3 → rc.5 because publish-docker failed at the now-removed pulse-agent push step. Customers pulling rcourtman/pulse:latest, :6, or :6.0 stayed on whatever the previous successful release had tagged — there was no warning anywhere that the floating tags were stale. Same fix pattern as install-sh-smoke (commit `7c0f65425`) and publish-helm-chart (commit `14c79a28e`): add a workflow_call trigger to promote-floating-tags.yml and call it explicitly from create-release.yml after validate_release_assets succeeds. Gating on validate_release_assets is intentional: that workflow waits for the docker image to be pullable from the registry (with retry backoff), so by the time it succeeds the image manifest exists and re-tagging it to latest/major/minor cannot point at vapor. The legacy workflow_run trigger stays as the primary path; this just guarantees promotion even when the chain doesn't fire. Tag-resolver step now accepts inputs from workflow_call / workflow_dispatch and only falls back to the workflow_run derivation when inputs are absent, so all three entry paths converge on the same identity. Pinned in build_release_assets_test.go: - new TestPromoteFloatingTagsReachableViaWorkflowCall pins the trigger declaration and the input-priority resolver - existing TestCreateReleaseUploadsPowerShellInstaller extended to pin the promote_floating_tags job wiring (uses, tag, prerelease) Contract delta in deployment-installability.md Extension Point 7 documents the same explicit-workflow_call requirement that applies to publish-helm-chart, extended to promote-floating-tags.	2026-05-12 16:47:51 +01:00
rcourtman	14c79a28e7	Trigger publish-helm-chart via workflow_call from create-release v6 rc.1 → rc.5 published successfully but the Helm chart never landed on rcourtman.github.io/Pulse/index.yaml — the index still ends at v5.1.30. `helm install pulse pulse/pulse --version 6.0.0-rc.5` returns chart-not-found; without `--version` helm pulls the latest published chart (v5.1.30) into a customer's v6 cluster. Root cause: GitHub does not fire `release: published` for releases that were created as drafts and later PATCHed to draft=false. create-release.yml deliberately uses that path so it can upload assets and run validate-release-assets against the draft before promoting. Inspection of the workflow run history confirms: every gh-API `release: published` event since 2026-03-02 has been from manually-dispatched v5 stable cuts; zero fired for v6 RCs published through the create-release pipeline. Fix the same way install-sh-smoke was wired in commit `7c0f65425`: add a `workflow_call` trigger to publish-helm-chart.yml and call it explicitly from create-release.yml as a downstream of validate_release_assets. The chart-version resolver in publish-helm-chart now accepts inputs from either workflow_call or workflow_dispatch and only falls back to the release-event tag when no inputs are present, keeping the legacy release-event path working for forks / manual gh-CLI publishes that create with draft=false from the start. Pinned in build_release_assets_test.go: - create-release.yml wiring (publish_helm_chart job, version inputs) - publish-helm-chart.yml workflow_call trigger declaration - chart-version resolver's input-priority logic Contract delta in deployment-installability.md Extension Point 7 documents the workflow_call requirement and forbids relying on the release-published webhook for the create-release.yml draft-promotion path. The fix takes effect on the next release through the pipeline. Backfill of the v6.0.0-rc.5 chart needs a one-time manual dispatch of publish-helm-chart.yml against chart_version=6.0.0-rc.5.	2026-05-12 16:30:53 +01:00
rcourtman	c6d5c4590a	Keep agent heartbeats stream local	2026-05-12 16:14:28 +01:00
rcourtman	ab62b46c1f	Fix helm chart agent.enabled by routing through main pulse image The chart's agent.image.repository defaulted to ghcr.io/rcourtman/pulse-agent, an image that has never been published. publish-docker.yml only pushes rcourtman/pulse; the Dockerfile defines an agent_runtime stage that could be published but it isn't, and commit `da7969fb4` from earlier in this session removed the corresponding pulse-agent attestation expectations — a clear signal the separate agent image was intentionally dropped without updating the chart. Customers running `helm install pulse pulse/pulse --set agent.enabled=true` were silently hitting ImagePullBackOff on the agent DaemonSet. Route the chart through the main rcourtman/pulse image instead. To make that work without per-arch chart overrides, the runtime stage in the Dockerfile now creates an arch-resolved /usr/local/bin/pulse-agent symlink to the right /opt/pulse/bin/pulse-agent-linux-{amd64,arm64,armv7} binary. The chart's agent.command default is /usr/local/bin/pulse-agent, which overrides the server ENTRYPOINT and runs the pod as a unified agent on whichever arch the node provides. agent.yaml renders the command via toYaml so list values pass through cleanly. KUBERNETES.md's DaemonSet example switches from the arch-hardcoded /opt/pulse/bin/pulse-agent-linux-amd64 to the new arch-resolved path, restoring multi-arch portability of the docs example. validate-release.sh asserts the symlink exists, points at one of the three supported Linux arch binaries, and is executable in the published image. A new TestHelmAgentRuntimePointsAtRealImage pins the chart defaults, the template wiring, the Dockerfile symlink, and the validate-release.sh guard so the regression class can't quietly resurface. Governance: extend the helm-chart-release-runtime verification policy's exact_files to include scripts/installtests/build_release_assets_test.go (matching its existing pin set for related deployment-installability policies); update the subsystem_lookup_test.py fixture that pins the exact_files list; document the agent-image and pulse-agent symlink contract in deployment-installability.md Extension Point 7. Verified locally: `helm lint` passes; `helm template --set agent.enabled=true` renders a DaemonSet with image rcourtman/pulse:6.0.0, command ["/usr/local/bin/pulse-agent"], args ["--enable-docker", "--enable-host=false"]. End-to-end image build + agent DaemonSet smoke will run via helm_smoke on the next release once rcourtman/pulse:6.0.0 is published.	2026-05-12 16:11:56 +01:00
rcourtman	0b98cded45	Bulk count agent fleet approvals	2026-05-12 16:00:31 +01:00
rcourtman	9496d6f6d8	Fix four customer-facing doc drift findings (RBAC, OIDC, helm, webhooks) RBAC.md (alerts:read → monitoring:read): The example team-setup table told operators to issue API tokens with an "alerts:read" scope. That scope does not exist in pkg/auth/scopes.go; defined scopes are monitoring:read, settings:read, etc. /api/alerts/ is gated by RequireAuth (no specific scope required), so an integrator issuing a token would naturally pick the closest real scope — monitoring:read — and that is what the doc should have shown. OIDC.md (OIDC_GROUP_ROLE_MAPPINGS, OIDC_CA_BUNDLE): Both env vars were documented but zero code reads them. OIDC config is per-provider in internal/config/sso.go and OIDCProviderConfig in internal/config/oidc.go: groupRoleMappings is a map field; caBundle is a path field. Replace both env-var snippets with the actual UI/API path so operators following the secure-install flow don't silently get no group mapping or no custom CA trust. Same drift pattern as the earlier rc.1 → rc.5 PULSE_RELAY_* aspiration-without-implementation. WEBHOOKS.md (missing helpers): notifications.go's templateFuncMap registers jsonString and pathescape on every webhook template, but the helper list only documented title / upper / lower / printf / urlquery / urlencode / urlpath. Add both, with a short note that jsonString is the safe way to embed arbitrary string values inside a JSON payload — Pulse's shipped templates use it everywhere a value goes inside JSON, and operators writing custom templates were missing the canonical escape primitive. KUBERNETES.md (helm path + markdown fence): - "deployment.strategy.type=Recreate" was the wrong helm path. The chart's strategy block is at the top level (deploy/helm/pulse/values.yaml line 9), so `strategy.type=Recreate` is what operators must actually --set. Following the broken path produced no override and left RWO PVC deployments on the default RollingUpdate, the exact Multi-Attach failure mode the note was trying to warn against. - Trailing ```text on the helm-template code block closed the fence but tagged it as a language, breaking markdown rendering in some readers. Reduced to plain ```. All four are doc-only changes; no code reads the names they document.	2026-05-12 15:54:24 +01:00
rcourtman	f16aa8a65c	Parse stored subscription states for entitlement refresh	2026-05-12 15:17:53 +01:00
rcourtman	4cf16ec9cb	Stabilize summary chart SLOs	2026-05-12 14:40:55 +01:00
rcourtman	1726cf47b4	Harden Patrol and Assistant action boundaries	2026-05-12 12:06:27 +01:00
rcourtman	7c0f654253	Wire install.sh smoke gate into create-release.yml release pipeline The smoke gate workflow exists from commit `065ebdb27` but until it is called from create-release.yml it does not actually protect any release. That is exactly the regression class that let rc.1 → rc.5 ship with a broken install.sh: nothing in the release pipeline exercised the documented secure-install flow against the published GitHub Release URL. Wire install-sh-smoke.yml as a downstream workflow_call after validate_release_assets succeeds. Gated on historical_asset_backfill_only != 'true' since asset-backfill flows re-upload to an already-published release and the smoke would just re-confirm what hasn't changed. Pre-install structural checks were verified locally against rc.5 — the gate correctly fires the banner / agent-banner / --version handler assertions against the broken release. The end-to-end container portion (privileged systemd boot, install.sh execution, /api/health, /api/version match) will run for the first time on the next release that publishes through this workflow; existing retry loops on systemd readiness, service activation, and health endpoint absorb transient runner flakes. Add install-sh-smoke.yml to the deployment-installability canonical files and to the release-promotion proof policy's match_files, and add scripts/installtests/build_release_assets_test.go to that policy's exact_files (matching the existing pin set for related policies in the deployment-installability subsystem). Update subsystem_lookup_test.py fixtures that pinned the exact_files list literally. Pinned the create-release.yml wiring in build_release_assets_test.go alongside the validate-release-assets wiring so the smoke step cannot silently be unwired. Document the gate's contract responsibilities in deployment-installability Extension Point 2.	2026-05-12 11:44:04 +01:00
rcourtman	b69c8c8007	Wire PULSE_RELAY_ENABLED and PULSE_RELAY_SERVER as real env overrides These two env vars were documented as relay overrides in v6 docs since March 18 (CONFIGURATION.md, RELAY.md, and the frontend-served doc copy) but no code ever read them. Operators trying to bootstrap relay headlessly saw no effect. Implement them rather than remove the documentation. Headless and container deployments now have a real path to enable relay and point it at a private endpoint without going through Settings → Relay. internal/relay/config_env.go: - ApplyEnvOverrides(*Config) mutates relay.Config in place. - PULSE_RELAY_ENABLED accepts true/false/yes/no/1/0/on/off (case- insensitive). Unrecognized values log a warning and leave the file value untouched — important so "unset" reads differently from "explicit false." - PULSE_RELAY_SERVER goes through the existing validateRelayServerURL check; invalid URLs log a warning and fall through. internal/config/persistence_relay.go: LoadRelayConfig calls ApplyEnvOverrides after the file load and after the default-fallback when relay.enc is absent, so the env override applies on every load. Tests cover unset / true / false / garbage-bool / valid-URL / invalid-URL / both-together / nil-config paths in the relay package, plus two end-to-end tests in internal/config that prove the override flows through LoadRelayConfig against a real persisted file and against the missing-file default branch. Restore the env-var docs with the correct default URL (the full wss://relay.pulserelay.pro/ws/instance, not the bare hostname the original aspirational table claimed) and add an explicit precedence note: saving from the UI after an env override persists the env-effective state to disk, so clearing the env alone does not revert. Add internal/relay/config_env_test.go to the relay-runtime registry's desktop-relay-runtime exact_files so the new code surface is proof-tracked. Update the matching pin in subsystem_lookup_test.py. Extend the relay-runtime contract Extension Point 3 to document the override semantics LoadRelayConfig must satisfy.	2026-05-12 11:18:31 +01:00
rcourtman	7cc90a8969	Remove fictional PULSE_RELAY_ENABLED / PULSE_RELAY_SERVER env overrides CONFIGURATION.md and its frontend-served copy advertised an "Environment Overrides" table for relay with two env vars, but neither has ever existed in code. git log -S "PULSE_RELAY_ENABLED" -- 'internal/*.go' is empty; relay config is entirely file-driven (relay.enc) and configured from Settings → Relay. The documented default "relay.pulserelay.pro" was also incorrect — the actual code default is the full ws URL "wss://relay.pulserelay.pro/ws/instance". Replace the misleading table with a clarifying paragraph that states the truth: relay has no env-var overrides, it's UI-configured, and the actual default server URL. Operators trying to deploy headlessly with the documented env vars would silently get the default config because the runtime never reads them; better to say so explicitly. Found during a sweep of all 29 PULSE_ env vars documented in CONFIGURATION.md against the codebase. These two were the only doc-only drift; the other 27 all have real production references.	2026-05-12 11:05:07 +01:00
rcourtman	3b57c880ba	Fix manual systemd install snippet binary path The "Manual systemd install (advanced)" example in docs/INSTALL.md told users to run `sudo install -m 0755 pulse /usr/local/bin/pulse`, but the release tarballs extract to ./bin/pulse, not ./pulse. Following the documented command literally produced "install: cannot stat 'pulse'". Show the download/extract commands explicitly so users know the tarball shape, and correct the install path to `bin/pulse`. validate-release.sh already pins the ./bin/pulse tarball layout, so this is the matching documentation side of that contract.	2026-05-12 10:51:03 +01:00
rcourtman	ce7d7c1956	Fix stale README signature key and guard against future drift The README's secure-install snippet has pinned the wrong ed25519 key since commit `a60fa03d7` (April 22, 2026), so v6 rc.2 through rc.5 all shipped with a documented verification step that does not work. I downloaded the published rc.5 install.sh + install.sh.sshsig and ran ssh-keygen -Y verify with both candidate keys: Ds21c5... (README's pinned key) -> Could not verify signature MZd/DaH... (key embedded in install.sh and pulse-auto-update.sh) -> OK Customers who actually followed the README's secure-install path saw "Could not verify signature" and aborted. Most users curl-pipe the script unverified so the drift went unreported. Replace the stale key in README.md and docs/INSTALL.md with the actual pipeline signing key (MZd/...). Add a validate-release.sh smoke that extracts the README's pinned key and runs the exact ssh-keygen -Y verify command against the signed install.sh.sshsig. Any future drift between documented key and actual signing key fails the release before publish. Lock both the correct-key presence and the stale-key absence in build_release_assets_test.go for README and docs/INSTALL.md so a manual edit cannot regress the docs back to the broken state.	2026-05-12 10:30:42 +01:00
rcourtman	49412357af	Ship the Pulse server install.sh as the GitHub Release asset Every v6 RC (rc.1 through rc.5, ~30 days) shipped the wrong install.sh. build-release.sh was copying the rendered AGENT installer into release/install.sh, but adapter_installsh, scripts/pulse-auto-update.sh, the root install.sh's own --rc/--stable/--version flows, and the README quickstart all fetch that asset and run `bash install.sh --version vX.Y.Z`. Since the agent installer rejects --version with "Unknown argument", the LXC quickstart, the in-product "Update Pulse" button on systemd/proxmoxve deployments, and the pulse-auto-update.sh systemd timer were all broken for every RC. Fix the build-release.sh copy to publish the root server installer. The agent installer continues to ship inside tarballs at ./scripts/install.sh and inside Docker images at /opt/pulse/scripts/install.sh, and is served at the running Pulse server's /install.sh endpoint — none of those paths change. Only the top-level GitHub Releases asset moves from agent to server. Update build_release_assets_test.go to lock in the new publishing rule and ban the reverse drift, replacing the March 18 "legacy root install.sh" guard that was the original mistake. Add a validate-release.sh smoke that catches the regression mode this hid: the published install.sh must have the Pulse server banner, the --version) arg handler, must not contain the agent banner, and `bash install.sh --help` must print the server installer's version-pinning help line. These checks run as part of validate-release-assets.yml against the post-publish asset bundle so a future swap back cannot slip through. Document the asset identity rule and the validate-release.sh guard in the deployment-installability contract so any future change to the publishing pipeline has to update the contract or trip the shape guard.	2026-05-12 10:24:28 +01:00
rcourtman	e1a6dff2a1	Document Pulse Cloud launch in v6 release notes	2026-05-11 23:18:05 +01:00
rcourtman	7951da526b	Add release_cycle_artifact_globs so RC ceremony skips contract-update requirement	2026-05-11 22:55:29 +01:00
rcourtman	816b3985ba	Make blocked-record drift self-fixable via BLESS_GOVERNANCE_FIXTURES env var	2026-05-11 22:31:03 +01:00
rcourtman	d38f3d9217	Update release-control fixtures after pulse-agent Docker removal	2026-05-11 22:21:31 +01:00
rcourtman	8ff69daa43	Bump install pins to rc.5 and refresh test fixtures for Patrol readiness + Unraid host profile tokens	2026-05-11 18:02:52 +01:00
rcourtman	894ea89af9	Refresh RC5 packet validation range for plain-JSON tool-call sanitisation	2026-05-11 17:09:57 +01:00
rcourtman	366bf8d127	Prepare v6.0.0-rc.5 release packet	2026-05-11 16:52:31 +01:00
rcourtman	d02255907c	Fail closed when patrol_resolve_finding verifier is inconclusive ResolveFinding adapter previously logged a warning and allowed the LLM's resolve to proceed when the deterministic verifier returned an error (timeout, executor unavailable, etc.). That's fail-open: any verifier failure let the auto_resolved → re-detected cycle continue, exactly the pattern the rest of this branch's patrol_resolve_finding work spent commits closing. The "Backup failed" finding on the live preview still cycled once post- migration because of this path — verifier returned an ErrVerificationUnknown and resolve was permitted. Resolution of an event/persistent category finding is effectively permanent (next detection registers as a regression and inflates counters and pollutes the trust strip). When the deterministic verifier cannot confidently say the failure signal is gone, we don't have grounds to honor the LLM's judgment — the LLM's "current investigation didn't surface a fresh failure" is exactly the unreliable signal that produced bogus cycles. Switches the inconclusive-verifier branch from log-and-allow to log-and-reject, returning an error to the tool so the LLM can retry or escalate to the operator. The verifier-still-detects- signal path stays as-is (it was already fail-closed). Test: TestPatrolFindingCreatorAdapter_ResolveFinding_RejectsWhenVerifierIsInconclusive exercises the path by calling ResolveFinding on a backup-failed finding through a PatrolService with no chat service wired (getExecutorForVerification returns ErrVerificationUnknown). Asserts the error mentions 'inconclusive' and that ResolvedAt remains nil. Contract: extends the deterministic-resolve-gate clause in the ai-runtime canonical-files completion-obligations to name the fail-closed-on-inconclusive policy explicitly.	2026-05-11 10:53:06 +01:00
rcourtman	9f43a22fb1	Bound patrol-main session at 200 messages to stop unbounded disk growth The patrol-main chat session was reused across every scheduled Patrol run with no upper bound. After a month of runs the file had grown to 16 MB / 3,593 messages, and every AddMessage rewrote the whole file to disk — so the I/O cost per Patrol run was scaling with total session age, not with the run's own output. Across all chat sessions on this dev instance, the ai_sessions directory hit 676 MB / 1,629 files. The stateless-Patrol-input fix (commit `43760fb0d`) stopped loading the session back into the agentic loop, but Patrol still wrote each run's messages to the session for the Pulse Assistant sidebar's forensic view. That write path is what this commit bounds. ExecutePatrolStream now calls SessionStore.TrimMessages(200) after each run, keeping roughly the last two runs' worth of messages — enough for the sidebar to show recent activity, far short of unbounded growth. The next Patrol run on a bloated session will drop the historical 3,000+ messages down to 200 on its first write, so existing storage debt clears on its own without a separate migration. User-driven chat sessions are unaffected: TrimMessages with keepMostRecent <= 0 is a no-op, and callers that want full history retention simply don't call it. Only Patrol's forensic session is capped. The canonical Patrol forensic log is the PatrolRunRecord history surfaced at /api/ai/patrol/runs — that's the durable record with structured fields. The chat-session-shaped file is a sidebar convenience, not the source of truth. Three tests guard the boundary: - TrimMessages keeps the most recent N (50 messages trimmed to 10 → messages 40-49 remain) - TrimMessages is a no-op below threshold (5 messages, cap 200 → 5 messages remain) - TrimMessages with non-positive keep is a no-op (3 messages, cap 0 or -5 → 3 messages remain) ai-runtime contract updated.	2026-05-10 23:19:08 +01:00
rcourtman	5dcdbfabf0	Document chat-side cost recording in ai-runtime contract Follow-up to `a0b3bc7ed` which closed the chat.Service cost-ledger gap. ai-runtime.md gains a Current State paragraph documenting: - The pre-fix bug (chat accumulated tokens via SSE done envelope but never recorded a cost.UsageEvent server-side; chat is the bulk of AI token spend so the dashboard was dramatically understating cost). - The fix shape (recordChatTurnCost runs after every loop return, success or error since the operator was billed regardless). - The threading path (chat.Config.CostStore wired by the router from AISettingsHandler.GetAIService.CostStore()). - The double-recording invariant (ExecutePatrolStream is deliberately not changed; its caller patrol_ai.go records via its own helper). - UseCase="chat" matches the canonical taxonomy noted on cost.UsageEvent.UseCase ("chat" or "patrol").	2026-05-10 23:16:47 +01:00
rcourtman	99c499ade7	Repair orphan tool_calls in convertToProviderMessages Defense-in-depth for the malformed-history bug pattern. The Patrol fix made patrol-main runs stateless, but Assistant chat sessions are inherently multi-turn and must keep their history. Any chat session that ends mid-tool-call — network drop, ctx timeout, browser crash, uncaught panic, any interrupt that fires between "model emits tool_calls" and "agentic loop appends all tool results" — leaves the persisted session with orphan tool_call_ids. The next message that loads this history is rejected with the same provider error that flapped Patrol for 33 days: An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. For Patrol this was fixable by ignoring the session. For Assistant it isn't; the conversation context is the product. convertToProviderMessages now ends with a repairOrphanToolCalls pass that scans every assistant message with tool_calls and inserts synthetic is_error tool result messages immediately after the assistant turn for any tool_call_id that has no matching downstream result. The synthetic content is marked is_error=true and explains the interruption so the model can retry the same call or proceed without that data — preserving conversational continuity while satisfying the provider's structural-validity check. This guards every conversation that crosses convertToProviderMessages, not just Assistant chat. If Patrol ever changes back to loading session history, the same safety net applies. If a new entry point appears for some other LLM flow, it gets the repair for free. Three tests guard the boundary: - Orphan injection (3 tool_calls, only 1 result → 2 synthetic results, marked is_error with interrupted explanation, ordering preserved) - Clean no-op (all tool_calls fulfilled → no synthetic messages, no is_error pollution) - Existing truncation test still passes (assistant message with both tool_calls and own tool_result → no repair needed, tool_call_id matches in same message) ai-runtime contract updated.	2026-05-10 23:10:13 +01:00
rcourtman	e657f6ace9	Suppress assessment error penalty after trailing-success recovery The overall-health "Recent Patrol errors" coverage factor in summarizeRecentPatrolCoverage was anchoring the score to a stale ratio: it counted errors across the last 10 runs without weighting recency. After Pulse fixed two compounding Patrol bugs today, four consecutive successful runs (50+ tool calls each) followed six earlier failures. The assessment kept showing C/65 with the prediction "most recent Patrol runs encountered errors (6 of 10)" — directly contradicting the fact that every recent run had succeeded. Operators reading that score would conclude Pulse Patrol is still broken. It isn't. The fix dragged the grade. This commit adds a recovery-suppression check: count trailing successful full Patrol runs from the most-recent end of the window (GetAll returns newest-first), skipping non-full runs. When three or more consecutive trailing successes exist — roughly a 9-hour clean stretch at the default 3-hour cadence — the error penalty drops entirely. The score reflects current reality. Three is conservative: a single recovery run could be a transient win; three consecutive demonstrate the underlying fix is sticking. Below the threshold, the existing ratio-tiered penalty still applies so partially-recovered states still register. Two tests guard the boundary: - 6 historical errors + 3 trailing successes → no coverage factor (suppressed) - 6 historical errors + 2 trailing successes → coverage factor remains (recovery incomplete) Live verified after this commit lands: the assessment that's been stuck at C/65 since the malformed-history fix will recompute to A/B grade as soon as the trailing 3 successful runs are recognized by the same recent-runs query. ai-runtime contract updated.	2026-05-10 23:02:57 +01:00
rcourtman	15f6881d89	Document the chat-side narrator wiring in ai-runtime contract Follow-up to `03463c1bf` which threaded the per-tenant report narrators through chat.Config -> tools.ExecutorConfig -> PulseToolExecutor so pulse_summarize can produce AI-narrated synthesis in chat instead of heuristic-only. ai-runtime.md's Current State paragraph documents the wiring: - chat.Config carries three optional fields (ReportNarrator, ReportFleetNarrator, ReportFindingsProvider) threaded through to the executor at session construction time. - The router installs a SetReportNarratorResolver closure that mirrors the reporting handler's pattern, asking the AISettingsHandler for the per-tenant ai.Service and returning it as the implementation for all three roles when AI is enabled. - Unconfigured tenants still get the heuristic fallback — matching the report PDF's graceful-degradation posture. - AI-narrated chat synthesis uses the same provider, sanitizer, model selection, cost ledger (report_narrative / report_narrative_fleet use-cases), and budget gate the report PDF endpoint enforces, so there is exactly one canonical synthesis path for both surfaces.	2026-05-10 22:51:17 +01:00
rcourtman	7a7b3c9d30	Gate LLM patrol_resolve_finding on deterministic verifier for event findings Backup-failed was flapping detected → auto-resolved → re-detected ten times in a single day. Each cycle the LLM saw "PBS backups look healthy in my current snapshot" during a Patrol pass, called patrol_resolve_finding(backup-failed), and the adapter at patrol_findings.go:985 called Resolve(findingID, true) directly — no category check, no evidence verification. The contract docs at findings.go:52-67 explicitly say event / persistent categories (backup, reliability, security, general) "stay active until explicitly resolved — either by the LLM calling patrol_resolve_finding with evidence, or by operator action." That "with evidence" was never enforced. This commit enforces it. The adapter now checks two conditions before honoring an LLM resolve: - finding.Category does NOT support stale-auto-resolve (per the contract function CategorySupportsStaleAutoResolve), AND - a deterministic verifier exists for finding.Key (currently smart-failure and backup-failed) When both are true, the adapter runs VerifyFixResolved on the finding's resource. If the verifier still detects the failure signal, the LLM gets an error explaining why the resolve was rejected and that the underlying issue must be fixed first. If the verifier confirms the signal has cleared, the resolve proceeds with grounded evidence. Categories that support stale-auto-resolve (performance, capacity) bypass the gate entirely — the LLM can resolve them based on absence per the existing contract. Keys without a verifier also fall through to current behavior so we don't block resolves for categories we haven't built verifiers for yet. New PatrolService.hasDeterministicVerifierForKey() helper keeps the gate's verifier list in lockstep with the switch in verifyFixDeterministically. Tests cover the three branches: - performance category → gate skipped, resolve proceeds - reliability + no verifier → gate falls through, resolve proceeds - hasDeterministicVerifierForKey for known and unknown keys ai-runtime contract updated.	2026-05-10 22:48:09 +01:00
rcourtman	3b06b4b09d	Document the non-rendering reporting engine entry points in api-contracts Follow-up to `e32d4ede4` (NarrativeFor + FleetNarrativeFor entrypoints) and `1fe5d6853` (pulse_summarize tool). api-contracts.md gains a Current State paragraph documenting that the Engine interface now exposes two non-rendering entry points alongside Generate/GenerateMulti, with the explicit invariant that test stubs implementing the interface must implement these methods so the contract is honoured across the entire surface, not just the export-shaped subset. ai-runtime.md was updated in the parallel-agent commit `ee2de2703` (which picked up the pulse_summarize paragraph when restating auto-resolve gating), so no further edit is needed there.	2026-05-10 22:38:07 +01:00
rcourtman	ee2de2703b	Update ai-runtime contract to cover both absence-based auto-resolve paths reconcileStaleFindings (commit `b44d5892f`) and the resource-absent gate added in commit d6bb89a1c both use the same CategorySupportsStaleAutoResolve helper, but the contract Current State only documented the first path. Rewords the paragraph so both are covered explicitly: stale-cleanup and resource-absent both gate on the whitelist, both reject event/persistent categories, and the bogus-absence examples extend to cover the resource-absent failure mode (transient agent reconnect, container churn, refresh gap).	2026-05-10 22:34:25 +01:00
rcourtman	ac0f361372	Document the report-narrator detection-boundary invariant Follow-up to the narrator prompt changes that forbid acting as a parallel detector. ai-runtime.md gains a Current State paragraph documenting that both report narrator system prompts encode an explicit detection-boundary invariant: warning or critical severity classifications must be backed by a Patrol finding, an alert, or a hard-threshold breach in the structured input, not by metric inference alone. Patterns the narrator notices without that backing are constrained to info severity. This keeps the narrative retrospective on Patrol's work and prevents silent shadow-classification competing with Patrol's detection rules. api-contracts.md gains an equivalent paragraph in the reporting contract section. The same rule applies to outliers and patterns on the fleet path, and to recommendations on both. The deterministic heuristic narrators were already constrained to the same threshold rules; this aligns the AI path with the same evidence surface the fallback uses, so the report PDF cannot become a back-door detection surface that diverges from the findings store.	2026-05-10 22:02:47 +01:00
rcourtman	c6a786a36d	Broaden regression-counter reset to include LLM-driven cycles Initial detector (commit `942f9ca0f`) only matched on the two legacy absence-signature reason strings — but the Backup failed finding on the live preview showed 6 auto_resolved events all with empty messages, produced by the LLM patrol_resolve_finding tool via Resolve(_, true). Counter stayed at 6× after the previous migration ran. New detector: any active finding whose category is NOT eligible for stale-auto-resolve (i.e. anything other than performance or capacity) AND has any auto_resolved event on its lifecycle is treated as having an inflated counter. The rationale is the same rule the category gate already established — for event/persistent categories there is no legitimate absence-driven resolution path, so any auto_resolved was either a removed-bogus-path stamp or an LLM judgment call that repeatedly reverted through regressions on the next run. The cumulative count is no signal either way. Performance/capacity findings retain their counter because the metric-cleared resolution model is sound there. Test extended to cover four cases: LLM-driven cycle resets, legacy-reason cycle resets, eligible-category preserves counter, non-eligible category without any auto_resolved preserves counter. Plus the existing idempotency case (already-reset finding stays reset and is not re-applied).	2026-05-10 21:47:57 +01:00
rcourtman	942f9ca0f5	Reset regression counters polluted by bogus auto_resolve cycles The Backup failed finding on the live preview showed "regressed 6×" when the actual regression count of genuine recurrences was at most 1 or 2 — the rest were the system fighting itself, driven by the absence-based auto_resolve paths that were gated (category whitelist) or removed (alert-mirror rip) earlier in this branch. Counter stayed sticky after those fixes landed, so the trust strip and finding badges still surfaced the inflated number. FindingsStore.SetPersistence load pass now scans each active finding's lifecycle for the two known bogus-signature auto_resolved reasons ("No longer detected by patrol", "Resource no longer exists in infrastructure"). If found, RegressionCount is reset to 0 and LastRegressionAt is cleared, and a regression_counter_reset lifecycle event is appended so the migration is idempotent. A finding that already has a regression_counter_reset event is left alone; any regressed events that accrued after the reset are genuine and stand. findingHasBogusAutoResolveCycle returns true only when the lifecycle contains a bogus auto_resolved and no prior reset event, so the function is the single point of truth for the migration decision and is straightforward to test. Test covers three cases: finding with bogus signature gets reset, finding with empty-message auto_resolved (LLM-driven, legitimate) keeps its counter, finding already migrated is not re-reset. Updates ai-runtime Current State to document the second migration on top of the alert-mirror retirement.	2026-05-10 21:44:29 +01:00
rcourtman	590671ffbb	Retire legacy "Active alert detected" findings on load The previous commit removed the detectAlertSignals path so no NEW alert-mirror findings are emitted, but the findings already persisted from earlier builds stay in the store indefinitely — nothing cleans them up (reconcileStaleFindings is gated on performance/capacity categories, the LLM resolves them just to have them re-detected next run except now the deterministic emitter is gone so re-detection can't happen, but they're left sitting as active findings draining the trust strip and score). FindingsStore.SetPersistence now runs a one-shot retirement pass on load: any active finding with title "Active alert detected", source ai-analysis, and category general is auto-resolved with reason "Patrol no longer mirrors alerts; the Alerts page is the canonical surface for currently-firing alerts." The pass appends an auto_resolved lifecycle event so the retirement is auditable, syncs the loop state to resolved, and schedules a save so the cleanup persists. Idempotent: after the first load with this code, no findings match the signature so the pass is a no-op. Defensive: the signature requires all three fields (title + source + category) to match before retiring, so an operator-authored finding that happens to share the title is left untouched. Test covers the mirror case, the matching-title-but-foreign-source case (must NOT retire), and an unrelated active finding (must NOT retire), plus verifies the retired state persists back through the persistence layer. Updates ai-runtime Current State to record the migration path.	2026-05-10 21:39:09 +01:00
rcourtman	271d12ecab	Stop mirroring alerts into Patrol findings The deterministic signal pipeline ran the pulse_alerts tool output through detectAlertSignals and produced a SignalActiveAlert for every firing alert, which Patrol then materialized as an "Active alert detected" finding (source: ai-analysis, category: general). The system prompt at the top of patrol_ai.go explicitly tells the LLM not to duplicate alerts — but the deterministic emitter was duplicating them anyway, behind the LLM's back. Symptoms observed in the wild: - 9 active "Active alert detected" findings in Patrol, every one a duplicate of an existing alert already on the Alerts page. - The LLM, doing what the prompt told it, resolved each mirrored finding via patrol_resolve_finding. Next run the alert was still firing and Patrol re-emitted the signal → finding regressed. Lifecycle showed several auto_resolved → re-detected → regressed cycles per finding within hours. - Health score dragged down by issues the operator already saw on the Alerts page, with no operator action possible from Patrol that wasn't already available from Alerts. Rip detectAlertSignals entirely, remove the pulse_alerts case from the signal-extraction switch, drop SignalActiveAlert plus its key / title / recommendation entries. Convert the prior TestDetectSignals_ActiveAlert into a regression guard that locks in the no-mirror behavior. Updates the ai-runtime subsystem Current State to record the decision: Patrol does not duplicate the Alerts surface; alerts own their own lifecycle, surface, and acknowledgement model.	2026-05-10 21:33:41 +01:00
rcourtman	7bd596d378	Document the fleet-narrative surface in ai-runtime and api-contracts Follow-up to the fleet-level AI narrative refactor: ai-runtime.md gains a Current State paragraph documenting that internal/ai.Service also implements pkg/reporting.FleetNarrator with its own use-case label (report_narrative_fleet) so fleet vs single-resource spend is distinguishable in the cost ledger and budget gate, and that the single-resource narrator is intentionally not propagated through the multi-report path. api-contracts.md gains a paragraph documenting the new optional FleetNarrator field on MultiReportRequest, the new FleetNarrative field on MultiReportData, the rendered fleet section (executive prose, named outliers, cross-cutting patterns, recommendations, optional period-comparison, AI provenance footer), and the explicit invariant that the deterministic resource summary table stays rendered from the same per-resource aggregates so every named outlier is verifiable against the table below. Dependent subsystems (agent-lifecycle, performance-and-scalability, storage-recovery) remain unchanged: their Extension Points reference internal/api/ broadly but agent lifecycle, perf scaling, and storage recovery semantics have no delta from this change.	2026-05-10 21:24:04 +01:00
rcourtman	9905383f70	Document the report-narrative surface in ai-runtime and api-contracts Follow-up to the report narrative refactor: ai-runtime.md gains a Current State paragraph documenting that internal/ai.Service now implements pkg/reporting.Narrator and pkg/reporting.FindingsProvider, and that the narrator/findings surfaces inherit the same provider, sanitizer, model selection, budget enforcement, and fail-closed governance as the rest of the canonical AI runtime. api-contracts.md gains a paragraph documenting the new optional Narrator and FindingsProvider fields on MetricReportRequest, the new Narrative, PriorPeriod, and Findings fields on ReportData, the three new PDF sections (executive prose, Period-over-period changes, AI provenance footer), and the explicit invariant that the deterministic data surface (charts, stats, alerts, storage, disks) stays rendered from the same aggregates so every AI claim is verifiable against adjacent data. Multi-resource fleet reports intentionally remain heuristic-only at this transport layer. The dependent subsystems flagged by the staged-shape guard (agent-lifecycle, performance-and-scalability, storage-recovery) genuinely have no contract delta from the prior commit: their Extension Points reference internal/api/ broadly, but agent lifecycle, performance scaling, and storage recovery semantics are unchanged.	2026-05-10 21:00:39 +01:00
rcourtman	a343ba0126	Tier patrol-coverage health impact by recent-error ratio The overall health score chip read "A · 95/100" while the same assessment card said "Coverage incomplete · Recent Patrol runs encountered errors · Verify full coverage." Those messages contradicted because the "recent errors but had a successful run" coverage factor used a flat -10 penalty regardless of the error ratio. With one successful manual run among many failed startup runs, the math stayed in the A band and the score directly undermined the warning it sat next to. The factor now tiers the impact by ratio of errored runs to relevant runs in the scoring window: >50% errored → -30, "Most recent Patrol runs encountered errors (N of M); the current health summary is not reliable until coverage stabilizes." >25% errored → -20, "Recent Patrol runs encountered errors (N of M), so the current health summary may be incomplete." else → -10, original light-tier description. A single transient error stays in grade A; dominant-error periods drop out of A so the grade matches the warning. Adds two tests covering both ends of the new tiering and updates the ai-runtime subsystem contract Current State section.	2026-05-10 19:33:18 +01:00

1 2 3 4 5 ...

2466 commits