promote-floating-tags.yml's `workflow_run` chain off publish-docker.yml
silently stopped firing for rc.3 → rc.5 because publish-docker failed at
the now-removed pulse-agent push step. Customers pulling
rcourtman/pulse:latest, :6, or :6.0 stayed on whatever the previous
successful release had tagged — there was no warning anywhere that the
floating tags were stale.
Same fix pattern as install-sh-smoke (commit 7c0f65425) and
publish-helm-chart (commit 14c79a28e): add a workflow_call trigger to
promote-floating-tags.yml and call it explicitly from create-release.yml
after validate_release_assets succeeds.
Gating on validate_release_assets is intentional: that workflow waits
for the docker image to be pullable from the registry (with retry
backoff), so by the time it succeeds the image manifest exists and
re-tagging it to latest/major/minor cannot point at vapor.
The legacy workflow_run trigger stays as the primary path; this just
guarantees promotion even when the chain doesn't fire.
Tag-resolver step now accepts inputs from workflow_call / workflow_dispatch
and only falls back to the workflow_run derivation when inputs are absent,
so all three entry paths converge on the same identity.
Pinned in build_release_assets_test.go:
- new TestPromoteFloatingTagsReachableViaWorkflowCall pins the trigger
declaration and the input-priority resolver
- existing TestCreateReleaseUploadsPowerShellInstaller extended to pin
the promote_floating_tags job wiring (uses, tag, prerelease)
Contract delta in deployment-installability.md Extension Point 7
documents the same explicit-workflow_call requirement that applies to
publish-helm-chart, extended to promote-floating-tags.
v6 rc.1 → rc.5 published successfully but the Helm chart never landed on
rcourtman.github.io/Pulse/index.yaml — the index still ends at v5.1.30.
`helm install pulse pulse/pulse --version 6.0.0-rc.5` returns
chart-not-found; without `--version` helm pulls the latest published
chart (v5.1.30) into a customer's v6 cluster.
Root cause: GitHub does not fire `release: published` for releases that
were created as drafts and later PATCHed to draft=false. create-release.yml
deliberately uses that path so it can upload assets and run
validate-release-assets against the draft before promoting. Inspection of
the workflow run history confirms: every gh-API `release: published` event
since 2026-03-02 has been from manually-dispatched v5 stable cuts; zero
fired for v6 RCs published through the create-release pipeline.
Fix the same way install-sh-smoke was wired in commit 7c0f65425: add a
`workflow_call` trigger to publish-helm-chart.yml and call it explicitly
from create-release.yml as a downstream of validate_release_assets. The
chart-version resolver in publish-helm-chart now accepts inputs from
either workflow_call or workflow_dispatch and only falls back to the
release-event tag when no inputs are present, keeping the legacy
release-event path working for forks / manual gh-CLI publishes that
create with draft=false from the start.
Pinned in build_release_assets_test.go:
- create-release.yml wiring (publish_helm_chart job, version inputs)
- publish-helm-chart.yml workflow_call trigger declaration
- chart-version resolver's input-priority logic
Contract delta in deployment-installability.md Extension Point 7
documents the workflow_call requirement and forbids relying on the
release-published webhook for the create-release.yml draft-promotion
path.
The fix takes effect on the next release through the pipeline. Backfill
of the v6.0.0-rc.5 chart needs a one-time manual dispatch of
publish-helm-chart.yml against chart_version=6.0.0-rc.5.
The chart's agent.image.repository defaulted to ghcr.io/rcourtman/pulse-agent,
an image that has never been published. publish-docker.yml only pushes
rcourtman/pulse; the Dockerfile defines an agent_runtime stage that
*could* be published but it isn't, and commit da7969fb4 from earlier in
this session removed the corresponding pulse-agent attestation
expectations — a clear signal the separate agent image was intentionally
dropped without updating the chart. Customers running
`helm install pulse pulse/pulse --set agent.enabled=true` were silently
hitting ImagePullBackOff on the agent DaemonSet.
Route the chart through the main rcourtman/pulse image instead. To make
that work without per-arch chart overrides, the runtime stage in the
Dockerfile now creates an arch-resolved /usr/local/bin/pulse-agent
symlink to the right /opt/pulse/bin/pulse-agent-linux-{amd64,arm64,armv7}
binary. The chart's agent.command default is /usr/local/bin/pulse-agent,
which overrides the server ENTRYPOINT and runs the pod as a unified
agent on whichever arch the node provides. agent.yaml renders the
command via toYaml so list values pass through cleanly.
KUBERNETES.md's DaemonSet example switches from the arch-hardcoded
/opt/pulse/bin/pulse-agent-linux-amd64 to the new arch-resolved path,
restoring multi-arch portability of the docs example.
validate-release.sh asserts the symlink exists, points at one of the
three supported Linux arch binaries, and is executable in the published
image. A new TestHelmAgentRuntimePointsAtRealImage pins the chart
defaults, the template wiring, the Dockerfile symlink, and the
validate-release.sh guard so the regression class can't quietly
resurface.
Governance: extend the helm-chart-release-runtime verification policy's
exact_files to include scripts/installtests/build_release_assets_test.go
(matching its existing pin set for related deployment-installability
policies); update the subsystem_lookup_test.py fixture that pins the
exact_files list; document the agent-image and pulse-agent symlink
contract in deployment-installability.md Extension Point 7.
Verified locally: `helm lint` passes; `helm template --set agent.enabled=true`
renders a DaemonSet with image rcourtman/pulse:6.0.0,
command ["/usr/local/bin/pulse-agent"], args ["--enable-docker", "--enable-host=false"].
End-to-end image build + agent DaemonSet smoke will run via helm_smoke
on the next release once rcourtman/pulse:6.0.0 is published.
RBAC.md (alerts:read → monitoring:read):
The example team-setup table told operators to issue API tokens with an
"alerts:read" scope. That scope does not exist in pkg/auth/scopes.go;
defined scopes are monitoring:read, settings:read, etc. /api/alerts/ is
gated by RequireAuth (no specific scope required), so an integrator
issuing a token would naturally pick the closest real scope —
monitoring:read — and that is what the doc should have shown.
OIDC.md (OIDC_GROUP_ROLE_MAPPINGS, OIDC_CA_BUNDLE):
Both env vars were documented but zero code reads them. OIDC config is
per-provider in internal/config/sso.go and OIDCProviderConfig in
internal/config/oidc.go: groupRoleMappings is a map field; caBundle is a
path field. Replace both env-var snippets with the actual UI/API path so
operators following the secure-install flow don't silently get no group
mapping or no custom CA trust. Same drift pattern as the earlier rc.1 →
rc.5 PULSE_RELAY_* aspiration-without-implementation.
WEBHOOKS.md (missing helpers):
notifications.go's templateFuncMap registers jsonString and pathescape
on every webhook template, but the helper list only documented title /
upper / lower / printf / urlquery / urlencode / urlpath. Add both, with
a short note that jsonString is the safe way to embed arbitrary string
values inside a JSON payload — Pulse's shipped templates use it
everywhere a value goes inside JSON, and operators writing custom
templates were missing the canonical escape primitive.
KUBERNETES.md (helm path + markdown fence):
- "deployment.strategy.type=Recreate" was the wrong helm path. The
chart's strategy block is at the top level (deploy/helm/pulse/values.yaml
line 9), so `strategy.type=Recreate` is what operators must actually
--set. Following the broken path produced no override and left RWO
PVC deployments on the default RollingUpdate, the exact Multi-Attach
failure mode the note was trying to warn against.
- Trailing ```text on the helm-template code block closed the fence
but tagged it as a language, breaking markdown rendering in some
readers. Reduced to plain ```.
All four are doc-only changes; no code reads the names they document.
The smoke gate workflow exists from commit 065ebdb27 but until it is
called from create-release.yml it does not actually protect any release.
That is exactly the regression class that let rc.1 → rc.5 ship with a
broken install.sh: nothing in the release pipeline exercised the
documented secure-install flow against the published GitHub Release URL.
Wire install-sh-smoke.yml as a downstream workflow_call after
validate_release_assets succeeds. Gated on
historical_asset_backfill_only != 'true' since asset-backfill flows
re-upload to an already-published release and the smoke would just
re-confirm what hasn't changed.
Pre-install structural checks were verified locally against rc.5 — the
gate correctly fires the banner / agent-banner / --version handler
assertions against the broken release. The end-to-end container portion
(privileged systemd boot, install.sh execution, /api/health, /api/version
match) will run for the first time on the next release that publishes
through this workflow; existing retry loops on systemd readiness,
service activation, and health endpoint absorb transient runner flakes.
Add install-sh-smoke.yml to the deployment-installability canonical files
and to the release-promotion proof policy's match_files, and add
scripts/installtests/build_release_assets_test.go to that policy's
exact_files (matching the existing pin set for related policies in the
deployment-installability subsystem). Update subsystem_lookup_test.py
fixtures that pinned the exact_files list literally.
Pinned the create-release.yml wiring in build_release_assets_test.go
alongside the validate-release-assets wiring so the smoke step cannot
silently be unwired.
Document the gate's contract responsibilities in
deployment-installability Extension Point 2.
These two env vars were documented as relay overrides in v6 docs since
March 18 (CONFIGURATION.md, RELAY.md, and the frontend-served doc copy)
but no code ever read them. Operators trying to bootstrap relay headlessly
saw no effect.
Implement them rather than remove the documentation. Headless and
container deployments now have a real path to enable relay and point it
at a private endpoint without going through Settings → Relay.
internal/relay/config_env.go:
- ApplyEnvOverrides(*Config) mutates relay.Config in place.
- PULSE_RELAY_ENABLED accepts true/false/yes/no/1/0/on/off (case-
insensitive). Unrecognized values log a warning and leave the file
value untouched — important so "unset" reads differently from
"explicit false."
- PULSE_RELAY_SERVER goes through the existing validateRelayServerURL
check; invalid URLs log a warning and fall through.
internal/config/persistence_relay.go:
LoadRelayConfig calls ApplyEnvOverrides after the file load and after
the default-fallback when relay.enc is absent, so the env override
applies on every load.
Tests cover unset / true / false / garbage-bool / valid-URL / invalid-URL
/ both-together / nil-config paths in the relay package, plus two
end-to-end tests in internal/config that prove the override flows through
LoadRelayConfig against a real persisted file and against the
missing-file default branch.
Restore the env-var docs with the correct default URL (the full
wss://relay.pulserelay.pro/ws/instance, not the bare hostname the
original aspirational table claimed) and add an explicit precedence note:
saving from the UI after an env override persists the env-effective state
to disk, so clearing the env alone does not revert.
Add internal/relay/config_env_test.go to the relay-runtime registry's
desktop-relay-runtime exact_files so the new code surface is proof-tracked.
Update the matching pin in subsystem_lookup_test.py. Extend the
relay-runtime contract Extension Point 3 to document the override
semantics LoadRelayConfig must satisfy.
CONFIGURATION.md and its frontend-served copy advertised an "Environment
Overrides" table for relay with two env vars, but neither has ever existed
in code. git log -S "PULSE_RELAY_ENABLED" -- 'internal/**.go' is empty;
relay config is entirely file-driven (relay.enc) and configured from
Settings → Relay. The documented default "relay.pulserelay.pro" was also
incorrect — the actual code default is the full ws URL
"wss://relay.pulserelay.pro/ws/instance".
Replace the misleading table with a clarifying paragraph that states the
truth: relay has no env-var overrides, it's UI-configured, and the actual
default server URL. Operators trying to deploy headlessly with the
documented env vars would silently get the default config because the
runtime never reads them; better to say so explicitly.
Found during a sweep of all 29 PULSE_* env vars documented in
CONFIGURATION.md against the codebase. These two were the only doc-only
drift; the other 27 all have real production references.
The "Manual systemd install (advanced)" example in docs/INSTALL.md told
users to run `sudo install -m 0755 pulse /usr/local/bin/pulse`, but the
release tarballs extract to ./bin/pulse, not ./pulse. Following the
documented command literally produced "install: cannot stat 'pulse'".
Show the download/extract commands explicitly so users know the tarball
shape, and correct the install path to `bin/pulse`. validate-release.sh
already pins the ./bin/pulse tarball layout, so this is the matching
documentation side of that contract.
The README's secure-install snippet has pinned the wrong ed25519 key
since commit a60fa03d7 (April 22, 2026), so v6 rc.2 through rc.5 all
shipped with a documented verification step that does not work.
I downloaded the published rc.5 install.sh + install.sh.sshsig and
ran ssh-keygen -Y verify with both candidate keys:
Ds21c5... (README's pinned key) -> Could not verify signature
MZd/DaH... (key embedded in install.sh and pulse-auto-update.sh) -> OK
Customers who actually followed the README's secure-install path saw
"Could not verify signature" and aborted. Most users curl-pipe the
script unverified so the drift went unreported.
Replace the stale key in README.md and docs/INSTALL.md with the actual
pipeline signing key (MZd/...).
Add a validate-release.sh smoke that extracts the README's pinned key
and runs the exact ssh-keygen -Y verify command against the signed
install.sh.sshsig. Any future drift between documented key and actual
signing key fails the release before publish.
Lock both the correct-key presence and the stale-key absence in
build_release_assets_test.go for README and docs/INSTALL.md so a manual
edit cannot regress the docs back to the broken state.
Every v6 RC (rc.1 through rc.5, ~30 days) shipped the wrong install.sh.
build-release.sh was copying the rendered AGENT installer into
release/install.sh, but adapter_installsh, scripts/pulse-auto-update.sh,
the root install.sh's own --rc/--stable/--version flows, and the README
quickstart all fetch that asset and run `bash install.sh --version vX.Y.Z`.
Since the agent installer rejects --version with "Unknown argument", the
LXC quickstart, the in-product "Update Pulse" button on systemd/proxmoxve
deployments, and the pulse-auto-update.sh systemd timer were all broken
for every RC.
Fix the build-release.sh copy to publish the root server installer.
The agent installer continues to ship inside tarballs at ./scripts/install.sh
and inside Docker images at /opt/pulse/scripts/install.sh, and is served
at the running Pulse server's /install.sh endpoint — none of those paths
change. Only the top-level GitHub Releases asset moves from agent to server.
Update build_release_assets_test.go to lock in the new publishing rule
and ban the reverse drift, replacing the March 18 "legacy root install.sh"
guard that was the original mistake.
Add a validate-release.sh smoke that catches the regression mode this hid:
the published install.sh must have the Pulse server banner, the --version)
arg handler, must not contain the agent banner, and `bash install.sh --help`
must print the server installer's version-pinning help line. These checks
run as part of validate-release-assets.yml against the post-publish asset
bundle so a future swap back cannot slip through.
Document the asset identity rule and the validate-release.sh guard in the
deployment-installability contract so any future change to the publishing
pipeline has to update the contract or trip the shape guard.
ResolveFinding adapter previously logged a warning and allowed the
LLM's resolve to proceed when the deterministic verifier returned
an error (timeout, executor unavailable, etc.). That's fail-open:
any verifier failure let the auto_resolved → re-detected cycle
continue, exactly the pattern the rest of this branch's
patrol_resolve_finding work spent commits closing. The "Backup
failed" finding on the live preview still cycled once post-
migration because of this path — verifier returned an
ErrVerificationUnknown and resolve was permitted.
Resolution of an event/persistent category finding is effectively
permanent (next detection registers as a regression and inflates
counters and pollutes the trust strip). When the deterministic
verifier cannot confidently say the failure signal is gone, we
don't have grounds to honor the LLM's judgment — the LLM's
"current investigation didn't surface a fresh failure" is exactly
the unreliable signal that produced bogus cycles.
Switches the inconclusive-verifier branch from log-and-allow to
log-and-reject, returning an error to the tool so the LLM can
retry or escalate to the operator. The verifier-still-detects-
signal path stays as-is (it was already fail-closed).
Test: TestPatrolFindingCreatorAdapter_ResolveFinding_RejectsWhenVerifierIsInconclusive
exercises the path by calling ResolveFinding on a backup-failed
finding through a PatrolService with no chat service wired
(getExecutorForVerification returns ErrVerificationUnknown). Asserts
the error mentions 'inconclusive' and that ResolvedAt remains nil.
Contract: extends the deterministic-resolve-gate clause in the
ai-runtime canonical-files completion-obligations to name the
fail-closed-on-inconclusive policy explicitly.
The patrol-main chat session was reused across every scheduled
Patrol run with no upper bound. After a month of runs the file
had grown to 16 MB / 3,593 messages, and every AddMessage
rewrote the whole file to disk — so the I/O cost per Patrol
run was scaling with total session age, not with the run's own
output. Across all chat sessions on this dev instance, the
ai_sessions directory hit 676 MB / 1,629 files.
The stateless-Patrol-input fix (commit 43760fb0d) stopped
loading the session back into the agentic loop, but Patrol
still wrote each run's messages to the session for the Pulse
Assistant sidebar's forensic view. That write path is what
this commit bounds.
ExecutePatrolStream now calls SessionStore.TrimMessages(200)
after each run, keeping roughly the last two runs' worth of
messages — enough for the sidebar to show recent activity, far
short of unbounded growth. The next Patrol run on a bloated
session will drop the historical 3,000+ messages down to 200
on its first write, so existing storage debt clears on its
own without a separate migration.
User-driven chat sessions are unaffected: TrimMessages with
keepMostRecent <= 0 is a no-op, and callers that want full
history retention simply don't call it. Only Patrol's
forensic session is capped.
The canonical Patrol forensic log is the PatrolRunRecord
history surfaced at /api/ai/patrol/runs — that's the durable
record with structured fields. The chat-session-shaped file
is a sidebar convenience, not the source of truth.
Three tests guard the boundary:
- TrimMessages keeps the most recent N (50 messages
trimmed to 10 → messages 40-49 remain)
- TrimMessages is a no-op below threshold (5 messages,
cap 200 → 5 messages remain)
- TrimMessages with non-positive keep is a no-op (3
messages, cap 0 or -5 → 3 messages remain)
ai-runtime contract updated.
Follow-up to a0b3bc7ed which closed the chat.Service cost-ledger
gap. ai-runtime.md gains a Current State paragraph documenting:
- The pre-fix bug (chat accumulated tokens via SSE done envelope
but never recorded a cost.UsageEvent server-side; chat is the
bulk of AI token spend so the dashboard was dramatically
understating cost).
- The fix shape (recordChatTurnCost runs after every loop return,
success or error since the operator was billed regardless).
- The threading path (chat.Config.CostStore wired by the router
from AISettingsHandler.GetAIService.CostStore()).
- The double-recording invariant (ExecutePatrolStream is
deliberately not changed; its caller patrol_ai.go records via
its own helper).
- UseCase="chat" matches the canonical taxonomy noted on
cost.UsageEvent.UseCase ("chat" or "patrol").
Defense-in-depth for the malformed-history bug pattern. The
Patrol fix made patrol-main runs stateless, but Assistant
chat sessions are inherently multi-turn and must keep their
history. Any chat session that ends mid-tool-call — network
drop, ctx timeout, browser crash, uncaught panic, any
interrupt that fires between "model emits tool_calls" and
"agentic loop appends all tool results" — leaves the
persisted session with orphan tool_call_ids. The next message
that loads this history is rejected with the same provider
error that flapped Patrol for 33 days:
An assistant message with 'tool_calls' must be followed by
tool messages responding to each 'tool_call_id'.
For Patrol this was fixable by ignoring the session. For
Assistant it isn't; the conversation context is the product.
convertToProviderMessages now ends with a repairOrphanToolCalls
pass that scans every assistant message with tool_calls and
inserts synthetic is_error tool result messages immediately
after the assistant turn for any tool_call_id that has no
matching downstream result. The synthetic content is marked
is_error=true and explains the interruption so the model can
retry the same call or proceed without that data — preserving
conversational continuity while satisfying the provider's
structural-validity check.
This guards every conversation that crosses
convertToProviderMessages, not just Assistant chat. If Patrol
ever changes back to loading session history, the same safety
net applies. If a new entry point appears for some other LLM
flow, it gets the repair for free.
Three tests guard the boundary:
- Orphan injection (3 tool_calls, only 1 result → 2
synthetic results, marked is_error with interrupted
explanation, ordering preserved)
- Clean no-op (all tool_calls fulfilled → no synthetic
messages, no is_error pollution)
- Existing truncation test still passes (assistant message
with both tool_calls and own tool_result → no repair
needed, tool_call_id matches in same message)
ai-runtime contract updated.
The overall-health "Recent Patrol errors" coverage factor in
summarizeRecentPatrolCoverage was anchoring the score to a
stale ratio: it counted errors across the last 10 runs without
weighting recency. After Pulse fixed two compounding Patrol
bugs today, four consecutive successful runs (50+ tool calls
each) followed six earlier failures. The assessment kept
showing C/65 with the prediction "most recent Patrol runs
encountered errors (6 of 10)" — directly contradicting the
fact that *every* recent run had succeeded.
Operators reading that score would conclude Pulse Patrol is
still broken. It isn't. The fix dragged the grade.
This commit adds a recovery-suppression check: count trailing
successful full Patrol runs from the most-recent end of the
window (GetAll returns newest-first), skipping non-full runs.
When three or more consecutive trailing successes exist —
roughly a 9-hour clean stretch at the default 3-hour cadence —
the error penalty drops entirely. The score reflects current
reality.
Three is conservative: a single recovery run could be a
transient win; three consecutive demonstrate the underlying
fix is sticking. Below the threshold, the existing ratio-tiered
penalty still applies so partially-recovered states still
register.
Two tests guard the boundary:
- 6 historical errors + 3 trailing successes → no coverage
factor (suppressed)
- 6 historical errors + 2 trailing successes → coverage
factor remains (recovery incomplete)
Live verified after this commit lands: the assessment that's
been stuck at C/65 since the malformed-history fix will
recompute to A/B grade as soon as the trailing 3 successful
runs are recognized by the same recent-runs query.
ai-runtime contract updated.
Follow-up to 03463c1bf which threaded the per-tenant report
narrators through chat.Config -> tools.ExecutorConfig ->
PulseToolExecutor so pulse_summarize can produce AI-narrated
synthesis in chat instead of heuristic-only. ai-runtime.md's
Current State paragraph documents the wiring:
- chat.Config carries three optional fields (ReportNarrator,
ReportFleetNarrator, ReportFindingsProvider) threaded through
to the executor at session construction time.
- The router installs a SetReportNarratorResolver closure that
mirrors the reporting handler's pattern, asking the
AISettingsHandler for the per-tenant ai.Service and returning
it as the implementation for all three roles when AI is
enabled.
- Unconfigured tenants still get the heuristic fallback —
matching the report PDF's graceful-degradation posture.
- AI-narrated chat synthesis uses the same provider, sanitizer,
model selection, cost ledger (report_narrative /
report_narrative_fleet use-cases), and budget gate the report
PDF endpoint enforces, so there is exactly one canonical
synthesis path for both surfaces.
Backup-failed was flapping detected → auto-resolved → re-detected
ten times in a single day. Each cycle the LLM saw "PBS backups
look healthy in my current snapshot" during a Patrol pass, called
patrol_resolve_finding(backup-failed), and the adapter at
patrol_findings.go:985 called Resolve(findingID, true) directly —
no category check, no evidence verification.
The contract docs at findings.go:52-67 explicitly say event /
persistent categories (backup, reliability, security, general)
"stay active until explicitly resolved — either by the LLM calling
patrol_resolve_finding with evidence, or by operator action." That
"with evidence" was never enforced.
This commit enforces it. The adapter now checks two conditions
before honoring an LLM resolve:
- finding.Category does NOT support stale-auto-resolve (per the
contract function CategorySupportsStaleAutoResolve), AND
- a deterministic verifier exists for finding.Key (currently
smart-failure and backup-failed)
When both are true, the adapter runs VerifyFixResolved on the
finding's resource. If the verifier still detects the failure
signal, the LLM gets an error explaining why the resolve was
rejected and that the underlying issue must be fixed first. If
the verifier confirms the signal has cleared, the resolve
proceeds with grounded evidence.
Categories that support stale-auto-resolve (performance, capacity)
bypass the gate entirely — the LLM can resolve them based on
absence per the existing contract. Keys without a verifier also
fall through to current behavior so we don't block resolves for
categories we haven't built verifiers for yet.
New PatrolService.hasDeterministicVerifierForKey() helper keeps
the gate's verifier list in lockstep with the switch in
verifyFixDeterministically.
Tests cover the three branches:
- performance category → gate skipped, resolve proceeds
- reliability + no verifier → gate falls through, resolve proceeds
- hasDeterministicVerifierForKey for known and unknown keys
ai-runtime contract updated.
Follow-up to e32d4ede4 (NarrativeFor + FleetNarrativeFor entrypoints)
and 1fe5d6853 (pulse_summarize tool). api-contracts.md gains a Current
State paragraph documenting that the Engine interface now exposes two
non-rendering entry points alongside Generate/GenerateMulti, with the
explicit invariant that test stubs implementing the interface must
implement these methods so the contract is honoured across the entire
surface, not just the export-shaped subset.
ai-runtime.md was updated in the parallel-agent commit ee2de2703
(which picked up the pulse_summarize paragraph when restating
auto-resolve gating), so no further edit is needed there.
reconcileStaleFindings (commit b44d5892f) and the resource-absent
gate added in commit d6bb89a1c both use the same
CategorySupportsStaleAutoResolve helper, but the contract Current
State only documented the first path. Rewords the paragraph so
both are covered explicitly: stale-cleanup and resource-absent
both gate on the whitelist, both reject event/persistent
categories, and the bogus-absence examples extend to cover the
resource-absent failure mode (transient agent reconnect,
container churn, refresh gap).
Follow-up to the narrator prompt changes that forbid acting as a
parallel detector. ai-runtime.md gains a Current State paragraph
documenting that both report narrator system prompts encode an
explicit detection-boundary invariant: warning or critical severity
classifications must be backed by a Patrol finding, an alert, or a
hard-threshold breach in the structured input, not by metric
inference alone. Patterns the narrator notices without that backing
are constrained to info severity. This keeps the narrative
retrospective on Patrol's work and prevents silent
shadow-classification competing with Patrol's detection rules.
api-contracts.md gains an equivalent paragraph in the reporting
contract section. The same rule applies to outliers and patterns
on the fleet path, and to recommendations on both. The deterministic
heuristic narrators were already constrained to the same threshold
rules; this aligns the AI path with the same evidence surface the
fallback uses, so the report PDF cannot become a back-door
detection surface that diverges from the findings store.
Initial detector (commit 942f9ca0f) only matched on the two legacy
absence-signature reason strings — but the Backup failed finding
on the live preview showed 6 auto_resolved events all with empty
messages, produced by the LLM patrol_resolve_finding tool via
Resolve(_, true). Counter stayed at 6× after the previous
migration ran.
New detector: any active finding whose category is NOT eligible
for stale-auto-resolve (i.e. anything other than performance or
capacity) AND has any auto_resolved event on its lifecycle is
treated as having an inflated counter. The rationale is the same
rule the category gate already established — for event/persistent
categories there is no legitimate absence-driven resolution path,
so any auto_resolved was either a removed-bogus-path stamp or an
LLM judgment call that repeatedly reverted through regressions on
the next run. The cumulative count is no signal either way.
Performance/capacity findings retain their counter because the
metric-cleared resolution model is sound there.
Test extended to cover four cases: LLM-driven cycle resets,
legacy-reason cycle resets, eligible-category preserves counter,
non-eligible category without any auto_resolved preserves counter.
Plus the existing idempotency case (already-reset finding stays
reset and is not re-applied).
The Backup failed finding on the live preview showed "regressed 6×"
when the actual regression count of genuine recurrences was at
most 1 or 2 — the rest were the system fighting itself, driven by
the absence-based auto_resolve paths that were gated (category
whitelist) or removed (alert-mirror rip) earlier in this branch.
Counter stayed sticky after those fixes landed, so the trust strip
and finding badges still surfaced the inflated number.
FindingsStore.SetPersistence load pass now scans each active
finding's lifecycle for the two known bogus-signature auto_resolved
reasons ("No longer detected by patrol", "Resource no longer
exists in infrastructure"). If found, RegressionCount is reset to
0 and LastRegressionAt is cleared, and a regression_counter_reset
lifecycle event is appended so the migration is idempotent. A
finding that already has a regression_counter_reset event is left
alone; any regressed events that accrued after the reset are
genuine and stand.
findingHasBogusAutoResolveCycle returns true only when the
lifecycle contains a bogus auto_resolved and no prior reset event,
so the function is the single point of truth for the migration
decision and is straightforward to test. Test covers three cases:
finding with bogus signature gets reset, finding with empty-message
auto_resolved (LLM-driven, legitimate) keeps its counter, finding
already migrated is not re-reset.
Updates ai-runtime Current State to document the second migration
on top of the alert-mirror retirement.
The previous commit removed the detectAlertSignals path so no NEW
alert-mirror findings are emitted, but the findings already
persisted from earlier builds stay in the store indefinitely —
nothing cleans them up (reconcileStaleFindings is gated on
performance/capacity categories, the LLM resolves them just to
have them re-detected next run except now the deterministic
emitter is gone so re-detection can't happen, but they're left
sitting as active findings draining the trust strip and score).
FindingsStore.SetPersistence now runs a one-shot retirement pass
on load: any active finding with title "Active alert detected",
source ai-analysis, and category general is auto-resolved with
reason "Patrol no longer mirrors alerts; the Alerts page is the
canonical surface for currently-firing alerts." The pass appends
an auto_resolved lifecycle event so the retirement is auditable,
syncs the loop state to resolved, and schedules a save so the
cleanup persists.
Idempotent: after the first load with this code, no findings
match the signature so the pass is a no-op. Defensive: the
signature requires all three fields (title + source + category)
to match before retiring, so an operator-authored finding that
happens to share the title is left untouched. Test covers the
mirror case, the matching-title-but-foreign-source case (must
NOT retire), and an unrelated active finding (must NOT retire),
plus verifies the retired state persists back through the
persistence layer.
Updates ai-runtime Current State to record the migration path.
The deterministic signal pipeline ran the pulse_alerts tool output
through detectAlertSignals and produced a SignalActiveAlert for
every firing alert, which Patrol then materialized as an
"Active alert detected" finding (source: ai-analysis, category:
general). The system prompt at the top of patrol_ai.go explicitly
tells the LLM not to duplicate alerts — but the deterministic
emitter was duplicating them anyway, behind the LLM's back.
Symptoms observed in the wild:
- 9 active "Active alert detected" findings in Patrol, every one a
duplicate of an existing alert already on the Alerts page.
- The LLM, doing what the prompt told it, resolved each mirrored
finding via patrol_resolve_finding. Next run the alert was still
firing and Patrol re-emitted the signal → finding regressed.
Lifecycle showed several auto_resolved → re-detected → regressed
cycles per finding within hours.
- Health score dragged down by issues the operator already saw on
the Alerts page, with no operator action possible from Patrol
that wasn't already available from Alerts.
Rip detectAlertSignals entirely, remove the pulse_alerts case from
the signal-extraction switch, drop SignalActiveAlert plus its key
/ title / recommendation entries. Convert the prior
TestDetectSignals_ActiveAlert into a regression guard that locks
in the no-mirror behavior.
Updates the ai-runtime subsystem Current State to record the
decision: Patrol does not duplicate the Alerts surface; alerts
own their own lifecycle, surface, and acknowledgement model.
Follow-up to the fleet-level AI narrative refactor: ai-runtime.md
gains a Current State paragraph documenting that internal/ai.Service
also implements pkg/reporting.FleetNarrator with its own use-case
label (report_narrative_fleet) so fleet vs single-resource spend is
distinguishable in the cost ledger and budget gate, and that the
single-resource narrator is intentionally not propagated through the
multi-report path. api-contracts.md gains a paragraph documenting
the new optional FleetNarrator field on MultiReportRequest, the new
FleetNarrative field on MultiReportData, the rendered fleet section
(executive prose, named outliers, cross-cutting patterns,
recommendations, optional period-comparison, AI provenance footer),
and the explicit invariant that the deterministic resource summary
table stays rendered from the same per-resource aggregates so every
named outlier is verifiable against the table below.
Dependent subsystems (agent-lifecycle, performance-and-scalability,
storage-recovery) remain unchanged: their Extension Points reference
internal/api/ broadly but agent lifecycle, perf scaling, and storage
recovery semantics have no delta from this change.
Follow-up to the report narrative refactor: ai-runtime.md gains a
Current State paragraph documenting that internal/ai.Service now
implements pkg/reporting.Narrator and pkg/reporting.FindingsProvider,
and that the narrator/findings surfaces inherit the same provider,
sanitizer, model selection, budget enforcement, and fail-closed
governance as the rest of the canonical AI runtime. api-contracts.md
gains a paragraph documenting the new optional Narrator and
FindingsProvider fields on MetricReportRequest, the new Narrative,
PriorPeriod, and Findings fields on ReportData, the three new PDF
sections (executive prose, Period-over-period changes, AI provenance
footer), and the explicit invariant that the deterministic data
surface (charts, stats, alerts, storage, disks) stays rendered from
the same aggregates so every AI claim is verifiable against adjacent
data. Multi-resource fleet reports intentionally remain heuristic-only
at this transport layer.
The dependent subsystems flagged by the staged-shape guard
(agent-lifecycle, performance-and-scalability, storage-recovery)
genuinely have no contract delta from the prior commit: their
Extension Points reference internal/api/ broadly, but agent
lifecycle, performance scaling, and storage recovery semantics are
unchanged.
The overall health score chip read "A · 95/100" while the same
assessment card said "Coverage incomplete · Recent Patrol runs
encountered errors · Verify full coverage." Those messages
contradicted because the "recent errors but had a successful run"
coverage factor used a flat -10 penalty regardless of the error
ratio. With one successful manual run among many failed startup
runs, the math stayed in the A band and the score directly
undermined the warning it sat next to.
The factor now tiers the impact by ratio of errored runs to
relevant runs in the scoring window:
>50% errored → -30, "Most recent Patrol runs encountered errors
(N of M); the current health summary is not
reliable until coverage stabilizes."
>25% errored → -20, "Recent Patrol runs encountered errors
(N of M), so the current health summary
may be incomplete."
else → -10, original light-tier description.
A single transient error stays in grade A; dominant-error periods
drop out of A so the grade matches the warning. Adds two tests
covering both ends of the new tiering and updates the ai-runtime
subsystem contract Current State section.