Commit graph

6062 commits

Author SHA1 Message Date
rcourtman
4cf16ec9cb Stabilize summary chart SLOs 2026-05-12 14:40:55 +01:00
rcourtman
1726cf47b4 Harden Patrol and Assistant action boundaries 2026-05-12 12:06:27 +01:00
rcourtman
5ac0484e06 Drop install.sh-smoke push self-test (v5.1.30 fallback was unviable)
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
Commit 590818744 added a push trigger that re-ran the gate against
v5.1.30 on every workflow edit, aiming to register the workflow for
API dispatch and to validate it before the next release depended on it.
The registration goal was achieved: workflow ID 275278570 is now active
and the gate is dispatchable via `gh workflow run install-sh-smoke.yml`
and the REST API.

The self-test itself was unviable: v5.1.30 doesn't ship
`install.sh.sshsig` (v5 didn't sign installers), so the signature-verify
step 404s on every run. No current published release is a valid known-
good smoke target — rc.5's `install.sh` has the wrong banner / agent
installer (the regression this gate exists to catch), and rc.6 doesn't
exist yet. The push-triggered run would fail forever, drowning real
signals.

Drop the push trigger. The workflow is registered, dispatch is verified
working, and a dispatch against rc.5 just confirmed the gate correctly
fires the banner check ("install.sh banner is not the Pulse server
installer") against the broken release. The first time the gate's
container portion runs end-to-end will be on rc.6 through the
create-release.yml workflow_call. Make the resolve-inputs step a hard
fail if tag or version is empty so any future regression that drops
inputs surfaces explicitly instead of running against a silent fallback.
2026-05-12 11:55:03 +01:00
rcourtman
5908187445 Self-test install.sh smoke gate on every workflow edit against v5.1.30
Commit 7c0f65425 wired install-sh-smoke.yml into create-release.yml but
the workflow has never actually executed — the pre-install structural
checks were validated locally against rc.5, but the privileged systemd
container portion is unproven on GitHub's cgroup-v2 runners. The first
real release through the pipeline would be its trial run, and a bug at
the container layer would block the release.

Add a push-event self-test that re-runs the full gate against v5.1.30
(a known-good release with the same server-installer banner, the same
--version arg handler, and the same ed25519 signing key as v6 RCs)
whenever this workflow file changes on pulse/v6-release or main. This
both validates the gate continuously and registers the workflow with
GitHub's actions/workflows API so it becomes dispatchable via gh CLI
and the REST endpoint — workflows on non-default branches with only
workflow_call + workflow_dispatch never appear in the API until they
have been triggered by a non-dispatch event.

Replace direct `${{ inputs.* }}` references with a single resolve step
that falls back to v5.1.30 / 5.1.30 / github.repository when no inputs
are supplied (push trigger). Drop the now-redundant Resolve release
repository step. Behavior under workflow_call from create-release.yml
is unchanged: the create-release-supplied tag/version/repository win.
2026-05-12 11:50:19 +01:00
rcourtman
7c0f654253 Wire install.sh smoke gate into create-release.yml release pipeline
The smoke gate workflow exists from commit 065ebdb27 but until it is
called from create-release.yml it does not actually protect any release.
That is exactly the regression class that let rc.1 → rc.5 ship with a
broken install.sh: nothing in the release pipeline exercised the
documented secure-install flow against the published GitHub Release URL.

Wire install-sh-smoke.yml as a downstream workflow_call after
validate_release_assets succeeds. Gated on
historical_asset_backfill_only != 'true' since asset-backfill flows
re-upload to an already-published release and the smoke would just
re-confirm what hasn't changed.

Pre-install structural checks were verified locally against rc.5 — the
gate correctly fires the banner / agent-banner / --version handler
assertions against the broken release. The end-to-end container portion
(privileged systemd boot, install.sh execution, /api/health, /api/version
match) will run for the first time on the next release that publishes
through this workflow; existing retry loops on systemd readiness,
service activation, and health endpoint absorb transient runner flakes.

Add install-sh-smoke.yml to the deployment-installability canonical files
and to the release-promotion proof policy's match_files, and add
scripts/installtests/build_release_assets_test.go to that policy's
exact_files (matching the existing pin set for related policies in the
deployment-installability subsystem). Update subsystem_lookup_test.py
fixtures that pinned the exact_files list literally.

Pinned the create-release.yml wiring in build_release_assets_test.go
alongside the validate-release-assets wiring so the smoke step cannot
silently be unwired.

Document the gate's contract responsibilities in
deployment-installability Extension Point 2.
2026-05-12 11:44:04 +01:00
rcourtman
065ebdb276 Add install.sh end-to-end smoke gate against published release
Across v6 rc.1 → rc.5 the published install.sh asset was the agent
installer rather than the server installer, and the README's pinned
ed25519 key did not verify what the pipeline actually signed. The first
broke `bash install.sh --version` and the in-product Update button; the
second silently failed the README's secure-install ssh-keygen step.
Neither was caught by CI because every existing gate operated on the
local release/ build, the Docker image, or the helm chart — nothing
exercised the documented LXC/systemd install commands against the
published release URL.

scripts/validate-release.sh now catches asset-identity drift at build
time. This workflow catches the rest of the regression class — anything
that breaks the actual install at runtime — by running the documented
flow end-to-end against the published release.

What it does:
- Downloads install.sh, install.sh.sshsig, and the linux-amd64 tarball
  from releases/download/<tag>/.
- Extracts the README's pinned pulse-installer ed25519 key and runs the
  exact ssh-keygen -Y verify command from the README's secure-install
  snippet against the downloaded asset.
- Re-checks the server-installer banner, the --version) arg handler, and
  the absence of the agent banner — same pins as validate-release.sh, but
  now against what GitHub is actually serving (not just what was built
  locally).
- Boots jrei/systemd-debian:12 privileged, runs
  `bash install.sh --archive <tarball> --disable-auto-updates` from
  inside, waits for systemd pulse.service to become active, hits
  /api/health, and asserts /api/version reports the expected version.

--archive mode is used rather than --version so the workflow doesn't
depend on install.sh's self-refetch loop (the re-fetched bytes are the
ones we already validated). Auto-updates are disabled to avoid the timer
unit doing anything during the smoke run.

Triggers are workflow_dispatch + workflow_call only. Wire it into
create-release.yml after the next RC validates it green.

Pinned in build_release_assets_test.go so silent deletion or weakening
of any critical assertion (signature verify, banner check, /api/health
hit, version match) trips the test.
2026-05-12 11:25:46 +01:00
rcourtman
b69c8c8007 Wire PULSE_RELAY_ENABLED and PULSE_RELAY_SERVER as real env overrides
These two env vars were documented as relay overrides in v6 docs since
March 18 (CONFIGURATION.md, RELAY.md, and the frontend-served doc copy)
but no code ever read them. Operators trying to bootstrap relay headlessly
saw no effect.

Implement them rather than remove the documentation. Headless and
container deployments now have a real path to enable relay and point it
at a private endpoint without going through Settings → Relay.

internal/relay/config_env.go:
  - ApplyEnvOverrides(*Config) mutates relay.Config in place.
  - PULSE_RELAY_ENABLED accepts true/false/yes/no/1/0/on/off (case-
    insensitive). Unrecognized values log a warning and leave the file
    value untouched — important so "unset" reads differently from
    "explicit false."
  - PULSE_RELAY_SERVER goes through the existing validateRelayServerURL
    check; invalid URLs log a warning and fall through.

internal/config/persistence_relay.go:
  LoadRelayConfig calls ApplyEnvOverrides after the file load and after
  the default-fallback when relay.enc is absent, so the env override
  applies on every load.

Tests cover unset / true / false / garbage-bool / valid-URL / invalid-URL
/ both-together / nil-config paths in the relay package, plus two
end-to-end tests in internal/config that prove the override flows through
LoadRelayConfig against a real persisted file and against the
missing-file default branch.

Restore the env-var docs with the correct default URL (the full
wss://relay.pulserelay.pro/ws/instance, not the bare hostname the
original aspirational table claimed) and add an explicit precedence note:
saving from the UI after an env override persists the env-effective state
to disk, so clearing the env alone does not revert.

Add internal/relay/config_env_test.go to the relay-runtime registry's
desktop-relay-runtime exact_files so the new code surface is proof-tracked.
Update the matching pin in subsystem_lookup_test.py. Extend the
relay-runtime contract Extension Point 3 to document the override
semantics LoadRelayConfig must satisfy.
2026-05-12 11:18:31 +01:00
rcourtman
7cc90a8969 Remove fictional PULSE_RELAY_ENABLED / PULSE_RELAY_SERVER env overrides
CONFIGURATION.md and its frontend-served copy advertised an "Environment
Overrides" table for relay with two env vars, but neither has ever existed
in code. git log -S "PULSE_RELAY_ENABLED" -- 'internal/**.go' is empty;
relay config is entirely file-driven (relay.enc) and configured from
Settings → Relay. The documented default "relay.pulserelay.pro" was also
incorrect — the actual code default is the full ws URL
"wss://relay.pulserelay.pro/ws/instance".

Replace the misleading table with a clarifying paragraph that states the
truth: relay has no env-var overrides, it's UI-configured, and the actual
default server URL. Operators trying to deploy headlessly with the
documented env vars would silently get the default config because the
runtime never reads them; better to say so explicitly.

Found during a sweep of all 29 PULSE_* env vars documented in
CONFIGURATION.md against the codebase. These two were the only doc-only
drift; the other 27 all have real production references.
2026-05-12 11:05:07 +01:00
rcourtman
3b57c880ba Fix manual systemd install snippet binary path
The "Manual systemd install (advanced)" example in docs/INSTALL.md told
users to run `sudo install -m 0755 pulse /usr/local/bin/pulse`, but the
release tarballs extract to ./bin/pulse, not ./pulse. Following the
documented command literally produced "install: cannot stat 'pulse'".

Show the download/extract commands explicitly so users know the tarball
shape, and correct the install path to `bin/pulse`. validate-release.sh
already pins the ./bin/pulse tarball layout, so this is the matching
documentation side of that contract.
2026-05-12 10:51:03 +01:00
rcourtman
ce7d7c1956 Fix stale README signature key and guard against future drift
The README's secure-install snippet has pinned the wrong ed25519 key
since commit a60fa03d7 (April 22, 2026), so v6 rc.2 through rc.5 all
shipped with a documented verification step that does not work.

I downloaded the published rc.5 install.sh + install.sh.sshsig and
ran ssh-keygen -Y verify with both candidate keys:
  Ds21c5... (README's pinned key) -> Could not verify signature
  MZd/DaH... (key embedded in install.sh and pulse-auto-update.sh) -> OK

Customers who actually followed the README's secure-install path saw
"Could not verify signature" and aborted. Most users curl-pipe the
script unverified so the drift went unreported.

Replace the stale key in README.md and docs/INSTALL.md with the actual
pipeline signing key (MZd/...).

Add a validate-release.sh smoke that extracts the README's pinned key
and runs the exact ssh-keygen -Y verify command against the signed
install.sh.sshsig. Any future drift between documented key and actual
signing key fails the release before publish.

Lock both the correct-key presence and the stale-key absence in
build_release_assets_test.go for README and docs/INSTALL.md so a manual
edit cannot regress the docs back to the broken state.
2026-05-12 10:30:42 +01:00
rcourtman
49412357af Ship the Pulse server install.sh as the GitHub Release asset
Every v6 RC (rc.1 through rc.5, ~30 days) shipped the wrong install.sh.
build-release.sh was copying the rendered AGENT installer into
release/install.sh, but adapter_installsh, scripts/pulse-auto-update.sh,
the root install.sh's own --rc/--stable/--version flows, and the README
quickstart all fetch that asset and run `bash install.sh --version vX.Y.Z`.
Since the agent installer rejects --version with "Unknown argument", the
LXC quickstart, the in-product "Update Pulse" button on systemd/proxmoxve
deployments, and the pulse-auto-update.sh systemd timer were all broken
for every RC.

Fix the build-release.sh copy to publish the root server installer.
The agent installer continues to ship inside tarballs at ./scripts/install.sh
and inside Docker images at /opt/pulse/scripts/install.sh, and is served
at the running Pulse server's /install.sh endpoint — none of those paths
change. Only the top-level GitHub Releases asset moves from agent to server.

Update build_release_assets_test.go to lock in the new publishing rule
and ban the reverse drift, replacing the March 18 "legacy root install.sh"
guard that was the original mistake.

Add a validate-release.sh smoke that catches the regression mode this hid:
the published install.sh must have the Pulse server banner, the --version)
arg handler, must not contain the agent banner, and `bash install.sh --help`
must print the server installer's version-pinning help line. These checks
run as part of validate-release-assets.yml against the post-publish asset
bundle so a future swap back cannot slip through.

Document the asset identity rule and the validate-release.sh guard in the
deployment-installability contract so any future change to the publishing
pipeline has to update the contract or trip the shape guard.
2026-05-12 10:24:28 +01:00
rcourtman
89379c4b5c Use effectiveLoadP95Budget for metrics-history load test CI variance 2026-05-12 09:19:58 +01:00
rcourtman
4ccbbf4267 Lazy-load 4 modals behind Show gates to slim index bundle further 2026-05-12 09:16:00 +01:00
rcourtman
da7969fb48 Drop pulse-agent attestation expectations from TestReleaseWorkflowsUseSecretSafeAttestedImageBuilds
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
2026-05-12 02:01:45 +01:00
rcourtman
a5d8b43088 Let assertJSONSnapshot exclude dynamic top-level fields for patrol_preflight 2026-05-12 01:38:59 +01:00
rcourtman
ec72977d3e Skip runtime-defaults raw-node TS imports when integration node_modules absent 2026-05-12 01:14:41 +01:00
rcourtman
8e49e68393 Pre-check integration node_modules before root-playwright wrapper assertion 2026-05-12 01:05:54 +01:00
rcourtman
216b3cae38 Skip root-playwright wrapper check on any tsx/playwright eval failure 2026-05-12 00:57:24 +01:00
rcourtman
86f9159bae Skip root-playwright wrapper check when @playwright/test isn't installed 2026-05-12 00:49:21 +01:00
rcourtman
659018ed28 Skip acceptance-doc wording-pin when the working-draft doc isn't checked out 2026-05-12 00:40:05 +01:00
rcourtman
96660e7586 Switch script-reference integrity test from rg to git grep for portable CI 2026-05-12 00:30:43 +01:00
rcourtman
5fd05efa83 Add connection-degraded alert for wedged platform connections
A Proxmox host wedged on a ZFS deadlock yesterday took the cluster API poll
with it (context deadline exceeded). The unified connections aggregator
flipped the Connection from active to stale to unreachable, and the
Settings / Infrastructure page rendered the right badges, but no top-nav
alert ever fired because nothing was actively notifying off that derived
state. Patrol's deterministic triage flagged it every minute, but its LLM
investigation stage has been broken since 2026-02-26 so flags never
escalated into user-visible findings. Result: a 3 hour outage I only
noticed because I happened to open Settings.

This wires an active notification off the same connection state the
Settings badges already use:

- internal/alerts/connection.go: new CheckConnection +
  clearConnectionDegradedAlert that fire connection-degraded after three
  consecutive stale or unreachable observations. Severity scales: stale
  warning, unreachable / unauthorized critical. Clear runs through the
  same recovery-confirmation gate as clearNodeOfflineAlert so a single
  flap back to active doesn't silently resolve a real outage. Paused,
  disabled, and non-platform connections are no-ops.

- internal/api/connections_alerts.go: snapshot translator that turns
  api.Connection into the narrow alerts.ConnectionSnapshot view. Keeping
  the snapshot type inside the alerts package preserves the existing
  api -> monitoring import direction; the monitor would have cycled if
  it called back into api directly.

- internal/monitoring: new SetConnectionsSnapshotLister hook + a
  per-tick checkConnectionAlerts call in the main poll loop, alongside
  the existing evaluate*Agents passes.

- internal/api/router.go: register the lister closure on r.monitor so
  the alerts loop sees the same Connection rows the HTTP handler does.

- internal/alerts/specs/types.go: add "connection" to the migration
  bridge list of accepted ResourceTypes, alongside node / docker-host /
  proxmox-disk / etc. The connection concept doesn't have a canonical
  unified resource type yet; this matches the existing pattern for
  alert-keyed resources that aren't first-class canonical.

Test coverage in internal/alerts/connection_test.go covers active never
fires, three stale observations escalate from pending to warning,
unreachable escalates warning to critical, unauthorized fires critical
cold, paused / disabled / agent never fire, recovery confirmation gate,
and a stale flap during recovery resets the gate.
TestResourceAlertSpecValidateAllowsConnectionMigrationBridgeType mirrors
the existing migration-bridge proof tests for the new type.
2026-05-12 00:29:04 +01:00
rcourtman
3323cba053 Stop install-mcp scripts from linking to GitHub blob/main docs 2026-05-11 23:58:45 +01:00
rcourtman
5e2dfe48c7 Lazy-load AIChat to cut frontend entry bundle by 29 percent 2026-05-11 23:58:12 +01:00
rcourtman
62cd23b97b Refresh frontend bundle-size baseline for post-rc.5 chunk shape 2026-05-11 23:48:00 +01:00
rcourtman
edda19aadf Catch up frontend tests with collapsed Patrol details and refactored FindingsPanel 2026-05-11 23:37:21 +01:00
rcourtman
2725a325f2 Consolidate AgentIntegrationsPanel doc links into shipped AGENT_SUBSTRATE.md 2026-05-11 23:25:55 +01:00
rcourtman
e1a6dff2a1 Document Pulse Cloud launch in v6 release notes 2026-05-11 23:18:05 +01:00
rcourtman
16963e415c Drop t.Parallel from dismiss/snooze finding tests that race on global session store 2026-05-11 23:10:35 +01:00
rcourtman
ff65551a1a Fix expired agent_preflight test fixture by using now-relative claim window 2026-05-11 22:57:57 +01:00
rcourtman
7951da526b Add release_cycle_artifact_globs so RC ceremony skips contract-update requirement 2026-05-11 22:55:29 +01:00
rcourtman
9a20bbd0b2 Derive RC packet paths from VERSION + glob to eliminate per-RC test churn 2026-05-11 22:36:03 +01:00
rcourtman
816b3985ba Make blocked-record drift self-fixable via BLESS_GOVERNANCE_FIXTURES env var 2026-05-11 22:31:03 +01:00
rcourtman
41dd867037 Stop alerts manager in TestPollCephClusterChecksPoolStorageThresholds to fix TempDir cleanup flake 2026-05-11 22:30:30 +01:00
rcourtman
920e88ede9 Schedule weekly release-dry-run watchdog against pulse/v6-release
Some checks are pending
Build and Test / Secret Scan (push) Waiting to run
Build and Test / Frontend & Backend (push) Waiting to run
2026-05-11 22:21:59 +01:00
rcourtman
d38f3d9217 Update release-control fixtures after pulse-agent Docker removal 2026-05-11 22:21:31 +01:00
rcourtman
3b2eef4984 Run build-and-test on pulse/v6-release commits to catch fixture/contract drift continuously 2026-05-11 22:19:45 +01:00
rcourtman
e69da0069f Stop pushing pulse-agent Docker image; ship agent only as release-asset binaries 2026-05-11 22:19:09 +01:00
rcourtman
e32db04543 Recalibrate CI 500-node load floor after rc.5 operator-state and agent-substrate plumbing 2026-05-11 19:07:44 +01:00
rcourtman
8ff69daa43 Bump install pins to rc.5 and refresh test fixtures for Patrol readiness + Unraid host profile tokens 2026-05-11 18:02:52 +01:00
rcourtman
894ea89af9 Refresh RC5 packet validation range for plain-JSON tool-call sanitisation 2026-05-11 17:09:57 +01:00
rcourtman
e36945741e Sanitise plain-JSON tool-call leaks from weak local models
Small Ollama models (qwen2.5:11b, qwen2.5:14b, similar) frequently emit
Pulse tool invocations as plain JSON inside content instead of routing
through the structured tool_calls channel. Users saw raw payloads like
`{"name": "pulse_query", "parameters": {...}}` as the assistant's final
response.

Extend cleanToolCallArtifacts and containsToolCallMarker with a new pass
that detects this leak shape, gated on a closed allowlist of canonical
tool names sourced from the runtime registry. The allowlist auto-syncs
when registerTools() gains a tool, so no separate hand-maintained list.

Anchored on (?:^|\n) and a leading `"name"` key so prose containing JSON
fragments, unrelated objects (`{"foo":"bar"}`), or named resources
(`{"name":"my-vm","cpu":50}`) are left untouched.
2026-05-11 17:02:07 +01:00
rcourtman
366bf8d127 Prepare v6.0.0-rc.5 release packet 2026-05-11 16:52:31 +01:00
rcourtman
52416cec6f Reword cluster deploy banner from "unmonitored" to "ready for Pulse Agent"
Cluster peers without a Pulse Unified Agent are still monitored via the
cluster's PVE API token — the agent adds richer telemetry (hardware,
sensors, OS metadata) but isn't the only source of monitoring. The
previous "N node(s) unmonitored" copy misrepresented the state.

Reframes the banner as an action opportunity, matching the adjacent
"Review & Deploy" CTA. Gate logic unchanged.
2026-05-11 16:02:53 +01:00
rcourtman
07d73843f0 Hide cluster deploy banner for offline PVE nodes
The "N nodes unmonitored" banner gated purely on absence of a Pulse
Unified Agent (r.agent?.agentId). Offline cluster members (status
'offline') were counted as deploy candidates, which is wrong on two
fronts: those nodes are typically still covered by the cluster's PVE
API token (so they aren't truly unmonitored, just unreachable), and an
offline host is precisely the case where deploying an agent cannot
succeed.

Exclude status === 'offline' from the unmonitored count.
2026-05-11 15:59:43 +01:00
rcourtman
9329258f8b Correct the DeepSeek tool_choice coercion rationale
The previous comment claimed DeepSeek's API aliases v4-flash/v4-pro to
deepseek-reasoner, justifying the auto coercion via the legacy
reasoner's known 400 behavior. That had the alias direction inverted:
per DeepSeek's pricing page, deepseek-chat and deepseek-reasoner are
deprecated aliases for v4-flash's non-thinking and thinking modes
respectively, not the other way around.

The coercion itself is empirically correct, though. Live preflight
against deepseek-v4-flash with tool_choice=required produces a
deterministic HTTP 400 ("provider rejected forced tool selection") in
275ms. The behavior is consistent across the DeepSeek user community
- multiple downstream projects (pydantic-ai #5193, claude-code-router
#1378, opencode #24190, others) confirm V4 models reject forced tool
selection despite DeepSeek's chat-completion docs listing required as
a supported value. Server reality disagrees with documentation.

This commit updates only the comments in openai.go and openai_test.go
to point at the empirical evidence and the community confirmation.
The coercion behavior is unchanged.
2026-05-11 14:46:08 +01:00
rcourtman
3031c218cb Surface preflight diagnosis on Patrol readiness banner
The readiness banner previously rendered only the summary string
("Provider connection issue (last preflight 2h ago)"), even though
the underlying payload already carried the provider, model, preflight
duration, tool-call observation, and recommendation. Operators had to
open dev tools to find out which provider failed and what to check.

Now the banner inlines:

  - Provider and model that preflight tried to reach
  - Preflight duration and whether a tool call was observed
  - The preflight recommendation text

The state hook gains a patrolPreflight memo that reads from the same
AISettings payload the rest of the patrol surface already consumes.
2026-05-11 14:31:19 +01:00
rcourtman
2a9afb1112 Sanitise double-pipe DeepSeek DSML tool-call markers in chat
Found by exercising pulse_summarize in real chat: the user asked a
question, the model called the tool successfully (response came back
with narrative_source: ai), but the chat panel ended with raw DSML
text and "Assistant response is ready" — no actual prose answer.

Root cause was in agentic_sanitize.go. The fast-path string list
only checked single-pipe DSML variants ("<|DSML|...>"), but
deepseek-v4-flash emits the double-pipe form ("<||DSML||...>").
The opening sequence didn't match any marker, so cleanToolCallArtifacts
returned the content unchanged. The chat orchestrator then showed
the raw DSML to the user as if it were the assistant's answer.

Fix:
- Add double-pipe variants (Unicode and ASCII) to the fast-path
  marker list. Six new entries, mirroring the existing single-pipe
  entries.
- Add a backstop regex (dsmlRe) that matches any pipe-count variant
  via `</?[\||]+/?DSML[\||]*`. Future model behaviour with triple
  or higher pipe counts gets caught without another fast-path edit.
- containsToolCallMarker gets the same coverage so streaming
  detection stops forwarding content the moment the marker
  appears, regardless of pipe count.

Tests in agentic_sanitize_test.go gain three new cases for cleanup
(double-pipe Unicode, double-pipe ASCII, triple-pipe regex
backstop) and two for detection (double-pipe Unicode, double-pipe
ASCII). All passing.

Process note: this bug was invisible to existing tests because the
test suite only covered the single-pipe variants that were
documented in the marker list. The double-pipe form only appears
when an actual model emits it. Same pattern as the JSON-casing fix
earlier — exercise the surface against a real LLM, find what tests
in isolation can't see.
2026-05-11 13:42:11 +01:00
rcourtman
e22113230a Purge resolved legacy alert-mirror findings on load
The previous rip retired only active "Active alert detected" findings
from the now-removed detectAlertSignals -> SignalActiveAlert emitter.
Resolved instances were left in place, polluting the Resolved tab and
inflating the regressed total on the trust strip (a stale 8 resources
each marked "regressed 3x" with descriptions like "Active warning
alert: Container 'ollama' is powered off"). They have no canonical
operator value -- the Alerts surface is the source of truth for
currently-firing alerts -- so on load we now purge them entirely
rather than keeping them around as Resolved noise. Active mirrors are
still retired (auto-resolved with a clear reason) so operators see
why the finding closed; resolved mirrors disappear silently because
they were already in the terminal state. Idempotent.

Extends TestFindingsStore_SetPersistence_RetiresLegacyAlertMirrorFindings
with a fixture for the resolved-mirror case and asserts both the
in-memory purge and the persisted state no longer carries it.
2026-05-11 11:40:07 +01:00
rcourtman
c3319b6304 Make Narrative / FleetOutlier JSON shape consistent with prompt schema
Found by exercising pulse_summarize in a real chat session. The
chat-tool response surfaced:

  "observations": [{"Text": "...", "Severity": "info"}]

The AI narrator's system prompt (report_narrator.go) tells the
model to emit lowercase keys:

  {"text": "...", "severity": "..."}

The model was being taught one schema and shown a different one
in the tool response for the same shape. NarrativeBullet,
FleetOutlier, Narrative, and FleetNarrative had no JSON tags, so
embedded struct fields serialised with their Go names.

Add struct tags so the wire shape matches the prompt schema. Pure
marshaling change — JSON tags don't affect Go field access, so
PDF rendering (which reads fields directly) is unchanged. Tests
in narrative_json_test.go pin the shape so the inconsistency
can't reappear silently.

Process note: this is a class of bug that only appears when an
LLM actually consumes the output. No unit test caught it; no
review of the code showed it; the model running through the chat
path is what surfaced it. Another argument for "exercise the
artifact" — even the tool surface that looks correct in
isolation has hidden inconsistencies you only see when something
external reads it.
2026-05-11 11:29:35 +01:00