Commit 590818744 added a push trigger that re-ran the gate against
v5.1.30 on every workflow edit, aiming to register the workflow for
API dispatch and to validate it before the next release depended on it.
The registration goal was achieved: workflow ID 275278570 is now active
and the gate is dispatchable via `gh workflow run install-sh-smoke.yml`
and the REST API.
The self-test itself was unviable: v5.1.30 doesn't ship
`install.sh.sshsig` (v5 didn't sign installers), so the signature-verify
step 404s on every run. No current published release is a valid known-
good smoke target — rc.5's `install.sh` has the wrong banner / agent
installer (the regression this gate exists to catch), and rc.6 doesn't
exist yet. The push-triggered run would fail forever, drowning real
signals.
Drop the push trigger. The workflow is registered, dispatch is verified
working, and a dispatch against rc.5 just confirmed the gate correctly
fires the banner check ("install.sh banner is not the Pulse server
installer") against the broken release. The first time the gate's
container portion runs end-to-end will be on rc.6 through the
create-release.yml workflow_call. Make the resolve-inputs step a hard
fail if tag or version is empty so any future regression that drops
inputs surfaces explicitly instead of running against a silent fallback.
Commit 7c0f65425 wired install-sh-smoke.yml into create-release.yml but
the workflow has never actually executed — the pre-install structural
checks were validated locally against rc.5, but the privileged systemd
container portion is unproven on GitHub's cgroup-v2 runners. The first
real release through the pipeline would be its trial run, and a bug at
the container layer would block the release.
Add a push-event self-test that re-runs the full gate against v5.1.30
(a known-good release with the same server-installer banner, the same
--version arg handler, and the same ed25519 signing key as v6 RCs)
whenever this workflow file changes on pulse/v6-release or main. This
both validates the gate continuously and registers the workflow with
GitHub's actions/workflows API so it becomes dispatchable via gh CLI
and the REST endpoint — workflows on non-default branches with only
workflow_call + workflow_dispatch never appear in the API until they
have been triggered by a non-dispatch event.
Replace direct `${{ inputs.* }}` references with a single resolve step
that falls back to v5.1.30 / 5.1.30 / github.repository when no inputs
are supplied (push trigger). Drop the now-redundant Resolve release
repository step. Behavior under workflow_call from create-release.yml
is unchanged: the create-release-supplied tag/version/repository win.
The smoke gate workflow exists from commit 065ebdb27 but until it is
called from create-release.yml it does not actually protect any release.
That is exactly the regression class that let rc.1 → rc.5 ship with a
broken install.sh: nothing in the release pipeline exercised the
documented secure-install flow against the published GitHub Release URL.
Wire install-sh-smoke.yml as a downstream workflow_call after
validate_release_assets succeeds. Gated on
historical_asset_backfill_only != 'true' since asset-backfill flows
re-upload to an already-published release and the smoke would just
re-confirm what hasn't changed.
Pre-install structural checks were verified locally against rc.5 — the
gate correctly fires the banner / agent-banner / --version handler
assertions against the broken release. The end-to-end container portion
(privileged systemd boot, install.sh execution, /api/health, /api/version
match) will run for the first time on the next release that publishes
through this workflow; existing retry loops on systemd readiness,
service activation, and health endpoint absorb transient runner flakes.
Add install-sh-smoke.yml to the deployment-installability canonical files
and to the release-promotion proof policy's match_files, and add
scripts/installtests/build_release_assets_test.go to that policy's
exact_files (matching the existing pin set for related policies in the
deployment-installability subsystem). Update subsystem_lookup_test.py
fixtures that pinned the exact_files list literally.
Pinned the create-release.yml wiring in build_release_assets_test.go
alongside the validate-release-assets wiring so the smoke step cannot
silently be unwired.
Document the gate's contract responsibilities in
deployment-installability Extension Point 2.
Across v6 rc.1 → rc.5 the published install.sh asset was the agent
installer rather than the server installer, and the README's pinned
ed25519 key did not verify what the pipeline actually signed. The first
broke `bash install.sh --version` and the in-product Update button; the
second silently failed the README's secure-install ssh-keygen step.
Neither was caught by CI because every existing gate operated on the
local release/ build, the Docker image, or the helm chart — nothing
exercised the documented LXC/systemd install commands against the
published release URL.
scripts/validate-release.sh now catches asset-identity drift at build
time. This workflow catches the rest of the regression class — anything
that breaks the actual install at runtime — by running the documented
flow end-to-end against the published release.
What it does:
- Downloads install.sh, install.sh.sshsig, and the linux-amd64 tarball
from releases/download/<tag>/.
- Extracts the README's pinned pulse-installer ed25519 key and runs the
exact ssh-keygen -Y verify command from the README's secure-install
snippet against the downloaded asset.
- Re-checks the server-installer banner, the --version) arg handler, and
the absence of the agent banner — same pins as validate-release.sh, but
now against what GitHub is actually serving (not just what was built
locally).
- Boots jrei/systemd-debian:12 privileged, runs
`bash install.sh --archive <tarball> --disable-auto-updates` from
inside, waits for systemd pulse.service to become active, hits
/api/health, and asserts /api/version reports the expected version.
--archive mode is used rather than --version so the workflow doesn't
depend on install.sh's self-refetch loop (the re-fetched bytes are the
ones we already validated). Auto-updates are disabled to avoid the timer
unit doing anything during the smoke run.
Triggers are workflow_dispatch + workflow_call only. Wire it into
create-release.yml after the next RC validates it green.
Pinned in build_release_assets_test.go so silent deletion or weakening
of any critical assertion (signature verify, banner check, /api/health
hit, version match) trips the test.
These two env vars were documented as relay overrides in v6 docs since
March 18 (CONFIGURATION.md, RELAY.md, and the frontend-served doc copy)
but no code ever read them. Operators trying to bootstrap relay headlessly
saw no effect.
Implement them rather than remove the documentation. Headless and
container deployments now have a real path to enable relay and point it
at a private endpoint without going through Settings → Relay.
internal/relay/config_env.go:
- ApplyEnvOverrides(*Config) mutates relay.Config in place.
- PULSE_RELAY_ENABLED accepts true/false/yes/no/1/0/on/off (case-
insensitive). Unrecognized values log a warning and leave the file
value untouched — important so "unset" reads differently from
"explicit false."
- PULSE_RELAY_SERVER goes through the existing validateRelayServerURL
check; invalid URLs log a warning and fall through.
internal/config/persistence_relay.go:
LoadRelayConfig calls ApplyEnvOverrides after the file load and after
the default-fallback when relay.enc is absent, so the env override
applies on every load.
Tests cover unset / true / false / garbage-bool / valid-URL / invalid-URL
/ both-together / nil-config paths in the relay package, plus two
end-to-end tests in internal/config that prove the override flows through
LoadRelayConfig against a real persisted file and against the
missing-file default branch.
Restore the env-var docs with the correct default URL (the full
wss://relay.pulserelay.pro/ws/instance, not the bare hostname the
original aspirational table claimed) and add an explicit precedence note:
saving from the UI after an env override persists the env-effective state
to disk, so clearing the env alone does not revert.
Add internal/relay/config_env_test.go to the relay-runtime registry's
desktop-relay-runtime exact_files so the new code surface is proof-tracked.
Update the matching pin in subsystem_lookup_test.py. Extend the
relay-runtime contract Extension Point 3 to document the override
semantics LoadRelayConfig must satisfy.
CONFIGURATION.md and its frontend-served copy advertised an "Environment
Overrides" table for relay with two env vars, but neither has ever existed
in code. git log -S "PULSE_RELAY_ENABLED" -- 'internal/**.go' is empty;
relay config is entirely file-driven (relay.enc) and configured from
Settings → Relay. The documented default "relay.pulserelay.pro" was also
incorrect — the actual code default is the full ws URL
"wss://relay.pulserelay.pro/ws/instance".
Replace the misleading table with a clarifying paragraph that states the
truth: relay has no env-var overrides, it's UI-configured, and the actual
default server URL. Operators trying to deploy headlessly with the
documented env vars would silently get the default config because the
runtime never reads them; better to say so explicitly.
Found during a sweep of all 29 PULSE_* env vars documented in
CONFIGURATION.md against the codebase. These two were the only doc-only
drift; the other 27 all have real production references.
The "Manual systemd install (advanced)" example in docs/INSTALL.md told
users to run `sudo install -m 0755 pulse /usr/local/bin/pulse`, but the
release tarballs extract to ./bin/pulse, not ./pulse. Following the
documented command literally produced "install: cannot stat 'pulse'".
Show the download/extract commands explicitly so users know the tarball
shape, and correct the install path to `bin/pulse`. validate-release.sh
already pins the ./bin/pulse tarball layout, so this is the matching
documentation side of that contract.
The README's secure-install snippet has pinned the wrong ed25519 key
since commit a60fa03d7 (April 22, 2026), so v6 rc.2 through rc.5 all
shipped with a documented verification step that does not work.
I downloaded the published rc.5 install.sh + install.sh.sshsig and
ran ssh-keygen -Y verify with both candidate keys:
Ds21c5... (README's pinned key) -> Could not verify signature
MZd/DaH... (key embedded in install.sh and pulse-auto-update.sh) -> OK
Customers who actually followed the README's secure-install path saw
"Could not verify signature" and aborted. Most users curl-pipe the
script unverified so the drift went unreported.
Replace the stale key in README.md and docs/INSTALL.md with the actual
pipeline signing key (MZd/...).
Add a validate-release.sh smoke that extracts the README's pinned key
and runs the exact ssh-keygen -Y verify command against the signed
install.sh.sshsig. Any future drift between documented key and actual
signing key fails the release before publish.
Lock both the correct-key presence and the stale-key absence in
build_release_assets_test.go for README and docs/INSTALL.md so a manual
edit cannot regress the docs back to the broken state.
Every v6 RC (rc.1 through rc.5, ~30 days) shipped the wrong install.sh.
build-release.sh was copying the rendered AGENT installer into
release/install.sh, but adapter_installsh, scripts/pulse-auto-update.sh,
the root install.sh's own --rc/--stable/--version flows, and the README
quickstart all fetch that asset and run `bash install.sh --version vX.Y.Z`.
Since the agent installer rejects --version with "Unknown argument", the
LXC quickstart, the in-product "Update Pulse" button on systemd/proxmoxve
deployments, and the pulse-auto-update.sh systemd timer were all broken
for every RC.
Fix the build-release.sh copy to publish the root server installer.
The agent installer continues to ship inside tarballs at ./scripts/install.sh
and inside Docker images at /opt/pulse/scripts/install.sh, and is served
at the running Pulse server's /install.sh endpoint — none of those paths
change. Only the top-level GitHub Releases asset moves from agent to server.
Update build_release_assets_test.go to lock in the new publishing rule
and ban the reverse drift, replacing the March 18 "legacy root install.sh"
guard that was the original mistake.
Add a validate-release.sh smoke that catches the regression mode this hid:
the published install.sh must have the Pulse server banner, the --version)
arg handler, must not contain the agent banner, and `bash install.sh --help`
must print the server installer's version-pinning help line. These checks
run as part of validate-release-assets.yml against the post-publish asset
bundle so a future swap back cannot slip through.
Document the asset identity rule and the validate-release.sh guard in the
deployment-installability contract so any future change to the publishing
pipeline has to update the contract or trip the shape guard.
A Proxmox host wedged on a ZFS deadlock yesterday took the cluster API poll
with it (context deadline exceeded). The unified connections aggregator
flipped the Connection from active to stale to unreachable, and the
Settings / Infrastructure page rendered the right badges, but no top-nav
alert ever fired because nothing was actively notifying off that derived
state. Patrol's deterministic triage flagged it every minute, but its LLM
investigation stage has been broken since 2026-02-26 so flags never
escalated into user-visible findings. Result: a 3 hour outage I only
noticed because I happened to open Settings.
This wires an active notification off the same connection state the
Settings badges already use:
- internal/alerts/connection.go: new CheckConnection +
clearConnectionDegradedAlert that fire connection-degraded after three
consecutive stale or unreachable observations. Severity scales: stale
warning, unreachable / unauthorized critical. Clear runs through the
same recovery-confirmation gate as clearNodeOfflineAlert so a single
flap back to active doesn't silently resolve a real outage. Paused,
disabled, and non-platform connections are no-ops.
- internal/api/connections_alerts.go: snapshot translator that turns
api.Connection into the narrow alerts.ConnectionSnapshot view. Keeping
the snapshot type inside the alerts package preserves the existing
api -> monitoring import direction; the monitor would have cycled if
it called back into api directly.
- internal/monitoring: new SetConnectionsSnapshotLister hook + a
per-tick checkConnectionAlerts call in the main poll loop, alongside
the existing evaluate*Agents passes.
- internal/api/router.go: register the lister closure on r.monitor so
the alerts loop sees the same Connection rows the HTTP handler does.
- internal/alerts/specs/types.go: add "connection" to the migration
bridge list of accepted ResourceTypes, alongside node / docker-host /
proxmox-disk / etc. The connection concept doesn't have a canonical
unified resource type yet; this matches the existing pattern for
alert-keyed resources that aren't first-class canonical.
Test coverage in internal/alerts/connection_test.go covers active never
fires, three stale observations escalate from pending to warning,
unreachable escalates warning to critical, unauthorized fires critical
cold, paused / disabled / agent never fire, recovery confirmation gate,
and a stale flap during recovery resets the gate.
TestResourceAlertSpecValidateAllowsConnectionMigrationBridgeType mirrors
the existing migration-bridge proof tests for the new type.
Small Ollama models (qwen2.5:11b, qwen2.5:14b, similar) frequently emit
Pulse tool invocations as plain JSON inside content instead of routing
through the structured tool_calls channel. Users saw raw payloads like
`{"name": "pulse_query", "parameters": {...}}` as the assistant's final
response.
Extend cleanToolCallArtifacts and containsToolCallMarker with a new pass
that detects this leak shape, gated on a closed allowlist of canonical
tool names sourced from the runtime registry. The allowlist auto-syncs
when registerTools() gains a tool, so no separate hand-maintained list.
Anchored on (?:^|\n) and a leading `"name"` key so prose containing JSON
fragments, unrelated objects (`{"foo":"bar"}`), or named resources
(`{"name":"my-vm","cpu":50}`) are left untouched.
Cluster peers without a Pulse Unified Agent are still monitored via the
cluster's PVE API token — the agent adds richer telemetry (hardware,
sensors, OS metadata) but isn't the only source of monitoring. The
previous "N node(s) unmonitored" copy misrepresented the state.
Reframes the banner as an action opportunity, matching the adjacent
"Review & Deploy" CTA. Gate logic unchanged.
The "N nodes unmonitored" banner gated purely on absence of a Pulse
Unified Agent (r.agent?.agentId). Offline cluster members (status
'offline') were counted as deploy candidates, which is wrong on two
fronts: those nodes are typically still covered by the cluster's PVE
API token (so they aren't truly unmonitored, just unreachable), and an
offline host is precisely the case where deploying an agent cannot
succeed.
Exclude status === 'offline' from the unmonitored count.
The previous comment claimed DeepSeek's API aliases v4-flash/v4-pro to
deepseek-reasoner, justifying the auto coercion via the legacy
reasoner's known 400 behavior. That had the alias direction inverted:
per DeepSeek's pricing page, deepseek-chat and deepseek-reasoner are
deprecated aliases for v4-flash's non-thinking and thinking modes
respectively, not the other way around.
The coercion itself is empirically correct, though. Live preflight
against deepseek-v4-flash with tool_choice=required produces a
deterministic HTTP 400 ("provider rejected forced tool selection") in
275ms. The behavior is consistent across the DeepSeek user community
- multiple downstream projects (pydantic-ai #5193, claude-code-router
#1378, opencode #24190, others) confirm V4 models reject forced tool
selection despite DeepSeek's chat-completion docs listing required as
a supported value. Server reality disagrees with documentation.
This commit updates only the comments in openai.go and openai_test.go
to point at the empirical evidence and the community confirmation.
The coercion behavior is unchanged.
The readiness banner previously rendered only the summary string
("Provider connection issue (last preflight 2h ago)"), even though
the underlying payload already carried the provider, model, preflight
duration, tool-call observation, and recommendation. Operators had to
open dev tools to find out which provider failed and what to check.
Now the banner inlines:
- Provider and model that preflight tried to reach
- Preflight duration and whether a tool call was observed
- The preflight recommendation text
The state hook gains a patrolPreflight memo that reads from the same
AISettings payload the rest of the patrol surface already consumes.
Found by exercising pulse_summarize in real chat: the user asked a
question, the model called the tool successfully (response came back
with narrative_source: ai), but the chat panel ended with raw DSML
text and "Assistant response is ready" — no actual prose answer.
Root cause was in agentic_sanitize.go. The fast-path string list
only checked single-pipe DSML variants ("<|DSML|...>"), but
deepseek-v4-flash emits the double-pipe form ("<||DSML||...>").
The opening sequence didn't match any marker, so cleanToolCallArtifacts
returned the content unchanged. The chat orchestrator then showed
the raw DSML to the user as if it were the assistant's answer.
Fix:
- Add double-pipe variants (Unicode and ASCII) to the fast-path
marker list. Six new entries, mirroring the existing single-pipe
entries.
- Add a backstop regex (dsmlRe) that matches any pipe-count variant
via `</?[\||]+/?DSML[\||]*`. Future model behaviour with triple
or higher pipe counts gets caught without another fast-path edit.
- containsToolCallMarker gets the same coverage so streaming
detection stops forwarding content the moment the marker
appears, regardless of pipe count.
Tests in agentic_sanitize_test.go gain three new cases for cleanup
(double-pipe Unicode, double-pipe ASCII, triple-pipe regex
backstop) and two for detection (double-pipe Unicode, double-pipe
ASCII). All passing.
Process note: this bug was invisible to existing tests because the
test suite only covered the single-pipe variants that were
documented in the marker list. The double-pipe form only appears
when an actual model emits it. Same pattern as the JSON-casing fix
earlier — exercise the surface against a real LLM, find what tests
in isolation can't see.
The previous rip retired only active "Active alert detected" findings
from the now-removed detectAlertSignals -> SignalActiveAlert emitter.
Resolved instances were left in place, polluting the Resolved tab and
inflating the regressed total on the trust strip (a stale 8 resources
each marked "regressed 3x" with descriptions like "Active warning
alert: Container 'ollama' is powered off"). They have no canonical
operator value -- the Alerts surface is the source of truth for
currently-firing alerts -- so on load we now purge them entirely
rather than keeping them around as Resolved noise. Active mirrors are
still retired (auto-resolved with a clear reason) so operators see
why the finding closed; resolved mirrors disappear silently because
they were already in the terminal state. Idempotent.
Extends TestFindingsStore_SetPersistence_RetiresLegacyAlertMirrorFindings
with a fixture for the resolved-mirror case and asserts both the
in-memory purge and the persisted state no longer carries it.
Found by exercising pulse_summarize in a real chat session. The
chat-tool response surfaced:
"observations": [{"Text": "...", "Severity": "info"}]
The AI narrator's system prompt (report_narrator.go) tells the
model to emit lowercase keys:
{"text": "...", "severity": "..."}
The model was being taught one schema and shown a different one
in the tool response for the same shape. NarrativeBullet,
FleetOutlier, Narrative, and FleetNarrative had no JSON tags, so
embedded struct fields serialised with their Go names.
Add struct tags so the wire shape matches the prompt schema. Pure
marshaling change — JSON tags don't affect Go field access, so
PDF rendering (which reads fields directly) is unchanged. Tests
in narrative_json_test.go pin the shape so the inconsistency
can't reappear silently.
Process note: this is a class of bug that only appears when an
LLM actually consumes the output. No unit test caught it; no
review of the code showed it; the model running through the chat
path is what surfaced it. Another argument for "exercise the
artifact" — even the tool surface that looks correct in
isolation has hidden inconsistencies you only see when something
external reads it.