Record Pulse Cloud recurring audit monitor

This commit is contained in:
rcourtman 2026-04-24 15:15:54 +01:00
parent 0d90b3457c
commit 0d87110166
2 changed files with 65 additions and 14 deletions

View file

@ -159,9 +159,29 @@ tenant/account rows were removed, the managed runtime container and
`CP_PROOF_TENANT_MAX_AGE` was tightened to `1s` so future proof residue is
surfaced by `pulse-control-plane cloud audit` almost immediately.
## Mobile Approval Proof Residue Recheck
A later same-day mobile approval proof left one additional disposable proof
workspace:
- Account: `a_ZTYQ41MVYG`
- Workspace: `t-KAQ6WKJX8V`
The account display name identified the state as
`Pulse Mobile GA Proof 20260424 Approval`. The recurring audit failed the
baseline with `proof_tenant_stale_count=1` and
`proof_account_stale_count=1`, proving the tightened proof-residue gate now
catches this class of drift.
The tenant registry was backed up to
`/root/tenants-pre-ga-mobile-proof-approval-cleanup-20260424T135425Z.db`, the
proof tenant/account rows were removed, the managed runtime container and
`/data/tenants/t-KAQ6WKJX8V` directory were removed, and the audit returned to
the empty hosted-customer baseline.
## Final Audit
The final production audit after the corrected canary cleanup passed:
The final production audit after proof-residue cleanup passed:
```text
audit_ok=true
@ -174,13 +194,13 @@ docker_managed_unhealthy=0
storage_guardrails_enabled=true
storage_ok=true
storage_root_status=ok
storage_root_available_bytes=151278604288
storage_root_available_bytes=151212212224
storage_root_min_available_bytes=10737418240
storage_data_status=ok
storage_data_available_bytes=151278604288
storage_data_available_bytes=151212212224
storage_data_min_available_bytes=5368709120
storage_docker_status=ok
storage_docker_available_bytes=151278604288
storage_docker_available_bytes=151212212224
storage_docker_min_available_bytes=10737418240
docker_build_cache_status=ok
docker_build_cache_total_bytes=0
@ -194,8 +214,8 @@ Host and Docker state at the same point:
```text
/dev/vda1 154G 14G 141G 9% /
Images 12 2 6.851GB 2.377GB reclaimable
Containers 2 2 40.96kB 0B reclaimable
Images 12 2 6.882GB 2.407GB reclaimable
Containers 2 2 30.98MB 0B reclaimable
Local Volumes 3 1 212.4MB 212.4MB reclaimable
Build Cache 0 0 0B 0B
```
@ -207,20 +227,46 @@ pulse-cloud-control-plane-1 pulse-control-plane:ga-audit-5fd645630 Up
pulse-cloud-traefik-1 a9890c898f37 Up
```
## Recurring Audit Monitor
The live host now has a recurring GA audit monitor installed:
- Script: `/opt/pulse-cloud/cloud-audit-monitor.sh`
- Unit: `/etc/systemd/system/pulse-cloud-audit.service`
- Timer: `/etc/systemd/system/pulse-cloud-audit.timer`
- Latest log: `/var/log/pulse-cloud-audit/latest.log`
- Status file: `/var/lib/pulse-cloud-audit/status.env`
- Prometheus textfile metric:
`/var/lib/node_exporter/textfile_collector/pulse_cloud_audit.prom`
- Managed source:
`pulse-pro:ops/pulse-cloud/audit/`
The timer runs every five minutes after boot and fails
`pulse-cloud-audit.service` if `pulse-control-plane cloud audit` fails or does
not print `audit_ok=true`.
Live verification after installing the managed monitor returned:
```text
Result=success
ExecMainStatus=0
PULSE_CLOUD_AUDIT_STATUS=ok
PULSE_CLOUD_AUDIT_EXIT_CODE=0
PULSE_CLOUD_AUDIT_AUDIT_OK=true
pulse_cloud_audit_success 1
pulse_cloud_audit_exit_code 0
```
## Conclusion
`cloud-hosted-tier-runtime-readiness` can be treated as `passed` for the current
GA multi-tenant scope. The previous storage exhaustion blocker is no longer only
manually cleaned up: hosted tenant provisioning and runtime rollout now have
production admission guardrails, the live audit command proves the storage,
stale-proof-tenant, stale-proof-account, and orphan-paid-entitlement floor, and a
fresh external MSP canary passed create, list, portal, detail, delete, and
cleanup verification on the live public control plane.
production admission guardrails, the recurring live audit monitor proves the
storage, stale-proof-tenant, stale-proof-account, and orphan-paid-entitlement
floor, and a fresh external MSP canary passed create, list, portal, detail,
delete, and cleanup verification on the live public control plane.
The correct GA baseline for Pulse Cloud today is empty of hosted customer
tenants. As of this record, that baseline is true in the registry, Docker, and
`/data/tenants`.
The remaining operational follow-up is to wire the audit command into recurring
production monitoring/alerting so drift is surfaced before a human asks for a
GA proof again.

View file

@ -109,6 +109,7 @@ cloud-specific enforcement rules.
86. `internal/cloudcp/handoff/handler.go`, `internal/cloudcp/handoff/handoff.go`
87. `internal/cloudcp/stripe/grace_enforcer.go`, `internal/cloudcp/stripe/helpers.go`, `internal/cloudcp/stripe/reconciler.go`, `internal/cloudcp/stripe/webhook.go`
88. `internal/hosted/hosted_metrics.go`, `internal/hosted/reaper.go`
89. `pulse-pro:ops/pulse-cloud/audit/`
## Shared Boundaries
@ -148,6 +149,10 @@ cloud-specific enforcement rules.
The same cloud audit contract must fail on stale proof/canary account rows
and paid hosted entitlements whose tenant rows are missing, because either
residue can recreate or mask hosted runtime state after a cleanup.
The live production host must run that cloud audit through the private
Pulse Pro operations bundle on a recurring systemd timer, write durable
status/log output, and emit Prometheus textfile metrics so a clean GA
baseline is continuously monitored rather than manually rediscovered.
10. `internal/cloudcp/tenant_runtime_rollout.go` shared with `deployment-installability`: hosted tenant runtime rollout is both a Pulse Cloud runtime contract boundary and a deployment-installability release-rollout boundary.
Hosted tenant runtime reconciliation must treat a registered tenant with
preserved tenant data but no live Docker runtime as a recoverable managed