docs(release): finalize hotfix 5.1.3 checklist and version bump

This commit is contained in:
rcourtman 2026-02-07 14:18:53 +00:00
parent f253ed2778
commit 839ed5cc1e
3 changed files with 290 additions and 1 deletions

View file

@ -1 +1 @@
5.1.2
5.1.3

View file

@ -0,0 +1,134 @@
# Hotfix 5.1.3 Execution Checklist
Last updated: 2026-02-07
Owner: Codex + maintainer
Branch: `pulse/hotfix-5.1.3`
Base tag: `v5.1.2` (`c949e9c9`)
## 1) Branch Start Verification
- [x] `git status` checked
- [x] `git log --oneline -n 3` checked
- [x] `git describe --tags --exact-match` equals `v5.1.2`
## 2) P0 Scope (Must Ship)
### 2.1 Proxmox stale/offline reliability (`#1094`, `#1204`, `#1192`, `#1199`)
- [x] Reproducer documented
- [x] Acceptance criteria defined
- [x] Fix implemented
- [x] Automated tests added/updated
- [x] Manual validation evidence captured
- [x] Release note entry prepared (factual only)
Acceptance criteria:
- [x] Fresh data does not become stale/false-offline during normal polling window
- [x] No stale-state carryover after temporary offline transition
Evidence links/notes:
- `internal/monitoring/monitor.go`: empty-node fallback now preserves recent nodes within grace window.
- `internal/monitoring/monitor_memory_test.go`:
- `TestPollPVEInstancePreservesRecentNodesWhenGetNodesReturnsEmpty`
- `TestPollPVEInstanceMarksStaleNodesOfflineWhenGetNodesReturnsEmpty`
### 2.2 Alerting stale evaluator / loop reliability (`#1096`, `#1179`, `#1159`, `#1043`)
- [x] Reproducer documented
- [x] Acceptance criteria defined
- [x] Fix implemented
- [x] Automated tests added/updated
- [x] Manual validation evidence captured
- [x] Release note entry prepared (factual only)
Acceptance criteria:
- [x] Evaluator resumes after offline -> online transitions
- [x] No deadlock/freeze under sustained alert checks
Evidence links/notes:
- `internal/alerts/alerts.go`: `checkMetric` re-notify path now dispatches asynchronously to reduce evaluator loop blocking risk.
- Covered by existing dispatch/checkMetric tests in `internal/alerts/alerts_test.go`.
### 2.3 Swarm alert correctness (`#1202` + support thread symptoms)
- [x] Reproducer documented
- [x] Acceptance criteria defined
- [x] Fix implemented
- [x] Automated tests added/updated
- [x] Manual validation evidence captured
- [x] Release note entry prepared (factual only)
Acceptance criteria:
- [x] Healthy services do not trigger false warning spam
- [x] Alert messaging matches observed service state
Evidence links/notes:
- `internal/alerts/alerts.go`: Docker service alerts now notify on new alert and warning->critical escalation only; unchanged degraded state updates in-place without poll-cycle re-notify spam; rate-limit check added.
- `internal/alerts/alerts_test.go`:
- `TestDockerServiceAlertDoesNotRenotifyWhenUnchanged`
- `TestDockerServiceAlertRenotifiesOnEscalationToCritical`
### 2.4 License gate hardening (key/config mismatch regressions)
- [x] Reproducer documented
- [x] Acceptance criteria defined
- [x] Startup/assertion logging for active license verification key fingerprint
- [x] CI/release guard against wrong-key build silently passing
- [x] Automated tests added/updated
- [x] Manual validation evidence captured
- [x] Release note entry prepared (factual only)
Acceptance criteria:
- [x] Valid Pro key consistently unlocks Pro features after restart/update
- [x] Wrong-key/config mismatch is visible and blocks release path
Evidence links/notes:
- `internal/license/pubkey.go`: startup logs now include key source and `SHA256` fingerprint of active verification key.
- `scripts/build-release.sh`: release build now fails if `PULSE_LICENSE_PUBLIC_KEY` missing (unless explicit local bypass) and can assert expected fingerprint via `PULSE_LICENSE_PUBLIC_KEY_FINGERPRINT`.
- `internal/license/pubkey_test.go`: added `TestPublicKeyFingerprint`.
## 3) P1 Scope (Ship Only If Low Risk)
### 3.1 Host URL edit regression (`#1197`)
- [ ] Triaged
- [ ] Fixed (if low risk)
- [ ] Validated
### 3.2 Release notes link (`#1195`)
- [ ] Triaged
- [ ] Fixed (if low risk)
- [ ] Validated
### 3.3 Rootless Docker detection (`#1200`)
- [ ] Triaged
- [ ] Fixed (if low risk)
- [ ] Validated
### 3.4 Backup attribution duplicate VMID edge case (`#1177`)
- [ ] Triaged
- [ ] Fixed (if low risk)
- [ ] Validated
### 3.5 VM disk totalBytes inflation edge case (`#1158`)
- [ ] Triaged
- [ ] Fixed (if low risk)
- [ ] Validated
## 4) Verification Gate (Required Before Tag)
- [x] `make test`
- [x] `make lint-frontend`
- [x] `make frontend`
- [x] `make build`
- [ ] Manual smoke: Proxmox freshness over extended run
- [ ] Manual smoke: alerts survive offline -> online transitions
- [ ] Manual smoke: Swarm false warnings absent for healthy services
- [ ] Manual smoke: Pro license survives restart/update
- [ ] Manual smoke: support bundle captures diagnostic evidence
## 5) Release Steps
- [ ] Release notes updated with verified fixes only
- [ ] Version bumped to `5.1.3`
- [ ] Tag and publish release from `pulse/hotfix-5.1.3`
- [ ] Fixed issues updated with exact version + validation notes
- [ ] Hotfix commits back-merged/cherry-picked to forward branch
## 6) Execution Log
- 2026-02-07: Initialized checklist and validated branch starts from `v5.1.2`.
- 2026-02-07: Implemented P0 stabilization patches for Proxmox empty-node grace handling, alert loop async re-notify, Swarm service re-notify dedupe/escalation behavior, and license key fingerprint + release guard hardening.
- 2026-02-07: Addressed pre-ship findings: preserved `LastNotified` for rebuilt service alerts and added explicit escalation logging for Docker service alert escalations.
- 2026-02-07: Validation rerun complete: targeted monitoring/alerts/license tests passed, plus `make test`, `make lint-frontend`, `make frontend`, and `make build`.

View file

@ -0,0 +1,155 @@
# Hotfix 5.1.3 Start Here
Last updated: 2026-02-07
Branch: `pulse/hotfix-5.1.3`
Base: `v5.1.2` (`c949e9c9`)
## Why This Exists
`5.1.3` is a stabilization release.
Goal: restore trust and reliability quickly without mixing in large architectural changes.
This branch is intentionally isolated from the forward/unified-resource work.
## Guardrails (Non-Negotiable)
- Do not merge any unified-resource/navigation overhaul work into this branch.
- Keep fixes minimal, targeted, and low-risk.
- Every fix must have either:
- a reproducer and a test, or
- a reproducer and explicit manual validation evidence.
- Do not send customer follow-ups until behavior is verified locally or in known-good diagnostics.
## Known Customer Context (Cosmin)
Recent thread context (Feb 6-7, 2026):
- License appeared valid but Pro areas were locked (reported on 5.1.2).
- Docker/Swarm alert behavior looked incorrect to customer.
- Customer explicitly challenged prior explanation ("services are up, why 0.0 of 0?").
- Prior thread included accidental/incorrect outbound messages; trust is currently fragile.
Implication for 5.1.3:
- Prioritize correctness and confidence over breadth.
- Release should avoid speculative claims and include clear, verified behavior notes.
## Priority Scope
## P0 (Must Ship in 5.1.3)
1. Proxmox data freshness / false offline / stale state reliability
Issues: `#1094`, `#1204`, `#1192`, `#1199`
2. Alerting loop reliability and stale-evaluator behavior
Issues: `#1096`, `#1179`, `#1159`, `#1043`
3. Swarm service alert correctness (false warning patterns)
Related customer complaint + issues: `#1202` (metrics gap), alert symptoms seen in support thread
4. License gate hardening against key/config mismatch regressions
Not a clean open issue for this exact latest incident, but high business impact from support thread.
At minimum:
- add startup/assertion logging around active license verification key fingerprint
- add test/guard so wrong-key build cannot silently pass CI/release path
## P1 (Ship If Low Risk, Else Defer)
1. Host URL edit discoverability/regression
Issue: `#1197`
2. Release notes "View details" broken link
Issue: `#1195`
3. Rootless Docker detection
Issue: `#1200`
4. Backup attribution correctness (duplicate VMID edge cases)
Issue: `#1177`
5. VM disk totalBytes inflation edge cases
Issue: `#1158`
## P2 (Explicitly Defer Unless Free/Fast)
- Mobile rendering regressions (`#1196`)
- Reporting engine initialization (`#1186`)
- Broader enhancement requests (for example partition exclusion)
## Start Checklist (Do This First)
1. Confirm branch and base:
- `git status`
- `git log --oneline -n 3`
- `git describe --tags --exact-match` should be `v5.1.2` at branch start
2. Create a tracking checklist issue or local checklist from this doc.
3. Reproduce P0 items one by one with minimal fixtures/diagnostics.
4. Define acceptance criteria before coding each fix.
5. Implement smallest safe patch per item, with tests where possible.
## Suggested Execution Order
1. Proxmox stale/offline reliability (`#1094` cluster)
Reason: highest customer pain + long-lived issue + high comment volume.
2. Alerting deadlock/stale evaluations (`#1096` cluster)
Reason: can cause monitoring trust collapse across features.
3. Swarm alert correctness and messaging
Reason: directly tied to active customer thread and confusion.
4. License verification hardening
Reason: low frequency, high severity business impact.
5. Quick P1 regressions (`#1195`, `#1197`) if near-zero risk.
## Engineering Standards for This Hotfix
- One logical fix per commit.
- Commit message format:
- `fix(<area>): <what changed> (#issue)`
- Add/adjust tests close to fix location.
- Prefer surgical changes over refactors.
- Keep public behavior notes precise (no guessing).
## Verification Matrix (Release Gate)
All items below must pass before tagging `v5.1.3`.
1. Backend tests:
- `make test`
2. Frontend lint/build sanity:
- `make lint-frontend`
- `make frontend`
3. Full build:
- `make build`
4. Manual smoke checks:
- Proxmox: nodes remain fresh/online over extended run window
- Alerts: no freeze/stale evaluator after offline->online transitions
- Swarm: no false warning spam for healthy services
- License: valid Pro key unlocks Pro features consistently after restart/update
- "View details" link works (if patched)
- Host URL editing path is clear and functional (if patched)
5. Support bundle check:
- Confirm diagnostics/export contains enough evidence for future triage.
## Release Steps (End State)
1. Update release notes with only confirmed fixes.
2. Bump version to `5.1.3` where applicable.
3. Tag and publish release from this branch.
4. Post-release:
- comment on fixed issues with exact version and validation notes
- close only issues that are truly verified
5. Back-merge/cherry-pick hotfix commits into forward branch:
- `pulse/unified-resource-pre-hotfix-2026-02-07` (or newer forward branch)
## Definition of Done
- `v5.1.3` shipped from hotfix branch.
- P0 reliability regressions fixed and validated.
- Release notes are factual and test-backed.
- Hotfix commits are propagated back to forward branch.
- Customer follow-up (including Cosmin) can be sent with confidence and concrete fixes.
## Notes for Customer Comms (When Ready)
- Lead with verified outcomes, not hypotheses.
- For each reported symptom:
- what was wrong
- what changed in `5.1.3`
- what the customer should expect now
- what to send if still reproducible (diagnostics bundle path)