# Postmortem and Regression Tests — OpsDeploy Guardrails
🧭 Quick Return to Map
> You are in a sub-page of **OpsDeploy**. > To reorient, go back here: > > - [**OpsDeploy** — operations automation and deployment pipelines](./README.md) > - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md) > - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md) > > Think of this page as a desk within a ward. > If you need the full triage and all prescriptions, return to the Emergency Room lobby.
Turn incidents into permanent fixes. This page gives a short postmortem template, evidence you must capture, and a drop-in regression suite so the same class of failure does not ship again. --- ## Open these first - Rollout and canary: [rollout_readiness_gate.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rollout_readiness_gate.md), [staged_rollout_canary.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/staged_rollout_canary.md) - Cutovers: [blue_green_switchovers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/blue_green_switchovers.md), [vector_index_build_and_swap.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/vector_index_build_and_swap.md) - Version and cache discipline: [version_pinning_and_model_lock.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/version_pinning_and_model_lock.md), [cache_warmup_invalidation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/cache_warmup_invalidation.md) - Stability under load: [rate_limit_backpressure.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rate_limit_backpressure.md), [retry_backoff.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/retry_backoff.md) - Side effects: [idempotency_dedupe.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/idempotency_dedupe.md) - Rollback: [rollback_and_fast_recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rollback_and_fast_recovery.md) - Live ops: [live_monitoring_rag.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/live_monitoring_rag.md), [debug_playbook.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/debug_playbook.md) --- ## When to run a postmortem - ΔS drift p95 above 0.15 or coverage below 0.60 in any canary window. - λ flips above 0.20 or tool loops detected. - 5xx above 1 percent or sustained 429 storms. - Mixed answers across versions or index mismatch. - Duplicate side effects or data corruption risk. --- ## Exit targets after recovery - Error rate below 0.5 percent within ten minutes. - ΔS and coverage match the last pinned baseline window. - p95 latency within plus 15 percent of baseline. - Duplicate side effects equal zero after reconciliation. --- ## Evidence you must capture - Version pins: `BUILD_ID`, `GIT_SHA`, `MODEL_VER`, `PROMPT_VER`, `EMBED_MODEL_VER`, `EMBED_DIM`, `NORM`, `metric`, `RERANK_CONF`, `TOK_VER`, `ANALYZER_CONF`, `CHUNK_SCHEMA_VER`, `INDEX_HASH`. - Gold set window: ΔS(question,retrieved), ΔS(retrieved,anchor), coverage, λ states before and after. - Traffic weights or flags during the event. - Cache namespace keys that were active. - Side effect receipts and idempotency decisions. - Timeline of flips and breaker state. --- ## Root cause classifier | Class | Typical sign | Primary fix page | |---|---|---| | Pin drift | answers change after silent provider update | [version_pinning_and_model_lock.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/version_pinning_and_model_lock.md) | | Index swap error | mixed citations or stale blends | [vector_index_build_and_swap.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/vector_index_build_and_swap.md) | | Cache namespace bug | cross-arm answers after cutover | [cache_warmup_invalidation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/cache_warmup_invalidation.md) | | Idempotency missing | duplicate writes or refunds | [idempotency_dedupe.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/idempotency_dedupe.md) | | Load controls weak | 429 storms, tail spikes | [rate_limit_backpressure.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rate_limit_backpressure.md), [retry_backoff.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/retry_backoff.md) | | Boot ordering | first call fails after deploy | [bootstrap-ordering.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) | | Prompt regression | fluent but wrong citations | [rollout_readiness_gate.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/OpsDeploy/rollout_readiness_gate.md) | --- ## 60-second postmortem checklist 1) Lock recovery window and freeze writes for reconciliation. 2) Capture all pins and the gold set metrics before and after the flip. 3) Identify the first bad change and the last known good set. 4) Classify the root cause using the table above. 5) Attach the exact fix page link that prevents this class. 6) Add a regression test that fails on the pre-fix build and passes on the fixed build. 7) File owners and due dates. Publish within 48 hours. --- ## Postmortem template (paste-ready) ```markdown # Incident · severity ## Summary One paragraph in plain language. ## Timeline -