WFGY/ProblemMap/GlobalFixMap/Cloud_Serverless/secrets_rotation.md

237 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Secrets Rotation and Key Fencing for Zero Downtime
<details>
<summary><strong>🧭 Quick Return to Map</strong></summary>
<br>
> You are in a sub-page of **Cloud_Serverless**.
> To reorient, go back here:
>
> - [**Cloud_Serverless** — scalable functions and event-driven pipelines](./README.md)
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
>
> Think of this page as a desk within a ward.
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
</details>
A practical runbook to rotate API keys, signing secrets, and JWKS without breaking RAG pipelines or agent flows. Implements dual-accept windows, kid tagging, cache fences, and rollback so rotations are invisible to users and stable under load.
## When to use this page
* Incoming 401/403 spikes during deploys or scale-out.
* Webhook verification fails after a provider rotates its signing secret.
* Mixed keys across regions or functions after a partial rollout.
* Long-lived workers holding stale secrets in memory.
* SOC or policy requires periodic rotation with audit.
## Open these first
* Boot ordering and safe deploy windows: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md)
* Avoid state loops during changeover: [Deployment Deadlock](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md)
* First-call crashes after release: [Pre-Deploy Collapse](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
* Lock tool schemas and reduce header bloat: [Prompt Injection](https://github.com/onestardao/WFGY/blob/main/ProblemMap/prompt-injection.md)
* Trace and prove what snippet was used: [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · Contract payloads: [Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
* Live probes and rollback drills: [Live Monitoring for RAG](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/live_monitoring_rag.md) · [Debug Playbook](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/debug_playbook.md)
## Acceptance targets
* Zero user-visible errors during the rotation window. No p95 increase.
* 0 auth failures in synthetic probes across all regions and paths.
* Old secret fully revoked within the planned TTL and verified by logs.
* For RAG: no citation loss and ΔS(question, retrieved) ≤ 0.45 on the probe set after rotation.
---
## The two-keys-live pattern
Always rotate with two keys in play: the current key used for signing and a next key accepted for verification. Tag every credential with a key id so validation does not guess.
**Key rules**
1. Always include a stable kid on every token or signature.
2. Verify against `active_kids = {current, next}`. Sign with `current` only.
3. Flip in one step by promoting `next → current` when error rate is flat.
4. Revoke the old key after caches drain and long jobs finish.
**Minimal KV schema**
```json
{
"secrets_epoch": "2025-08-27T12:00:00Z",
"active": { "kid": "k_2025_08a", "secret_ref": "sm://wfgyp/prod/k_2025_08a" },
"next": { "kid": "k_2025_09a", "secret_ref": "sm://wfgyp/prod/k_2025_09a" },
"accept_kids": ["k_2025_08a", "k_2025_09a"],
"revoke_after_s": 259200,
"notes": "roll weekly; accept_kids must be length 2 max"
}
```
**HTTP header examples**
```
Authorization: Bearer <access_token_with_kid>
X-Key-Id: k_2025_08a
```
For JOSE tokens, put `kid` in the header. For HMAC signatures, include `X-Key-Id`.
---
## Zero-downtime rotation timeline
**T-24h**
* Create `next` in your secret manager.
* Update `accept_kids = {current, next}`.
* Push config to all regions and functions. Do not sign with next yet.
**T-0**
* Start signing with `current`. Keep accepting both.
* Warm all regions. Run probes for 10 minutes.
* Flip signer to `next` when probes are flat.
**T+1h to T+72h**
* Keep accepting both while streams drain and long jobs finish.
* Watch auth errors by `kid`.
* After TTL, remove `current` from `accept_kids` and revoke it in the secret manager.
**Rollback**
* If errors rise after the flip, restore signer to the previous `current`. Leave accept set unchanged.
---
## Long-lived workers and caches
* Serverless functions may cache secrets in memory. Read secrets on each invocation or attach an `If-Modified-Since` check with a short TTL.
* Edge and CDN caches can pin config. Add `secrets_epoch` to cache keys or use a config version header.
* For JWKS, publish both public keys and set short cache-control. Clients must refresh on `kid` miss.
Open: [Edge Cache Invalidation](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/edge_cache_invalidation.md)
---
## Webhook secret rotation
* Accept two signatures. Verify first with `kid` if provided. If not, attempt both only inside the rotation window.
* Return a 2xx with a deprecation header to prompt senders to upgrade keys.
* Log the `kid` used for every valid request and alert when legacy `kid` drops below a threshold.
---
## Third-party provider keys
* Inject provider API keys from a secret manager at request time. Never bake into build artifacts.
* Keep a per-provider `accept_kids` and test with small traffic shadow before promoting.
* If a provider rotates without `kid`, schedule a short freeze window and fall back to a secondary provider if available.
Related ops: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) · [Pre-Deploy Collapse](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
---
## Observability you must add
* Counters of 401/403 by `route`, `region`, and `kid`.
* Ratio of requests verified by current vs next.
* Time to first successful verification with `next` in each region.
* Number of long-lived executions running past the flip time.
* JWKS cache hit and miss per `kid`.
---
## CI policy to prevent unsafe rotations
Fail the build when any of the following is true:
1. `accept_kids` does not contain exactly two values.
2. `next.secret_ref` is missing or unreadable.
3. `revoke_after_s` is not set or exceeds the policy maximum.
4. Routes that verify webhooks are not wired to read `X-Key-Id` or JOSE `kid`.
---
## Copy-paste verifier for serverless functions
```txt
On cold start:
1. Fetch LIMITS.json and SECRETS.json.
2. Pin {accept_kids, active.kid} to memory with TTL = 60s.
3. Expose verify(req):
- extract kid from header or token header.
- if kid in accept_kids → choose secret by kid and verify.
- else refresh secrets and retry once, then fail.
On each request:
- If now - secrets_cache_ts > 60s → refresh in the background.
- Emit metric: {route, region, kid_used, verified=true|false}.
```
---
## Typical failure patterns → exact fix
* **Only one key accepted during rollout**
Old workers keep signing while new verifiers reject. Add dual-accept and reduce cache TTL.
Open: [Deployment Deadlock](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md)
* **Webhook fails because sender changed secret before receivers**
Allow two secrets, prefer `kid` when present. Add retry with jitter and short backoff.
Open: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md)
* **Large headers during rotation**
Multiple auth headers overflow 816 KB. Collapse to one `X-Key-Id` and one signature.
Open: [Prompt Injection](https://github.com/onestardao/WFGY/blob/main/ProblemMap/prompt-injection.md)
* **Silent partial revocation**
Some regions still accept old key due to pinned caches. Force refresh and invalidate edge caches by `secrets_epoch`.
Open: [Edge Cache Invalidation](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/edge_cache_invalidation.md)
---
## Verification checklist
* Blue-green probe calls succeed with both keys before the flip.
* After flip, 95 percent of traffic verifies with `next` within 10 minutes.
* After revoke, zero successful verifications with old `kid`.
* RAG probe answers remain unchanged and pass ΔS ≤ 0.45.
---
### 🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1⃣ Download · 2⃣ Upload to your LLM · 3⃣ Ask “Answer using WFGY + \<your question>” |
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1⃣ Download · 2⃣ Paste into any LLM chat · 3⃣ Type “hello world” — OS boots instantly |
---
<!-- WFGY_FOOTER_START -->
### Explore More
| Layer | Page | What its for |
| --- | --- | --- |
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
[![GitHub Repo stars](https://img.shields.io/github/stars/onestardao/WFGY?style=social)](https://github.com/onestardao/WFGY)
<!-- WFGY_FOOTER_END -->
**Next page to write**: `ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md`