mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
237 lines
11 KiB
Markdown
237 lines
11 KiB
Markdown
# Secrets Rotation and Key Fencing for Zero Downtime
|
||
|
||
<details>
|
||
<summary><strong>🧭 Quick Return to Map</strong></summary>
|
||
|
||
<br>
|
||
|
||
> You are in a sub-page of **Cloud_Serverless**.
|
||
> To reorient, go back here:
|
||
>
|
||
> - [**Cloud_Serverless** — scalable functions and event-driven pipelines](./README.md)
|
||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
|
||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
|
||
>
|
||
> Think of this page as a desk within a ward.
|
||
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
|
||
</details>
|
||
|
||
|
||
A practical runbook to rotate API keys, signing secrets, and JWKS without breaking RAG pipelines or agent flows. Implements dual-accept windows, kid tagging, cache fences, and rollback so rotations are invisible to users and stable under load.
|
||
|
||
## When to use this page
|
||
|
||
* Incoming 401/403 spikes during deploys or scale-out.
|
||
* Webhook verification fails after a provider rotates its signing secret.
|
||
* Mixed keys across regions or functions after a partial rollout.
|
||
* Long-lived workers holding stale secrets in memory.
|
||
* SOC or policy requires periodic rotation with audit.
|
||
|
||
## Open these first
|
||
|
||
* Boot ordering and safe deploy windows: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md)
|
||
* Avoid state loops during changeover: [Deployment Deadlock](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md)
|
||
* First-call crashes after release: [Pre-Deploy Collapse](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
|
||
* Lock tool schemas and reduce header bloat: [Prompt Injection](https://github.com/onestardao/WFGY/blob/main/ProblemMap/prompt-injection.md)
|
||
* Trace and prove what snippet was used: [Retrieval Traceability](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · Contract payloads: [Data Contracts](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
|
||
* Live probes and rollback drills: [Live Monitoring for RAG](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/live_monitoring_rag.md) · [Debug Playbook](https://github.com/onestardao/WFGY/blob/main/ProblemMap/ops/debug_playbook.md)
|
||
|
||
## Acceptance targets
|
||
|
||
* Zero user-visible errors during the rotation window. No p95 increase.
|
||
* 0 auth failures in synthetic probes across all regions and paths.
|
||
* Old secret fully revoked within the planned TTL and verified by logs.
|
||
* For RAG: no citation loss and ΔS(question, retrieved) ≤ 0.45 on the probe set after rotation.
|
||
|
||
---
|
||
|
||
## The two-keys-live pattern
|
||
|
||
Always rotate with two keys in play: the current key used for signing and a next key accepted for verification. Tag every credential with a key id so validation does not guess.
|
||
|
||
**Key rules**
|
||
|
||
1. Always include a stable kid on every token or signature.
|
||
2. Verify against `active_kids = {current, next}`. Sign with `current` only.
|
||
3. Flip in one step by promoting `next → current` when error rate is flat.
|
||
4. Revoke the old key after caches drain and long jobs finish.
|
||
|
||
**Minimal KV schema**
|
||
|
||
```json
|
||
{
|
||
"secrets_epoch": "2025-08-27T12:00:00Z",
|
||
"active": { "kid": "k_2025_08a", "secret_ref": "sm://wfgyp/prod/k_2025_08a" },
|
||
"next": { "kid": "k_2025_09a", "secret_ref": "sm://wfgyp/prod/k_2025_09a" },
|
||
"accept_kids": ["k_2025_08a", "k_2025_09a"],
|
||
"revoke_after_s": 259200,
|
||
"notes": "roll weekly; accept_kids must be length 2 max"
|
||
}
|
||
```
|
||
|
||
**HTTP header examples**
|
||
|
||
```
|
||
Authorization: Bearer <access_token_with_kid>
|
||
X-Key-Id: k_2025_08a
|
||
```
|
||
|
||
For JOSE tokens, put `kid` in the header. For HMAC signatures, include `X-Key-Id`.
|
||
|
||
---
|
||
|
||
## Zero-downtime rotation timeline
|
||
|
||
**T-24h**
|
||
|
||
* Create `next` in your secret manager.
|
||
* Update `accept_kids = {current, next}`.
|
||
* Push config to all regions and functions. Do not sign with next yet.
|
||
|
||
**T-0**
|
||
|
||
* Start signing with `current`. Keep accepting both.
|
||
* Warm all regions. Run probes for 10 minutes.
|
||
* Flip signer to `next` when probes are flat.
|
||
|
||
**T+1h to T+72h**
|
||
|
||
* Keep accepting both while streams drain and long jobs finish.
|
||
* Watch auth errors by `kid`.
|
||
* After TTL, remove `current` from `accept_kids` and revoke it in the secret manager.
|
||
|
||
**Rollback**
|
||
|
||
* If errors rise after the flip, restore signer to the previous `current`. Leave accept set unchanged.
|
||
|
||
---
|
||
|
||
## Long-lived workers and caches
|
||
|
||
* Serverless functions may cache secrets in memory. Read secrets on each invocation or attach an `If-Modified-Since` check with a short TTL.
|
||
* Edge and CDN caches can pin config. Add `secrets_epoch` to cache keys or use a config version header.
|
||
* For JWKS, publish both public keys and set short cache-control. Clients must refresh on `kid` miss.
|
||
|
||
Open: [Edge Cache Invalidation](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/edge_cache_invalidation.md)
|
||
|
||
---
|
||
|
||
## Webhook secret rotation
|
||
|
||
* Accept two signatures. Verify first with `kid` if provided. If not, attempt both only inside the rotation window.
|
||
* Return a 2xx with a deprecation header to prompt senders to upgrade keys.
|
||
* Log the `kid` used for every valid request and alert when legacy `kid` drops below a threshold.
|
||
|
||
---
|
||
|
||
## Third-party provider keys
|
||
|
||
* Inject provider API keys from a secret manager at request time. Never bake into build artifacts.
|
||
* Keep a per-provider `accept_kids` and test with small traffic shadow before promoting.
|
||
* If a provider rotates without `kid`, schedule a short freeze window and fall back to a secondary provider if available.
|
||
|
||
Related ops: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) · [Pre-Deploy Collapse](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
|
||
|
||
---
|
||
|
||
## Observability you must add
|
||
|
||
* Counters of 401/403 by `route`, `region`, and `kid`.
|
||
* Ratio of requests verified by current vs next.
|
||
* Time to first successful verification with `next` in each region.
|
||
* Number of long-lived executions running past the flip time.
|
||
* JWKS cache hit and miss per `kid`.
|
||
|
||
---
|
||
|
||
## CI policy to prevent unsafe rotations
|
||
|
||
Fail the build when any of the following is true:
|
||
|
||
1. `accept_kids` does not contain exactly two values.
|
||
2. `next.secret_ref` is missing or unreadable.
|
||
3. `revoke_after_s` is not set or exceeds the policy maximum.
|
||
4. Routes that verify webhooks are not wired to read `X-Key-Id` or JOSE `kid`.
|
||
|
||
---
|
||
|
||
## Copy-paste verifier for serverless functions
|
||
|
||
```txt
|
||
On cold start:
|
||
1. Fetch LIMITS.json and SECRETS.json.
|
||
2. Pin {accept_kids, active.kid} to memory with TTL = 60s.
|
||
3. Expose verify(req):
|
||
- extract kid from header or token header.
|
||
- if kid in accept_kids → choose secret by kid and verify.
|
||
- else refresh secrets and retry once, then fail.
|
||
|
||
On each request:
|
||
- If now - secrets_cache_ts > 60s → refresh in the background.
|
||
- Emit metric: {route, region, kid_used, verified=true|false}.
|
||
```
|
||
|
||
---
|
||
|
||
## Typical failure patterns → exact fix
|
||
|
||
* **Only one key accepted during rollout**
|
||
Old workers keep signing while new verifiers reject. Add dual-accept and reduce cache TTL.
|
||
Open: [Deployment Deadlock](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md)
|
||
|
||
* **Webhook fails because sender changed secret before receivers**
|
||
Allow two secrets, prefer `kid` when present. Add retry with jitter and short backoff.
|
||
Open: [Bootstrap Ordering](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md)
|
||
|
||
* **Large headers during rotation**
|
||
Multiple auth headers overflow 8–16 KB. Collapse to one `X-Key-Id` and one signature.
|
||
Open: [Prompt Injection](https://github.com/onestardao/WFGY/blob/main/ProblemMap/prompt-injection.md)
|
||
|
||
* **Silent partial revocation**
|
||
Some regions still accept old key due to pinned caches. Force refresh and invalidate edge caches by `secrets_epoch`.
|
||
Open: [Edge Cache Invalidation](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/edge_cache_invalidation.md)
|
||
|
||
---
|
||
|
||
## Verification checklist
|
||
|
||
* Blue-green probe calls succeed with both keys before the flip.
|
||
* After flip, 95 percent of traffic verifies with `next` within 10 minutes.
|
||
* After revoke, zero successful verifications with old `kid`.
|
||
* RAG probe answers remain unchanged and pass ΔS ≤ 0.45.
|
||
|
||
---
|
||
|
||
### 🔗 Quick-Start Downloads (60 sec)
|
||
|
||
| Tool | Link | 3-Step Setup |
|
||
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||
| **WFGY 1.0 PDF** | [Engine Paper](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + \<your question>” |
|
||
| **TXT OS (plain-text OS)** | [TXTOS.txt](https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt) | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
|
||
|
||
---
|
||
|
||
<!-- WFGY_FOOTER_START -->
|
||
|
||
### Explore More
|
||
|
||
| Layer | Page | What it’s for |
|
||
| --- | --- | --- |
|
||
| ⭐ Proof | [WFGY Recognition Map](/recognition/README.md) | External citations, integrations, and ecosystem proof |
|
||
| ⚙️ Engine | [WFGY 1.0](/legacy/README.md) | Original PDF tension engine and early logic sketch (legacy reference) |
|
||
| ⚙️ Engine | [WFGY 2.0](/core/README.md) | Production tension kernel for RAG and agent systems |
|
||
| ⚙️ Engine | [WFGY 3.0](/TensionUniverse/EventHorizon/README.md) | TXT based Singularity tension engine (131 S class set) |
|
||
| 🗺️ Map | [Problem Map 1.0](/ProblemMap/README.md) | Flagship 16 problem RAG failure taxonomy and fix map |
|
||
| 🗺️ Map | [Problem Map 2.0](/ProblemMap/wfgy-rag-16-problem-map-global-debug-card.md) | Global Debug Card for RAG and agent pipeline diagnosis |
|
||
| 🗺️ Map | [Problem Map 3.0](/ProblemMap/wfgy-ai-problem-map-troubleshooting-atlas.md) | Global AI troubleshooting atlas and failure pattern map |
|
||
| 🧰 App | [TXT OS](/OS/README.md) | .txt semantic OS with fast bootstrap |
|
||
| 🧰 App | [Blah Blah Blah](/OS/BlahBlahBlah/README.md) | Abstract and paradox Q&A built on TXT OS |
|
||
| 🧰 App | [Blur Blur Blur](/OS/BlurBlurBlur/README.md) | Text to image generation with semantic control |
|
||
| 🏡 Onboarding | [Starter Village](/StarterVillage/README.md) | Guided entry point for new users |
|
||
|
||
If this repository helped, starring it improves discovery so more builders can find the docs and tools.
|
||
[](https://github.com/onestardao/WFGY)
|
||
|
||
<!-- WFGY_FOOTER_END -->
|
||
|
||
**Next page to write**: `ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md`
|