12 KiB
Secrets Rotation and Key Fencing for Zero Downtime
A practical runbook to rotate API keys, signing secrets, and JWKS without breaking RAG pipelines or agent flows. Implements dual-accept windows, kid tagging, cache fences, and rollback so rotations are invisible to users and stable under load.
When to use this page
- Incoming 401/403 spikes during deploys or scale-out.
- Webhook verification fails after a provider rotates its signing secret.
- Mixed keys across regions or functions after a partial rollout.
- Long-lived workers holding stale secrets in memory.
- SOC or policy requires periodic rotation with audit.
Open these first
- Boot ordering and safe deploy windows: Bootstrap Ordering
- Avoid state loops during changeover: Deployment Deadlock
- First-call crashes after release: Pre-Deploy Collapse
- Lock tool schemas and reduce header bloat: Prompt Injection
- Trace and prove what snippet was used: Retrieval Traceability · Contract payloads: Data Contracts
- Live probes and rollback drills: Live Monitoring for RAG · Debug Playbook
Acceptance targets
- Zero user-visible errors during the rotation window. No p95 increase.
- 0 auth failures in synthetic probes across all regions and paths.
- Old secret fully revoked within the planned TTL and verified by logs.
- For RAG: no citation loss and ΔS(question, retrieved) ≤ 0.45 on the probe set after rotation.
The two-keys-live pattern
Always rotate with two keys in play: the current key used for signing and a next key accepted for verification. Tag every credential with a key id so validation does not guess.
Key rules
- Always include a stable kid on every token or signature.
- Verify against
active_kids = {current, next}. Sign withcurrentonly. - Flip in one step by promoting
next → currentwhen error rate is flat. - Revoke the old key after caches drain and long jobs finish.
Minimal KV schema
{
"secrets_epoch": "2025-08-27T12:00:00Z",
"active": { "kid": "k_2025_08a", "secret_ref": "sm://wfgyp/prod/k_2025_08a" },
"next": { "kid": "k_2025_09a", "secret_ref": "sm://wfgyp/prod/k_2025_09a" },
"accept_kids": ["k_2025_08a", "k_2025_09a"],
"revoke_after_s": 259200,
"notes": "roll weekly; accept_kids must be length 2 max"
}
HTTP header examples
Authorization: Bearer <access_token_with_kid>
X-Key-Id: k_2025_08a
For JOSE tokens, put kid in the header. For HMAC signatures, include X-Key-Id.
Zero-downtime rotation timeline
T-24h
- Create
nextin your secret manager. - Update
accept_kids = {current, next}. - Push config to all regions and functions. Do not sign with next yet.
T-0
- Start signing with
current. Keep accepting both. - Warm all regions. Run probes for 10 minutes.
- Flip signer to
nextwhen probes are flat.
T+1h to T+72h
- Keep accepting both while streams drain and long jobs finish.
- Watch auth errors by
kid. - After TTL, remove
currentfromaccept_kidsand revoke it in the secret manager.
Rollback
- If errors rise after the flip, restore signer to the previous
current. Leave accept set unchanged.
Long-lived workers and caches
- Serverless functions may cache secrets in memory. Read secrets on each invocation or attach an
If-Modified-Sincecheck with a short TTL. - Edge and CDN caches can pin config. Add
secrets_epochto cache keys or use a config version header. - For JWKS, publish both public keys and set short cache-control. Clients must refresh on
kidmiss.
Open: Edge Cache Invalidation
Webhook secret rotation
- Accept two signatures. Verify first with
kidif provided. If not, attempt both only inside the rotation window. - Return a 2xx with a deprecation header to prompt senders to upgrade keys.
- Log the
kidused for every valid request and alert when legacykiddrops below a threshold.
Third-party provider keys
- Inject provider API keys from a secret manager at request time. Never bake into build artifacts.
- Keep a per-provider
accept_kidsand test with small traffic shadow before promoting. - If a provider rotates without
kid, schedule a short freeze window and fall back to a secondary provider if available.
Related ops: Bootstrap Ordering · Pre-Deploy Collapse
Observability you must add
- Counters of 401/403 by
route,region, andkid. - Ratio of requests verified by current vs next.
- Time to first successful verification with
nextin each region. - Number of long-lived executions running past the flip time.
- JWKS cache hit and miss per
kid.
CI policy to prevent unsafe rotations
Fail the build when any of the following is true:
accept_kidsdoes not contain exactly two values.next.secret_refis missing or unreadable.revoke_after_sis not set or exceeds the policy maximum.- Routes that verify webhooks are not wired to read
X-Key-Idor JOSEkid.
Copy-paste verifier for serverless functions
On cold start:
1. Fetch LIMITS.json and SECRETS.json.
2. Pin {accept_kids, active.kid} to memory with TTL = 60s.
3. Expose verify(req):
- extract kid from header or token header.
- if kid in accept_kids → choose secret by kid and verify.
- else refresh secrets and retry once, then fail.
On each request:
- If now - secrets_cache_ts > 60s → refresh in the background.
- Emit metric: {route, region, kid_used, verified=true|false}.
Typical failure patterns → exact fix
-
Only one key accepted during rollout Old workers keep signing while new verifiers reject. Add dual-accept and reduce cache TTL. Open: Deployment Deadlock
-
Webhook fails because sender changed secret before receivers Allow two secrets, prefer
kidwhen present. Add retry with jitter and short backoff. Open: Bootstrap Ordering -
Large headers during rotation Multiple auth headers overflow 8–16 KB. Collapse to one
X-Key-Idand one signature. Open: Prompt Injection -
Silent partial revocation Some regions still accept old key due to pinned caches. Force refresh and invalidate edge caches by
secrets_epoch. Open: Edge Cache Invalidation
Verification checklist
- Blue-green probe calls succeed with both keys before the flip.
- After flip, 95 percent of traffic verifies with
nextwithin 10 minutes. - After revoke, zero successful verifications with old
kid. - RAG probe answers remain unchanged and pass ΔS ≤ 0.45.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame —
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.
Next page to write: ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md