Update README.md

This commit is contained in:
PSBigBig 2025-09-01 18:32:21 +08:00 committed by GitHub
parent 9d1332832d
commit ce07700588
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,73 +1,96 @@
# Cloud & Serverless — Guardrails and Fix Patterns
A compact hub to harden serverless and edge workloads without touching your core infra. Targets Vercel, Cloudflare Workers, Lambda, Cloud Run, Azure Functions, Fly.io and similar stacks. Each symptom maps to an auditable WFGY fix page with measurable acceptance.
A compact hub to harden serverless and edge workloads without touching your core infra.
Targets Vercel, Cloudflare Workers, Lambda, Cloud Run, Azure Functions, Fly.io, and similar stacks.
Each symptom maps to an auditable WFGY fix page with measurable acceptance.
---
## Open these first
* Visual map and recovery: [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
* Boot order and deployments: [bootstrap-ordering.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) · [deployment-deadlock.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md) · [predeploy-collapse.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
* Retrieval integrity and payloads: [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
* Threats and schema locks: [prompt-injection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/prompt-injection.md) · [bluffing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bluffing.md)
- Visual map and recovery → [rag-architecture-and-recovery.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/rag-architecture-and-recovery.md)
- Boot order and deployments → [bootstrap-ordering.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) · [deployment-deadlock.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/deployment-deadlock.md) · [predeploy-collapse.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/predeploy-collapse.md)
- Retrieval integrity and payloads → [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
- Threats and schema locks → [prompt-injection.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/prompt-injection.md) · [bluffing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bluffing.md)
---
## Core acceptance
* p95 warm path latency ≤ 300 ms, cold path ≤ 1200 ms under nominal load.
* First-byte time on streaming APIs ≤ 500 ms when warm.
* Error budget respected: availability ≥ 99.9 percent, SLO tracked per route.
* Concurrency never exceeds configured caps. No throttled retries without jitter.
* Secrets rotated within policy. Zero PII in logs and vector payloads.
* ΔS(question, retrieved) ≤ 0.45 and coverage ≥ 0.70 for RAG routes after any infra change.
- p95 warm path latency ≤ **300 ms**, cold path ≤ **1200 ms** under nominal load
- First-byte time on streaming APIs ≤ **500 ms** when warm
- Availability ≥ **99.9%**, SLO tracked per route with error budget alerts
- Concurrency never exceeds configured caps; retries use **jittered backoff**
- Secrets rotated within policy; **zero PII** in logs and vector payloads
- RAG routes hold ΔS(question, retrieved) ≤ **0.45** and coverage ≥ **0.70** after infra changes
---
## Quick index — per-page guides
| Area | Page |
|---|---|
| Cold start and concurrency caps | [cold_start_concurrency.md](./cold_start_concurrency.md) |
| Streaming stalls, body cutoffs | [timeouts_streaming_body_limits.md](./timeouts_streaming_body_limits.md) |
| Stateless jobs, idempotency, dedupe | [stateless_kv_queue_patterns.md](./stateless_kv_queue_patterns.md) |
| Edge cache invalidation | [edge_cache_invalidation.md](./edge_cache_invalidation.md) |
| Egress rules and webhook storms | [egress_rules_and_webhooks.md](./egress_rules_and_webhooks.md) |
| CI/CD for serverless | [serverless_ci_cd.md](./serverless_ci_cd.md) |
| Bootstrap order and migrations | [env_bootstrap_and_migrations.md](./env_bootstrap_and_migrations.md) |
| Quotas, scaling, budget caps | [quotas_scaling_budget_caps.md](./quotas_scaling_budget_caps.md) |
| Secrets rotation | [secrets_rotation.md](./secrets_rotation.md) |
| Multi-region routing | [multi_region_routing.md](./multi_region_routing.md) |
| Region failover drills | [region_failover_drills.md](./region_failover_drills.md) |
| Observability and SLOs | [observability_slo.md](./observability_slo.md) |
| Canary releases | [canary_release_serverless.md](./canary_release_serverless.md) |
| Blue-green switchovers | [blue_green_switchovers.md](./blue_green_switchovers.md) |
| Disaster recovery table-top | [disaster_recovery_tabletop.md](./disaster_recovery_tabletop.md) |
| Data retention and backups | [data_retention_and_backups.md](./data_retention_and_backups.md) |
| Privacy and PII at edges | [privacy_and_pii_edges.md](./privacy_and_pii_edges.md) |
---
## Symptom → exact fix
| Symptom | Likely cause | Open this |
| -------------------------------- | ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Spiky cold starts and timeouts | oversubscribed concurrency, missing provisioned capacity | [cold\_start\_concurrency.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/cold_start_concurrency.md) |
| Streaming stalls or body cutoffs | proxy buffers, tiny read timeouts, chunked encoding quirks | [timeouts\_streaming\_body\_limits.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/timeouts_streaming_body_limits.md) |
| Stateless bugs and lost work | in-memory state, duplicate triggers, missing idempotency | [stateless\_kv\_queue\_patterns.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/stateless_kv_queue_patterns.md) |
| Users see stale results | cache keys drift, no purge on writes | [edge\_cache\_invalidation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/edge_cache_invalidation.md) |
| Webhook storms or data leaks | open egress, retry spirals, payload bloat | [egress\_rules\_and\_webhooks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/egress_rules_and_webhooks.md) |
| Drift between preview and prod | env mismatch, missing checks, unsafe deploys | [serverless\_ci\_cd.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/serverless_ci_cd.md) |
| Boot fails after migration | schema not ready, wrong order, partial writes | [env\_bootstrap\_and\_migrations.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/env_bootstrap_and_migrations.md) |
| Surprise bills and throttles | no quotas, bursty retries, N+1 calls | [quotas\_scaling\_budget\_caps.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/quotas_scaling_budget_caps.md) |
| Token leaks and broken rotation | long-lived keys, missing overlap windows | [secrets\_rotation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/secrets_rotation.md) |
| Cross-region weirdness | sticky sessions, unsynced caches, DNS TTLs | [multi\_region\_routing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md) |
| Failover works in theory only | untested runbooks, stale health checks | [region\_failover\_drills.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/region_failover_drills.md) |
| SLOs feel random | no golden signals, no ΔS probes on RAG | [observability\_slo.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/observability_slo.md) |
| Canary breaks users silently | uneven traffic splits, noisy metrics | [canary\_release\_serverless.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/canary_release_serverless.md) |
| Blue-green stuck or unsafe | skewed env vars, missed DB switchover | [blue\_green\_switchovers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/blue_green_switchovers.md) |
| Disaster playbooks collapse | missing drills, restore paths untested | [disaster\_recovery\_tabletop.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/disaster_recovery_tabletop.md) |
| Backups exist but useless | wrong cadence, missing manifests | [data\_retention\_and\_backups.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/data_retention_and_backups.md) |
| PII shows up in logs or vectors | no DLP, loose schemas, unsafe webhooks | [privacy\_and\_pii\_edges.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/privacy_and_pii_edges.md) |
| Symptom | Likely cause | Open this |
|---|---|---|
| Spiky cold starts and timeouts | Oversubscribed concurrency, no provisioned capacity | [cold_start_concurrency.md](./cold_start_concurrency.md) |
| Streaming stalls or body cutoffs | Proxy buffers, tiny read timeouts, chunked encoding quirks | [timeouts_streaming_body_limits.md](./timeouts_streaming_body_limits.md) |
| Stateless bugs and lost work | In-memory state, duplicate triggers, missing idempotency | [stateless_kv_queue_patterns.md](./stateless_kv_queue_patterns.md) |
| Users see stale results | Cache keys drift, no purge on writes | [edge_cache_invalidation.md](./edge_cache_invalidation.md) |
| Webhook storms or data leaks | Open egress, retry spirals, payload bloat | [egress_rules_and_webhooks.md](./egress_rules_and_webhooks.md) |
| Drift between preview and prod | Env mismatch, unsafe deploys, missing checks | [serverless_ci_cd.md](./serverless_ci_cd.md) |
| Boot fails after migration | Schema not ready, wrong order, partial writes | [env_bootstrap_and_migrations.md](./env_bootstrap_and_migrations.md) |
| Surprise bills and throttles | No quotas, bursty retries, N+1 calls | [quotas_scaling_budget_caps.md](./quotas_scaling_budget_caps.md) |
| Token leaks and broken rotation | Long-lived keys, missing overlap windows | [secrets_rotation.md](./secrets_rotation.md) |
| Cross-region weirdness | Sticky sessions, unsynced caches, DNS TTLs | [multi_region_routing.md](./multi_region_routing.md) |
| Failover works only on paper | Stale health checks, untested runbooks | [region_failover_drills.md](./region_failover_drills.md) |
| SLOs feel random | No golden signals, no ΔS probes on RAG | [observability_slo.md](./observability_slo.md) |
| Canary breaks users silently | Uneven traffic splits, noisy metrics | [canary_release_serverless.md](./canary_release_serverless.md) |
| Blue-green stuck or unsafe | Skewed env vars, missed DB switchover | [blue_green_switchovers.md](./blue_green_switchovers.md) |
| DR playbooks collapse | Missing drills, restore paths untested | [disaster_recovery_tabletop.md](./disaster_recovery_tabletop.md) |
| Backups exist but useless | Wrong cadence, missing manifests | [data_retention_and_backups.md](./data_retention_and_backups.md) |
| PII shows up in logs/vectors | No DLP, loose schemas, unsafe webhooks | [privacy_and_pii_edges.md](./privacy_and_pii_edges.md) |
---
## Fix in 60 seconds
1. Measure reality: cold vs warm p95, first byte, throttles, ΔS and coverage for RAG routes.
2. Fence the edges: cache keys, egress allowlist, redaction, idempotency, retries with jitter.
3. Lock boot order: env, schema, index and rerankers, then app.
4. Prove recovery: one canary, one blue-green, one failover drill with data restore.
1) **Measure reality**
Capture warm vs cold p95, TTFB for streaming, throttles, and for RAG routes log ΔS and coverage.
2) **Fence the edges**
Normalize cache keys, attach purge hooks, restrict egress, redact payloads, enforce idempotency, and use jittered backoff.
3) **Lock boot order**
Env and secrets first, then schema and indexes, then retrievers/rerankers, then app routes.
4) **Prove recovery**
One canary, one blue-green, one failover drill with restore. Keep artifacts.
Open: [bootstrap-ordering.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/bootstrap-ordering.md) · [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) · [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md)
## Quick routes to per-page guides
* [cold\_start\_concurrency.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/cold_start_concurrency.md)
* [timeouts\_streaming\_body\_limits.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/timeouts_streaming_body_limits.md)
* [stateless\_kv\_queue\_patterns.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/stateless_kv_queue_patterns.md)
* [edge\_cache\_invalidation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/edge_cache_invalidation.md)
* [egress\_rules\_and\_webhooks.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/egress_rules_and_webhooks.md)
* [serverless\_ci\_cd.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/serverless_ci_cd.md)
* [env\_bootstrap\_and\_migrations.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/env_bootstrap_and_migrations.md)
* [quotas\_scaling\_budget\_caps.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/quotas_scaling_budget_caps.md)
* [secrets\_rotation.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/secrets_rotation.md)
* [multi\_region\_routing.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md)
* [region\_failover\_drills.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/region_failover_drills.md)
* [observability\_slo.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/observability_slo.md)
* [canary\_release\_serverless.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/canary_release_serverless.md)
* [blue\_green\_switchovers.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/blue_green_switchovers.md)
* [disaster\_recovery\_tabletop.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/disaster_recovery_tabletop.md)
* [data\_retention\_and\_backups.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/data_retention_and_backups.md)
* [privacy\_and\_pii\_edges.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/Cloud_Serverless/privacy_and_pii_edges.md)
---
## Copy-paste prompt for cloud incidents
@ -88,7 +111,47 @@ Tell me:
3) the minimal steps to restore SLO today,
4) a small regression suite to keep it fixed.
Return a short, auditable plan.
```
````
---
## FAQ
**Q1. Why does streaming feel slow even when average latency looks fine?**
Small proxy buffers or short read timeouts choke chunked responses. Fix with [timeouts\_streaming\_body\_limits.md](./timeouts_streaming_body_limits.md). Track TTFB and chunk cadence, not just average latency.
**Q2. What is the fastest way to reduce cold starts?**
Cap concurrency, pre-warm critical routes, and keep dependencies slim. If the provider supports provisioned or minimum instances, enable them for RAG endpoints. See [cold\_start\_concurrency.md](./cold_start_concurrency.md).
**Q3. My retries create duplicate work and extra bills. How do I stop that?**
Use **idempotency keys** with a KV fence and jittered backoff. Reject replays within the window. Patterns in [stateless\_kv\_queue\_patterns.md](./stateless_kv_queue_patterns.md).
**Q4. Preview works, prod fails after a schema change. Why?**
You deployed app routes before index or schema were ready. Fix your boot order and add deployment checks. See [env\_bootstrap\_and\_migrations.md](./env_bootstrap_and_migrations.md) and [serverless\_ci\_cd.md](./serverless_ci_cd.md).
**Q5. We did a canary, metrics looked noisy, then users complained.**
Your split wasnt even or your metrics were not route-scoped. Follow [canary\_release\_serverless.md](./canary_release_serverless.md) and attach a per-route SLO in [observability\_slo.md](./observability_slo.md).
**Q6. RAG started hallucinating after an infra tweak. Is that a coincidence?**
Likely not. Cache keys or analyzer versions changed, so snippets drift. Verify ΔS and coverage before and after. See [retrieval-traceability.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md) and [data-contracts.md](https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md).
**Q7. How do I stop webhook storms and data exfiltration at the edge?**
Enforce an egress allowlist, cap retries with backoff, and validate payload schemas. See [egress\_rules\_and\_webhooks.md](./egress_rules_and_webhooks.md) and [privacy\_and\_pii\_edges.md](./privacy_and_pii_edges.md).
**Q8. We cache aggressively but get “wrong users data” bugs.**
Your key doesnt include the right tenants or roles. Normalize keys and purge on writes. See [edge\_cache\_invalidation.md](./edge_cache_invalidation.md).
**Q9. Multi-region is enabled, yet performance is random.**
Check sticky sessions, unsynced caches, and DNS TTL. Pin read/write paths and align cache invalidation. See [multi\_region\_routing.md](./multi_region_routing.md).
**Q10. Secrets rotation broke production. What did we miss?**
Rotate with overlap windows and staged rollout. Validate before flipping traffic. See [secrets\_rotation.md](./secrets_rotation.md).
**Q11. Our DR plan exists, but teams still panic.**
You never ran a realistic drill. Run the full table-top and restore from backups with manifests. See [disaster\_recovery\_tabletop.md](./disaster_recovery_tabletop.md) and [data\_retention\_and\_backups.md](./data_retention_and_backups.md).
**Q12. Which SLOs should I start with for LLM endpoints?**
Route-level p95 latency (warm and cold), TTFB for streaming, throttle rate, and for RAG add **ΔS** and **coverage**. Templates in [observability\_slo.md](./observability_slo.md).
---