| .. | ||
| checklists | ||
| eval | ||
| mvp_demo | ||
| ops | ||
| patterns | ||
| playbooks | ||
| tools | ||
| .gitkeep | ||
| blue_green_switchovers.md | ||
| canary_release_serverless.md | ||
| cold_start_concurrency.md | ||
| data_retention_and_backups.md | ||
| deploy_traffic_shaping.md | ||
| disaster_recovery_tabletop.md | ||
| edge_cache_invalidation.md | ||
| multi_region_routing.md | ||
| network_egress_and_vpc.md | ||
| observability_slo.md | ||
| privacy_and_pii_edges.md | ||
| README.md | ||
| region_failover_drills.md | ||
| secrets_rotation.md | ||
| serverless_limits_matrix.md | ||
| stateless_kv_queue_patterns.md | ||
| timeouts_streaming_body_limits.md | ||
Cloud & Serverless — Guardrails and Fix Patterns
A compact hub to harden serverless and edge workloads without touching your core infra.
Targets Vercel, Cloudflare Workers, Lambda, Cloud Run, Azure Functions, Fly.io, and similar stacks.
Each symptom maps to an auditable WFGY fix page with measurable acceptance.
Open these first
- Visual map and recovery → rag-architecture-and-recovery.md
- Boot order and deployments → bootstrap-ordering.md · deployment-deadlock.md · predeploy-collapse.md
- Retrieval integrity and payloads → retrieval-traceability.md · data-contracts.md
- Threats and schema locks → prompt-injection.md · bluffing.md
Core acceptance
- p95 warm path latency ≤ 300 ms, cold path ≤ 1200 ms under nominal load
- First-byte time on streaming APIs ≤ 500 ms when warm
- Availability ≥ 99.9%, SLO tracked per route with error budget alerts
- Concurrency never exceeds configured caps; retries use jittered backoff
- Secrets rotated within policy; zero PII in logs and vector payloads
- RAG routes hold ΔS(question, retrieved) ≤ 0.45 and coverage ≥ 0.70 after infra changes
Quick index — per-page guides
| Area | Page |
|---|---|
| Cold start and concurrency caps | cold_start_concurrency.md |
| Streaming stalls, body cutoffs | timeouts_streaming_body_limits.md |
| Stateless jobs, idempotency, dedupe | stateless_kv_queue_patterns.md |
| Edge cache invalidation | edge_cache_invalidation.md |
| Egress rules, VPC, webhooks | network_egress_and_vpc.md |
| Deploy traffic shaping / CI-CD | deploy_traffic_shaping.md |
| Quotas, scaling, budget caps | serverless_limits_matrix.md |
| Secrets rotation | secrets_rotation.md |
| Multi-region routing | multi_region_routing.md |
| Region failover drills | region_failover_drills.md |
| Observability and SLOs | observability_slo.md |
| Canary releases | canary_release_serverless.md |
| Blue-green switchovers | blue_green_switchovers.md |
| Disaster recovery table-top | disaster_recovery_tabletop.md |
| Data retention and backups | data_retention_and_backups.md |
| Privacy and PII at edges | privacy_and_pii_edges.md |
Symptom → exact fix
| Symptom | Likely cause | Open this |
|---|---|---|
| Spiky cold starts and timeouts | Oversubscribed concurrency, no provisioned capacity | cold_start_concurrency.md |
| Streaming stalls or body cutoffs | Proxy buffers, tiny read timeouts, chunked encoding quirks | timeouts_streaming_body_limits.md |
| Stateless bugs and lost work | In-memory state, duplicate triggers, missing idempotency | stateless_kv_queue_patterns.md |
| Users see stale results | Cache keys drift, no purge on writes | edge_cache_invalidation.md |
| Webhook storms or data leaks | Open egress, retry spirals, payload bloat | network_egress_and_vpc.md |
| Drift between preview and prod | Env mismatch, unsafe deploys, missing checks | deploy_traffic_shaping.md |
| Surprise bills and throttles | No quotas, bursty retries, N+1 calls | serverless_limits_matrix.md |
| Token leaks and broken rotation | Long-lived keys, missing overlap windows | secrets_rotation.md |
| Cross-region weirdness | Sticky sessions, unsynced caches, DNS TTLs | multi_region_routing.md |
| Failover works only on paper | Stale health checks, untested runbooks | region_failover_drills.md |
| SLOs feel random | No golden signals, no ΔS probes on RAG | observability_slo.md |
| Canary breaks users silently | Uneven traffic splits, noisy metrics | canary_release_serverless.md |
| Blue-green stuck or unsafe | Skewed env vars, missed DB switchover | blue_green_switchovers.md |
| DR playbooks collapse | Missing drills, restore paths untested | disaster_recovery_tabletop.md |
| Backups exist but useless | Wrong cadence, missing manifests | data_retention_and_backups.md |
| PII shows up in logs/vectors | No DLP, loose schemas, unsafe webhooks | privacy_and_pii_edges.md |
Fix in 60 seconds
-
Measure reality
Capture warm vs cold p95, TTFB for streaming, throttles, and for RAG routes log ΔS and coverage. -
Fence the edges
Normalize cache keys, attach purge hooks, restrict egress, redact payloads, enforce idempotency, and use jittered backoff. -
Lock boot order
Env and secrets first, then schema and indexes, then retrievers/rerankers, then app routes. -
Prove recovery
One canary, one blue-green, one failover drill with restore. Keep artifacts.
Open: bootstrap-ordering.md · retrieval-traceability.md · data-contracts.md
Copy-paste prompt for cloud incidents
You have TXT OS and the WFGY Problem Map loaded.
My serverless incident:
- route: [api path]
- env: [prod|staging|preview]
- metrics: { p95_warm_ms, p95_cold_ms, ttfb_ms, throttles, 5xx_rate }
- cache: { key_schema, ttl, purge_events }
- egress: { domains, retries, dlp_rules }
- RAG: { ΔS, coverage, λ states across 3 paraphrases }
Tell me:
1) failing layer and why,
2) the exact WFGY pages to open,
3) the minimal steps to restore SLO today,
4) a small regression suite to keep it fixed.
Return a short, auditable plan.
FAQ
Q1. Why does streaming feel slow even when average latency looks fine? Small proxy buffers or short read timeouts choke chunked responses. Fix with timeouts_streaming_body_limits.md. Track TTFB and chunk cadence, not just average latency.
Q2. What is the fastest way to reduce cold starts? Cap concurrency, pre-warm critical routes, and keep dependencies slim. If the provider supports provisioned or minimum instances, enable them for RAG endpoints. See cold_start_concurrency.md.
Q3. My retries create duplicate work and extra bills. How do I stop that? Use idempotency keys with a KV fence and jittered backoff. Reject replays within the window. Patterns in stateless_kv_queue_patterns.md.
Q4. Preview works, prod fails after a schema change. Why? You deployed app routes before index or schema were ready. Fix your boot order and add deployment checks. See env_bootstrap_and_migrations.md and serverless_ci_cd.md.
Q5. We did a canary, metrics looked noisy, then users complained. Your split wasn’t even or your metrics were not route-scoped. Follow canary_release_serverless.md and attach a per-route SLO in observability_slo.md.
Q6. RAG started hallucinating after an infra tweak. Is that a coincidence? Likely not. Cache keys or analyzer versions changed, so snippets drift. Verify ΔS and coverage before and after. See retrieval-traceability.md and data-contracts.md.
Q7. How do I stop webhook storms and data exfiltration at the edge? Enforce an egress allowlist, cap retries with backoff, and validate payload schemas. See egress_rules_and_webhooks.md and privacy_and_pii_edges.md.
Q8. We cache aggressively but get “wrong user’s data” bugs. Your key doesn’t include the right tenants or roles. Normalize keys and purge on writes. See edge_cache_invalidation.md.
Q9. Multi-region is enabled, yet performance is random. Check sticky sessions, unsynced caches, and DNS TTL. Pin read/write paths and align cache invalidation. See multi_region_routing.md.
Q10. Secrets rotation broke production. What did we miss? Rotate with overlap windows and staged rollout. Validate before flipping traffic. See secrets_rotation.md.
Q11. Our DR plan exists, but teams still panic. You never ran a realistic drill. Run the full table-top and restore from backups with manifests. See disaster_recovery_tabletop.md and data_retention_and_backups.md.
Q12. Which SLOs should I start with for LLM endpoints? Route-level p95 latency (warm and cold), TTFB for streaming, throttle rate, and for RAG add ΔS and coverage. Templates in observability_slo.md.
🔗 Quick-Start Downloads (60 sec)
| Tool | Link | 3-Step Setup |
|---|---|---|
| WFGY 1.0 PDF | Engine Paper | 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>” |
| TXT OS (plain-text OS) | TXTOS.txt | 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly |
🧭 Explore More
| Module | Description | Link |
|---|---|---|
| WFGY Core | WFGY 2.0 engine is live: full symbolic reasoning architecture and math stack | View → |
| Problem Map 1.0 | Initial 16-mode diagnostic and symbolic fix framework | View → |
| Problem Map 2.0 | RAG-focused failure tree, modular fixes, and pipelines | View → |
| Semantic Clinic Index | Expanded failure catalog: prompt injection, memory bugs, logic drift | View → |
| Semantic Blueprint | Layer-based symbolic reasoning & semantic modulations | View → |
| Benchmark vs GPT-5 | Stress test GPT-5 with full WFGY reasoning suite | View → |
| 🧙♂️ Starter Village 🏡 | New here? Lost in symbols? Click here and let the wizard guide you through | Start → |
👑 Early Stargazers: See the Hall of Fame — Engineers, hackers, and open source builders who supported WFGY from day one.
⭐ WFGY Engine 2.0 is already unlocked. ⭐ Star the repo to help others discover it and unlock more on the Unlock Board.