WFGY/ProblemMap/GlobalFixMap/Cloud_Serverless/network_egress_and_vpc.md

9.4 KiB
Raw Blame History

Network Egress and VPC — Guardrails

🧭 Quick Return to Map

You are in a sub-page of Cloud_Serverless.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

Keep serverless outbound traffic predictable and cheap. This page fixes DNS flakiness, NAT bottlenecks, blocked endpoints, and cross-region surprises that break RAG calls, vector writes, and webhook posts.

When to use this page

  • Lambdas or Cloud Run randomly time out on first external call.
  • Only production fails because it runs inside a VPC or private subnet.
  • Costs jump after moving to NAT or a new region.
  • Vector DB reachable from dev, unreachable from prod.
  • Long tail latency grows after scale up or during deploys.

Open these first

Acceptance targets

  • p95 DNS lookup ≤ 30 ms and connection setup ≤ 120 ms per request.
  • NAT or egress gateway active connections ≤ 70 percent of max for steady state.
  • Cross region egress ratio ≤ 5 percent of total requests after pinning.
  • Socket error rate ≤ 0.1 percent with retries and idempotency keys enabled.

Fix in 60 seconds

  1. Pin the path

    • Resolve targets to regional endpoints. Prefer private endpoints or service networking where offered.
    • Add allowlist and block outbound to unknown hosts.
  2. Stabilize resolution and connect

    • Enable connection pooling and keep-alive.
    • Cache DNS for short TTLs in runtime, and rotate resolvers only on failure.
  3. Choose the right egress shape

    • Small constant QPS → serverless NAT or egress gateway.
    • Spiky or chat streaming → dedicated NAT with connection headroom.
  4. Retry only what is safe

    • Use idempotency keys for POST.
    • Exponential backoff with jitter. Cap total retry time below request timeout.
  5. Observe and clamp

    • Emit dns_ms, tcp_connect_ms, tls_ms, ttfb_ms, status, retry_count.
    • Trip a circuit when connect errors exceed threshold. Route to regional fallback.

Typical breakpoints → exact fix

  • Dev works, prod inside VPC cannot reach external API Missing route or NACL block. Add NAT or egress gateway with explicit route table. Verify with regional probes. Open: predeploy-collapse.md

  • Cold starts get slow after VPC attach ENI allocation adds latency. Reduce subnets, enable provisioned concurrency for hot paths, and pool connections at handler scope.

  • DNS timeouts during traffic spikes Resolver throttling or missing cache. Enable a runtime DNS cache and set low negative TTL. Monitor dns_ms.

  • Cross region vector writes Endpoint is global but peered to a distant region. Replace with regional endpoint and pin by environment variable.

  • NAT port exhaustion Too many simultaneous connects with short keep-alive. Increase keep-alive to reuse sockets and scale NAT capacity.


Minimal recipes you can copy

A) AWS Lambda to external API through NAT

VPC subnets: 2 private + 2 AZs
Route table: 0.0.0.0/0 → NAT gateway
Security group: egress 443 allowlist to api.vendor.com
Runtime
- HTTP agent keepAlive=true, maxSockets=per-function target
- DNS cache with TTL obeying 3060 s
- Retries: 3 with jitter, idempotency key on POST
Metrics
- dns_ms, connect_ms, tls_ms, ttfb_ms, bytes_out, bytes_in

B) GCP Cloud Run with Serverless VPC Connector

Connector: min instances ≥ 2, autoscale up to peak QPS
Routes: only private ranges through connector, public stays direct
Env pins: API_HOST=api-ap-southeast1.vendor.com
Timeouts: request 120 s, connect 5 s, read 60 s

C) Azure Functions with VNet integration

Regional private endpoint to storage and vector DB
NAT gateway attached to subnet with enough SNAT ports
Outbound allowlist to model provider and webhook targets

D) Private service endpoints for data stores

Prefer:
- AWS PrivateLink to vector store or search
- GCP Private Service Connect for managed endpoints
- Azure Private Endpoint to PaaS databases

Disable public ingress on the target service once private is verified.

E) Connection pooling pattern

// Create once at module scope
const agent = new Agent({ keepAlive: true, maxSockets: 256 })
// Reuse per invocation
fetch(url, { agent, signal, headers: { "Idempotency-Key": key } })

Observability you must add

  • Percent of requests by region and by endpoint.
  • dns_ms, connect_ms, tls_ms, ttfb_ms histograms.
  • NAT utilization, active connections, allocated ports.
  • Cross region egress cost meter for early alerts.
  • Circuit breaker state and fallback hit counts.

Verification

  • Regional canary proves pinned endpoint is used.
  • p95 connect settles under the target after keep-alive and DNS cache.
  • No cross region traffic after endpoint pin.
  • Retries do not duplicate writes due to idempotency.

When to escalate

  • Persistent connect errors in one region → route reads to nearest healthy region, queue writes for the home region.
  • NAT saturation → split subnets, add more NAT, or move heavy flows to a fixed egress gateway.
  • Vendor rate limits → enable token bucket per host and raise backoff caps.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1 Download · 2 Upload to your LLM · 3 Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1 Download · 2 Paste into any LLM chat · 3 Type “hello world” — OS boots instantly

Explore More

Layer Page What its for
Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
Engine WFGY 1.0 Original PDF based tension engine
Engine WFGY 2.0 Production tension kernel and math engine for RAG and agents
Engine WFGY 3.0 TXT based Singularity tension engine, 131 S class set
Map Problem Map 1.0 Flagship 16 problem RAG failure checklist and fix map
Map Problem Map 2.0 RAG focused recovery pipeline
Map Problem Map 3.0 Global Debug Card, image as a debug protocol layer
Map Semantic Clinic Symptom to family to exact fix
Map Grandmas Clinic Plain language stories mapped to Problem Map 1.0
Onboarding Starter Village Guided tour for newcomers
App TXT OS TXT semantic OS, fast boot
App Blah Blah Blah Abstract and paradox Q and A built on TXT OS
App Blur Blur Blur Text to image with semantic control
App Blow Blow Blow Reasoning game engine and memory demo

If this repository helped, starring it improves discovery so more builders can find the docs and tools. GitHub Repo stars