WFGY/ProblemMap/patterns/pattern_bootstrap_deadlock.md
2025-08-13 18:25:05 +08:00

14 KiB
Raw Blame History

Pattern — Bootstrap Deadlock (No.14 Startup Ordering)

Scope
Service boots, logs look “healthy,” but readiness never flips. Components wait on each other in a cycle (model ⇄ retriever ⇄ policy ⇄ index), or a probe marks “ready” before dependencies are actually warm, causing stuck loops or flapping.

Why it matters
Deadlocks at startup burn deploy time, mask real regressions, and create ghost 500s. Fixing them requires explicit dependency graphs, single-owner warmup, and deterministic gates.

Quick nav: Patterns Index · Examples: Example 01 · Example 03 · Eval: Precision & CHR


1) Signals & fast triage

Likely symptoms

  • /readyz stays red forever, while /livez is green.
  • Warmup logs repeat: “waiting for model,” “waiting for index,” with no progress counter.
  • First request after rollout returns 503/timeout even though probes were green for a moment (flap).
  • Two components each wait on the others warm signal (classic cycle).

Deterministic checks (no LLM)

  • Build a dependency DAG (JSON/YAML). Reject boot if a cycle exists (topological sort fails).
  • Require a monotonic heartbeat per phase; if no delta in N seconds → flag STALL.
  • Enforce single warmup owner (mutex/lock) so concurrent warmups cant interleave.

2) Minimal reproducible case

Bad sequence (cycle)

  • Retriever waits for Model “ready” to run sentinel query.
  • Model waits for Retriever to provide a sample vector to warm the head.
    → Neither proceeds. /readyz hangs.

Bad probe

  • /readyz returns 200 once the HTTP server binds, even though FAISS/embeddings arent loaded.
    → First live request 500s; probe flaps.

3) Root causes

  • Implicit dependencies hidden in code paths (e.g., readiness calls the real query path which itself checks readiness).
  • Multiple warmup owners racing to initialize the same cache/index.
  • Circular waits across microservices (A waits for Bs /readyz, B waits for As /readyz).
  • Probe confusion: liveness vs readiness mixed; readiness not tied to real artifacts.

4) Standard fix (ordered, minimal, measurable)

Step 1 — Declare the DAG
A machine-readable dependency list. Example:

{
  "nodes": ["config","manifest","model","index","policy","sentinel"],
  "edges": [["config","manifest"],["manifest","model"],["model","index"],["index","policy"],["policy","sentinel"]]
}

Step 2 — Validate at boot

  • Toposort the DAG; abort if a cycle exists.
  • Persist the order and run warmup phases exactly in that order.

Step 3 — Single warmup owner

  • Acquire a process-wide lock before warmup. Other threads read READY=false and return 503.

Step 4 — Readiness = artifacts

  • Flip READY=true only after: config loaded, manifest validated, models reachable, index loaded, guard/policy loaded, sentinel query passes (same path as prod).

Step 5 — Timeouts + heartbeats

  • Each phase updates a heartbeat counter (ok_count) and a timestamp.
  • If no progress within T_deadline, log STALL and retry from phase 0 (not from mid-phase).

Step 6 — External waits are banned

  • Never wait on other services /readyz in your readiness. Use local mocks/sentinels.

5) Reference implementation — Python (DAG + deadlock detector)

Create ops/boot_guard.py.

# ops/boot_guard.py -- DAG validation, single-owner warmup, heartbeats, readiness flag
import json, os, time, threading, http.server, socketserver

READY = False
LOCK  = threading.Lock()
STATE = {"phase":"init","hb":{},"errors":[]}

def topo(nodes, edges):
    from collections import defaultdict, deque
    g = defaultdict(list); indeg = {n:0 for n in nodes}
    for a,b in edges: g[a].append(b); indeg[b]+=1
    q=deque([n for n in nodes if indeg[n]==0]); order=[]
    while q:
        u=q.popleft(); order.append(u)
        for v in g[u]:
            indeg[v]-=1
            if indeg[v]==0: q.append(v)
    if len(order)!=len(nodes): raise RuntimeError("CYCLE in bootstrap DAG")
    return order

def hb(phase): STATE["hb"][phase] = {"ts": time.time(), "ok_count": STATE["hb"].get(phase,{}).get("ok_count",0)+1}

def warm_phase(name, fn):
    STATE["phase"] = name
    fn(); hb(name)

def p_config():   time.sleep(0.1)            # load env, secrets
def p_manifest(): time.sleep(0.1)            # compare manifest vs runtime (see Example 05)
def p_model():    time.sleep(0.2)            # ping LLM: respond "ok"
def p_index():    time.sleep(0.2)            # map index files + ids
def p_policy():   time.sleep(0.1)            # load guard templates/policies
def p_sentinel(): time.sleep(0.2)            # run end-to-end question; check template/refusal

PHASE_FUN = {
    "config":p_config, "manifest":p_manifest, "model":p_model,
    "index":p_index, "policy":p_policy, "sentinel":p_sentinel
}

def warmup():
    global READY
    dag = json.loads(os.getenv("BOOT_DAG", '{"nodes":["config","manifest","model","index","policy","sentinel"],"edges":[["config","manifest"],["manifest","model"],["model","index"],["index","policy"],["policy","sentinel"]]}'))
    order = topo(dag["nodes"], dag["edges"])
    start = time.time()
    for name in order:
        warm_phase(name, PHASE_FUN[name])
        # heartbeat watchdog
        last = STATE["hb"][name]["ts"]
        if time.time() - last > float(os.getenv("PHASE_DEADLINE","10")):
            raise RuntimeError(f"STALL at {name}")
    READY = True; STATE["phase"]="ready"

class H(http.server.BaseHTTPRequestHandler):
    def _j(self,code,obj): self.send_response(code); self.send_header("Content-Type","application/json"); self.end_headers(); self.wfile.write(json.dumps(obj).encode())
    def log_message(self, *a, **kw): pass
    def do_GET(self):
        if self.path=="/livez": return self._j(200, {"live":True,"phase":STATE["phase"]})
        if self.path=="/readyz": return self._j(200 if READY else 503, {"ready":READY,"phase":STATE["phase"],"hb":STATE["hb"],"errors":STATE["errors"][-3:]})
        return self._j(404, {"error":"not found"})

def main():
    # single-owner warmup
    with LOCK:
        try: warmup()
        except Exception as e: STATE["errors"].append({"phase":STATE["phase"],"err":str(e)})
    with socketserver.TCPServer(("",8081), H) as s:
        s.serve_forever()

if __name__=="__main__": main()

Run:

python ops/boot_guard.py
curl -s localhost:8081/livez
curl -s localhost:8081/readyz

What it guarantees

  • Any cycle in your declared DAG fails fast with CYCLE in bootstrap DAG.
  • Readiness flips only after all phases heartbeat.
  • A stalled phase triggers an exception instead of hanging forever.

6) Reference implementation — Node (same contract)

Create ops/boot_guard.mjs.

// ops/boot_guard.mjs -- DAG validation + single-owner warmup + readiness
import http from "node:http";

let READY = false;
const STATE = { phase:"init", hb:{}, errors:[] };

function topo(nodes, edges){
  const indeg = Object.fromEntries(nodes.map(n=>[n,0]));
  const g = Object.fromEntries(nodes.map(n=>[n,[]]));
  for(const [a,b] of edges){ g[a].push(b); indeg[b]++; }
  const q = nodes.filter(n=>indeg[n]===0); const order=[];
  while(q.length){ const u=q.shift(); order.push(u); for(const v of g[u]){ if(--indeg[v]===0) q.push(v); } }
  if(order.length!==nodes.length) throw new Error("CYCLE in bootstrap DAG");
  return order;
}
function hb(name){ const prev=(STATE.hb[name]?.ok_count||0); STATE.hb[name]={ ts:Date.now(), ok_count:prev+1 }; }

async function warmPhase(name, fn){ STATE.phase=name; await fn(); hb(name); }

// mock phases; swap with real ones
const phases = {
  config:   async()=>{},
  manifest: async()=>{},
  model:    async()=>{},
  index:    async()=>{},
  policy:   async()=>{},
  sentinel: async()=>{}
};

async function warmup(){
  const dag = JSON.parse(process.env.BOOT_DAG || '{"nodes":["config","manifest","model","index","policy","sentinel"],"edges":[["config","manifest"],["manifest","model"],["model","index"],["index","policy"],["policy","sentinel"]]}');
  const order = topo(dag.nodes, dag.edges);
  for(const name of order){
    await warmPhase(name, phases[name]);
    const last = STATE.hb[name].ts;
    const deadline = Number(process.env.PHASE_DEADLINE || 10000);
    if(Date.now() - last > deadline) throw new Error(`STALL at ${name}`);
  }
  READY = true; STATE.phase="ready";
}

const server = http.createServer((req,res)=>{
  const json=(c,o)=>{ res.writeHead(c,{"Content-Type":"application/json"}); res.end(JSON.stringify(o)); };
  if(req.url==="/livez")  return json(200,{live:true,phase:STATE.phase});
  if(req.url==="/readyz") return json(READY?200:503,{ready:READY,phase:STATE.phase,hb:STATE.hb,errors:STATE.errors.slice(-3)});
  json(404,{error:"not found"});
});

server.listen(8081, async ()=>{
  try { await warmup(); } catch(e){ STATE.errors.push({phase:STATE.phase, err:String(e)}); }
});

7) Acceptance criteria (ship/no-ship)

A build may ship only if:

  1. DAG validated (no cycles).
  2. Single-owner warmup completed; READY=true.
  3. Sentinel passes using the same path as production query.
  4. Readiness stays green for N seconds (e.g., 30s) without flap.
  5. Example 07 readiness probe semantics are followed (liveness ≠ readiness).

8) Prevention (contracts & defaults)

  • No external /readyz waits inside readiness. Use local mocks/sentinels only.
  • Artifacts-first readiness: flip only after model ping, index map, guard/policy load, sentinel ok.
  • One lock around warmup; idempotent phases.
  • Config freeze during warmup; reject config reloads that would restart phases mid-flight.
  • Document the DAG in repo (ops/bootstrap.dag.json) and pin it in CI.

9) Debug workflow (10 minutes)

  1. Print the DAG and the computed toposort at boot.
  2. Tail /readyz; if stuck, check STATE.hb for the last completed phase.
  3. If CYCLE, fix edges and retry locally.
  4. If STALL, increase logging inside that phase (model ping, file open, network).
  5. After fixing, run a rolling restart; verify green readiness window.

10) Common traps & fixes

  • Using /readyz to warm another service → cycle risk. Replace with direct dependency probes (e.g., TCP/socket or mock call).
  • Probes that always 200 → meaningless. Tie readiness to the artifact checklist.
  • Multiple warmers (thread + health controller) → wrap with a mutex or leadership election.
  • Sentinel shortcut that skips guards → green probe, broken prod. Run the real template.

11) Minimal checklist (copy into PR)

  • ops/bootstrap.dag.json committed and validated at boot.
  • Single warmup owner; no concurrent initializers.
  • Readiness requires sentinel success on the prod path.
  • No cross-service /readyz dependencies.
  • Example 07 probes configured (live vs ready).

References to hands-on examples

  • Example 07 — Bootstrap Ordering & Readiness Gate
  • Example 05 — Manifest validation before model/index warm
  • Example 03 — Use real retrieval in sentinel to catch hidden deps
  • Example 08 — Optional: run a micro-eval before flipping ready

🧭 Explore More

Module Description Link
WFGY Core Standalone semantic reasoning engine for any LLM View →
Problem Map 1.0 Initial 16-mode diagnostic and symbolic fix framework View →
Problem Map 2.0 RAG-focused failure tree, modular fixes, and pipelines View →
Semantic Clinic Index Expanded failure catalog: prompt injection, memory bugs, logic drift View →
Semantic Blueprint Layer-based symbolic reasoning & semantic modulations View →
Benchmark vs GPT-5 Stress test GPT-5 with full WFGY reasoning suite View →

👑 Early Stargazers: See the Hall of Fame
Engineers, hackers, and open source builders who supported WFGY from day one.

GitHub stars Help reach 10,000 stars by 2025-09-01 to unlock Engine 2.0 for everyone Star WFGY on GitHub

WFGY Main   TXT OS   Blah   Blot   Bloc   Blur   Blow