From 66ffd7cc66bcb334468eb0235c0dfeec3683f502 Mon Sep 17 00:00:00 2001
From: wenshao <shaojin.wensj@alibaba-inc.com>
Date: Tue, 12 May 2026 07:35:23 +0800
Subject: [PATCH] =?UTF-8?q?feat(serve):=20close=203=20chiga0=20audit=20ite?=
 =?UTF-8?q?ms=20=E2=80=94=20ringSize=204000,=20--max-sessions,=20/health?=
 =?UTF-8?q?=3Fdeep=3D1=20(#3803)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three "30-minute" items from chiga0's external architecture audit
(2026-05-11). All actionable within Stage 1 scope; remaining items
in chiga0's review (SaaS positioning, multi-token to Stage 1.5,
acp-bridge package extraction, reference orchestrator) are larger
scoping decisions deferred to Stage 1.5/2.

DEFAULT_RING_SIZE 1000 → 4000 (Risk 4):
- A single long turn can emit hundreds of frames (test plan reports
  13 for a SHORT turn, real workloads can be 10× that). 1000 was
  exhausted by a moderate turn before a 5s reconnect window
  finished. 4000 gives ~30× headroom over a typical busy turn at
  the cost of a few hundred KB RAM/session. Updated user + protocol
  docs and the daemon-client-quickstart example.

--max-sessions <n> (default 20) (Rec 3):
- New `ServeOptions.maxSessions` + matching `BridgeOptions`. Bridge
  throws `SessionLimitExceededError` when `byId.size +
  inFlightSpawns.size >= max` BEFORE issuing a fresh spawn. Attaches
  to existing sessions (single scope) bypass the cap so an idle
  daemon's reconnects keep working at-capacity. `0` disables.
  Default of 20 sized below the design's N≈50 cliff (per-session
  ~30–50 MB RSS + FD pressure). HTTP route maps to 503 with
  `Retry-After: 5` and `code: session_limit_exceeded`. Tests cover:
  cap rejection under thread scope, attach-not-counted under single
  scope, `0` disables. Documented in CLI flags table + protocol
  Common-error section.

/health?deep=1 (Risk 3):
- Default `/health` stays cheap (no bridge access). With `?deep=1`
  the response includes `sessions` and `pendingPermissions` from
  the bridge — touches state so a wedged bridge surfaces as 503
  `{status: "degraded"}` instead of "200 ok" on a zombie daemon
  (the `k8s rolling deploy will see healthy` failure mode chiga0
  flagged). Loopback-vs-non-loopback bearer-exempt logic from the
  earlier A8dZT fix is preserved via a shared handler. Tests cover:
  cheap default, deep response shape, throwing-getter → 503.
---
 .../examples/daemon-client-quickstart.md      |  2 +-
 docs/developers/qwen-serve-protocol.md        | 24 +++++-
 docs/users/qwen-serve.md                      | 15 ++--
 packages/cli/src/commands/serve.ts            |  9 ++
 packages/cli/src/serve/eventBus.ts            | 14 +++-
 packages/cli/src/serve/httpAcpBridge.test.ts  | 83 +++++++++++++++++++
 packages/cli/src/serve/httpAcpBridge.ts       | 44 ++++++++++
 packages/cli/src/serve/runQwenServe.ts        |  3 +-
 packages/cli/src/serve/server.test.ts         | 64 ++++++++++++++
 packages/cli/src/serve/server.ts              | 53 ++++++++++--
 packages/cli/src/serve/types.ts               | 11 +++
 11 files changed, 303 insertions(+), 19 deletions(-)
diff --git a/docs/developers/examples/daemon-client-quickstart.md b/docs/developers/examples/daemon-client-quickstart.md
index 3e1696069..a72069aca 100644
--- a/docs/developers/examples/daemon-client-quickstart.md
+++ b/docs/developers/examples/daemon-client-quickstart.md
@@ -113,7 +113,7 @@ for await (const event of client.subscribeEvents(session.sessionId, {
 }
 ```
 
-The daemon retains the last 1000 events per session in a ring buffer; gaps beyond that window won't be re-deliverable.
+The daemon retains the last 4000 events per session in a ring buffer; gaps beyond that window won't be re-deliverable.
 
 ## Voting on permissions
 
diff --git a/docs/developers/qwen-serve-protocol.md b/docs/developers/qwen-serve-protocol.md
index 424c56b75..f072ac15a 100644
--- a/docs/developers/qwen-serve-protocol.md
+++ b/docs/developers/qwen-serve-protocol.md
@@ -40,6 +40,18 @@ with status `400`.
 
 with status `404`.
 
+`POST /session` past the daemon's `--max-sessions` cap returns `503` with a `Retry-After: 5` header and:
+
+```json
+{
+  "error": "Session limit reached (20)",
+  "code": "session_limit_exceeded",
+  "limit": 20
+}
+```
+
+Attaches to existing sessions are NOT counted toward the cap, so an idle daemon's reconnects keep working even when at-capacity.
+
 ## Capabilities
 
 Every Stage 1 daemon advertises 9 feature tags. Clients **must** gate UI off `features`, not off `mode` (per design §10).
@@ -60,7 +72,15 @@ Every Stage 1 daemon advertises 9 feature tags. Clients **must** gate UI off `fe
 
 ### `GET /health`
 
-Liveness probe. Returns `200 {"status":"ok"}` if the listener is up.
+Liveness probe. Default form returns `200 {"status":"ok"}` if the listener is up — cheap, no bridge access, suitable for high-frequency k8s/Compose liveness probes.
+
+Pass `?deep=1` (also accepts `?deep=true` or bare `?deep`) for a probe that touches bridge state. The deep probe is what catches a "zombie daemon" — listener answering, bridge wedged:
+
+```json
+{ "status": "ok", "sessions": 3, "pendingPermissions": 1 }
+```
+
+A 503 with `{"status":"degraded"}` indicates the bridge state access threw — pull the daemon out of rotation and investigate.
 
 **Auth:** required **only on non-loopback binds**. On loopback (`127.0.0.1`, `::1`, `[::1]`) `/health` is registered before the bearer middleware so k8s/Compose probes inside the pod don't need to carry the token. On non-loopback (`--hostname 0.0.0.0` etc.) the route is registered after the bearer middleware and returns 401 without a valid token — otherwise an unauthenticated caller could probe arbitrary addresses to confirm a `qwen serve` exists, a low-severity info leak that combines poorly with port scanning. CORS deny + Host allowlist still apply on the loopback exemption.
 
@@ -237,7 +257,7 @@ The SSE-level `id:` / `event:` lines duplicate `envelope.id` / `envelope.type` f
 
 Reconnect semantics:
 
-- Send `Last-Event-ID: <n>` to replay events with `id > n` from the per-session ring (default depth 1000)
+- Send `Last-Event-ID: <n>` to replay events with `id > n` from the per-session ring (default depth 4000)
 - **Gap detection (client-side):** if `<n>` predates the oldest event still in the ring (e.g. you reconnect with `Last-Event-ID: 50` but the ring now holds 200–1199), the daemon replays from the oldest available event without raising. Compare the first replayed event's `id` against `n + 1`; any difference is the size of the lost window. Stage 2 will inject an explicit `stream_gap` synthetic frame on the daemon side; in Stage 1 detection is the client's responsibility.
 - IDs are monotonic per session, starting at 1
 - Synthetic terminal frames (`client_evicted`, `stream_error`) intentionally omit `id` so they don't burn a sequence slot for other subscribers
diff --git a/docs/users/qwen-serve.md b/docs/users/qwen-serve.md
index bdd8eaade..23ae59a81 100644
--- a/docs/users/qwen-serve.md
+++ b/docs/users/qwen-serve.md
@@ -57,7 +57,7 @@ curl -N http://127.0.0.1:4170/session/$SESSION_ID/events
 The `data:` line is the **full event envelope** — `{id?, v, type, data, originatorClientId?}` — JSON-stringified on a single line. The ACP payload (the `sessionUpdate` block in this example) sits under `data` inside that envelope. The SSE-level `id:` / `event:` lines are convenience for EventSource clients; the same values appear inside the JSON envelope so raw-`fetch` consumers get them too.
 
 Open this **before** sending the prompt — the SSE replay buffer holds the
-last 1000 events so a late subscriber can catch up via `Last-Event-ID`,
+last 4000 events so a late subscriber can catch up via `Last-Event-ID`,
 but for the simple "watch a single prompt" case it's easiest to subscribe
 first and let it stream live.
 
@@ -100,12 +100,13 @@ The token comparison is constant-time (SHA-256 + `crypto.timingSafeEqual`); 401
 
 ## CLI flags
 
-| Flag                | Default     | Purpose                                                                                                                             |
-| ------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------- |
-| `--port <n>`        | `4170`      | TCP port. `0` = OS-assigned ephemeral port.                                                                                         |
-| `--hostname <addr>` | `127.0.0.1` | Bind interface. Anything beyond loopback requires a token.                                                                          |
-| `--token <str>`     | —           | Bearer token. Falls back to `QWEN_SERVER_TOKEN` env var (with leading/trailing whitespace stripped — handy for `$(cat token.txt)`). |
-| `--http-bridge`     | `true`      | Stage 1 mode: per-session `qwen --acp` child process. Stage 2 native in-process becomes available later.                            |
+| Flag                 | Default     | Purpose                                                                                                                                                                                                                                                                                                                                             |
+| -------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--port <n>`         | `4170`      | TCP port. `0` = OS-assigned ephemeral port.                                                                                                                                                                                                                                                                                                         |
+| `--hostname <addr>`  | `127.0.0.1` | Bind interface. Anything beyond loopback requires a token.                                                                                                                                                                                                                                                                                          |
+| `--token <str>`      | —           | Bearer token. Falls back to `QWEN_SERVER_TOKEN` env var (with leading/trailing whitespace stripped — handy for `$(cat token.txt)`).                                                                                                                                                                                                                 |
+| `--max-sessions <n>` | `20`        | Cap on concurrent live sessions. New `POST /session` requests that would spawn a fresh child return `503` (with `Retry-After: 5`) when the cap is hit; attaches to existing sessions are NOT counted. Set to `0` to disable. Sized for single-user / small-team usage; raise it if your deployment has the RAM/FD headroom (~30–50 MB per session). |
+| `--http-bridge`      | `true`      | Stage 1 mode: per-session `qwen --acp` child process. Stage 2 native in-process becomes available later.                                                                                                                                                                                                                                            |
 
 ## Default deployment threat model
 
diff --git a/packages/cli/src/commands/serve.ts b/packages/cli/src/commands/serve.ts
index 0d9fa81b4..bc70f8828 100644
--- a/packages/cli/src/commands/serve.ts
+++ b/packages/cli/src/commands/serve.ts
@@ -16,6 +16,7 @@ interface ServeArgs {
   port: number;
   hostname: string;
   token?: string;
+  'max-sessions': number;
   // Read from the kebab-case key only — the camelCase mirror that yargs
   // synthesizes is convenient for handlers but type-confusing here. The
   // handler reads `argv['http-bridge']` directly.
@@ -45,6 +46,13 @@ export const serveCommand: CommandModule<unknown, ServeArgs> = {
         description:
           'Bearer token required on every request. Falls back to the QWEN_SERVER_TOKEN env var.',
       })
+      .option('max-sessions', {
+        type: 'number',
+        default: 20,
+        description:
+          'Cap on concurrent live sessions. New spawn requests beyond this return 503; ' +
+          'attach to existing sessions still works. Set to 0 to disable.',
+      })
       .option('http-bridge', {
         type: 'boolean',
         default: true,
@@ -79,6 +87,7 @@ export const serveCommand: CommandModule<unknown, ServeArgs> = {
         hostname: argv.hostname,
         token: argv.token,
         mode: 'http-bridge',
+        maxSessions: argv['max-sessions'],
       });
     } catch (err) {
       writeStderrLine(
diff --git a/packages/cli/src/serve/eventBus.ts b/packages/cli/src/serve/eventBus.ts
index a917ae55b..1d099fb2a 100644
--- a/packages/cli/src/serve/eventBus.ts
+++ b/packages/cli/src/serve/eventBus.ts
@@ -61,7 +61,17 @@ export interface SubscribeOptions {
 }
 
 const DEFAULT_MAX_QUEUED = 256;
-const DEFAULT_RING_SIZE = 1000;
+/**
+ * Default replay-ring depth per session. Sized for a 5-second
+ * reconnect window over a chatty turn — a single long-running prompt
+ * can emit hundreds of frames (test plan reports 13 for a short
+ * turn, real workloads can be 10× that or more once tool-call /
+ * thought streams pile up). 1000 was the original default and could
+ * be exhausted by a moderate turn before the client reconnected;
+ * 4000 gives ~30× headroom over a typical-but-busy turn at the cost
+ * of a few hundred KB of RAM per session.
+ */
+const DEFAULT_RING_SIZE = 4000;
 /**
  * Per-bus subscriber cap. With per-subscriber `maxQueued` defaulting to
  * 256 frames, 64 concurrent subscribers caps the per-session subscriber
@@ -133,7 +143,7 @@ export class EventBus {
       ...input,
     };
     this.ring.push(event);
-    // Eviction-by-shift is O(n) once the ring is full. With ringSize=1000
+    // Eviction-by-shift is O(n) once the ring is full. With ringSize=4000
     // and per-publish work measured in hundreds of microseconds even on
     // chatty sessions, this isn't a real hotspot today. A circular-buffer
     // refactor would push it to O(1) but adds index bookkeeping; deferred
diff --git a/packages/cli/src/serve/httpAcpBridge.test.ts b/packages/cli/src/serve/httpAcpBridge.test.ts
index 1ed70fbad..b5c68c017 100644
--- a/packages/cli/src/serve/httpAcpBridge.test.ts
+++ b/packages/cli/src/serve/httpAcpBridge.test.ts
@@ -1785,4 +1785,87 @@ describe('createHttpAcpBridge', () => {
       expect(events[0]?.type).toBe('session_died');
     });
   });
+
+  describe('maxSessions cap (chiga0 Rec 3)', () => {
+    it('refuses NEW spawns past the cap with SessionLimitExceededError', async () => {
+      const factory: ChannelFactory = async () => makeChannel().channel;
+      const bridge = createHttpAcpBridge({
+        channelFactory: factory,
+        maxSessions: 2,
+        // `thread` so each call is a fresh session, not an attach.
+        sessionScope: 'thread',
+      });
+
+      // First two spawns succeed.
+      await bridge.spawnOrAttach({ workspaceCwd: WS_A });
+      await bridge.spawnOrAttach({ workspaceCwd: WS_B });
+      expect(bridge.sessionCount).toBe(2);
+
+      // Third hits the cap.
+      await expect(
+        bridge.spawnOrAttach({ workspaceCwd: WS_A }),
+      ).rejects.toMatchObject({
+        name: 'SessionLimitExceededError',
+        limit: 2,
+      });
+      // Cap rejection must NOT register a new session.
+      expect(bridge.sessionCount).toBe(2);
+
+      await bridge.shutdown();
+    });
+
+    it('attach to an existing session under single scope is NOT counted toward the cap', async () => {
+      const factory: ChannelFactory = async () => makeChannel().channel;
+      const bridge = createHttpAcpBridge({
+        channelFactory: factory,
+        maxSessions: 1,
+        sessionScope: 'single',
+      });
+
+      // First call spawns.
+      const a = await bridge.spawnOrAttach({ workspaceCwd: WS_A });
+      expect(a.attached).toBe(false);
+      expect(bridge.sessionCount).toBe(1);
+
+      // Second call to the SAME workspace attaches — cap doesn't apply.
+      const b = await bridge.spawnOrAttach({ workspaceCwd: WS_A });
+      expect(b.attached).toBe(true);
+      expect(b.sessionId).toBe(a.sessionId);
+      expect(bridge.sessionCount).toBe(1);
+
+      // But a NEW workspace (would need a fresh spawn) is rejected.
+      await expect(
+        bridge.spawnOrAttach({ workspaceCwd: WS_B }),
+      ).rejects.toMatchObject({
+        name: 'SessionLimitExceededError',
+      });
+
+      await bridge.shutdown();
+    });
+
+    it('maxSessions: 0 disables the cap', async () => {
+      // Distinct sessionIdPrefix per spawn so each call gets a unique
+      // sessionId (otherwise they'd collide in `byId` and only the
+      // last would remain — making `sessionCount` stay at 1).
+      let n = 0;
+      const factory: ChannelFactory = async () =>
+        makeChannel({ sessionIdPrefix: `s${n++}` }).channel;
+      const bridge = createHttpAcpBridge({
+        channelFactory: factory,
+        maxSessions: 0,
+        sessionScope: 'thread',
+      });
+      // 5 spawns is far past the would-be default of 20 isn't, but
+      // it's enough to confirm the cap is disabled (with default of
+      // 20 a thread-scope flood could go 5 deep without hitting it
+      // anyway, so we use a smaller test value with 0/disabled
+      // explicit so a regression that re-enabled some default cap
+      // would still surface).
+      for (let i = 0; i < 5; i++) {
+        await bridge.spawnOrAttach({ workspaceCwd: WS_A });
+      }
+      expect(bridge.sessionCount).toBe(5);
+      await bridge.shutdown();
+    });
+  });
 });
diff --git a/packages/cli/src/serve/httpAcpBridge.ts b/packages/cli/src/serve/httpAcpBridge.ts
index 8a667f0a2..88c9d9098 100644
--- a/packages/cli/src/serve/httpAcpBridge.ts
+++ b/packages/cli/src/serve/httpAcpBridge.ts
@@ -177,6 +177,22 @@ export class SessionNotFoundError extends Error {
   }
 }
 
+/**
+ * Thrown by `spawnOrAttach` when a fresh-spawn would push `sessionCount`
+ * past `BridgeOptions.maxSessions`. The HTTP route maps this to 503
+ * with a `Retry-After` hint. Attaches (same workspace under `single`
+ * scope) never trip this — only NEW children. Distinct error type so
+ * routes can branch without text-matching.
+ */
+export class SessionLimitExceededError extends Error {
+  readonly limit: number;
+  constructor(limit: number) {
+    super(`Session limit reached (${limit})`);
+    this.name = 'SessionLimitExceededError';
+    this.limit = limit;
+  }
+}
+
 /**
  * One ACP NDJSON channel to a single agent. Tests inject a fake by replacing
  * the channel factory; production uses `defaultSpawnChannelFactory`.
@@ -208,6 +224,14 @@ export interface BridgeOptions {
   channelFactory?: ChannelFactory;
   /** How long to wait for the child's `initialize` reply before giving up. */
   initializeTimeoutMs?: number;
+  /**
+   * Cap on concurrent live sessions. `spawnOrAttach` calls that would
+   * cross this throw `SessionLimitExceededError`; attaches to an
+   * existing session (same workspace under `single` scope) are not
+   * counted. `0` / `Infinity` disable the cap. Defaults to 20 — see
+   * `ServeOptions.maxSessions` for the rationale.
+   */
+  maxSessions?: number;
 }
 
 interface SessionEntry {
@@ -394,9 +418,21 @@ class BridgeClient implements Client {
 }
 
 const DEFAULT_INIT_TIMEOUT_MS = 10_000;
+const DEFAULT_MAX_SESSIONS = 20;
 
 export function createHttpAcpBridge(opts: BridgeOptions = {}): HttpAcpBridge {
   const sessionScope = opts.sessionScope ?? 'single';
+  // `undefined` → default 20 (intentionally tight per #3803 N≈50 cliff).
+  // `0` or non-finite (Infinity / NaN) → unlimited.
+  // Positive finite → use as-is.
+  let maxSessions: number;
+  if (opts.maxSessions === undefined) {
+    maxSessions = DEFAULT_MAX_SESSIONS;
+  } else if (opts.maxSessions <= 0 || !Number.isFinite(opts.maxSessions)) {
+    maxSessions = Infinity;
+  } else {
+    maxSessions = opts.maxSessions;
+  }
   if (sessionScope !== 'single' && sessionScope !== 'thread') {
     throw new TypeError(
       `Invalid sessionScope: ${JSON.stringify(sessionScope)}. ` +
@@ -801,6 +837,14 @@ export function createHttpAcpBridge(opts: BridgeOptions = {}): HttpAcpBridge {
         }
       }
 
+      // Cap check: count both registered sessions and in-flight spawns
+      // (a fresh-spawn races that's about to register hasn't hit
+      // `byId` yet but should still count toward the limit). Attaches
+      // returned above bypass this — only NEW children are gated.
+      if (byId.size + inFlightSpawns.size >= maxSessions) {
+        throw new SessionLimitExceededError(maxSessions);
+      }
+
       const promise = doSpawn(workspaceKey, req.modelServiceId);
       // Track in-flight spawns regardless of scope. Under `single`
       // this also serves the coalescing path above (a parallel
diff --git a/packages/cli/src/serve/runQwenServe.ts b/packages/cli/src/serve/runQwenServe.ts
index e5cf5c55c..c7e2cb4f6 100644
--- a/packages/cli/src/serve/runQwenServe.ts
+++ b/packages/cli/src/serve/runQwenServe.ts
@@ -74,7 +74,8 @@ export async function runQwenServe(
     );
   }
 
-  const bridge = deps.bridge ?? createHttpAcpBridge();
+  const bridge =
+    deps.bridge ?? createHttpAcpBridge({ maxSessions: opts.maxSessions });
   let actualPort = opts.port;
   const app = createServeApp(opts, () => actualPort, { bridge });
 
diff --git a/packages/cli/src/serve/server.test.ts b/packages/cli/src/serve/server.test.ts
index b6404916e..606737407 100644
--- a/packages/cli/src/serve/server.test.ts
+++ b/packages/cli/src/serve/server.test.ts
@@ -17,6 +17,7 @@ import type {
   SetSessionModelResponse,
 } from '@agentclientprotocol/sdk';
 import {
+  SessionLimitExceededError,
   SessionNotFoundError,
   type BridgeSession,
   type BridgeSessionSummary,
@@ -758,6 +759,69 @@ describe('createServeApp', () => {
       expect(bridge.calls).toHaveLength(0);
     });
   });
+
+  describe('GET /health?deep=1 (chiga0 Risk 3)', () => {
+    it('default /health stays cheap (no bridge touch)', async () => {
+      const bridge = fakeBridge();
+      const app = createServeApp(baseOpts, undefined, { bridge });
+      const res = await request(app)
+        .get('/health')
+        .set('Host', `127.0.0.1:${baseOpts.port}`);
+      expect(res.status).toBe(200);
+      expect(res.body).toEqual({ status: 'ok' });
+    });
+
+    it('deep=1 includes bridge state', async () => {
+      const bridge = fakeBridge();
+      const app = createServeApp(baseOpts, undefined, { bridge });
+      const res = await request(app)
+        .get('/health?deep=1')
+        .set('Host', `127.0.0.1:${baseOpts.port}`);
+      expect(res.status).toBe(200);
+      expect(res.body).toMatchObject({
+        status: 'ok',
+        sessions: 0,
+        pendingPermissions: 0,
+      });
+    });
+
+    it('deep=1 returns 503 when bridge state access throws', async () => {
+      // Simulate a wedged bridge by replacing the getter to throw.
+      const bridge = fakeBridge();
+      Object.defineProperty(bridge, 'sessionCount', {
+        get() {
+          throw new Error('bridge wedged');
+        },
+      });
+      const app = createServeApp(baseOpts, undefined, { bridge });
+      const res = await request(app)
+        .get('/health?deep=1')
+        .set('Host', `127.0.0.1:${baseOpts.port}`);
+      expect(res.status).toBe(503);
+      expect(res.body).toEqual({ status: 'degraded' });
+    });
+  });
+
+  describe('session limit (chiga0 Rec 3 — --max-sessions)', () => {
+    it('503 + Retry-After + structured error when bridge throws SessionLimitExceededError', async () => {
+      const bridge = fakeBridge({
+        spawnImpl: async () => {
+          throw new SessionLimitExceededError(20);
+        },
+      });
+      const app = createServeApp(baseOpts, undefined, { bridge });
+      const res = await request(app)
+        .post('/session')
+        .set('Host', `127.0.0.1:${baseOpts.port}`)
+        .send({ cwd: '/work/a' });
+      expect(res.status).toBe(503);
+      expect(res.headers['retry-after']).toBe('5');
+      expect(res.body).toMatchObject({
+        code: 'session_limit_exceeded',
+        limit: 20,
+      });
+    });
+  });
 });
 
 describe('runQwenServe', () => {
diff --git a/packages/cli/src/serve/server.ts b/packages/cli/src/serve/server.ts
index a1b1a0e89..00c2f3345 100644
--- a/packages/cli/src/serve/server.ts
+++ b/packages/cli/src/serve/server.ts
@@ -12,6 +12,7 @@ import { bearerAuth, denyBrowserOriginCors, hostAllowlist } from './auth.js';
 import { isLoopbackBind } from './loopbackBinds.js';
 import {
   createHttpAcpBridge,
+  SessionLimitExceededError,
   SessionNotFoundError,
   type HttpAcpBridge,
 } from './httpAcpBridge.js';
@@ -75,11 +76,39 @@ export function createServeApp(
   // gate `/health` behind their token alongside everything else.
   // CORS deny + Host allowlist still apply to `/health` in both
   // cases.
+  // Shared handler so loopback (pre-auth) and non-loopback (post-auth)
+  // routes return the same shape. `?deep=1` opts into a probe that
+  // touches bridge state — if the bridge is wedged the property
+  // accesses throw and we surface 503 so k8s probes can distinguish
+  // a zombie daemon from a healthy one. Default (no query) stays
+  // cheap so high-frequency liveness probes don't load the bridge.
+  const healthHandler = (
+    req: import('express').Request,
+    res: import('express').Response,
+  ): void => {
+    const deepQuery = req.query['deep'];
+    const deep = deepQuery === '1' || deepQuery === 'true' || deepQuery === '';
+    if (!deep) {
+      res.status(200).json({ status: 'ok' });
+      return;
+    }
+    try {
+      res.status(200).json({
+        status: 'ok',
+        sessions: bridge.sessionCount,
+        pendingPermissions: bridge.pendingPermissionCount,
+      });
+    } catch (err) {
+      writeStderrLine(
+        `qwen serve: /health deep probe failed: ${err instanceof Error ? err.message : String(err)}`,
+      );
+      res.status(503).json({ status: 'degraded' });
+    }
+  };
+
   const loopback = isLoopbackBind(opts.hostname);
   if (loopback) {
-    app.get('/health', (_req, res) => {
-      res.status(200).json({ status: 'ok' });
-    });
+    app.get('/health', healthHandler);
   }
 
   app.use(bearerAuth(opts.token));
@@ -89,9 +118,7 @@ export function createServeApp(
     // Non-loopback: register `/health` AFTER `bearerAuth` so probes
     // must carry the token. Otherwise unauthenticated callers can
     // ping any reachable address:port to confirm a daemon exists.
-    app.get('/health', (_req, res) => {
-      res.status(200).json({ status: 'ok' });
-    });
+    app.get('/health', healthHandler);
   }
 
   app.get('/capabilities', (_req, res) => {
@@ -624,6 +651,20 @@ function sendBridgeError(res: import('express').Response, err: unknown): void {
     res.status(404).json({ error: err.message, sessionId: err.sessionId });
     return;
   }
+  if (err instanceof SessionLimitExceededError) {
+    // 503 Service Unavailable + `Retry-After` is the canonical
+    // "we'd serve you, but we're full right now" shape. The hint
+    // is intentionally conservative (5s) because a session that
+    // finishes a prompt frees a slot quickly under normal load;
+    // a client that backs off too aggressively wastes capacity.
+    res.set('Retry-After', '5');
+    res.status(503).json({
+      error: err.message,
+      code: 'session_limit_exceeded',
+      limit: err.limit,
+    });
+    return;
+  }
   // 5xx is the kind of error operators need to see in their daemon log
   // — bridge ENOMEM, agent stack trace, unexpected throw, etc. Without
   // logging here every 500 disappears once the caller consumes the
diff --git a/packages/cli/src/serve/types.ts b/packages/cli/src/serve/types.ts
index d55bcb3bd..0365bd2d0 100644
--- a/packages/cli/src/serve/types.ts
+++ b/packages/cli/src/serve/types.ts
@@ -25,6 +25,17 @@ export interface ServeOptions {
    */
   token?: string;
   mode: ServeMode;
+  /**
+   * Cap on concurrent live sessions. Once `bridge.sessionCount` reaches
+   * this, new `POST /session` requests that would spawn fresh sessions
+   * return 503. Attaching to an existing session (same workspace under
+   * `sessionScope: 'single'`) still works — so an idle daemon doesn't
+   * block reconnects from existing users. Defaults to 20: comfortably
+   * above single-user usage, well below the design's N≈50 cliff where
+   * per-session RSS (~30–50 MB) and FD pressure start to bite. Set to
+   * `0` or `Infinity` to disable.
+   */
+  maxSessions?: number;
 }
 
 /**