feat(core): PR-2.5 — post-promote stream redirect + natural-exit registry settle (#3831 follow-up) (#4102)

* feat(core): PR-2.5 — post-promote stream redirect + natural-exit registry settle

Closes the two limitations PR-2 (#3894) deferred for the Phase D part
(b) Ctrl+B promote flow (#3831):

1. **Post-promote stream redirect**: today the `bg_xxx.output` file
   is frozen at promote time because `ShellExecutionService` detaches
   its data listener as part of PR-1's ownership-transfer contract.
   PR-2.5 wires a caller-side `onPostPromoteData` callback so bytes
   from the still-running child append to the file via an
   `fs.createWriteStream` opened in `handlePromotedForeground`.
2. **Natural-exit registry settle**: today the registry entry stays
   `'running'` until `task_stop` / session-end `abortAll` fires its
   abort listener. PR-2.5 wires `onPostPromoteSettle` so natural
   child exit transitions the entry to `'completed'` / `'failed'`
   with the right exitCode / signal / error message.

- New exported types: `ShellExecuteOptions`, `ShellPostPromoteHandlers`,
  `ShellPostPromoteSettleInfo`.
- `execute()` options bag now accepts `postPromote?: { onData, onSettle }`.
  Threaded through to both `executeWithPty` and `childProcessFallback`.
- PTY's `performBackgroundPromote` (line ~1159): after disposing
  the foreground data + exit + error listeners, RE-ATTACH minimal
  forwarders that call `postPromote.onData` / `postPromote.onSettle`
  when the caller opted in. Backwards compat: when `postPromote` is
  unset the PR-2 detach-everything contract is preserved (the
  re-attach is gated on each callback being defined).
- `childProcessFallback`'s `performBackgroundPromote` (line ~706):
  same pattern — re-attach `stdout.on('data', ...)`, `stderr.on('data',
  ...)`, `child.once('exit', ...)`, `child.once('error', ...)` when
  the caller opted in. `error` listener routes through `onSettle`
  with `error` populated, so spawn-side errors after the foreground
  errorHandler detached don't crash the daemon via the default
  unhandled `'error'` event.
- Both paths wrap caller callbacks in try/catch so a thrown handler
  doesn't crash the child's data loop / unhandled-rejection the
  service.

- New `PromoteArtifacts` type — slots shared between the foreground
  `execute()` postPromote handlers (which fire on the service side
  as soon as promote happens) and the post-resolve
  `handlePromotedForeground` finalizer (which runs after
  `await resultPromise` returns). The two race; the buffer +
  settle-queue absorb that race so neither chunks nor the eventual
  exit info are lost.
- `executeForeground` wires `postPromote` handlers that route data
  to either `promoteArtifacts.stream` (if open) or
  `promoteArtifacts.buffer` (drained when the stream opens), and
  queue settle info if the wired handler isn't yet installed.
- `handlePromotedForeground` opens `fs.createWriteStream(outputPath,
  { flags: 'w' })`, writes the initial snapshot first, drains the
  buffer, then registers the entry and wires `onSettleWired` with
  the full registry decision table:
    - `error` set → `registry.fail(shellId, error.message, endTime)`
    - `exitCode === 0` → `registry.complete(shellId, 0, endTime)`
    - non-zero exitCode → `registry.fail(shellId, "Exited with code N", endTime)`
    - signal !== null → `registry.fail(shellId, "Terminated by signal N", endTime)`
    - all-null fallback → `registry.fail(shellId, "Exited with unknown status", endTime)`
- Fires queued settle synchronously after wiring so a fast command
  that exits between promote and finalizer doesn't get lost.
- Self-audit catch: closes the output stream on the
  `registry.register` throw path so the FD doesn't leak past the
  orphan-child kill.

- 3 new in `shellExecutionService.test.ts`:
  - `post-promote bytes route to postPromote.onData when callback provided`
  - `postPromote.onSettle fires on natural child exit after promote`
  - `backwards compat: without postPromote, listeners stay fully detached`
- 3 new in `shell.test.ts` under a `foreground → background promote
  PR-2.5` describe block:
  - `post-promote bytes APPEND to bg_xxx.output via write stream`
  - `natural child exit transitions registry entry to "completed"`
  - `non-zero exit / signal / error → "failed" with descriptive message`
- Bulk-replaced 50 prior `{},` (empty 6th-arg shellExecutionConfig)
  with `expect.objectContaining({}),` + added `expect.objectContaining({
  postPromote: expect.any(Object) }),` as the 7th-arg expectation for
  the foreground execute call.
- Updated the existing `registers a bg_xxx entry on result.promoted`
  test to assert on `fs.createWriteStream` + `stream.write` instead
  of the now-removed `fs.writeFileSync` snapshot path.

182/182 shell.test.ts pass + 73/73 shellExecutionService.test.ts pass
+ 111/111 coreToolScheduler.test.ts pass + 60/60 AppContainer.test.tsx
pass; tsc + ESLint clean.

Self-audit: 3 rounds (positive / reverse / cross-file) found one
issue — output stream FD leak on `registry.register` throw — and
fixed it before flagging complete. All flagged edge cases (stream
errors, child-exits-before-wire-up race, task_stop during natural-
exit window, promote-never-happens cleanup, backwards compat
without callbacks) have explicit handling and / or test pinning.

* fix(core): #4102 review wave — 3 Critical + UTF-8 + tests

3 Critical race/correctness issues + 1 multibyte-corruption suggestion
+ 3 test coverage gaps addressed:

**Critical 1 — child_process late-chunk drop (service)**
Settle was fired on 'exit', but stdout/stderr can emit buffered data
between 'exit' and 'close'. Late chunks landed in
`promoteArtifacts.buffer` after shell.ts had already closed the
stream + transitioned the registry → silently dropped → truncated
`bg_xxx.output`. Switched to listening on 'close' which guarantees
all stdio is fully drained. (code, signal) payload is identical to
'exit', just with proper ordering.

**Critical 2 — stream-flush wait before registry transition (shell)**
`stream.end()` is asynchronous; pending writes can still be in the
libuv queue when it returns. The old code transitioned the registry
immediately after `.end()`, so a /tasks consumer could observe a
`completed` entry and read the output file BEFORE the trailing
bytes were on disk. Fixed: wired settle now `stream.once('finish',
...)` BEFORE calling `registry.complete/fail`. `error` event also
short-circuits to the transition so a late ENOSPC doesn't hang the
settle path forever.

**Critical 3 — stream-open-fail buffer leak (shell)**
If `fs.createWriteStream` threw, the catch path set `stream = null`
but the foreground `onData` handler would still take the
`stream === null` branch and push chunks into `promoteArtifacts.buffer`
— unbounded growth under a sustained child whose output file
couldn't be opened. Added a `streamFailed: boolean` latch on
`PromoteArtifacts`. When set, `onData` drops chunks (with a debug
log) instead of buffering. The catch branch sets the latch.

**Suggestion — shared TextDecoder corrupts multibyte UTF-8 (service)**
child_process post-promote used ONE TextDecoder for both stdout AND
stderr. The decoder's continuation-byte state machine assumes one
byte source; interleaved multibyte chunks corrupted. Now uses
separate decoders + flushes both with `decode()` (no `stream: true`)
on settle so trailing bytes surface as their final characters.

**Suggestion — llmContent reflects already-settled status (shell)**
When the queued-settle drain transitions the registry synchronously
(fast-exit race), the model-facing copy was still saying "Status:
running. … task_stop({...})". Updated to branch on
`postPromoteAlreadySettled` / `postPromoteFinalStatus` — when the
process is already gone, the copy says "Status: completed/failed"
and replaces the `task_stop` suggestion with "Process has already
exited; no `task_stop` needed".

**Suggestion — test coverage gaps**
Added: (a) `queued-settle race: onSettle BEFORE
handlePromotedForeground completes` — custom service impl fires
onSettle synchronously before resolving the promote promise, pins
the drain path. (b) child_process post-promote tests for stdout/stderr
forwarding + 'close'-not-'exit' settle + spawn-error settle.

**Self-audit**: Round 1 + reverse audit. Stream.once mock added to
fire 'finish' synchronously so existing tests don't hang on the new
flush wait. 76/76 shellExecutionService.test.ts (+3) + 183/183
shell.test.ts (+1) pass; tsc + ESLint clean.

* fix(core): #4102 review wave-2 — 3 more

C1 (shell.ts:2227): the WriteStream `'error'` event handler only
logged. `fs.createWriteStream` reports common open failures
(ENOENT / EACCES / ENOSPC) asynchronously via that event rather
than throwing. Result: `promoteArtifacts.stream` kept pointing at
the failed stream; `onSettleWired` attached a `.once('finish')`
listener that would never fire → registry stuck on `running`
forever. Latch the failure (null the shared `stream` slot,
set `streamFailed`); `onSettleWired`'s existing `if (!stream)`
branch then transitions the registry immediately.

C2 (shellExecutionService.ts:1468): the promote handoff removes the
foreground `ptyErrorHandler` and only re-attaches data + exit
listeners. A subsequent PTY `error` event had no listener — Node
treats an unhandled `error` from an EventEmitter as a fatal
exception that takes the whole CLI down. Attach a post-promote
forwarder that ignores expected PTY read-exit codes (EIO / EAGAIN,
same filter the foreground handler uses) and routes unexpected
errors through `postPromote.onSettle` with `error` populated.
Single-fire latch shared with `onExit` so settle never fires twice.

C3 (shell.ts:2503): `onSettleWired` waits for the stream's
asynchronous `'finish'` event before flipping
`postPromoteAlreadySettled`, but the model-facing `statusLine` was
built immediately after invoking `onSettleWired` on the queued
settle. A fast-exited promoted command could therefore land
"Status: running" + a `task_stop` instruction in production even
though settle was already observed. Split into two flags:
`postPromoteSettleObserved` (set synchronously when settle is
classified) drives the model copy; the registry transition stays
behind the stream flush.

Tests: +1 PR-2.5 wave-2 PTY error-routing test; +2 shell.ts tests
(stream open async error → registry still transitions; async
`'finish'` after queued-settle drain → llmContent says 'completed'
before registry transition fires).

* fix(core): #4102 review wave-3 — 4 actionable

T2 (shell.ts:2456) — Critical buffer-leak race
`onSettleWired` previously set `promoteArtifacts.stream = null`
BEFORE calling `stream.end()`. Any `postPromote.onData` chunk that
landed between that null assignment and the actual flush completing
saw `stream === null && streamFailed === false` and pushed into
`promoteArtifacts.buffer` — a buffer that has no further drain path
(the foreground finalizer has already returned). Result: chunks
stranded indefinitely; PTY mode in particular hits this because
`onExit` can fire while kernel buffers still hold data. Fix drains
the pre-settle buffer to the stream BEFORE nulling AND latches
`streamFailed = true` so any subsequent chunk drops via the
existing `else if (streamFailed)` arm in `onData` instead of
leaking. Updates the `streamFailed` doc to cover both setters
(open-fail and settle-done) so the dual semantic is explicit.

T3 (shell.ts:2262) — silent chunk-drop in catch path
When `fs.createWriteStream` throws synchronously (rare: ENOENT on
a vanished tmpdir), chunks already in `promoteArtifacts.buffer`
were silently lost with no observability — oncall reading a
truncated `bg_xxx.output` had no way to distinguish "stream open
failed" from "child produced nothing." Logs the dropped chunk
count and empties the buffer.

T5 (shell.ts:2443) — opaque all-null fallback
The "Exited with unknown status" fallback fired the registry to
'failed' without any context about which fields were null. This
branch is meant to be unreachable; hitting it indicates the
service emitted a defective settle info object. Includes the
field values in both the fail message and a warn log so the
oncall engineer can tell this path apart from the other "failed"
branches.

T6 (shellExecutionService.ts:1452) — leaked PTY post-promote listeners
`ptyProcess.onData(...)` returns an `IDisposable` that was being
discarded; same for `onExit`. The `'error'` listener function was
also not captured (no way to `removeListener` it). EventEmitter
holds refs to listener closures, which transitively hold refs to
`onPostData` / `onPostSettle` / the caller's `promoteArtifacts`.
While bounded by the PTY's lifetime, the closures keep the
caller's state pinned for the post-settle delay window. Captures
all three handles into `postPromoteDataDisposable` /
`postPromoteExitDisposable` / `postPromoteErrorListener`, then
releases them via a shared `disposePostPromoteListeners()` call
from `firePostSettle` (idempotent — each slot null-checked and
nulled after disposal).

Tests: +1 service test for IDisposable + error-listener cleanup;
+2 shell.ts tests for buffer drain race and catch-path snapshot
fallback. Existing tests stay green (262 → 265 in the touched
suites; 7819 → 7822 across the core package).

* fix(core/test): drop unused 'registry' in wave-3 T2 test (TS6133)

CI build failed across all platforms with src/tools/shell.test.ts(4395,15): error TS6133. The variable was a leftover from copying the queued-settle test pattern; the wave-3 T2 test inspects writeStreamMock.write call history directly and never reads the registry, so the assignment is dead code. Drop it.

* fix(core): #4102 review wave-4 — 6 actionable

T1 (Critical, shellExecutionService.ts:860 child_process onSettle
exactly-once)
The PTY path used a `firePostSettle` latch but child_process wired
`close` and `error` independently to `onPostSettle`. A spawn-side
error followed by Node's auto-emitted `'close'` would call the
caller's settle TWICE, racing the registry transition. Added the
same single-fire latch on the child_process path.

T2 (Critical, shell.ts:2264 handoff race reorder)
Original order was `write(snapshot) -> drain buffer -> assign stream`.
Synchronous today (no race in current code), but assign-after-drain
leaves a hazard for any future refactor that adds an `await` inside
the drain loop — a chunk arriving in that window would land in
`promoteArtifacts.buffer`, then post-assign chunks would write to
the stream first, producing out-of-order bytes until the settle
drain. Reordered to `write(snapshot) -> assign stream -> drain
buffer`, which closes the hazard regardless of future async
additions.

T3 (Suggestion, shellExecutionService.ts:816 decoder flush gated
on onSettle)
The trailing-multibyte flush ran inside the `child.once('close', ...)`
handler, which was only installed when `onSettle` was set. An
`onData`-only caller (no onSettle) lost trailing continuation
bytes silently. Hoisted flush into `flushPostPromoteDecoders`
called from `firePostSettle`, and made `firePostSettle` available
on the `'close'` path independent of onSettle (T6 install).

T4 (Suggestion, shell.ts:1700 promoted ANSI passthrough)
The regular `executeBackground` path strips ANSI before writing to
`bg_xxx.output`; the promoted-foreground onData path appended raw
chunks. Reading `bg_xxx.output` after Ctrl+B showed plain text up
to the snapshot then raw `\x1b[31m` / cursor-move / clear-screen
sequences for the post-promote tail — unreadable. Apply
`stripAnsi(rawChunk)` before write/buffer, matching the
executeBackground contract.

T5 (Suggestion, shellExecutionService.ts:786 UTF-8 hardcoded)
The post-promote child_process decoders were hard-coded to
`new TextDecoder('utf-8')`, but the foreground decoder runs
encoding detection via `getCachedEncodingForBuffer`. On a non-UTF-8
child (e.g. GBK on a Chinese Windows shell), the snapshot decoded
correctly but the post-promote tail was mojibake. Capture the
foreground decoder's `.encoding` property and reuse it for
post-promote (with utf-8 fallback if foreground hadn't seen any
bytes yet, and a try/catch around `new TextDecoder` for the rare
unsupported-encoding case).

T6 (Suggestion, shellExecutionService.ts:1540 `error` listener
gated on onSettle)
The post-promote `error` listener was attached only when `onSettle`
was set. An `onData`-only caller still had the foreground
errorHandler detached; a post-promote spawn error would then crash
the CLI via Node's unhandled-error default. Hoisted the close +
error listeners into `if (postPromote)` so any caller opting into
post-promote gets crash protection; if `onSettle` is absent the
listeners log + drop instead of routing.

T7 (Suggestion, shellExecutionService.ts:791 onSettle-only
pipe-block deadlock)
Same root cause as T6: when only `onSettle` is set, the foreground
`stdout`/`stderr` 'data' listeners are detached and no post-promote
listener replaces them. The Readables stay paused, the OS pipe
buffer fills (~64KB on Linux), the child blocks on `stdout.write`,
'close' never fires, onSettle never fires. Added `child.stdout?.resume()`
and `child.stderr?.resume()` in the no-onData branch so the child
can drain its pipes and reach exit.

T8 (Suggestion, shell.ts:2614 dead inspectLine ternary)
`inspectLine`'s ternary returned the same string on both sides —
copy-paste leftover from when the other two adjacent ternaries
(statusLine / stopLine) were correctly varied. Collapsed to a
single string assignment.

Tests: +5 regression tests (4 child_process: T1 double-fire latch,
T3 onData-only flush, T6 onData-only error survives, T7 onSettle-
only resume; +1 shell.ts: T4 ANSI strip).

265 -> 270 in the touched suites; 7822 -> 7827 across the core
package; full suite green.

* fix(core/test): use ShellOutputEvent type in wave-4 onData callbacks (TS2345)

CI lint failed on the wave-4 (T3 / T6) tests with TS2345: pushing
ShellOutputEvent into Array<{type:string;chunk:unknown}> narrows
incompatibly. Switch to ShellOutputEvent[] (matches earlier helpers
at lines 758/966) and discriminate the union via .type === 'data'
when reading .chunk so the narrowed multibyte assertion still
type-checks.

* fix(core): address PR #4102 review — PTY error guard, flush timeout, diagnostic marker, failed-settle test

- Move PTY post-promote error listener from `if (postPromote?.onSettle)` to
  `if (postPromote)` to match child_process path and prevent unhandled error
  crashes for onData-only callers
- Add 10s flush timeout in onSettleWired so stalled streams don't leave
  registry entries stuck on 'running' forever
- Append diagnostic marker to output file on stream error so truncation
  is visible without debug logging
- Add queued-settle test with exitCode:1 asserting 'Status: failed.' in
  llmContent

* fix(core): address PR #4102 review — align PTY/child_process guards, add flush timeout, diagnostic marker, and tests

- Widen PTY post-promote onExit + error listener guard from
  `if (postPromote?.onSettle)` to `if (postPromote)` to match
  child_process path — prevents unhandled error crash and listener
  leak for onData-only callers
- Add 10s flush timeout in onSettleWired so stalled streams don't
  leave registry entries stuck on 'running' indefinitely
- Append diagnostic marker to output file on stream error so
  truncation is visible without debug logging
- Remove model name references from code comments
- Add tests: PTY onData-only error/exit, flush timeout fallback,
  appendFileSync diagnostic marker, queued-settle with failed exit code

* fix(core): address PR #4102 review round 2 — listener cleanup, rename, constant hoist

- Fix expect.objectContaining({}) misused as runtime arg in 2 execute() call sites
- Add child_process post-promote stdout/stderr listener cleanup in firePostSettle
- Rename streamFailed → streamClosed to reflect its overloaded semantics
- Hoist FLUSH_TIMEOUT_MS to module-level PROMOTE_FLUSH_TIMEOUT_MS constant
- Fix dangling FLUSH_TIMEOUT_MS reference (was undefined at runtime)
- Add Windows note to streams pause/resume comment
- Document PTY onData dispose-before-settle as known limitation
This commit is contained in:
Shaojin Wen 2026-05-17 17:57:08 +08:00 committed by GitHub
parent cc32ef2ff9
commit 0240c310fd
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 2490 additions and 81 deletions

View file

@ -19,7 +19,9 @@ import { type ChildProcess } from 'node:child_process';
import pkg from '@xterm/headless';
import type {
ShellAbortReason,
ShellExecuteOptions,
ShellOutputEvent,
ShellPostPromoteSettleInfo,
} from './shellExecutionService.js';
import {
getShellAbortReasonKind,
@ -266,6 +268,7 @@ describe('ShellExecutionService', () => {
ac: AbortController,
) => void,
config = shellExecutionConfig,
options: ShellExecuteOptions = {},
) => {
const abortController = new AbortController();
const handle = await ShellExecutionService.execute(
@ -275,6 +278,7 @@ describe('ShellExecutionService', () => {
abortController.signal,
true,
config,
options,
);
await new Promise((resolve) => process.nextTick(resolve));
@ -745,6 +749,283 @@ describe('ShellExecutionService', () => {
expect(exitDisposableStub.dispose).toHaveBeenCalled();
});
it('PR-2.5: post-promote bytes route to postPromote.onData when callback provided', async () => {
// Pin the new opt-in contract: when `postPromote.onData` is set,
// bytes the still-running PTY emits after promote go to the
// caller's handler instead of being lost. PR-2 fully detached
// listeners; PR-2.5 re-attaches a minimal forwarder when the
// caller opts in.
const onDataCalls: ShellOutputEvent[] = [];
const { result } = await simulateExecution(
'tail -f /tmp/never.log',
(pty, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_pr25_data',
} satisfies ShellAbortReason);
},
shellExecutionConfig,
{
postPromote: {
onData: (event) => onDataCalls.push(event),
},
},
);
expect(result.promoted).toBe(true);
// After promote, drive a fresh post-promote chunk through the
// PTY's onData. The service should have attached a NEW listener
// (the foreground one is disposed); look at the latest
// mock.calls entry — index 1 since PR-2.5 adds a second.
const onDataRegistrations = mockPtyProcess.onData.mock.calls;
expect(onDataRegistrations.length).toBeGreaterThanOrEqual(2);
const postPromoteHandler =
onDataRegistrations[onDataRegistrations.length - 1][0];
postPromoteHandler('post-promote-byte-stream');
expect(onDataCalls).toEqual([
{ type: 'data', chunk: 'post-promote-byte-stream' },
]);
});
it('PR-2.5: postPromote.onSettle fires on natural child exit after promote', async () => {
// Pin the natural-exit settle: when the child terminates AFTER
// promote, the caller's onSettle handler is invoked exactly
// once with the exit code (or signal / error). PR-2 detached
// the exit listener entirely; PR-2.5 re-attaches a forwarder
// when the caller opts in.
const settleCalls: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'long-running-command',
(pty, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_pr25_settle',
} satisfies ShellAbortReason);
},
shellExecutionConfig,
{
postPromote: {
onSettle: (info) => settleCalls.push(info),
},
},
);
expect(result.promoted).toBe(true);
// After promote, drive the PTY's onExit to simulate natural
// completion. The service attaches a new exit listener for
// post-promote settle — find the most-recently-registered.
const onExitRegistrations = mockPtyProcess.onExit.mock.calls;
expect(onExitRegistrations.length).toBeGreaterThanOrEqual(2);
const postPromoteExitHandler =
onExitRegistrations[onExitRegistrations.length - 1][0];
postPromoteExitHandler({ exitCode: 0, signal: undefined });
expect(settleCalls).toHaveLength(1);
expect(settleCalls[0].exitCode).toBe(0);
expect(settleCalls[0].signal).toBeNull();
expect(settleCalls[0].error).toBeUndefined();
expect(typeof settleCalls[0].endTime).toBe('number');
});
it('PR-2.5 wave-2 (C2): unexpected post-promote PTY error routes to onSettle as failure (does NOT crash the CLI)', async () => {
// Foreground PTY error handler removed at promote handoff. Before
// the wave-2 fix the post-promote path attached NO error listener,
// so an unhandled `error` event would take Node down. Now we
// attach a forwarder: unexpected errors flow through onSettle
// with `error` populated; expected PTY read-exit errors
// (EIO / EAGAIN) are filtered.
const settleCalls: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'long-running-with-error',
(pty, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_pr25_pty_err',
} satisfies ShellAbortReason);
},
shellExecutionConfig,
{
postPromote: {
onSettle: (info) => settleCalls.push(info),
},
},
);
expect(result.promoted).toBe(true);
// 1. An expected PTY read-exit error (EIO) is FILTERED — onSettle
// is NOT invoked yet (the upcoming onExit will carry status).
mockPtyProcess.emit(
'error',
Object.assign(new Error('read EIO'), { code: 'EIO' }),
);
expect(settleCalls).toHaveLength(0);
// 2. An UNEXPECTED error (EPIPE) routes to onSettle as a failure.
// Critically: emitting must NOT throw (no unhandled `error`).
const unexpectedErr = Object.assign(new Error('disk gone'), {
code: 'EPIPE',
});
expect(() => mockPtyProcess.emit('error', unexpectedErr)).not.toThrow();
expect(settleCalls).toHaveLength(1);
expect(settleCalls[0].error).toBe(unexpectedErr);
expect(settleCalls[0].exitCode).toBeNull();
expect(settleCalls[0].signal).toBeNull();
expect(typeof settleCalls[0].endTime).toBe('number');
// 3. A subsequent onExit MUST NOT fire onSettle again (single-fire
// latch): callers like the registry's `complete`/`fail`
// transitions are not idempotent across status types.
const onExitRegistrations = mockPtyProcess.onExit.mock.calls;
const postPromoteExitHandler =
onExitRegistrations[onExitRegistrations.length - 1][0];
postPromoteExitHandler({ exitCode: 0, signal: undefined });
expect(settleCalls).toHaveLength(1);
});
it('PR-2.5 wave-3 (T6): post-promote IDisposables and error listener are released on settle (no GC roots dangling)', async () => {
// Each promoted PTY child can sit dead for milliseconds while
// the caller's `cancelChild` finalizes. Node's EventEmitter
// holds refs to listener closures, which in turn hold refs to
// `onPostData` / `onPostSettle` / the caller's
// `promoteArtifacts`. Without disposal on settle, those refs
// dangle until the PTY itself is collected. The fix captures
// the IDisposables returned by `onData` / `onExit` AND the
// `'error'` listener function we registered on the EE, then
// releases them when `firePostSettle` fires (no matter which
// path triggers settle).
const removeListenerSpy = vi.spyOn(mockPtyProcess, 'removeListener');
const settleCalls: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'long-running-disposable',
(pty, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_pr25_dispose',
} satisfies ShellAbortReason);
},
shellExecutionConfig,
{
postPromote: {
onData: () => {},
onSettle: (info) => settleCalls.push(info),
},
},
);
expect(result.promoted).toBe(true);
// The mocked `mockReturnValue({ dispose: vi.fn() })` reuses the
// SAME disposable object across calls, so foreground +
// post-promote share the same dispose Mock. The foreground
// disposable was already disposed at promote handoff; clear
// the call history so we can assert ONLY on post-settle
// disposal.
const sharedDataDisposable = mockPtyProcess.onData.mock.results[0]
.value as { dispose: Mock };
const sharedExitDisposable = mockPtyProcess.onExit.mock.results[0]
.value as { dispose: Mock };
sharedDataDisposable.dispose.mockClear();
sharedExitDisposable.dispose.mockClear();
removeListenerSpy.mockClear();
// Drive onExit → firePostSettle runs disposePostPromoteListeners.
const onExitRegistrations = mockPtyProcess.onExit.mock.calls;
const postPromoteExitHandler =
onExitRegistrations[onExitRegistrations.length - 1][0];
postPromoteExitHandler({ exitCode: 0, signal: undefined });
expect(settleCalls).toHaveLength(1);
// Post-settle: BOTH disposables released, error listener removed.
expect(sharedDataDisposable.dispose).toHaveBeenCalledTimes(1);
expect(sharedExitDisposable.dispose).toHaveBeenCalledTimes(1);
// The post-promote error listener was attached via
// `ptyProcess.on('error', listener)` and is released via
// `removeListener('error', listener)`. Verify removeListener
// was called on the 'error' channel.
const errorRemoves = removeListenerSpy.mock.calls.filter(
(args: unknown[]) => args[0] === 'error',
);
expect(errorRemoves.length).toBeGreaterThanOrEqual(1);
// Re-driving onExit must NOT re-fire settle (latched) AND
// dispose calls must NOT double-count (idempotent disposal —
// disposePostPromoteListeners nulls the slots after first
// disposal).
postPromoteExitHandler({ exitCode: 0, signal: undefined });
expect(settleCalls).toHaveLength(1);
expect(sharedDataDisposable.dispose).toHaveBeenCalledTimes(1);
expect(sharedExitDisposable.dispose).toHaveBeenCalledTimes(1);
removeListenerSpy.mockRestore();
});
it('PR-2.5: onData-only PTY caller has post-promote error + exit listeners (no crash, listeners disposed on exit)', async () => {
const dataChunks: ShellOutputEvent[] = [];
const { result } = await simulateExecution(
'tail -f /dev/null',
(pty, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_pty_ondata_only',
} satisfies ShellAbortReason);
},
shellExecutionConfig,
{
postPromote: {
onData: (event) => dataChunks.push(event),
},
},
);
expect(result.promoted).toBe(true);
// Error listener must be installed even without onSettle —
// emitting 'error' on an EventEmitter with no listener throws.
expect(() =>
mockPtyProcess.emit('error', new Error('post-promote pty err')),
).not.toThrow();
// onExit must also be installed so disposePostPromoteListeners
// runs on natural exit (cleaning up data + error listeners).
const onExitRegistrations = mockPtyProcess.onExit.mock.calls;
expect(onExitRegistrations.length).toBeGreaterThanOrEqual(2);
const postPromoteExitHandler =
onExitRegistrations[onExitRegistrations.length - 1][0];
// Simulate natural exit — should dispose listeners without crash.
postPromoteExitHandler({ exitCode: 0 });
});
it('PR-2.5 backwards compat: without postPromote, listeners stay fully detached (no regression on PR-2 contract)', async () => {
// Pin that omitting `postPromote` preserves the PR-2 detach-
// everything contract. The pre-existing post-promote test at
// line ~680 already covers this for the data path; this one
// adds the symmetric guarantee for the exit path — natural
// post-promote exit must NOT invoke any callback the caller
// didn't provide.
const onDataCalls: ShellOutputEvent[] = [];
const onSettleCalls: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'no-post-promote-handlers',
(pty, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_pr25_compat',
} satisfies ShellAbortReason);
},
// No options arg → postPromote unset → PR-2 contract.
);
expect(result.promoted).toBe(true);
// Drive both PTY events post-promote.
const onDataRegistrations = mockPtyProcess.onData.mock.calls;
// PR-2 contract: only ONE onData registration (the foreground
// one, now disposed). PR-2.5's re-attach is gated on
// `postPromote.onData` being set, so without it the
// registration count stays at 1.
expect(onDataRegistrations.length).toBe(1);
const onExitRegistrations = mockPtyProcess.onExit.mock.calls;
expect(onExitRegistrations.length).toBe(1);
// Caller-provided handlers were never invoked.
expect(onDataCalls).toHaveLength(0);
expect(onSettleCalls).toHaveLength(0);
});
it('post-exit race: PTY background-promote refuses if process.kill(pid, 0) reports the pid is gone', async () => {
// Mirror of the child_process post-exit race test. The PTY may
// have already exited but our `exitDisposable` (onExit) handler
@ -1210,6 +1491,7 @@ describe('ShellExecutionService child_process fallback', () => {
const simulateExecution = async (
command: string,
simulation: (cp: typeof mockChildProcess, ac: AbortController) => void,
options: ShellExecuteOptions = {},
) => {
const abortController = new AbortController();
const handle = await ShellExecutionService.execute(
@ -1219,6 +1501,7 @@ describe('ShellExecutionService child_process fallback', () => {
abortController.signal,
true,
shellExecutionConfig,
options,
);
await new Promise((resolve) => process.nextTick(resolve));
@ -1556,6 +1839,257 @@ describe('ShellExecutionService child_process fallback', () => {
expect(result.signal).toBeNull();
});
it('PR-2.5 child_process: post-promote stdout/stderr forward to postPromote.onData with SEPARATE decoders', async () => {
// Pin: post-promote bytes from the still-running child route to
// the caller's onData handler. Separate decoders for stdout vs
// stderr — a single shared decoder would corrupt interleaved
// multibyte UTF-8 (the continuation-byte state machine assumes
// one byte source).
mockPlatform.mockReturnValue('linux');
const events: Array<{ type: string; chunk?: string | unknown }> = [];
const { result } = await simulateExecution(
'tail -f',
(cp, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_cp_data',
} satisfies ShellAbortReason);
// Drive post-promote chunks — should now flow to onData.
cp.stdout?.emit('data', Buffer.from('post-promote-stdout\n'));
cp.stderr?.emit('data', Buffer.from('post-promote-stderr\n'));
},
{
postPromote: {
onData: (event) => events.push(event),
},
},
);
expect(result.promoted).toBe(true);
// Both streams forwarded.
const dataChunks = events
.filter((e) => e.type === 'data')
.map((e) => e.chunk);
expect(dataChunks).toContain('post-promote-stdout\n');
expect(dataChunks).toContain('post-promote-stderr\n');
});
it('PR-2.5 child_process: onSettle fires on `close` (NOT `exit`) so late chunks land before the registry transitions', async () => {
// Pin the `close`-not-`exit` contract: child can emit buffered
// data AFTER 'exit' but BEFORE 'close'. If onSettle fired on
// 'exit' the caller would close the output stream + transition
// the registry while late chunks were still in flight — they'd
// hit a closed stream and be dropped, producing truncated logs.
mockPlatform.mockReturnValue('linux');
const events: Array<{ type: string; chunk?: string | unknown }> = [];
const settles: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'cmd',
(cp, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_cp_close',
} satisfies ShellAbortReason);
// Order matters: emit 'exit' first (this would have settled
// PR-1 of PR-2.5 too early), then a final stdout chunk, then
// 'close'. With the new contract, onSettle only fires on
// 'close' so the late chunk is captured.
cp.emit('exit', 0, null);
cp.stdout?.emit('data', Buffer.from('late-chunk\n'));
cp.emit('close', 0, null);
},
{
postPromote: {
onData: (event) => events.push(event),
onSettle: (info) => settles.push(info),
},
},
);
expect(result.promoted).toBe(true);
// Late chunk made it through.
const dataChunks = events
.filter((e) => e.type === 'data')
.map((e) => e.chunk);
expect(dataChunks).toContain('late-chunk\n');
// onSettle fired exactly once with exitCode 0.
expect(settles).toHaveLength(1);
expect(settles[0].exitCode).toBe(0);
expect(settles[0].signal).toBeNull();
});
it('PR-2.5 child_process: post-promote spawn error routes to onSettle with error populated', async () => {
mockPlatform.mockReturnValue('linux');
const settles: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'cmd',
(cp, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_cp_err',
} satisfies ShellAbortReason);
cp.emit('error', new Error('post-promote spawn boom'));
},
{
postPromote: {
onSettle: (info) => settles.push(info),
},
},
);
expect(result.promoted).toBe(true);
expect(settles).toHaveLength(1);
expect(settles[0].error?.message).toBe('post-promote spawn boom');
expect(settles[0].exitCode).toBeNull();
expect(settles[0].signal).toBeNull();
});
it('PR-2.5 wave-4 (T1): post-promote `error` followed by `close` fires onSettle EXACTLY ONCE', async () => {
// Regression for the double-fire bug: pre-fix, `child.once('close', ...)`
// and `child.once('error', ...)` were independent and each invoked
// `onPostSettle` directly. A spawn-side error followed by the
// child-process automatic 'close' event would call the caller's
// settle twice, violating the exactly-once contract and racing
// the caller's `transitionRegistry`. Fix wraps both branches in
// a `firePostSettle` latch (mirroring the PTY path).
mockPlatform.mockReturnValue('linux');
const settles: ShellPostPromoteSettleInfo[] = [];
const { result } = await simulateExecution(
'cmd',
(cp, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_cp_double',
} satisfies ShellAbortReason);
// First: error fires.
cp.emit('error', new Error('error first'));
// Then: close (Node child_process always emits 'close' even
// after an error). Pre-fix this would call onSettle a second
// time.
cp.emit('close', 1, null);
},
{
postPromote: {
onSettle: (info) => settles.push(info),
},
},
);
expect(result.promoted).toBe(true);
expect(settles).toHaveLength(1);
expect(settles[0].error?.message).toBe('error first');
});
it('PR-2.5 wave-4 (T3): onData-only caller still gets decoder flush on close (no trailing multibyte loss)', async () => {
// T3 regression: the close handler used to be installed only
// when `onSettle` was set, so an `onData`-only caller never got
// the trailing-multibyte flush — a UTF-8 character split across
// chunks could vanish. Fix installs close whenever ANY
// postPromote handler is set, and the flush helper runs whenever
// onData is set independent of onSettle.
mockPlatform.mockReturnValue('linux');
const dataChunks: ShellOutputEvent[] = [];
const { result } = await simulateExecution(
'cmd',
(cp, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_cp_t3',
} satisfies ShellAbortReason);
// Push the FIRST byte of a 3-byte UTF-8 char (€ = 0xE2 0x82 0xAC).
// Without flush, the trailing two bytes would be stuck in the
// decoder's continuation state and lost.
cp.stdout?.emit('data', Buffer.from([0xe2]));
cp.stdout?.emit('data', Buffer.from([0x82, 0xac]));
// Trigger close so the flush runs; no onSettle to gate on.
cp.emit('close', 0, null);
},
{
postPromote: {
onData: (event) => dataChunks.push(event),
// NO onSettle — close handler must still fire flush.
},
},
);
expect(result.promoted).toBe(true);
// The € character should appear once the second chunk completes
// the multibyte sequence; flush at close ensures any remainder
// is surfaced.
const joined = dataChunks
.map((d) =>
d.type === 'data' && typeof d.chunk === 'string' ? d.chunk : '',
)
.join('');
expect(joined).toContain('€');
});
it('PR-2.5 wave-4 (T6): onData-only caller has post-promote `error` listener (does not crash CLI)', async () => {
// T6 regression: `child.once('error', ...)` install was gated
// on `onSettle`, so an `onData`-only caller had the foreground
// errorHandler detached at promote with no replacement — a
// post-promote spawn error would surface as Node's default
// unhandled-error crash. Fix attaches an error listener
// whenever ANY postPromote handler is set.
mockPlatform.mockReturnValue('linux');
const dataChunks: ShellOutputEvent[] = [];
const { result } = await simulateExecution(
'cmd',
(cp, ac) => {
ac.abort({
kind: 'background',
shellId: 'bg_cp_t6',
} satisfies ShellAbortReason);
// Emitting 'error' on an EventEmitter with no listener throws
// synchronously. With the fix, our listener is attached so
// the emit does not throw.
expect(() =>
cp.emit('error', new Error('post-promote err')),
).not.toThrow();
// child_process auto-emits 'close' after 'error'.
cp.emit('close', null, null);
},
{
postPromote: {
onData: (event) => dataChunks.push(event),
// NO onSettle — but error must still be handled (no crash).
},
},
);
expect(result.promoted).toBe(true);
});
it('PR-2.5 wave-4 (T7): onSettle-only caller has stdout/stderr resumed (child does not block on full pipes)', async () => {
// T7 regression: when `onSettle` is set but `onData` is NOT, the
// post-promote path used to leave stdout/stderr without any data
// listener. The Readables stay paused; the OS pipe buffer fills
// (~64KB on Linux); the child blocks on stdout.write; 'close'
// never fires; onSettle never fires. Fix calls .resume() on
// both streams in the no-onData branch so the child can drain.
mockPlatform.mockReturnValue('linux');
const settles: ShellPostPromoteSettleInfo[] = [];
const stdoutResumeSpy = vi.fn();
const stderrResumeSpy = vi.fn();
const { result } = await simulateExecution(
'cmd',
(cp, ac) => {
// Patch resume() so we can verify the wire was driven.
if (cp.stdout) cp.stdout.resume = stdoutResumeSpy;
if (cp.stderr) cp.stderr.resume = stderrResumeSpy;
ac.abort({
kind: 'background',
shellId: 'bg_cp_t7',
} satisfies ShellAbortReason);
cp.emit('close', 0, null);
},
{
postPromote: {
// NO onData — but stdout/stderr must still be resumed.
onSettle: (info) => settles.push(info),
},
},
);
expect(result.promoted).toBe(true);
expect(stdoutResumeSpy).toHaveBeenCalled();
expect(stderrResumeSpy).toHaveBeenCalled();
expect(settles).toHaveLength(1);
});
it('should gracefully attempt SIGKILL on linux if SIGTERM fails', async () => {
mockPlatform.mockReturnValue('linux');
vi.useFakeTimers();

View file

@ -177,6 +177,63 @@ export interface ShellExecutionConfig {
disableDynamicLineTrimming?: boolean;
}
/**
* Optional caller-side handlers for the *post-promote* lifetime of a
* background-promoted child process. PR-2 (#3894) detached every
* service-side listener at promote time and froze `result.output` at
* the snapshot; without these hooks the still-running child's bytes
* are lost and the registry entry stays `'running'` until `task_stop`
* / session-end cleanup. PR-2.5 (#3831 follow-up) wires shell.ts to
* pass these so promoted shells behave like regular background shells:
* bytes append to `bg_xxx.output` and the entry transitions to
* `'completed'` / `'failed'` on natural child exit.
*
* Backwards compat: if `postPromote` is unset on the options bag the
* service falls back to the PR-2 detach-everything contract no
* regressions for callers that don't opt in.
*/
export interface ShellPostPromoteHandlers {
/**
* Fired for each output chunk the still-running child produces
* AFTER `result.promoted` resolves. Same `ShellOutputEvent` shape
* the foreground stream uses so callers can reuse rendering logic;
* `binary_detected` / `binary_progress` are NOT re-emitted (those
* decisions were made pre-promote against the same byte stream).
*/
onData?: (event: ShellOutputEvent) => void;
/**
* Fired exactly once when the post-promote child settles natural
* exit (`exitCode` set, `signal: null`), signal kill (`exitCode:
* null`, `signal` set), or spawn-side error (`error` set). NOT
* fired for the promote-time resolve itself (that's the
* `result.promoted` Promise resolution). Callers wire this to the
* registry's `complete` / `fail` transitions.
*/
onSettle?: (info: ShellPostPromoteSettleInfo) => void;
}
export interface ShellPostPromoteSettleInfo {
exitCode: number | null;
signal: number | NodeJS.Signals | null;
error?: Error;
/** `Date.now()` at the moment the service observed the exit/error. */
endTime: number;
}
/**
* Options bag for `ShellExecutionService.execute()`. Kept as an
* interface (rather than the prior inline shape) so future additions
* land without breaking signatures.
*/
export interface ShellExecuteOptions {
streamStdout?: boolean;
/**
* Post-promote callback hooks. See {@link ShellPostPromoteHandlers}.
* Optional; omit to preserve the PR-2 detach-everything contract.
*/
postPromote?: ShellPostPromoteHandlers;
}
/**
* Describes a structured event emitted during shell command execution.
*/
@ -430,7 +487,7 @@ export class ShellExecutionService {
abortSignal: AbortSignal,
shouldUseNodePty: boolean,
shellExecutionConfig: ShellExecutionConfig,
options: { streamStdout?: boolean } = {},
options: ShellExecuteOptions = {},
): Promise<ShellExecutionHandle> {
if (shouldUseNodePty) {
const ptyInfo = await getPty();
@ -443,6 +500,7 @@ export class ShellExecutionService {
abortSignal,
shellExecutionConfig,
ptyInfo,
options.postPromote,
);
} catch (_e) {
// Fallback to child_process
@ -456,6 +514,7 @@ export class ShellExecutionService {
onOutputEvent,
abortSignal,
options.streamStdout ?? false,
options.postPromote,
);
}
@ -465,6 +524,7 @@ export class ShellExecutionService {
onOutputEvent: (event: ShellOutputEvent) => void,
abortSignal: AbortSignal,
streamStdout: boolean,
postPromote?: ShellPostPromoteHandlers,
): ShellExecutionHandle {
try {
const isWindows = os.platform() === 'win32';
@ -702,6 +762,229 @@ export class ShellExecutionService {
const combined =
snapStdout +
(snapStderr ? (snapStdout ? separator : '') + snapStderr : '');
// PR-2.5: re-attach post-promote listeners that forward to the
// caller's handlers. Attach AFTER `detachServiceListeners()`
// so we don't double-up on stdout/stderr 'data' events with
// the foreground listeners that just got removed; attach
// BEFORE `resolve()` so a sub-millisecond data burst right
// after promote still lands on the caller. The new listeners
// are direct stdout/stderr listeners (not service-managed) —
// ownership is the caller's from this point. We also attach
// a fresh exit listener (the foreground exitHandler is also
// detached by detachServiceListeners) so the caller can
// settle the registry entry on natural child exit. When
// postPromote is undefined we fall back to the PR-2 detach-
// everything contract: no listeners re-attach.
// PR-2.5 wave-4: preserve the detected encoding
// from the foreground decoders so a non-UTF-8 child (e.g.
// GBK on a Chinese Windows shell) doesn't snapshot correctly
// and then mojibake the post-promote tail. The foreground
// `stdoutDecoder` / `stderrDecoder` are initialized in
// `handleOutput` from `getCachedEncodingForBuffer(data)` on
// the first chunk; if they're still null at promote time
// (no bytes yet), fall back to `'utf-8'`. Capture the
// detected encoding rather than the decoder instance — the
// foreground decoder has already seen pre-promote bytes
// (its multibyte state machine is at an arbitrary midpoint)
// and may have accumulated continuation-byte state that the
// post-promote stream shouldn't inherit; new instances with
// the same `encoding` start fresh.
const detectedEncoding = stdoutDecoder?.encoding ?? 'utf-8';
// SEPARATE decoders for stdout and stderr. A single shared
// decoder corrupts interleaved multibyte UTF-8 (the streaming
// state machine assumes one byte source); independent
// decoders preserve each stream's continuation-byte context.
// Both decoders are flushed (with `stream: false`) once the
// child has fully closed so any trailing multibyte bytes
// surface instead of being silently dropped.
//
// PR-2.5 wave-4: allocate decoders whenever
// `onData` is set (not gated on close-handler installation),
// because the close handler now ALWAYS installs when any
// postPromote handler is present (T6 + T7) and needs to
// flush these decoders if onData is set, regardless of
// whether onSettle is set.
const safeDecoder = (encoding: string): TextDecoder => {
try {
return new TextDecoder(encoding, { fatal: false });
} catch {
// Defensive: if the detected encoding string is somehow
// not supported by Node's ICU (extremely rare on modern
// Node), fall back to utf-8 rather than throwing inside
// the promote handoff path.
return new TextDecoder('utf-8', { fatal: false });
}
};
const postPromoteStdoutDecoder = postPromote?.onData
? safeDecoder(detectedEncoding)
: null;
const postPromoteStderrDecoder = postPromote?.onData
? safeDecoder(detectedEncoding)
: null;
let postPromoteStdoutHandler: ((chunk: Buffer) => void) | null = null;
let postPromoteStderrHandler: ((chunk: Buffer) => void) | null = null;
if (postPromote?.onData) {
const onPostData = postPromote.onData;
const safeData = (decoder: TextDecoder) => (chunk: Buffer) => {
try {
onPostData({
type: 'data',
chunk: decoder.decode(chunk, { stream: true }),
});
} catch (cbErr) {
debugLogger.warn(
`postPromote.onData threw: ${cbErr instanceof Error ? cbErr.message : String(cbErr)}`,
);
}
};
try {
if (postPromoteStdoutDecoder) {
postPromoteStdoutHandler = safeData(postPromoteStdoutDecoder);
child.stdout?.on('data', postPromoteStdoutHandler);
}
if (postPromoteStderrDecoder) {
postPromoteStderrHandler = safeData(postPromoteStderrDecoder);
child.stderr?.on('data', postPromoteStderrHandler);
}
} catch (e) {
debugLogger.warn(
`re-attaching post-promote data listeners threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
} else if (postPromote) {
// PR-2.5 wave-4: caller asked for `onSettle`
// (or any other future postPromote handler) without
// `onData`. The foreground stdout/stderr listeners were
// detached above; without ANY data listener the Readable
// streams stay paused (on Windows they may already be
// flowing — `resume()` is a no-op in that case), the OS
// pipe buffer fills (~64KB on Linux), and
// `child.stdout.write` in the child blocks —
// potentially forever. `'close'` then never fires and
// `onSettle` is never called. `.resume()` puts the stream
// back in flowing mode (data arrives + is dropped) so the
// child can drain its pipes and exit normally.
try {
child.stdout?.resume();
child.stderr?.resume();
} catch (e) {
debugLogger.warn(
`post-promote stdout/stderr resume() threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
}
// PR-2.5 wave-4: single-fire latch shared by
// 'close' and 'error' (both branches funnel through here).
// Without it the child_process path could fire onSettle
// twice — once from `error`, then again from the `close`
// that immediately follows — violating the exactly-once
// settle contract and racing the caller's `transitionRegistry`.
//
// PR-2.5 wave-4: the helper also performs the
// decoder flush so any caller with `onData` set gets the
// trailing multibyte bytes surfaced — independent of
// whether `onSettle` is also set.
let postPromoteSettleFired = false;
const flushPostPromoteDecoders = (): void => {
if (!postPromote?.onData) return;
try {
if (postPromoteStdoutDecoder) {
const trailing = postPromoteStdoutDecoder.decode();
if (trailing) {
postPromote.onData({
type: 'data',
chunk: trailing,
});
}
}
if (postPromoteStderrDecoder) {
const trailing = postPromoteStderrDecoder.decode();
if (trailing) {
postPromote.onData({
type: 'data',
chunk: trailing,
});
}
}
} catch (flushErr) {
debugLogger.warn(
`post-promote decoder flush threw: ${flushErr instanceof Error ? flushErr.message : String(flushErr)}`,
);
}
};
const firePostSettle = (info: ShellPostPromoteSettleInfo): void => {
if (postPromoteSettleFired) return;
postPromoteSettleFired = true;
flushPostPromoteDecoders();
if (postPromoteStdoutHandler) {
child.stdout?.off('data', postPromoteStdoutHandler);
postPromoteStdoutHandler = null;
}
if (postPromoteStderrHandler) {
child.stderr?.off('data', postPromoteStderrHandler);
postPromoteStderrHandler = null;
}
if (!postPromote?.onSettle) return;
try {
postPromote.onSettle(info);
} catch (cbErr) {
debugLogger.warn(
`postPromote.onSettle threw: ${cbErr instanceof Error ? cbErr.message : String(cbErr)}`,
);
}
};
// PR-2.5 wave-4: install 'close' and
// 'error' listeners whenever ANY postPromote handler is
// present, not just when `onSettle` is set. Two reasons:
//
// 1. T6: `onData`-only callers still had the foreground
// `errorHandler` detached; without a replacement
// listener a post-promote `'error'` would crash Node
// via the unhandled-error default. Even with no
// onSettle to route into, the listener prevents the
// crash (and triggers decoder flush on close).
//
// 2. T3 / T7: `onData`-only callers need the close handler
// to flush trailing decoder bytes; an `onSettle`-only
// caller needs `'close'` to fire onSettle — both share
// the same close hook now.
if (postPromote) {
try {
child.once(
'close',
(
exitCode: number | null,
signalCode: NodeJS.Signals | null,
) => {
// Listen on 'close' (all stdio fully drained) NOT
// 'exit' (which can fire while stdout/stderr still
// have buffered bytes pending). Without this, late
// chunks emitted between 'exit' and 'close' land in
// the caller's onData AFTER onSettle already closed
// the output stream and transitioned the registry —
// they'd be dropped silently and `/tasks` would
// show a truncated log.
firePostSettle({
exitCode,
signal: signalCode,
endTime: Date.now(),
});
},
);
child.once('error', (err: Error) => {
firePostSettle({
exitCode: null,
signal: null,
error: err,
endTime: Date.now(),
});
});
} catch (e) {
debugLogger.warn(
`re-attaching post-promote exit/error listeners threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
}
resolve({
rawOutput: finalBuffer,
output: stripAnsi(combined).trim(),
@ -831,6 +1114,7 @@ export class ShellExecutionService {
abortSignal: AbortSignal,
shellExecutionConfig: ShellExecutionConfig,
ptyInfo: PtyImplementation,
postPromote?: ShellPostPromoteHandlers,
): ShellExecutionHandle {
if (!ptyInfo) {
// This should not happen, but as a safeguard...
@ -1176,11 +1460,11 @@ export class ShellExecutionService {
);
return;
}
// Skip kill, dispose all our listeners on the live PTY (so
// post-promote data/exit/error don't leak into our foreground
// onOutputEvent or crash via the error handler's `throw err`),
// set the listenersDetached guard so any already-enqueued
// processingChain callback's onOutputEvent emits are
// Skip kill, dispose all our foreground listeners on the live
// PTY (so post-promote data/exit/error don't leak into our
// foreground onOutputEvent or crash via the error handler's
// `throw err`), set the listenersDetached guard so any already-
// enqueued processingChain callback's onOutputEvent emits are
// suppressed (in-flight writes still LAND in headlessTerminal
// so the snapshot below reflects them), drain pending chain
// work, drop the PTY from the active set (so cleanup() won't
@ -1188,6 +1472,15 @@ export class ShellExecutionService {
// resolve immediately with `promoted: true` so the awaiting
// caller unblocks. The caller has attached its own listeners
// by this point and owns the PTY's lifecycle.
//
// PR-2.5: if `postPromote.onData` / `postPromote.onSettle` were
// provided, ATTACH NEW listeners after disposing the
// foreground ones — bytes from the still-running child route
// to the caller (typically shell.ts's append-to-bg_xxx.output
// path), and the eventual natural-exit transitions the
// registry entry to `'completed'` / `'failed'` instead of
// leaving it stuck on `'running'`. When postPromote is
// undefined the PR-2 detach-everything contract is preserved.
exited = true;
listenersDetached = true;
abortSignal.removeEventListener('abort', abortHandler);
@ -1228,6 +1521,154 @@ export class ShellExecutionService {
}
this.activePtys.delete(ptyProcess.pid);
// PR-2.5: re-attach minimal listeners that forward to the
// caller's post-promote handlers. Attach BEFORE the drain so
// late bytes the PTY emits during the drain window flow to
// the caller instead of falling on the floor — strictly an
// improvement; without this they'd be dropped on the way to
// the snapshot anyway.
//
// PR-2.5 wave-3: capture the IDisposable
// returned by `onData` / `onExit` and the listener function
// we register on `'error'`, then dispose them all when
// settle fires. node-pty's `ptyProcess` outlives the
// post-promote handlers (the child can sit dead for
// milliseconds before the caller's `cancelChild` finalizes
// it), and node's EventEmitter holds refs to listener
// closures (which in turn hold refs to `onPostData` /
// `onPostSettle` / `promoteArtifacts`). Disposing the
// listeners on settle releases those refs so they can be
// GC'd without waiting for the underlying PTY to be
// collected.
//
// Guard so `onSettle` fires AT MOST ONCE. Both `onExit` and
// the post-promote `error` listener below funnel through
// this latch — a PTY error during the read-exit race could
// otherwise fire onSettle twice (once for the error, once
// for the immediately-following exit) and the caller's
// `transitionRegistry` would race itself.
let postPromoteSettleFired = false;
let postPromoteDataDisposable: { dispose: () => void } | null = null;
let postPromoteExitDisposable: { dispose: () => void } | null = null;
let postPromoteErrorListener:
| ((err: NodeJS.ErrnoException) => void)
| null = null;
const disposePostPromoteListeners = () => {
if (postPromoteDataDisposable) {
try {
postPromoteDataDisposable.dispose();
} catch (e) {
debugLogger.warn(
`disposing post-promote data listener threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
postPromoteDataDisposable = null;
}
if (postPromoteExitDisposable) {
try {
postPromoteExitDisposable.dispose();
} catch (e) {
debugLogger.warn(
`disposing post-promote exit listener threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
postPromoteExitDisposable = null;
}
if (postPromoteErrorListener) {
try {
ptyProcess.removeListener('error', postPromoteErrorListener);
} catch (e) {
debugLogger.warn(
`removing post-promote error listener threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
postPromoteErrorListener = null;
}
};
const firePostSettle = (info: ShellPostPromoteSettleInfo) => {
if (postPromoteSettleFired) return;
postPromoteSettleFired = true;
// Dispose BEFORE invoking the caller — even if the caller
// throws, the listeners are gone (and idempotent if we
// come back through the error path).
// Known limitation: node-pty may have queued onData
// callbacks not yet delivered when onExit fires; disposing
// the data listener here means those trailing bytes (<4KB)
// are lost. Bounded and low severity — a setImmediate
// delay could recover them but would complicate the
// single-fire latch.
disposePostPromoteListeners();
if (!postPromote?.onSettle) return;
try {
postPromote.onSettle(info);
} catch (cbErr) {
debugLogger.warn(
`postPromote.onSettle threw: ${cbErr instanceof Error ? cbErr.message : String(cbErr)}`,
);
}
};
if (postPromote?.onData) {
const onPostData = postPromote.onData;
try {
postPromoteDataDisposable = ptyProcess.onData((data: string) => {
try {
onPostData({ type: 'data', chunk: data });
} catch (cbErr) {
// Caller's handler threw — don't let it crash the
// child's data loop. Log + drop.
debugLogger.warn(
`postPromote.onData threw: ${cbErr instanceof Error ? cbErr.message : String(cbErr)}`,
);
}
});
} catch (e) {
debugLogger.warn(
`re-attaching post-promote data listener threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
}
if (postPromote) {
try {
postPromoteExitDisposable = ptyProcess.onExit(
({
exitCode,
signal,
}: {
exitCode: number;
signal?: number;
}) => {
firePostSettle({
exitCode,
signal: signal ?? null,
endTime: Date.now(),
});
},
);
} catch (e) {
debugLogger.warn(
`re-attaching post-promote exit listener threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
try {
postPromoteErrorListener = (err: NodeJS.ErrnoException) => {
if (isExpectedPtyReadExitError(err)) {
return;
}
firePostSettle({
error: err,
exitCode: null,
signal: null,
endTime: Date.now(),
});
};
ptyProcess.on('error', postPromoteErrorListener);
} catch (e) {
debugLogger.warn(
`re-attaching post-promote error listener threw: ${e instanceof Error ? e.message : String(e)}`,
);
}
}
// Drain in-flight chain work (already-enqueued
// headlessTerminal.write callbacks) so the snapshot reflects
// the last batch of bytes the PTY emitted before promote.

File diff suppressed because it is too large Load diff

View file

@ -34,6 +34,8 @@ import type {
ShellExecutionConfig,
ShellExecutionResult,
ShellOutputEvent,
ShellPostPromoteHandlers,
ShellPostPromoteSettleInfo,
} from '../services/shellExecutionService.js';
import { ShellExecutionService } from '../services/shellExecutionService.js';
import type { ShellTaskRegistration } from '../services/backgroundShellRegistry.js';
@ -905,6 +907,64 @@ const DEFAULT_FOREGROUND_TIMEOUT_MS = 120000;
*/
const PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS = 200;
/** Maximum wait for the output stream flush before transitioning the registry. */
const PROMOTE_FLUSH_TIMEOUT_MS = 10_000;
/**
* PR-2.5 slots shared between the foreground `execute()` postPromote
* handlers and the post-resolve `handlePromotedForeground` finalizer.
* The handlers fire on the service side as soon as promote happens;
* the finalizer runs after `await resultPromise` returns. They race
* the buffer + settle-queue absorb the race so neither chunks nor the
* eventual exit info are lost. See `executeForeground` for the wiring
* and `handlePromotedForeground` for the drain logic.
*/
interface PromoteArtifacts {
/**
* Chunks observed by `postPromote.onData` BEFORE the stream is
* open. Drained into the stream once `handlePromotedForeground`
* opens it. After drain this stays empty for the rest of the run.
*/
buffer: string[];
/**
* Append-mode write stream to `bg_xxx.output`. Null until
* `handlePromotedForeground` opens it. Closed by `onSettleWired`.
*/
stream: fs.WriteStream | null;
/**
* Latched true when the output stream is no longer accepting writes.
* Two paths set it:
*
* 1. Stream open failed (`fs.createWriteStream` threw OR fired an
* async `'error'` event before bytes could land). The stream
* will never reopen; future `onData` chunks must drop.
* 2. Settle has fired and `onSettleWired` has drained the buffer
* and called `stream.end()`. The stream is closing; any chunk
* that arrives during the `.end()` flush window (rare but
* possible on PTY when kernel buffers deliver late) MUST drop
* rather than be pushed into the buffer at this point the
* buffer has no remaining drain path (the foreground finalizer
* has returned).
*
* Without this flag the buffer would grow without bound under a
* sustained child whose output file we can't open, OR strand
* late-arriving post-settle bytes in an undrainable buffer.
*/
streamClosed: boolean;
/**
* Settle handler installed by `handlePromotedForeground` once the
* registry entry exists. Null until then; `onSettle` calls below
* queue into `settleQueued` if this isn't yet set.
*/
onSettleWired: ((info: ShellPostPromoteSettleInfo) => void) | null;
/**
* Settle info captured by `postPromote.onSettle` before the wired
* handler was installed. `handlePromotedForeground` checks this and
* fires the wired handler synchronously after registering.
*/
settleQueued: ShellPostPromoteSettleInfo | null;
}
// Long-run advisory threshold: half the EFFECTIVE foreground timeout
// (not the default), computed per-invocation by `longRunThresholdFor`.
// Couples to whichever timeout actually governs THIS command — so a
@ -1619,6 +1679,77 @@ export class ShellToolInvocation extends BaseToolInvocation<
}
};
// Pre-allocate the promote artifacts (PR-2.5). Lazily created — no
// disk I/O unless the user actually fires Ctrl+B / promote signal.
// The handlers below close over these slots; once promote happens,
// `handlePromotedForeground` populates them (opens the stream, sets
// the shellId / onSettle wiring), and any onData chunks that the
// service forwarded BEFORE handlePromotedForeground caught up land
// in `postPromoteBuffer` and drain to the stream once it opens.
const promoteArtifacts: PromoteArtifacts = {
buffer: [],
stream: null,
streamClosed: false,
onSettleWired: null,
settleQueued: null,
};
const postPromote: ShellPostPromoteHandlers = {
onData: (event) => {
if (event.type !== 'data') return;
// ANSI structured chunks have no append semantics — coerce to
// string. The output file is plain text; live ANSI updates are
// owned by the foreground stream, which by promote-time has
// already terminated.
//
// PR-2.5 wave-4: strip ANSI before writing so
// the post-promote tail of `bg_xxx.output` matches the format
// of the snapshot above (which is rendered terminal text, not
// raw escape sequences) AND matches the regular
// `executeBackground` path's `outputStream.write(stripAnsi(chunk))`
// contract. Without this, an agent reading the file after a
// promote would see plain text up to the promote moment, then
// raw `\x1b[...m` color codes / cursor moves / clear-screen
// sequences for any post-promote output — which is unreadable
// and inconsistent.
const rawChunk =
typeof event.chunk === 'string'
? event.chunk
: event.chunk
.map((line) => line.map((tok) => tok.text).join(''))
.join('\n');
const chunk = stripAnsi(rawChunk);
if (promoteArtifacts.stream) {
try {
promoteArtifacts.stream.write(chunk);
} catch (err) {
debugLogger.warn(
`promote: postPromote stream.write failed: ${getErrorMessage(err)}`,
);
}
} else if (promoteArtifacts.streamClosed) {
// Stream-open already failed permanently — drop chunks
// rather than buffer them. Without this guard the buffer
// would grow without bound under a sustained child whose
// output file we couldn't open.
debugLogger.debug(
'promote: dropping post-promote chunk because output stream open failed',
);
} else {
promoteArtifacts.buffer.push(chunk);
}
},
onSettle: (info) => {
if (promoteArtifacts.onSettleWired) {
promoteArtifacts.onSettleWired(info);
} else {
// Service observed the child exit before handlePromotedForeground
// finished registering. Queue the settle info — handlePromotedForeground
// applies it as soon as the registry entry exists.
promoteArtifacts.settleQueued = info;
}
},
};
let executionHandle;
try {
executionHandle = await ShellExecutionService.execute(
@ -1628,6 +1759,7 @@ export class ShellToolInvocation extends BaseToolInvocation<
combinedSignal,
this.config.getShouldUseNodePtyShell(),
shellExecutionConfig ?? {},
{ postPromote },
);
} catch (err) {
// ShellExecutionService.execute() can throw before resolving (e.g.
@ -1725,6 +1857,7 @@ export class ShellToolInvocation extends BaseToolInvocation<
cwd,
commandToExecute,
promoteAbortController,
promoteArtifacts,
);
return promotedToolResult;
}
@ -2033,27 +2166,26 @@ export class ShellToolInvocation extends BaseToolInvocation<
/**
* Foreground background promote handler. Called when the foreground
* execute path observes `result.promoted: true` (the user pressed
* Ctrl+B mid-flight). Snapshots captured output to a `bg_xxx.output`
* file, registers a `BackgroundShellEntry` in the same registry the
* `is_background: true` path uses, and returns a model-facing
* `ToolResult` pointing at `/tasks` / the dialog / `task_stop` for
* follow-up.
* Ctrl+B mid-flight). Writes the initial snapshot + open the
* post-promote append stream so subsequent child bytes land in
* `bg_xxx.output`, registers a `BackgroundShellEntry` in the same
* registry the `is_background: true` path uses, wires settle so
* natural child exit transitions the entry to `'completed'` /
* `'failed'`, and returns a model-facing `ToolResult` pointing at
* `/tasks` / the dialog / `task_stop` for follow-up.
*
* Limitations (PR-2.5 follow-up):
* - The registry entry stays `'running'` until `task_stop bg_xxx`
* or session-end `abortAll` clears it; natural child exit does
* NOT auto-settle the entry today (no settle hook from the
* service after promote the listener was detached as part of
* PR-1's ownership-transfer contract).
* - The `outputPath` content is FROZEN at the promote moment; the
* service no longer streams post-promote bytes to the file.
* Caller-side stream redirect lands in PR-2.5.
* PR-2.5: post-promote stream redirect + natural-exit registry
* settle are now live via the `postPromote` callbacks wired in
* `executeForeground`. The `promoteArtifacts` parameter carries the
* pre-allocated buffer/stream slots that absorb the race between
* service-side promote-time data flush and this finalizer running.
*/
private async handlePromotedForeground(
result: ShellExecutionResult,
cwd: string,
commandToExecute: string,
abortController: AbortController,
promoteArtifacts: PromoteArtifacts,
): Promise<ToolResult> {
// Mirror executeBackground's outputPath layout so /tasks-on-disk and
// ReadFileTool's auto-allow rules treat foreground-promoted shells
@ -2110,15 +2242,108 @@ export class ShellToolInvocation extends BaseToolInvocation<
const shellId = `bg_${crypto.randomBytes(4).toString('hex')}`;
const outputPath = path.join(outputDir, `shell-${shellId}.output`);
// Best-effort initial snapshot write — if disk is full or
// permission flips, log + continue (the registry entry is still
// valuable on its own; the file is only the inspection surface).
// PR-2.5: open an append-mode write stream so the initial snapshot
// AND post-promote bytes from the still-running child both land in
// the same file. Synchronous open via `createWriteStream` with
// `flags: 'w'` (overwrite) — if a stale file is somehow there from
// a prior session with the same shellId (vanishingly unlikely
// given the randomBytes), start fresh. Stream errors (ENOSPC mid-
// stream, permission flip) are logged via 'error' listener; we
// never let them crash the daemon.
let outputStream: fs.WriteStream | null = null;
try {
fs.writeFileSync(outputPath, result.output);
outputStream = fs.createWriteStream(outputPath, { flags: 'w' });
// PR-2.5 wave-2: `createWriteStream` reports common
// failures (ENOENT / EACCES / ENOSPC during the async libuv
// `open`) via an `'error'` event AFTER this synchronous call
// returns — they do NOT throw. Without latching the failure
// here, `promoteArtifacts.stream` would still point at an
// already-broken stream, `postPromote.onData` would `write` into
// it (catching the throw via its own try/catch but never
// releasing the buffer), and `onSettleWired` would attach a
// `'finish'` listener that never fires → registry stuck on
// `running` forever. Latch the failure: null the stream,
// mark `streamClosed` so `onData` drops chunks, and let
// `onSettleWired` transition the registry immediately (its
// existing `if (!stream)` branch handles that case).
outputStream.on('error', (err) => {
debugLogger.warn(
`promote: output write stream error for ${outputPath}: ${getErrorMessage(err)}`,
);
const droppedChunks = promoteArtifacts.buffer.length;
promoteArtifacts.stream = null;
promoteArtifacts.streamClosed = true;
try {
fs.appendFileSync(
outputPath,
`\n[WARNING: post-promote output lost — stream error (${getErrorMessage(err)}). ${droppedChunks} buffered chunks dropped.]\n`,
);
} catch {
// Best-effort diagnostic — if the append itself fails
// (e.g. disk full), the debugLogger.warn above is the
// only trace left.
}
});
// Initial snapshot first, so it always precedes post-promote
// bytes in the file (write ordering is FIFO on a single stream).
outputStream.write(result.output);
// PR-2.5 wave-4: assign the stream BEFORE draining
// the buffer, not after. The drain + assign block is synchronous
// today (single-tick JS, so a service-side `onData` callback
// cannot fire between drain-end and assign), but the assign-
// after-drain order leaves a hazard for any future refactor
// that introduces an `await` inside the drain — a chunk arriving
// in that window would be pushed into `promoteArtifacts.buffer`
// (because `stream` is still null), then later chunks would write
// directly to the stream after assign, producing out-of-order
// bytes in `bg_xxx.output` until the settle drain caught the
// straggler. Assign-first eliminates the hazard entirely:
// concurrent `onData` writes go straight through after the
// queued snapshot + the queued drained chunks, in the correct
// FIFO order on the stream.
promoteArtifacts.stream = outputStream;
while (promoteArtifacts.buffer.length > 0) {
const chunk = promoteArtifacts.buffer.shift()!;
outputStream.write(chunk);
}
} catch (err) {
debugLogger.warn(
`promote: failed to write initial output snapshot to ${outputPath}: ${getErrorMessage(err)}`,
`promote: failed to open output stream for ${outputPath}: ${getErrorMessage(err)}`,
);
// Stream failure is recoverable — the registry entry is still
// valuable on its own; the file is the inspection surface only.
// Continue without a stream; future onData chunks are dropped
// (their warns will accumulate in the log, which is enough
// observability for a rare disk failure case).
promoteArtifacts.stream = null;
// Latch streamClosed so the foreground postPromote.onData
// handler stops buffering chunks that would never be drained
// (the drain path only runs when `stream` becomes non-null,
// which never happens after this branch).
promoteArtifacts.streamClosed = true;
// PR-2.5 wave-3: record how many pre-
// finalizer post-promote chunks are being dropped. Without
// this an oncall engineer reading a truncated `bg_xxx.output`
// has no signal that the truncation is due to stream-open
// failure rather than the child not producing more output.
// The chunks themselves are gone (no salvage path exists once
// the stream open has failed and the buffer drain depends on
// a non-null stream slot).
if (promoteArtifacts.buffer.length > 0) {
debugLogger.warn(
`promote: dropping ${promoteArtifacts.buffer.length} buffered post-promote chunks for ${outputPath} (stream open failed before drain)`,
);
promoteArtifacts.buffer.length = 0;
}
// Last-ditch: try a sync snapshot write so /tasks still has
// SOMETHING readable; the buffer chunks are lost in this branch.
try {
fs.writeFileSync(outputPath, result.output);
} catch (err2) {
debugLogger.warn(
`promote: snapshot fallback writeFileSync also failed for ${outputPath}: ${getErrorMessage(err2)}`,
);
}
}
const startTime = Date.now();
@ -2237,15 +2462,216 @@ export class ShellToolInvocation extends BaseToolInvocation<
} catch {
/* swallow — we're already in an error path */
}
// PR-2.5: close the output stream so the FD doesn't leak past
// the throw. Best-effort — if .end() itself throws we're
// already in an error path with the orphan-child kill already
// in flight.
try {
promoteArtifacts.stream?.end();
} catch {
/* swallow */
}
promoteArtifacts.stream = null;
throw e;
}
// PR-2.5: wire the post-promote settle so a natural child exit
// (or spawn-side error) transitions the registry entry from
// `'running'` to `'completed'` / `'failed'`. Without this the
// entry stays `'running'` until `task_stop` / session-end. The
// service's `postPromote.onSettle` fires AT MOST ONCE per
// promote, and `registry.complete` / `registry.fail` are
// idempotent (no-op when status !== 'running'), so a race with
// `entryAc.abort() → registry.cancel` (task_stop fired during the
// exit window) is safe: whichever lands first wins, the other
// becomes a no-op.
// Status flags consumed by the model-facing copy below.
//
// - `postPromoteSettleObserved`: SET SYNCHRONOUSLY inside
// `onSettleWired` the moment we know the child has exited (the
// service has called us with settle info). Independent of
// whether the registry transition has actually completed yet,
// because the transition may be deferred awaiting the output
// stream's `'finish'` event (libuv flush). This is the flag
// the model-facing copy branches on: once we know the child has
// exited, saying "Status: running" + suggesting `task_stop`
// would mislead the agent.
// - `postPromoteFinalStatus`: classified from the settle info at
// the same synchronous moment, so the status line can report
// the right terminal status even if the registry transition is
// still in flight.
//
// PR-2.5 wave-2: originally the model-facing copy
// checked a `postPromoteAlreadySettled` flag that was only flipped
// AFTER the registry transition fired (post-flush). A fast-exited
// promoted command could therefore land "Status: running" +
// `task_stop` instructions in the model copy even when settle was
// already queued, because the queued-settle drain returned before
// the stream's 'finish' event fired. The two flags decouple
// "child has exited" (what the agent cares about) from "registry
// transition has run" (which can lag behind libuv flush).
let postPromoteSettleObserved = false;
let postPromoteFinalStatus: 'completed' | 'failed' | null = null;
const classifySettle = (
info: ShellPostPromoteSettleInfo,
): { status: 'completed' | 'failed'; failMsg: string | null } => {
// Decision table: `error` → fail (spawn-side failure); `exitCode
// === 0` → complete; non-zero exitCode → fail; signal-killed
// (no exitCode, signal set) → fail with descriptive message;
// everything-null → fail with generic message.
if (info.error) return { status: 'failed', failMsg: info.error.message };
if (info.exitCode === 0) return { status: 'completed', failMsg: null };
if (info.exitCode !== null)
return {
status: 'failed',
failMsg: `Exited with code ${info.exitCode}`,
};
if (info.signal !== null)
return {
status: 'failed',
failMsg: `Terminated by signal ${info.signal}`,
};
// PR-2.5 wave-3: this branch is meant to
// be unreachable — the service always populates one of
// `error` / `exitCode` / `signal`. Hitting it means the
// service emitted a defective settle info object, which is a
// logic bug. Capture the actual field values in the failure
// message AND warn-log so the oncall engineer reading
// `/tasks` or the debug log can tell THIS path apart from the
// other "failed" branches. (`info.error` has been narrowed to
// `never` by the preceding `if (info.error) return`, so we
// can't read `.message` here — by construction it would be
// `undefined` at runtime anyway.)
debugLogger.warn(
`promote: classifySettle all-null fallback hit for ${shellId}` +
`exitCode=${info.exitCode}, signal=${info.signal}, error=undefined`,
);
return {
status: 'failed',
failMsg: `Exited with unknown status (exitCode=${info.exitCode}, signal=${info.signal}, error=undefined)`,
};
};
const transitionRegistry = (info: ShellPostPromoteSettleInfo) => {
const cls = classifySettle(info);
if (cls.status === 'completed') {
registry.complete(shellId, info.exitCode as number, info.endTime);
} else {
registry.fail(shellId, cls.failMsg as string, info.endTime);
}
};
promoteArtifacts.onSettleWired = (info) => {
// Synchronous observation — the child has exited; classify now
// so the model-facing copy can branch correctly even when the
// registry transition is deferred behind the stream's flush.
const cls = classifySettle(info);
postPromoteFinalStatus = cls.status;
postPromoteSettleObserved = true;
// Wait for the output stream to fully FLUSH before transitioning
// the registry. `stream.end()` is asynchronous — pending writes
// can still be in the libuv queue when it returns. Without the
// 'finish' wait, `/tasks` consumers can observe the entry as
// `completed`/`failed` and read the output file BEFORE the
// trailing bytes are on disk, producing truncated logs.
const stream = promoteArtifacts.stream;
// PR-2.5 wave-3: drain the pre-settle
// buffer to the stream BEFORE nulling the shared slot. Service-
// side `onData` callbacks that race the foreground finalizer
// can land chunks in the buffer between when the wire fires
// and when the buffer drain (during stream-open) sees them.
// Without this drain those chunks are stranded. AND latch
// `streamClosed` together with the null so that any
// chunk arriving AFTER `.end()` (during the flush window —
// unlikely once the service has emitted settle, but kernel
// buffers can deliver late on PTY) is DROPPED via the
// `else if (promoteArtifacts.streamClosed)` arm in `onData`
// instead of being pushed into the now-undrainable buffer.
if (stream) {
while (promoteArtifacts.buffer.length > 0) {
try {
stream.write(promoteArtifacts.buffer.shift()!);
} catch (writeErr) {
// Stream write failure during pre-end drain — log + drop,
// same recovery posture as the foreground `onData` write
// path. The error event will fire async if the stream is
// dead, latching `streamClosed` via the 'error' handler.
debugLogger.warn(
`promote: pre-end buffer drain write failed: ${getErrorMessage(writeErr)}`,
);
}
}
}
promoteArtifacts.stream = null;
promoteArtifacts.streamClosed = true;
if (!stream) {
// No stream (open failed or already ended) — transition right
// away, no flush to wait on.
transitionRegistry(info);
return;
}
try {
// `finish` fires after all queued writes have been flushed to
// the underlying fd. `error` covers a late EIO / ENOSPC that
// doesn't reach the existing `'error'` listener — race with
// `.end()` itself. Either way, run the transition once.
let transitioned = false;
const finalize = () => {
if (transitioned) return;
transitioned = true;
transitionRegistry(info);
};
const flushTimer = setTimeout(() => {
debugLogger.warn(
`promote: output stream flush timed out for ${shellId} after ${PROMOTE_FLUSH_TIMEOUT_MS}ms — transitioning registry without flush confirmation`,
);
finalize();
}, PROMOTE_FLUSH_TIMEOUT_MS);
flushTimer.unref();
stream.once('finish', () => {
clearTimeout(flushTimer);
finalize();
});
stream.once('error', () => {
clearTimeout(flushTimer);
finalize();
});
stream.end();
} catch (closeErr) {
debugLogger.warn(
`promote: closing output stream on settle threw: ${getErrorMessage(closeErr)}`,
);
transitionRegistry(info);
}
};
// Drain a settle that landed BEFORE the wire installed (fast
// commands can exit between `result.promoted` and this line).
// After this call returns, `postPromoteSettleObserved` is true
// if a settle was queued — that's the case the model-facing copy
// below branches on so the message doesn't say "Status: running"
// for a process that already finished during the registration
// window.
if (promoteArtifacts.settleQueued) {
const queued = promoteArtifacts.settleQueued;
promoteArtifacts.settleQueued = null;
promoteArtifacts.onSettleWired(queued);
}
// Build the model-facing status line based on whether the settle
// was observed synchronously (i.e. the child has exited). Branch
// on `postPromoteSettleObserved` rather than the post-flush latch
// — see the flag block above for the rationale.
const statusLine = postPromoteSettleObserved
? `Status: ${postPromoteFinalStatus ?? 'settled'}. PID: ${result.pid ?? '(unknown)'}.`
: `Status: running. PID: ${result.pid ?? '(unknown)'}.`;
const inspectLine = `To inspect: \`/tasks\` (text), the Background tasks dialog (↓ + Enter on the footer pill), or \`Read\` the output file directly.`;
const stopLine = postPromoteSettleObserved
? `Process has already exited; no \`task_stop\` needed (the entry is observable in \`/tasks\` for inspection).`
: `To stop the now-background process: \`task_stop({ task_id: '${shellId}' })\`.`;
const llmContent = [
`Foreground command "${commandToExecute}" promoted to background as ${shellId}.`,
`Status: running. PID: ${result.pid ?? '(unknown)'}.`,
statusLine,
`Output snapshot at promote time saved to: ${outputPath}`,
`To inspect: \`/tasks\` (text), the Background tasks dialog (↓ + Enter on the footer pill), or \`Read\` the output file directly.`,
`To stop the now-background process: \`task_stop({ task_id: '${shellId}' })\`.`,
inspectLine,
stopLine,
].join('\n');
debugLogger.debug(