qwen-code/packages
Shaojin Wen 2d222c4206
feat(core): foreground → background promote integration (#3831 PR-2 of 3) (#3894)
* feat(core): foreground → background promote integration (#3831 PR-2 of 3)

Builds on the \`signal.reason\` foundation merged in #3842 / #3886. Wires
the foreground \`shell\` tool to detect a background-promote abort, snapshot
the captured output to a \`bg_xxx.output\` file, register a
\`BackgroundShellEntry\` in the existing \`BackgroundShellRegistry\`, and
return a model-facing \`ToolResult\` pointing at \`/tasks\` / the dialog /
\`task_stop\`. Also resolves design question 7 from #3831 (raised by
@tanzhenxin in the PR-1 review): set \`result.aborted: false\` when
\`result.promoted: true\` so existing \`if (result.aborted)\` consumer
branches fall through naturally.

## Changes

**\`shellExecutionService.ts\`** — both \`executeWithPty\` and
\`childProcessFallback\` background-promote branches now resolve with
\`aborted: false, promoted: true\` (was \`aborted: true\`). The flag now
answers "should the caller emit a cancel/timeout message?" rather than
"did the abort signal fire?" — and a promoted shell is neither
cancelled nor timed out (the child is still running, ownership simply
transferred). \`ShellExecutionResult.promoted?\` JSDoc updated to
document this contract.

**\`shell.ts\`** — \`ShellToolInvocation.execute()\` gains a 5th optional
parameter \`setPromoteAbortControllerCallback?: (ac: AbortController)
=> void\`. The foreground path now creates an internal
\`promoteAbortController\` and combines its signal into the existing
\`signal + timeoutSignal\` AbortSignal.any() chain. Right after
\`setPidCallback\` fires, \`setPromoteAbortControllerCallback\` exposes
the controller to the scheduler so a UI surface (PR-3 Ctrl+B keybind)
can find it by callId and trigger \`abort({ kind: 'background',
shellId })\`.

When \`result.promoted\` is observed after \`await resultPromise\`, a new
\`handlePromotedForeground\` private method:
  1. Generates \`bg_xxx\` shellId + on-disk \`outputPath\` under the same
     project temp dir \`executeBackground\` uses.
  2. Writes \`result.output\` (the snapshot the service flushed at
     promote time) as the file's initial content (best-effort —
     ENOSPC / EACCES logged + swallowed; the registry entry is
     valuable on its own).
  3. Constructs a \`BackgroundShellEntry\` with the running pid + the
     same \`promoteAbortController\` already wired into the live child
     — \`task_stop bg_xxx\` and the dialog's \`x\` key both abort via
     \`entry.abortController\` and will land on the still-running
     process.
  4. Returns a model-facing \`ToolResult\` pointing at \`/tasks\` / the
     Background tasks dialog / \`task_stop\` for follow-up.

**\`coreToolScheduler.ts\`** — \`TrackedExecutingToolCall\` gains an
optional \`promoteAbortController?: AbortController\` field, populated
when the shell tool's \`setPromoteAbortControllerCallback\` fires. The
scheduler routes only the shell-tool branch to pass this callback,
matching the existing \`setPidCallback\` pattern.

## Limitations (deferred to PR-2.5)

Two follow-up items intentionally NOT in scope here. Scope discipline
keeps PR-2 reviewable while still delivering the user-facing promote
flow end-to-end (PR-3's Ctrl+B keybind can wire to this PR's
\`promoteAbortController\` to ship a working feature).

- **Post-promote stream redirect**: today the \`outputPath\` content is
  FROZEN at the promote moment. The service detached its data
  listener as part of PR-1's ownership-transfer contract, so
  post-promote bytes from the still-running child don't reach the
  file. \`Read\`-ing the output via \`/tasks\` shows what was captured
  before promote, not live updates. PR-2.5 will add caller-side
  \`onPostPromoteData\` callback (or equivalent) so post-promote bytes
  stream to the file like a normal background shell.

- **Natural-exit registry settle**: the registry entry stays
  \`'running'\` until \`task_stop bg_xxx\` or session-end \`abortAll\`
  clears it. The service's exit listener was disposed at promote, so
  there's no observation point for natural child exit. PR-2.5 will
  keep the exit listener attached post-promote (with a separate
  \`onPostPromoteSettle\` callback) so the entry transitions to
  \`completed\` / \`failed\` like a normal background shell.

These limitations are visible to users (output frozen, entry stays
running until task_stop/session end) but don't break the core promote
contract: the agent unblocks, the registry entry is observable, the
process stays alive, cancel via \`task_stop\` works.

## Tests

**\`shellExecutionService.test.ts\`** — two existing promote tests now
assert \`aborted: false\` (per design question 7) instead of \`true\`.
\`70 / 70 pass\`.

**\`shell.test.ts\`** — three new tests in a \`foreground → background
promote (#3831 PR-2)\` describe block:
  1. \`setPromoteAbortControllerCallback\` exposes a real
     \`AbortController\` after spawn.
  2. On \`result.promoted: true\`, the registry receives a \`bg_xxx\`
     entry with pid + abortController + outputPath, the snapshot is
     written via \`fs.writeFileSync\`, and the model-facing copy
     references \`/tasks\` + \`task_stop\` + the dialog.
  3. A snapshot-write failure (mocked ENOSPC) doesn't break promote —
     the registry entry still gets registered with the running pid.
\`96 / 96 pass\`.

**\`coreToolScheduler.test.ts\`** — \`98 / 98 pass\` (no new tests; the
new \`promoteAbortController\` field is exercised end-to-end via
shell.test.ts).

Total: \`264 / 264 affected tests pass\`; tsc + ESLint clean.

## Related

- #3831 (Phase D part b — design + 3-PR sequencing; question 7
  resolved here)
- #3842 (PR-1 — \`signal.reason\` foundation)
- #3886 (PR-1 follow-up — Proxy-trap fix + handoff test parity)
- #3634 (Background task management roadmap)

cc @tanzhenxin

* fix(core): give promoted shell entry a FRESH AbortController so task_stop kills the child

Real bug found in self-audit of #3894 PR-2: \`entry.abortController\`
was being set to the same \`promoteAbortController\` that triggered the
promote — which is **already aborted** by the time we reach
\`handlePromotedForeground\`. Two consequences:

1. \`task_stop bg_xxx\` calls \`entry.abortController.abort()\`. On an
   already-aborted controller this is a no-op (the abort event was
   dispatched once when the controller fired; the second \`abort()\`
   doesn't re-fire listeners per WHATWG spec).
2. \`ShellExecutionService\` has already detached its own abort
   listener as part of the PR-1 ownership-transfer contract, so even
   if the abort COULD re-fire, there's nobody left listening to
   translate the signal into an actual SIGTERM/SIGKILL on the still-
   running child.

Net effect: a promoted shell would survive \`task_stop\` forever — the
agent would think it cancelled, the registry entry would stay
\`'running'\`, and the OS process would keep running until the user
killed the CLI session.

Fix: \`handlePromotedForeground\` now creates a fresh \`AbortController\`
for the registry entry and wires its abort listener to:
1. Send SIGTERM → SIGKILL to the still-running child via
   \`process.kill(-pid, …)\` (Linux/Mac process group, mirroring the
   \`detached: !isWindows\` spawn the foreground path uses) or
   \`taskkill /pid /f /t\` (Windows). Reuses the same SIGTERM-then-
   timeout-then-SIGKILL pattern \`ShellExecutionService.execute()\`
   uses on the non-promote cancel path; new constant
   \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS = 200ms\` (intentionally
   separate from the service's \`SIGKILL_TIMEOUT_MS\` so tuning one
   doesn't silently change the other).
2. Sync-mark the registry entry \`cancelled\` via \`registry.cancel()\`
   so \`/tasks\` and the dialog reflect the user intent immediately.

Added a regression test pinning \`entry.abortController.signal.aborted
=== false\` at registration time. Without the fix, this asserts
\`true\` and the test fails — which is the visible canary for the
silent-task_stop-failure mode.

97 / 97 shell.test.ts pass; tsc + ESLint clean.

* fix(core): add 'error' listener on Windows taskkill spawn (audit follow-up)

Reverse-audit found a Windows-specific crash mode: \`cpSpawn('taskkill',
…)\` returns a \`ChildProcess\` whose 'error' event (emitted when the
spawn fails — taskkill binary missing, EACCES, etc.) crashes Node by
default if no 'error' listener is attached. Same pattern as PR-1's
\`@lydell/node-pty\` IPty incident — Web/Node spec quirk easy to miss
without specifically thinking about Windows + spawn-failure.

Also wrapped the \`cpSpawn\` call itself in try/catch for the rarer
sync-throw mode (EMFILE / ENOMEM at spawn-time). Recovery in both
cases: log via debugLogger.warn + continue; \`registry.cancel\` below
still transitions the entry, and the still-running child becomes an
orphan that Windows reaps when the CLI session ends.

97 / 97 shell.test.ts pass; tsc + ESLint clean.

* test(core): close 3 test gaps from #3894 review

Three [Suggestion] threads from the @tanzhenxin-style review on PR-2,
all real test gaps that would have let silent regressions through:

1. **\`setPromoteAbortControllerCallback\` test was too weak.** The old
   test only asserted that the callback received an \`AbortController\`
   instance, not that the controller's signal was actually wired into
   the \`AbortSignal.any(...)\` chain handed to ShellExecutionService.
   If \`shell.ts\` exposed the controller but forgot to combine its
   signal, Ctrl+B promotion would never reach the service while the
   bare-instance test still passed. Strengthened: capture the
   AbortSignal handed to ShellExecutionService.execute (4th arg),
   abort the promote controller, and assert the captured signal goes
   from \`aborted: false\` → \`true\`.

2. **The post-promote cancellation kill path was unverified.** The
   prior commit added a real-bug fix (fresh \`entryAc\` + abort
   listener that sends SIGTERM/SIGKILL + sync-marks the registry
   entry cancelled) but the only test it had was "the controller is
   fresh, signal not aborted". Reviewer rightly noted that this is
   the **core operational guarantee** for promoted shells — \`task_stop
   bg_xxx\` must actually stop the child. Added a test that uses
   fake timers + a \`process.kill\` spy: register a promoted entry,
   abort \`entry.abortController\`, flush microtasks (SIGTERM dispatch),
   advance fake time past \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS\`
   (SIGKILL dispatch + \`registry.cancel\` mark). Pins the entire
   kill chain.

3. **Scheduler-side wiring of \`promoteAbortController\` was untested.**
   PR-3's Ctrl+B keybind looks up the executing tool call by callId
   and aborts \`tc.promoteAbortController\` — if \`CoreToolScheduler\`
   stops populating that field, the keybind silently breaks. Added
   a test in \`coreToolScheduler.test.ts\` that uses a
   \`TestShellInvocation extends ShellToolInvocation\` (so the
   scheduler's \`instanceof ShellToolInvocation\` check still routes
   the call through the shell-specific branch that wires the
   callback) and asserts that an \`onToolCallsUpdate\` batch emitted
   during the executing window contains a tool call where
   \`tc.promoteAbortController\` matches the controller the test
   exposed.

98 / 98 shell.test.ts pass; 99 / 99 coreToolScheduler.test.ts pass; tsc + ESLint clean.

* fix(core): use commandToExecute in promoted entry + try/catch register

Resolves 3 #3894 review threads:

- **Critical**: `entry.command` and `llmContent` for the promoted
  foreground shell now use `commandToExecute` (post-co-author-rewrite
  form) instead of raw `this.params.command`. For `git commit -m`
  invocations that `addCoAuthorToGitCommit()` rewrote, the registry
  entry now mirrors what actually ran — matching `executeBackground`'s
  long-standing convention (line 1234).

- Defensive try/catch around `registry.register(entry)`: today the
  call is internally safe (Map.set + emit), but a future
  implementation that throws would leave a zombie child detached from
  service listeners with no kill path. Catch path logs, fires
  `entryAc.abort()` for best-effort kill via the wired listener, and
  re-throws so the scheduler surfaces the failure.

- Updates the misleading comment (line 748) that claimed the registry
  entry uses "the same `promoteAbortController`" — actual impl uses a
  fresh `entryAc` (the audit-fix from the previous push).

Tests:
- `entry.command` git-commit case pinning post-rewrite form
- register-throw rejection + SIGTERM/SIGKILL kill via fake timers
- 100/100 shell.test.ts pass; tsc + ESLint clean

* fix(core): close 2 #3894 review findings — promote refused-race + mkdir orphan

Resolves @tanzhenxin's CHANGES_REQUESTED review on #3894.

1. **Refused-promote race no longer reported as "Command timed out"**

The combined-abort signal folds in `signal | timeoutSignal |
promoteAbortController.signal`, but the timeout discriminator only
excluded the user-cancel signal — not the promote signal. When the
user fires Ctrl+B (PR-3's keybind) but the service's race guard
refuses promotion (the child terminated a beat earlier), the result
lands `aborted: true, promoted: false` and the foreground path
falsely reported `Command timed out after 120000ms`. Both the agent
and the user would see a timeout that didn't happen.

Fix: extend the discriminator to ALSO exclude
`promoteAbortController.signal.aborted`. Add a `wasPromoteRefused`
branch that surfaces the actual cause: "Command finished before the
background-promote request could be honoured (the child had already
exited)." Same fix applied to both the llmContent path and the
returnDisplay path so the model and the visible UI agree.

Latent in PR-2 itself (no in-tree caller fires the promote yet), but
PR-3's keybind would expose it on first ship.

2. **Unguarded mkdirSync orphans the promoted child**

After `result.promoted: true`, ownership of the still-running child
has transferred and the service's kill path is detached. The promote
handler creates the snapshot output directory next, but the original
`fs.mkdirSync(outputDir, { recursive: true })` had no guard — read-
only temp mounts, sandboxed perms, full disk on inode/metadata
exhaustion would reject the handler BEFORE the registry's kill
listener was wired. The still-running child became an orphan zombie
with no kill path until the OS reaped it on session end.

Fix: wrap mkdirSync in try/catch (matches the safety pattern around
`registry.register`). On failure, log + best-effort kill the child
(SIGTERM via process.kill(-pid) on POSIX, taskkill /f /t on Windows
with an `error` listener so a spawn failure doesn't crash Node) +
re-throw so the scheduler surfaces the failure to the agent.

Tests: 2 new regressions in `shell.test.ts`:
- `mkdirSync(outputDir) throws → child gets SIGTERM, error re-raised`
- `promote-refused race (aborted: true, promoted: false after promote
  signal) is NOT reported as "Command timed out"`

171/171 shell.test.ts pass; tsc + ESLint clean.
2026-05-08 19:56:50 +08:00
..
channels chore(release): v0.15.8 (#3928) [skip ci] 2026-05-08 00:46:00 +08:00
cli feat(session): add /branch to fork the current conversation (#3539) 2026-05-08 19:34:11 +08:00
core feat(core): foreground → background promote integration (#3831 PR-2 of 3) (#3894) 2026-05-08 19:56:50 +08:00
sdk-java fix(sdk-java): pass custom env to CLI process (#3543) 2026-04-24 10:37:52 +08:00
sdk-python feat: add commit attribution with per-file AI contribution tracking (#3115) 2026-05-08 09:55:58 +08:00
sdk-typescript refactor: extract shared release helper utilities (#3834) 2026-05-05 10:15:17 +08:00
vscode-ide-companion fix(vscode): mark Qwen OAuth coder-model as Discontinued in model picker (#3948) 2026-05-08 17:41:40 +08:00
web-templates chore(release): v0.15.8 (#3928) [skip ci] 2026-05-08 00:46:00 +08:00
webui chore(release): v0.15.8 (#3928) [skip ci] 2026-05-08 00:46:00 +08:00
zed-extension