mirror of
https://github.com/QwenLM/qwen-code.git
synced 2026-05-20 01:01:53 +00:00
* feat(core): foreground → background promote integration (#3831 PR-2 of 3) Builds on the \`signal.reason\` foundation merged in #3842 / #3886. Wires the foreground \`shell\` tool to detect a background-promote abort, snapshot the captured output to a \`bg_xxx.output\` file, register a \`BackgroundShellEntry\` in the existing \`BackgroundShellRegistry\`, and return a model-facing \`ToolResult\` pointing at \`/tasks\` / the dialog / \`task_stop\`. Also resolves design question 7 from #3831 (raised by @tanzhenxin in the PR-1 review): set \`result.aborted: false\` when \`result.promoted: true\` so existing \`if (result.aborted)\` consumer branches fall through naturally. ## Changes **\`shellExecutionService.ts\`** — both \`executeWithPty\` and \`childProcessFallback\` background-promote branches now resolve with \`aborted: false, promoted: true\` (was \`aborted: true\`). The flag now answers "should the caller emit a cancel/timeout message?" rather than "did the abort signal fire?" — and a promoted shell is neither cancelled nor timed out (the child is still running, ownership simply transferred). \`ShellExecutionResult.promoted?\` JSDoc updated to document this contract. **\`shell.ts\`** — \`ShellToolInvocation.execute()\` gains a 5th optional parameter \`setPromoteAbortControllerCallback?: (ac: AbortController) => void\`. The foreground path now creates an internal \`promoteAbortController\` and combines its signal into the existing \`signal + timeoutSignal\` AbortSignal.any() chain. Right after \`setPidCallback\` fires, \`setPromoteAbortControllerCallback\` exposes the controller to the scheduler so a UI surface (PR-3 Ctrl+B keybind) can find it by callId and trigger \`abort({ kind: 'background', shellId })\`. When \`result.promoted\` is observed after \`await resultPromise\`, a new \`handlePromotedForeground\` private method: 1. Generates \`bg_xxx\` shellId + on-disk \`outputPath\` under the same project temp dir \`executeBackground\` uses. 2. Writes \`result.output\` (the snapshot the service flushed at promote time) as the file's initial content (best-effort — ENOSPC / EACCES logged + swallowed; the registry entry is valuable on its own). 3. Constructs a \`BackgroundShellEntry\` with the running pid + the same \`promoteAbortController\` already wired into the live child — \`task_stop bg_xxx\` and the dialog's \`x\` key both abort via \`entry.abortController\` and will land on the still-running process. 4. Returns a model-facing \`ToolResult\` pointing at \`/tasks\` / the Background tasks dialog / \`task_stop\` for follow-up. **\`coreToolScheduler.ts\`** — \`TrackedExecutingToolCall\` gains an optional \`promoteAbortController?: AbortController\` field, populated when the shell tool's \`setPromoteAbortControllerCallback\` fires. The scheduler routes only the shell-tool branch to pass this callback, matching the existing \`setPidCallback\` pattern. ## Limitations (deferred to PR-2.5) Two follow-up items intentionally NOT in scope here. Scope discipline keeps PR-2 reviewable while still delivering the user-facing promote flow end-to-end (PR-3's Ctrl+B keybind can wire to this PR's \`promoteAbortController\` to ship a working feature). - **Post-promote stream redirect**: today the \`outputPath\` content is FROZEN at the promote moment. The service detached its data listener as part of PR-1's ownership-transfer contract, so post-promote bytes from the still-running child don't reach the file. \`Read\`-ing the output via \`/tasks\` shows what was captured before promote, not live updates. PR-2.5 will add caller-side \`onPostPromoteData\` callback (or equivalent) so post-promote bytes stream to the file like a normal background shell. - **Natural-exit registry settle**: the registry entry stays \`'running'\` until \`task_stop bg_xxx\` or session-end \`abortAll\` clears it. The service's exit listener was disposed at promote, so there's no observation point for natural child exit. PR-2.5 will keep the exit listener attached post-promote (with a separate \`onPostPromoteSettle\` callback) so the entry transitions to \`completed\` / \`failed\` like a normal background shell. These limitations are visible to users (output frozen, entry stays running until task_stop/session end) but don't break the core promote contract: the agent unblocks, the registry entry is observable, the process stays alive, cancel via \`task_stop\` works. ## Tests **\`shellExecutionService.test.ts\`** — two existing promote tests now assert \`aborted: false\` (per design question 7) instead of \`true\`. \`70 / 70 pass\`. **\`shell.test.ts\`** — three new tests in a \`foreground → background promote (#3831 PR-2)\` describe block: 1. \`setPromoteAbortControllerCallback\` exposes a real \`AbortController\` after spawn. 2. On \`result.promoted: true\`, the registry receives a \`bg_xxx\` entry with pid + abortController + outputPath, the snapshot is written via \`fs.writeFileSync\`, and the model-facing copy references \`/tasks\` + \`task_stop\` + the dialog. 3. A snapshot-write failure (mocked ENOSPC) doesn't break promote — the registry entry still gets registered with the running pid. \`96 / 96 pass\`. **\`coreToolScheduler.test.ts\`** — \`98 / 98 pass\` (no new tests; the new \`promoteAbortController\` field is exercised end-to-end via shell.test.ts). Total: \`264 / 264 affected tests pass\`; tsc + ESLint clean. ## Related - #3831 (Phase D part b — design + 3-PR sequencing; question 7 resolved here) - #3842 (PR-1 — \`signal.reason\` foundation) - #3886 (PR-1 follow-up — Proxy-trap fix + handoff test parity) - #3634 (Background task management roadmap) cc @tanzhenxin * fix(core): give promoted shell entry a FRESH AbortController so task_stop kills the child Real bug found in self-audit of #3894 PR-2: \`entry.abortController\` was being set to the same \`promoteAbortController\` that triggered the promote — which is **already aborted** by the time we reach \`handlePromotedForeground\`. Two consequences: 1. \`task_stop bg_xxx\` calls \`entry.abortController.abort()\`. On an already-aborted controller this is a no-op (the abort event was dispatched once when the controller fired; the second \`abort()\` doesn't re-fire listeners per WHATWG spec). 2. \`ShellExecutionService\` has already detached its own abort listener as part of the PR-1 ownership-transfer contract, so even if the abort COULD re-fire, there's nobody left listening to translate the signal into an actual SIGTERM/SIGKILL on the still- running child. Net effect: a promoted shell would survive \`task_stop\` forever — the agent would think it cancelled, the registry entry would stay \`'running'\`, and the OS process would keep running until the user killed the CLI session. Fix: \`handlePromotedForeground\` now creates a fresh \`AbortController\` for the registry entry and wires its abort listener to: 1. Send SIGTERM → SIGKILL to the still-running child via \`process.kill(-pid, …)\` (Linux/Mac process group, mirroring the \`detached: !isWindows\` spawn the foreground path uses) or \`taskkill /pid /f /t\` (Windows). Reuses the same SIGTERM-then- timeout-then-SIGKILL pattern \`ShellExecutionService.execute()\` uses on the non-promote cancel path; new constant \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS = 200ms\` (intentionally separate from the service's \`SIGKILL_TIMEOUT_MS\` so tuning one doesn't silently change the other). 2. Sync-mark the registry entry \`cancelled\` via \`registry.cancel()\` so \`/tasks\` and the dialog reflect the user intent immediately. Added a regression test pinning \`entry.abortController.signal.aborted === false\` at registration time. Without the fix, this asserts \`true\` and the test fails — which is the visible canary for the silent-task_stop-failure mode. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * fix(core): add 'error' listener on Windows taskkill spawn (audit follow-up) Reverse-audit found a Windows-specific crash mode: \`cpSpawn('taskkill', …)\` returns a \`ChildProcess\` whose 'error' event (emitted when the spawn fails — taskkill binary missing, EACCES, etc.) crashes Node by default if no 'error' listener is attached. Same pattern as PR-1's \`@lydell/node-pty\` IPty incident — Web/Node spec quirk easy to miss without specifically thinking about Windows + spawn-failure. Also wrapped the \`cpSpawn\` call itself in try/catch for the rarer sync-throw mode (EMFILE / ENOMEM at spawn-time). Recovery in both cases: log via debugLogger.warn + continue; \`registry.cancel\` below still transitions the entry, and the still-running child becomes an orphan that Windows reaps when the CLI session ends. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * test(core): close 3 test gaps from #3894 review Three [Suggestion] threads from the @tanzhenxin-style review on PR-2, all real test gaps that would have let silent regressions through: 1. **\`setPromoteAbortControllerCallback\` test was too weak.** The old test only asserted that the callback received an \`AbortController\` instance, not that the controller's signal was actually wired into the \`AbortSignal.any(...)\` chain handed to ShellExecutionService. If \`shell.ts\` exposed the controller but forgot to combine its signal, Ctrl+B promotion would never reach the service while the bare-instance test still passed. Strengthened: capture the AbortSignal handed to ShellExecutionService.execute (4th arg), abort the promote controller, and assert the captured signal goes from \`aborted: false\` → \`true\`. 2. **The post-promote cancellation kill path was unverified.** The prior commit added a real-bug fix (fresh \`entryAc\` + abort listener that sends SIGTERM/SIGKILL + sync-marks the registry entry cancelled) but the only test it had was "the controller is fresh, signal not aborted". Reviewer rightly noted that this is the **core operational guarantee** for promoted shells — \`task_stop bg_xxx\` must actually stop the child. Added a test that uses fake timers + a \`process.kill\` spy: register a promoted entry, abort \`entry.abortController\`, flush microtasks (SIGTERM dispatch), advance fake time past \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS\` (SIGKILL dispatch + \`registry.cancel\` mark). Pins the entire kill chain. 3. **Scheduler-side wiring of \`promoteAbortController\` was untested.** PR-3's Ctrl+B keybind looks up the executing tool call by callId and aborts \`tc.promoteAbortController\` — if \`CoreToolScheduler\` stops populating that field, the keybind silently breaks. Added a test in \`coreToolScheduler.test.ts\` that uses a \`TestShellInvocation extends ShellToolInvocation\` (so the scheduler's \`instanceof ShellToolInvocation\` check still routes the call through the shell-specific branch that wires the callback) and asserts that an \`onToolCallsUpdate\` batch emitted during the executing window contains a tool call where \`tc.promoteAbortController\` matches the controller the test exposed. 98 / 98 shell.test.ts pass; 99 / 99 coreToolScheduler.test.ts pass; tsc + ESLint clean. * fix(core): use commandToExecute in promoted entry + try/catch register Resolves 3 #3894 review threads: - **Critical**: `entry.command` and `llmContent` for the promoted foreground shell now use `commandToExecute` (post-co-author-rewrite form) instead of raw `this.params.command`. For `git commit -m` invocations that `addCoAuthorToGitCommit()` rewrote, the registry entry now mirrors what actually ran — matching `executeBackground`'s long-standing convention (line 1234). - Defensive try/catch around `registry.register(entry)`: today the call is internally safe (Map.set + emit), but a future implementation that throws would leave a zombie child detached from service listeners with no kill path. Catch path logs, fires `entryAc.abort()` for best-effort kill via the wired listener, and re-throws so the scheduler surfaces the failure. - Updates the misleading comment (line 748) that claimed the registry entry uses "the same `promoteAbortController`" — actual impl uses a fresh `entryAc` (the audit-fix from the previous push). Tests: - `entry.command` git-commit case pinning post-rewrite form - register-throw rejection + SIGTERM/SIGKILL kill via fake timers - 100/100 shell.test.ts pass; tsc + ESLint clean * fix(core): close 2 #3894 review findings — promote refused-race + mkdir orphan Resolves @tanzhenxin's CHANGES_REQUESTED review on #3894. 1. **Refused-promote race no longer reported as "Command timed out"** The combined-abort signal folds in `signal | timeoutSignal | promoteAbortController.signal`, but the timeout discriminator only excluded the user-cancel signal — not the promote signal. When the user fires Ctrl+B (PR-3's keybind) but the service's race guard refuses promotion (the child terminated a beat earlier), the result lands `aborted: true, promoted: false` and the foreground path falsely reported `Command timed out after 120000ms`. Both the agent and the user would see a timeout that didn't happen. Fix: extend the discriminator to ALSO exclude `promoteAbortController.signal.aborted`. Add a `wasPromoteRefused` branch that surfaces the actual cause: "Command finished before the background-promote request could be honoured (the child had already exited)." Same fix applied to both the llmContent path and the returnDisplay path so the model and the visible UI agree. Latent in PR-2 itself (no in-tree caller fires the promote yet), but PR-3's keybind would expose it on first ship. 2. **Unguarded mkdirSync orphans the promoted child** After `result.promoted: true`, ownership of the still-running child has transferred and the service's kill path is detached. The promote handler creates the snapshot output directory next, but the original `fs.mkdirSync(outputDir, { recursive: true })` had no guard — read- only temp mounts, sandboxed perms, full disk on inode/metadata exhaustion would reject the handler BEFORE the registry's kill listener was wired. The still-running child became an orphan zombie with no kill path until the OS reaped it on session end. Fix: wrap mkdirSync in try/catch (matches the safety pattern around `registry.register`). On failure, log + best-effort kill the child (SIGTERM via process.kill(-pid) on POSIX, taskkill /f /t on Windows with an `error` listener so a spawn failure doesn't crash Node) + re-throw so the scheduler surfaces the failure to the agent. Tests: 2 new regressions in `shell.test.ts`: - `mkdirSync(outputDir) throws → child gets SIGTERM, error re-raised` - `promote-refused race (aborted: true, promoted: false after promote signal) is NOT reported as "Command timed out"` 171/171 shell.test.ts pass; tsc + ESLint clean. |
||
|---|---|---|
| .. | ||
| channels | ||
| cli | ||
| core | ||
| sdk-java | ||
| sdk-python | ||
| sdk-typescript | ||
| vscode-ide-companion | ||
| web-templates | ||
| webui | ||
| zed-extension | ||