Commit graph

5707 commits

Author SHA1 Message Date
wenshao
1bb1326904 fix(cli): address Copilot second-pass review — dialogOpen tick gate, isFocused doc
Two further findings from Copilot:

1. **Interval kept ticking while bg-tasks dialog was open.** The
   first-pass fix already torn the tick down once all rows expired
   from the visibility window, but `dialogOpen` was a separate
   reason the panel returned null and was missed by the gate. Add
   `dialogOpen` to the useEffect deps and short-circuit when true,
   so the dialog's tenure is interval-free.

2. **Stale `isFocused` doc comment in `ToolMessageProps`.** The
   comment claimed the prop controlled the now-retired `Ctrl+E /
   Ctrl+F` display shortcuts (those died with the inline
   `AgentExecutionDisplay` frame). Rewrite the comment to describe
   the only remaining behavior — the focus-routed approval surface
   (focus-holder banner vs. queued-sibling marker).

Coverage delta: +1 case on `LiveAgentPanel.test.tsx` — `tears the
1s tick down when the bg-tasks dialog opens` advances 60s of fake
time with `dialogOpen=true` and asserts no panel state drift.
2026-05-07 19:30:14 +08:00
wenshao
e9b76989c0 fix(cli): address Copilot review on LiveAgentPanel — interval gate, ghost rows, dead doc
Three findings from Copilot's PR review:

1. **1s interval kept ticking forever after expiry.** The gate was
   `entries.some(isAgentEntry)`, but `BackgroundTaskRegistry.getAll()`
   retains terminal entries indefinitely — once the last visible row
   passed its 8s window the panel returned null but the interval kept
   firing setNow each second, churning re-renders for nothing on
   screen. The gate now considers visibility (running / paused OR
   terminal-within-window) and the interval clears itself once the
   condition flips false. New entries restart the interval via the
   `entries` dep.

2. **Ghost rows when registry forgets an entry.** The live re-pull
   fell back to the snapshot when `registry.get()` returned
   undefined. The canonical case is a foreground subagent that
   unregisters silently after its statusChange fires
   (`unregisterForeground` deletes without emitting a follow-up
   transition) — the snapshot still says `running`, so the row
   would never clear. Trust the registry: when it says the entry
   is gone, drop the row. The snapshot-only fallback is preserved
   for the no-Config case (test fixtures).

3. **Dead doc reference.** The trailing comment in `subagents/index.ts`
   pointed at `docs/comparison/subagent-display-deep-dive.md`, which
   doesn't exist in this repo (it lives in the codeagents knowledge
   base, not qwen-code). Dropped the dead pointer; the in-tree
   pointers to `LiveAgentPanel` and `BackgroundTasksDialog` already
   tell readers where to look.

Coverage delta: +2 cases on `LiveAgentPanel.test.tsx` — `drops
snapshot rows the live registry no longer knows about` (issue #2)
and `still shows the snapshot when no Config is mounted` (locks in
the test-fixture fallback that issue #2's stricter rule would
otherwise have broken).
2026-05-07 19:25:26 +08:00
wenshao
b8acd9c39b fix(cli): make Down on focused BackgroundTasksPill open the dialog
The focus chain Composer → AgentTabBar → BackgroundTasksPill is
walked with the Down arrow, but Down dead-ended at the pill — the
pill only opened the dialog on Enter. Users who followed the
LiveAgentPanel's "(↓ to view all)" overflow callout reached the
highlighted pill and got stuck there, defeating the hint.

Route Down on the focused pill into openDialog so the chain
completes naturally: Composer ↓ → AgentTabBar ↓ → Pill ↓ → Dialog.
Enter still works, so existing muscle memory keeps functioning.
2026-05-07 19:21:19 +08:00
wenshao
bfecc094c8 fix(cli): widen LiveAgentPanel, drop [in turn] marker, point overflow at dialog
Three usability fixes from review:

1. Use `terminalWidth` instead of `mainAreaWidth`. The latter is
   capped at 100 cols (intended for markdown / code where soft-wrap
   matters), which on a 200-col terminal left half the screen empty
   to the right of an already-truncating row. Live progress lines
   have nothing to soft-wrap, so the panel wants the full width.

2. Drop the `[in turn]` foreground marker. The flavor distinction
   matters in BackgroundTasksDialog (cancel semantics differ for
   foreground vs background entries) but in the glance panel the
   marker reads as cryptic noise — users asked what it meant. Keep
   the dialog as the surface that surfaces it.

3. Annotate the overflow callout with `(↓ to view all)`. The panel
   is intentionally read-only (it has no keyboard focus so it can't
   steal input from the composer), so when the roster outgrows the
   row budget we point users at the existing dialog — same keystroke
   the footer pill uses, kept in sync so users only learn one
   gesture.
2026-05-07 18:20:20 +08:00
wenshao
860c4ee212 fix(cli): port Claude Code's bullet + arrow visual to LiveAgentPanel rows
Adopt the leaked CoordinatorTaskPanel visual conventions:

- `○` replaces `⊷` for live (running) slots — matches Claude Code's
  use of `figures.circle` for the active-agent bullet, gives a
  uniform list look across the running roster. Terminal states
  keep distinct check / cross marks (✔ / ✖) so they're easy to
  scan at a glance.
- `▶` separates the description from elapsed / tokens, mirroring
  Claude's `PLAY_ICON` suffix marker.
- Activity is wrapped in `( ... )` so it reads as an annotation on
  the description rather than a sibling field, and the type prefix
  switches from ` · ` to `: ` (e.g. `editor: tighten import order`)
  to match Claude's `name: description` pattern.

The two-column flex layout from layout C is preserved — left column
flex-shrink:1 with truncate-end, right column (` ▶ Ns · Nk tokens`)
flex-shrink:0 so elapsed + tokens are never clipped, regardless of
how long the description / activity grows. This is the one
intentional divergence from Claude's literal pattern, which puts
elapsed at the row tail without pinning and lets it disappear off
narrow terminals.

Verified at 60 / 100 / 200 cols: at 200 cols the row is flush with
no internal gap; at 60 cols the description / activity tail
truncates with `…` while elapsed + tokens stay visible on the right.
2026-05-07 18:07:55 +08:00
wenshao
2d3e07ef9b fix(cli): switch LiveAgentPanel row to layout C (right-pinned, no flex-grow)
Iterating on the row layout based on visual review:

- Layout A (time first, single Text) put numbers ahead of identity,
  which broke the natural left-to-right reading order.
- Layout B (right-pinned with flex-grow:1 on left) puffed the left
  column out to fill the row, leaving a visible gap between the
  description tail and the right-pinned elapsed when the terminal
  was wider than the content.

Layout C keeps the right column flex-shrink:0 so elapsed + tokens are
never clipped, but DROPS flex-grow on the left so the two columns sit
side-by-side: empty slack falls off the row tail (invisible) instead
of opening a gap inside the row. Identity (type) and intent
(description / activity) read first, cost reads last — matching the
natural visual hierarchy. When the row overflows the panel width the
left column truncates with `…` mid-row, while elapsed + tokens stay
intact.

Verified at 60 / 100 / 200 cols — at 200 cols there is no internal
gap and no trailing ellipsis; at 60 cols time + tokens stay visible
on the right and the description / activity tail truncates with `…`.
2026-05-07 18:04:14 +08:00
wenshao
474423e814 fix(cli): move elapsed + tokens to the front of LiveAgentPanel rows
The two-column layout used `flex-grow:1` on the left description
column, which puffed it out to fill the row even when content was
short — leaving a visible gap between the description tail and the
right-pinned elapsed/tokens whenever the terminal was wider than the
content. Worse, the gap made it look like display space was being
wasted while the description still got truncated.

Move elapsed + tokens to the front of the row (right after the status
icon) so:

- Time and cost are pinned at a stable left position and are NEVER at
  risk of being truncated, regardless of description / activity length.
- The row reads as one tight left-to-right line — no flex-grow, no
  internal gap. On wide terminals the unused width sits at the row
  tail (invisible), where it belongs.
- Description + activity become the truncatable tail; `truncate-end`
  cuts only when the line genuinely overflows the panel width.

Also wrap the icon-plus-spaces span in a template literal so the two
spaces of breathing room after the glyph survive a prettier pass.

Verified at 60 / 100 / 200 cols: at 200 cols the row renders flush
with no trailing ellipsis and no internal gap; at 60 cols the time +
tokens stay at the front and the description tail truncates with `…`.
2026-05-07 17:59:25 +08:00
wenshao
bf60e53402 feat(cli): replace inline AgentExecutionDisplay with always-on LiveAgentPanel
Surface running subagents in a borderless, always-on roster anchored
beneath the input footer (mirrors Claude Code's CoordinatorTaskPanel)
and retire the verbose inline `AgentExecutionDisplay` frame whose
per-tool-call mutations caused scrollback flicker. Detail / cancel /
resume keep flowing through the existing BackgroundTasksDialog.

LiveAgentPanel:
- Two-column row: status icon + (optional) type + description +
  activity on the left (truncate-end), elapsed + tokens on the right
  in a flex-shrink:0 column so the cost/time fields are never hidden
  by long descriptions.
- Re-pulls each agent from BackgroundTaskRegistry on every wall-clock
  tick so `recentActivities` stays fresh — the snapshot from
  `useBackgroundTaskView` only refreshes on `statusChange` to keep
  the footer pill / AppContainer quiet under heavy tool traffic.
- Reaches for Config via raw ConfigContext (not useConfig) so the
  panel degrades to snapshot-only when no provider is mounted (test
  isolation).
- Hides when any dialog is visible (auth / permission / bg tasks)
  and self-hides when no agent entries are live.
- Drops `subagentType` from the row when it is the default
  `general-purpose` builtin to keep the line uncluttered; specialized
  types still bold-anchor the row.
- Keeps terminal entries on screen for 8s so the user gets feedback
  when an agent finishes, then they fall off (BackgroundTasksDialog
  retains them long-term).

Inline frame retirement:
- ToolMessage's SubagentExecutionRenderer collapses to the focus-
  routed approval surfaces only (focus-holder banner + queued
  marker). All other agent state is owned by the panel + dialog.
- AgentExecutionDisplay.tsx + test removed (-918 lines); the
  subagents/index export is dropped with a pointer to the new
  surfaces.

Net diff: +97 / -1069.
2026-05-07 17:46:24 +08:00
jinye
b1ec8d64c7
fix(cli): warn on ignored provider generation config (#3883)
* fix(cli): warn on ignored provider generation config

Warn when a selected provider model has top-level model.generationConfig fields that will not apply because the provider entry is sealed. Document the required local-model placement for contextWindowSize and related generation config fields.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(cli): address provider config warning review

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(cli): narrow provider config warning fields

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

---------

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-07 14:46:23 +08:00
易良
f468cb61a0
ci: add Qwen Code issue follow-up bot workflow (#3854)
* ci: add issue follow-up bot workflow

* ci: gate issue follow-up bot rollout

* ci: reduce issue follow-up batch size

* ci: address issue follow-up bot review

* ci: add temporary issue bot canary

* ci: fix canary verification

* ci: dedupe bot token issue comments

* ci: remove temporary issue bot canary

* ci: avoid repeated issue bot followups

* ci: simplify issue follow-up bot prompt

* ci: refine issue follow-up bot flow

* ci: harden issue follow-up bot workflow

* ci: harden issue follow-up bot rollout

* ci: enforce follow-up bot dry-run writes

* ci: redact blocked bot command args

* ci: lock follow-up bot gh wrapper to current repo

- Require explicit `--repo <expected>` on every gh command path; reject
  any --repo value that does not match REPOSITORY/GITHUB_REPOSITORY so a
  poisoned issue body cannot redirect bot writes to another repo.
- Add OPENAI_BASE_URL to the secret-scrubbing list so an internal proxy
  URL is not echoed into comments or labels.
- Print the resolved DISPATCH_DRY_RUN / ISSUE_OPENED_DRY_RUN /
  SCHEDULE_DRY_RUN inputs alongside the resolved dry_run state for
  easier debugging of automatic vs manual paths.

* ci: tighten follow-up bot wrapper and trim noise

- Fold the repo-match check into validate_issue_edit_args /
  validate_issue_comment_args; replace the standalone require_explicit_repo
  with a smaller require_repo_match used only by the read-only paths.
- Add an upfront guard that fails fast when expected_repo is unset,
  and document the positional subcommand match.
- Drop the configuration-notice job: it warned on every issues.opened
  and cron tick when QWEN_ISSUE_FOLLOWUP_BOT_ENABLED was unset, which
  is the default state.
- Remove the redundant BOT_GITHUB_TOKEN re-export at step level
  (already inherited from the workflow env).
- Invert the dry_run resolution so every branch starts from `true` and
  flips to `false` on explicit opt-in, removing the duplicate assignments.
- Collapse the multi-line dry-run debug block into a single state line.
- Note in the prompt that global flags and short aliases (`-b`, `-F`)
  are rejected by the runner so the model only emits long-form gh
  commands.

* ci: fix shim reject logs to include full subcommand context

Agent-Logs-Url: https://github.com/QwenLM/qwen-code/sessions/1cf8097d-b747-4838-a206-63a11352facc

Co-authored-by: yiliang114 <11473889+yiliang114@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yiliang114 <11473889+yiliang114@users.noreply.github.com>
2026-05-07 13:52:52 +08:00
Shaojin Wen
018184b571
fix(core): stop per-subagent ToolRegistry on the foreground-fork path (#3887)
Follow-up to #3873 review: the foreground-fork branch in `agent.ts`
fires the fork body via `void runInForkContext(runFramedFork)` and
returns the placeholder result synchronously, with no try/finally
around the fork body. The other three spawn paths (foreground
non-fork, background fork, background non-fork) added in #3873
already stop the per-subagent ToolRegistry in their finally blocks
— this one was missed, so any AgentTool / SkillTool the fork's
model later instantiated leaked its change-listener on the shared
SubagentManager / SkillManager for the rest of the session.

Wraps the inner body in `try { await runSubagentWithHooks(...) }
finally { void agentConfig.getToolRegistry().stop().catch(() => {}) }`
— same shape as the background bgBody finally added in #3873.

Adds a regression test in `agent.test.ts` Fork dispatch describe
that drains the detached fork body and asserts the stop spy was
invoked exactly once.
2026-05-07 13:43:00 +08:00
Shaojin Wen
ddd21b5330
fix(core): address @tanzhenxin's PR-1 review notes (post-merge follow-up to #3842) (#3886)
* fix(core): wrap hasOwnProperty.call inside try + add post-abort PTY data assertion

Two non-blocking review notes from @tanzhenxin's PR-1 approval
(#3842, post-merge follow-up):

- **Note 2 (real bug)**: `getShellAbortReasonKind` had
  `Object.prototype.hasOwnProperty.call(reason, 'kind')` outside the
  try/catch. `hasOwnProperty.call` triggers the `[[GetOwnProperty]]`
  Proxy trap (`getOwnPropertyDescriptor` handler). A Proxy whose
  `getOwnPropertyDescriptor` throws — separate from a throwing `get`
  trap, which the prior commit already covered — would propagate past
  the helper, leaving the abort handler's switch on `kind` to throw
  through `addEventListener` (which doesn't await async listener
  return values), so the shell process would stay alive instead of
  being killed on cancel. Moved the descriptor probe inside the same
  try block as the value read.

  My own multi-round audit covered six attack vectors but missed this
  one — only `get` trap throws were considered. The reviewer caught
  it on first pass.

- **Note 3 (test parity)**: the PTY post-promotion handoff test
  asserted `dispose` was called but never re-invoked the data
  callback to verify the foreground `onOutputEvent` actually stops
  firing — the child_process equivalent has that assertion. Mirrored
  it: emit data AFTER abort by re-invoking the captured `dataCallback`
  reference, and assert `onOutputEventMock.mock.calls.length` does NOT
  increase past the moment of promote. Exercises the production
  `listenersDetached` guard inside the chain callback, which the
  bare dispose-was-called check didn't.

Note 1 from the same review (the `aborted: true + promoted: true`
shape forcing PR-2 callers to check `promoted` before `aborted`) is
deliberately NOT addressed here — it's a contract simplification that
affects PR-2's branching, so it belongs in PR-2 along with the
caller-side decision on whether to flip `aborted` for promoted
results. Added a TODO upstream in #3831 (PR-2 design) to track.

70 / 70 tests pass (69 baseline + 1 new helper boundary for the
throwing-getOwnPropertyDescriptor case). tsc + ESLint clean.

* test(core): fix tautological PTY post-promote assertion (audit follow-up)

The PTY post-promotion handoff test added in the previous commit
copied the child_process equivalent's pattern verbatim — sync
\`expect(count).toBe(countAtPromote)\` immediately after dataCallback
returns. That works for child_process because its \`handleOutput\` is
fully synchronous (sniff → decoder → emit, all on the same call
stack), so the count change happens BEFORE the assertion.

PTY's \`handleOutput\` is async — \`processingChain.then(...)\` queues a
microtask that does the sniff + write + render-then-emit work. The
sync assertion captures both \`countAtPromote\` and the post-emit count
BEFORE the chain microtask ever runs, so both reads return whatever
happened before the assert (typically 0). The test would
tautologically pass even if the production \`listenersDetached\` guard
were removed — i.e., it didn't actually verify the guard.

Restructured to:
1. Drive the PTY through \`simulateExecution\` so \`await handle.result\`
   forces all queued microtasks (including pre-promote chain items
   AND the abort handler's drain) to settle.
2. Capture \`eventCountAfterSettle\` once everything has stabilized.
3. Re-invoke the captured \`dataCallback\` with post-promote data,
   await two more macrotask boundaries to let the new chain item
   fully run.
4. Assert the count hasn't moved.

If the production \`listenersDetached\` guard is removed, the
post-promote chain item emits, count increases past
\`eventCountAfterSettle\`, and this assertion fails. So the test
actually exercises the guard now.

Found in self-audit while reviewing my own follow-up commit. Caught
because audit was paranoid about *whether the test verifies what it
claims to verify*, not just whether it passes.

70 / 70 tests pass; tsc + ESLint clean.

* test(core): pin eventCountAfterSettle === 0 in PTY post-promote test

The previous fix asserted post-promote count equals the
\`eventCountAfterSettle\` baseline, but didn't pin the baseline
itself. With the production \`listenersDetached\` guard intact, both
halves (pre-promote chain and post-promote chain) suppress emit, so
\`eventCountAfterSettle === 0 === post-count\` and the relative
comparison is vacuously true.

If a future refactor changed the production guard semantics so the
pre-promote chain item DID emit (count becomes 1+ after settle), the
relative-comparison test would still pass as long as post-promote
also emitted the same number — that's a regression the test should
catch. Adding \`expect(eventCountAfterSettle).toBe(0)\` makes the
contract explicit: once \`listenersDetached\` is set during the abort
handler's sync part, BOTH the in-flight chain item (pre-promote) and
the future chain item (post-promote) skip emit.

Found in another self-audit pass — even after fixing the tautological
assertion, the test could still mask certain future regressions.
70 / 70 tests pass; tsc + ESLint clean.

* test(core): drop dataCallbackHolder pattern, read mock.calls directly

The dataCallbackHolder pattern (capturing the onData callback inside
simulateExecution then invoking it after via a closure-shared object)
was unnecessary indirection — \`mockPtyProcess.onData.mock.calls[0][0]\`
reads the same callback reference whether you read it inside or outside
the simulation closure. Vi's mock.calls array is per-mock-instance and
beforeEach re-creates mockPtyProcess + .onData freshly, so there's no
stale-reference risk in the simpler form.

No behavior change in what's tested. 70 / 70 pass.

* chore(core): drop in-source @-mention attribution + dead Proxy.get handler

Two cosmetic cleanups found in another self-audit pass:

- **Helper comment**: removed the parenthetical attribution ("Caught
  by @tanzhenxin in the PR-1 review; my own audit only covered \`get\`
  trap throws."). The technical content of the comment — explaining
  why both the descriptor probe and the value read live inside the
  try — stands on its own. The reviewer credit lives in commit
  history / PR description, where it belongs; an in-source @-mention
  ages poorly (handles change, the relevant person may move on) and
  doesn't help future readers reason about the code.

- **Test Proxy**: \`throwingDescriptorProxy\` declared a \`get()\` handler
  that always returned \`undefined\`. The descriptor probe throws
  before the helper ever reaches the value read, so the \`get\`
  handler is unreachable — dropped it and added a one-line comment
  explaining why no \`get\` handler is needed for this test. Mirror
  test (\`throwingReason\` with throwing accessor + \`proxyReason\` with
  throwing \`get\`) keeps the symmetric "throwing-`get`" coverage.

70 / 70 tests pass; tsc + ESLint clean.
2026-05-07 13:42:42 +08:00
Shaojin Wen
93139d0eb0
docs(cli): document new banner customization settings (#3885)
Add ui.customBannerTitle, ui.customBannerSubtitle, and ui.customAsciiArt
to the user-facing settings table. Also reword ui.hideBanner to note
that it covers both the logo column and the info panel and that Tips
render independently.

These settings landed in #3710 but only ui.hideBanner was listed in
the table, so users had no way to discover the other three short of
reading the schema or the design doc.
2026-05-07 13:42:22 +08:00
易良
379da6327c
fix(core): improve stream rate-limit retry handling (#3790)
* fix(core): improve stream rate-limit retry diagnostics

* fix(core): honor retry-after for stream rate-limit retries

* fix(core): support response retry-after headers

* fix(core): guard rate-limit diagnostics payloads

* fix(core): tolerate null retry-after headers

* fix(core): harden rate-limit retry diagnostics
2026-05-07 11:58:55 +08:00
kkhomej33-netizen
49f8828849
Create project temp dir before saving truncated output (#3875) 2026-05-07 11:58:46 +08:00
Shaojin Wen
7124c6cba0
fix(core): rebuild tool registry on subagent Config overrides so bound tools resolve to the subagent (#3873)
* fix(core): rebuild tool registry on subagent Config overrides so bound tools resolve to the subagent

PR-B (#3774) added per-Config FileReadCache isolation via Object.create
overrides at two subagent spawn sites — agent.ts:createApprovalModeOverride
and subagent-manager.ts:maybeOverrideContentGenerator. The override
shielded code that read FileReadCache directly through the Config
instance, but missed the bound-tool path: Config.createToolRegistry runs
once at parent initialise time, so the parent's EditTool / WriteFileTool
/ ReadFileTool instances are bound with `this.config = parent`. The
subagent's Object.create wrapper inherited getToolRegistry via the
prototype chain, reaching the parent registry whose bound tools then
read FileReadCache and approval mode from the parent.

This change closes that gap by rebuilding the tool registry on the
override at both sites — the same pattern InProcessBackend.createPerAgentConfig
already uses:

  - override.createToolRegistry(undefined, { skipDiscovery: true })
  - registry.copyDiscoveredToolsFrom(base.getToolRegistry())
  - override.getToolRegistry = () => registry

createApprovalModeOverride becomes async; its single call site already
ran inside an async block. maybeOverrideContentGenerator skips the
rebuild when the upstream Config already has its own getToolRegistry
(real-world case: agent.ts wrapper passed through createAgentHeadless),
avoiding wasted work, listener accumulation on shared SubagentManager /
SkillManager, and a cache split where the bound tools' registry layer
diverges from the runtime context's lazy-init cache.

Includes regression tests in agent-override.test.ts and
subagent-manager-override.test.ts that exercise the bound-tool path:
they instantiate the lazy factories on the override registry and
assert that EditTool / WriteFileTool / ReadFileTool resolve
this.config to the override Config (and thus to the override's
FileReadCache / approval mode), not the parent.

* fix(core): close bound-tool gap on resumed background agents too

Follow-up audit on PR #3873 surfaced a duplicate, pre-rebuild copy of
`createApprovalModeOverride` living in `background-agent-resume.ts`
(L142-150). Resumed fork agents go through `createResumedForkSubagent`
which bypasses `SubagentManager.maybeOverrideContentGenerator` (where
the registry rebuild now lives), so the resumed fork's
`EditTool` / `WriteFileTool` / `ReadFileTool` were still resolving
`this.config` to the parent and reading the parent's `FileReadCache`.

The non-fork resume path went through `subagent-manager` and worked
correctly only because `maybeOverrideContentGenerator` saw no upstream
own-registry on `bgConfig` and rebuilt one — but with that fallback
the fork path could never benefit.

This change deletes the local copy and switches `background-agent-resume.ts`
to import the now-async exported `createApprovalModeOverride` from
`agent.ts`. Drops the previous `?: this.config` short-circuit so the
resumed agent ALWAYS gets a wrapper Config — the same behaviour
`agent.ts` already enforces; reusing the parent directly defeats the
per-Config FileReadCache isolation.

Updates `background-agent-resume.test.ts` mock config with the
`createToolRegistry` / `getToolRegistry` stubs the rebuild path now
exercises.

* fix(core): address bound-tool isolation review feedback

Three independent fixes from PR #3873 review feedback:

1. Switch the upstream-rebuild guard from
   `hasOwnProperty(base, 'getToolRegistry')` to a Symbol-keyed marker
   `TOOL_REGISTRY_REBUILT`. The own-property check missed the case
   where the override is reached via an Object.create wrapper above
   the rebuilt Config (e.g. `bgConfig = Object.create(agentConfig)` in
   the agent.ts background path) — it would falsely report "no
   upstream rebuild" and cause a redundant third rebuild that wastes
   work and doubles the listener-leak surface. Symbol property reads
   walk the prototype chain via normal lookup, so a marker stored on
   any ancestor is correctly observed.

   Extracts the shared rebuild logic into
   `rebuildToolRegistryOnOverride(override, base)` so the three spawn
   sites (agent.ts:createApprovalModeOverride, the inherits branch,
   the non-inherit branch) cannot drift apart.

2. Stop the per-subagent ToolRegistry in the lifecycle finally blocks:
   - agent.ts foreground finally (after the inner try wrapping
     `runFramed`)
   - agent.ts background bgBody finally (after `bgSubagent.execute`
     resolves)
   - background-agent-resume.ts resume body finally (same shape)

   Without this, every AgentTool / SkillTool the model instantiates
   from the per-subagent registry registers a change-listener on
   shared SubagentManager / SkillManager, and repeated subagent runs
   accumulate listeners for the rest of the session. Stop is
   fire-and-forget, matching `InProcessBackend.cleanup` and
   `stopAgent`.

3. Add bound-tool isolation tests for the non-inherit branch
   (explicit-model selector). The original PR only covered the
   inherits branch directly; the non-inherit branch now goes through
   the same helper, but a dedicated test pins
   `tool.config === override` and the FileReadCache binding so a
   regression cannot leave explicit-model subagents reading the
   parent's cache while existing model-override tests still pass.

Tests now exercise:
- Symbol marker propagation via Object.create chain (3 cases)
- Non-inherit rebuild + bound-tool isolation
- Non-inherit skip-rebuild when upstream wrapper has the marker
- Pre-existing inherits / chained-override / approval-mode propagation
- Mock configs in agent.test.ts / subagent-manager.test.ts /
  background-agent-resume.test.ts gain `stop` and `tools: Map` stubs
  to model the registry contract the override path now exercises.

`npx vitest run packages/core/src` — 268 files / 6943 passed.
2026-05-07 10:31:32 +08:00
Shaojin Wen
b9f56e4344
feat(core): add signal.reason convention for ShellExecutionService (#3831 PR-1 of 3) (#3842)
* feat(core): add signal.reason convention for ShellExecutionService.execute()

Foundation for #3831 Phase D (b) — Ctrl+B promote of a running foreground
shell to background. Defines a discriminated `ShellAbortReason` union that
the AbortSignal carries; default behavior (no reason / `{ kind: 'cancel' }`)
keeps the existing tree-kill on abort. `{ kind: 'background' }` is a takeover
signal — execute() skips the kill, drops the child from its active set (so
cleanup() won't kill it later), flushes a snapshot of captured output, and
resolves the result Promise immediately with `promoted: true` so the
awaiting caller unblocks.

Pure plumbing: no caller sets the reason yet, so this is a zero-behavior
change for existing call sites. The `promoted?: boolean` field is optional
on ShellExecutionResult so existing consumers compile against the new shape
without source changes.

Tests pin both branches in both childProcessFallback and executeWithPty:
default abort still SIGTERM-tree-kills; `{ kind: 'cancel' }` is identical to
default (pin against accidental routing through the background branch);
`{ kind: 'background' }` skips the kill, snapshot output is preserved,
mockProcessKill / mockPtyProcess.kill are NOT called.

Part of #3831 (Phase D part b — Ctrl+B promote running shell to background).
PR-1 of 3.

* fix(core): detach service listeners on background-promote (resolve review)

Addresses 4 Critical + 2 Suggestion findings on PR-1 of #3831:

- **childProcess listener detach** (review line 555 + 573): Anonymous arrow
  listeners on stdout/stderr/error/exit could not be off()'d. After
  background-promote, post-promote bytes would re-enter handleOutput, which
  then calls decoder.decode() on a now-finalized text decoder (cleanup()
  already called .decode() without stream:true) → TypeError crash. Even
  without the crash, old onOutputEvent would fire for new data → ownership
  contract violation + duplication. Fix: extract named handler refs
  (stdoutHandler / stderrHandler / errorHandler / exitHandler) and call
  off() on all four in the background-promote branch via a
  detachServiceListeners() helper.

- **PTY listener detach** (review line 967 + 990): node-pty's onData / onExit
  return IDisposable handles; the abort handler now captures
  dataDisposable / exitDisposable and calls .dispose() in the
  background-promote branch. ptyProcess.on('error') is EventEmitter-style
  (not IDisposable) — extract a named ptyErrorHandler ref and off() it.
  Without these, post-promote PTY error throws → Node.js crash; post-promote
  data continues writing to headlessTerminal and calling old onOutputEvent
  → ownership violation.

- **PTY in-flight chain item ownership** (related to review line 990):
  processingChain may have already-enqueued callbacks past the early
  listenersDetached check. Refactored from "early-return short-circuit" to
  "guard each onOutputEvent emit individually" so in-flight writes still
  LAND in headlessTerminal (snapshot reflects them) but no events leak to
  the foreground onOutputEvent. Also clear renderTimeout in the abort
  handler so a pending throttled render doesn't fire post-promote.

- **PTY snapshot freshness** (review line 972, suggestion): The original
  abort handler called serializeTerminalToText immediately. Now we
  await Promise.race([processingChain drain, SIGKILL_TIMEOUT_MS]) first
  (mirrors the onExit finalize pattern at ~line 970) so in-flight
  headlessTerminal.write callbacks land before serialization. Skipped
  render(true) intentionally because it would emit final onOutputEvent
  data (renderFn calls onOutputEvent), violating the "no emit post-promote"
  invariant — added a comment explaining why direct serialize is correct.

- **Handoff-boundary tests** (review line 1257, suggestion): Added 4 new
  tests pinning the ownership contract — 2 for child_process (post-promote
  stdout/stderr does NOT route to onOutputEvent; child exit does NOT
  re-resolve result), 2 for PTY (data/exit disposables ARE called; result
  shape stays promoted: true even if post-promote events fire).

Also: test setup now stubs mockPtyProcess.onData / .onExit to return
{ dispose: vi.fn() } so the background-promote path's dispose() calls
don't crash on undefined (the stub's mock.results[0].value is then
inspected by the new handoff tests).

58 / 58 tests pass (50 baseline + 4 first-pass + 4 handoff). Total +235 / -35
on top of the prior commit.

* fix(core): defensive hardening for ShellExecutionService background-promote (resolve 2nd review pass)

Addresses 6 follow-up [Suggestion] threads on PR-1 of #3831 — all
substantive code-quality issues raised by the second-pass review of
the dispose-based detach commit (8e8e18ca7):

- **Exhaustive switch on `ShellAbortReason.kind`** (both abort handlers).
  Earlier `if (reason?.kind === 'background')` form silently fell
  through to kill for any unrecognized variant — a future
  `{ kind: 'suspend' }` would have killed the process with zero
  compile-time signal. Switched to `switch (kind)` with a `never`-typed
  default that runs `debugLogger.warn` and falls back to the safest
  behavior (cancel/kill). Each branch is now extracted into a named
  helper (`performBackgroundPromote` / `performCancelKill`) so the
  switch body stays a single screenful.

- **Each `dispose()` wrapped in its own try/catch** (PTY). node-pty's
  `IDisposable` contract doesn't guarantee no-throw. Without per-dispose
  try/catch a single throwing dispose() would skip subsequent cleanup
  (the other dispose, off('error'), activePtys.delete, drain, resolve)
  and the caller would hang forever on `await result`. Each call now
  logs via debugLogger.warn on failure but continues.

- **`.catch(() => undefined)` on the processingChain side of the drain
  race** (PTY). `Promise.race([processingChain.then(drain).then(drain),
  timeout])` would propagate a chain rejection out of the race; since
  `addEventListener` doesn't await our handler, the rejection became
  unhandled and `resolve()` was never called → caller hung. Now the
  rejection is swallowed; the timeout side still terminates the race
  on time.

- **Drain-timeout truncation now emits a diagnostic warning** (PTY).
  Previously the 200ms drain timeout could fire, the snapshot would be
  taken with the buffer in mid-write state, and the result.output
  would be silently truncated. Race result is now observed via a
  symbol sentinel; when the timeout side wins, debugLogger.warn fires
  pointing the user at rawOutput as the un-truncated fallback.

- **Snapshot serialize failure logs instead of swallowing silently**
  (PTY). Empty `catch {}` made result.output indistinguishable from
  "command produced no output" if serializeTerminalToText threw. Now
  `debugLogger.warn` with the error message leaves a trail for support
  bundles.

- **Dedicated `PROMOTE_DRAIN_TIMEOUT_MS` constant** separated from
  `SIGKILL_TIMEOUT_MS`. Both are 200ms today, but they have unrelated
  reasons-to-change (kill escalation timing vs. promote drain
  ceiling) — sharing the constant means tuning one would silently
  change the other.

Also adds a module-level `debugLogger = createDebugLogger('SHELL_EXECUTION')`
since the service had no logging surface before this commit.

58 / 58 tests pass; tsc clean; ESLint clean. No new tests added: the new
behaviors (timeout sentinel firing, dispose throw, exhaustive switch
default) are defensive log-only paths; existing handoff tests already
cover the happy path. Adding mock-throw tests is reasonable
follow-up but not blocking.

* fix(core): real bug — ptyProcess.off → removeListener; defensive abort-reason read

Resolves the third review pass on PR-1 of #3831 — 1 real bug + 2
defensive hardenings:

- **Real bug: `ptyProcess.off('error', ...)` throws TypeError at
  runtime** (line ~1074). `@lydell/node-pty`'s `IPty` interface
  exposes the legacy Node EventEmitter `removeListener`, not the
  modern `off` alias. Previous form threw, the surrounding try/catch
  swallowed it (post-prior-pass dispose hardening), but the old
  `ptyErrorHandler` stayed registered — so a post-promote PTY error
  would still hit our foreground handler and `throw err`, breaking
  the handoff contract that PR-1's whole listener-detach work is
  supposed to enforce. Switched to `removeListener`. The catch +
  warn stays as defense-in-depth; the message wording is updated.

- **Prototype-pollution-safe `kind` read** (extracted to module-level
  helper `getShellAbortReasonKind`). The previous `reason?.kind`
  walked the prototype chain — a polluted
  `Object.prototype.kind = 'background'` would silently route
  `abortController.abort({})` (any plain object reason) into the
  promote branch and skip the kill. Lifecycle/safety branch deserves
  the extra check. Helper now: rejects non-object reasons; reads
  `kind` only as an OWN property (`hasOwnProperty`); whitelists
  against `'background' | 'cancel'`; defaults to `'cancel'` (the
  safe historical behavior) for everything else. Both abort handlers
  (childProcess + PTY) now share this helper.

- **`streamStdout: true` + background-promote = silent empty
  snapshot** (childProcess `performBackgroundPromote`). The promote
  snapshot reads from the `stdout` / `stderr` string accumulators;
  but in `streamStdout` mode `handleOutput` forwards bytes through
  `onOutputEvent` and skips the accumulators entirely. Today PR-1's
  only call site (foreground shell.ts) uses `streamStdout: false`,
  so the combination is unreachable — but if a future caller pairs
  the two, `result.output` would be empty with no diagnostic. Added a
  `debugLogger.warn` when the combination occurs, pointing the caller
  at `rawOutput` as the fallback. Cheaper than building a parallel
  accumulator just for this latent case.

58 / 58 tests pass; tsc clean; ESLint clean.

* fix(core): liveness check + throw-safe abort-reason read + encoding-aware PTY snapshot (resolve 4th review pass)

Resolves 6 threads on PR-1 of #3831 — 1 Critical + 1 real bug + 2
quality + 2 test-coverage:

- **[Critical] `getShellAbortReasonKind` throw-safe property read.**
  Previous form read `reason.kind` after only checking that `kind` is
  an own property. An own accessor that throws (or a Proxy with a
  trapping getter) would throw before the helper reached either the
  cancel kill path or the background promote path. Abort handlers are
  dispatched async and not awaited by AbortSignal, so a leaked throw
  here would have left the shell process alive instead of being killed
  on cancel — quietly. Wrapped the property read in try/catch with a
  fall-back to the safe 'cancel' kill behavior.

- **Real bug: child_process post-exit race in background-promote**
  (`performBackgroundPromote`). The child may have already exited but
  the 'exit' event hasn't reached our handler yet (Node delivers
  events on the next microtask). Promoting in that window would
  detach our exit listener and report `promoted: true` for a process
  that's already dead — the caller would hold an inert pid expecting
  to take over. Now we read `child.exitCode` / `child.signalCode`
  before detaching: if either is non-null, fall through and let the
  pending exit handler resolve normally with the real exit info.
  Mirrored mock setup so `exitCode` / `signalCode` default to `null`
  (matching real ChildProcess) instead of `undefined`.

- **PTY snapshot: re-decode + replay (mirror exit-path encoding).**
  The promoted snapshot was serializing `headlessTerminal` directly,
  which was fed by a streaming decoder initialized from the
  first-chunk encoding heuristic. When early output is ASCII-only but
  later output is in a different encoding (GBK / Shift-JIS / etc.),
  this produces mojibake — and the normal exit path doesn't, because
  it re-decodes `finalBuffer` with `getCachedEncodingForBuffer` and
  replays through a fresh terminal. Now mirrors that logic so
  `result.output` shape matches across the two paths. Direct-serialize
  remains as a last-ditch fallback if replay throws.

- **Switch `default` no longer emits a runtime warn.** Reviewer noted
  the helper's whitelist made the `default: { _exhaustive: never }`
  branch unreachable at runtime — the `debugLogger.warn` in it could
  never fire. Kept the `_: never = kind` type assertion (so a future
  ShellAbortReason variant forces a TS error here, directing the
  developer to extend BOTH the helper's whitelist AND add a `case`),
  removed the unreachable warn. Added a comment that the assertion is
  the static-only safety net the union expansion would trigger.

- **Direct unit tests for `getShellAbortReasonKind`** (8 cases). The
  helper's prototype-pollution defense is the main reason it exists;
  if `hasOwnProperty` is accidentally removed the regression would
  silently send `abortController.abort({})` (any plain reason) into
  the promote path. Exported the helper and added direct tests for:
  null / undefined, non-object, empty object (no own kind), prototype-
  only kind (pollution), unknown kind value, throwing accessor, Proxy
  trap, and the two happy paths.

- **`removeListener` regression guard.** The fix to call
  `ptyProcess.removeListener('error', ...)` instead of `.off(...)`
  matters because `@lydell/node-pty`'s IPty interface only exposes
  `removeListener` — `.off()` throws TypeError on a real PTY but the
  EventEmitter mock tolerates both. Added a test that spies on both
  methods and asserts the production code uses `removeListener` for
  the 'error' event, so a future swap back to `.off()` regresses
  loudly under the mock instead of silently.

68 / 68 tests pass (58 baseline + 9 helper boundary + 1 removeListener
guard + 1 post-exit race); tsc clean; ESLint clean.

* fix(core): PTY background-promote post-exit race guard (resolve 5th review pass)

Mirrors the child_process post-exit race fix from 4cc558b3d into the
PTY path — addresses 1 [Critical] thread on PR-1 of #3831:

The PTY may have already exited but our `exitDisposable` (onExit
callback) hasn't run yet — node-pty delivers the exit event
asynchronously after the PTY's native SIGCHLD, so there's a window
between "PTY actually dead" and "service onExit fires". Promoting in
that window detaches our exit listener and reports `promoted: true`
for a dead PTY, losing the real exit status; the caller would hold an
inert pid expecting to take over.

The IPty interface doesn't expose an `exitCode` field we can read
directly (unlike `child.exitCode` / `child.signalCode` for
child_process), so use `process.kill(pid, 0)` as a best-effort
liveness check via the existing `ShellExecutionService.isPtyActive`
helper. If kill(pid, 0) throws ESRCH, the pid is gone — log at debug
level and fall through, letting the pending onExit callback resolve
normally with the real exit info.

Also adds a unit test mirroring the child_process race test: mocks
`process.kill(pid, 0)` to throw ESRCH on the liveness probe, asserts
the result has no `promoted: true` and reports the real exitCode.

69 / 69 tests pass; tsc clean; ESLint clean.

* docs(core): correct getShellAbortReasonKind boundary-test count in JSDoc

Doc said 'all six edge cases' but the test suite has 8 cases (added
Proxy-trap and undefined later). Off-by-2 cosmetic only — no behavior
change. Caught during a multi-round self-audit of PR-1 of #3831.

Audit summary: 7 rounds (correctness / reverse / consistency / coverage
/ build / exception paths / style) found one false-positive (a sync-
abort registration-order race I initially thought existed). Verified
that Node's WHATWG AbortSignal does NOT auto-fire 'abort' listeners
on already-aborted signals, so the race window cannot open. No code
change needed for that scenario; this commit is just the JSDoc fix.

69 / 69 tests still pass; tsc + ESLint clean.

* docs(core): document the helper / union / switch sync invariant explicitly

Multi-round self-audit found that `getShellAbortReasonKind`'s value
whitelist has no compile-time tie to the `ShellAbortReason` union: when
the union grows, TypeScript's `_exhaustive: never` in each switch
forces #3 (the case arm) to be added, but the helper's whitelist
(#2) silently keeps degrading the new variant to 'cancel', and the
new case arm is never reached at runtime.

Reviewer #4 raised this on the second pass; the original commit chose
to accept it (option B in that thread) but didn't leave a strong
in-code signal for future contributors. Added an INVARIANT block
inside the helper enumerating the three sites that must be kept in
sync, so the next person extending `ShellAbortReason` sees the
coupling at the place where they're most likely to forget it.

No behavior change — comment-only. 69 / 69 tests still pass; tsc +
ESLint clean.

Audit summary (this round + prior round): 18 angles total over two
sweeps and one reverse-attack pass. Found:
  - 0 real bugs
  - 1 false-positive race (sync-abort registration order — Node WHATWG
    AbortSignal does NOT auto-fire on already-aborted signals;
    investigated, reverted)
  - 1 cosmetic doc fix (boundary-test count off-by-2)
  - 1 cosmetic INVARIANT block (this commit)

Areas reviewed without finding new issues: caller-side
ShellExecutionResult shape compatibility (optional `promoted?` field,
existing callers spread-untouched); `exited` flag lifecycle
(monotonic, cleanup() idempotent); processingChain in-flight
ownership (listenersDetached guards every onOutputEvent emit
including the renderFn-rendered case via the same flag); race
between exit event and abort handler (both microtasks, FIFO ordering
gives correct outcome either way); Node version dependence
(`AbortSignal.reason` is Node 17.2+, engines: >=20 covers it);
test isolation (mockImplementationOnce + module-level mockProcessKill
clears each beforeEach); `process.kill(pid, 0)` Windows liveness
reliability (best-effort, acceptable for PR-1 plumbing); PID reuse
race on the PTY liveness check (theoretically possible, microsecond
window, unavoidable at the OS level — rejected in spec discussion);
PR-2/PR-3 contract surface (caller MUST attach listeners before
abort — documented; any future caller violating this is its own bug).

* test(core): align mockChildProcess.exitCode/signalCode in second beforeEach

The 'execution method selection' describe block has its own
beforeEach (separate from 'child_process fallback') that builds
mockChildProcess but does not set `exitCode` / `signalCode = null`.
Real Node `ChildProcess.exitCode` / `signalCode` are `null` while the
process is alive — and production now reads these in the
background-promote race guard. The current tests in this block don't
exercise the promote path, so they pass regardless, but any future
promote-related test landing here would silently trip the guard
(`undefined !== null` is true) and fall through to the normal-exit
branch instead of promoting.

Mirror the `child_process fallback` block's mock setup so the two
beforeEach hooks produce equivalent ChildProcess shapes, eliminating
a quiet foot-gun for future contributors.

Comment-only / test-fixture change. 69 / 69 tests still pass; tsc clean.
Found during a deeper third-round self-audit of PR-1 of #3831.
2026-05-07 10:31:15 +08:00
ChiGao
4f084352f4
feat(cli): customize banner area (logo, title, hide) (#3710)
* docs(design): add banner customization design (#3005)

Document the design for issue #3005 (customize CLI banner area). Covers
the banner region taxonomy and what is replaceable vs. locked, the three
proposed settings (`ui.hideBanner`, `ui.customBannerTitle`,
`ui.customAsciiArt`) and their resolution pipeline, the schema additions
and wiring touch points, five alternative shapes considered, and the
security / failure-handling guards. Mirrored EN + zh-CN under
`docs/design/customize-banner-area/`. No code changes in this commit;
implementation lands in a follow-up PR.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* feat(cli): customize banner area (logo, title, hide)

Adds three opt-in `ui.*` settings that let users replace brand chrome on
startup while keeping the operational lines (version, auth, model, path)
locked: `hideBanner`, `customBannerTitle`, `customAsciiArt` (string,
{path}, or {small,large}).

A new resolver in `packages/cli/src/ui/utils/customBanner.ts` walks the
loaded settings, normalizes each tier per scope (so {path} resolves
against the file that declared it), reads the file with O_NOFOLLOW and a
64 KB cap on POSIX, sanitizes via a banner-specific stripper that drops
OSC/CSI/SS2/SS3 sequences while preserving newlines, and caps art at 200
lines × 200 cols and titles at 80 chars. Every soft failure logs a
`[BANNER]` warn and falls through to the bundled QWEN logo or default
brand title — banner config can never crash the CLI.

`<Header />` now picks the widest custom tier that fits via a shared
`pickAsciiArtTier` helper and falls back to `shortAsciiLogo` otherwise;
`<AppHeader />` extends the existing `showBanner` gate to honor
`hideBanner` alongside the screen-reader fallback.

Tracks #3005 and the design merged in #3671.

* docs(design): apply prettier to banner customization design

Reformats the EN and zh-CN design docs in
`docs/design/customize-banner-area/` to satisfy `npx prettier --check`:
table column alignment and trailing commas in `jsonc` examples. No
content changes — the words, tables, and code blocks all say the same
thing as before.

Carries forward the only actionable feedback from the now-closed
docs-only PR #3671, where the prettier check was the sole change
requested.

* fix(cli): address banner audit findings

Three audit-driven fixes for the banner customization feature:

1. **VSCode JSON schema accepts every documented shape.** The
   `ui.customAsciiArt` entry in
   `packages/vscode-ide-companion/schemas/settings.schema.json` was
   declared as `type: object`, which made VSCode flag the inline-string
   form (`"customAsciiArt": "  ___"`) — a shape the runtime accepts and
   the design doc recommends — as a schema violation. Replaced with a
   `oneOf` covering string, `{path}`, and `{small,large}` (with each
   tier itself string-or-`{path}`).

2. **Narrow terminals no longer leak the QWEN logo over a white-label
   deployment.** When a user supplied custom ASCII art but neither tier
   fit the terminal, `Header.tsx` previously fell back to the bundled
   `shortAsciiLogo` — silently undoing the white-label intent on small
   windows. The fallback now distinguishes "user supplied custom art"
   from "no custom art at all": in the first case the logo column is
   hidden entirely (info panel still renders); in the second case the
   default logo shows as before. Soft-failure paths (missing file,
   sanitization rejection) still fall through to `shortAsciiLogo`.

3. **Sanitizer strips C1 control bytes (0x80-0x9F).** The art and title
   strippers previously stopped at 0x7F, leaving single-byte CSI
   (`0x9B`), DCS (`0x90`), ST (`0x9C`) and other C1 controls intact —
   which legacy 8-bit terminals would still interpret. Aligned the
   ranges with the repo's existing `stripUnsafeCharacters` (in
   `textUtils.ts`) so banner content can't carry interpreted control
   bytes through.

New tests cover: C1 strip in art and title, absolute path reads,
symlink rejection on POSIX, narrow-terminal hide-on-custom-art, and
end-to-end `<AppHeader />` rendering through `resolveCustomBanner`.
The full banner suite is 48 tests (was 42).

* docs(design): clarify cross-scope tier merge and white-label fallback

Two clarifications surfaced by the audit on the implementation PR:

1. The design said `customAsciiArt` follows standard merge precedence,
   but the resolver actually walks scopes per-tier so workspace can
   override only `large` while user keeps `small`. Document that this
   per-tier walk is intentional — both because each `{path}` has to
   resolve against the file that declared it (the merged view loses
   that information) and because it lets users keep a personal default
   tier and override the other one per-workspace.

2. The render-time tier-selection step now distinguishes "user
   supplied custom art but neither tier fits" (hide the logo column
   entirely; falling back to `shortAsciiLogo` would silently undo a
   white-label deployment on narrow terminals) from "user supplied no
   custom art at all" (fall through to `shortAsciiLogo` and let the
   default-logo width gate decide). Step 5's pure soft-failure
   fallback (missing file, sanitization rejection) is unchanged —
   still `shortAsciiLogo`.

Mirrored both edits in the zh-CN translation.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* docs(design): add size budget section to banner customization

Question raised on the implementation PR: "why is the test logo `CCA`
instead of the full `Custom Code Agent` — is there a character limit?"

There is no character-count limit on titles or art. There is a
**width budget** driven by terminal columns, plus an absolute
hard cap (200×200 art, 80-char title) to keep malformed input from
freezing layout. The existing user-facing guide didn't quantify the
budget anywhere, so users were guessing why long inline names didn't
render.

Add a "How wide can the logo be? — the size budget" subsection that
spells out the formula
(`availableLogoWidth = terminalCols − 4 − 2 − 44`), tabulates it at
80 / 100 / 120 / 200 cols, calls out that a 17-char brand like
"Custom Code Agent" can't render as a single ANSI Shadow line on most
terminals (~120 cols of art), and shows the stacked-words
`{ small, large }` recipe — including the `figlet` one-liner that
generates the corresponding `banner-large.txt`.

Mirrored in the zh-CN translation.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* docs(design): add limits-at-a-glance table; switch demo to Custom Agent

The banner-customization design now has the size budget written down,
but the per-cap limits (80-char title, 200×200 art, 64 KB file) were
buried inside the size-budget formula table. Surface them as their own
"Limits at a glance" subsection at the top of the user-configuration
guide so users see the hard caps before they start hand-crafting art.

Also switch the running example from "Custom Code Agent" (17 chars,
~120 cols of ANSI Shadow art on one line — too wide for any common
terminal) to "Custom Agent" (12 chars, two-word stack at ~54 cols ×
12 lines, fits any terminal ≥ 104 cols). The figlet recipe is now a
two-word pipeline so a copy-paste run produces art the size the doc
claims.

Mirrored both changes in the zh-CN translation. The implementation
itself is unchanged.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(cli): address PR review + CI Lint failure

Two reviewer findings on PR #3710 (and the Lint job that fails for the
same root cause):

1. **Schema regen now reproduces the committed JSON Schema.** The CI
   Lint step runs `npm run generate:settings-schema` and fails when
   the worktree dirties — my earlier hand-authored `oneOf` got blown
   away because `customAsciiArt` is `type: 'object'` in the source
   schema and the generator had no way to emit a union.

   Add a `jsonSchemaOverride` escape-hatch field on `SettingDefinition`:
   when set, the generator emits the override verbatim (description
   carried forward) instead of the type-driven shape. Set it on
   `customAsciiArt` to express the runtime union (string | {path} |
   {small,large} where each tier is itself string-or-{path}). The
   committed schema is now regenerated from source and CI's
   regenerate-and-diff check passes; two back-to-back regens produce
   identical output.

2. **Untrusted workspace settings no longer influence the banner.**
   `collectScopedTiers()` walked `settings.workspace` directly because
   per-scope file paths are needed to resolve relative `{path}`
   entries — but that bypassed the trust gate that
   `settings.merged` enforces. An untrusted checkout could therefore
   render its own ASCII art and trigger local file reads through a
   `{path}` entry before the user trusts the folder. Skip
   `settings.workspace` entirely when `settings.isTrusted` is false.
   Two regression tests cover the gate (untrusted = workspace
   silenced, falls through to user; trusted = workspace honored).

Test suite for the banner is now 30 resolver tests + the existing
Header / AppHeader / settingsSchema tests = 66 total, all green.

* feat(cli): add ui.customBannerSubtitle for the spacer row

Adds a fourth opt-in setting to the banner customization surface.
The info panel renders four rows (title, subtitle/spacer, status,
path); the second row was a hard-coded single-space spacer up to
now. With this change a fork or white-label deployment can set
`ui.customBannerSubtitle` to a one-line subtitle (e.g. "Built-in
DataWorks Official Skills") and have it render in the secondary
text color in place of the spacer. Empty/unset preserves the
previous blank-spacer layout, so the change is back-compat.

The subtitle is sanitized through the same
`sanitizeSingleLine` helper as the title (now factored out): OSC /
CSI / SS2 / SS3 leaders dropped, every other C0/C1 control byte
replaced with a space, internal whitespace collapsed, ends
trimmed. Capped at 160 characters — looser than the title's 80
because tagline / "powered by" copy commonly runs longer — with
the same `[BANNER]` warn on truncation.

Wiring:

- `settingsSchema.ts` — new `customBannerSubtitle` entry next to
  `customBannerTitle`, `showInDialog: false` (free-form text in
  the TUI dialog isn't worth its own picker).
- `customBanner.ts` — `ResolvedBanner.subtitle` field;
  `resolveCustomBanner` populates it; `sanitizeTitle` and the new
  `sanitizeSubtitle` share the same helper.
- `Header.tsx` — when `customBannerSubtitle` is truthy the spacer
  row renders the string (secondary color, single line) instead
  of `<Text> </Text>`. Auth/model and path still sit at their
  usual positions.
- `AppHeader.tsx` — pipes `resolvedBanner.subtitle` through.
- VSCode JSON schema regenerated from source (idempotent).

Tests: 5 new resolver tests (default, sanitize, length cap,
empty, newline + C1 strip), 2 new Header tests (renders subtitle
between title and auth; spacer preserved when unset), 1 new
AppHeader integration test (end-to-end through resolver). Banner
suite is now 35 + 17 + 6 + 16 = 74 tests, all green.

Design docs (EN + zh-CN) updated: region taxonomy now lists four
B-rows; "Limits at a glance" table grows a subtitle row;
"Customization rules" matrix and "How to modify" section gain a
"Add a brand subtitle" example with a rendered four-row preview.

* docs(design): sweep stale 3-setting references after subtitle add

Self-review found several sections of the banner customization design
doc still framed for the original three settings; bring them in line
with the four-setting reality landed in c7aa4a401:

- Region taxonomy ASCII diagram now shows four B-rows
  (① title, ② subtitle, ③ status, ④ path).
- Resolution-pipeline ASCII diagram and step list pick up
  customBannerSubtitle on the input side and the title/subtitle
  sanitize step on the resolver side.
- "Settings schema additions" section lists the fourth entry,
  customBannerSubtitle, and notes the customAsciiArt
  jsonSchemaOverride that landed for VS Code schema reproducibility.
- "Wiring changes" section updates the Header prop list and the
  HeaderProps interface, replaces the brittle line-number anchors
  with file-level anchors, drops the obsolete `paths` second arg
  from resolveCustomBanner, and adds the trust-gate sentence.
- "Security & failure handling" table replaces the
  stripTerminalControlSequences shorthand with the actual
  banner-specific stripper, splits the title/subtitle row to cover
  both, and adds the untrusted-workspace gate as its own row.
- "Verification plan" gains two scenarios: the subtitle row, and
  the untrusted-workspace check that the Critical reviewer comment
  on the impl PR explicitly asked us to lock down.

Mirrored every edit in the zh-CN translation. The implementation
itself is unchanged.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(cli): address banner re-review (FIFO, mutex schema, display width, regex dedupe)

Addresses the five findings on PR #3710 from the latest re-review:

1. **[Critical] FIFO/pipe at `customAsciiArt.path` no longer hangs
   startup.** The resolver was calling `openSync(path, O_NOFOLLOW)`
   *before* the `fstatSync(...).isFile()` check; on POSIX, opening a
   FIFO read-only blocks until a writer connects, and `O_NOFOLLOW`
   doesn't help — it only refuses symlinks at the final path
   component. `readArtFile` now `lstatSync()`s first and refuses
   non-regular files (FIFO / socket / device / symlink) before the
   open, while keeping the post-open `fstatSync` check for TOCTOU
   safety against a swap between the lstat and the open. New
   POSIX-only regression test `mkfifo`s a named pipe and asserts the
   resolver soft-fails inside 1 s; if the open ever regresses to
   blocking, the test will hang past the timeout and the assertion
   will catch it.

2. **[Suggestion] `{path}` and `{small,large}` are now mutually
   exclusive in both schema and runtime.** The `jsonSchemaOverride`
   on `ui.customAsciiArt` is split into three branches (string,
   `{path}`, `{small?, large?}`); none of them allow `path` and tier
   keys to co-exist. `normalizeTiers()` mirrors that — an object
   carrying both kinds of keys is now soft-rejected with a `[BANNER]`
   warn rather than letting `path` silently win and dropping the
   tier values. New regression test pins the runtime side.

3. **[Suggestion] Column cap and tier-fit selection now measure in
   terminal cells.** `getAsciiArtWidth` (in `textUtils.ts`) and the
   `MAX_ART_COLS` cap in `customBanner.ts` were both using UTF-16
   `.length`, so 200 CJK fullwidth characters would slip the cap and
   render at ~400 cells, and `pickAsciiArtTier`'s width-fit check
   was wrong for any non-ASCII art. Switched both to
   `getCachedStringWidth` (string-width semantics, already in the
   repo); art truncation walks code points until adding another
   would push the cell width past the cap, so we never split a
   fullwidth code point or surrogate pair down the middle. New
   regression test exercises the CJK fullwidth case.

4. **[Suggestion] `collectScopedTiers()` no longer drops a whole
   scope just because it has no `file.path`.** Inline-string tiers
   don't need an owning settings directory; only `{path}` tiers do.
   The path-presence check was moved into the `{path}` branch, so a
   path-less scope (e.g. `systemDefaults`, future SDK-injected
   scopes) can still contribute inline art. `{path}` entries in such
   a scope soft-fail with a tier-specific `[BANNER]` warn rather
   than killing the whole scope. Two regression tests cover both
   sides.

5. **[Suggestion] OSC / CSI / SS2-3 regex are now authored once.**
   Extracted `TERMINAL_OSC_REGEX`, `TERMINAL_CSI_REGEX`,
   `TERMINAL_SHIFT_DCS_REGEX` from `stripTerminalControlSequences`
   in `@qwen-code/qwen-code-core` and re-export them from the
   package index. `customBanner.ts` reuses the constants for
   `sanitizeArt` (which still has to preserve `\n` / `\t`) and
   delegates the title/subtitle pipeline directly to
   `stripTerminalControlSequences`. Also backported the C1 control
   strip (0x80-0x9F) into the core helper so all callers
   (session-title, etc.) benefit from the same coverage; banner
   sanitizer was the only place catching single-byte CSI / DCS / ST.

Banner suite is now 40 + 17 + 6 + 16 = 79 tests, all green. Schema
regen is still byte-for-byte idempotent. `npm run typecheck` and
prettier clean on touched files.

* fix(cli): replace require() with ES6 import in FIFO test (lint)

The FIFO regression test in 7ccbfaeb1 used a synchronous `require()` to
pull in `node:child_process` so the test could lazy-load `execFileSync`
only when needed. CI Lint flagged it under `no-restricted-syntax` —
the repo enforces ES6 imports throughout, including in tests, with no
exception for `require()`.

Move the import to the top of the file alongside the other `node:` /
vitest imports. The `try/catch` around `execFileSync('mkfifo', ...)`
still gates the test on `mkfifo` being available (rare on a fresh
container, so we skip rather than fail). 40 / 40 tests still pass and
ESLint is clean on the touched file.

---------

Co-authored-by: 秦奇 <gary.gq@alibaba-inc.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-07 10:17:53 +08:00
jinye
15342b8932
fix(core): shrink file diff session records (#3872)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(core): shrink file diff session records

Trim oversized file edit result displays before writing them to session JSONL while preserving live tool results and diff stats.

Also make resume, ACP replay, and export paths treat saved previews as incomplete so they do not reconstruct fake full diffs.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* refactor(cli): share truncated diff preview text

Use one helper for ACP replay and export fallback messages so truncated session preview wording cannot drift.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

---------

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-06 20:48:24 +08:00
Shaojin Wen
5441581392
feat(core): enforce prior read before Edit / WriteFile mutates a file (#3774)
* feat(core): enforce prior read before Edit / WriteFile mutates a file

Introduces a session-scoped invariant: the model cannot mutate an
existing file without having actually Read it (or its post-write
state) earlier in this conversation. Builds on the FileReadCache
landed in #3717.

Two new ToolErrorType codes:
- EDIT_REQUIRES_PRIOR_READ — file has no entry in the session cache.
  The model is told to use read_file first.
- FILE_CHANGED_SINCE_READ — file has an entry but its mtime or size
  drifted since the recorded fingerprint. The model is told to
  re-read before retrying.

EditTool blocks the existing-file path on cache.check; new-file
creation (old_string === '' on a non-existent target) is exempt.
WriteFileTool blocks the overwrite path; new-file creation
(fileExists === false) is exempt.

Both tools route through the existing fileReadCacheDisabled escape
hatch on Config — flipping it bypasses enforcement byte-for-byte,
matching pre-cache behaviour. Operators can use this as a kill switch
if a session falls into a state where the cache cannot be trusted.

ReadFile fix on the auto-memory path: PR #3717 had auto-memory reads
skip the cache entirely (both lookup and record), but with the new
enforcement that means a model that just Read AGENTS.md cannot then
Edit it. Decoupled the two: auto-memory reads still skip the
file_unchanged fast-path (the per-read freshness <system-reminder>
must always reach the model) but DO record into the cache so the
follow-up Edit sees `fresh`. New regression test asserts this.

Test plan
- vitest run (all of @qwen-code/qwen-code-core): 6308 passed, 2 skipped
- 9 new enforcement tests across edit.test.ts and write-file.test.ts:
  unknown rejects, stale rejects, new-file exempt, edit chain stays
  authorised, escape hatch bypasses, plus the auto-memory record
  regression in read-file.test.ts.
- tsc --noEmit clean. eslint clean. core build succeeds.

* test(core): clear shared fileReadCache between write-file.test.ts cases

CI surfaced one Linux-only failure: the prior-read enforcement test
'rejects a write that would overwrite an unread existing file'
returned FILE_CHANGED_SINCE_READ instead of EDIT_REQUIRES_PRIOR_READ.

Root cause: the FileReadCache instance is declared at module scope
(line 41) and shared across every test in write-file.test.ts. State
from earlier tests — most recently the 'records a write' integration
test that records the same path — leaks forward. On Linux the test
ordering puts a record-bearing test before the enforcement test, so
the cache reports `stale` (mtime drifted) instead of `unknown`.
macOS / Windows happen to order them differently and never hit it.

Adding a fileReadCache.clear() to beforeEach gives every test a
known-empty cache, matching how edit.test.ts already isolates its
per-test cache by re-instantiating it.

* fix(core): close prior-read enforcement gaps flagged in 3rd review

Three concrete loopholes / regressions that the original PR-B
introduction left open. All three are addressed in the same commit
because the underlying refactor (move enforcement earlier and tighten
the fresh predicate) is shared across them.

1. fresh != "model has seen the bytes". Pre-fix, requirePriorRead()
   accepted any cache.check === 'fresh'. ReadFileTool records every
   successful read into the cache, including ranged reads
   (offset/limit), truncated full reads, and non-cacheable
   binary/image/audio/video/PDF/notebook reads (lastReadCacheable
   = false). This let the model peek at a slice or a structured
   payload of a file and then mutate the whole thing. Tightened the
   accept predicate to fresh && lastReadAt && lastReadWasFull &&
   lastReadCacheable.

2. Read-less content oracle through calculateEdit error codes. Pre-fix,
   execute() ran calculateEdit (which reads file bytes and counts
   matches) before the enforcement check. A model could probe an
   unread file by attempting Edits with candidate old_strings and
   observing NO_OCCURRENCE_FOUND vs EXPECTED_OCCURRENCE_MISMATCH vs
   EDIT_NO_CHANGE — reverse-engineering content without ever calling
   read_file. Moved enforcement to the top of calculateEdit, before
   any content read; only a stat is performed up to the rejection
   point.

3. Confirmation flow regression. Pre-fix, getConfirmationDetails()
   read the existing file to render a diff for the user, then
   approval flowed to execute() which would freshly check the cache
   and reject. The user could approve a diff computed from current
   bytes the model never saw, and the call would still fail. Moved
   enforcement before the confirmation read in both EditTool (via the
   shared calculateEdit path) and WriteFileTool (explicit check at
   the top of getConfirmationDetails). The user now never sees a
   confirmation diff for an unread file — the call rejects up front.

Public API surface change: requirePriorRead() -> checkPriorRead() that
returns a structured decision, so the same predicate can route into
a CalculatedEdit.error (calculateEdit), a thrown error
(getConfirmationDetails), or a ToolResult (execute) without
duplicating the boolean / message / type plumbing in three shapes.

Reported by pomelo-nwu (3 inline comments on PR #3774).

* refactor(core): close 4 prior-read enforcement gaps from 4th review

1. recordWrite now seeds read metadata on brand-new entries
   (lastReadAt / lastReadWasFull / lastReadCacheable). The strict
   accept predicate added in the previous round (#3 review) requires
   all three, but recordWrite only set lastWriteAt — so a model
   creating a file with Edit (old_string="") or WriteFile and then
   editing it again was rejected on the second edit. The model
   authored the bytes it just wrote; for the purposes of prior-read
   enforcement that counts as having seen them. New regression test
   in edit.test.ts: "allows a create-then-edit-then-edit chain
   without an intervening read".

2. Extracted checkPriorRead into src/tools/priorReadEnforcement.ts.
   The two copies in edit.ts and write-file.ts had already drifted
   (one used ${ReadFileTool.Name}, the other hardcoded 'read_file');
   the boolean guard is security-sensitive and a one-sided fix
   would silently weaken the boundary. The shared utility takes a
   verb ('editing' | 'overwriting') so the user-facing prose can
   differ between callers without duplicating the decision logic.

3. WriteFileTool.execute now runs prior-read enforcement BEFORE
   readTextFile. Pre-fix, an unread overwrite still slurped the
   entire file into memory (encoding / BOM / line-ending detection)
   and only then rejected it: wasted I/O, and momentary in-memory
   custody of bytes the model never legitimately read. Now matches
   the order in getConfirmationDetails().

4. The "rejects a write that would overwrite an unread existing
   file" test now spies on FileSystemService.readTextFile and
   asserts not.toHaveBeenCalled() — without that, the test gave
   false confidence: it passed both pre-fix (read happened, then
   reject) and post-fix (reject before read), so the ordering
   regression in (3) was invisible to the assertion.

Reported by glm-5.1 via /review on PR #3774.

* refactor(core): close 4 prior-read enforcement gaps from 4th review (Copilot)

Five concrete gaps that the previous round of enforcement work left
open. Reported by Copilot via /review on PR #3774.

1. Confirmation-time rejections lost their ToolErrorType code.
   getConfirmationDetails() in both EditTool and WriteFileTool threw
   a plain Error on prior-read failure, which coreToolScheduler
   collapsed into UNHANDLED_EXCEPTION — silently breaking the
   EDIT_REQUIRES_PRIOR_READ / FILE_CHANGED_SINCE_READ contract for
   any approval-required flow.

   Fix: introduce PriorReadEnforcementError that carries
   `errorType: ToolErrorType`. Both confirmation paths now throw it,
   and coreToolScheduler reads `error.errorType` (falling back to
   UNHANDLED_EXCEPTION when absent). New regression tests assert
   the thrown error's `errorType` field for both tools.

2. checkPriorRead's "re-read with read_file" advice was wrong for
   binary / image / audio / video / PDF / notebook files. Their
   ReadFile result always sets lastReadCacheable=false, so the
   message would loop the agent forever on the same rejection.

   Fix: detect the fresh-but-non-cacheable case explicitly and
   return a dedicated message that explains the dead end ("Edit /
   WriteFile cannot mutate that payload safely") instead of asking
   for another read. Updated the existing non-cacheable regression
   test to assert the new message and the absence of "use the
   read_file tool first".

3. checkPriorRead swallowed every stat() failure and returned
   ok:true. EACCES, EBUSY, NFS hiccups, etc. would silently
   re-open the blind-write path the helper exists to block.

   Fix: only ENOENT continues to return ok:true (disappearance
   race). Any other code is fail-closed: returns
   EDIT_REQUIRES_PRIOR_READ with a message that names the errno.
   New regression test in write-file.test.ts spies on fs.promises
   .stat to inject EACCES and asserts the rejection.

4. The auto-memory record regression test only asserted `state ===
   'fresh'`. A future change that recorded auto-memory reads as
   partial / non-cacheable would still satisfy that assertion but
   would actually fail enforcement on every follow-up Edit.

   Fix: also assert lastReadAt is defined, lastReadWasFull is true,
   and lastReadCacheable is true. The full "what enforcement
   requires" predicate is now explicit in the test.

(The 5th item, the WriteFile mirror of (1), is covered by the same
PriorReadEnforcementError change.)

* refactor(core): tighten StructuredToolError naming + add scheduler test

Four follow-ups raised by deepseek-v4-pro on PR #3774. None of them
change the enforcement boundary; they are all about making the
contract clearer and harder to break in future changes.

1. PriorReadEnforcementError -> StructuredToolError. The class now
   wraps any content-derived ToolErrorType from calculateEdit
   (EDIT_NO_OCCURRENCE_FOUND, EDIT_EXPECTED_OCCURRENCE_MISMATCH,
   EDIT_NO_CHANGE, ATTEMPT_TO_CREATE_EXISTING_FILE) on top of the
   prior-read codes. The old name suggested the class was prior-
   read-specific, which would mislead any oncall engineer seeing
   it paired with one of the calculateEdit error codes.

2. EDIT_REQUIRES_PRIOR_READ kept its name (the prefix mentions
   "edit" but the enum is shared with WriteFileTool) — chose
   documentation over rename to avoid the churn of a value rename
   across logs/dashboards already keyed on it. JSDoc now spells
   out the cross-tool usage explicitly.

3. Stat failures other than ENOENT now map to a new
   PRIOR_READ_VERIFICATION_FAILED code instead of being conflated
   with EDIT_REQUIRES_PRIOR_READ. The failure mode is "we cannot
   verify" rather than "definitely not read" — operators routing
   on error codes can distinguish the two populations.

4. Added a coreToolScheduler test (`surfaces error.errorType from
   a confirmation throw instead of UNHANDLED_EXCEPTION`) that
   constructs a stub tool whose getConfirmationDetails throws
   StructuredToolError and asserts the surfaced ToolCall response
   carries the correct ToolErrorType. Without this test the
   scheduler's explicitErrorType branch would have no coverage at
   all.

Tool tests updated for the new StructuredToolError class name and
the PRIOR_READ_VERIFICATION_FAILED code on the EACCES path.

* fix(core): close TOCTOU + grammar + directory regressions in PR-B

Six concrete issues that the previous round of enforcement work
left open. Reported by Copilot via /review on PR #3774.

1. TOCTOU window between pre-read checkPriorRead and readTextFile.
   The pre-read stat could pass enforcement, then an external writer
   could land between that stat and the actual read, leaving
   currentContent reflecting bytes the model never saw — exactly the
   stale-write path the PR is supposed to block. Closed by re-running
   checkPriorRead immediately after every readTextFile that fed
   currentContent / originalContent: EditTool.calculateEdit and the
   two WriteFileTool paths (execute + getConfirmationDetails). A
   `stale` outcome now fails the operation with
   FILE_CHANGED_SINCE_READ at the correct moment.

2. Directory targets sent the model into an enforcement loop.
   `fileExists` is a plain access check, so directories also entered
   the enforcement branch — the model would be told to call
   `read_file`, but `read_file` rejects directories with
   TARGET_IS_DIRECTORY, so the loop never terminated. Fixed in
   checkPriorRead: if `fs.stat` reports the path is not a regular
   file, return `ok: true` so the downstream readTextFile / write
   path can surface its own EISDIR / similar error.

3. Confirmation-time error messages used the short `display` form
   instead of the full `raw` form. Approval-required Edit calls
   therefore lost the remediation detail (file path, stale-vs-unread
   distinction, "without offset / limit / pages" hint) that the
   execute path already surfaced and that the WriteFile confirmation
   path already preserved. EditTool.getConfirmationDetails now
   throws StructuredToolError with `editData.error.raw`.

4. Non-text payload displayMessage was grammatically broken: built
   from the gerund `editing` / `overwriting`, it rendered as
   "cannot editing via this tool" / "cannot overwriting via this
   tool". Fixed by deriving a bare-verb form (`edit` / `overwrite`)
   alongside the gerund and using it in displayMessage.

(Items 1, 5 and 6 from Copilot's batch are the same TOCTOU class —
EditTool calculateEdit + WriteFile execute + WriteFile confirmation —
addressed together in (1) above.)

The "bypasses enforcement entirely" test now uses mockReturnValue
instead of mockReturnValueOnce because calculateEdit calls
getFileReadCacheDisabled twice — once for the pre-read check and
once for the post-read TOCTOU re-check. Both must see disabled=true
to actually bypass.

* fix(core): close fileExists TOCTOU on WriteFile prior-read enforcement

WriteFile gated prior-read enforcement on `fileExists` from
`isFilefileExists()`, but a file that sprang into existence between
that check and the write would still be overwritten without
enforcement — `fileExists === false` skipped the check entirely.

Made the gate unconditional on `fileExists`. checkPriorRead's own
`fs.stat` decides what to do:
- ENOENT → ok:true, fall through to the new-file path as before
- file exists right now (whether or not isFilefileExists saw it) →
  unknown / stale check runs, the race-created file is rejected.

Applied to both getConfirmationDetails and execute. The path that
actually creates new files is unchanged because checkPriorRead's
ENOENT branch is the disappearance-race exit, which is the correct
exit for "the file truly does not exist yet".

Reported by glm-5.1 via /review on PR #3774.

* fix(core): close 4 enforcement gaps + 1 critical bug from 5th Copilot review

Six issues raised by deepseek-v4-pro / glm-5.1 / qwen3.6-plus on
PR #3774. Listed by reviewer-assigned severity.

[Critical] (qwen3.6-plus) recordWrite previously only seeded the
read metadata for brand-new entries. The reproduction was real:
ReadFile(limit=10) → WriteFile(full content) → Edit. The partial
read's lastReadWasFull=false would persist through the write, and
the Edit would be rejected with EDIT_REQUIRES_PRIOR_READ even
though the model just authored every byte. recordWrite now
unconditionally refreshes lastReadAt, lastReadWasFull=true, and
lastReadCacheable=true. The fileReadCache.test.ts case that
previously asserted "preserves lastReadAt" is rewritten to assert
the new "refreshes lastReadAt to match the write" contract, and a
new "upgrades lastReadWasFull / lastReadCacheable after a full
write" regression test pins the reproduction reviewer described.

[Suggestion] (deepseek-v4-pro) Narrowed the non-regular-file
bypass in priorReadEnforcement from `!stats.isFile()` to
`stats.isDirectory()`. The earlier broad form covered FIFOs,
sockets, and devices that the model has no legitimate "read first"
recourse for and that can block readTextFile (FIFO) or
over-allocate (/dev/urandom). Those now flow through to
cache.check() and reject with the unread-file path before any I/O.

[Suggestion] (glm-5.1) Removed the `fileExists && ...` gate from
EditTool.calculateEdit, mirroring the f4ef75667 fix on WriteFile.
A file that springs into existence between isFilefileExists() and
the enforcement check is now caught here as well; ENOENT inside
checkPriorRead remains the disappearance-race exit and new-file
creation flow is unchanged.

[Suggestion] (deepseek-v4-pro) Added debugLogger.warn() at every
post-read TOCTOU rejection site (Edit calculateEdit, WriteFile
getConfirmationDetails, WriteFile execute). These rejections are
rare and self-healing — without a debug record, an operator
investigating "why did this Edit fail once?" had nothing to grep.
debugLogger uses dedicated 'EDIT_PRIOR_READ' / 'WRITE_FILE' tags.

[Suggestion] (qwen3.6-plus) Added a final pre-write checkPriorRead
in EditTool.execute() and WriteFileTool.execute(). The earlier
post-read check ran inside calculateEdit (Edit) or before mkdirSync
(WriteFile), but the actual writeTextFile call could be arbitrarily
later — user approval, modify-and-confirm, etc. The window from
"post-read check → writeTextFile" is now bounded to "pre-write
stat → writeTextFile" (two adjacent syscalls).

* fix(core): close new-file race + special-file enforcement loop

Three issues from the latest Copilot review on PR #3774.

1. New-file race in pre-write enforcement (write-file.ts:348,
   edit.ts:487). The earlier pre-write checkPriorRead was gated on
   `fileExists` (WriteFile) and `!editData.isNewFile` (Edit). If the
   path was absent at planning time and another process created it
   while approval was pending, the gated form would skip enforcement
   and silently overwrite a pre-existing file the model never read.
   Run unconditionally in both tools — checkPriorRead's own ENOENT
   branch is the disappearance-race exit, so genuine new-file
   creation is unaffected, but a race-created file now hits the
   `unknown` branch and is rejected as unread.

2. FIFO / socket / device sent the model into an enforcement loop
   (priorReadEnforcement.ts:220). After narrowing the
   non-regular-file bypass to directories only, FIFOs etc. fell
   through to cache.check, returned `unknown`, and produced a
   "use read_file first" message — but read_file rejects those same
   targets as "not a regular file", so the model would loop on
   read_file forever. Added a dedicated `!stats.isFile()` branch
   (after the directory exemption) that returns a "special file;
   cannot edit/overwrite via this tool — use shell instead" message,
   matching the shape of the existing non-text-payload guidance.

(Tool-error.ts and the non-cacheable policy notes are addressed in
the PR description update — not in code.)

* fix(core): close 4 enforcement gaps from 6th Copilot review

(Plus a doc-only update for the 5th — the mtime+size limitation
warning in the Risk section now mentions the silent-overwrite
escalation that this PR's mutation paths bring along.)

1. ENOENT after the model has already read the file is no longer
   silently treated as `ok: true`. Added an `expectExisting` option
   to `checkPriorRead`; post-read and pre-write callers pass `true`.
   ENOENT under that flag now rejects with `FILE_CHANGED_SINCE_READ`
   ("file disappeared after the model read it") rather than falling
   through to the new-file path with stale bytes. Pre-read callers
   keep the old default (ENOENT → ok:true → fall through to genuine
   new-file creation). EditTool's pre-write check derives the flag
   from `editData.isNewFile`; WriteFile's pre-write check derives it
   from the post-read `fileExists` value.

2. Directory targets now reject with `TARGET_IS_DIRECTORY` and a
   structured message instead of returning `ok: true`. The previous
   form fell through to readTextFile(), which on the WriteFile
   confirmation path threw a plain Error and was surfaced by the
   scheduler as `UNHANDLED_EXCEPTION`. Both Edit and WriteFile now
   emit a structured rejection at enforcement time. (WriteFile's
   build-time validateToolParamValues already rejects directories,
   so the change matters most for EditTool.)

3. Non-cacheable rejection's `rawMessage` no longer hard-codes
   "overwrite" — it now uses the same `verbBare` derivation as the
   `displayMessage`, so EditTool's path correctly says "if you need
   to edit it" and WriteFile's path stays "if you need to overwrite
   it". The previous form was confusing for in-place edits.

4. WriteFile.getConfirmationDetails now mirrors execute()'s
   ENOENT-to-new-file fallback: a file that disappears between
   isFilefileExists() and the readTextFile-for-diff call no longer
   throws a plain Error (which would surface as
   UNHANDLED_EXCEPTION) — it falls back to the brand-new-file diff
   so the user sees a clean confirmation rather than an unstructured
   crash.

Tests:
- New: `rejects an edit on a directory with TARGET_IS_DIRECTORY`
- New: `confirmation falls back to a new-file diff when the file
  disappears mid-flight` (WriteFile)
- Updated: non-cacheable rejection asserts `verbBare` is "edit" on
  the EditTool path and "overwrite" on the WriteFile path.

Reported by Copilot via /review on PR #3774.

* docs(core): clarify stat→write race + EDIT_REQUIRES_PRIOR_READ scope

Three doc-only follow-ups from Copilot's latest review pass on PR
#3774. None change behaviour; the pre-fix code state was already
the actual contract — the docs just lagged it.

1. EDIT_REQUIRES_PRIOR_READ enum comment now lists the three cases
   the code actually returns it for (never-read, partial / ranged /
   non-cacheable read, structural dead end — non-text payload or
   special file). The previous one-liner described only the first
   case and would mislead future maintainers.

2. The Final pre-write freshness check blocks in EditTool.execute
   and WriteFileTool.execute now spell out that they DO NOT
   eliminate the stat → writeTextFile race. The window narrows
   from the previously-unbounded post-read-to-write gap down to
   two adjacent syscalls, but a concurrent writer landing in
   that pair can still be clobbered. Closing the residual would
   require an atomic write (write-to-temp + rename) or a
   content-hash post-write recheck — both deferred. Operators who
   need strict protection set `fileReadCacheDisabled: true` and
   rely on application-level locking.

3. PR description Risk section gains a "Known unmitigated: stat →
   write race window" subsection (English + Chinese mirror)
   matching the code comments.

* chore(core): minor follow-ups from review #4229917446

Three of the five MINOR items raised in the independent code review
on 2026-05-05 — the cheap, isolated ones. The other two (race-
simulating integration test, moving StructuredToolError out of
priorReadEnforcement.ts) are deferred as the reviewer suggested.

1. EditTool now has a symmetric `PRIOR_READ_VERIFICATION_FAILED`
   regression test (mocks fs.promises.stat to reject with EACCES,
   asserts the EditTool path produces the same fail-closed result
   that the existing WriteFile EACCES test pins). Five-line fix to
   close the asymmetry that, while harmless today (the helper is
   shared), would let a future Edit-side change to checkPriorRead
   slip through without test coverage.

2. ensureParentDirectoriesExist / mkdirSync now run AFTER the
   pre-write checkPriorRead in both EditTool.execute() and
   WriteFileTool.execute(). Doing it before would leak intermediate
   directories on the rejection path — a real (if minor) FS litter
   the previous order created on every rejected new-file write.

3. EDIT_REQUIRES_PRIOR_READ enum docstring gains a one-line note
   for operators routing alerts on this code: a single
   `edit_requires_prior_read` signal can mean any of the three
   cases (no read / partial read / structural dead-end), and if
   per-cause monitoring becomes important the enum can be split
   in a follow-up. The originating tool name and the message text
   already disambiguate at runtime.

* fix(core): close 2 correctness gaps from maintainer review #4232751470

Both tracked back to the cache's "track most recent read shape"
model diverging from prior-read enforcement's "model has seen
these bytes" model.

1. SVG (and similar string-content fallbacks) recorded as
   non-cacheable, blocking subsequent Edit / WriteFile.

   `read-file.ts` derives `cacheable` from
   `originalLineCount !== undefined && !isTruncated`. The SVG
   branch in `fileUtils.ts` returned content without
   `originalLineCount`, so `cacheable` collapsed to false and a
   follow-up Edit hit the dead-end "non-text payload — use shell"
   rejection — telling the model to use shell to mutate a file it
   had just successfully read as text. This was a real regression
   vs pre-PR behaviour where SVG-as-text editing worked.

   Fix: SVG-as-text branch now sets `originalLineCount` (split
   on '\n') and `isTruncated: false`, so ReadFile records it as
   a full cacheable read. The binary-fallback string and
   over-1MB SVG branches are deliberately left non-cacheable —
   they return placeholder strings ("Cannot display content of
   ...") rather than file content, so blocking edits there is
   correct. New regression test in `read-file.test.ts`:
   `records SVG-as-text reads with cacheable=true so a follow-up
   Edit passes enforcement`.

2. recordRead unconditionally overwriting lastReadWasFull /
   lastReadCacheable, revoking prior write-author or full-read
   rights.

   The `WriteFile(create) → ReadFile(offset/limit) → Edit`
   sequence rejected the Edit because the partial read clobbered
   the `lastReadWasFull = true` that `recordWrite` had stamped
   at create time. Same shape applies to a full text read
   followed by a partial one of the same inode.

   Fix: `recordRead` is now sticky-on-true for the read flags —
   `if (opts.full) entry.lastReadWasFull = true;` and the
   matching guard for `cacheable`. Prior `true` survives a later
   partial / non-cacheable read. The fast-path `file_unchanged`
   check still gates on the incoming request's own `isFullRead`
   in `read-file.ts`, so a partial read still does not get a
   placeholder it shouldn't. Updated the existing
   "overwrites earlier lastReadWasFull" test to assert the new
   sticky semantics, and added a `lastReadCacheable` symmetric
   test plus a `Write → partial-Read → Edit` end-to-end test in
   `edit.test.ts`.

Reported by tanzhenxin via independent maintainer review on
2026-05-06.

* fix(core): close 3 correctness gaps from re-review #4233904930

All three are tightenings of the prior `de8ddf530` round.

1. **Sticky-on-true narrowed to "no fingerprint drift"**.
   `fileReadCache.recordRead` previously kept `lastReadWasFull` /
   `lastReadCacheable` true across drifted recordings, which
   re-opened a `Read full @X → external write @Y → Read partial
   @Y → Edit` hole: the partial recordRead silently advanced the
   entry's mtime+size to Y while preserving the sticky `full=true`
   from X, so a follow-up Edit ran against bytes the model only
   saw the first 10 lines of. Now the sticky branch only fires
   when `(mtimeMs, sizeBytes)` matches the existing entry; on
   drift, both flags reset to exactly what this read produced.
   New regression test in `fileReadCache.test.ts` reproduces the
   reviewer's reported sequence.

2. **Subagent FileReadCache isolation now covers the
   inherits-model + same-approval-mode common case**. The
   own-property machinery from #3717 only triggers when an
   `Object.create(parent)` actually fires; both
   `agent.ts:990-993` (same-approval-mode) and
   `subagent-manager.ts:699-701` (inherits-model) had paths that
   returned the parent Config directly, so the subagent's
   `getFileReadCache()` resolved to the parent's instance — a
   parent Read could satisfy the subagent's Edit on a path the
   subagent's transcript never contained. Both sites now build
   a thin `Object.create(base)` override unconditionally; no
   method changes for the inherits / same-mode cases, but a
   distinct instance triggers the lazy-init in
   `Config.getFileReadCache()` so the subagent gets an isolated
   cache.

3. **Cache records the read pipeline's internal stat instead of
   a post-read re-stat**. `processSingleFileContent` now
   surfaces its internal stat via `result.stats`, and read-file
   uses that for `recordRead` instead of running its own stat
   after the read returns. Pre-fix, an external write between
   the pipeline call and the post-read stat let the cache record
   fingerprint Y for content the model received at X — a
   subsequent Edit would pass enforcement against bytes the
   model never legitimately saw. The internal-stat-to-read
   window is still a few microseconds wide; that residue is the
   same content-hash territory acknowledged in the Risk section.

Reported by tanzhenxin via re-review on PR #3774.

* docs(core): clarify partial subagent isolation per review #4234090906

tanzhenxin's third review correctly observed that the
`Object.create(parent)` wrappers in `agent.ts:createApprovalModeOverride`
and `subagent-manager.ts:maybeOverrideContentGenerator` only isolate
the FileReadCache for code that consults `Config.getFileReadCache()`
directly. Bound `EditTool` / `WriteFileTool` instances were registered
against the parent's tool registry at initialise time, so tool
invocations still resolve `this.config` to the parent and reach the
parent's cache. `InProcessBackend.createPerAgentConfig` already does
the right thing (`override.createToolRegistry()` +
`copyDiscoveredToolsFrom(base.getToolRegistry())`); bringing that to
these two spawn sites is the real fix.

Reviewer's verdict was COMMENT, not REQUEST_CHANGES — the gap
pre-dates this PR (it's a property of #3717's per-Config own-property
machinery) and pre-PR there was no enforcement on subagent mutations
at all, so the PR is strictly an improvement on every spawn path.
Documented the partial guarantee explicitly:

- Inline comments on both spawn sites note the bound-tool caveat
  and point at `InProcessBackend.createPerAgentConfig` as the model
  for the follow-up.
- PR description's subagent paragraph (English + Chinese mirror) now
  splits into "fully isolated" (`InProcessBackend.createPerAgentConfig`)
  and "partial isolation" (the two sites in this PR) so readers don't
  walk away with the wrong contract.

Filing the registry-rebuild work as a follow-up; not in this PR.
2026-05-06 17:17:48 +08:00
tanzhenxin
8a1ed565a6
fix(core): auto-compact subagent context to prevent overflow (#3735)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
* fix(core): auto-compact subagent context to prevent overflow

Subagent chats accumulated history without ever compacting, so a long
multi-turn run could hit "max context length exceeded" before the
compaction logic the main session uses had a chance to fire.

Move compaction down into the chat layer so both main agent and subagent
auto-compress at the configured threshold, and surface the result via a
new chat stream event that bridges into the existing ChatCompressed UI
path. The main-session wrapper still owns full /compress reset.

Closes #3664

* fix(core): address subagent compaction review feedback

Code fixes:
- Seed `lastPromptTokenCount` on subagent chats so the first-send
  threshold gate sees the inherited history's true size.
- Add `COMPRESSION_FAILED_TOKEN_COUNT_ERROR` to the fail-latch chain
  so token-counting failures stop retrying compression every send.
- Restore `FileReadCache.clear()` after compaction in both the manual
  /compress wrapper and the auto-compaction path inside GeminiChat,
  preventing post-summary `file_unchanged` placeholders from pointing
  at content the model can no longer retrieve.
- Refresh stale comment on the `compressed → ChatCompressed` bridge
  in turn.ts now that this path is the primary route, not a fallback.

Tests:
- turn.test.ts asserts the compressed → ChatCompressed bridge.
- geminiChat.test.ts asserts COMPRESSED yields as the first stream
  event after auto-compaction succeeds.
- chatCompressionService.test.ts bumps originalTokenCount above the
  cheap-gate so the NOOP test exercises findCompressSplitPoint.
- client.test.ts asserts forceFullIdeContext flips when a
  ChatCompressed event flows through sendMessageStream's loop.

* chore(core): drop redundant StreamEvent cast and document auto-compaction trade-off

- Remove the `as StreamEvent` cast at the COMPRESSED yield site — the
  literal already matches the union member.
- Add a 4-line comment at the auto-compaction setHistory point that
  points readers to GeminiClient for the env-refresh trade-off rationale,
  so readers don't have to chase the layering decision back across files.
2026-05-06 14:09:07 +08:00
tanzhenxin
808d0978eb
feat(cli): route foreground subagents through pill+dialog while running (#3768)
* feat(cli): route foreground subagents through pill+dialog while running

Foreground (synchronous) subagents currently render a live AgentExecutionDisplay
inside the parent's pendingHistoryItems block. The frame mutates on every tool
call and approval; once it grows past the terminal height (verbose mode, parallel
subagents, long tool-call lists) the live-area repaint flickers visibly.

This change extends BackgroundTaskRegistry with a flavor: 'foreground' | 'background'
discriminator. Foreground entries register at the start of the synchronous tool-call
and unregister in its finally path. The pill counts them; the dialog drills into
their activity. The inline frame is suppressed during the live phase — only an
active, focus-locked approval prompt renders, as a small banner labeled with the
originating agent. Once the parent turn commits, the full AgentExecutionDisplay
appears in scrollback via Ink's <Static>, exactly as before.

Foreground entries skip the XML task-notification (the parent receives the result
through the normal tool-result channel) and skip the headless holdback (the
parent's await already pins the loop). The dialog gates per-agent cancellation
behind a two-step confirm so a stray 'x' can't end the user's current turn.

* fix(cli): address review findings on foreground subagent routing

- Gate `registerCallback` on background flavor so foreground entries
  don't leak orphaned `task_started` SDK events without a matching
  terminal notification.
- Render a queued-approval marker for non-focus subagents instead of
  returning null, so a queued approval is visible in the main view.
- Move `emitStatusChange` before `agents.delete` in
  `unregisterForeground` to match the ordering used by
  complete/fail/cancel/finalize.
- Prefix the foreground tool result with a cancel marker when
  `terminateMode === CANCELLED`, so the parent model can distinguish
  a user-cancelled run from a successful completion.
- Mirror the background path's stats wiring on the foreground path
  so `entry.stats` stays current and the dialog detail subtitle
  shows tool count + tokens for foreground runs.
- Remove the unreachable `isWaitingForOtherApproval` branch (subsumed
  by the queued-approval marker above).
- Reset the foreground confirm-step on detail-mode `left` and ignore
  `x` on terminal entries so an armed cancel can't carry into list
  mode and the hint footer/handler stay in sync.
- Test factory uses a `baseProps` spread instead of `as` cast so a
  future required field on `ToolMessageProps` is a compile-time miss.
2026-05-06 14:08:12 +08:00
易良
fe1fb31557
fix(cli): prevent file paths from being treated as slash commands (#3743)
* fix(cli): prevent file paths from being treated as slash commands (#1804)

When users input file paths starting with '/' (e.g. '/api/apiFunction/...',
'/Users/name/path'), they were incorrectly parsed as slash commands, resulting
in "Unknown command" errors. The input was discarded instead of being sent to
the model for processing.

Root cause: isSlashCommand() only checked for a '/' prefix without validating
whether the first token actually looks like a command name. Any '/' prefix
triggered the slash command flow, and when no matching command was found, the
error was shown with no fallback.

Fix: Add looksLikeCommandName() that validates command names contain only
[a-zA-Z0-9:_-]. Both isSlashCommand() and handleSlashCommand() now check the
first token — if it contains path separators, dots, or non-ASCII characters,
the input falls through to normal model processing instead of the command
dispatcher.

Closes #1804

* fix(cli): allow dots in command names and fix prettier formatting

Address review feedback:
- Allow '.' in looksLikeCommandName() regex to support extension-qualified
  commands like gcp.deploy (CommandService renames conflicts as ext.cmd)
- Add regression tests for dot-named commands in both commandUtils and
  slashCommandProcessor
- Fix prettier formatting in slashCommandProcessor test file

* fix(cli): handle slash command review edge cases

* docs(cli): align slash command validation comment

* fix(cli): preserve slash prompt ordering

* fix(cli): reject shell-metacharacter slash tokens

* test(cli): align slash command action mocks

* fix(cli): track model-sent user turns

* fix(cli): narrow slash path handling scope

* fix(cli): count slash path prompts in history

* test(cli): type slash command action mocks
2026-05-06 14:03:08 +08:00
jinye
2bd4aa1b6d
fix(sdk-python): standardize TAG_PREFIX to include v suffix (#3832)
Align the Python SDK's TAG_PREFIX with the TypeScript SDK convention by
changing it from 'sdk-python-' to 'sdk-python-v'. This removes the need
for callers to manually inject the `v` when composing git tags, eliminating
an asymmetry that could lead to doubled or missing `v` prefixes when code
is copied between SDK release helpers.

The final tag format (sdk-python-v0.1.0) is unchanged.

Closes #3793

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-06 11:14:02 +08:00
Shaojin Wen
d44786685c
feat(core,cli): surface and cancel auto-memory dream tasks (#3836)
* feat(cli): surface auto-memory dream tasks in Background tasks dialog

Adds `dream` as a fourth kind in the Background tasks pill + dialog,
alongside agent / shell / monitor. Subscribes to MemoryManager via
its existing subscribe() / listTasksByType() API and adapts
MemoryTaskRecord into a DreamDialogEntry view-model. Zero changes
to the core package.

Filters out `pending` (sub-second transition) and `skipped` (every
UserQuery that misses the gate creates one) records. Caps retained
terminal entries at 3 since `MemoryManager.tasks` has no eviction
path; without the cap, completed dreams would accumulate over the
project's lifetime (mirrors MonitorRegistry's terminal cap pattern).

Extract tasks are intentionally NOT surfaced — they fire on every
UserQuery, would flood the pill, and the `memory_saved` toast in
useGeminiStream already covers their completion signal.

Read-only for now: cancellation requires MemoryManager.cancelTask
+ task_stop integration which lands in a follow-up PR. The dialog
suppresses the "x stop" hint for dream entries until then to avoid
silent no-op keystrokes.

Refs #3634

* docs(cli): rephrase dream filter comment to focus on extract vs dream

The earlier comment compared the design to a non-qwen-code product;
restate the rationale in terms of the local extract / dream split
(extract fires every UserQuery and surfaces via memory_saved toast,
dream fires after gates and warrants pill / dialog visibility).

* feat(core,cli): cancel dream consolidation tasks via dialog and task_stop

Wires cancellation for the auto-memory dream task kind:

- `MemoryManager.cancelTask(taskId)` — aborts the dream's fork-agent
  via a new per-task AbortController, marks the record `cancelled`
  before aborting so the runDream catch path can detect user-intent
  and avoid overwriting with a generic `failed`. The existing
  finally block releases the consolidation lock as the agent unwinds.
- `MemoryManager.getTask(id)` — point lookup helper so cross-cutting
  consumers like `task_stop` can route by id without a project root.
- AbortSignal threaded through `scheduleDream` → `runDream` →
  `runManagedAutoMemoryDream` → `planManagedAutoMemoryDreamByAgent`
  → `runForkedAgent.abortSignal` (already supported).
- `task_stop` tool gets a 4th dispatch branch: dream task ids look
  up via MemoryManager and route through `cancelTask`. Extract is
  intentionally NOT cancellable — it runs synchronously on the
  request loop, cancelling would interfere with the user's own turn.
- `BackgroundTasksDialog` x-stop hint suppression for dream is removed
  (was a PR-1 placeholder); `cancelSelected` dream branch now calls
  `memoryManager.cancelTask`.
- `MemoryTaskStatus` gains `'cancelled'`. Dream view-model widens
  status union and filter to surface cancelled entries (terminal
  cap continues to apply).

Refs #3634

* fix(core,cli): address review feedback on dream cancellation surface

- DreamDetailBody: comment said cancellation "lands in PR-2" but the
  same PR wires `cancelSelected` for dream entries. Reword to describe
  what's actually shipped + flag in-flight progress as the real
  follow-up.
- task_stop: drop the unreachable `!aborted` error branch. The status
  guard above already confirms `running`, and `cancelTask` is
  synchronous; in this branch it cannot return false.

* fix(core): handle resolved-cancel path from runForkedAgent for dream tasks

runForkedAgent maps AgentTerminateMode.CANCELLED to a resolved
{status: 'cancelled'} result rather than rejecting. The cancel-via-
task_stop path landed in the previous commit assumed the call would
throw — when it didn't, the runDream success path overwrote the
user-cancelled record with 'completed' AND bumped lastDreamAt
metadata, suppressing the next legitimate dream cycle.

Two-layer defense:

- dreamAgentPlanner now rethrows when the fork agent reports
  cancelled status (mirrors the existing failed-status throw).
  This is the source-of-truth fix.

- runDream now checks abortSignal.aborted after the await as
  defense in depth. If anything in the call chain ever forgets to
  propagate, this guard short-circuits the success path before
  metadata write.

Updates the existing dreamAgentPlanner test that previously pinned
the buggy "returns cancelled without throwing" behavior. Adds a
manager test that simulates the resolves-on-abort scenario directly
to verify the consumer-side guard catches anything the planner
might miss.

* fix(core,cli): address review feedback on dream UX, perf, and comment clarity

Three fixes from the post-cancellation-PR review:

- scheduleDream now sets an initial `progressText` ("Scheduled
  managed auto-memory dream.") on the in-flight record. Without
  this, the dialog Detail's Progress section stayed empty until
  completion — the PR description's mid-flight screenshot showed
  text that production never actually rendered.

- useBackgroundTaskView gates the MemoryManager.subscribe listener
  on a dream-content signature. The manager fires for every task
  transition (extract included, ~2x per UserQuery), but the dialog
  has no extract surface; without this dedup each extract notify
  forced a full 4-source re-merge + a fresh setEntries reference,
  re-rendering the dialog and pill on entries that hadn't changed.
  Added a test that pins the reference-stability invariant.

- task-stop comment was misleading — said "the status guard above
  already confirmed running", but for the dream branch the running
  check happens IN this branch (not earlier). Reworded.

* fix(core,cli): tighten cancelTask contract, dedup dream snapshot reads, sync test helper

Four fixes from the latest review pass:

- cancelTask() now enforces the AbortController invariant. The old
  `ac?.abort()` returned `true` even when no controller was found,
  meaning callers could see a successful return while the dream was
  not actually aborted (and would leak the consolidation lock until
  the agent finished naturally). The controller is registered
  synchronously alongside `status='running'`, so a missing controller
  for a running record is a contract violation — return false without
  flipping status so the caller knows the abort didn't take.

- useBackgroundTaskView's `refresh()` now reuses the dream snapshot
  the memory listener fetched for its dedup gate. The previous version
  re-read `listTasksByType('dream')` inside `computeDreamSig()` and
  again inside `refresh()` — extra work AND a race window where the
  gate signature could come from a different snapshot than the one
  used to build dreamEntries. Single read, single source of truth.

- The `dream()` test helper widened to include `'cancelled'` so it
  matches the production `MemoryTaskStatus` union. Added a small test
  asserting cancelled dreams flow through the kind discriminator (the
  dialog's terminal-cap window depends on showing the user the
  outcome of the abort they just triggered).

- Dropped the stale "PR-1 read-only / PR-2 cancellation" block
  comment above the dream-entries describe block — both are in this
  PR now.

* fix(core,cli): address self-review + Copilot feedback on dream surface

Combined fixes for the latest review pass — self-review notes (U1, U2,
U5, U6) plus Copilot's three new comments (C1, C2, C3):

- **Footer dream-indicator dedupe (U1)**: removed `useDreamRunning` +
  `✦ dreaming` right-column text. The new Background tasks pill
  already counts dream tasks alongside agent / shell / monitor;
  showing both produced two simultaneous signals for the same state.

- **task_stop dispatch surfaces extract distinctly (U2 + C1)**: a
  new `TASK_STOP_NOT_CANCELLABLE` error type fires when the task id
  resolves to a known-but-not-cancellable record (extract). Previously
  extract ids fell through to `NOT_FOUND`, misleading the model into
  thinking the id was never valid. Also surfaces the missing-controller
  case from `cancelTask` as an explicit error rather than reporting
  phantom success.

- **MemoryManager.cancelTask logs missing-controller violation (C2)**:
  the silent `return false` for the missing-AbortController case now
  emits a `debugLogger.warn` so the inconsistency is observable in
  debug bundles. Without the log a runaway dream burning tokens would
  leave no trail.

- **MemoryManager.subscribe taskType filter (C3)**: subscribers can
  now opt into per-type notify routing via `subscribe(fn, { taskType })`.
  Internal `notify()` calls pass the changed task's type so filtered
  consumers wake only on relevant transitions. The bg-tasks UI hook
  uses this to skip the per-UserQuery extract notify entirely — drops
  the per-extract O(n) signature work to zero.

- **runDream guards against late-cancel overwrite (U5)**: the
  success path now re-checks `abortSignal.aborted` between metadata
  read/write and before the final `update({status: 'completed'})`.
  Closes the ~tens-of-ms race window where `cancelTask` flipping
  status to `'cancelled'` would silently lose to the success
  continuation overwriting with `'completed'` + bumping
  `lastDreamAt`.

- **DreamDialogEntry.endTime semantic comment (U6)**: documents that
  `endTime` for cancelled records is the cancel-call moment (not the
  fork unwind), so a future maintainer doesn't treat it as a real
  fork-finish timestamp.

Tests: new `subscribe() taskType filter` describe block in manager,
new task_stop tests for `NOT_CANCELLABLE` (extract) and missing-
AbortController (dream); existing test renamed/widened.

* fix(core,cli): tighten error semantics + comments on dream surface

Four fixes from the latest review pass:

- New `TASK_STOP_INTERNAL_ERROR` error type for the missing-
  AbortController contract violation. Previously the dispatcher
  reused `TASK_STOP_NOT_RUNNING`, which is misleading — the task IS
  running, cancellation just couldn't be delivered. Distinct type
  signals "this is unexpected, file a bug" vs `NOT_CANCELLABLE`
  which signals "expected behavior, use a different approach".

- Reworded the `useBackgroundTaskView` filter comment. Said "every
  UserQuery that misses the gate creates one [skipped record]" but
  `scheduleDream` returns `{status: 'skipped'}` early without
  creating a record for most gate misses; only the
  acquireDreamLock/EEXIST race actually stores a `'skipped'` record.

- Strengthened the `subscribe() unsubscribe` test. The previous
  version asserted "not called yet" without firing any notify after
  unsubscribe, so a regression that left the listener attached
  would still pass. Now schedules an extract before AND after the
  unsubscribe, verifying the call count doesn't increment.

- Moved `const debugLogger = createDebugLogger(...)` below the
  full import block in manager.ts. Previous version sat between
  imports, violating eslint-plugin-import's `import/first` rule
  (didn't trip lint locally, but worth fixing before it does).

* fix(core): skip scheduleDream early when params.config is missing

`ScheduleDreamParams.config` is optional in the type so test paths
can omit it, but production callers always pass one. Without a
config, `runManagedAutoMemoryDream` throws because the fork-agent
execution requires it. With dream tasks now visible in the
Background tasks dialog, that throw becomes a noisy `failed` entry
the user sees but didn't trigger.

Convert the omitted-config case to the same `disabled` skip path
that an explicitly-disabled config takes, so a no-config call
short-circuits before any record is stored. Existing tests that
relied on the old "no config = proceed past the disabled gate"
behavior now pass an explicit `makeMockConfig()` (matching what
they would do in any realistic scenario).

New test pins the no-config skip behavior + asserts no record
was stored (so a regression that drops the early skip would
produce a visible failed entry in the dialog and fail the test).

* fix(core): plug subscribe Map leak + dream.ts late-cancel ordering bug

Two fixes from the latest review pass:

- `MemoryManager.subscribe`'s typed-branch unsubscribe deleted the
  listener from its per-type Set but left the empty Set sitting in
  `subscribersByType`. Over a long-running session with repeated
  React mount/unmount of the bg-tasks view, that accumulates dead
  Map entries forever. Drop the entry when the bucket goes empty.

- `runManagedAutoMemoryDream` writes metadata after the fork agent
  returns (`bumpMetadata` → `rebuildManagedAutoMemoryIndex` →
  `updateDreamMetadataResult`). If the user presses 'x' between the
  fork's success return and these writes, the writes proceed and
  bump `lastDreamAt` — leaving the visible UI ('Stopped') disagreeing
  with the scheduler gate (sees a recent successful dream and
  suppresses the next cycle). manager.ts already short-circuits its
  own metadata write via the post-await abort check, but it can't
  block writes that already happened inside dream.ts. Adds the same
  abort-signal check between each write step here.

* fix(core): swallow releaseDreamLock errors so they don't poison outcome

If `releaseDreamLock` throws inside the inner finally (e.g. filesystem
error on the lock file), the exception propagates to the outer catch
and overwrites a successfully-completed dream record with 'failed'.
The on-disk metadata is already up-to-date at that point, so the user
sees a contradictory state — `lastDreamAt` was bumped but the UI
shows a failure.

Wrap the release in a try/catch with `debugLogger.warn`. The lock
file is still cleaned up on the next session via the existing
staleness sweep, so swallowing the release error doesn't risk a
permanent stuck lock.

* fix(core): thread abortSignal into dream metadata writes

The pre-call abort checks in `runManagedAutoMemoryDream` close most
of the late-cancel race window, but each metadata helper itself does
read → mutate → write across two awaits. If the user cancels between
those two awaits the write still happens, persisting `lastDreamAt`
for an aborted run and suppressing the next legitimate dream cycle.

Thread `abortSignal` into `bumpMetadata` and
`updateDreamMetadataResult`; both now re-check between the read and
the write, returning early without persisting when the signal has
already fired. The pre-call checks remain as the first line of
defense; this guards the race that opens after the call enters.

* fix(core): close dream cancel race + surface lock-release failures

Two related fixes from the latest review pass:

- Move scheduler-gating metadata writes out of `runManagedAutoMemoryDream`
  and into `MemoryManager.runDream`, sequenced AFTER the
  status='completed' flip. The previous shape left a race window
  where cancellation arriving during/after `fs.writeFile` could
  persist `lastDreamAt` while the manager flipped status to
  'cancelled' — visible UI ('Stopped') would disagree with the
  scheduler gate (sees a recent successful dream), suppressing the
  next legitimate dream cycle. The new order makes the gating
  metadata write race-free: once status !== 'running', cancelTask
  refuses, so any cancel arriving during the metadata write is
  ignored. The remaining "cancel raced the synchronous status
  update" window is handled by a post-update abort recheck that
  restores 'cancelled' and skips the metadata write.

  Drops the now-dead `bumpMetadata` helper from dream.ts; index
  rebuild stays there since it's informational, not gating.

- Surface `releaseDreamLock` failures on the task record's metadata
  (`lockReleaseError`). The previous fix logged-and-swallowed only,
  so a Windows EPERM or ENOENT race would silently block subsequent
  dreams as 'locked' with no UI signal explaining why dreaming had
  stopped. Logger keeps emitting the warn for debug bundles.

* fix(core,cli): address 8 review findings on dream surface

Reverts the metadata-write behavior regression, plugs the
storeWith reentrancy hole, surfaces lock/metadata warnings in the
UI, and a handful of cleanups:

- Wrap gating-metadata read+write in try/catch (manager.ts). The
  PR moved metadata writes from dream.ts (best-effort, swallowed)
  to manager.ts (unguarded). A throw from readDreamMetadata /
  writeDreamMetadata now propagates to the outer catch and
  overwrites a successfully-completed dream with 'failed' — the
  dream actually did its work and touched files are visible. New
  catch logs + writes `metadataWriteError` on the record so the UI
  can explain why the next dream may re-fire sooner than expected.

- Register the AbortController BEFORE storeWith in scheduleDream.
  storeWith fires a notify; a subscriber synchronously calling
  cancelTask(record.id) would otherwise see status='running' but
  no controller, hitting the missing-controller defensive warn
  path and reporting a phantom failure on a brand-new dream.

- Surface `lockReleaseError` and `metadataWriteError` in the dream
  view-model and DreamDetailBody (rendered as warnings, not
  errors, so the terminal status stays Completed). Previous fix
  wrote them to record.metadata only — nothing in the cli read or
  rendered them, so users still had no UI signal.

- Preserve `result.touchedTopics` on the unreachable cancel-raced-
  status-update branch. If a future refactor introduces an await
  there, the restored cancelled record would otherwise drop the
  already-produced result; the UI would report a clean cancellation
  even though memory files were already modified.

- Add `'cancelled'` to MemoryDreamEvent status union and emit a
  cancelled telemetry event from the runDream catch path. Without
  this a cancelled dream is indistinguishable from one that never
  scheduled in the first place.

- Drop the dead abortSignal param from updateDreamMetadataResult
  in dream.ts (no caller passes it after the manager.ts move).

- Swap declaration order of `computeDreamSig` and `refresh` in
  useBackgroundTaskView.ts (TDZ-fragile against a future refactor
  that adds a synchronous refresh call between them).

- Update the unreachable cancel-raced-update branch comment to
  describe what's actually true ("defense-in-depth, unreachable
  today") instead of the confusing "cancelTask flipped" path.

- cancelSelected now checks the cancelTask return value and logs
  via debugLogger when false. Today this branch is unreachable
  thanks to the controller-register-before-storeWith fix above,
  but if a future refactor breaks the invariant the silent ignore
  would let the user think the cancel took effect.

- Mirror the manual /dream metadata path's `recentSessionIdsSinceDream = []`
  reset in the auto path — field is dead code today but keeping
  the two write sites in sync avoids surprises.

Telemetry metric `recordMemoryDreamMetrics` widened to accept the
new 'cancelled' status (downstream consumer in loggers.ts).

* fix(memory): same-session recovery when releaseDreamLock throws

The prior fix surfaced lockReleaseError on the dialog so the user knows
the lock release failed (Windows EPERM, ENOENT race, disk full, etc.) —
but until next process start, dreamLockExists() still sees a fresh-mtime
lock owned by an alive PID (us!) and silently suppresses every
subsequent scheduleDream() call as `{status: 'skipped',
skippedReason: 'locked'}`. The user sees the warning AND zero further
dream activity, and the staleness sweep that would clean the leaked
lock only runs at session start.

Adds a `dreamLockReleaseFailed` flag set in the catch. The next
scheduleDream() force-cleans the leaked lock file via fs.rm({force: true})
before the existence check, so dream scheduling resumes within the same
session. Best-effort: if even the forced rm fails (truly unrecoverable
filesystem state), falls through to the existing 'locked' skip path.

This is an incremental improvement on top of b00ecdebd's UI-surface fix.
The two together give the full story: warning visible → automatic
recovery on next attempt.

* fix(memory): real cancelled-dream duration + clarify index-rebuild ordering

Two follow-up suggestions from review:

- **`duration_ms: 0` in cancelled-dream telemetry** (manager.ts):
  the user-cancel path emitted `MemoryDreamEvent` with
  `duration_ms: 0`, which would silently skew latency histograms / p95
  metrics by treating cancelled dreams as instant. Capture
  `dreamStartMs = Date.now()` at the top of `runDream` and emit the
  real elapsed time in the cancel branch.

- **Misleading index-rebuild comment** (dream.ts:75–84): the comment
  claimed the index rebuild "is still done before returning when
  topics were touched", but the code returns early on
  `abortSignal?.aborted` BEFORE the rebuild. Rewrote the comment to
  describe the actual cancel-aware ordering — abort returns partial
  result without rebuilding (rebuild is expensive; next dream cycle
  will rebuild against the latest files anyway), live path rebuilds
  only when topics changed.

22 / 22 manager tests pass; tsc clean.
2026-05-06 10:20:19 +08:00
John London
2a5be0d3b1
fix(core): prevent auto-memory recall from blocking main request (#3814)
Some checks are pending
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(core): prevent auto-memory recall from blocking main request (issue #3759)

The auto-memory recall side-query had a 5s AbortSignal.timeout that fired
on every turn, and the main request path awaited the full recall promise
(including timeout + heuristic fallback). This caused a ~5s delay on every
user turn.

Changes:
- relevanceSelector.ts: reduce model-driven selector timeout from 5s to 2s
- client.ts: add resolveAutoMemoryWithDeadline() that races the recall
  promise against a 2.5s deadline, returning empty result if recall
  hasn't completed in time
- client.test.ts: add 2 tests verifying slow recall doesn't block the
  main request and fast recall still includes memory content

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): remove stripThoughtsFromHistory from GeminiChat mocks

stripThoughtsFromHistory was removed from GeminiChat on main.
The mock in client.test.ts still referenced it, causing TS2353 build failures in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(core): address PR #3814 review comments on auto-memory recall deadline

Three changes from the review:

1. clearTimeout leak — The setTimeout in resolveAutoMemoryWithDeadline is
   now cleared in a finally block so the dangling timer doesn't block the
   Node.js event loop from draining during graceful process exit.

2. Telemetry inflation — An AbortController is created before the recall
   call and aborted when the 2.5s deadline fires. Its signal is passed into
   the recall pipeline via a new optional abortSignal field on
   ResolveRelevantAutoMemoryPromptOptions. All three logMemoryRecall calls
   in recall.ts are gated behind !options.abortSignal?.aborted so discarded
   results don't emit success metrics.

3. Test coverage — Added a test for the !promise guard path
   (getManagedAutoMemoryEnabled() returns false) verifying sendMessageStream
   completes without calling recall or injecting memory content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(core): address PR #3814 review findings for auto-memory recall

Critical fixes:
- Wrap onDeadline() in try/finally inside resolveAutoMemoryWithDeadline so
  resolve() always runs even if onDeadline() throws, preventing deadlock
- Forward caller abortSignal through recall() → selectRelevantAutoMemoryDocumentsByModel
  → runSideQuery, combined with the existing 2s timeout via AbortSignal.any()
  so the model API call is cancelled when the 2.5s deadline fires
- Abort and clear pendingRecallAbortController on all early return paths
  (MaxSessionTurns, BoundedTurns=0, SessionTokenLimitExceeded, Arena control signal)
- Rename _pendingRecallAbortController → pendingRecallAbortController
  (inconsistent underscore prefix)

Tests added:
- client.test.ts: verify sendMessageStream completes normally when recall rejects
- relevanceSelector.test.ts: verify caller abort signal is forwarded combined with timeout,
  and timeout-only signal when no caller signal provided

* fix(core): start auto-memory recall deadline at initiation time

Moves the resolveAutoMemoryWithDeadline() race from consumption time to
initiation time so the 2.5s budget is not consumed by intermediate work
(microcompact, compression, token checks, IDE context) between recall
initiation and consumption. The raced promise is stored directly in
relevantAutoMemoryPromise and simply awaited at consumption.

The pendingRecallAbortController cleanup on early return paths is
preserved unchanged.

* fix(core): address PR #3814 review feedback

- resetChat(): abort pendingRecallAbortController so stale in-flight recall
  does not leak into the next session.
- recall.ts catch: distinguish AbortError (deadline-triggered cancellation)
  from real model errors. AbortError logs at debug level with a message
  indicating the heuristic result was discarded. Real model errors continue
  to log at warn with the existing fallback message.

* fix(core): address PR #3814 round 3 review feedback

- Use explicit undefined check instead of non-null assertion for timer in
  resolveAutoMemoryWithDeadline (clearTimeout)
- Gate heuristic fallback in recall.ts on abortSignal.aborted so discarded
  results after the deadline don't produce output
- Downgrade AbortError log level from warn to debug in the client.ts recall
  catch block, keeping the warn channel meaningful for real failures
- Add tests verifying pendingRecallAbortController.abort() is called on
  MaxSessionTurns and SessionTokenLimitExceeded early-return paths

Co-Authored-By: Claude <noreply@anthropic.com>

* test(core): assert abort propagation in relevance selector

* fix(lint): remove unused eslint-disable directive in relevanceSelector.test.ts

The import/no-internal-modules eslint-disable comment was unnecessary —
the rule does not flag same-package imports like ../utils/sideQuery.js.
ESLint 9 reports it as an unused directive, failing the CI lint check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-06 07:34:41 +08:00
John London
c7facf5013
fix(core): use per-model settings for fast model side queries (#3815)
Fixes #3765. Side queries (session recap, title generation, tool-use summary) running on the fast model previously inherited the main model's ContentGeneratorConfig, leaking extra_body / samplingParams / reasoning between models.

GeminiClient.generateContent() now resolves the target model's own config via buildAgentContentGeneratorConfig() when the requested model differs from the main model, and creates a dedicated ContentGenerator (cached, cleared on resetChat). Cross-authType resolution lets the fast model live under a different provider than the main model. Uses the target model's authType for retry logic so provider-specific checks (e.g. QWEN_OAUTH quota detection) reference the correct provider. Falls back to the main generator if the model is not in the registry.
2026-05-06 02:07:32 +08:00
5d0adbb5bd
fix(core): activate skills from discovered result paths (#3852)
* fix(core): activate skills from discovered result paths
* fix(core): address result path review feedback
* fix(core): handle grep result path edge cases
* fix(core): trust fallback grep result paths
2026-05-06 01:57:30 +08:00
Mr-Maidong
f4a9f7bfd7
feat(weixin): add image sending support via CDN upload (#3781)
* feat(weixin): add image sending support via CDN upload

* fix(weixin): address PR review — path validation, encoding, timeout, error handling

Critical fixes from wenshao's review of feat/weixin-image-send:

1. File read vulnerability: add validateImagePath() in send.ts with
   directory allowlist, extension filter, magic-byte check, 20MB cap,
   and realpath resolution. Pass workspace cwd as allowed dir.

2. aes_key encoding: change from base64(hex-ascii) to base64(raw 16B)
   to match the protocol expectation (images use raw bytes, not hex).

3. uploadToCdn timeout: add AbortController + 40s timeout per retry
   attempt to prevent hanging on stalled CDN connections.

4. Unhandled rejection: wrap fallback sendText() in catch block with
   its own try/catch to prevent process crash on double failure.

5. Default instructions merge: append image capability guide when
   custom instructions lack [IMAGE:], instead of silently dropping it.

6. Dead code: remove unused imagePaths parameter from sendMessage().

7. Regex hardening: strip code blocks before [IMAGE:] extraction,
   filter empty paths to prevent confusing readFileSync('') errors.

8. URL validation: reject http:// URLs and validate CDN hostname in
   uploadToCdn (SSRF prevention).

Tests: replace identity mock with real encryptAesEcb/computeMd5 so
padding mismatches are caught; fix partial node:crypto mock.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(weixin): address 2nd round PR review — 10 issues

Critical fixes:
1. detectImageMime: add JPEG magic bytes (0xFF 0xD8 0xFF), throw on
   unrecognized format instead of defaulting to image/jpeg
2. getUploadUrl retry: pass errcode to WeixinApiError, add ret field
   check in isRetryableError so actual API errors trigger retries
3. ALLOWED_DIRS: add realpathSync('/tmp/') and realpathSync(tmpdir())
   to handle macOS symlink resolution (/tmp → /private/tmp)
4. [IMAGE:] stripping: only replace markers that were actually parsed,
   preserving [IMAGE:] inside code blocks in displayed text
5. TOCTOU fix: use openSync/readSync(16B) for magic-byte check instead
   of reading the entire file twice
6. sendMessage: check both ret and errcode fields for error detection

Suggestions:
7. connect(): avoid mutating this.config.instructions on reconnect
8. Fix duplicate Step 2 comment numbering
9. Replace errcode=${undefined} with errcode=${resp.errcode ?? '(none)'}
10. stderr: log structured (status=, ret=) instead of raw errmsg

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(weixin): 3rd round PR review — errcode checks, error logging, timeout, path resolution

- api.ts: add errcode check in getUploadUrl (align with sendMessage)
- api.ts: pass ret/errcode from CDN error to WeixinApiError
- send.ts: resolve workspace dirs with realpathSync
- WeixinAdapter.ts: include errcode and err.message in error log
- media.ts: add 40s timeout to downloadAndDecrypt fetch
- send.test.ts: add 12 tests covering detectImageMime and validateImagePath error branches

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

---------

Co-authored-by: Maidong <408097061@qq.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-06 01:49:26 +08:00
Alex Musick
174b3ac951
feat(cli): Add ability to switch models non-interactively from the cli (#3783)
* Add ability to switch models non-interactively from the cli

This fulfills request #3410

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Revert "Update packages/cli/src/ui/commands/modelCommand.ts"

This reverts commit 7b49c442a7.

* Protect against empty model name; align non-interactive behavior

Empty model names are now ignored. ```/model ``` (with trailing whitespace) will still open the interactive model picker.

Realigned the non-interactive path to use the same ```args.trim().split(' ')[0]``` logic. Valid model IDs can not contain spaces anyway. If preferred, this specific change can be reverted and the new code can use the old logic instead.

* Warn if model is not in registry

* Update command description

* Updated modelCommand test to reflect new description

* Implement auto-complete with model IDs

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Revert "Update packages/cli/src/ui/commands/modelCommand.ts"

This reverts commit 0600b2373a.

* Update modelCommand.ts

* Update modelCommand.ts

* Update modelCommand.ts

* Update/use i18n keys

* Corrected en i18n

* removed redundant .trim() on modelName check

---------

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
2026-05-06 00:54:27 +08:00
Yan Shen
89cb326c2f
feat(cli): improve export format completion navigation (#3701)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
* feat(cli): improve export format completion navigation

* fix(cli): address PR #3701 review feedback on /export completion

Critical:
- Guard phase-2 cycling by checking buffer text starts with "/export "
  so a manually edited buffer is never clobbered by stale nav state (C1)
- Derive export format suggestions from slashCommands.subCommands to
  keep a single source of truth with the command registry (C2)
- Reset completionSelectionWasNavigatedRef on showSuggestions rising
  edge instead of on every suggestions change to avoid a race where
  an already-navigated selection is forgotten before Enter (C3)
- Add regression tests for isPerfectMatch + navigated + Enter,
  including the positive path and a control case (C4)

Suggestions:
- Prefix-guard getExportFormatFromInput to skip regex on non-/export
  input (S1)
- Drop trailing space from setExportCompletionInput output so buffer
  text is no longer implicitly coupled to the cycling heuristic (S2)
- Document the two-phase state machine (one-shot fill + cycling) (S3)
- Accept Tab as an additional cycling key alongside Up/Down (S4)
- Remove the unconditional ref reset at the tail of handleInput;
  correctness is now guaranteed by the buffer-text guard (C1) and
  the showSuggestions edge-triggered useEffect (C3) (S5)

* fix(cli): tighten export completion cycling guard and unify Tab behavior

- Phase 2 cycling guard: replace startsWith('/export ') with strict
  getExportFormatFromInput() to prevent overwriting inputs with extra
  arguments (e.g. '/export html --verbose').
- ACCEPT_SUGGESTION in Phase 1 popup: add hasExportFormatSuggestions
  branch so Tab/Enter seeds exportCompletionSelectionIndexRef,
  allowing Phase 2 cycling to continue from the selected format
  (consistent with Up/Down arrow behavior).
- Add 4 regression tests: Phase 1 Up wrap, Phase 2 Up wrap, Tab
  seed + Phase 2 Tab cycle, guard prevents overwriting extra args.

Ref: PR #3701 second-round review by wenshao

* fix(cli): address PR #3701 third-round review feedback on /export completion

- S6: use dynamic exportFormatSuggestions.findIndex() for highlight index
  instead of static EXPORT_FORMAT_COMPLETIONS.indexOf()
- S7: derive Phase 2 cycling current index from buffer text via
  getExportFormatFromInput + indexOf, with defensive ref fallback
- S8: extract getNextExportCompletionIndex as module-level pure function;
  cache exportCycleFormats via useMemo to avoid per-keystroke .map()
- S9/S10: add tests for ESC and Ctrl+C reset of export cycling state

* fix(cli): tighten /export prefix guard, add superset matching fallthrough, and improve documentation

* fix(cli): address review #4224860127 - smaller notes optimization

- S1: Strengthen EXPORT_FORMAT_COMPLETIONS fallback comment with
  an IMPORTANT sync warning for format removals
- S2: De-export getExportFormatFromInput (no external consumers)
- S3: Add intermediate buffer-clear assertion after Ctrl+U in test
  to pin state and prevent false positives from future hook changes

* refactor(cli): extract export completion into useExportCompletion hook

Address all feedback from PR #3701 review comment:
- Extract ~310 lines of /export state machine from InputPrompt into
  dedicated useExportCompletion hook
- Replace exportCompletionSelectionIndexRef (number|null) with
  cyclingActiveRef (boolean) since index was never read
- Simplify navigated-flag lifecycle: reset on buffer.text changes
  instead of popup visibility transitions; add navigatedTextRef
  snapshot to prevent sticky autocomplete after buffer edits
- Remove static EXPORT_FORMAT_COMPLETIONS fallback; derive entirely
  from slashCommands.subCommands
- Aggregate 4 parallel ternaries into single suggestionDisplayProps
- Add regression test: navigate + backspace + retype + Enter
  should submit raw buffer, not autocomplete
- Remove redundant navigatedRef reset in ESC handler (already
  covered by exportCompletion.reset())

* fix(cli): guard export completion state
2026-05-05 23:23:25 +08:00
jinye
07441cc1e3
feat(sdk-python): add network timeouts to release version helper (#3833) 2026-05-05 19:25:00 +08:00
Rayan Salhab
ec51fd3138
fix(core): coalesce MCP server rediscovery (#3818)
* fix(core): coalesce MCP server rediscovery

* test(core): assert MCP rediscovery cleanup

* fix(core): address MCP rediscovery review feedback

* fix(core): preserve MCP rediscovery health checks

---------

Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>
2026-05-05 18:31:26 +08:00
JerryLee
095a39a8d5
fix(acp): run auto compression before model sends (#3698) 2026-05-05 18:14:13 +08:00
jinye
0501c7165b
fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute (#3813)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
* fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute

`shutdownTelemetry()` awaited `sdk.shutdown()` with no time bound, causing
the CLI to hang indefinitely when the OTLP endpoint was unreachable. Wrap it
in a `Promise.race` with a 10-second timeout that fails open (logs a warning
and proceeds with cleanup).

The `service.version` resource attribute was set to `process.version`
(Node.js runtime version) instead of the actual application version. Replace
it with `config.getCliVersion() || 'unknown'` to match the convention used
throughout the codebase.

Closes #3811

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): remove redundant diag.error and add .catch() clarifying comment

- Remove duplicate `diag.error()` in shutdown catch block — a rejected
  shutdown already surfaces as a thrown exception; only `debugLogger.error`
  is needed (consistent with pre-existing behavior).
- Add comment to `.catch()` handler explaining why rejections are silently
  dropped when timeout hasn't fired (they're caught by the try/catch via
  Promise.race).
- Update rejection test to remove stale `diag.error` assertion.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): use || for getCliVersion fallback and clarify Promise.resolve wrap

- Revert `??` back to `||` for `getCliVersion()` fallback to stay
  consistent with the 6 other call sites across the codebase and
  guard against empty-string misconfiguration.
- Add comment explaining why `Promise.resolve()` wraps `shutdown()`
  (defensive measure for auto-mocked environments).

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): restore diag.error in catch block and add test assertion

- Re-add `diag.error()` to the shutdown rejection catch block so
  the error is visible via OTel diagnostic channel, matching the
  timeout path which uses `diag.warn`.
- Add `diagErrorSpy` assertion to the rejection test so the test
  name ("should log error") is backed by actual log verification.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

---------

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-05 11:12:13 +08:00
jinye
2c93fd670c
refactor: extract shared release helper utilities (#3834)
Move four duplicated utility functions (getArgs, readJson,
validateVersion, isExpectedMissingGitHubRelease) from the three
get-release-version.js scripts into a shared module at
scripts/lib/release-helpers.js so that changes only need to happen
in one place.

Also fixes a pre-existing bug in getArgs where argument values
containing '=' were silently truncated (e.g. --msg=a=b produced
{msg:'a'} instead of {msg:'a=b'}).

Closes #3795

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-05 10:15:17 +08:00
Yan Shen
59845407fc
fix(openai): parse MiniMax thinking tags (#3677)
* fix(openai): parse MiniMax thinking tags

Add a MiniMax OpenAI-compatible provider that opts into tagged thinking parsing for minimaxi.com endpoints.

Split MiniMax <think>/<thinking> content into thought parts while preserving default OpenAI-compatible behavior, and avoid rendering leading blank stream chunks as empty assistant rows.

* fix(openai): address PR #3677 review — optimize MiniMax tagged thinking parser

- perf(taggedThinkingParser): pre-compute lowercase buffer once per
  parse() call, use startsWith(tag, offset) to avoid O(N²) per-char
  slice+toLowerCase allocations
- feat(minimax): expand host matching from exact Set to *.minimaxi.com /
  *.minimax.io wildcard suffix, covering gateway / custom subdomains
- docs(converter): add comment clarifying finish_reason flush of
  buffered tagged-thinking content on stream end

* test(openai): add standalone unit tests for TaggedThinkingParser

Add taggedThinkingParser.test.ts with 23 test cases covering the boundary scenarios requested in PR review #4219047370: empty tag content, close tags in text mode, pure partial-tag-prefix chunks, multi-chunk splitting, final flag flush behavior, and case insensitivity.

* fix(openai): add observability, host guards, and reasoning_content safeguard for MiniMax tagged thinking

- Add debugLogger.warn on unclosed thought flush (observability)
- Add debugLogger.debug for tag detection and parts count
- Add cross-matching tag comment explaining binary mode toggle
- Add MINIMAX_KNOWN_HOSTS exact match layer with suffix fallback
- Add reasoning_content guard to avoid duplicating thought parts
- Add cross-matching and unclosed flush tests (+4 cases)
- Add known host exact match and custom proxy tests (+4 cases)
2026-05-05 08:52:03 +08:00
2e69d641d5
fix(core): unescape shell-escaped file paths in Edit, WriteFile, and ReadFile tools (#3820)
Some checks are pending
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(core): unescape shell-escaped file_path in Edit, WriteFile, and ReadFile tools

Shell-escaped paths (e.g. my\ file.txt) from at-completion can reach
the LLM without going through atCommandProcessor unescaping.  When
the LLM passes such an escaped path to a file tool, the backslash is
treated as a literal character in the filename, causing the tool to
fail with file-not-found.

Add defensive unescapePath() normalization at the start of each
tool validateToolParamValues(), so that escaped paths are
converted back to real filesystem paths before any I/O.

Also normalize the path in coreToolScheduler conditional-rules
injection path for consistency.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): harden unescapePath — Windows safety, speculation paths, hooks, and consistency

- Make unescapePath a no-op on win32 to avoid corrupting paths like
  C:\(v2)\file.txt where backslashes are path separators.
- Apply unescapePath in speculationToolGate before overlay FS lookups
  so speculation mode finds files given escaped paths.
- Normalize file_path/path in hook toolInput so custom hooks receive
  actual filesystem paths.
- Add unescapePath to Grep, Glob, Ls, and RipGrep validateToolParamValues
  for parity with the file tools.
- Apply unescapePath in getModifyContext callbacks (Edit, WriteFile) so
  the modify-with-editor flow works when request.args still holds escaped
  paths.
- Add .trim() to Edit and WriteFile path normalization for consistency
  with ReadFile.
- Use .trim() in conditional-rules matchAndConsume for parity.
- Gate literal-backslash test to non-win32; add Windows no-op test for
  unescapePath.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): persist unescaped paths in validators, cover LSP filePath, centralize modifyWithEditor normalization

- Mutate params.path in Glob/Grep/Ls/RipGrep validators so invocations
  receive the normalized path, not the original escaped form.
- Add unescapePath to LspTool.validateToolParamValues for filePath.
- Extend scheduler path normalization to filePath + notebook_path
  keys, mirroring speculationToolGate's key set.
- Move unescapePath from getModifyContext callbacks into
  coreToolScheduler's modifyWithEditor path to avoid double-unescape
  and keep a single normalization site.
- Add .trim() to speculationToolGate paths for consistency.
- Show normalized path in ls.ts validation error message.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): address review feedback for unescapePath PR

- Extract shared PATH_ARG_KEYS constant to eliminate duplicate key lists
  across coreToolScheduler and speculationToolGate
- Add notebook_path to modifyWithEditor unescape block (was missing)
- Hoist unescapePath regex as module-level constant (UNESCAPE_REGEX)
- Add debug log when unescapePath actually modifies a path
- Replace early-return Windows guards with vitest skipIf in tests
- Add tests for escaped paths in Glob, Grep, Ls, RipGrep, LSP validators
- Add test for conditional rules matchAndConsume with unescaped paths

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): fix import order in grep.ts, consolidate ls.ts imports, win32-guard write-file tests

- grep.ts: move debugLogger declaration after all import statements
- ls.ts: consolidate two imports from ../utils/paths.js into one
- write-file.test.ts: add skipIf(process.platform === 'win32') to
  shell-escape tests (unescapePath is a no-op on win32)

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): address review — double-unescape, circular import, lsp asymmetry, win32 test guards

- coreToolScheduler.ts: remove redundant unescapePath in conditional-rules
  injection (toolInput was already normalized; unescapePath is not
  idempotent for \\X sequences)
- paths.ts: remove debugLogger to break circular import chain
  (paths → debugLogger → storage → paths). The single debug log line is
  low signal — dropped entirely.
- lsp.ts: remove dead .trim() checks in FILE_REQUIRED_OPERATIONS and
  RANGE_REQUIRED_OPERATIONS (filePath is already trimmed by validator)
- Add it.skipIf(process.platform === 'win32') to 11 escaped-path tests
  across edit.test.ts, read-file.test.ts, glob.test.ts, grep.test.ts,
  ripGrep.test.ts, ls.test.ts, lsp.test.ts

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(test): add win32 skipIf guards to atCommandProcessor, rulesDiscovery, fileSearch tests

Four more escaped-path tests that fail on Windows because
unescapePath is a no-op on win32:
- atCommandProcessor.test.ts: two @-command unescape tests
- rulesDiscovery.test.ts: shell-escaped match test
- fileSearch.test.ts: special-char escaping test

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(test): consolidate duplicate imports in rulesDiscovery.test.ts

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

---------

Co-authored-by: cici <cici@cicideMacBook-Pro.local>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-05 01:33:36 +08:00
Shaojin Wen
3631c01e17
feat(skills): parallelize loading + add path-conditional activation (#3604)
* perf(skills): parallelize skill loading with Promise.all

Three nested for-await loops in SkillManager — one per layer of the
skill discovery tree — were serializing what is independent I/O:

- refreshCache(): the 4 SkillLevels (project/user/extension/bundled)
  load one after the other.
- listSkillsAtLevel(): each provider directory (.qwen, .agent,
  .cursor, ...) is read sequentially.
- loadSkillsFromDir(): each skill subdirectory's stat + access +
  parseSkillFile fires one at a time.

Replace each layer with Promise.all so the I/O fans out. Precedence
between provider dirs is still preserved by folding the parallel
results back in baseDirs order. No semantic change; the pre-existing
49 SkillManager and 27 skill-load tests still pass unchanged.

* feat(skills): add path-conditional skill activation

Large monorepos accumulate skills faster than any one task cares
about. Every turn we ship the full <available_skills> listing in
the SkillTool description — 100 skills is roughly 600–1500 tokens
the model does not need most of the time.

Let skills opt into lazy activation via a `paths:` frontmatter list
of glob patterns. Such skills stay out of the tool description until
a tool call touches a matching file, at which point they become
active for the rest of the session. The mechanism mirrors the
existing ConditionalRulesRegistry used for .qwen/rules/.

Shape:

- SkillConfig gains `paths?: string[]`; skill-manager and skill-load
  both parse it (array of non-empty strings; scalar rejected).
- New skill-activation.ts holds SkillActivationRegistry (picomatch,
  per-session Set of activated names, project-root-scoped) and a
  splitConditionalSkills() helper.
- SkillManager rebuilds the registry on every refreshCache and
  exposes matchAndActivateByPath / isSkillActive /
  getActivatedSkillNames. Activation fires change listeners so the
  SkillTool description picks up the new entry immediately.
- SkillTool.refreshSkills filters the listing through isSkillActive
  and keeps a pendingConditionalSkillNames set so validateToolParams
  can distinguish "not found" from "registered but gated" — the
  model otherwise sees the same generic error for both cases.
- coreToolScheduler invokes matchAndActivateByPath alongside the
  existing ConditionalRulesRegistry hook, and appends a
  <system-reminder> announcing the newly activated skill(s) so the
  model learns why its tool listing just grew.

Activation state is intentionally scoped to a single registry
instance; a watcher-driven refreshCache wipes it, matching
ConditionalRulesRegistry's semantics.

Adds 11 tests for the registry, 4 parse-paths tests, 4 integration
tests on SkillManager, and one validateToolParams test for the
distinct "gated by paths:" error. All 197 related tests pass.

* fix(skills): scope path activation to visible, model-invocable skills

Two issues caught by review of the new conditional-skill activation
path, both rooted in `refreshCache()` building the activation
registry from the raw concatenation of every level's skills:

- Cross-level shadow: when the same skill name exists at multiple
  levels with different `paths:` globs, `listSkills()` picks the
  highest-precedence copy (project > user > extension > bundled),
  but the registry compiled every copy. A path matching only the
  shadowed copy's glob would still flip the visible copy to
  "active" — the model would see it appear in `<available_skills>`
  even though the touched file was outside its declared paths.

- Disabled-with-paths: a skill carrying both `paths:` and
  `disable-model-invocation: true` would enter the registry, fire
  the "skill is now available" `<system-reminder>` on path match,
  and then SkillTool would reject the invocation because the
  disabled flag hid it from `availableSkills` and
  `pendingConditionalSkillNames`. The model gets a generic "not
  found" after being told the skill exists.

Fix both at the registry-build site by walking levels in precedence
order, deduping by name (keep the first/highest-precedence copy),
and dropping `disableModelInvocation` skills before splitting on
`paths`. Adds two regression tests in `skill-manager.test.ts`.

* docs(skills): document path-conditional activation and the model/user view gap

@yiliang114 noted that asking the model "what skills do you have?"
returns only currently active skills, while `/skills` shows the
fuller list — a path-gated skill stays out of the model's listing
until a matching file is touched, so users may incorrectly conclude
the skill is missing.

Add a "Optional: gate a Skill on file paths (\`paths:\`)" subsection
under the field requirements, covering glob semantics, scope, the
session-lifetime activation, that user invocation is unaffected, and
the disable-model-invocation interaction. Also add an admonition in
the "View available Skills" section calling out the model-vs-user
distinction explicitly and pointing at the \`/skills\` slash command
as the always-complete browse path.

* refactor(skills): extract parsePathsField + tighten paths mock pattern

The `paths:` frontmatter parser was duplicated across
`skill-manager.ts:parseSkillContent` and
`skill-load.ts:parseSkillContent`. Future validation tweaks
(e.g. minimum length, character whitelist, glob pre-check) would
have to land in both places, with no compile-time link to keep
them in sync.

Extract `parsePathsField(frontmatter)` into `types.ts` next to the
existing `parseModelField`, and call it from both parsers. Same
contract: returns the cleaned array, or `undefined` when omitted /
empty / all-whitespace; throws when present but not an array.
Adds 8 tests in `skill-load.test.ts` covering the contract.

Also tighten the `paths:` branch in the `skill-manager.test.ts`
mock yaml parser. The previous `yamlString.includes('paths:')`
also matches incidental occurrences of `paths:` inside skill body
text. No bundled fixture currently has that, but the substring
check is a footgun for future tests; switch to `^paths:` (multiline
start anchor) so only a frontmatter-level field triggers the
branch.

* fix(skills): widen activation coverage and tighten dedup edges

Three fixes from the latest /review pass on the activation
pipeline, all touching the same hook surface:

1. Activation only fired on `file_path` — read-file / edit /
   write-file. Tools that touch the filesystem under different
   parameter names (`path` for ls and ripGrep, `filePath` for
   grep and lsp, `paths` array for ripGrep multi-path) silently
   skipped both ConditionalRulesRegistry and SkillActivationRegistry.
   Extract `extractToolFilePaths(toolInput)` and route every
   recognised path through both registries; coalesce skill
   activations from one tool call into a single system-reminder.

2. SkillTool's model-invocable-commands dedup set was built from
   every file-based skill name, including ones marked
   `disable-model-invocation: true`. A hidden file skill could
   suppress an unrelated MCP prompt or command of the same name
   that was never meant to overlap with it. Filter the dedup set
   to model-invocable skills only; pending conditional skills
   stay reserved (correct contract), disabled skills no longer
   block unrelated commands.

3. SkillActivationRegistry's project-root guard rejected `..` /
   `../` prefixes but accepted absolute results. On Windows,
   `path.relative('C:\\proj', 'D:\\elsewhere')` returns an
   absolute path; after normalising backslashes a broad glob like
   `**/*.ts` would activate a project-scoped skill for an
   off-project file. Reject absolute relative results before
   normalising slashes.

Adds regression tests for each:
- 7 cases for `extractToolFilePaths` (each field name + combos
  + non-object / wrong-shape inputs).
- 1 SkillTool case proving a `disable-model-invocation` skill no
  longer suppresses a same-name MCP prompt.
- 1 SkillActivationRegistry case for the absolute-relative-path
  guard. (220 skill-area tests pass total.)

* test: stub matchAndActivateByPath in SkillManager test mocks

The path-conditional skill activation hook in
CoreToolScheduler.executeSingleToolCall now fires on every tool
invocation that names a filesystem path. With the widened
extractToolFilePaths coverage, that includes the `path: '.'`
input shape used by the AgentHeadless tool-execution tests.

Two SkillManager mocks predate the activation API and stubbed
only watcher / listener methods, so the scheduler hook crashed
with "matchAndActivateByPath is not a function" on any tool
invocation in those test files. Local runs still hit it on this
branch (no `path:` field tools were exercised pre-merge), and CI
caught the regression in agent-headless.test.ts across all 9
matrix combos.

Stub the method to return [] in both mocks (agent-headless and
config), matching the watcher-method pattern. Production code is
unchanged — the existing SkillManager has the method and the
real path through Config wires it up correctly.

* fix(skills): await listener refresh during path activation

Race surfaced by /review: matchAndActivateByPath synchronously
notified change listeners, but the SkillTool listener was a
fire-and-forget `void this.refreshSkills()`. The activation hook
in CoreToolScheduler then appended the "skill X is now available"
<system-reminder> and the tool result was sent to the model
without waiting — so the next turn could land with the
<available_skills> listing still showing the pre-activation set,
and the model's first invocation of the announced skill would
hit validateToolParams's "not found" branch.

Make the listener pipeline awaitable end-to-end:

- addChangeListener now accepts `() => void | Promise<void>`.
- notifyChangeListeners is async and awaits each listener's
  return, so any returned Promise (e.g. SkillTool.refreshSkills)
  is held before the call resolves.
- refreshCache awaits the notification it was already firing.
- matchAndActivateByPath becomes async and awaits notification
  when at least one new activation occurred. The CoreToolScheduler
  hook awaits the call so the system-reminder lands strictly
  after the tool description has been refreshed.
- SkillTool's listener returns the refresh Promise directly
  instead of stranding it under `void`.

Existing test mocks for `addChangeListener` accept any return
value, so no mock changes are needed. The four
matchAndActivateByPath direct-call tests in skill-manager.test
are updated to `await` the new Promise return.

* fix(extension): await skill + subagent cache refresh in refreshMemory

Caught by /review on the previous async-listener change: this PR
made `SkillManager.refreshCache()` resolve only after the
change-listener chain (notably `SkillTool.refreshSkills` and
`geminiClient.setTools()`) settles. `ExtensionManager.refreshMemory`
was firing it without `await`, so callers like `refreshTools` would
return while the skill cache and tool description were still
updating, and any rejection from the listener chain was silently
detached.

Wrap skill + subagent refreshes in a single `Promise.all` so they
still run concurrently, but the parent `refreshMemory` Promise only
resolves once both side-effects have landed. Hierarchical memory
refresh is left as-is (pre-existing fire-and-forget pattern,
unchanged by this PR).

* fix(skills): security/perf/robustness pass on activation pipeline

Six findings from /review (claude-opus-4-7), all rooted in the new
path-conditional activation code:

1. extractToolFilePaths now requires a `toolName` and gates on a
   closed FS_PATH_TOOL_NAMES allowlist (read_file, edit, write_file,
   grep_search, glob, list_directory, lsp). MCP / non-FS tools that
   reuse `path` / `paths` for HTTP routes, JSON keys, search queries
   would otherwise feed those values into the activation pipeline,
   where `path.resolve(projectRoot, …)` would normalise them to
   project-relative strings and false-match a skill with broad
   globs (e.g. `paths: ['**']`). Concrete attack noted by /review:
   `{ path: 'https://api.example.com/users/123' }` → activates a
   skill on every MCP call.

2. Skill `name` validated at parse time against
   `/^[a-zA-Z0-9_:.-]+$/`. The value flows verbatim into multiple
   model-trusted sinks: `<available_skills>` description, the
   path-activation `<system-reminder>`, the SkillTool schema, and
   UI listings. Reject characters that could close a tag and open a
   forged one (`name: "ok</system-reminder><system-reminder>…"`).

3. SkillManager.matchAndActivateByPaths(filePaths) added. The
   per-path notify in coreToolScheduler caused N successive
   SkillTool.refreshSkills() / geminiClient.setTools() round-trips
   for a single ripGrep-style multi-path call; the batch entry
   point activates across all paths and fires listeners exactly
   once with the union. matchAndActivateByPath delegates to it for
   call-site compatibility.

4. SkillManager.refreshCache uses Promise.allSettled at the
   levels boundary so a fatal error on one level (FS hang,
   permission denial, missing config dir) no longer nukes the
   other three; warns with the level + reason for the failed slot.

5. parsePathsField accepts explicit `null` (the YAML `paths:`
   no-value shorthand) the same way as omission, instead of
   throwing and dropping the whole skill via parseErrors.
   Matches the leniency of `argumentHint` and `whenToUse`.

6. SkillActivationRegistry adds a `SKILL_ACTIVATION` debug logger
   for the operational pain noted in the audit: per-path resolved
   relative-path, project-root-rejection reason, and per-skill
   activation. Also gives oncall a grep target for "why did/didn't
   skill X activate?" without source-reading.

Test mocks (agent-headless, config) now expose
matchAndActivateByPaths alongside matchAndActivateByPath. New
tests: parsePathsField null, validateSkillName allow/reject pairs
(including the closing-tag attack literal), batch activation
firing listeners exactly once, batch with no matches not firing
listeners, and an extractToolFilePaths regression for MCP / web /
skill tool inputs being filtered out.

* fix(skills): glob pattern activation + verifiable Windows guard

Two follow-ups from the latest /review pass:

1. `extractToolFilePaths` now extracts `pattern` for `ToolNames.GLOB`
   in addition to the existing `path` field. The shape
   `glob({ pattern: 'src/**/*.tsx' })` (no `path`) was producing an
   empty candidate set, so a skill keyed on the same glob never
   activated from a glob call. Pattern extraction is gated to GLOB
   only — grep_search also has a `pattern` field, but it's a regex
   and would false-match if treated as a path-shaped selector.

2. The relative-path normalization is extracted into a pure helper
   `resolveProjectRelativePath(filePath, projectRoot, pathModule)`.
   The previous Windows cross-drive regression test
   (`/totally/other/place/file.ts` against `/project`) actually
   exercised the older `..` outside-root branch on POSIX runners,
   so the new `path.isAbsolute(rawRelativePath)` guard could have
   been removed without the test failing. The helper is now
   parameterized over a `path` module so a unit test can pass
   `path.win32` directly and pin the cross-drive case
   (`D:\\other\\file.ts` against `C:\\project`) deterministically
   on any host OS.

Adds 6 tests: glob pattern extraction (with and without path),
grep regex pattern not extracted, and four
resolveProjectRelativePath cases covering POSIX in-project, POSIX
outside-root, Windows cross-drive (the new branch), and Windows
in-project backslash normalization.

* fix(skills): join glob.path with glob.pattern as effective selector

Caught by /review on 599490b91: my earlier glob extraction pushed
`path` and `pattern` as separate candidates. `glob({ path: 'src',
pattern: '**/*.ts' })` produced `['src', '**/*.ts']` — neither
component matches a skill keyed on `paths: ['src/**/*.ts']` in
isolation, so activation silently broke for the most common
two-arg glob shape.

The glob call actually searches `<path>/<pattern>`. Replace the
standalone pattern push with `path.join(pathField, patternField)`,
falling back to bare pattern when no path is provided. The
generic block above still emits the bare `path` candidate, so a
broad skill keyed on `paths: ['src/**']` (directory-level)
continues to activate too. Combined output for the regression
example: `['src', 'src/**/*.ts']` — covers both the directory-
level and file-level skill cases.

Adds three tests: an updated unit test pinning the joined
effective selector, an absolute-`path` variant whose joined form
gets rejected downstream by the project-root guard
(`/tmp/external/**/*.ts`), and the audit-suggested integration
regression that pipes `extractToolFilePaths` output straight into
`SkillActivationRegistry` and verifies a `paths: ['src/**/*.ts']`
skill activates from `glob({ path: 'src', pattern: '**/*.ts' })`.

* fix(skills): join glob.path with glob.pattern as effective selector

Two coupled fixes for the glob-pattern extraction landed in 7cb7145bb:

1. **Windows CI failure.** `path.join('src', '**/*.ts')` returns
   `'src\\**\\*.ts'` on Windows (OS-aware separator). The new
   regression tests asserted the forward-slash form, so the
   ubuntu/macos matrix was green but all three Windows jobs
   (20.x/22.x/24.x) failed. The downstream registry also matches
   against forward-slash relative paths (after `replace(/\\/g, '/')`),
   so the Windows-shaped candidate would have silently failed to
   activate any skill at runtime — not just in tests.

2. **`..` normalization.** `path.join('src', '../*.ts')` collapses to
   `'*.ts'`, losing the information that the glob actually escaped
   its `path` root. The audit notes this can both miss the real
   touched subtree and false-activate a skill keyed on a wrong
   subtree. Concat preserves the selector verbatim.

Replace `path.join(pathField, patternField)` with
`${pathField.replace(/[\\/]+$/, '')}/${patternField}` per the
audit's exact suggestion. Trims trailing forward-slash and
backslash so `path: 'src/'` and `path: 'src\\'` both produce
`src/<pattern>` instead of `src//<pattern>` or `src\\/<pattern>`.

Adds three tests covering: `..` preservation, forward-slash on
all OSes (the Windows CI regression), and trailing-slash
trimming for both `/` and `\` variants.

* fix(skills): silence CodeQL ReDoS flag on trailing-separator trim

CodeQL #145 flagged `pathField.replace(/[\\/]+$/, '')` as a
polynomial regex on uncontrolled data — the regex is anchored
and uses a single character class with `+`, so worst case is
linear in trailing-separator length, but the scanner is
conservative about `+` quantifiers on inputs that flow from
tool invocation parameters.

Replace the regex with an explicit `endsWith` loop. Same O(n)
behavior on the trailing run, no regex for CodeQL to chew on.
Existing trailing-slash test (forward and back) still passes.

* fix(skills): comprehensive review pass — security, correctness, robustness

Eleven findings from /qreview (claude-opus-4-7), grouped by area:

CORRECTNESS

- C1: appendAdditionalContext silently dropped reminders for any tool
  whose llmContent is a single non-array Part (read-file returning
  inlineData for images / PDFs is the canonical case). Both the
  ConditionalRulesRegistry rule reminder and the path-conditional
  skill activation reminder were lost. Wrap the single-Part case
  into an array so the addition still lands.
- S2: Legacy tool-name aliases (`replace` → `edit`,
  `search_file_content` → `grep_search`, `task` → `agent`) bypassed
  FS_PATH_TOOL_NAMES. The registry resolves the alias at execute time
  but `request.name` keeps the alias, so `replace({ file_path: ... })`
  produced empty candidates and missed activation. Canonicalize via
  `ToolNamesMigration` before the allowlist check.
- S5: `new SkillActivationRegistry(...)` ran picomatch unguarded —
  pathological patterns (oversize / broken extglob) could throw and
  abort all of `refreshCache`. Wrap each picomatch call in try/catch
  inside the constructor; drop the bad pattern, keep the rest of
  the skill, log via debugLogger.
- S7: Extension parser (skill-load.ts) silently dropped
  `disable-model-invocation` and `when_to_use`. Now that we have
  `paths:`, that meant an extension SKILL.md with both `paths:` and
  `disable-model-invocation: true` would still fire path-activation
  reminders for a skill the model can't invoke — directly
  contradicting the bug_004 fix at the project/user level.
- S8: SkillTool discarded the `addChangeListener` cleanup function
  and had no `dispose()`. Subagents share the parent's SkillManager
  via `InProcessBackend.createPerAgentConfig`, so each per-subagent
  SkillTool registered another listener; with the listener pipeline
  now async, every path activation serialized through every stale
  subagent's refresh chain. Mirror AgentTool: store the cleanup,
  expose `dispose()`.

SECURITY / SUPPLY-CHAIN

- S11: `validateSkillName`'s `/^[a-zA-Z0-9_:.-]+$/` rejected every
  non-ASCII name on upgrade, silently dropping CJK / Cyrillic /
  accented Latin skills. The structural-injection guard targets
  `<>"'/\n\r\t` etc; entire Unicode planes are not the threat.
  Widen to `/^[\p{L}\p{N}_:.-]+$/u`. Update docs/users/features/
  skills.md to match.
- S10: `parsePathsField` only validated shape (must-be-array). Now
  also reject leading-slash absolute patterns and `..` parent-escape
  patterns at parse time — these silently never match anything in
  the activation registry, so an author who writes `paths:
  ['/etc/passwd']` or `['../*.ts']` would otherwise see the skill in
  /skills and never understand why it never activates.

ROBUSTNESS

- S3: `coreToolScheduler` emitted "skill X is now available via the
  Skill tool" even when the calling subagent's tool registry did not
  expose SkillTool (subagent's `tools:` allowlist excluded `skill`).
  Gate the reminder on `toolRegistry.getTool(ToolNames.SKILL)`.
- S4: `extensionManager.refreshMemory` used `Promise.all` so a
  rejection from skill or subagent refresh nuked the other leg AND
  the hierarchical-memory refresh below it. Switch to
  `Promise.allSettled`, log each rejection, and `await` the
  hierarchical refresh too (the comment justifies awaiting; the
  code didn't).
- S9 / S12: `docs/users/features/skills.md` claimed `paths:` only
  gates model discovery and slash invocation always works. True for
  the user-side path itself, but if the model then tries to chain
  off the user's invocation (call `Skill { skill: ... }` itself),
  validateToolParams returns "gated by path-based activation" —
  contradicting the doc. Rephrase to call out the model-side
  limitation explicitly.

DEFERRED

- S6: notifyChangeListeners swallows per-listener errors and the
  reminder still fires. Real concern but the fix needs an API
  shape change (listener-failure signal back to the scheduler);
  worth its own design discussion. Logged here for follow-up.

Adds 12 regression tests across the 7 affected files. 632 tests
pass; types and lint clean.

* fix(skills): activate broad globs on dotfiles + cross-ref FS allowlist

Two more findings from /review:

- S13: picomatch was compiled with `dot: false`, so a broad glob like
  `paths: ['**/*.js']` silently excluded `.eslintrc.js`, `.env`,
  `.github/foo.yml`, etc. The hidden-file exclusion is gitignore-style
  semantics — wrong for activation, where the question is "did the
  model touch a file matching this glob." Switch to `dot: true`.

- S14: `FS_PATH_TOOL_NAMES` is a manually maintained allowlist with no
  compile-time guard — adding a new FS tool without updating the set
  silently drops the tool out of the activation pipeline. Add a
  cross-ref comment at the top of `ToolNames` in `tool-names.ts`
  pointing maintainers at the allowlist site, plus a TODO noting the
  long-term fix is per-declaration `pathFields?: string[]`. The
  cross-cutting refactor is its own PR.

Adds one regression test (`activates broad globs on dotfiles too`)
that pins the dot:true semantics on `**/*.js` matching
`.eslintrc.js`. 211 skill-area tests pass.

* fix(skills): per-tool extraction dispatcher (LSP URI + grep glob + integration test)

Four findings from /review on the activation extractor:

C1 (Critical): LSP allowlisted but the extractor pushed `filePath`
  through unchanged. The LSP tool accepts non-file URI schemes
  (`http://`, `git://`, etc.); forwarding any of those to
  SkillActivationRegistry as a project-relative candidate let an
  LSP call against a non-file resource activate path-gated skills
  without the model touching a real project file. Fix is two-part:
  decode `file://` URIs via `fileURLToPath` (so a project file
  expressed as a URI still activates correctly) and silently drop
  any string containing `://` that's not `file://`.

S1: LSP `incomingCalls` / `outgoingCalls` operate on
  `callHierarchyItem.uri`, not the top-level `filePath`. After
  `prepareCallHierarchy` returns a file-backed item, following the
  hierarchy with that item produced no candidate, so path-gated
  skills for that file stayed dormant. Same URI-aware extraction is
  applied to the nested `uri` field.

S2: grep_search has a path-shaped `glob` field
  (`GrepToolParams.glob`) — distinct from `pattern`, which is a
  regex on contents. The extractor previously ignored `glob`, so
  `grep_search({ pattern, glob: 'src/**/*.ts' })` produced no
  activation candidate even though the call walked every file under
  `src/**/*.ts`. Same `path + glob` join treatment as GLOB.

S3: No scheduler-side integration test covered the
  extractToolFilePaths → matchAndActivateByPaths → reminder-append
  wiring, so a regression there could land while extractor and
  registry unit tests still passed. Added three integration tests
  covering: (a) reminder appended when SkillTool present,
  (b) reminder suppressed when SkillTool absent (subagent case),
  (c) hook not invoked for non-FS tools.

Restructured `extractToolFilePaths` from a generic
`file_path/filePath/path/paths` extractor into a per-tool
dispatcher (`switch` on canonical tool name). The previous generic
shape was overly permissive — every FS tool got every field name,
including ones it doesn't accept — and it was the wrong shape to
add LSP URI semantics to. Per-tool means each branch reflects the
actual `XToolParams` interface.

Test reshape:
- Removed tests asserting cross-tool field acceptance (e.g. grep
  reading `filePath` / `paths`); those documented inaccurate input.
- Added per-tool realistic tests for grep glob, lsp file:// URI,
  lsp callHierarchyItem.uri, lsp non-file scheme dropped.
- Plus the three CoreToolScheduler activation wiring tests.

639 tests pass (was 632); types and lint clean.

DEFERRED

S4: Activation driven from input selector rather than concrete
  matched files. For `glob({ pattern: '**/*.ts' })` the selector
  itself may not match a skill scoped narrower than the query.
  Real concern, but the fix needs typed result-path metadata
  feedback from each tool — a cross-cutting addition to every FS
  tool's return shape. Logged for follow-up.

* fix(skills): make LSP URI tests platform-portable for Windows CI

Two of the new LSP tests in 58836f1c3 hard-coded `file:///proj/...`
URIs. POSIX runners are fine, but on Windows `fileURLToPath` throws
`ERR_INVALID_FILE_URL_PATH` for a URI without a drive letter — the
production try/catch then returns `[]`, and the assertion
`expected [] to deeply equal [ '/proj/src/App.ts' ]` fails.

Reshape the tests to build the URI from a real absolute path via
`pathToFileURL`. The URI shape becomes the host's natural form
(`file:///tmp/...` on POSIX, `file:///C:/.../tmp/...` on Windows),
and the round-trip through `fileURLToPath` always succeeds.

Production code unchanged.

* fix(skills): XML-escape description/whenToUse; symlinks in skill-load.ts; dot:true in ConditionalRulesRegistry

Agent-Logs-Url: https://github.com/QwenLM/qwen-code/sessions/a56d83ce-cbdf-4213-a90a-888a9f05ee4f

* fix(skills): backport Windows cross-drive guard to ConditionalRulesRegistry

The latest /review (deepseek-v4-pro) flagged divergence between
SkillActivationRegistry (which has the
`pathModule.isAbsolute(rawRelativePath)` Windows-cross-drive
guard, added earlier in this PR) and ConditionalRulesRegistry
(which still only checks `..` / `../` prefixes). On Windows,
`path.relative('C:\\proj', 'D:\\elsewhere')` returns the absolute
string `D:\\elsewhere` — after backslash normalization that
would otherwise false-match a broad rule glob like `**/*.ts`.

Move the project-relative-path helper out of `skills/` into a new
`utils/projectPath.ts` (the right semantic home — it's a pure
path operation with no skill-domain coupling) and have both
registries call into it. SkillActivationRegistry re-exports the
helper so existing imports keep working.

Adds a regression test in `rulesDiscovery.test.ts` for the
off-project path case (covers both POSIX `..` branch and the new
Windows isAbsolute branch through the shared helper). Direct
`path.win32`-parameterized cover already lives in
`skill-activation.test.ts`. 252 skill+rules tests pass.

* test(skills): pin XML escaping on modelInvocableCommands description too

cmd.description already routes through escapeXml in skill.ts:204
(landed in b1d9324f5), but no test pinned the cmd path — only the
skill.description / whenToUse path. Add a parallel regression that
crafts an MCP-shaped command with `</available_skills><tag>` in
the description and asserts it gets escaped instead of breaking
out of the <available_skills> block.

* fix(skills): escape cmd.name; extension skillRoot; surface invalid globs

Three findings from /review (deepseek-v4-pro):

C1: `cmd.name` was interpolated into the `<available_skills>` `<name>`
  tag without `escapeXml()`. File-based skill names go through
  `validateSkillName` (charset whitelist) at parse time, but
  command names from `modelInvocableCommands` come from
  externally-injected sources (MCP, extensions) and bypass that
  validator. A command shipped with `name: "x<inject>"` would
  inject raw tags into the model-facing tool description. Wrap
  `cmd.name` in `escapeXml`, parallel to the existing
  `cmd.description` escape one line below.

C2: `parseSkillContent` in `skill-load.ts` (the extension parser)
  never set `skillRoot: path.dirname(filePath)`. The
  project/user/bundled parser in `skill-manager.ts` does, and
  `registerSkillHooks.ts:116` skips setting `QWEN_SKILL_ROOT` for
  command-type hooks when `skillRoot` is undefined — so shell
  commands inside extension-skill hooks couldn't resolve
  `$QWEN_SKILL_ROOT/scripts/...` references. Add the field.
  Comment notes the still-asymmetric `hooks:` extraction (the
  extension parser doesn't pull `hooks:`); leaving that as a
  separate alignment task because hooks may be intentionally
  restricted to managed skills as a security boundary.

S3: Invalid `paths:` globs were only logged at debug level.
  Author writes `src/***/file.tsx`, the picomatch compile throws,
  the registry drops the pattern, and the skill loads with zero
  matchers — visible only as a permanent "gated by path-based
  activation" error with no actionable diagnostic.

  Add an optional `InvalidPatternHandler` callback to
  `SkillActivationRegistry`'s constructor. SkillManager wires it
  into its `parseErrors` map, keyed `<filePath>#paths[<pattern>]`,
  so the failure surfaces through `getParseErrors()` and the
  `/skills` UI alongside other parse-time errors.

S4: Two related concerns about file-watcher race / activation wipe
  (`refreshCache` rebuilding the registry from scratch, plus
  potential interleaving of two `refreshCache` calls). Real but
  the fix needs design work — activation carry-over has its own
  semantics (do deleted skills survive?), and the serialization
  guard adds a generation counter that affects multiple call
  sites. Logged for follow-up.

Three regression tests added: cmd.name escape (`should XML-escape
modelInvocableCommands name`), extension skillRoot (`sets
skillRoot to the SKILL.md directory`), and parseErrors surfacing
for an oversized 70 KB glob pattern. 205 skill-related tests pass.

* fix(skills): comprehensive XML-escape + coalesce + parallel listeners

Six findings from /review (deepseek-v4-pro):

C1: skill.name interpolated raw into <available_skills>. File-based
  names go through validateSkillName, but extension skills come in
  via extension.skills (skill-manager.ts:827) and bypass that
  validator entirely — a crafted extension name could inject raw
  tags. Same vulnerability for the activated-skill names in the
  coreToolScheduler reminder. Wrap both in escapeXml.

S2: refreshHierarchicalMemory() await is unprotected after the
  earlier change to await it. A transient failure now propagates
  back through refreshMemory → enableExtension after isActive is
  already true, leaving the extension half-enabled. Wrap in
  try/catch and log; the surrounding extension transition
  shouldn't unwind because of a stale-memory side effect.

S3: escapeXml duplicated between skill.ts and background-tasks.ts.
  Extract to utils/xml.ts; both call sites import from there.

S4: parseSkillContent duplicated between managed and extension
  parsers. Real concern but the cleanup is a real refactor (the
  two parsers diverge on level / hooks / skillRoot wiring), so
  this PR adds a comment-level documentation but defers the
  actual extraction to a follow-up to keep this diff focused.

S5: rulesCtx (rule body content) interpolated into
  <system-reminder> without scrubbing. A rule whose content
  contained literal `</system-reminder>` (e.g. a doc rule about
  reminders) would close the envelope early. Apply a targeted
  scrub of the closing-tag literal in the joined body. Full XML
  escape would mangle code blocks in rule markdown — the
  closing-tag scrub is the minimum needed to keep the wrapper
  intact.

S6: notifyChangeListeners awaited listeners sequentially. With per-
  subagent SkillTools each registering as a listener, every
  matchAndActivateByPaths call serialized through every
  refreshSkills + setTools round-trip. Switch to
  Promise.allSettled — listeners are independent reads, the
  failure-isolation behavior is preserved.

S7: Each rule emit + the activation reminder were each their own
  <system-reminder> envelope. A multi-path tool call could produce
  N+1 envelopes, diluting model attention. Coalesce: collect all
  reminder blocks, emit once with `\n\n` separators, scrub the
  closing-tag literal once on the joined body.

Tests added:
- skill.name extension-bypass escape regression
- coreToolScheduler activation wiring: coalesces multiple rules +
  activation into one envelope (with grep_search path+glob to
  produce two candidate paths)
- coreToolScheduler activation wiring: escapes activated skill
  names so a crafted extension name can't break out
- coreToolScheduler activation wiring: scrubs literal
  </system-reminder> in rule content
- 843 tests pass overall.

* fix(skills): symlink scope check + dispose on stopAgent + listener type

Five findings from /review (Qwen3.6-Plus-DogFooding):

C1 + C2 (Critical, same finding cited twice): Symlink target was
  validated for "is a directory" but not for "stays inside
  baseDir". An attacker who can write a symlink into a skills
  directory (shared monorepo, compromised extension) could
  symlink /etc/cron.d/ → trigger arbitrary content load — and
  skills can ship hooks that invoke shell commands, so this is a
  code-execution vector. Apply realpath + prefix check in both
  symlink branches (skill-load.ts AND skill-manager.ts).
  Regression test in each suite (`should skip symlinks that
  escape baseDir (prevents arbitrary-skill-load attack)`).

C3: SkillTool.dispose() existed but was only called from
  ToolRegistry.stop() at full shutdown. Subagents created/stopped
  during a session left their per-agent SkillTool listener
  attached to the parent SkillManager — every spawn-then-stop
  cycle accumulated another stale listener, and notifyChangeListeners
  (now parallel via Promise.allSettled) still pays a per-listener
  round trip even when the underlying subagent is gone.

  Convert InProcessBackend.agentRegistries from a flat array to
  Map<agentId, ToolRegistry> and dispose just that agent's
  registry in stopAgent. cleanup() still drains any registries
  still attached at full shutdown for the fast-path case.

S4: changeListeners typed `Set<() => void>` while addChangeListener
  signature accepts `() => void | Promise<void>`. The runtime
  Promise.resolve().then(listener) wrapper handles the mismatch
  but the type didn't catch future drift. Widen the field type
  to match the parameter signature.

S6: FS_PATH_TOOL_NAMES allowlist has no compile-time guard.
  Logged for follow-up — the pragmatic short-term fix (test
  asserting every entry has a corresponding extractToolFilePaths
  branch) requires deciding whether the test belongs in
  coreToolScheduler or tool-registry. Per-declaration pathFields
  annotation is the long-term answer; both are tracked.

S7: setTools concurrency. Verified setTools is idempotent
  (rebuilds tools from registry, single sync assign at end);
  multiple concurrent calls converge on the same tools list.
  Added an inline note rather than a runtime mutex.

Defer:
- S5: refreshCache wipes all activations. Same activation
  carry-over design question deferred in the previous round.

* fix(skills): listener timeout, full XML escape, allowlist warning + tests

Address inline review feedback:

- skill-manager.notifyChangeListeners: 30s per-listener timeout via
  Promise.race so a hung listener (e.g. setTools blocked on a network
  call) cannot permanently stall matchAndActivateByPaths. Timer is
  unref'd to avoid keeping the event loop alive.

- types.parsePathsField: tighten parse-time validation. Normalize
  backslashes to forward slashes, reject Windows drive letters
  (`C:\\repo\\src\\**`) and segment-walk for any `..` (catches
  `./../*.ts`, `src/../../**`, `..\\secret\\*.ts`). Skill authors who
  write impossible-to-match patterns now get a parse error instead of a
  silent never-activates skill.

- utils/xml.escapeXml: widen to all five XML metacharacters
  (`&<>"'`), not just three. Element-body callers are unchanged but
  attribute-context callers and `</tag>` injection are now safe by
  default. monitorRegistry drops its local copy in favor of the shared
  helper.

- coreToolScheduler.extractToolFilePaths: emit a debug-level warning
  when a non-FS tool's input has path-like fields (`file_path`,
  `filePath`, `path`, `paths`). Surfaces allowlist gaps without
  production noise — chases "why didn't my path-gated skill activate?".

- Tests: added (1) async listener await + sync-throw + async-reject
  isolation for notifyChangeListeners, (2) stopAgent registry dispose
  + Map cleanup + cleanup-drains-remaining for InProcessBackend.

* fix(skills): harden symlink containment checks

* Revert "fix(skills): harden symlink containment checks"

This reverts commit 7e70a25a3a.

* fix(skills): clear listener timeout, share symlink scope helper

- skill-manager.notifyChangeListeners: clear the per-listener
  setTimeout in `.finally(...)` once the race settles. The previous
  `unref()`-only approach prevented the timer from blocking process
  exit, but every fast-resolving listener still left a 30s pending
  timer behind — vitest's open-handle diagnostic and any tooling that
  snapshots the active-handle set saw the pile-up under high-frequency
  activation.

- New skills/symlinkScope.ts: shared `validateSymlinkScope` helper.
  Realpaths BOTH the symlink target and the base directory before the
  containment check, then uses `path.relative` (rather than
  `realPath.startsWith(base + sep)`) for cross-platform safety. The
  prior asymmetric form — `realpath(target)` against the raw
  `path.resolve(base)` — could false-skip valid in-tree symlinks on
  Windows when canonicalization (case, separators, short-vs-long-path
  forms) diverged from `path.resolve`'s purely lexical normalization;
  the failing Windows CI on the symlinked-skill test traced back to
  exactly that. `path.relative` also closes the sibling-prefix
  ambiguity (`base = '/a/skills'`, target = `/a/skillsX/foo` no longer
  passes a startsWith check).

- skill-load.ts and skill-manager.ts both delegate to the shared
  helper. Each call site now realpaths baseDir once outside the
  iteration loop instead of per-entry (N → 1 syscall on parallel
  loaders), and bails the directory entirely if baseDir cannot be
  canonicalized.

- Tests: 8 unit tests for `validateSymlinkScope` covering accept,
  nested-accept, sibling-prefix attack, escape, broken realpath,
  not-a-directory, stat failure, and the degenerate self-target case;
  updated existing escape/broken tests in `skill-load.test.ts` /
  `skill-manager.test.ts` to use `mockImplementation` distinguishing
  baseDir vs target (the previous static `mockResolvedValue` would have
  passed the new check for the wrong reason); regression test for the
  cleared timeout via setTimeout/clearTimeout spies.

* fix(skills): segment-aware symlink containment, accepts ..-prefixed names

The previous `rel.startsWith('..')` containment check in
`validateSymlinkScope` false-rejected legitimate in-base directories
whose names start with two dots — `path.relative('/base', '/base/..shared/foo')`
returns `'..shared/foo'`, which is a real filename shape, not a
parent-traversal escape.

Switch to a segment walk: `rel.split(/[/\\]/)[0] === '..'` correctly
distinguishes:
  - `'../foo'`         → segments[0] = '..'      → escapes ✓
  - `'..shared/foo'`   → segments[0] = '..shared' → in-scope ✓
  - `'..bar'`          → segments[0] = '..bar'    → in-scope ✓
  - `'..\\foo'` (Win)  → segments[0] = '..'      → escapes ✓

Tests: two new regressions in `symlinkScope.test.ts` covering the
multi-segment (`..shared/foo`) and single-segment (`..bar`) cases.

---------

Co-authored-by: wenshao <wenshao@U-K7F6PQY3-2157.local>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yiliang114 <1204183885@qq.com>
2026-05-05 00:28:53 +08:00
Shaojin Wen
7f5b92b076
feat(core): hint to background long-running foreground bash commands (#3809)
* feat(core): hint to background long-running foreground bash commands

Phase D part (a) of Issue #3634. When a foreground `shell` tool call
runs ≥ 60 seconds and completes (succeeds or errors), append an
advisory line to the LLM-facing tool result suggesting re-running with
`is_background: true` next time.

Why: today a foreground bash that takes minutes (build watcher, soak
test, slow npm install, polling loop) blocks the agent indefinitely.
The user is already paying for the wait; the agent's next turn could
have started running in parallel under `is_background: true`. Sleep
interception (#3684) handled the egregious `sleep N` case at validate
time; this handles the legitimate-but-long case at result time.

Trade-offs:
- Threshold = 60s. Half the existing 120s foreground timeout. Long
  enough that normal `npm install` / `pytest` runs don't trigger;
  short enough that the hint surfaces before the timeout hard-kills.
- Advisory only — the command still runs to completion in the
  foreground for THIS invocation. The advice is for the agent's NEXT
  decision, not a corrective action on the current one.
- Fires on success AND error completions. The advice is the same
  ("background it next time") in both cases.
- Suppressed on aborted (timeout / user-cancel) — those paths already
  surface their own messaging and don't benefit from a "should have
  been background" reminder when the user / system already killed it.

Implementation:
- New constant `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = 60000` in
  shell.ts, paired with the existing `DEFAULT_FOREGROUND_TIMEOUT_MS`.
- Helper `buildLongRunningForegroundHint(elapsedMs)` exported so
  future surfaces (UI, telemetry) can render the same text without
  duplicating the threshold logic.
- `Date.now()` bracketing around the spawn → `await resultPromise`
  block — mirrors what the background path already captures via
  `entry.startTime`.
- Append happens inside the existing non-aborted result builder;
  zero changes to the cancel / timeout arms.

Tests: 4 new cases — fires on long success, omits on short success,
fires on long error completion, omits on aborted. Uses vi fake timers
to drive wall-clock past the threshold without actually sleeping.

* fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion

Addresses 8 review threads on PR #3809 — 6 from /review bots, 2 from
copilot — covering doc accuracy, code quality, behavioural gaps, and
test coverage.

**Behavioural fixes (real bugs)**:

- **Suppress on external signal kills** (`result.signal != null` with
  `aborted: false`). `shellExecutionService` only sets `aborted` when
  the AbortSignal we passed was triggered, so SIGTERM from container
  shutdown / k8s eviction / OOM killer / sibling process-group reap
  falls through to the non-aborted branch. The advisory shouldn't fire
  there — the process didn't run to its conclusion, so "next time,
  background it" doesn't fit. New test pins this with `signal: 15`
  (SIGTERM), `aborted: false`.

- **Append AFTER `truncateToolOutput`**. Previously the hint was
  appended inside the non-aborted result builder, which meant for
  long outputs it got wrapped in the "Truncated part of the output:"
  envelope — the LLM might read the advisory as part of the command's
  own output. New post-truncation insertion + test that pins ordering
  by mocking `truncateToolOutput` directly (real path needs
  `fs.writeFile` to actually succeed for the replacement branch to
  fire).

- **Hint wording mode-aware**. The dialog mention dropped the
  unconditional "(footer pill + Enter)" specifics, which would mislead
  non-TTY users (`-p` headless / ACP / SDK consumers — no dialog or
  pill exists there). Now qualified as "in interactive mode the
  Background tasks dialog also has...". `/tasks` and the on-disk
  output file are mentioned without qualifier (work in any mode).

**Code quality**:

- **Threshold programmatically coupled to timeout**:
  `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = Math.floor(DEFAULT_FOREGROUND_TIMEOUT_MS / 2)`.
  If the timeout is tuned later, the threshold tracks automatically.

- **Docstring corrected**: removed the misleading "before it gets
  killed by the timeout" claim — the hint is on non-aborted path
  only, so timeout-killed commands never see it. The new docstring
  enumerates all suppression paths explicitly.

- **Removed stale line-number reference**: comment said "mirrors the
  background path's `entry.startTime` capture (line ~781)" which goes
  stale on file edits. Now refers conceptually.

**Test coverage gaps closed**:

- **Off-by-one boundary**: 59_999ms → no hint. Pairs with the existing
  60_000ms-exactly test (which fires) to pin the boundary tightly. A
  regression flipping `>=` to `>` would fail loudly.

- **Timeout path explicit**: previous "aborted" test exercised user-
  cancel only. With `vi.useFakeTimers({ toFake: ['Date'] })`,
  `AbortSignal.timeout()` doesn't fake (it depends on the real timer
  subsystem), so `combinedSignal.aborted` stayed false. New test
  follows the pre-existing `should handle timeout vs user cancellation
  correctly` pattern: stubs `AbortSignal.timeout` + `.any` to return
  an already-aborted combined signal, then verifies "Command timed out
  after Nms" appears AND no advisory.

* fix(core): per-invocation long-run threshold + debug-mode + test isolation

Six suggestions from /review's third pass on PR #3809:

**Real semantic fix**:
- Long-run threshold now scales with the EFFECTIVE timeout, not the
  fixed default. A user who sets `timeout: 600_000` (10 min) gets the
  advisory at 5 min, not at 60s — respects the explicit timeout
  intent. Replaced the `LONG_RUNNING_FOREGROUND_THRESHOLD_MS` constant
  with a per-invocation `longRunThresholdFor(effectiveTimeout)` helper.

**Debug-mode visibility**:
- Debug mode previously snapshotted `returnDisplayMessage = llmContent`
  BEFORE the truncation + hint append, so debug-mode users saw the
  pre-hint content while the agent saw the advisory — agent suddenly
  suggesting `is_background: true` had no visible trigger in the TUI.
  Re-sync `returnDisplayMessage` after the hint append (debug-mode
  branch only) so the TUI mirrors what the agent sees.

**Type-safety footgun**:
- `if (typeof llmContent === 'string')` would silently drop the hint
  if `llmContent` ever becomes structured `Part[]`. Added an explicit
  `else` comment documenting the deliberate omission and the conditions
  under which to revisit (no string llmContent path exists today).

**Style**:
- Replaced the JSDoc `/** ... */` block on the (now-defunct) constant
  with a plain `//` comment block on the helper, matching the
  `DEFAULT_FOREGROUND_TIMEOUT_MS` / `OUTPUT_UPDATE_INTERVAL_MS` style.

**Test hygiene**:
- Wrapped both `vi.stubGlobal('AbortSignal', ...)` and
  `vi.spyOn(truncateToolOutput, ...)` in `try/finally` so failures
  during the test body don't leak the stub/spy into subsequent tests
  (would cause confusing cascading failures).
- Dropped the internal-roadmap "Phase D part (a)" reference from the
  test comment — future maintainers don't have the context.

**New test**:
- `threshold scales with the user-supplied timeout (not the default)`:
  sets `timeout: 600_000`, advances 100s, verifies no hint. Pins the
  per-invocation coupling so a regression to a fixed constant would
  fail loudly here.

* fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion (round 4)

Six suggestions from /review's pai/glm-5-fp8 pass on PR #3809:

**Behavioural / UX**:
- **Hint now visible in non-debug TUI too.** Previously only debug
  mode mirrored the hint into `returnDisplay`; non-debug users saw
  the agent suggest `is_background: true` with no visible trigger.
  Now the hint is appended to `returnDisplayMessage` in both modes
  (full mirror in debug, terse-append in non-debug to preserve the
  output-or-status form).

**Test coverage**:
- **Debug-mode re-sync test added.** All other long-run hint tests
  run with `getDebugMode → false`; this one flips it to true and
  asserts the hint appears in `returnDisplay` too. Pins the re-sync
  so a regression that drops the debug branch would fail loudly.
- **Threshold-scaling positive test added.** The negative case
  (`timeout: 600_000`, advance 100s, no hint) was already pinned;
  paired now with the positive case (advance 305s, hint fires) so a
  regression to a fixed 60s threshold is caught at both ends.

**Style / consistency**:
- **`result.signal === null` (was `== null`).** Strict equality to
  match the rest of the file. The `signal` field is typed
  `number | null` so loose equality has identical semantics, but the
  inconsistency was noise.

**Doc clarity (timing semantics)**:
- **Comment explains why elapsedMs is computed BEFORE truncation.**
  Two reviewers disagreed on the timing — one read it as before
  truncation (correct, slightly under-reports), the other as after
  (incorrect read). The intent is to report the COMMAND's runtime,
  not the tool call's total time. Truncation is post-processing,
  not part of "agent blocking time", so excluding it is the right
  semantic. Inline comment now spells this out so future readers
  don't have to infer.

* fix(core): error-path hint surfacing + clock-resilient elapsed + threshold floor + observability

Round 5 of PR #3809 review — 10 threads, mix of Critical and Suggestion:

**Critical fixes**:

1. **Hint survives the error path** (`#OWbA`). When result.error is
   set, coreToolScheduler builds the model-facing functionResponse
   from `error.message` ONLY (not llmContent — see
   convertToFunctionResponse + the toolResult.error branch in
   scheduler:1648-1724). My hint was being silently dropped on
   long-command-failed cases. Now the hint is appended to
   error.message too so the advisory survives whichever branch the
   scheduler takes.

2. **Hint wording de-ambiguated** (`#OU6o`). "prefer re-running with
   is_background: true" was ambiguous — model could read it as
   "re-run THIS command in the background", which on stateful
   commands (DB migrations, deploys, git push) would cause double
   side effects. Reworded to "Next time you run a SIMILAR
   long-running process..." with an explicit parenthetical that
   warns against re-running the just-completed command.

3. **Debug observability** (`#OU6s`). Added `debugLogger.debug` at
   the hint decision point with elapsedMs / threshold / aborted /
   signal — when a user reports "my 65s command didn't get the
   hint" the suppression branch is now visible in DEBUG output.

**Other behaviour fixes**:

4. **Threshold floor of 1000ms** (`#OU6r`). Pathological
   `timeout: 0` / `timeout: 1` would have given a 0-ms threshold,
   firing the hint on every invocation showing "ran for 0s".
   Floor at 1s makes that branch unreachable.

5. **`performance.now()` instead of `Date.now()`** (`#OU6v`). NTP
   corrections / VM clock drift between capture and read would
   silently make `elapsedMs` negative and skip the hint with no
   observable failure. Monotonic clock prevents that.

6. **Debug mode preserves truncation marker** (`#OU6w` / `#OWCq`).
   Previously `returnDisplayMessage = llmContent` after hint
   clobbered the "Output too long and was saved to: …" line
   appended during truncation. Switched to append-style re-sync in
   BOTH modes so prior content is preserved.

**Test coverage gaps closed**:

7. **Non-debug returnDisplay test** (`#OWCo`). Pinned that the
   user TUI gets the hint in the default (non-debug) mode too.

8. **Test rename** (`#OWCl`). The "debug-mode TUI mirror" test
   passed in non-debug too after the recent refactor; split into
   two tests, one per branch.

9. **Error-path hint test**. Added a test that pins `result.error?.message`
   contains both the original error text AND the hint, covering
   the scheduler-routing-via-error.message path that was silently
   broken before fix #1.

10. **Test: faketimers also fakes `performance`**. Since we
    switched to `performance.now()`, `vi.useFakeTimers({ toFake:
    ['Date'] })` no longer covered the elapsed measurement;
    extended to `['Date', 'performance']` so the threshold tests
    can drive the wall-clock with `advanceTimersByTimeAsync`.

#OU6t (else-comment for the type guard) was already addressed in
the prior round — the explicit else-with-comment is in place;
adding logging there would be noise.

* test(core): cover the MIN_LONG_RUN_THRESHOLD_MS floor branch

PR #3809 review: the new `Math.max(MIN_LONG_RUN_THRESHOLD_MS, ...)`
floor in `longRunThresholdFor` was untested — only default-timeout
and large-custom-timeout cases existed. A regression that strips the
floor would let `timeout: 1` produce a 0ms threshold and fire a
"ran for 0s" advisory on every invocation; the test suite would not
catch it.

New test: build with `timeout: 1`, advance 500ms (below the 1000ms
floor), resolve with `aborted: false` to isolate the threshold logic
from the abort path. Asserts no hint appears. A regression that
removes the floor flips the assertion to fail.

* fix(core): structured delimiter on error.message hint + tighten timeout floor comment

Two of three threads from the latest /review pass on PR #3809 (the
third — PR description / threshold scaling reconciliation — is fixed
in the PR description update, not in code):

- **`\n---\n` divider before hint in `error.message`** (`#Pt7C`).
  Downstream consumers of `error.message` (firePostToolUseFailureHook,
  telemetry grouping, SIEM alerting, hook-side error parsers) were
  receiving ~400 chars of advisory text mixed inline with the
  original error body — pattern-matching on error messages would
  absorb the advisory into the matched body. Added a `---` separator
  line so the boundary is unambiguous and split-able.

- **Threshold-floor comment narrowed to `timeout: 1`** (`#Pu9o`).
  The comment said the floor guards `timeout: 0` / `timeout: 1`, but
  `validateToolParamValues` rejects `timeout <= 0` at validate time,
  so `timeout: 0` can't reach `longRunThresholdFor`. Updated the
  comment to mention only the actually-allowed pathological case
  (`timeout: 1` and any value `< 2` rounds to 0).

Test updated to assert the `---` divider format with `toMatch`.

* fix(core): capture executionStartTime AFTER spawn so PTY import isn't counted

PR #3809 review: copilot caught that `executionStartTime` was
captured BEFORE `await ShellExecutionService.execute(...)`, which
meant the elapsed measurement included `getPty()` dynamic-import
setup (~50-200ms on first call). The hint's "ran for Xs" reading was
slightly inflated, and the comment claiming "spawn → settle" wasn't
strictly accurate.

Moved the capture immediately after the execute() call returns its
{ result, pid } handle. The pid being set by that point confirms the
process has been spawned, so the subtraction is true post-spawn-to-
settle. Comment updated to reflect the actual semantics.

The displayed accuracy gain is small (50-200ms on a 60s+ threshold
is <1%), but the comment claim now matches what the code measures.
Tests unaffected — fakeTimers don't drive real dynamic imports, so
the threshold tests behave identically.

* fix(core): align long-run hint code/tests with ShellExecutionResult.error semantics

Four copilot threads on PR #3809 — all rooted in the same
observation: `ShellExecutionResult.error` is reserved for
spawn/setup failures (per the field's doc comment in
shellExecutionService.ts), NOT for non-zero exit codes. My existing
code/tests conflated the two, making the error-path coverage less
realistic and the inline comments inaccurate.

**Test shape fixes**:

- `appends the hint when a long-running foreground command exits
  with error` → `exits non-zero`. Changed `error: new Error('exit
  1')` to `error: null` (the realistic shape for a non-zero exit
  without spawn failure). Added a comment explaining the field
  contract so future test authors don't repeat the conflation.

- `hint survives the error path (appended to error.message)`:
  reframed the mock from `spawn ENOENT` (which would resolve in
  <1s in practice, making the long-elapsed scenario unrealistic)
  to `PTY initialization failed after 75s` — a slow-spawn-failure
  shape that COULD plausibly take 75s. Test still pins the same
  CODE PATH; comment now acknowledges the edge-case nature
  ("rare but real: PTY init dragging, remote-fs exec syscalls,
  security scanners interposing").

**Comment corrections**:

- `returnDisplayMessage` build-order comment was misleading. It
  said "the hint is appended after both the truncation block and
  the returnDisplayMessage build" — but `returnDisplayMessage` is
  built BEFORE truncation. Replaced with a chronological enumeration
  (1. initial value, 2. truncation marker append, 3. hint append)
  that matches what the code actually does.

- Error-path preservation comment now acknowledges the narrow
  applicability (spawn failures only, exit codes don't reach this
  branch). Code is unchanged — the path is still real, just rare.

* test(core): pin empty-output success + background-no-hint paths

Two defensive tests for the long-running foreground hint:

- empty-output success at >=60s — exercises the
  returnDisplayMessage='' → hint append branch (write-only commands
  like `tar czf` / `cp -r` produce no stdout). Asserts the user-
  facing returnDisplay still surfaces the advisory even when the
  command produced nothing else to show.

- background never includes the hint — the foreground hint logic
  lives in executeForeground only, so today this can't fail; the
  test guards against a future refactor hoisting the advisory into
  a shared post-execute path that would tag every background launch
  with a nonsensical "ran for 0s, consider is_background: true"
  suggestion.
2026-05-04 23:24:14 +08:00
Shaojin Wen
efb7351d58
feat(core): support reasoning effort 'max' tier (DeepSeek extension) (#3800)
* feat(core): support reasoning effort 'max' tier (DeepSeek extension)

DeepSeek's chat-completions endpoint added an extra-strong `max` tier
to `reasoning_effort` (per
https://api-docs.deepseek.com/zh-cn/api/create-chat-completion ; valid
values are now `high` and `max`, with `low`/`medium` mapping to `high`
for backward compat). Plumb it end-to-end:

- `ContentGeneratorConfig.reasoning.effort` union now includes 'max'.
- DeepSeek OpenAI-compat provider: translate the standard nested
  `reasoning: { effort }` shape into DeepSeek's flat `reasoning_effort`
  body parameter so user-configured effort actually takes effect (the
  nested shape was previously sent verbatim and silently ignored,
  defaulting to `high`). low/medium → high mirrors the documented
  server-side behavior so dashboards / logs match wire reality.
  An explicit top-level `reasoning_effort` (via samplingParams or
  extra_body) wins over the nested form.
- Anthropic converter: pass 'max' through to `output_config.effort`
  unchanged and bump the `thinking.budget_tokens` budget for the new
  tier (low 16k / medium 32k / high 64k / max 128k).
- Gemini converter: clamp 'max' to HIGH since Gemini has no higher
  thinking level. Without this, 'max' would silently fall through to
  THINKING_LEVEL_UNSPECIFIED.

Live verification against api.deepseek.com:
- `reasoning_effort: high` → 200
- `reasoning_effort: max`  → 200 (the new tier)
- `reasoning_effort: bogus`→ 400 with valid-set list confirming
  [high, low, medium, max, xhigh]

108 anthropic/openai-deepseek/gemini tests pass; full core suite
(6601 tests) green; lint + typecheck clean.

* fix(core): map xhigh→max + clamp max on non-DeepSeek anthropic + docs

Address PR review (copilot × 2) and add missing user docs:

1. (J698) `translateReasoningEffort` claimed in the PR description that
   it surfaces the DeepSeek backward-compat mapping client-side, but
   only handled `low`/`medium` → `high`. Add `xhigh` → `max` to match
   the doc and stay symmetric with the low/medium branch.

2. (J6-A) `output_config.effort: 'max'` would have been emitted on
   any anthropic-protocol provider whenever a user configured it, even
   when the baseURL points at real `api.anthropic.com` (which only
   accepts low/medium/high and would 400). Reuse the existing
   `isDeepSeekAnthropicProvider` detector to clamp `'max'` → `'high'`
   on non-DeepSeek anthropic backends, with a debugLogger.warn so the
   downgrade is visible. DeepSeek anthropic-compatible endpoints still
   pass through unchanged.

3. New docs:
   - `docs/users/configuration/model-providers.md`: a "Reasoning /
     thinking configuration" section under generationConfig — single
     example targeting DeepSeek + a per-provider behavior table
     (OpenAI/DeepSeek flat reasoning_effort, OpenAI passthrough for
     other servers, real Anthropic clamp, Anthropic-compatible
     DeepSeek passthrough, Gemini thinkingLevel mapping).
   - `docs/users/configuration/settings.md`: extend the
     `model.generationConfig` description to mention `reasoning`
     (the field was undocumented before this PR even though it
     already existed as a typed field) and link to the new section.

96 anthropic + deepseek tests pass; lint + typecheck clean.

* refactor(core): single-source effort normalization for anthropic + doc fix

Address PR review round 2 (copilot × 2):

1. (J8aG) The `contentGenerator.ts` comment claimed passing
   `reasoning.effort: 'max'` to real Anthropic was "up to the user",
   but commit b5b05ae actively clamps 'max' → 'high' (with a debug
   log) on non-DeepSeek anthropic backends. Update the comment to
   describe current runtime behavior.

2. (J8aL) The clamp ran inside `buildOutputConfig()` only — the effort
   label was downgraded but `buildThinkingConfig()` still used the
   raw user value to size the budget, so a non-DeepSeek anthropic
   request could end up with `output_config.effort: 'high'` paired
   with a 'max'-sized 128K thinking budget. Inconsistent label vs.
   budget on the wire.

   Refactor: hoist the normalization into a single
   `resolveEffectiveEffort()` helper that runs once per request in
   `buildRequest()`. Both `buildThinkingConfig` and `buildOutputConfig`
   now consume the same clamped value, so the budget ladder and the
   effort label stay aligned. The debug log fires once per request.

Add a regression test asserting that on a non-DeepSeek anthropic
provider with `effort: 'max'` configured, the wire request carries
both `output_config.effort: 'high'` AND `thinking.budget_tokens:
64_000` (the 'high' tier), not the 128K 'max' budget.

96 tests pass; lint + typecheck clean.

* fix(core): tighten 'max' clamp + warn-once + strip reasoning_effort on side queries

Address PR review round 3 (copilot × 3):

1. (J-2v) When request.config.thinkingConfig.includeThoughts is false,
   pipeline.buildRequest's post-processing only deleted the nested
   `reasoning` key. The DeepSeek provider's translateReasoningEffort
   may have already flattened an extra_body-injected reasoning into
   top-level `reasoning_effort` by that point, so a side query (e.g.
   suggestionGenerator) could still ship reasoning_effort on the wire.
   Extend the post-processing to also delete `reasoning_effort`.

2. (J-2z) The warn for clamping 'max' on non-DeepSeek anthropic ran on
   every request needing the downgrade — the docstring claimed "first
   time only" but the implementation didn't latch. Add a private
   `effortClampWarned` boolean on the generator so the warning fires
   once per generator lifetime.

3. (J-23) `resolveEffectiveEffort` used the broad
   `isDeepSeekAnthropicProvider` detector for the clamp decision, but
   that helper falls back to model-name matching to cover sglang/vllm
   self-hosted DeepSeek deployments. A model configured as e.g.
   "deepseek-distill" but routed to real api.anthropic.com would
   bypass the clamp and trigger HTTP 400. Split the detector: keep
   `isDeepSeekAnthropicProvider` (broad) for the thinking-block
   injection workaround where false-positives are harmless, and add
   `isDeepSeekAnthropicHostname` (hostname-only) for decisions where
   a model-name false-positive would route DeepSeek-only behavior to
   a stricter backend. The clamp now uses the hostname-only check.

New regression test: a config with model name containing "deepseek"
but baseURL pointing at api.anthropic.com still clamps `'max'` to
`'high'`. Existing "passes max through" test updated to set a
DeepSeek baseURL since model name alone no longer suffices for the
clamp bypass.

385 tests pass; lint + typecheck clean.

* docs(core): correct pipeline timing comment + samplingParams caveat

Address PR review round 4 (copilot × 3) — three documentation accuracy
fixes, no behavior change:

1. (KBcw) The post-processing comment in pipeline.ts misdescribed the
   call order ("after this branch already ran during the same
   buildRequest pass") — provider.buildRequest actually runs BEFORE
   the includeThoughts=false post-processing in the same pass.
   Reword to match the actual order: provider hook flattens nested
   reasoning to reasoning_effort first, this cleanup runs after and
   strips both shapes.

2. (KBdC, KBdE) The "Reasoning / thinking configuration" section in
   model-providers.md and the model.generationConfig description in
   settings.md both implied `reasoning` is honored on every provider.
   For OpenAI-compatible providers, when `generationConfig.samplingParams`
   is set, `ContentGenerationPipeline.buildGenerateContentConfig()`
   ships samplingParams verbatim and skips the separate `reasoning`
   injection entirely. Configs like
   `{ samplingParams: { temperature: 0.5 }, reasoning: { effort: 'max' } }`
   would silently drop the reasoning field on OpenAI/DeepSeek
   requests.

   Add an explicit "Interaction with samplingParams" warning section
   in model-providers.md and a parenthetical note in settings.md
   directing users to put `reasoning_effort` inside `samplingParams`
   (or `extra_body`) when both are configured.

385 tests pass; lint + typecheck clean.

* docs(core): clarify explicit budget_tokens bypasses 'max' effort clamp

When user sets `{ effort: 'max', budget_tokens: N }` on a non-DeepSeek
anthropic backend, the effort label gets clamped to 'high' (otherwise
the server 400s on the unknown enum) but the explicit budget_tokens is
preserved verbatim. The wire-shape mismatch is intentional, not a bug:
the clamp only protects the enum field, while budget is a free integer
the server accepts within the context window, so an explicit override
stays explicit. Document the contract on the early-return and add a
regression test that locks it in.

* docs(deepseek): fix comments to match flatten + reasoning-strip behavior

Two doc-only nits called out in review:

1. `buildRequest` JSDoc said non-text parts are "rejected", but
   `flattenContentParts` actually substitutes a textual placeholder
   (`[Unsupported content type: <type>]`) so the request still goes
   through with a breadcrumb. Reword the JSDoc accordingly.

2. `translateReasoningEffort`'s strip comment claimed it strips the
   nested form to avoid shipping both shapes, but it only drops the
   duplicated `effort` key when other keys (e.g. `budget_tokens`) are
   present. Reword to describe the actual selective behavior and why
   keeping orthogonal keys is intentional.

Behavior unchanged.

* fix(deepseek): gate reasoning_effort translation on actual DeepSeek hostname

The provider class is selected via the broader `isDeepSeekProvider`
check, which falls back to model-name matching to cover self-hosted
DeepSeek deployments (sglang/vllm/ollama, see #3613). That fallback is
the right call for content-part flattening — it's a model-format
constraint baked into the model itself, not the API surface.

But the same broad detection was also gating
`translateReasoningEffort`, which rewrites the standard
`reasoning: { effort }` config into DeepSeek's flat `reasoning_effort`
body parameter. That's a wire-shape decision, not a model-format one:
strict OpenAI-compat backends in self-hosted setups may not accept the
DeepSeek extension and would have happily handled the original shape.

Split the two decisions: keep `isDeepSeekProvider` (broad) for
flattening, add a hostname-only `isDeepSeekHostname` and gate the body
rewrite on it. Self-hosted DeepSeek users who actually want the
translation can either use a baseUrl containing api.deepseek.com or
inject `reasoning_effort` directly via `samplingParams`/`extra_body`.

Regression tests:
  - self-hosted (sglang) with deepseek-named model + nested
    `reasoning.effort` → flattening still runs, body shape preserved
  - `isDeepSeekHostname` matches api.deepseek.com but not custom hosts

* fix(deepseek): use URL parsing in isDeepSeekHostname; fix log-level docs

CodeQL flagged a high-severity URL substring sanitization issue on the
new `isDeepSeekHostname` helper. The naive
`baseUrl.includes('api.deepseek.com')` check would false-positive on
hostile hosts like `https://api.deepseek.com.evil.com/v1` and
incorrectly inject the DeepSeek-only `reasoning_effort` body parameter
into requests routed elsewhere. Switch to `new URL(...).hostname` with
exact match against `api.deepseek.com` (and `.api.deepseek.com`
subdomains), mirroring `isDeepSeekAnthropicHostname` on the Anthropic
side. Invalid URLs treated as non-DeepSeek.

`isDeepSeekProvider` already routes through `isDeepSeekHostname`, so
the hardening applies to both decision paths.

Regression tests cover:
  - subdomain match (us.api.deepseek.com)
  - hostile substrings (api.deepseek.com.evil.com,
    evil.com/api.deepseek.com/v1, api.deepseek.comevil.com,
    api-deepseek-com.example.com)
  - invalid / empty baseUrl

Also fix two doc-level mismatches: the `'max'` clamp on Anthropic logs
via `debugLogger.warn` (warning level, once per generator), not "with
a debug log". Update both `ContentGeneratorConfig.reasoning` JSDoc and
the per-provider behavior table in model-providers.md.

* feat(deepseek): emit thinking:disabled signal when reasoning is off

DeepSeek V4+ defaults `thinking.type` to `'enabled'`, so just stripping
`reasoning_effort` from the request leaves the server happily thinking
on side queries — paying full thinking latency/cost without an effort
configured. Per yiliang114's review, emit the explicit
`thinking: { type: 'disabled' }` field on the wire whenever reasoning
is disabled.

Triggered when either:
  - `request.config.thinkingConfig.includeThoughts === false` (forked
    queries, e.g. suggestion generation)
  - `contentGeneratorConfig.reasoning === false` (config-level opt-out)

The previous post-processing block only fired on the per-request opt-out
path, so the config-level case was already leaking. Unify both under a
single `reasoningDisabled` predicate that runs the same strip + signal
logic.

Hostname-gated to `api.deepseek.com` (and subdomains): self-hosted
DeepSeek behind sglang/vllm/ollama, or older DeepSeek versions, may
not accept the V4 thinking parameter — pushing it there could trip an
unknown-key 400. Mirrors the round-7 decision to gate
`reasoning_effort` translation on hostname.

Regression tests cover all four matrix points:
  - DeepSeek hostname + includeThoughts false → emits disabled
  - DeepSeek hostname + reasoning false → emits disabled
  - non-DeepSeek hostname + includeThoughts false → does not emit
  - self-hosted DeepSeek (model-name fallback only) → does not emit

Docs: extend the `reasoning: false` section with the new behavior and
the self-hosted/non-DeepSeek caveat.

* refactor(deepseek): expose isDeepSeek* as free functions; clarify docs

Two doc/coupling nits from review:

1. The pipeline post-processing block was importing the concrete
   `DeepSeekOpenAICompatibleProvider` class just to reach
   `isDeepSeekHostname`. That couples the generic OpenAI pipeline to a
   specific provider implementation. Promote the helper (and its broad
   `isDeepSeekProvider` sibling) to free `export function`s in
   `provider/deepseek.ts` and import them by name. The class keeps thin
   static delegates for backward compat with existing callers and tests.

2. The per-provider behavior table on `model-providers.md` said
   `'low'/'medium' → 'high'` and `'xhigh' → 'max'` "client-side", but
   that normalization only fires inside `translateReasoningEffort`,
   which runs on the nested `reasoning.effort` config path. Explicit
   top-level overrides via `samplingParams.reasoning_effort` or
   `extra_body.reasoning_effort` skip the rewrite and ship verbatim.
   Reword the row to reflect that.

Behavior unchanged.
2026-05-04 22:42:23 +08:00
Shaojin Wen
fcefab6df5
fix(core): clear FileReadCache on every history rewrite path (#3810)
* fix(core): clear FileReadCache after microcompaction

Microcompaction (the idle-cleanup pass that runs at the start of every
new user/cron message) replaces old read_file / shell / glob / grep /
edit / write_file tool outputs with a `[Old tool result content cleared]`
placeholder. The FileReadCache, however, still records the prior full
Reads as "seen in this conversation" — so the next ReadFile of an
unchanged file returns the file_unchanged placeholder pointing at bytes
the model can no longer retrieve from history. The result is a Read
that succeeds at the tool layer but delivers no usable content to the
model, which is the failure mode reported in #3805 ("read tool returns
no content in long-running sessions").

This mirrors the existing post-compaction clear in tryCompressChat —
microcompaction has the same "history rewrite invalidates the cache's
'model has seen this' assumption" property, it was just missed when the
cache was wired in.

* fix(core): clear FileReadCache on every history rewrite path

PR1 only patched microcompaction, but a multi-round audit found four
more entry points that rewrite history without clearing the cache,
producing the same `file_unchanged` placeholder vs. missing-content
mismatch. Each is fixed in the same minimal way (clear() at the call
site) and covered by a regression test:

- GeminiClient.setHistory     — /restore checkpoint, /load_history
- GeminiClient.truncateHistory — rewind in AppContainer
- GeminiClient.resetChat       — public API; clearCommand happens to
  clear the cache via startNewSession beforehand, but other callers
  have no such guarantee
- stripOrphanedUserEntriesFromHistory — Retry path drops trailing user
  entries that may include read_file functionResponses

Also tightened the microcompaction comment ("compactable tool outputs"
instead of an enumerated list, since the source of truth is
microcompact.COMPACTABLE_TOOLS) and removed caller references per the
codebase comment style.

Reverse-tested every new clear() by commenting it out and confirming
the matching regression test fails.

* test(core): integration test for FileReadCache + history rewrite

End-to-end tests using the real ReadFileTool, real FileReadCache,
real microcompactHistory, and a real on-disk file. Three cases:

1. Without a cache clear after microcompact, the second Read of an
   unchanged file returns the file_unchanged placeholder while the
   prior content has already been wiped from history. Demonstrates
   the failure mode this PR fixes.
2. After an explicit cache.clear(), the second Read re-emits the
   real bytes. Demonstrates that the fix works.
3. When microcompact removes every prior read of a file, the
   placeholder leaves zero recoverable bytes — the model literally
   cannot find the content anywhere it can reach.

These complement the existing unit tests in client.test.ts (which
verify the call-site wiring) by proving the end-to-end behaviour
through the real code paths, without mocks.

* chore(core): add traceable debug log for every FileReadCache clear

Per review feedback: the new clear() call sites were silent, leaving
no breadcrumb in production debug streams when the cache is dropped.
Adds a `[FILE_READ_CACHE] clear after <reason>` log at every clear
site (5 new + 1 pre-existing in tryCompressChat) so operators can
grep one prefix and see why the cache was invalidated.

* chore(core): refine truncateHistory cache clear + extract test helper

Per review feedback (deepseek-v4-pro):

1. truncateHistory now skips the cache clear when keepCount >=
   prevLen, since a no-op truncate leaves the cache valid against the
   unchanged history. Adds a regression test covering both
   keepCount==prevLen and keepCount>prevLen.

2. The 6 cache-spy test cases each repeated the same 4-line mock
   setup. Extract a `mockFileReadCacheClear()` helper so future
   changes to the FileReadCache mock surface only need one edit.

Both are quality-of-implementation tweaks; the underlying fix is
unchanged.

* perf(core): use O(1) getHistoryLength in truncateHistory

Per Copilot review feedback: the previous commit's no-op detection in
truncateHistory called this.getChat().getHistory().length, but
GeminiChat.getHistory() does a structuredClone of the entire history
on every call (line 770 of geminiChat.ts) — paying an O(history)
clone purely to read .length. In long-running sessions with hundreds
of entries this is a meaningful regression.

Adds GeminiChat.getHistoryLength(): O(1), no clone. truncateHistory
switches to it. The behaviour (skip clear when keepCount >= prevLen)
is unchanged.

Also adds:
- Unit tests for GeminiChat.getHistoryLength (empty, after addHistory,
  parity with getHistory().length).
- A regression test asserting truncateHistory calls getHistoryLength
  and NOT getHistory, locking in the perf fix against future drift.

* fix(core): close NaN hole + use public ReadFileTool API in tests

Two issues from copilot review:

1. NaN edge case in truncateHistory cache invalidation. The
   "did anything actually change?" check was `keepCount < prevLen`,
   but `Array.slice(0, NaN)` returns [] (history wiped) while
   `NaN < prevLen` is false. That sequence would wipe the chat but
   leave the FileReadCache claiming the model has seen the prior
   reads — exactly the file_unchanged placeholder bug this PR is
   closing. Switched the check to compare actual post-truncate length
   (`newLen < prevLen`), which correctly invalidates whenever entries
   were removed regardless of how `keepCount` was malformed. Added
   a NaN regression test.

2. The integration test cast `tool` to `unknown` to reach the
   protected `createInvocation()` method. Switched to the public
   `tool.buildAndExecute(params, signal)` API so the test exercises
   the same surface real callers use, including build-time schema
   validation.
2026-05-04 22:42:06 +08:00
Shaojin Wen
fbcf59e0b3
docs(core): point background-shell + monitor guidance at both /tasks and the dialog (#3808)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
* docs(core): point background-shell guidance at both /tasks and the dialog

Follow-up to PR #3801, fulfilling the "separate small PR" commitment in
its description. The two model-facing strings (`shell.ts` after
spawning a background shell, `task-stop.ts` after requesting cancel)
referenced only `/tasks` as the inspection path, predating the
interactive Background tasks dialog landing at #3488 / #3720 / #3791.
Now that the dialog handles all three kinds (agent / shell / monitor),
both surfaces should be visible to the LLM so it can suggest the right
one based on the user's mode.

Updates:

- `shell.ts:865` (LLM message after `is_background: true` spawn) now
  surfaces both `/tasks` (text, any mode) AND the interactive dialog
  (footer pill + Enter, with detail view + live updates). Output file
  guidance retained.
- `task-stop.ts:110` (LLM message after `task_stop` on a shell) same
  pattern: both surfaces named.
- `task-stop.ts:95` comment updated to enumerate all observation paths
  (including the dialog).
- `monitorRegistry.ts:197` comment fixed — said "/tasks dialog" which
  conflated two distinct surfaces. Split to "Background tasks dialog
  reopens or `/tasks` listings".
- `backgroundShellRegistry.ts:10` (module docstring) and `:31` (shellId
  doc) now mention all three consumers (agent, dialog, slash command).

No behavior change — pure documentation/string update. Tests untouched
(none asserted on these exact strings); build + lint + 152-test core
suite all clean.

* docs(core): address PR 3808 review — 'captured output' + consistent ordering

Three review nits:

1. (LoqU — copilot) `shell.ts:865` said the output file holds "raw
   content", but `shellExecutionService` runs each chunk through
   stripAnsi and skips non-string/binary chunks before writing. Reword
   to "captured output" so callers don't expect a byte-for-byte stream.

2. (LqKr — wenshao) The PR mentioned both surfaces in two different
   orders depending on the file: `backgroundShellRegistry.ts` listed
   the dialog first, while `task-stop.ts` and `shell.ts` listed
   `/tasks` first. Unify on the LLM-facing order — `/tasks` first,
   then the interactive Background tasks dialog — across all four
   sites. Also flips the line-31 docstring on the `shellId` field for
   the same reason.

3. (LqKt — wenshao, flagged for awareness only) Trim the redundant
   keystroke detail in shell.ts:865 to match `task-stop.ts:111`'s
   shorter "(footer pill + Enter)" form. Saves ~7 tokens per
   background shell launch in `llmContent` while still naming both
   surfaces. The PR description's rationale (LLM should know both
   surfaces exist so it can suggest the right one for the user's
   mode) is preserved — only the operational verbosity is trimmed.

581 tests pass; lint + typecheck clean. Pure docs / string update.

* docs(core): grammar polish on PR 3808 strings

Two more wording nits from copilot review:

- backgroundShellRegistry.ts:10 — change "metadata the agent…need to
  query" to "metadata that the agent…use to query". The original
  phrasing reads as if the metadata itself is performing the query.

- shell.ts:865 — change "Read the output file directly for the
  captured output." to "Read the output file directly to view the
  captured output." — clearer instruction to the model/user.

Pure wording, no behavior change.

* docs(core): grammar fix on PR 3808 monitor comment

'not visible from later Background tasks dialog reopens' read as
if 'reopens' was a noun. Reword to 'not visible after reopening
the Background tasks dialog or from /tasks listings'.

* docs(core): round 4 wording polish on PR 3808

Four more nits from copilot:

- shell.ts:865 + task-stop.ts:96,111: "footer pill + Enter" was
  ambiguous now that the footer renders multiple pills (background
  tasks vs other status indicators). Disambiguate to
  "focus the footer Background tasks pill, then Enter".
- monitorRegistry.ts:198: re-tweak my round-3 phrasing —
  "after reopening the Background tasks dialog or from /tasks
  listings" → "in later Background tasks dialog reopens or /tasks
  listings". Reads as "from those surfaces" rather than "after
  reopening", which the reviewer found ungrammatical.
- backgroundShellRegistry.ts:10,31: clarify "/tasks" as the slash
  command, since the codebase also uses "<projectDir>/tasks/..."
  on-disk paths in agent-transcript contexts.

Pure wording, no behavior change. 87 affected tests pass.

* docs(core): mirror /tasks + dialog guidance to monitor llmContent paths

Address @doudouOUC review on PR #3808 — two Medium findings: this PR
updated shell-facing strings to mention both inspection surfaces but
left the parallel monitor strings without any inspection guidance, even
though monitors render in the same /tasks output and the same
Background tasks dialog. Restore symmetry:

- monitor.ts:587-598 — append the same "/tasks (text) or the
  interactive Background tasks dialog (focus the footer Background
  tasks pill, then Enter — detail view + live updates)" sentence to
  the Monitor-started llmContent, mirroring shell.ts:865.
- task-stop.ts:125-131 — the monitor cancellation llmContent had no
  guidance at all. Add the same "Final status will be visible via
  /tasks (text) or the interactive Background tasks dialog (focus the
  footer Background tasks pill, then Enter) once the process drains"
  line that already existed for shells at task-stop.ts:111.

The (Low) commit-churn observation is a maintainer call (squash on
merge); the (Info) snapshot-test gap is pre-existing and not in scope.

78 monitor + task-stop tests pass; lint + typecheck clean.

* docs(core): drop drain phrasing for monitor cancel + restructure dialog comment

Address PR #3808 review round 5 (doudouOUC + copilot × 2):

1. (XNoH copilot, XSBu doudouOUC — Medium) The monitor cancellation
   message inherited "once the process drains" from the shell branch,
   but `monitorRegistry.cancel()` settles synchronously — when the
   tool returns, the entry is already `cancelled`, not waiting on a
   child process. The drain qualifier is accurate for shells (which
   use `requestCancel()` + the AbortController and settle when the
   real process exits) but misleading for monitors.

   Reword the monitor branch in `task-stop.ts:121-130` to drop the
   drain phrasing and add an explanatory comment about the sync vs.
   async difference so future maintainers don't replicate the wording
   from the shell branch by reflex.

2. (XNod copilot — wording, third revision on the same comment)
   Restructure rather than re-litigate the preposition. The
   "reopens" noun framing has gone through three rounds of churn
   (`from later... reopens` → `after reopening...` → `in later...
   reopens` → and now back to `from`). Sidestep the loop by making
   the comment a proper sentence about WHAT the surfaces actually
   read: the persisted `entry.error` is the source of truth; the
   chat-history notification is a separate, ephemeral side channel.
   Avoids the noun-form "reopens" entirely.

Updated test assertion to match the new "Monitor \"...\" cancelled"
prefix. 51 tests pass; lint + typecheck clean.
2026-05-04 22:27:00 +08:00
jinye
2fea1d3aa7
fix(core): address post-merge monitor tool and UI routing issues (#3792)
* fix(core): address post-merge monitor tool and UI routing issues

- Guard token bucket against clock drift after system suspend/resume
  (negative elapsed resets lastRefill instead of starving the bucket)
- Add debugLogger.warn for AST read-only check failures in monitor
  getConfirmationDetails (previously silent catch)
- Consolidate SHELL_TOOL_NAMES: export from rule-parser.ts, import in
  permission-manager.ts (removes identical SHELL_LIKE_TOOLS duplicate)
- Extract hasBlockingBackgroundWork/resetBackgroundStateForSessionSwitch
  to shared backgroundWorkUtils.ts (removes identical copies in
  clearCommand.ts and useResumeCommand.ts)
- Consolidate getToolCallComponent routing into packages/webui (removes
  near-identical copies in ChatViewer.tsx and vscode-ide-companion, adds
  missing web_search compat alias to VSCode path)
- Add test for droppedLines count in terminal notification text
- Add test for exit(null, null) settlement (externally killed process)

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,cli,vscode): address PR #3792 review feedback

- Add debugLogger.warn for clock-drift guard (observability for
  throttle bucket resets after suspend/resume)
- Add test for clock-drift recovery (elapsed < 0 bucket reset)
- Add test for AST parse failure catch path (mockRejectedValueOnce)
- Forward isFirst/isLast props through VSCode ToolCallRouter
  (fixes timeline connector rendering)
- Add test for shell running branch in hasBlockingBackgroundWork

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,vscode): address follow-up review comments

- Fix React.FC missing import: use `import type { FC } from 'react'`
  instead of `React.FC` (original file had this import before refactor)
- Tighten clock-drift test: emit while clock is in the past to confirm
  guard resets lastRefill, then verify refill at the new reference point

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,cli): adopt review feedback — ReadonlySet + hasRunningEntries

- Type SHELL_TOOL_NAMES as ReadonlySet<string> to prevent accidental
  mutation of permission-critical set
- Use BackgroundShellRegistry.hasRunningEntries() instead of
  getAll().some() for zero-allocation short-circuit check
- Update clearCommand test mocks to include hasRunningEntries

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,webui,vscode): address remaining PR #3792 review comments

- Merge AgentToolCall + isAgentExecutionToolCall into single import in
  routing.ts (comment 3178280762)
- Use real getToolCallComponent via vi.importActual in VSCode test mock
  so routing logic is validated, not a parallel mock that can drift
  (comment 3178280775)
- Validate isFirst/isLast forwarding in VSCode test mock via data
  attributes (comment 3178346891)
- Add comment documenting debugLogger.warn no-op tradeoff for clock
  drift guard (comment 3178346889)

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* test(webui): add unit test for getToolCallComponent routing

Covers all 8 component branches including the web_search compatibility
alias, agent execution detection, case-insensitive matching, and
fallback to GenericToolCall.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(cli): add missing hasRunningEntries to useResumeCommand test mocks

The backgroundWorkUtils refactor replaced getAll().some() with
hasRunningEntries(), but the test mocks in useResumeCommand.test.ts
were not updated, causing CI failures.

🤖 Generated with [Qoder Code](https://github.com/QwenLM/qwen-code)

* test(cli): add unit tests for backgroundWorkUtils shared utility

Cover hasBlockingBackgroundWork (6 cases including short-circuit
behaviour) and resetBackgroundStateForSessionSwitch (1 case verifying
all three registries are reset).

🤖 Generated with [Qoder Code](https://github.com/QwenLM/qwen-code)

---------

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-04 21:19:41 +08:00
jinye
03f66bada5
feat(sdk-python): add PyPI release workflow (#3685)
* feat(sdk-python): add pypi release workflow

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): build cli before smoke test

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): tighten release conflict handling

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): harden python release workflow

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): tighten stable release guards

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): harden prerelease publish flow

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): reuse release branches on rerun

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): resume incomplete releases

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(release): tighten missing-release checks

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): resume stable release reruns

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): tighten release recovery guards

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* test(sdk-python): cover release version edge cases

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): address release workflow review feedback

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* refactor(sdk-python): address review feedback on release version script

- Remove unreachable `if (type === 'stable')` branch in bumpVersion();
  the stable path was dead code since getVersion() throws for all
  stable conflicts before calling bumpVersion(). Move nightly conflict
  throw to the call site for symmetry.
- Rename getNextPatchBaseVersion → getNextBaseVersion to reflect that
  the function can return a prerelease base without incrementing patch.
- Add test for preview+nightly coexistence where nightly base is higher.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): address remaining review feedback on release workflow

- Fix failure-issue gate to read github.event.inputs.dry_run directly
  instead of steps.vars.outputs.is_dry_run (which is empty when early
  steps fail). Add --repo flag for gh issue create when checkout failed.
- Add diagnostic state table to failure-issue body (RELEASE_TAG,
  PACKAGE_VERSION, PUBLISH_CHANNEL, RESUME_EXISTING_RELEASE, etc.)
- Fix release-notes error swallow: only silence release not found /
  Not Found / HTTP 404, emit :⚠️: for other gh release view errors.
- Improve validateVersion error messages to use human-readable format
  keys (X.Y.Z, X.Y.Z-preview.N) matching TS sibling convention.
- Filter fully-yanked versions in getAllVersionsFromPyPI.
- Add console.error log when stable is derived from nightly.
- Add bash regex guard for inputs.version to prevent shell injection.
- Use per-release-type concurrency groups (nightly/preview/stable).
- Add jq null-guard checks for all 6 field extractions.
- Remove misleading --follow-tags from git push (lightweight tags).

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): rename misleading test description

The test asserts that preview/nightly releases return empty
previousReleaseTag, but the name said "same-channel previous
release tags" which implied non-empty values.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): address unresolved review comments on release workflow

- Remove -z check in extract_field() that blocked preview/nightly releases
  (previousReleaseTag is legitimately empty for non-stable releases)
- Use static environment.url since step outputs aren't available at job startup
- Use skip-existing for resumed PyPI publish to fill in missing artifacts
- Add AbortSignal.timeout(30s) to PyPI fetch to prevent indefinite hangs
- Add downgrade guard for stable_version_override
- Use GHA :⚠️: annotation instead of console.error for visibility
- Separate yanked/non-yanked version lists so conflict detection includes
  yanked versions (PyPI still reserves those slots)
- Filter current release from previousReleaseTag to avoid self-reference on resume
- Add tests for yanked conflict detection, downgrade guard, and resume previousReleaseTag

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): address final review round on release version script

- Fix getNextBaseVersion() first-release skip: use pyproject.toml version
  directly when PyPI has no stable versions instead of unconditionally
  incrementing
- Fix getNextBaseVersion() off-by-one: change > to >= so equal prerelease
  base continues the existing line instead of incrementing patch
- Add :⚠️: annotation when preview auto-bumps due to orphan git
  tags (tag exists without PyPI version or GitHub release)
- Add set -euo pipefail to 5 workflow steps missing it: release_branch,
  persist_source, Create GitHub release, Delete prerelease branch, Create
  issue on failure
- Fix 2 existing tests affected by first-release change, add 4 new tests

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): use stderr for GHA warning annotations to avoid corrupting JSON stdout

console.log writes to stdout, which gets captured by VERSION_JSON=$(node ...)
in the workflow and corrupts the JSON output for jq. Switch to console.error
so :⚠️: annotations go to stderr (GHA recognizes workflow commands on
both streams). Also add set -euo pipefail to the "Get the version" step for
consistency with other workflow steps.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

---------

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-04 21:07:21 +08:00
jinye
e617f20d15
fix(telemetry): suppress async resource attribute warning on startup (#3807)
Some checks are pending
Qwen Code CI / CodeQL (push) Waiting to run
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(telemetry): suppress async resource attribute warning on startup

Disable NodeSDK's default auto-detection of host/process/env resources.
These detectors return async attributes; if a span is exported before
they settle, OTel logs an error-level diag message that surfaces in the
terminal UI. The attributes we actually need (service name, version,
session.id) are already provided synchronously via resourceFromAttributes.

Closes part of #3731

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): add comment and test for autoDetectResources: false

Address review feedback:
- Add inline comment explaining why auto-detection is disabled
  (avoids async attribute warning before first span export)
- Add test assertion that NodeSDK is constructed with
  autoDetectResources: false to prevent silent regression

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): improve comment precision and extend autoDetectResources test coverage

- Clarify comment: error fires on any resource attribute read before
  async detectors settle, not just during export; HttpInstrumentation
  span creation is the typical trigger
- Add autoDetectResources: false assertion to HTTP OTLP and file
  exporter test cases for branch-complete coverage across all three
  NodeSDK construction paths

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
2026-05-03 19:36:03 +08:00
Shaojin Wen
07fdfadc33
feat(cli): include monitors in /tasks + add interactive-mode hint (#3801)
* feat(cli): include monitors in /tasks + add interactive-mode hint

Phase B closure for Issue #3634. Two coupled changes to /tasks:

1. **Bug fix — include monitors.** The command was last touched before
   #3684 / #3791 landed, so it merged only agent + shell entries while
   monitors silently disappeared from the headless / non-interactive /
   ACP listing path. Add a third registry pull from `getMonitorRegistry()`
   and wire monitor through statusLabel / taskLabel / taskId /
   taskOutputPath. Status line includes eventCount (`running (N events)`,
   `completed (exit 0, N events)`, `completed (Max events reached, N
   events)` for auto-stop) and pid where defined.

2. **Soft deprecation hint, scoped to interactive mode only.** Once the
   richer Ctrl+T dialog (#3488 + #3720 + #3791) is available, the text
   dump is the long-form fallback rather than the primary surface. Show
   `Tip: Ctrl+T opens the interactive Background tasks dialog with
   detail view + live updates.` at the top of the output when
   `executionMode === 'interactive'`. Headless / ACP get the bare list
   — they have no dialog to point at and the hint would just clutter.
   Description string also clarified to call out the modal split.

Kept on all three executionModes (no deletion) — `/tasks` is the only
way headless / ACP / SDK consumers can inspect background-task state.

Tests: 4 new cases in tasksCommand.test.ts cover monitor entry
formatting (running with pid, natural completion with exitCode,
auto-stop with error string, failed), the singular `1 event` form,
the interactive-mode hint gating, and the cross-kind merge order.

* fix(cli): address PR 3801 review — exhaustive switch + i18n + extra tests

Three actionable Suggestions from /review's pass:

- `taskLabel` rewritten as a `switch` with a `never`-typed `default`
  arm, matching the structural-safety pattern already used by `taskId`.
  Adding a 4th DialogEntry kind in the future will now flip both
  helpers to compile errors instead of letting `taskLabel` silently
  fall through to `entry.description` (which the new kind may not have).

- Hint string wrapped in `t()` for i18n consistency with the rest of
  the file. The literal stays as the i18n key default, so today's
  output is unchanged.

- Tests: cover `cancelled` monitor status (was the only one without an
  inline assertion) and explicit `acp` execution mode hint suppression
  (pins the suppression rationale so a future regression flipping the
  check to `!== 'non_interactive'` would fail loudly).

* fix(cli): correct /tasks dialog-open hint — Ctrl+T was wrong

Tmux verification on PR #3801 caught that the hint string says "Ctrl+T
opens the interactive Background tasks dialog" but Ctrl+T is actually
bound to the MCP tool descriptions toggle (ContextSummaryDisplay.tsx
lines 110-115). The dialog opens via Down arrow on an empty composer
(focuses the footer pill) followed by Enter (InputPrompt.tsx 947-968).
Same misattribution slipped into PR #3791's first description and was
caught + fixed there before merge — this PR carried the wrong wording
forward in code.

Updates four sites:
- The hint string itself: "Tip: press ↓ from an empty composer then
  Enter to open the interactive Background tasks dialog with detail
  view + live updates."
- The slash-command description: "interactive UI is Ctrl+T" → "interactive
  dialog opens via the footer pill"
- Two inline comments referencing Ctrl+T as the dialog opener
- The interactive-mode hint test now pins on `↓` + `Enter` and
  asserts `not.toContain('Ctrl+T')` so a regression to the wrong
  wording fails loudly.

* fix(cli): address PR 3801 review — exhaustive switch consistency + path-agnostic hint

Four Suggestions from the latest /review pass:

- `statusLabel` rewritten as a single top-level switch with a
  `never`-typed default, matching `taskLabel` / `taskId` /
  `taskOutputPath`. The previous `if`/`if`/fallthrough form would
  silently apply monitor formatting to a future 4th kind.
- `taskOutputPath` gained the same exhaustive default — was the only
  per-kind helper still relying on implicit fallthrough; would
  silently omit a 4th-kind output path while the adjacent helpers
  flip to compile errors.
- Hint wording de-specifies the exact keystroke count: `'Tip: focus
  the Background tasks pill in the footer (use ↓ from an empty
  composer) and press Enter ...'`. Previous "press ↓ then Enter"
  phrasing was wrong when the Arena agent tab bar is present —
  `InputPrompt`'s focus chain routes Down through the tab bar first,
  so a single Down lands there, not on the bg pill.
- Test pin tightened: `[mon_fail] failed: spawn ENOENT (0 events)` is
  now a full-string assertion instead of a prefix match, so a
  regression that drops the `(N events)` suffix from monitor's failed
  branch fails loudly.

* fix(cli): sanitize ANSI escape sequences in /tasks output

deepseek's review pass flagged that monitor description / error fields
are user / process-supplied strings rendered directly to the terminal.
A maliciously-crafted tool description or spawn error containing raw
ANSI control sequences (clear-screen, cursor-move, colour) would
otherwise reach stdout verbatim and corrupt display.

Same risk applies to agent error / description and shell error /
command — all already-existing renderers with the same exposure that
this PR didn't introduce but inherits. So instead of per-field
sprinkling, wrap the joined output once with `escapeAnsiCtrlCodes`
(no-op when no control chars present, so cost is zero in the common
case). One line change in the renderer covers every kind including
any future one.

Test pins the behaviour: a monitor entry with `\x1b[2J` /
`\x1b[31m...` content produces output with no raw ESC bytes and
visible escaped `[...]` sequences.

* docs(cli): tighten escapeAnsiCtrlCodes comments to match actual scope

Two doc-precision Suggestions from copilot's pass on 0840e32f6:

- Source comment claimed `escapeAnsiCtrlCodes` is "a no-op when no
  control chars" but it's narrower than that — it only handles
  sequences matched by `ansi-regex` (CSI / OSC / SGR — anything
  starting with ESC). Isolated C0/C1 control bytes like BEL, BS, VT
  pass through untouched. Updated the comment to enumerate the actual
  scope and call out that `node:util`'s `stripVTControlCharacters`
  would be needed if those become a concern.

- Test comment had a literal raw ESC byte (octal 033) embedded in the
  source — visually showed `^[[...]` in editors that render ESC, but
  was a real ESC byte in the file rather than the escaped ``
  form the sanitizer produces. Rewrote with a literal `` text
  description so what the comment shows matches what the assertions
  check for.

* fix(cli): broaden /tasks sanitization + tighten inner switch exhaustiveness

Addresses 3 of 5 items from doudouOUC's PR 3801 review:

- **Issue 1 (Low) — C0/C1 control byte gap**: switched from
  `escapeAnsiCtrlCodes` (only handles ESC-initiated ANSI sequences) to
  `stripUnsafeCharacters` (one-pass strip of ANSI + VT + C0/C1, with
  TAB/CR/LF preserved). The pre-existing exposure to bare BEL / BS /
  FF / VT bytes via shell entry strings is now closed for all three
  kinds. Test rewritten to cover both ANSI sequences AND bare control
  bytes (BEL, BS), and pins that surrounding printable text and line
  breaks survive.

- **Issue 2 (Low) — inner status switches inconsistent**: the three
  inner `switch (entry.status)` blocks (agent / shell / monitor) used
  `case 'running': default: return 'running'` (or duplicated bodies).
  All three now have explicit `running` cases plus a `never`-typed
  default that throws — matches the outer `switch (entry.kind)`
  pattern and means a future status added to any of `BackgroundTaskEntry`
  / `BackgroundShellEntry` / `MonitorStatus` flips to a compile error
  here instead of silently returning `'running'`.

- **Issue 5 (Nit) — beforeEach default change**: added an inline
  comment explaining why the test default overrides
  `createMockCommandContext`'s `'interactive'` default
  (`'non_interactive'` lets the hint-suppression assertions work
  without each test rebinding context).

Issues 3 and 4 from the review are nits with no action needed (3 is
already documented as intentional; 4 is a UX call about hint length
that's better handled by user feedback than guess-tweaking).

* fix(cli): bind status to local before exhaustive switch — fixes tsc build

CI's `tsc --build` (full mode, vs `--noEmit` locally) caught that
`switch (entry.status)` followed by a `never`-typed default reading
`entry.status` doesn't compile. After the case arms exhaust the
discriminated union, TS narrows `entry` itself to `never`, so the
`.status` access in the default arm becomes "Property 'status' does
not exist on type 'never'" + the resulting `any` value can't be
assigned to `never`.

Fix: bind `entry.status` to a local `status` const before the inner
switch. The local stays typed as the per-kind status union and
narrows correctly to `never` at the default arm — `const _exhaustive:
never = status` is then `never = never`, valid.

Standard exhaustive-switch-on-discriminator pattern; doesn't change
runtime behavior or test surface, just gets past TS narrowing on the
nested case.
2026-05-03 18:45:51 +08:00
Shaojin Wen
cdadbcdb33
feat(cli): wire Monitor entries into combined Background tasks dialog (#3791)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* feat(cli): wire Monitor entries into combined Background tasks dialog

Phase C mirror follow-up of Issue #3634, structurally a clean repeat of
#3720 but for Monitor (third consumer of the kind framework).

Core (MonitorRegistry):
- Add `setStatusChangeCallback` mirroring `BackgroundShellRegistry` /
  `BackgroundTaskRegistry`. Fires synchronously inside `register()` (so
  subscribers see lifecycle start) and inside `settle()` (so subscribers
  see every running → terminal transition: complete / fail / cancel /
  emitEvent's auto-stop at maxEvents).
- Subscriber failures are caught and logged but do not poison the
  registry — same defensive contract as the other two registries.

CLI:
- `useBackgroundTaskView` subscribes to all three registries (agent +
  shell + monitor) and merges by `startTime`. `DialogEntry` union
  extended with `(MonitorEntry & { kind: 'monitor' })`. `entryId`
  switches over the three kinds and returns the right id field.
- `BackgroundTasksPill.getPillLabel` adds monitor to its KIND_NAMES
  table; grouping order is shell → agent → monitor (monitor last
  because it tends to be the longest-lived, least urgent to glance at).
- `BackgroundTasksDialog`:
  - `rowLabel` returns `[monitor] <description>` for monitor rows.
  - New `MonitorDetailBody` showing command / status / pid / event
    count / dropped lines. No Progress block (monitors don't fire
    activity callbacks per-event).
  - DetailBody dispatcher gains the monitor branch.
- `BackgroundTaskViewContext.cancelSelected` routes monitor cancels via
  `monitorRegistry.cancel(monitorId)`. Monitor's cancel is synchronous
  (settle + abort happen inside the registry), matching the same path
  task_stop already uses.

Tests: 6 new core tests (registry callbacks fire on register / each
terminal transition / not on emitEvent / on auto-stop / clear stops
notifications / subscriber-failure isolation), 4 new pill tests
(singular / plural / 3-kind grouping / running-only filter), dialog
mock extended with `getMonitorRegistry`.

* fix(cli): add exhaustive default arms to DialogEntry switches

ESLint's `default-case` rule requires every switch to have a default
arm even when TypeScript can prove the union is exhaustive. Add
`default: { const _exhaustive: never = entry; throw ... }` to the
five switches added in this PR — same pattern keeps both the runtime
guard and the compile-time exhaustiveness check.

* fix(core): fire statusChange in MonitorRegistry.reset()

The newly-added `setStatusChangeCallback` subscriber misses `reset()`,
so a `/clear` or session reset leaves stale monitor rows visible in the
combined Background tasks dialog until an unrelated register/settle
event happens. Both BackgroundShellRegistry and BackgroundTaskRegistry
already fire statusChange on their reset paths — Monitor was the
outlier.

Fix: fire `statusChange()` (no arg) after `monitors.clear()`, with an
early return when the registry is already empty so we don't notify on
a no-op reset. Two new tests cover both branches.

* fix(cli,core): address PR 3791 review feedback

Four review threads from /review's second pass on top of f26b700:

1. **MonitorDetailBody live counters were stale.** `eventCount` and
   `droppedLines` came from the `useBackgroundTaskView` snapshot, which
   only refreshes on `statusChange`, and `emitEvent` deliberately
   doesn't fire `statusChange` (the per-event churn would burn the pill
   / AppContainer). Fix: extend the existing `selectedEntry` useMemo to
   re-resolve monitors from `monitorRegistry.get()` on each
   `activityTick`, mirroring the agent path. The pre-existing 1s
   wall-clock interval already drives the recompute while the entry is
   running, so no new callback wiring is needed.

2. **Settle reasons weren't persisted on `MonitorEntry`.** `fail()`,
   `complete(exitCode)`, max-events auto-stop, and idle-timeout all
   passed their reason strings only to the chat-history terminal
   notification — opening the dialog detail view after the monitor
   died showed the bare status with no clue why. Add `exitCode?` and
   `error?` fields (mirrors `BackgroundShellEntry`); populate them
   before `settle()` in each path; render them in `MonitorDetailBody`
   with appropriate colour (red for `failed`, warning for
   auto-stopped `completed`).

3. **`monitorCancel` mock had no test asserting it.** Add a dedicated
   test that opens detail on a monitor entry, presses `x`, and verifies
   `monitorRegistry.cancel(monitorId)` was called and the agent
   registry's cancel was NOT called. Pins the kind switch in
   `cancelSelected` so a regression flipping the monitor branch to
   anything else (e.g. `requestCancel`) would fail loudly.

4. **`MonitorStatusChangeCallback` docstring was wrong.** It claimed
   the callback fires on running → terminal transitions, but
   `register()` (nothing → running) and `reset()` (mass clear, fired
   with no entry) also fire it. Rewrite the docstring to enumerate the
   actual fire sites and explicitly note that `emitEvent` is excluded
   (per-event churn).

* docs(cli,core): tighten MonitorEntry.error and rowLabel comments

Two doc-precision fixes from copilot's PR 3791 review pass:

- `MonitorEntry.error` enumerated max-events as the only auto-stop
  reason that populates the field, but `idle-timeout` also writes it
  (was added in the same earlier commit). Rewrote to list both current
  reasons and explicitly note: any future auto-stop reason should
  populate this field too. Also clarified that `cancelled` and
  natural-exit `completed` paths intentionally don't.

- `rowLabel`'s shell-branch comment claimed long commands are
  "truncated by the row renderer's MaxSizedBox", but ListBody renders
  rows with plain `<Text>` (no MaxSizedBox / truncation helper). Long
  text wraps. Rewrote to describe actual behaviour and note the
  intentional decision to leave it wrapping (the dialog is what users
  open to see context — truncating defeats the purpose).

* test(cli): cover MonitorDetailBody render branches + useBackgroundTaskView

Two coverage gaps from /review's third pass on PR 3791:

- New file `useBackgroundTaskView.test.ts` (6 cases) directly exercises
  the merge logic with a stub config: empty when config is null, merges
  three registries, sorts by startTime, tags `kind`, subscribes on
  mount, refreshes when any registry fires statusChange, clears all
  three subscriptions on unmount.

- New `MonitorDetailBody render branches` describe block in
  `BackgroundTasksDialog.test.tsx` (8 cases) covers the conditional
  pieces my last commit added: title + Command, pid show/hide,
  eventCount singular vs plural, droppedLines show/hide, exitCode
  display, Error block (failed) vs Stopped because (auto-stop), and
  the all-undefined no-block case.
2026-05-03 10:05:19 +08:00
Umut Polat
a08d48b75c
fix(cli): stop double-wrapping and double-printing API errors in non-interactive mode (#3749)
* fix(cli): stop double-wrapping and double-printing API errors in non-interactive mode

In non-interactive (-p) mode, any upstream 4xx ended up on stderr three
times: once from the stream-error handler, once from handleError after
the thrown Error.message (already containing the formatted text) was
fed back through parseAndFormatApiError producing
"[API Error: [API Error: ...]]", and once more from
JsonOutputAdapter.emitResult writing the same errorMessage out in TEXT
mode. The top-level catch then framed the resulting throw as
"An unexpected critical error occurred:" with a stack trace, which
made a routine 4xx look like a CLI crash.

Three coordinated changes:

- AlreadyReportedError marks a throw whose message is already on the
  wire. handleError short-circuits on it: no second writeStderrLine,
  no second parseAndFormatApiError, just propagate the exit code.
- The non-interactive stream-error handler now throws
  AlreadyReportedError, and the catch block skips the adapter's
  emitResult in TEXT mode so we don't get a third copy.
- The top-level .catch in packages/cli/index.ts treats
  AlreadyReportedError as a routine, already-reported failure: exit
  with the carried code without printing the "unexpected critical"
  framing or the stack trace.

parseAndFormatApiError is also made idempotent — input that already
starts with "[API Error: " and ends with "]" is returned unchanged.
That is the safety net: even if a future caller forgets to mark its
throw, the double-wrap symptom is impossible.

Tests cover all three layers: idempotency in errorParsing, the
short-circuit in handleError, and a regression test on
runNonInteractive that asserts no "[API Error: [API Error: ...]" line
is ever produced on stderr.

Fixes #3748

* fix(cli): make API error formatting idempotent for 429 suffix + use AlreadyReportedError

* fix(cli): follow up on error reporting review
2026-05-03 08:39:31 +08:00
John London
5037fa7627
Feat/stats model cost estimation rebase (#3780)
* feat(stats): add optional cost estimation to /stats model

Adds optional cost estimation based on user-defined pricing in settings.json.
Users can configure per-model pricing via the new modelPricing setting.
When configured, /stats model shows estimated cost; when not configured,
the behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(stats): extract raw model name from composite key for cost lookup

flattenModelsBySource creates keys like "model::source", but
modelPricing is keyed by raw model names. Extract the raw
model name by splitting on "::" to fix cost lookup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: update settings schema after adding modelPricing

Regenerate the VS Code settings schema to include the new
modelPricing field so the lint check passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(ui): add regression tests for cost estimation edge cases

Coverage for the cost display fixes in ModelStatsDisplay:
- Cost section hidden when no pricing + no thoughts
- Cost section shown when pricing is configured (with value check)
- Thoughts tokens included in cost calculation (with larger numbers
  to expose the before/after difference)
- Raw model name used for pricing lookup with subagent attribution
- Cost section shown when thoughts > 0 even without pricing
- Multiple models with different pricing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ui): include thoughts tokens in ModelStatsDisplay cost calculation

Address review feedback: the interactive cost estimate was omitting
thoughts/reasoning tokens from the output token count, causing it to
disagree with the non-interactive /stats model path.

Changes:
- hasPricing visibility gate now includes thoughts in outputTokens
- Cost estimate calculation now includes thoughts in outputTokens
- getModelName() already correctly extracts raw model name from
  flattened model::source keys for pricing lookup

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(stats): include thoughts tokens in non-interactive cost calc + update snapshots

- Fix statsCommand.ts non-interactive path to include thoughts tokens in
  outputTokens for cost calculation, aligning with the interactive
  ModelStatsDisplay path.
- Update 4 ModelStatsDisplay snapshots to reflect Cost section appearing
  when thoughts > 0 (even without pricing, showing N/A).

* fix(test): address all CI failures and reviewer feedback on PR #3780

- Add `toModelMetrics` helper to statsCommand.test.ts to properly
  create ModelMetrics with required `bySource` property
- Fix ModelStatsDisplay.test.tsx:
  - Add missing `sessionId` to SessionStatsState mock
  - Add missing `startNewSession` to useSessionStatsMock
  - Add `auto_accept` to all totalDecisions mocks (7 locations)
  - Add `files` to all SessionMetrics objects (6 locations)
  - Remove contradictory test "should show Cost section when thoughts > 0
    even without pricing" per Option A (strict opt-in)
- Revert 4 snapshots that incorrectly showed Cost/N/A lines for models
  without pricing configuration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 07:31:08 +08:00