Commit graph

5743 commits

Author SHA1 Message Date
wenshao
cc916b528e docs(attribution): correct legacyTypes / EXCLUDED_DIRECTORY_SEGMENTS comments
9Ta_ (Copilot): the JSDoc on legacyTypes claimed JSON Schema's
`type` keyword does not accept `'object'` — that's wrong; `'object'`
IS a valid JSON Schema type. Reword to reflect the actual rationale:
`'enum'` is not a valid JSON Schema `type` value at all (enum
constraints use the `enum` keyword), and a bare `{type: 'object'}`
would accept any object regardless of what the field's pre-expansion
shape actually allowed. The narrowed `boolean | string | number |
array` set is exactly what the one-liner generator can faithfully
emit; richer legacy shapes belong in their own branch of
convertSettingToJsonSchema.

9Tbs (Copilot): the comment in generatedFiles.ts referenced
`EXCLUDED_DIRECTORIES`, but the constant is `EXCLUDED_DIRECTORY_SEGMENTS`
(renamed during the segment-boundary refactor). Update the
reference so a future maintainer scanning for the rule doesn't
chase a non-existent identifier.
2026-05-06 20:47:43 +08:00
wenshao
a66a21d81d fix(attribution): GIT_DIR repo-shift bail, snapshot envelope validation, narrow legacyTypes
80ME (gpt-5.5 /review, [Critical]): tokeniseSegment unconditionally
stripped every leading KEY=value token. `GIT_DIR=elsewhere/.git git
commit ...` was therefore treated as an in-cwd commit, picked up the
Co-authored-by trailer, and produced a per-file note that landed
against our cwd's HEAD even though the actual commit went to a
different repo. Define a GIT_ENV_SHIFTS_REPO set (GIT_DIR,
GIT_WORK_TREE, GIT_COMMON_DIR, GIT_INDEX_FILE, GIT_NAMESPACE) and
have tokeniseSegment refuse to parse any segment whose leading env
block (including the env-wrapper's KEY=VALUE block) carries one of
these. Identity / date variables (GIT_AUTHOR_*, GIT_COMMITTER_*) are
deliberately NOT in the set — they tweak metadata but don't relocate
the repo. Tests cover plain prefix, env-wrapped prefix, and a
GIT_COMMITTER_DATE positive control that should still get the trailer.

8EeQ (Copilot): restoreFromSnapshot received `snapshot as
AttributionSnapshot` from a structural cast off `unknown` (the
resume path), so its TS-typed shape was only a hint. A corrupted
JSONL line (non-object / array / wrong type discriminator / missing
type) would skip past the version check straight into
Object.entries(snapshot.fileStates) — and a non-object fileStates
(an array, say) seeded fileAttributions with numeric-string keys.
Add envelope-level shape gates (isPlainObject + type discriminator)
and a fileStates plain-object check before iterating; both bail to a
clean reset rather than poisoning the singleton. Tests added.

8Eej (Copilot): SettingDefinition.legacyTypes was typed as
SettingsType[] which includes 'enum' and 'object' — JSON Schema's
`type` keyword doesn't accept those values. Adding
`legacyTypes: ['enum']` would silently produce an invalid
settings.schema.json. Narrow the field's type to
ReadonlyArray<'boolean' | 'string' | 'number' | 'array'> (the
JSON-Schema-primitive subset). Future complex-shape legacy support
should land its own branch in convertSettingToJsonSchema.
2026-05-06 20:34:41 +08:00
wenshao
c0ed9092b0 fix(attribution): restore fire-and-forget appendRecord, route rollback via callback
6OcJ (Copilot): refactor in 715c258fb returned a Promise from
appendRecord so the snapshot dedup-key path could chain rollback —
but recordUserMessage / recordAssistantTurn / recordAtCommand /
recordSlashCommand / rewindRecording all call appendRecord without
await or .catch(). A transient jsonl.writeLine rejection on any of
those would surface as an unhandled-promise-rejection (warning, or
crash on --unhandled-rejections=throw).

Restore the original fire-and-forget semantics: appendRecord again
returns void and internally swallows async failures (logging via
debugLogger). Per-record failure reactions are routed through an
optional onError callback — recordAttributionSnapshot uses this to
roll back lastAttributionSnapshotJson when the write that set it
ends up rejecting.

Tests: add a fire-and-forget regression that mocks writeLine to
reject and asserts no unhandledRejection events fire while the
existing snapshot rollback tests (sync + async) still pass via the
new callback path.
2026-05-06 19:17:08 +08:00
wenshao
8f9c7e405a Merge remote-tracking branch 'origin/main' into feat/commit-attribution
# Conflicts:
#	packages/core/src/tools/write-file.test.ts
2026-05-06 17:23:07 +08:00
Shaojin Wen
5441581392
feat(core): enforce prior read before Edit / WriteFile mutates a file (#3774)
* feat(core): enforce prior read before Edit / WriteFile mutates a file

Introduces a session-scoped invariant: the model cannot mutate an
existing file without having actually Read it (or its post-write
state) earlier in this conversation. Builds on the FileReadCache
landed in #3717.

Two new ToolErrorType codes:
- EDIT_REQUIRES_PRIOR_READ — file has no entry in the session cache.
  The model is told to use read_file first.
- FILE_CHANGED_SINCE_READ — file has an entry but its mtime or size
  drifted since the recorded fingerprint. The model is told to
  re-read before retrying.

EditTool blocks the existing-file path on cache.check; new-file
creation (old_string === '' on a non-existent target) is exempt.
WriteFileTool blocks the overwrite path; new-file creation
(fileExists === false) is exempt.

Both tools route through the existing fileReadCacheDisabled escape
hatch on Config — flipping it bypasses enforcement byte-for-byte,
matching pre-cache behaviour. Operators can use this as a kill switch
if a session falls into a state where the cache cannot be trusted.

ReadFile fix on the auto-memory path: PR #3717 had auto-memory reads
skip the cache entirely (both lookup and record), but with the new
enforcement that means a model that just Read AGENTS.md cannot then
Edit it. Decoupled the two: auto-memory reads still skip the
file_unchanged fast-path (the per-read freshness <system-reminder>
must always reach the model) but DO record into the cache so the
follow-up Edit sees `fresh`. New regression test asserts this.

Test plan
- vitest run (all of @qwen-code/qwen-code-core): 6308 passed, 2 skipped
- 9 new enforcement tests across edit.test.ts and write-file.test.ts:
  unknown rejects, stale rejects, new-file exempt, edit chain stays
  authorised, escape hatch bypasses, plus the auto-memory record
  regression in read-file.test.ts.
- tsc --noEmit clean. eslint clean. core build succeeds.

* test(core): clear shared fileReadCache between write-file.test.ts cases

CI surfaced one Linux-only failure: the prior-read enforcement test
'rejects a write that would overwrite an unread existing file'
returned FILE_CHANGED_SINCE_READ instead of EDIT_REQUIRES_PRIOR_READ.

Root cause: the FileReadCache instance is declared at module scope
(line 41) and shared across every test in write-file.test.ts. State
from earlier tests — most recently the 'records a write' integration
test that records the same path — leaks forward. On Linux the test
ordering puts a record-bearing test before the enforcement test, so
the cache reports `stale` (mtime drifted) instead of `unknown`.
macOS / Windows happen to order them differently and never hit it.

Adding a fileReadCache.clear() to beforeEach gives every test a
known-empty cache, matching how edit.test.ts already isolates its
per-test cache by re-instantiating it.

* fix(core): close prior-read enforcement gaps flagged in 3rd review

Three concrete loopholes / regressions that the original PR-B
introduction left open. All three are addressed in the same commit
because the underlying refactor (move enforcement earlier and tighten
the fresh predicate) is shared across them.

1. fresh != "model has seen the bytes". Pre-fix, requirePriorRead()
   accepted any cache.check === 'fresh'. ReadFileTool records every
   successful read into the cache, including ranged reads
   (offset/limit), truncated full reads, and non-cacheable
   binary/image/audio/video/PDF/notebook reads (lastReadCacheable
   = false). This let the model peek at a slice or a structured
   payload of a file and then mutate the whole thing. Tightened the
   accept predicate to fresh && lastReadAt && lastReadWasFull &&
   lastReadCacheable.

2. Read-less content oracle through calculateEdit error codes. Pre-fix,
   execute() ran calculateEdit (which reads file bytes and counts
   matches) before the enforcement check. A model could probe an
   unread file by attempting Edits with candidate old_strings and
   observing NO_OCCURRENCE_FOUND vs EXPECTED_OCCURRENCE_MISMATCH vs
   EDIT_NO_CHANGE — reverse-engineering content without ever calling
   read_file. Moved enforcement to the top of calculateEdit, before
   any content read; only a stat is performed up to the rejection
   point.

3. Confirmation flow regression. Pre-fix, getConfirmationDetails()
   read the existing file to render a diff for the user, then
   approval flowed to execute() which would freshly check the cache
   and reject. The user could approve a diff computed from current
   bytes the model never saw, and the call would still fail. Moved
   enforcement before the confirmation read in both EditTool (via the
   shared calculateEdit path) and WriteFileTool (explicit check at
   the top of getConfirmationDetails). The user now never sees a
   confirmation diff for an unread file — the call rejects up front.

Public API surface change: requirePriorRead() -> checkPriorRead() that
returns a structured decision, so the same predicate can route into
a CalculatedEdit.error (calculateEdit), a thrown error
(getConfirmationDetails), or a ToolResult (execute) without
duplicating the boolean / message / type plumbing in three shapes.

Reported by pomelo-nwu (3 inline comments on PR #3774).

* refactor(core): close 4 prior-read enforcement gaps from 4th review

1. recordWrite now seeds read metadata on brand-new entries
   (lastReadAt / lastReadWasFull / lastReadCacheable). The strict
   accept predicate added in the previous round (#3 review) requires
   all three, but recordWrite only set lastWriteAt — so a model
   creating a file with Edit (old_string="") or WriteFile and then
   editing it again was rejected on the second edit. The model
   authored the bytes it just wrote; for the purposes of prior-read
   enforcement that counts as having seen them. New regression test
   in edit.test.ts: "allows a create-then-edit-then-edit chain
   without an intervening read".

2. Extracted checkPriorRead into src/tools/priorReadEnforcement.ts.
   The two copies in edit.ts and write-file.ts had already drifted
   (one used ${ReadFileTool.Name}, the other hardcoded 'read_file');
   the boolean guard is security-sensitive and a one-sided fix
   would silently weaken the boundary. The shared utility takes a
   verb ('editing' | 'overwriting') so the user-facing prose can
   differ between callers without duplicating the decision logic.

3. WriteFileTool.execute now runs prior-read enforcement BEFORE
   readTextFile. Pre-fix, an unread overwrite still slurped the
   entire file into memory (encoding / BOM / line-ending detection)
   and only then rejected it: wasted I/O, and momentary in-memory
   custody of bytes the model never legitimately read. Now matches
   the order in getConfirmationDetails().

4. The "rejects a write that would overwrite an unread existing
   file" test now spies on FileSystemService.readTextFile and
   asserts not.toHaveBeenCalled() — without that, the test gave
   false confidence: it passed both pre-fix (read happened, then
   reject) and post-fix (reject before read), so the ordering
   regression in (3) was invisible to the assertion.

Reported by glm-5.1 via /review on PR #3774.

* refactor(core): close 4 prior-read enforcement gaps from 4th review (Copilot)

Five concrete gaps that the previous round of enforcement work left
open. Reported by Copilot via /review on PR #3774.

1. Confirmation-time rejections lost their ToolErrorType code.
   getConfirmationDetails() in both EditTool and WriteFileTool threw
   a plain Error on prior-read failure, which coreToolScheduler
   collapsed into UNHANDLED_EXCEPTION — silently breaking the
   EDIT_REQUIRES_PRIOR_READ / FILE_CHANGED_SINCE_READ contract for
   any approval-required flow.

   Fix: introduce PriorReadEnforcementError that carries
   `errorType: ToolErrorType`. Both confirmation paths now throw it,
   and coreToolScheduler reads `error.errorType` (falling back to
   UNHANDLED_EXCEPTION when absent). New regression tests assert
   the thrown error's `errorType` field for both tools.

2. checkPriorRead's "re-read with read_file" advice was wrong for
   binary / image / audio / video / PDF / notebook files. Their
   ReadFile result always sets lastReadCacheable=false, so the
   message would loop the agent forever on the same rejection.

   Fix: detect the fresh-but-non-cacheable case explicitly and
   return a dedicated message that explains the dead end ("Edit /
   WriteFile cannot mutate that payload safely") instead of asking
   for another read. Updated the existing non-cacheable regression
   test to assert the new message and the absence of "use the
   read_file tool first".

3. checkPriorRead swallowed every stat() failure and returned
   ok:true. EACCES, EBUSY, NFS hiccups, etc. would silently
   re-open the blind-write path the helper exists to block.

   Fix: only ENOENT continues to return ok:true (disappearance
   race). Any other code is fail-closed: returns
   EDIT_REQUIRES_PRIOR_READ with a message that names the errno.
   New regression test in write-file.test.ts spies on fs.promises
   .stat to inject EACCES and asserts the rejection.

4. The auto-memory record regression test only asserted `state ===
   'fresh'`. A future change that recorded auto-memory reads as
   partial / non-cacheable would still satisfy that assertion but
   would actually fail enforcement on every follow-up Edit.

   Fix: also assert lastReadAt is defined, lastReadWasFull is true,
   and lastReadCacheable is true. The full "what enforcement
   requires" predicate is now explicit in the test.

(The 5th item, the WriteFile mirror of (1), is covered by the same
PriorReadEnforcementError change.)

* refactor(core): tighten StructuredToolError naming + add scheduler test

Four follow-ups raised by deepseek-v4-pro on PR #3774. None of them
change the enforcement boundary; they are all about making the
contract clearer and harder to break in future changes.

1. PriorReadEnforcementError -> StructuredToolError. The class now
   wraps any content-derived ToolErrorType from calculateEdit
   (EDIT_NO_OCCURRENCE_FOUND, EDIT_EXPECTED_OCCURRENCE_MISMATCH,
   EDIT_NO_CHANGE, ATTEMPT_TO_CREATE_EXISTING_FILE) on top of the
   prior-read codes. The old name suggested the class was prior-
   read-specific, which would mislead any oncall engineer seeing
   it paired with one of the calculateEdit error codes.

2. EDIT_REQUIRES_PRIOR_READ kept its name (the prefix mentions
   "edit" but the enum is shared with WriteFileTool) — chose
   documentation over rename to avoid the churn of a value rename
   across logs/dashboards already keyed on it. JSDoc now spells
   out the cross-tool usage explicitly.

3. Stat failures other than ENOENT now map to a new
   PRIOR_READ_VERIFICATION_FAILED code instead of being conflated
   with EDIT_REQUIRES_PRIOR_READ. The failure mode is "we cannot
   verify" rather than "definitely not read" — operators routing
   on error codes can distinguish the two populations.

4. Added a coreToolScheduler test (`surfaces error.errorType from
   a confirmation throw instead of UNHANDLED_EXCEPTION`) that
   constructs a stub tool whose getConfirmationDetails throws
   StructuredToolError and asserts the surfaced ToolCall response
   carries the correct ToolErrorType. Without this test the
   scheduler's explicitErrorType branch would have no coverage at
   all.

Tool tests updated for the new StructuredToolError class name and
the PRIOR_READ_VERIFICATION_FAILED code on the EACCES path.

* fix(core): close TOCTOU + grammar + directory regressions in PR-B

Six concrete issues that the previous round of enforcement work
left open. Reported by Copilot via /review on PR #3774.

1. TOCTOU window between pre-read checkPriorRead and readTextFile.
   The pre-read stat could pass enforcement, then an external writer
   could land between that stat and the actual read, leaving
   currentContent reflecting bytes the model never saw — exactly the
   stale-write path the PR is supposed to block. Closed by re-running
   checkPriorRead immediately after every readTextFile that fed
   currentContent / originalContent: EditTool.calculateEdit and the
   two WriteFileTool paths (execute + getConfirmationDetails). A
   `stale` outcome now fails the operation with
   FILE_CHANGED_SINCE_READ at the correct moment.

2. Directory targets sent the model into an enforcement loop.
   `fileExists` is a plain access check, so directories also entered
   the enforcement branch — the model would be told to call
   `read_file`, but `read_file` rejects directories with
   TARGET_IS_DIRECTORY, so the loop never terminated. Fixed in
   checkPriorRead: if `fs.stat` reports the path is not a regular
   file, return `ok: true` so the downstream readTextFile / write
   path can surface its own EISDIR / similar error.

3. Confirmation-time error messages used the short `display` form
   instead of the full `raw` form. Approval-required Edit calls
   therefore lost the remediation detail (file path, stale-vs-unread
   distinction, "without offset / limit / pages" hint) that the
   execute path already surfaced and that the WriteFile confirmation
   path already preserved. EditTool.getConfirmationDetails now
   throws StructuredToolError with `editData.error.raw`.

4. Non-text payload displayMessage was grammatically broken: built
   from the gerund `editing` / `overwriting`, it rendered as
   "cannot editing via this tool" / "cannot overwriting via this
   tool". Fixed by deriving a bare-verb form (`edit` / `overwrite`)
   alongside the gerund and using it in displayMessage.

(Items 1, 5 and 6 from Copilot's batch are the same TOCTOU class —
EditTool calculateEdit + WriteFile execute + WriteFile confirmation —
addressed together in (1) above.)

The "bypasses enforcement entirely" test now uses mockReturnValue
instead of mockReturnValueOnce because calculateEdit calls
getFileReadCacheDisabled twice — once for the pre-read check and
once for the post-read TOCTOU re-check. Both must see disabled=true
to actually bypass.

* fix(core): close fileExists TOCTOU on WriteFile prior-read enforcement

WriteFile gated prior-read enforcement on `fileExists` from
`isFilefileExists()`, but a file that sprang into existence between
that check and the write would still be overwritten without
enforcement — `fileExists === false` skipped the check entirely.

Made the gate unconditional on `fileExists`. checkPriorRead's own
`fs.stat` decides what to do:
- ENOENT → ok:true, fall through to the new-file path as before
- file exists right now (whether or not isFilefileExists saw it) →
  unknown / stale check runs, the race-created file is rejected.

Applied to both getConfirmationDetails and execute. The path that
actually creates new files is unchanged because checkPriorRead's
ENOENT branch is the disappearance-race exit, which is the correct
exit for "the file truly does not exist yet".

Reported by glm-5.1 via /review on PR #3774.

* fix(core): close 4 enforcement gaps + 1 critical bug from 5th Copilot review

Six issues raised by deepseek-v4-pro / glm-5.1 / qwen3.6-plus on
PR #3774. Listed by reviewer-assigned severity.

[Critical] (qwen3.6-plus) recordWrite previously only seeded the
read metadata for brand-new entries. The reproduction was real:
ReadFile(limit=10) → WriteFile(full content) → Edit. The partial
read's lastReadWasFull=false would persist through the write, and
the Edit would be rejected with EDIT_REQUIRES_PRIOR_READ even
though the model just authored every byte. recordWrite now
unconditionally refreshes lastReadAt, lastReadWasFull=true, and
lastReadCacheable=true. The fileReadCache.test.ts case that
previously asserted "preserves lastReadAt" is rewritten to assert
the new "refreshes lastReadAt to match the write" contract, and a
new "upgrades lastReadWasFull / lastReadCacheable after a full
write" regression test pins the reproduction reviewer described.

[Suggestion] (deepseek-v4-pro) Narrowed the non-regular-file
bypass in priorReadEnforcement from `!stats.isFile()` to
`stats.isDirectory()`. The earlier broad form covered FIFOs,
sockets, and devices that the model has no legitimate "read first"
recourse for and that can block readTextFile (FIFO) or
over-allocate (/dev/urandom). Those now flow through to
cache.check() and reject with the unread-file path before any I/O.

[Suggestion] (glm-5.1) Removed the `fileExists && ...` gate from
EditTool.calculateEdit, mirroring the f4ef75667 fix on WriteFile.
A file that springs into existence between isFilefileExists() and
the enforcement check is now caught here as well; ENOENT inside
checkPriorRead remains the disappearance-race exit and new-file
creation flow is unchanged.

[Suggestion] (deepseek-v4-pro) Added debugLogger.warn() at every
post-read TOCTOU rejection site (Edit calculateEdit, WriteFile
getConfirmationDetails, WriteFile execute). These rejections are
rare and self-healing — without a debug record, an operator
investigating "why did this Edit fail once?" had nothing to grep.
debugLogger uses dedicated 'EDIT_PRIOR_READ' / 'WRITE_FILE' tags.

[Suggestion] (qwen3.6-plus) Added a final pre-write checkPriorRead
in EditTool.execute() and WriteFileTool.execute(). The earlier
post-read check ran inside calculateEdit (Edit) or before mkdirSync
(WriteFile), but the actual writeTextFile call could be arbitrarily
later — user approval, modify-and-confirm, etc. The window from
"post-read check → writeTextFile" is now bounded to "pre-write
stat → writeTextFile" (two adjacent syscalls).

* fix(core): close new-file race + special-file enforcement loop

Three issues from the latest Copilot review on PR #3774.

1. New-file race in pre-write enforcement (write-file.ts:348,
   edit.ts:487). The earlier pre-write checkPriorRead was gated on
   `fileExists` (WriteFile) and `!editData.isNewFile` (Edit). If the
   path was absent at planning time and another process created it
   while approval was pending, the gated form would skip enforcement
   and silently overwrite a pre-existing file the model never read.
   Run unconditionally in both tools — checkPriorRead's own ENOENT
   branch is the disappearance-race exit, so genuine new-file
   creation is unaffected, but a race-created file now hits the
   `unknown` branch and is rejected as unread.

2. FIFO / socket / device sent the model into an enforcement loop
   (priorReadEnforcement.ts:220). After narrowing the
   non-regular-file bypass to directories only, FIFOs etc. fell
   through to cache.check, returned `unknown`, and produced a
   "use read_file first" message — but read_file rejects those same
   targets as "not a regular file", so the model would loop on
   read_file forever. Added a dedicated `!stats.isFile()` branch
   (after the directory exemption) that returns a "special file;
   cannot edit/overwrite via this tool — use shell instead" message,
   matching the shape of the existing non-text-payload guidance.

(Tool-error.ts and the non-cacheable policy notes are addressed in
the PR description update — not in code.)

* fix(core): close 4 enforcement gaps from 6th Copilot review

(Plus a doc-only update for the 5th — the mtime+size limitation
warning in the Risk section now mentions the silent-overwrite
escalation that this PR's mutation paths bring along.)

1. ENOENT after the model has already read the file is no longer
   silently treated as `ok: true`. Added an `expectExisting` option
   to `checkPriorRead`; post-read and pre-write callers pass `true`.
   ENOENT under that flag now rejects with `FILE_CHANGED_SINCE_READ`
   ("file disappeared after the model read it") rather than falling
   through to the new-file path with stale bytes. Pre-read callers
   keep the old default (ENOENT → ok:true → fall through to genuine
   new-file creation). EditTool's pre-write check derives the flag
   from `editData.isNewFile`; WriteFile's pre-write check derives it
   from the post-read `fileExists` value.

2. Directory targets now reject with `TARGET_IS_DIRECTORY` and a
   structured message instead of returning `ok: true`. The previous
   form fell through to readTextFile(), which on the WriteFile
   confirmation path threw a plain Error and was surfaced by the
   scheduler as `UNHANDLED_EXCEPTION`. Both Edit and WriteFile now
   emit a structured rejection at enforcement time. (WriteFile's
   build-time validateToolParamValues already rejects directories,
   so the change matters most for EditTool.)

3. Non-cacheable rejection's `rawMessage` no longer hard-codes
   "overwrite" — it now uses the same `verbBare` derivation as the
   `displayMessage`, so EditTool's path correctly says "if you need
   to edit it" and WriteFile's path stays "if you need to overwrite
   it". The previous form was confusing for in-place edits.

4. WriteFile.getConfirmationDetails now mirrors execute()'s
   ENOENT-to-new-file fallback: a file that disappears between
   isFilefileExists() and the readTextFile-for-diff call no longer
   throws a plain Error (which would surface as
   UNHANDLED_EXCEPTION) — it falls back to the brand-new-file diff
   so the user sees a clean confirmation rather than an unstructured
   crash.

Tests:
- New: `rejects an edit on a directory with TARGET_IS_DIRECTORY`
- New: `confirmation falls back to a new-file diff when the file
  disappears mid-flight` (WriteFile)
- Updated: non-cacheable rejection asserts `verbBare` is "edit" on
  the EditTool path and "overwrite" on the WriteFile path.

Reported by Copilot via /review on PR #3774.

* docs(core): clarify stat→write race + EDIT_REQUIRES_PRIOR_READ scope

Three doc-only follow-ups from Copilot's latest review pass on PR
#3774. None change behaviour; the pre-fix code state was already
the actual contract — the docs just lagged it.

1. EDIT_REQUIRES_PRIOR_READ enum comment now lists the three cases
   the code actually returns it for (never-read, partial / ranged /
   non-cacheable read, structural dead end — non-text payload or
   special file). The previous one-liner described only the first
   case and would mislead future maintainers.

2. The Final pre-write freshness check blocks in EditTool.execute
   and WriteFileTool.execute now spell out that they DO NOT
   eliminate the stat → writeTextFile race. The window narrows
   from the previously-unbounded post-read-to-write gap down to
   two adjacent syscalls, but a concurrent writer landing in
   that pair can still be clobbered. Closing the residual would
   require an atomic write (write-to-temp + rename) or a
   content-hash post-write recheck — both deferred. Operators who
   need strict protection set `fileReadCacheDisabled: true` and
   rely on application-level locking.

3. PR description Risk section gains a "Known unmitigated: stat →
   write race window" subsection (English + Chinese mirror)
   matching the code comments.

* chore(core): minor follow-ups from review #4229917446

Three of the five MINOR items raised in the independent code review
on 2026-05-05 — the cheap, isolated ones. The other two (race-
simulating integration test, moving StructuredToolError out of
priorReadEnforcement.ts) are deferred as the reviewer suggested.

1. EditTool now has a symmetric `PRIOR_READ_VERIFICATION_FAILED`
   regression test (mocks fs.promises.stat to reject with EACCES,
   asserts the EditTool path produces the same fail-closed result
   that the existing WriteFile EACCES test pins). Five-line fix to
   close the asymmetry that, while harmless today (the helper is
   shared), would let a future Edit-side change to checkPriorRead
   slip through without test coverage.

2. ensureParentDirectoriesExist / mkdirSync now run AFTER the
   pre-write checkPriorRead in both EditTool.execute() and
   WriteFileTool.execute(). Doing it before would leak intermediate
   directories on the rejection path — a real (if minor) FS litter
   the previous order created on every rejected new-file write.

3. EDIT_REQUIRES_PRIOR_READ enum docstring gains a one-line note
   for operators routing alerts on this code: a single
   `edit_requires_prior_read` signal can mean any of the three
   cases (no read / partial read / structural dead-end), and if
   per-cause monitoring becomes important the enum can be split
   in a follow-up. The originating tool name and the message text
   already disambiguate at runtime.

* fix(core): close 2 correctness gaps from maintainer review #4232751470

Both tracked back to the cache's "track most recent read shape"
model diverging from prior-read enforcement's "model has seen
these bytes" model.

1. SVG (and similar string-content fallbacks) recorded as
   non-cacheable, blocking subsequent Edit / WriteFile.

   `read-file.ts` derives `cacheable` from
   `originalLineCount !== undefined && !isTruncated`. The SVG
   branch in `fileUtils.ts` returned content without
   `originalLineCount`, so `cacheable` collapsed to false and a
   follow-up Edit hit the dead-end "non-text payload — use shell"
   rejection — telling the model to use shell to mutate a file it
   had just successfully read as text. This was a real regression
   vs pre-PR behaviour where SVG-as-text editing worked.

   Fix: SVG-as-text branch now sets `originalLineCount` (split
   on '\n') and `isTruncated: false`, so ReadFile records it as
   a full cacheable read. The binary-fallback string and
   over-1MB SVG branches are deliberately left non-cacheable —
   they return placeholder strings ("Cannot display content of
   ...") rather than file content, so blocking edits there is
   correct. New regression test in `read-file.test.ts`:
   `records SVG-as-text reads with cacheable=true so a follow-up
   Edit passes enforcement`.

2. recordRead unconditionally overwriting lastReadWasFull /
   lastReadCacheable, revoking prior write-author or full-read
   rights.

   The `WriteFile(create) → ReadFile(offset/limit) → Edit`
   sequence rejected the Edit because the partial read clobbered
   the `lastReadWasFull = true` that `recordWrite` had stamped
   at create time. Same shape applies to a full text read
   followed by a partial one of the same inode.

   Fix: `recordRead` is now sticky-on-true for the read flags —
   `if (opts.full) entry.lastReadWasFull = true;` and the
   matching guard for `cacheable`. Prior `true` survives a later
   partial / non-cacheable read. The fast-path `file_unchanged`
   check still gates on the incoming request's own `isFullRead`
   in `read-file.ts`, so a partial read still does not get a
   placeholder it shouldn't. Updated the existing
   "overwrites earlier lastReadWasFull" test to assert the new
   sticky semantics, and added a `lastReadCacheable` symmetric
   test plus a `Write → partial-Read → Edit` end-to-end test in
   `edit.test.ts`.

Reported by tanzhenxin via independent maintainer review on
2026-05-06.

* fix(core): close 3 correctness gaps from re-review #4233904930

All three are tightenings of the prior `de8ddf530` round.

1. **Sticky-on-true narrowed to "no fingerprint drift"**.
   `fileReadCache.recordRead` previously kept `lastReadWasFull` /
   `lastReadCacheable` true across drifted recordings, which
   re-opened a `Read full @X → external write @Y → Read partial
   @Y → Edit` hole: the partial recordRead silently advanced the
   entry's mtime+size to Y while preserving the sticky `full=true`
   from X, so a follow-up Edit ran against bytes the model only
   saw the first 10 lines of. Now the sticky branch only fires
   when `(mtimeMs, sizeBytes)` matches the existing entry; on
   drift, both flags reset to exactly what this read produced.
   New regression test in `fileReadCache.test.ts` reproduces the
   reviewer's reported sequence.

2. **Subagent FileReadCache isolation now covers the
   inherits-model + same-approval-mode common case**. The
   own-property machinery from #3717 only triggers when an
   `Object.create(parent)` actually fires; both
   `agent.ts:990-993` (same-approval-mode) and
   `subagent-manager.ts:699-701` (inherits-model) had paths that
   returned the parent Config directly, so the subagent's
   `getFileReadCache()` resolved to the parent's instance — a
   parent Read could satisfy the subagent's Edit on a path the
   subagent's transcript never contained. Both sites now build
   a thin `Object.create(base)` override unconditionally; no
   method changes for the inherits / same-mode cases, but a
   distinct instance triggers the lazy-init in
   `Config.getFileReadCache()` so the subagent gets an isolated
   cache.

3. **Cache records the read pipeline's internal stat instead of
   a post-read re-stat**. `processSingleFileContent` now
   surfaces its internal stat via `result.stats`, and read-file
   uses that for `recordRead` instead of running its own stat
   after the read returns. Pre-fix, an external write between
   the pipeline call and the post-read stat let the cache record
   fingerprint Y for content the model received at X — a
   subsequent Edit would pass enforcement against bytes the
   model never legitimately saw. The internal-stat-to-read
   window is still a few microseconds wide; that residue is the
   same content-hash territory acknowledged in the Risk section.

Reported by tanzhenxin via re-review on PR #3774.

* docs(core): clarify partial subagent isolation per review #4234090906

tanzhenxin's third review correctly observed that the
`Object.create(parent)` wrappers in `agent.ts:createApprovalModeOverride`
and `subagent-manager.ts:maybeOverrideContentGenerator` only isolate
the FileReadCache for code that consults `Config.getFileReadCache()`
directly. Bound `EditTool` / `WriteFileTool` instances were registered
against the parent's tool registry at initialise time, so tool
invocations still resolve `this.config` to the parent and reach the
parent's cache. `InProcessBackend.createPerAgentConfig` already does
the right thing (`override.createToolRegistry()` +
`copyDiscoveredToolsFrom(base.getToolRegistry())`); bringing that to
these two spawn sites is the real fix.

Reviewer's verdict was COMMENT, not REQUEST_CHANGES — the gap
pre-dates this PR (it's a property of #3717's per-Config own-property
machinery) and pre-PR there was no enforcement on subagent mutations
at all, so the PR is strictly an improvement on every spawn path.
Documented the partial guarantee explicitly:

- Inline comments on both spawn sites note the bound-tool caveat
  and point at `InProcessBackend.createPerAgentConfig` as the model
  for the follow-up.
- PR description's subagent paragraph (English + Chinese mirror) now
  splits into "fully isolated" (`InProcessBackend.createPerAgentConfig`)
  and "partial isolation" (the two sites in this PR) so readers don't
  walk away with the wrong contract.

Filing the registry-rebuild work as a follow-up; not in this PR.
2026-05-06 17:17:48 +08:00
wenshao
ee460de97b docs(attribution): align cleanup-branch comments with noteCommitWithoutClearing
Three doc/test-fixture stale-after-refactor cleanups (Copilot
4MDx / 4MEI / 4MEa):

- shell.ts:1944 (around the stagedInfo === null branch): the comment
  still claimed the finally block "falls back to a full clear", but
  1ece87438 switched analysis-failure cleanup to
  noteCommitWithoutClearing(). Update the comment so the reasoning
  matches what the code actually does (and so a future reader doesn't
  reintroduce the wholesale clear thinking it's already there).

- shell.ts: getCommittedFileInfo docstring carried the same stale
  "full clear" claim for the `null` return value. Update to describe
  the noteCommitWithoutClearing() fallback and the smaller-evil
  trade-off for the just-committed file.

- chatRecordingService.test.ts: baseSnapshot fixture for the
  recordAttributionSnapshot tests still carried `baselines: {}`,
  even though that field was removed from AttributionSnapshot in
  296fb55ae's dead-code purge. Structural typing let it compile,
  but the fixture didn't reflect the production shape — drop it.
2026-05-06 15:40:36 +08:00
wenshao
acd06e3461 Merge remote-tracking branch 'origin/main' into feat/commit-attribution
# Conflicts:
#	packages/core/src/core/client.ts
2026-05-06 15:02:52 +08:00
tanzhenxin
8a1ed565a6
fix(core): auto-compact subagent context to prevent overflow (#3735)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
* fix(core): auto-compact subagent context to prevent overflow

Subagent chats accumulated history without ever compacting, so a long
multi-turn run could hit "max context length exceeded" before the
compaction logic the main session uses had a chance to fire.

Move compaction down into the chat layer so both main agent and subagent
auto-compress at the configured threshold, and surface the result via a
new chat stream event that bridges into the existing ChatCompressed UI
path. The main-session wrapper still owns full /compress reset.

Closes #3664

* fix(core): address subagent compaction review feedback

Code fixes:
- Seed `lastPromptTokenCount` on subagent chats so the first-send
  threshold gate sees the inherited history's true size.
- Add `COMPRESSION_FAILED_TOKEN_COUNT_ERROR` to the fail-latch chain
  so token-counting failures stop retrying compression every send.
- Restore `FileReadCache.clear()` after compaction in both the manual
  /compress wrapper and the auto-compaction path inside GeminiChat,
  preventing post-summary `file_unchanged` placeholders from pointing
  at content the model can no longer retrieve.
- Refresh stale comment on the `compressed → ChatCompressed` bridge
  in turn.ts now that this path is the primary route, not a fallback.

Tests:
- turn.test.ts asserts the compressed → ChatCompressed bridge.
- geminiChat.test.ts asserts COMPRESSED yields as the first stream
  event after auto-compaction succeeds.
- chatCompressionService.test.ts bumps originalTokenCount above the
  cheap-gate so the NOOP test exercises findCompressSplitPoint.
- client.test.ts asserts forceFullIdeContext flips when a
  ChatCompressed event flows through sendMessageStream's loop.

* chore(core): drop redundant StreamEvent cast and document auto-compaction trade-off

- Remove the `as StreamEvent` cast at the COMPRESSED yield site — the
  literal already matches the union member.
- Add a 4-line comment at the auto-compaction setHistory point that
  points readers to GeminiClient for the env-refresh trade-off rationale,
  so readers don't have to chase the layering decision back across files.
2026-05-06 14:09:07 +08:00
tanzhenxin
808d0978eb
feat(cli): route foreground subagents through pill+dialog while running (#3768)
* feat(cli): route foreground subagents through pill+dialog while running

Foreground (synchronous) subagents currently render a live AgentExecutionDisplay
inside the parent's pendingHistoryItems block. The frame mutates on every tool
call and approval; once it grows past the terminal height (verbose mode, parallel
subagents, long tool-call lists) the live-area repaint flickers visibly.

This change extends BackgroundTaskRegistry with a flavor: 'foreground' | 'background'
discriminator. Foreground entries register at the start of the synchronous tool-call
and unregister in its finally path. The pill counts them; the dialog drills into
their activity. The inline frame is suppressed during the live phase — only an
active, focus-locked approval prompt renders, as a small banner labeled with the
originating agent. Once the parent turn commits, the full AgentExecutionDisplay
appears in scrollback via Ink's <Static>, exactly as before.

Foreground entries skip the XML task-notification (the parent receives the result
through the normal tool-result channel) and skip the headless holdback (the
parent's await already pins the loop). The dialog gates per-agent cancellation
behind a two-step confirm so a stray 'x' can't end the user's current turn.

* fix(cli): address review findings on foreground subagent routing

- Gate `registerCallback` on background flavor so foreground entries
  don't leak orphaned `task_started` SDK events without a matching
  terminal notification.
- Render a queued-approval marker for non-focus subagents instead of
  returning null, so a queued approval is visible in the main view.
- Move `emitStatusChange` before `agents.delete` in
  `unregisterForeground` to match the ordering used by
  complete/fail/cancel/finalize.
- Prefix the foreground tool result with a cancel marker when
  `terminateMode === CANCELLED`, so the parent model can distinguish
  a user-cancelled run from a successful completion.
- Mirror the background path's stats wiring on the foreground path
  so `entry.stats` stays current and the dialog detail subtitle
  shows tool count + tokens for foreground runs.
- Remove the unreachable `isWaitingForOtherApproval` branch (subsumed
  by the queued-approval marker above).
- Reset the foreground confirm-step on detail-mode `left` and ignore
  `x` on terminal entries so an armed cancel can't carry into list
  mode and the hint footer/handler stay in sync.
- Test factory uses a `baseProps` spread instead of `as` cast so a
  future required field on `ToolMessageProps` is a compile-time miss.
2026-05-06 14:08:12 +08:00
易良
fe1fb31557
fix(cli): prevent file paths from being treated as slash commands (#3743)
* fix(cli): prevent file paths from being treated as slash commands (#1804)

When users input file paths starting with '/' (e.g. '/api/apiFunction/...',
'/Users/name/path'), they were incorrectly parsed as slash commands, resulting
in "Unknown command" errors. The input was discarded instead of being sent to
the model for processing.

Root cause: isSlashCommand() only checked for a '/' prefix without validating
whether the first token actually looks like a command name. Any '/' prefix
triggered the slash command flow, and when no matching command was found, the
error was shown with no fallback.

Fix: Add looksLikeCommandName() that validates command names contain only
[a-zA-Z0-9:_-]. Both isSlashCommand() and handleSlashCommand() now check the
first token — if it contains path separators, dots, or non-ASCII characters,
the input falls through to normal model processing instead of the command
dispatcher.

Closes #1804

* fix(cli): allow dots in command names and fix prettier formatting

Address review feedback:
- Allow '.' in looksLikeCommandName() regex to support extension-qualified
  commands like gcp.deploy (CommandService renames conflicts as ext.cmd)
- Add regression tests for dot-named commands in both commandUtils and
  slashCommandProcessor
- Fix prettier formatting in slashCommandProcessor test file

* fix(cli): handle slash command review edge cases

* docs(cli): align slash command validation comment

* fix(cli): preserve slash prompt ordering

* fix(cli): reject shell-metacharacter slash tokens

* test(cli): align slash command action mocks

* fix(cli): track model-sent user turns

* fix(cli): narrow slash path handling scope

* fix(cli): count slash path prompts in history

* test(cli): type slash command action mocks
2026-05-06 14:03:08 +08:00
jinye
2bd4aa1b6d
fix(sdk-python): standardize TAG_PREFIX to include v suffix (#3832)
Align the Python SDK's TAG_PREFIX with the TypeScript SDK convention by
changing it from 'sdk-python-' to 'sdk-python-v'. This removes the need
for callers to manually inject the `v` when composing git tags, eliminating
an asymmetry that could lead to doubled or missing `v` prefixes when code
is copied between SDK release helpers.

The final tag format (sdk-python-v0.1.0) is unchanged.

Closes #3793

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-06 11:14:02 +08:00
wenshao
715c258fb2 fix(attribution): roll back snapshot dedup key on sync appendRecord failure
1UMh (Copilot): appendRecord can throw synchronously before returning
a promise — e.g. when ensureConversationFile() rethrows a non-EEXIST
writeFileSync error. The async .catch() handler attached to the
promise never runs in that case, so the optimistic dedup-key set
sticks on a write that never landed and permanently suppresses
identical retries. Roll back lastAttributionSnapshotJson in the outer
catch too. Regression test forces writeFileSync to throw EACCES on
the first invocation, then asserts the second identical snapshot
attempt fires a fresh write rather than getting deduped.
2026-05-06 10:53:43 +08:00
Shaojin Wen
d44786685c
feat(core,cli): surface and cancel auto-memory dream tasks (#3836)
* feat(cli): surface auto-memory dream tasks in Background tasks dialog

Adds `dream` as a fourth kind in the Background tasks pill + dialog,
alongside agent / shell / monitor. Subscribes to MemoryManager via
its existing subscribe() / listTasksByType() API and adapts
MemoryTaskRecord into a DreamDialogEntry view-model. Zero changes
to the core package.

Filters out `pending` (sub-second transition) and `skipped` (every
UserQuery that misses the gate creates one) records. Caps retained
terminal entries at 3 since `MemoryManager.tasks` has no eviction
path; without the cap, completed dreams would accumulate over the
project's lifetime (mirrors MonitorRegistry's terminal cap pattern).

Extract tasks are intentionally NOT surfaced — they fire on every
UserQuery, would flood the pill, and the `memory_saved` toast in
useGeminiStream already covers their completion signal.

Read-only for now: cancellation requires MemoryManager.cancelTask
+ task_stop integration which lands in a follow-up PR. The dialog
suppresses the "x stop" hint for dream entries until then to avoid
silent no-op keystrokes.

Refs #3634

* docs(cli): rephrase dream filter comment to focus on extract vs dream

The earlier comment compared the design to a non-qwen-code product;
restate the rationale in terms of the local extract / dream split
(extract fires every UserQuery and surfaces via memory_saved toast,
dream fires after gates and warrants pill / dialog visibility).

* feat(core,cli): cancel dream consolidation tasks via dialog and task_stop

Wires cancellation for the auto-memory dream task kind:

- `MemoryManager.cancelTask(taskId)` — aborts the dream's fork-agent
  via a new per-task AbortController, marks the record `cancelled`
  before aborting so the runDream catch path can detect user-intent
  and avoid overwriting with a generic `failed`. The existing
  finally block releases the consolidation lock as the agent unwinds.
- `MemoryManager.getTask(id)` — point lookup helper so cross-cutting
  consumers like `task_stop` can route by id without a project root.
- AbortSignal threaded through `scheduleDream` → `runDream` →
  `runManagedAutoMemoryDream` → `planManagedAutoMemoryDreamByAgent`
  → `runForkedAgent.abortSignal` (already supported).
- `task_stop` tool gets a 4th dispatch branch: dream task ids look
  up via MemoryManager and route through `cancelTask`. Extract is
  intentionally NOT cancellable — it runs synchronously on the
  request loop, cancelling would interfere with the user's own turn.
- `BackgroundTasksDialog` x-stop hint suppression for dream is removed
  (was a PR-1 placeholder); `cancelSelected` dream branch now calls
  `memoryManager.cancelTask`.
- `MemoryTaskStatus` gains `'cancelled'`. Dream view-model widens
  status union and filter to surface cancelled entries (terminal
  cap continues to apply).

Refs #3634

* fix(core,cli): address review feedback on dream cancellation surface

- DreamDetailBody: comment said cancellation "lands in PR-2" but the
  same PR wires `cancelSelected` for dream entries. Reword to describe
  what's actually shipped + flag in-flight progress as the real
  follow-up.
- task_stop: drop the unreachable `!aborted` error branch. The status
  guard above already confirms `running`, and `cancelTask` is
  synchronous; in this branch it cannot return false.

* fix(core): handle resolved-cancel path from runForkedAgent for dream tasks

runForkedAgent maps AgentTerminateMode.CANCELLED to a resolved
{status: 'cancelled'} result rather than rejecting. The cancel-via-
task_stop path landed in the previous commit assumed the call would
throw — when it didn't, the runDream success path overwrote the
user-cancelled record with 'completed' AND bumped lastDreamAt
metadata, suppressing the next legitimate dream cycle.

Two-layer defense:

- dreamAgentPlanner now rethrows when the fork agent reports
  cancelled status (mirrors the existing failed-status throw).
  This is the source-of-truth fix.

- runDream now checks abortSignal.aborted after the await as
  defense in depth. If anything in the call chain ever forgets to
  propagate, this guard short-circuits the success path before
  metadata write.

Updates the existing dreamAgentPlanner test that previously pinned
the buggy "returns cancelled without throwing" behavior. Adds a
manager test that simulates the resolves-on-abort scenario directly
to verify the consumer-side guard catches anything the planner
might miss.

* fix(core,cli): address review feedback on dream UX, perf, and comment clarity

Three fixes from the post-cancellation-PR review:

- scheduleDream now sets an initial `progressText` ("Scheduled
  managed auto-memory dream.") on the in-flight record. Without
  this, the dialog Detail's Progress section stayed empty until
  completion — the PR description's mid-flight screenshot showed
  text that production never actually rendered.

- useBackgroundTaskView gates the MemoryManager.subscribe listener
  on a dream-content signature. The manager fires for every task
  transition (extract included, ~2x per UserQuery), but the dialog
  has no extract surface; without this dedup each extract notify
  forced a full 4-source re-merge + a fresh setEntries reference,
  re-rendering the dialog and pill on entries that hadn't changed.
  Added a test that pins the reference-stability invariant.

- task-stop comment was misleading — said "the status guard above
  already confirmed running", but for the dream branch the running
  check happens IN this branch (not earlier). Reworded.

* fix(core,cli): tighten cancelTask contract, dedup dream snapshot reads, sync test helper

Four fixes from the latest review pass:

- cancelTask() now enforces the AbortController invariant. The old
  `ac?.abort()` returned `true` even when no controller was found,
  meaning callers could see a successful return while the dream was
  not actually aborted (and would leak the consolidation lock until
  the agent finished naturally). The controller is registered
  synchronously alongside `status='running'`, so a missing controller
  for a running record is a contract violation — return false without
  flipping status so the caller knows the abort didn't take.

- useBackgroundTaskView's `refresh()` now reuses the dream snapshot
  the memory listener fetched for its dedup gate. The previous version
  re-read `listTasksByType('dream')` inside `computeDreamSig()` and
  again inside `refresh()` — extra work AND a race window where the
  gate signature could come from a different snapshot than the one
  used to build dreamEntries. Single read, single source of truth.

- The `dream()` test helper widened to include `'cancelled'` so it
  matches the production `MemoryTaskStatus` union. Added a small test
  asserting cancelled dreams flow through the kind discriminator (the
  dialog's terminal-cap window depends on showing the user the
  outcome of the abort they just triggered).

- Dropped the stale "PR-1 read-only / PR-2 cancellation" block
  comment above the dream-entries describe block — both are in this
  PR now.

* fix(core,cli): address self-review + Copilot feedback on dream surface

Combined fixes for the latest review pass — self-review notes (U1, U2,
U5, U6) plus Copilot's three new comments (C1, C2, C3):

- **Footer dream-indicator dedupe (U1)**: removed `useDreamRunning` +
  `✦ dreaming` right-column text. The new Background tasks pill
  already counts dream tasks alongside agent / shell / monitor;
  showing both produced two simultaneous signals for the same state.

- **task_stop dispatch surfaces extract distinctly (U2 + C1)**: a
  new `TASK_STOP_NOT_CANCELLABLE` error type fires when the task id
  resolves to a known-but-not-cancellable record (extract). Previously
  extract ids fell through to `NOT_FOUND`, misleading the model into
  thinking the id was never valid. Also surfaces the missing-controller
  case from `cancelTask` as an explicit error rather than reporting
  phantom success.

- **MemoryManager.cancelTask logs missing-controller violation (C2)**:
  the silent `return false` for the missing-AbortController case now
  emits a `debugLogger.warn` so the inconsistency is observable in
  debug bundles. Without the log a runaway dream burning tokens would
  leave no trail.

- **MemoryManager.subscribe taskType filter (C3)**: subscribers can
  now opt into per-type notify routing via `subscribe(fn, { taskType })`.
  Internal `notify()` calls pass the changed task's type so filtered
  consumers wake only on relevant transitions. The bg-tasks UI hook
  uses this to skip the per-UserQuery extract notify entirely — drops
  the per-extract O(n) signature work to zero.

- **runDream guards against late-cancel overwrite (U5)**: the
  success path now re-checks `abortSignal.aborted` between metadata
  read/write and before the final `update({status: 'completed'})`.
  Closes the ~tens-of-ms race window where `cancelTask` flipping
  status to `'cancelled'` would silently lose to the success
  continuation overwriting with `'completed'` + bumping
  `lastDreamAt`.

- **DreamDialogEntry.endTime semantic comment (U6)**: documents that
  `endTime` for cancelled records is the cancel-call moment (not the
  fork unwind), so a future maintainer doesn't treat it as a real
  fork-finish timestamp.

Tests: new `subscribe() taskType filter` describe block in manager,
new task_stop tests for `NOT_CANCELLABLE` (extract) and missing-
AbortController (dream); existing test renamed/widened.

* fix(core,cli): tighten error semantics + comments on dream surface

Four fixes from the latest review pass:

- New `TASK_STOP_INTERNAL_ERROR` error type for the missing-
  AbortController contract violation. Previously the dispatcher
  reused `TASK_STOP_NOT_RUNNING`, which is misleading — the task IS
  running, cancellation just couldn't be delivered. Distinct type
  signals "this is unexpected, file a bug" vs `NOT_CANCELLABLE`
  which signals "expected behavior, use a different approach".

- Reworded the `useBackgroundTaskView` filter comment. Said "every
  UserQuery that misses the gate creates one [skipped record]" but
  `scheduleDream` returns `{status: 'skipped'}` early without
  creating a record for most gate misses; only the
  acquireDreamLock/EEXIST race actually stores a `'skipped'` record.

- Strengthened the `subscribe() unsubscribe` test. The previous
  version asserted "not called yet" without firing any notify after
  unsubscribe, so a regression that left the listener attached
  would still pass. Now schedules an extract before AND after the
  unsubscribe, verifying the call count doesn't increment.

- Moved `const debugLogger = createDebugLogger(...)` below the
  full import block in manager.ts. Previous version sat between
  imports, violating eslint-plugin-import's `import/first` rule
  (didn't trip lint locally, but worth fixing before it does).

* fix(core): skip scheduleDream early when params.config is missing

`ScheduleDreamParams.config` is optional in the type so test paths
can omit it, but production callers always pass one. Without a
config, `runManagedAutoMemoryDream` throws because the fork-agent
execution requires it. With dream tasks now visible in the
Background tasks dialog, that throw becomes a noisy `failed` entry
the user sees but didn't trigger.

Convert the omitted-config case to the same `disabled` skip path
that an explicitly-disabled config takes, so a no-config call
short-circuits before any record is stored. Existing tests that
relied on the old "no config = proceed past the disabled gate"
behavior now pass an explicit `makeMockConfig()` (matching what
they would do in any realistic scenario).

New test pins the no-config skip behavior + asserts no record
was stored (so a regression that drops the early skip would
produce a visible failed entry in the dialog and fail the test).

* fix(core): plug subscribe Map leak + dream.ts late-cancel ordering bug

Two fixes from the latest review pass:

- `MemoryManager.subscribe`'s typed-branch unsubscribe deleted the
  listener from its per-type Set but left the empty Set sitting in
  `subscribersByType`. Over a long-running session with repeated
  React mount/unmount of the bg-tasks view, that accumulates dead
  Map entries forever. Drop the entry when the bucket goes empty.

- `runManagedAutoMemoryDream` writes metadata after the fork agent
  returns (`bumpMetadata` → `rebuildManagedAutoMemoryIndex` →
  `updateDreamMetadataResult`). If the user presses 'x' between the
  fork's success return and these writes, the writes proceed and
  bump `lastDreamAt` — leaving the visible UI ('Stopped') disagreeing
  with the scheduler gate (sees a recent successful dream and
  suppresses the next cycle). manager.ts already short-circuits its
  own metadata write via the post-await abort check, but it can't
  block writes that already happened inside dream.ts. Adds the same
  abort-signal check between each write step here.

* fix(core): swallow releaseDreamLock errors so they don't poison outcome

If `releaseDreamLock` throws inside the inner finally (e.g. filesystem
error on the lock file), the exception propagates to the outer catch
and overwrites a successfully-completed dream record with 'failed'.
The on-disk metadata is already up-to-date at that point, so the user
sees a contradictory state — `lastDreamAt` was bumped but the UI
shows a failure.

Wrap the release in a try/catch with `debugLogger.warn`. The lock
file is still cleaned up on the next session via the existing
staleness sweep, so swallowing the release error doesn't risk a
permanent stuck lock.

* fix(core): thread abortSignal into dream metadata writes

The pre-call abort checks in `runManagedAutoMemoryDream` close most
of the late-cancel race window, but each metadata helper itself does
read → mutate → write across two awaits. If the user cancels between
those two awaits the write still happens, persisting `lastDreamAt`
for an aborted run and suppressing the next legitimate dream cycle.

Thread `abortSignal` into `bumpMetadata` and
`updateDreamMetadataResult`; both now re-check between the read and
the write, returning early without persisting when the signal has
already fired. The pre-call checks remain as the first line of
defense; this guards the race that opens after the call enters.

* fix(core): close dream cancel race + surface lock-release failures

Two related fixes from the latest review pass:

- Move scheduler-gating metadata writes out of `runManagedAutoMemoryDream`
  and into `MemoryManager.runDream`, sequenced AFTER the
  status='completed' flip. The previous shape left a race window
  where cancellation arriving during/after `fs.writeFile` could
  persist `lastDreamAt` while the manager flipped status to
  'cancelled' — visible UI ('Stopped') would disagree with the
  scheduler gate (sees a recent successful dream), suppressing the
  next legitimate dream cycle. The new order makes the gating
  metadata write race-free: once status !== 'running', cancelTask
  refuses, so any cancel arriving during the metadata write is
  ignored. The remaining "cancel raced the synchronous status
  update" window is handled by a post-update abort recheck that
  restores 'cancelled' and skips the metadata write.

  Drops the now-dead `bumpMetadata` helper from dream.ts; index
  rebuild stays there since it's informational, not gating.

- Surface `releaseDreamLock` failures on the task record's metadata
  (`lockReleaseError`). The previous fix logged-and-swallowed only,
  so a Windows EPERM or ENOENT race would silently block subsequent
  dreams as 'locked' with no UI signal explaining why dreaming had
  stopped. Logger keeps emitting the warn for debug bundles.

* fix(core,cli): address 8 review findings on dream surface

Reverts the metadata-write behavior regression, plugs the
storeWith reentrancy hole, surfaces lock/metadata warnings in the
UI, and a handful of cleanups:

- Wrap gating-metadata read+write in try/catch (manager.ts). The
  PR moved metadata writes from dream.ts (best-effort, swallowed)
  to manager.ts (unguarded). A throw from readDreamMetadata /
  writeDreamMetadata now propagates to the outer catch and
  overwrites a successfully-completed dream with 'failed' — the
  dream actually did its work and touched files are visible. New
  catch logs + writes `metadataWriteError` on the record so the UI
  can explain why the next dream may re-fire sooner than expected.

- Register the AbortController BEFORE storeWith in scheduleDream.
  storeWith fires a notify; a subscriber synchronously calling
  cancelTask(record.id) would otherwise see status='running' but
  no controller, hitting the missing-controller defensive warn
  path and reporting a phantom failure on a brand-new dream.

- Surface `lockReleaseError` and `metadataWriteError` in the dream
  view-model and DreamDetailBody (rendered as warnings, not
  errors, so the terminal status stays Completed). Previous fix
  wrote them to record.metadata only — nothing in the cli read or
  rendered them, so users still had no UI signal.

- Preserve `result.touchedTopics` on the unreachable cancel-raced-
  status-update branch. If a future refactor introduces an await
  there, the restored cancelled record would otherwise drop the
  already-produced result; the UI would report a clean cancellation
  even though memory files were already modified.

- Add `'cancelled'` to MemoryDreamEvent status union and emit a
  cancelled telemetry event from the runDream catch path. Without
  this a cancelled dream is indistinguishable from one that never
  scheduled in the first place.

- Drop the dead abortSignal param from updateDreamMetadataResult
  in dream.ts (no caller passes it after the manager.ts move).

- Swap declaration order of `computeDreamSig` and `refresh` in
  useBackgroundTaskView.ts (TDZ-fragile against a future refactor
  that adds a synchronous refresh call between them).

- Update the unreachable cancel-raced-update branch comment to
  describe what's actually true ("defense-in-depth, unreachable
  today") instead of the confusing "cancelTask flipped" path.

- cancelSelected now checks the cancelTask return value and logs
  via debugLogger when false. Today this branch is unreachable
  thanks to the controller-register-before-storeWith fix above,
  but if a future refactor breaks the invariant the silent ignore
  would let the user think the cancel took effect.

- Mirror the manual /dream metadata path's `recentSessionIdsSinceDream = []`
  reset in the auto path — field is dead code today but keeping
  the two write sites in sync avoids surprises.

Telemetry metric `recordMemoryDreamMetrics` widened to accept the
new 'cancelled' status (downstream consumer in loggers.ts).

* fix(memory): same-session recovery when releaseDreamLock throws

The prior fix surfaced lockReleaseError on the dialog so the user knows
the lock release failed (Windows EPERM, ENOENT race, disk full, etc.) —
but until next process start, dreamLockExists() still sees a fresh-mtime
lock owned by an alive PID (us!) and silently suppresses every
subsequent scheduleDream() call as `{status: 'skipped',
skippedReason: 'locked'}`. The user sees the warning AND zero further
dream activity, and the staleness sweep that would clean the leaked
lock only runs at session start.

Adds a `dreamLockReleaseFailed` flag set in the catch. The next
scheduleDream() force-cleans the leaked lock file via fs.rm({force: true})
before the existence check, so dream scheduling resumes within the same
session. Best-effort: if even the forced rm fails (truly unrecoverable
filesystem state), falls through to the existing 'locked' skip path.

This is an incremental improvement on top of b00ecdebd's UI-surface fix.
The two together give the full story: warning visible → automatic
recovery on next attempt.

* fix(memory): real cancelled-dream duration + clarify index-rebuild ordering

Two follow-up suggestions from review:

- **`duration_ms: 0` in cancelled-dream telemetry** (manager.ts):
  the user-cancel path emitted `MemoryDreamEvent` with
  `duration_ms: 0`, which would silently skew latency histograms / p95
  metrics by treating cancelled dreams as instant. Capture
  `dreamStartMs = Date.now()` at the top of `runDream` and emit the
  real elapsed time in the cancel branch.

- **Misleading index-rebuild comment** (dream.ts:75–84): the comment
  claimed the index rebuild "is still done before returning when
  topics were touched", but the code returns early on
  `abortSignal?.aborted` BEFORE the rebuild. Rewrote the comment to
  describe the actual cancel-aware ordering — abort returns partial
  result without rebuilding (rebuild is expensive; next dream cycle
  will rebuild against the latest files anyway), live path rebuilds
  only when topics changed.

22 / 22 manager tests pass; tsc clean.
2026-05-06 10:20:19 +08:00
wenshao
a53c7508d2 fix(attribution): harden restoreFromSnapshot against corrupt payloads
1KMY (Copilot): snapshot.surface was copied without type validation.
A corrupted/partially-written snapshot with a non-string surface
(e.g. {}, 42, null) would later be serialized into the git note as
"[object Object]" and used as a Map key downstream, breaking the
expected payload shape. Type-check and fall back to the current
client surface for any non-string (or empty-string) value.

1KLq (Copilot): per-field sanitiseCount enforced
`promptCount >= 0` and `promptCountAtLastCommit >= 0` independently,
but never the cross-field invariant. A snapshot with
promptCountAtLastCommit > promptCount would surface a negative
getPromptsSinceLastCommit() and propagate as a "(-N)-shotted"
trailer into PR text. Clamp atLastCommit to total on restore.

1KL_ (Copilot): when a snapshot carried both the symlinked and
canonical paths for the same file (a session straddling the
canonicalisation fix), `set(realpathOrSelf(k), ...)` overwrote the
first entry with the second, silently dropping the AI contribution
the first form had accumulated. Merge instead: sum aiContribution
and OR aiCreated when collapsing duplicate keys.

Tests cover all three branches: non-string surface fallback,
promptCount clamp, and duplicate-key merge.
2026-05-06 10:16:28 +08:00
wenshao
d429d90331 fix(attribution): toggle-off partial clear, normalizeGitCoAuthor type-check, terraform lockfile
0oAK (Copilot): the gitCoAuthor.commit toggle-off branch returned
before computing the committed file set, leaving the just-committed
files' tracked AI work in the singleton. Re-enabling the toggle and
committing the same file again would re-attribute earlier (already-
committed) AI edits to the new commit. Move the toggle gate AFTER
matchCommittedFiles so the finally block does a proper partial clear
of the just-committed files even when the note write is skipped.

0oAg (Copilot): normalizeGitCoAuthor copied value?.commit / value?.pr
without type-checking. settings.json is hand-editable; a stored
`{ commit: "false" }` reached runtime as a truthy string and behaved
as if attribution were enabled. Add a per-field bool coercion that
falls back to the schema default (true) for any non-boolean,
matching what the dialog and IDE schema already imply. Tests cover
the string / number / null cases.

0oAo (Copilot): v3→v4 shouldMigrate only special-cased versionless
legacy booleans — versionless files with invalid gitCoAuthor values
(`"off"`, `[]`, etc.) skipped the migration and the loader stamped
`$version: 4` over the bad value. Runtime normalization then
silently re-enabled attribution. Extend shouldMigrate to fire on ANY
versionless non-object value at general.gitCoAuthor; the existing
migrate() body's drop-and-warn path resets it. Already-object
shapes (hand-edited to v4) still skip cleanly. Tests added.

0oAt (Copilot): `.terraform.lock.hcl` got dropped from generated-file
exclusion when `.lock` was removed from the blanket extension list
in 3c0e3293b. It's a generated provider lockfile in the same class
as `package-lock.json` and dominates Terraform-repo commits. Re-add
to EXCLUDED_FILENAMES and add a regression test covering both
repo-root and module-nested locations.
2026-05-06 10:01:26 +08:00
wenshao
dd45e17201 fix(attribution): runGit null-on-failure, versionless v3→v4 migration
z54M (Copilot): runGit returned '' on both successful-empty-output
and silent failure, so a `--name-only` that errored mid-way through
the diff fan-out aliased to a real `--allow-empty` commit. The
empty-commit branch then preserved pending attributions, leaving
the just-committed file's tracked AI edit alive to re-attribute on
the next commit. Switch runGit to `Promise<string | null>`,
distinguishing exit code 0 (any output, including '') from non-zero
(null). The diff-stage fan-out and ancillary probes now treat null
as analysis failure and bail with `return null` instead of falling
into the empty-commit path.

z539 (Copilot): the v3→v4 `shouldMigrate` only fired on
`$version === 3`. A versionless settings file carrying the legacy
`general.gitCoAuthor: false` boolean would skip every migration
(gitCoAuthor isn't in V1_INDICATOR_KEYS — it post-dates V2), get
its `$version` normalized to 4 by the loader, and leave the
boolean in place. The settings dialog then reads the V4
`{commit, pr}` shape, sees missing keys, defaults both to true, and
silently overwrites the user's opt-out on the next save. Also fire
when `$version` is absent AND the value at `general.gitCoAuthor`
is a boolean. Tests cover the new path and confirm the existing
versioned/object-shape paths are untouched.
2026-05-06 08:48:47 +08:00
John London
2a5be0d3b1
fix(core): prevent auto-memory recall from blocking main request (#3814)
Some checks are pending
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(core): prevent auto-memory recall from blocking main request (issue #3759)

The auto-memory recall side-query had a 5s AbortSignal.timeout that fired
on every turn, and the main request path awaited the full recall promise
(including timeout + heuristic fallback). This caused a ~5s delay on every
user turn.

Changes:
- relevanceSelector.ts: reduce model-driven selector timeout from 5s to 2s
- client.ts: add resolveAutoMemoryWithDeadline() that races the recall
  promise against a 2.5s deadline, returning empty result if recall
  hasn't completed in time
- client.test.ts: add 2 tests verifying slow recall doesn't block the
  main request and fast recall still includes memory content

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(test): remove stripThoughtsFromHistory from GeminiChat mocks

stripThoughtsFromHistory was removed from GeminiChat on main.
The mock in client.test.ts still referenced it, causing TS2353 build failures in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(core): address PR #3814 review comments on auto-memory recall deadline

Three changes from the review:

1. clearTimeout leak — The setTimeout in resolveAutoMemoryWithDeadline is
   now cleared in a finally block so the dangling timer doesn't block the
   Node.js event loop from draining during graceful process exit.

2. Telemetry inflation — An AbortController is created before the recall
   call and aborted when the 2.5s deadline fires. Its signal is passed into
   the recall pipeline via a new optional abortSignal field on
   ResolveRelevantAutoMemoryPromptOptions. All three logMemoryRecall calls
   in recall.ts are gated behind !options.abortSignal?.aborted so discarded
   results don't emit success metrics.

3. Test coverage — Added a test for the !promise guard path
   (getManagedAutoMemoryEnabled() returns false) verifying sendMessageStream
   completes without calling recall or injecting memory content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(core): address PR #3814 review findings for auto-memory recall

Critical fixes:
- Wrap onDeadline() in try/finally inside resolveAutoMemoryWithDeadline so
  resolve() always runs even if onDeadline() throws, preventing deadlock
- Forward caller abortSignal through recall() → selectRelevantAutoMemoryDocumentsByModel
  → runSideQuery, combined with the existing 2s timeout via AbortSignal.any()
  so the model API call is cancelled when the 2.5s deadline fires
- Abort and clear pendingRecallAbortController on all early return paths
  (MaxSessionTurns, BoundedTurns=0, SessionTokenLimitExceeded, Arena control signal)
- Rename _pendingRecallAbortController → pendingRecallAbortController
  (inconsistent underscore prefix)

Tests added:
- client.test.ts: verify sendMessageStream completes normally when recall rejects
- relevanceSelector.test.ts: verify caller abort signal is forwarded combined with timeout,
  and timeout-only signal when no caller signal provided

* fix(core): start auto-memory recall deadline at initiation time

Moves the resolveAutoMemoryWithDeadline() race from consumption time to
initiation time so the 2.5s budget is not consumed by intermediate work
(microcompact, compression, token checks, IDE context) between recall
initiation and consumption. The raced promise is stored directly in
relevantAutoMemoryPromise and simply awaited at consumption.

The pendingRecallAbortController cleanup on early return paths is
preserved unchanged.

* fix(core): address PR #3814 review feedback

- resetChat(): abort pendingRecallAbortController so stale in-flight recall
  does not leak into the next session.
- recall.ts catch: distinguish AbortError (deadline-triggered cancellation)
  from real model errors. AbortError logs at debug level with a message
  indicating the heuristic result was discarded. Real model errors continue
  to log at warn with the existing fallback message.

* fix(core): address PR #3814 round 3 review feedback

- Use explicit undefined check instead of non-null assertion for timer in
  resolveAutoMemoryWithDeadline (clearTimeout)
- Gate heuristic fallback in recall.ts on abortSignal.aborted so discarded
  results after the deadline don't produce output
- Downgrade AbortError log level from warn to debug in the client.ts recall
  catch block, keeping the warn channel meaningful for real failures
- Add tests verifying pendingRecallAbortController.abort() is called on
  MaxSessionTurns and SessionTokenLimitExceeded early-return paths

Co-Authored-By: Claude <noreply@anthropic.com>

* test(core): assert abort propagation in relevance selector

* fix(lint): remove unused eslint-disable directive in relevanceSelector.test.ts

The import/no-internal-modules eslint-disable comment was unnecessary —
the rule does not flag same-package imports like ../utils/sideQuery.js.
ESLint 9 reports it as an unused directive, failing the CI lint check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-06 07:34:41 +08:00
wenshao
1ece87438f fix(attribution): preserve unstaged AI edits across cleanup branches
uxU5 + uxVQ + uxUO (Copilot): every cleanup branch in
attachCommitAttribution that called clearAttributions(true) was
wholesale-erasing pending AI edits for files the user never staged
in this commit. Reviewer scenarios:
- multi-commit chain (`commit a && commit b`) bails out without
  writing a note, but unstaged edits to file Z (touched by neither
  commit) get cleared along with the chain's committed files.
- attribution toggle off: same — toggling the flag wipes pending
  unstaged work.
- analysis failure (shallow clone, --amend without reflog, partial
  diff failure): the finally-block fallback wholesale-cleared
  every pending file, consuming unrelated AI edits.
- 0%-AI commit: when no file in the commit was AI-touched,
  generateNotePayload was emitting an "0% AI" note attached to a
  commit that legitimately had no AI involvement — actively
  misleading metadata.

Add `noteCommitWithoutClearing()` to the service: snapshots the
prompt counter as the new "at last commit" but leaves the per-file
map alone. Use it in the multi-commit, no-tracked-edits,
toggle-off, and analysis-failure paths. The committed-files
partial-clear (clearAttributedFiles) still runs in the success
path. The 0%-AI no-match case now skips the note write entirely.
2026-05-06 07:23:47 +08:00
wenshao
325a12c3c1 chore(schema): regenerate settings.schema.json to match gitCoAuthor.commit description
The settingsSchema.ts source for `gitCoAuthor.commit.description` was
updated in 3c0e3293b but the JSON schema only picked up the OUTER
description rewrite and missed this inner property's. The Lint check
("Check settings schema is up-to-date") fails on that drift; this
commit re-runs `npm run generate:settings-schema` to sync them.
2026-05-06 07:19:38 +08:00
John London
c7facf5013
fix(core): use per-model settings for fast model side queries (#3815)
Fixes #3765. Side queries (session recap, title generation, tool-use summary) running on the fast model previously inherited the main model's ContentGeneratorConfig, leaking extra_body / samplingParams / reasoning between models.

GeminiClient.generateContent() now resolves the target model's own config via buildAgentContentGeneratorConfig() when the requested model differs from the main model, and creates a dedicated ContentGenerator (cached, cleared on resetChat). Cross-authType resolution lets the fast model live under a different provider than the main model. Uses the target model's authType for retry logic so provider-specific checks (e.g. QWEN_OAUTH quota detection) reference the correct provider. Falls back to the main generator if the model is not in the registry.
2026-05-06 02:07:32 +08:00
5d0adbb5bd
fix(core): activate skills from discovered result paths (#3852)
* fix(core): activate skills from discovered result paths
* fix(core): address result path review feedback
* fix(core): handle grep result path edge cases
* fix(core): trust fallback grep result paths
2026-05-06 01:57:30 +08:00
Mr-Maidong
f4a9f7bfd7
feat(weixin): add image sending support via CDN upload (#3781)
* feat(weixin): add image sending support via CDN upload

* fix(weixin): address PR review — path validation, encoding, timeout, error handling

Critical fixes from wenshao's review of feat/weixin-image-send:

1. File read vulnerability: add validateImagePath() in send.ts with
   directory allowlist, extension filter, magic-byte check, 20MB cap,
   and realpath resolution. Pass workspace cwd as allowed dir.

2. aes_key encoding: change from base64(hex-ascii) to base64(raw 16B)
   to match the protocol expectation (images use raw bytes, not hex).

3. uploadToCdn timeout: add AbortController + 40s timeout per retry
   attempt to prevent hanging on stalled CDN connections.

4. Unhandled rejection: wrap fallback sendText() in catch block with
   its own try/catch to prevent process crash on double failure.

5. Default instructions merge: append image capability guide when
   custom instructions lack [IMAGE:], instead of silently dropping it.

6. Dead code: remove unused imagePaths parameter from sendMessage().

7. Regex hardening: strip code blocks before [IMAGE:] extraction,
   filter empty paths to prevent confusing readFileSync('') errors.

8. URL validation: reject http:// URLs and validate CDN hostname in
   uploadToCdn (SSRF prevention).

Tests: replace identity mock with real encryptAesEcb/computeMd5 so
padding mismatches are caught; fix partial node:crypto mock.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(weixin): address 2nd round PR review — 10 issues

Critical fixes:
1. detectImageMime: add JPEG magic bytes (0xFF 0xD8 0xFF), throw on
   unrecognized format instead of defaulting to image/jpeg
2. getUploadUrl retry: pass errcode to WeixinApiError, add ret field
   check in isRetryableError so actual API errors trigger retries
3. ALLOWED_DIRS: add realpathSync('/tmp/') and realpathSync(tmpdir())
   to handle macOS symlink resolution (/tmp → /private/tmp)
4. [IMAGE:] stripping: only replace markers that were actually parsed,
   preserving [IMAGE:] inside code blocks in displayed text
5. TOCTOU fix: use openSync/readSync(16B) for magic-byte check instead
   of reading the entire file twice
6. sendMessage: check both ret and errcode fields for error detection

Suggestions:
7. connect(): avoid mutating this.config.instructions on reconnect
8. Fix duplicate Step 2 comment numbering
9. Replace errcode=${undefined} with errcode=${resp.errcode ?? '(none)'}
10. stderr: log structured (status=, ret=) instead of raw errmsg

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(weixin): 3rd round PR review — errcode checks, error logging, timeout, path resolution

- api.ts: add errcode check in getUploadUrl (align with sendMessage)
- api.ts: pass ret/errcode from CDN error to WeixinApiError
- send.ts: resolve workspace dirs with realpathSync
- WeixinAdapter.ts: include errcode and err.message in error log
- media.ts: add 40s timeout to downloadAndDecrypt fetch
- send.test.ts: add 12 tests covering detectImageMime and validateImagePath error branches

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

---------

Co-authored-by: Maidong <408097061@qq.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-06 01:49:26 +08:00
wenshao
296fb55aef fix(attribution): preHead race, regex apostrophe-escape, surface failures, dead code
t2G0 (deepseek-v4-pro): addCoAuthorToGitCommit single-quote regex now
matches the bash close-escape-reopen apostrophe form using
((?:[^']|'\\'')*) — the same pattern bodySinglePattern uses for
gh pr create. Input like git commit -m 'don'\''t' was previously
silently un-rewritten because the negative lookahead bailed; the
trailer now lands at the FINAL closing quote. Test updated.

tMBP (gpt-5.5): preHead capture switched from concurrent async
getGitHead to a synchronous getGitHeadSync (execFileSync) BEFORE
ShellExecutionService.execute spawns the user's command. A fast
hot-cached git commit could move HEAD before the async rev-parse
resolved, leaving preHead === postHead and silently skipping the
attribution note. Trade ~10–50 ms event-loop block per
commit-shaped command for correctness of the post-command HEAD
comparison.

t2Gv (deepseek-v4-pro): attribution write failures (note exec
non-zero, payload too large, diff-analysis exception, shallow
clone / amend-without-reflog) are now surfaced on the shell tool's
returnDisplay AND llmContent so the user and agent both see when
their commit succeeded but the per-file git note didn't land.
attachCommitAttribution now returns string | null (warning text or
null for intentional skips like no-tracked-edits). Co-authored-by
trailer is unaffected — only the note is gated by these failures.

t2Gy (deepseek-v4-pro): committedAbsolutePaths now matches against
the canonical keys already stored in fileAttributions
(matchCommittedFiles iterates by relative path against the
canonical repo root) instead of re-resolving each diff path
on the fly. realpathSync(resolved) failed for deleted files and
didn't follow intermediate symlinks, leaving stale per-file
attribution alive past commit and inflating AI percentages on
subsequent commits.

t2HI (deepseek-v4-pro): removed dead sessionBaselines /
FileBaseline / contentHash / computeContentHash infrastructure
(~40 lines). The fields were written, persisted, and restored but
never read for any computation or decision. AttributionSnapshot
schema stays at version 1 — restore tolerates pre-fix snapshots
that carried the now-ignored baselines field.

t2HM (deepseek-v4-pro): extracted the duplicated lastMatch helper
in addCoAuthorToGitCommit and addAttributionToPR into a single
module-level lastMatchOf so future fixes can't be applied to only
one copy.
2026-05-06 01:20:57 +08:00
wenshao
e4bb0181ad fix(attribution): depth-1 shallow detection, snapshot dedup post-rewind/post-failure
sfGz (Copilot): rev-list --count HEAD === 1 cannot distinguish a
true root commit from a depth-1 shallow clone — both report 1
because rev-list only walks locally available objects. Switch to
git log -1 --pretty=%P HEAD which reads the parent SHA directly
from commit metadata: empty means a real root, non-empty means a
parent is recorded (whether or not its object is local). The
shallow-clone bail is now reliable.

sfIm (Copilot): the dedup key persisted across rewindRecording, so
the previous snapshot living on the now-abandoned branch would
match the next post-rewind snapshot and silently skip the write,
leaving /resume on the rewound session with no attribution state.
Reset lastAttributionSnapshotJson when rewindRecording fires.

sfJE (Copilot): dedup key was committed before the async write
settled. A transient write failure would update the key, then
permanently suppress all future identical snapshots even though
nothing was ever persisted. Switch to optimistic-set then rollback
on appendRecord rejection — synchronous identical calls dedup
cleanly, but a failed write clears the key so the next identical
snapshot retries. appendRecord now returns the per-record write
promise (writeChain still has its swallow-catch for chain liveness)
so callers needing per-write success can react to it. Tests added
in chatRecordingService.test.ts for both rewind-reset and
rollback-on-failure paths.
2026-05-06 01:04:57 +08:00
Alex Musick
174b3ac951
feat(cli): Add ability to switch models non-interactively from the cli (#3783)
* Add ability to switch models non-interactively from the cli

This fulfills request #3410

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Revert "Update packages/cli/src/ui/commands/modelCommand.ts"

This reverts commit 7b49c442a7.

* Protect against empty model name; align non-interactive behavior

Empty model names are now ignored. ```/model ``` (with trailing whitespace) will still open the interactive model picker.

Realigned the non-interactive path to use the same ```args.trim().split(' ')[0]``` logic. Valid model IDs can not contain spaces anyway. If preferred, this specific change can be reverted and the new code can use the old logic instead.

* Warn if model is not in registry

* Update command description

* Updated modelCommand test to reflect new description

* Implement auto-complete with model IDs

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Update packages/cli/src/ui/commands/modelCommand.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* Revert "Update packages/cli/src/ui/commands/modelCommand.ts"

This reverts commit 0600b2373a.

* Update modelCommand.ts

* Update modelCommand.ts

* Update modelCommand.ts

* Update/use i18n keys

* Corrected en i18n

* removed redundant .trim() on modelName check

---------

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
2026-05-06 00:54:27 +08:00
Yan Shen
89cb326c2f
feat(cli): improve export format completion navigation (#3701)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
* feat(cli): improve export format completion navigation

* fix(cli): address PR #3701 review feedback on /export completion

Critical:
- Guard phase-2 cycling by checking buffer text starts with "/export "
  so a manually edited buffer is never clobbered by stale nav state (C1)
- Derive export format suggestions from slashCommands.subCommands to
  keep a single source of truth with the command registry (C2)
- Reset completionSelectionWasNavigatedRef on showSuggestions rising
  edge instead of on every suggestions change to avoid a race where
  an already-navigated selection is forgotten before Enter (C3)
- Add regression tests for isPerfectMatch + navigated + Enter,
  including the positive path and a control case (C4)

Suggestions:
- Prefix-guard getExportFormatFromInput to skip regex on non-/export
  input (S1)
- Drop trailing space from setExportCompletionInput output so buffer
  text is no longer implicitly coupled to the cycling heuristic (S2)
- Document the two-phase state machine (one-shot fill + cycling) (S3)
- Accept Tab as an additional cycling key alongside Up/Down (S4)
- Remove the unconditional ref reset at the tail of handleInput;
  correctness is now guaranteed by the buffer-text guard (C1) and
  the showSuggestions edge-triggered useEffect (C3) (S5)

* fix(cli): tighten export completion cycling guard and unify Tab behavior

- Phase 2 cycling guard: replace startsWith('/export ') with strict
  getExportFormatFromInput() to prevent overwriting inputs with extra
  arguments (e.g. '/export html --verbose').
- ACCEPT_SUGGESTION in Phase 1 popup: add hasExportFormatSuggestions
  branch so Tab/Enter seeds exportCompletionSelectionIndexRef,
  allowing Phase 2 cycling to continue from the selected format
  (consistent with Up/Down arrow behavior).
- Add 4 regression tests: Phase 1 Up wrap, Phase 2 Up wrap, Tab
  seed + Phase 2 Tab cycle, guard prevents overwriting extra args.

Ref: PR #3701 second-round review by wenshao

* fix(cli): address PR #3701 third-round review feedback on /export completion

- S6: use dynamic exportFormatSuggestions.findIndex() for highlight index
  instead of static EXPORT_FORMAT_COMPLETIONS.indexOf()
- S7: derive Phase 2 cycling current index from buffer text via
  getExportFormatFromInput + indexOf, with defensive ref fallback
- S8: extract getNextExportCompletionIndex as module-level pure function;
  cache exportCycleFormats via useMemo to avoid per-keystroke .map()
- S9/S10: add tests for ESC and Ctrl+C reset of export cycling state

* fix(cli): tighten /export prefix guard, add superset matching fallthrough, and improve documentation

* fix(cli): address review #4224860127 - smaller notes optimization

- S1: Strengthen EXPORT_FORMAT_COMPLETIONS fallback comment with
  an IMPORTANT sync warning for format removals
- S2: De-export getExportFormatFromInput (no external consumers)
- S3: Add intermediate buffer-clear assertion after Ctrl+U in test
  to pin state and prevent false positives from future hook changes

* refactor(cli): extract export completion into useExportCompletion hook

Address all feedback from PR #3701 review comment:
- Extract ~310 lines of /export state machine from InputPrompt into
  dedicated useExportCompletion hook
- Replace exportCompletionSelectionIndexRef (number|null) with
  cyclingActiveRef (boolean) since index was never read
- Simplify navigated-flag lifecycle: reset on buffer.text changes
  instead of popup visibility transitions; add navigatedTextRef
  snapshot to prevent sticky autocomplete after buffer edits
- Remove static EXPORT_FORMAT_COMPLETIONS fallback; derive entirely
  from slashCommands.subCommands
- Aggregate 4 parallel ternaries into single suggestionDisplayProps
- Add regression test: navigate + backspace + retype + Enter
  should submit raw buffer, not autocomplete
- Remove redundant navigatedRef reset in ESC handler (already
  covered by exportCompletion.reset())

* fix(cli): guard export completion state
2026-05-05 23:23:25 +08:00
wenshao
3c0e3293be fix(attribution): dedup snapshot writes, cap excludedGenerated, doc commit toggle scope
rsf- (Copilot): recordAttributionSnapshot wrote a full snapshot to
the JSONL on every non-retry turn, even when the tracked state was
unchanged. Long-running sessions accumulated thousands of identical
snapshot copies, inflating session size and slowing /resume hydrate.
Dedup by JSON-equality with the prior write — first write always
goes through, identical successors are no-ops.

rsgo (Copilot): excludedGenerated path list was unbounded. A commit
churning thousands of generated artifacts (large dist/ rebuild)
could push the JSON note past MAX_NOTE_BYTES (30KB) and lose
attribution for the real source files in the same commit. Cap the
serialized sample at MAX_EXCLUDED_GENERATED_SAMPLE (50) and add
excludedGeneratedCount for the true total.

rsg9 + rshM (Copilot): the gitCoAuthor.commit description claimed
the toggle only controlled the Co-authored-by trailer, but
attachCommitAttribution also gates the per-file git-notes payload
on the same flag. Update both the schema description and the
settings.md table to mention both effects so disabling the option
isn't a silent surprise.
2026-05-05 22:52:32 +08:00
wenshao
090758c5b1 fix(attribution): drop unsafe full-clear, tag analysis-failure with null
ju1p (Copilot): the `else if (commitCtx.hasCommit)` branch fully
cleared the singleton on `cd /abs/same-repo/subdir && git commit`
(or `git -C . commit`), losing pending AI edits the user hadn't
staged. We can't tell which files were in the commit from this
branch, and the next attributable commit's partial-clear handles
cleanup correctly anyway. Drop the branch entirely.

ju2D (Copilot): `getCommittedFileInfo` returned the same empty
StagedFileInfo for both "could not analyze" (shallow clone, --amend
without reflog, --numstat partial failure, exception) and
"intentionally empty" (--allow-empty). The caller couldn't tell them
apart, so the partial clear became a no-op on analysis failure and
the just-committed AI edits leaked to the next commit. Switch the
return type to `StagedFileInfo | null` and have the caller treat
null as "fall back to full clear" while empty StagedFileInfo
(--allow-empty) leaves attributions intact for the next real commit.
2026-05-05 22:10:43 +08:00
jinye
07441cc1e3
feat(sdk-python): add network timeouts to release version helper (#3833) 2026-05-05 19:25:00 +08:00
Rayan Salhab
ec51fd3138
fix(core): coalesce MCP server rediscovery (#3818)
* fix(core): coalesce MCP server rediscovery

* test(core): assert MCP rediscovery cleanup

* fix(core): address MCP rediscovery review feedback

* fix(core): preserve MCP rediscovery health checks

---------

Co-authored-by: cyphercodes <cyphercodes@users.noreply.github.com>
2026-05-05 18:31:26 +08:00
JerryLee
095a39a8d5
fix(acp): run auto compression before model sends (#3698) 2026-05-05 18:14:13 +08:00
wenshao
9e731698ae fix(attribution): submodule leak, PR body nesting, shallow-clone bail, schema default
- attachCommitAttribution: when HEAD didn't move in our cwd, leave
  pending attributions alone instead of dropping them. The case can be
  a failed commit, `git reset HEAD~1`, OR `cd submodule && git commit`
  (inner repo's HEAD moves, ours doesn't). Dropping was overly
  aggressive and silently lost outer-repo edits in the submodule case.
- addAttributionToPR: mirror addCoAuthorToGitCommit's nested-match
  rejection so `gh pr create --body "docs mention -b 'flag'"` picks the
  outer `--body`, not the inner literal `-b`. Splicing into the inner
  match would corrupt the body. Regression test added.
- getCommittedFileInfo: when `rev-parse --verify HEAD~1` fails, also
  check `rev-list --count HEAD === 1` to confirm HEAD is the true
  root commit. In a shallow clone, HEAD~1 is unreadable but the commit
  has a parent recorded — falling back to `diff-tree --root` would
  diff against the empty tree and over-attribute the entire commit.
  Bail with a debug warning instead.
- generate-settings-schema: lift `default` (and `description`) out of
  the inner `anyOf[N]` schema to the outer level when wrapping with
  `legacyTypes`. Most JSON-schema-driven editors only surface
  top-level defaults; burying the default under `anyOf` lost the
  "enabled by default" hint. Also extend the default filter to
  publish non-empty plain objects (so `gitCoAuthor`'s default can
  appear). gitCoAuthor's source default updated to the runtime shape
  `{commit: true, pr: true}` to match `normalizeGitCoAuthor`.
2026-05-05 14:16:19 +08:00
wenshao
0103af5296 Merge remote-tracking branch 'origin/main' into feat/commit-attribution
# Conflicts:
#	packages/core/src/tools/shell.ts
2026-05-05 13:38:52 +08:00
jinye
0501c7165b
fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute (#3813)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
* fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute

`shutdownTelemetry()` awaited `sdk.shutdown()` with no time bound, causing
the CLI to hang indefinitely when the OTLP endpoint was unreachable. Wrap it
in a `Promise.race` with a 10-second timeout that fails open (logs a warning
and proceeds with cleanup).

The `service.version` resource attribute was set to `process.version`
(Node.js runtime version) instead of the actual application version. Replace
it with `config.getCliVersion() || 'unknown'` to match the convention used
throughout the codebase.

Closes #3811

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): remove redundant diag.error and add .catch() clarifying comment

- Remove duplicate `diag.error()` in shutdown catch block — a rejected
  shutdown already surfaces as a thrown exception; only `debugLogger.error`
  is needed (consistent with pre-existing behavior).
- Add comment to `.catch()` handler explaining why rejections are silently
  dropped when timeout hasn't fired (they're caught by the try/catch via
  Promise.race).
- Update rejection test to remove stale `diag.error` assertion.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): use || for getCliVersion fallback and clarify Promise.resolve wrap

- Revert `??` back to `||` for `getCliVersion()` fallback to stay
  consistent with the 6 other call sites across the codebase and
  guard against empty-string misconfiguration.
- Add comment explaining why `Promise.resolve()` wraps `shutdown()`
  (defensive measure for auto-mocked environments).

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): restore diag.error in catch block and add test assertion

- Re-add `diag.error()` to the shutdown rejection catch block so
  the error is visible via OTel diagnostic channel, matching the
  timeout path which uses `diag.warn`.
- Add `diagErrorSpy` assertion to the rejection test so the test
  name ("should log error") is backed by actual log verification.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

---------

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-05 11:12:13 +08:00
jinye
2c93fd670c
refactor: extract shared release helper utilities (#3834)
Move four duplicated utility functions (getArgs, readJson,
validateVersion, isExpectedMissingGitHubRelease) from the three
get-release-version.js scripts into a shared module at
scripts/lib/release-helpers.js so that changes only need to happen
in one place.

Also fixes a pre-existing bug in getArgs where argument values
containing '=' were silently truncated (e.g. --msg=a=b produced
{msg:'a'} instead of {msg:'a=b'}).

Closes #3795

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-05 10:15:17 +08:00
Yan Shen
59845407fc
fix(openai): parse MiniMax thinking tags (#3677)
* fix(openai): parse MiniMax thinking tags

Add a MiniMax OpenAI-compatible provider that opts into tagged thinking parsing for minimaxi.com endpoints.

Split MiniMax <think>/<thinking> content into thought parts while preserving default OpenAI-compatible behavior, and avoid rendering leading blank stream chunks as empty assistant rows.

* fix(openai): address PR #3677 review — optimize MiniMax tagged thinking parser

- perf(taggedThinkingParser): pre-compute lowercase buffer once per
  parse() call, use startsWith(tag, offset) to avoid O(N²) per-char
  slice+toLowerCase allocations
- feat(minimax): expand host matching from exact Set to *.minimaxi.com /
  *.minimax.io wildcard suffix, covering gateway / custom subdomains
- docs(converter): add comment clarifying finish_reason flush of
  buffered tagged-thinking content on stream end

* test(openai): add standalone unit tests for TaggedThinkingParser

Add taggedThinkingParser.test.ts with 23 test cases covering the boundary scenarios requested in PR review #4219047370: empty tag content, close tags in text mode, pure partial-tag-prefix chunks, multi-chunk splitting, final flag flush behavior, and case insensitivity.

* fix(openai): add observability, host guards, and reasoning_content safeguard for MiniMax tagged thinking

- Add debugLogger.warn on unclosed thought flush (observability)
- Add debugLogger.debug for tag detection and parts count
- Add cross-matching tag comment explaining binary mode toggle
- Add MINIMAX_KNOWN_HOSTS exact match layer with suffix fallback
- Add reasoning_content guard to avoid duplicating thought parts
- Add cross-matching and unclosed flush tests (+4 cases)
- Add known host exact match and custom proxy tests (+4 cases)
2026-05-05 08:52:03 +08:00
2e69d641d5
fix(core): unescape shell-escaped file paths in Edit, WriteFile, and ReadFile tools (#3820)
Some checks are pending
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(core): unescape shell-escaped file_path in Edit, WriteFile, and ReadFile tools

Shell-escaped paths (e.g. my\ file.txt) from at-completion can reach
the LLM without going through atCommandProcessor unescaping.  When
the LLM passes such an escaped path to a file tool, the backslash is
treated as a literal character in the filename, causing the tool to
fail with file-not-found.

Add defensive unescapePath() normalization at the start of each
tool validateToolParamValues(), so that escaped paths are
converted back to real filesystem paths before any I/O.

Also normalize the path in coreToolScheduler conditional-rules
injection path for consistency.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): harden unescapePath — Windows safety, speculation paths, hooks, and consistency

- Make unescapePath a no-op on win32 to avoid corrupting paths like
  C:\(v2)\file.txt where backslashes are path separators.
- Apply unescapePath in speculationToolGate before overlay FS lookups
  so speculation mode finds files given escaped paths.
- Normalize file_path/path in hook toolInput so custom hooks receive
  actual filesystem paths.
- Add unescapePath to Grep, Glob, Ls, and RipGrep validateToolParamValues
  for parity with the file tools.
- Apply unescapePath in getModifyContext callbacks (Edit, WriteFile) so
  the modify-with-editor flow works when request.args still holds escaped
  paths.
- Add .trim() to Edit and WriteFile path normalization for consistency
  with ReadFile.
- Use .trim() in conditional-rules matchAndConsume for parity.
- Gate literal-backslash test to non-win32; add Windows no-op test for
  unescapePath.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): persist unescaped paths in validators, cover LSP filePath, centralize modifyWithEditor normalization

- Mutate params.path in Glob/Grep/Ls/RipGrep validators so invocations
  receive the normalized path, not the original escaped form.
- Add unescapePath to LspTool.validateToolParamValues for filePath.
- Extend scheduler path normalization to filePath + notebook_path
  keys, mirroring speculationToolGate's key set.
- Move unescapePath from getModifyContext callbacks into
  coreToolScheduler's modifyWithEditor path to avoid double-unescape
  and keep a single normalization site.
- Add .trim() to speculationToolGate paths for consistency.
- Show normalized path in ls.ts validation error message.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): address review feedback for unescapePath PR

- Extract shared PATH_ARG_KEYS constant to eliminate duplicate key lists
  across coreToolScheduler and speculationToolGate
- Add notebook_path to modifyWithEditor unescape block (was missing)
- Hoist unescapePath regex as module-level constant (UNESCAPE_REGEX)
- Add debug log when unescapePath actually modifies a path
- Replace early-return Windows guards with vitest skipIf in tests
- Add tests for escaped paths in Glob, Grep, Ls, RipGrep, LSP validators
- Add test for conditional rules matchAndConsume with unescaped paths

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): fix import order in grep.ts, consolidate ls.ts imports, win32-guard write-file tests

- grep.ts: move debugLogger declaration after all import statements
- ls.ts: consolidate two imports from ../utils/paths.js into one
- write-file.test.ts: add skipIf(process.platform === 'win32') to
  shell-escape tests (unescapePath is a no-op on win32)

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(core): address review — double-unescape, circular import, lsp asymmetry, win32 test guards

- coreToolScheduler.ts: remove redundant unescapePath in conditional-rules
  injection (toolInput was already normalized; unescapePath is not
  idempotent for \\X sequences)
- paths.ts: remove debugLogger to break circular import chain
  (paths → debugLogger → storage → paths). The single debug log line is
  low signal — dropped entirely.
- lsp.ts: remove dead .trim() checks in FILE_REQUIRED_OPERATIONS and
  RANGE_REQUIRED_OPERATIONS (filePath is already trimmed by validator)
- Add it.skipIf(process.platform === 'win32') to 11 escaped-path tests
  across edit.test.ts, read-file.test.ts, glob.test.ts, grep.test.ts,
  ripGrep.test.ts, ls.test.ts, lsp.test.ts

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(test): add win32 skipIf guards to atCommandProcessor, rulesDiscovery, fileSearch tests

Four more escaped-path tests that fail on Windows because
unescapePath is a no-op on win32:
- atCommandProcessor.test.ts: two @-command unescape tests
- rulesDiscovery.test.ts: shell-escaped match test
- fileSearch.test.ts: special-char escaping test

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(test): consolidate duplicate imports in rulesDiscovery.test.ts

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

---------

Co-authored-by: cici <cici@cicideMacBook-Pro.local>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-05 01:33:36 +08:00
Shaojin Wen
3631c01e17
feat(skills): parallelize loading + add path-conditional activation (#3604)
* perf(skills): parallelize skill loading with Promise.all

Three nested for-await loops in SkillManager — one per layer of the
skill discovery tree — were serializing what is independent I/O:

- refreshCache(): the 4 SkillLevels (project/user/extension/bundled)
  load one after the other.
- listSkillsAtLevel(): each provider directory (.qwen, .agent,
  .cursor, ...) is read sequentially.
- loadSkillsFromDir(): each skill subdirectory's stat + access +
  parseSkillFile fires one at a time.

Replace each layer with Promise.all so the I/O fans out. Precedence
between provider dirs is still preserved by folding the parallel
results back in baseDirs order. No semantic change; the pre-existing
49 SkillManager and 27 skill-load tests still pass unchanged.

* feat(skills): add path-conditional skill activation

Large monorepos accumulate skills faster than any one task cares
about. Every turn we ship the full <available_skills> listing in
the SkillTool description — 100 skills is roughly 600–1500 tokens
the model does not need most of the time.

Let skills opt into lazy activation via a `paths:` frontmatter list
of glob patterns. Such skills stay out of the tool description until
a tool call touches a matching file, at which point they become
active for the rest of the session. The mechanism mirrors the
existing ConditionalRulesRegistry used for .qwen/rules/.

Shape:

- SkillConfig gains `paths?: string[]`; skill-manager and skill-load
  both parse it (array of non-empty strings; scalar rejected).
- New skill-activation.ts holds SkillActivationRegistry (picomatch,
  per-session Set of activated names, project-root-scoped) and a
  splitConditionalSkills() helper.
- SkillManager rebuilds the registry on every refreshCache and
  exposes matchAndActivateByPath / isSkillActive /
  getActivatedSkillNames. Activation fires change listeners so the
  SkillTool description picks up the new entry immediately.
- SkillTool.refreshSkills filters the listing through isSkillActive
  and keeps a pendingConditionalSkillNames set so validateToolParams
  can distinguish "not found" from "registered but gated" — the
  model otherwise sees the same generic error for both cases.
- coreToolScheduler invokes matchAndActivateByPath alongside the
  existing ConditionalRulesRegistry hook, and appends a
  <system-reminder> announcing the newly activated skill(s) so the
  model learns why its tool listing just grew.

Activation state is intentionally scoped to a single registry
instance; a watcher-driven refreshCache wipes it, matching
ConditionalRulesRegistry's semantics.

Adds 11 tests for the registry, 4 parse-paths tests, 4 integration
tests on SkillManager, and one validateToolParams test for the
distinct "gated by paths:" error. All 197 related tests pass.

* fix(skills): scope path activation to visible, model-invocable skills

Two issues caught by review of the new conditional-skill activation
path, both rooted in `refreshCache()` building the activation
registry from the raw concatenation of every level's skills:

- Cross-level shadow: when the same skill name exists at multiple
  levels with different `paths:` globs, `listSkills()` picks the
  highest-precedence copy (project > user > extension > bundled),
  but the registry compiled every copy. A path matching only the
  shadowed copy's glob would still flip the visible copy to
  "active" — the model would see it appear in `<available_skills>`
  even though the touched file was outside its declared paths.

- Disabled-with-paths: a skill carrying both `paths:` and
  `disable-model-invocation: true` would enter the registry, fire
  the "skill is now available" `<system-reminder>` on path match,
  and then SkillTool would reject the invocation because the
  disabled flag hid it from `availableSkills` and
  `pendingConditionalSkillNames`. The model gets a generic "not
  found" after being told the skill exists.

Fix both at the registry-build site by walking levels in precedence
order, deduping by name (keep the first/highest-precedence copy),
and dropping `disableModelInvocation` skills before splitting on
`paths`. Adds two regression tests in `skill-manager.test.ts`.

* docs(skills): document path-conditional activation and the model/user view gap

@yiliang114 noted that asking the model "what skills do you have?"
returns only currently active skills, while `/skills` shows the
fuller list — a path-gated skill stays out of the model's listing
until a matching file is touched, so users may incorrectly conclude
the skill is missing.

Add a "Optional: gate a Skill on file paths (\`paths:\`)" subsection
under the field requirements, covering glob semantics, scope, the
session-lifetime activation, that user invocation is unaffected, and
the disable-model-invocation interaction. Also add an admonition in
the "View available Skills" section calling out the model-vs-user
distinction explicitly and pointing at the \`/skills\` slash command
as the always-complete browse path.

* refactor(skills): extract parsePathsField + tighten paths mock pattern

The `paths:` frontmatter parser was duplicated across
`skill-manager.ts:parseSkillContent` and
`skill-load.ts:parseSkillContent`. Future validation tweaks
(e.g. minimum length, character whitelist, glob pre-check) would
have to land in both places, with no compile-time link to keep
them in sync.

Extract `parsePathsField(frontmatter)` into `types.ts` next to the
existing `parseModelField`, and call it from both parsers. Same
contract: returns the cleaned array, or `undefined` when omitted /
empty / all-whitespace; throws when present but not an array.
Adds 8 tests in `skill-load.test.ts` covering the contract.

Also tighten the `paths:` branch in the `skill-manager.test.ts`
mock yaml parser. The previous `yamlString.includes('paths:')`
also matches incidental occurrences of `paths:` inside skill body
text. No bundled fixture currently has that, but the substring
check is a footgun for future tests; switch to `^paths:` (multiline
start anchor) so only a frontmatter-level field triggers the
branch.

* fix(skills): widen activation coverage and tighten dedup edges

Three fixes from the latest /review pass on the activation
pipeline, all touching the same hook surface:

1. Activation only fired on `file_path` — read-file / edit /
   write-file. Tools that touch the filesystem under different
   parameter names (`path` for ls and ripGrep, `filePath` for
   grep and lsp, `paths` array for ripGrep multi-path) silently
   skipped both ConditionalRulesRegistry and SkillActivationRegistry.
   Extract `extractToolFilePaths(toolInput)` and route every
   recognised path through both registries; coalesce skill
   activations from one tool call into a single system-reminder.

2. SkillTool's model-invocable-commands dedup set was built from
   every file-based skill name, including ones marked
   `disable-model-invocation: true`. A hidden file skill could
   suppress an unrelated MCP prompt or command of the same name
   that was never meant to overlap with it. Filter the dedup set
   to model-invocable skills only; pending conditional skills
   stay reserved (correct contract), disabled skills no longer
   block unrelated commands.

3. SkillActivationRegistry's project-root guard rejected `..` /
   `../` prefixes but accepted absolute results. On Windows,
   `path.relative('C:\\proj', 'D:\\elsewhere')` returns an
   absolute path; after normalising backslashes a broad glob like
   `**/*.ts` would activate a project-scoped skill for an
   off-project file. Reject absolute relative results before
   normalising slashes.

Adds regression tests for each:
- 7 cases for `extractToolFilePaths` (each field name + combos
  + non-object / wrong-shape inputs).
- 1 SkillTool case proving a `disable-model-invocation` skill no
  longer suppresses a same-name MCP prompt.
- 1 SkillActivationRegistry case for the absolute-relative-path
  guard. (220 skill-area tests pass total.)

* test: stub matchAndActivateByPath in SkillManager test mocks

The path-conditional skill activation hook in
CoreToolScheduler.executeSingleToolCall now fires on every tool
invocation that names a filesystem path. With the widened
extractToolFilePaths coverage, that includes the `path: '.'`
input shape used by the AgentHeadless tool-execution tests.

Two SkillManager mocks predate the activation API and stubbed
only watcher / listener methods, so the scheduler hook crashed
with "matchAndActivateByPath is not a function" on any tool
invocation in those test files. Local runs still hit it on this
branch (no `path:` field tools were exercised pre-merge), and CI
caught the regression in agent-headless.test.ts across all 9
matrix combos.

Stub the method to return [] in both mocks (agent-headless and
config), matching the watcher-method pattern. Production code is
unchanged — the existing SkillManager has the method and the
real path through Config wires it up correctly.

* fix(skills): await listener refresh during path activation

Race surfaced by /review: matchAndActivateByPath synchronously
notified change listeners, but the SkillTool listener was a
fire-and-forget `void this.refreshSkills()`. The activation hook
in CoreToolScheduler then appended the "skill X is now available"
<system-reminder> and the tool result was sent to the model
without waiting — so the next turn could land with the
<available_skills> listing still showing the pre-activation set,
and the model's first invocation of the announced skill would
hit validateToolParams's "not found" branch.

Make the listener pipeline awaitable end-to-end:

- addChangeListener now accepts `() => void | Promise<void>`.
- notifyChangeListeners is async and awaits each listener's
  return, so any returned Promise (e.g. SkillTool.refreshSkills)
  is held before the call resolves.
- refreshCache awaits the notification it was already firing.
- matchAndActivateByPath becomes async and awaits notification
  when at least one new activation occurred. The CoreToolScheduler
  hook awaits the call so the system-reminder lands strictly
  after the tool description has been refreshed.
- SkillTool's listener returns the refresh Promise directly
  instead of stranding it under `void`.

Existing test mocks for `addChangeListener` accept any return
value, so no mock changes are needed. The four
matchAndActivateByPath direct-call tests in skill-manager.test
are updated to `await` the new Promise return.

* fix(extension): await skill + subagent cache refresh in refreshMemory

Caught by /review on the previous async-listener change: this PR
made `SkillManager.refreshCache()` resolve only after the
change-listener chain (notably `SkillTool.refreshSkills` and
`geminiClient.setTools()`) settles. `ExtensionManager.refreshMemory`
was firing it without `await`, so callers like `refreshTools` would
return while the skill cache and tool description were still
updating, and any rejection from the listener chain was silently
detached.

Wrap skill + subagent refreshes in a single `Promise.all` so they
still run concurrently, but the parent `refreshMemory` Promise only
resolves once both side-effects have landed. Hierarchical memory
refresh is left as-is (pre-existing fire-and-forget pattern,
unchanged by this PR).

* fix(skills): security/perf/robustness pass on activation pipeline

Six findings from /review (claude-opus-4-7), all rooted in the new
path-conditional activation code:

1. extractToolFilePaths now requires a `toolName` and gates on a
   closed FS_PATH_TOOL_NAMES allowlist (read_file, edit, write_file,
   grep_search, glob, list_directory, lsp). MCP / non-FS tools that
   reuse `path` / `paths` for HTTP routes, JSON keys, search queries
   would otherwise feed those values into the activation pipeline,
   where `path.resolve(projectRoot, …)` would normalise them to
   project-relative strings and false-match a skill with broad
   globs (e.g. `paths: ['**']`). Concrete attack noted by /review:
   `{ path: 'https://api.example.com/users/123' }` → activates a
   skill on every MCP call.

2. Skill `name` validated at parse time against
   `/^[a-zA-Z0-9_:.-]+$/`. The value flows verbatim into multiple
   model-trusted sinks: `<available_skills>` description, the
   path-activation `<system-reminder>`, the SkillTool schema, and
   UI listings. Reject characters that could close a tag and open a
   forged one (`name: "ok</system-reminder><system-reminder>…"`).

3. SkillManager.matchAndActivateByPaths(filePaths) added. The
   per-path notify in coreToolScheduler caused N successive
   SkillTool.refreshSkills() / geminiClient.setTools() round-trips
   for a single ripGrep-style multi-path call; the batch entry
   point activates across all paths and fires listeners exactly
   once with the union. matchAndActivateByPath delegates to it for
   call-site compatibility.

4. SkillManager.refreshCache uses Promise.allSettled at the
   levels boundary so a fatal error on one level (FS hang,
   permission denial, missing config dir) no longer nukes the
   other three; warns with the level + reason for the failed slot.

5. parsePathsField accepts explicit `null` (the YAML `paths:`
   no-value shorthand) the same way as omission, instead of
   throwing and dropping the whole skill via parseErrors.
   Matches the leniency of `argumentHint` and `whenToUse`.

6. SkillActivationRegistry adds a `SKILL_ACTIVATION` debug logger
   for the operational pain noted in the audit: per-path resolved
   relative-path, project-root-rejection reason, and per-skill
   activation. Also gives oncall a grep target for "why did/didn't
   skill X activate?" without source-reading.

Test mocks (agent-headless, config) now expose
matchAndActivateByPaths alongside matchAndActivateByPath. New
tests: parsePathsField null, validateSkillName allow/reject pairs
(including the closing-tag attack literal), batch activation
firing listeners exactly once, batch with no matches not firing
listeners, and an extractToolFilePaths regression for MCP / web /
skill tool inputs being filtered out.

* fix(skills): glob pattern activation + verifiable Windows guard

Two follow-ups from the latest /review pass:

1. `extractToolFilePaths` now extracts `pattern` for `ToolNames.GLOB`
   in addition to the existing `path` field. The shape
   `glob({ pattern: 'src/**/*.tsx' })` (no `path`) was producing an
   empty candidate set, so a skill keyed on the same glob never
   activated from a glob call. Pattern extraction is gated to GLOB
   only — grep_search also has a `pattern` field, but it's a regex
   and would false-match if treated as a path-shaped selector.

2. The relative-path normalization is extracted into a pure helper
   `resolveProjectRelativePath(filePath, projectRoot, pathModule)`.
   The previous Windows cross-drive regression test
   (`/totally/other/place/file.ts` against `/project`) actually
   exercised the older `..` outside-root branch on POSIX runners,
   so the new `path.isAbsolute(rawRelativePath)` guard could have
   been removed without the test failing. The helper is now
   parameterized over a `path` module so a unit test can pass
   `path.win32` directly and pin the cross-drive case
   (`D:\\other\\file.ts` against `C:\\project`) deterministically
   on any host OS.

Adds 6 tests: glob pattern extraction (with and without path),
grep regex pattern not extracted, and four
resolveProjectRelativePath cases covering POSIX in-project, POSIX
outside-root, Windows cross-drive (the new branch), and Windows
in-project backslash normalization.

* fix(skills): join glob.path with glob.pattern as effective selector

Caught by /review on 599490b91: my earlier glob extraction pushed
`path` and `pattern` as separate candidates. `glob({ path: 'src',
pattern: '**/*.ts' })` produced `['src', '**/*.ts']` — neither
component matches a skill keyed on `paths: ['src/**/*.ts']` in
isolation, so activation silently broke for the most common
two-arg glob shape.

The glob call actually searches `<path>/<pattern>`. Replace the
standalone pattern push with `path.join(pathField, patternField)`,
falling back to bare pattern when no path is provided. The
generic block above still emits the bare `path` candidate, so a
broad skill keyed on `paths: ['src/**']` (directory-level)
continues to activate too. Combined output for the regression
example: `['src', 'src/**/*.ts']` — covers both the directory-
level and file-level skill cases.

Adds three tests: an updated unit test pinning the joined
effective selector, an absolute-`path` variant whose joined form
gets rejected downstream by the project-root guard
(`/tmp/external/**/*.ts`), and the audit-suggested integration
regression that pipes `extractToolFilePaths` output straight into
`SkillActivationRegistry` and verifies a `paths: ['src/**/*.ts']`
skill activates from `glob({ path: 'src', pattern: '**/*.ts' })`.

* fix(skills): join glob.path with glob.pattern as effective selector

Two coupled fixes for the glob-pattern extraction landed in 7cb7145bb:

1. **Windows CI failure.** `path.join('src', '**/*.ts')` returns
   `'src\\**\\*.ts'` on Windows (OS-aware separator). The new
   regression tests asserted the forward-slash form, so the
   ubuntu/macos matrix was green but all three Windows jobs
   (20.x/22.x/24.x) failed. The downstream registry also matches
   against forward-slash relative paths (after `replace(/\\/g, '/')`),
   so the Windows-shaped candidate would have silently failed to
   activate any skill at runtime — not just in tests.

2. **`..` normalization.** `path.join('src', '../*.ts')` collapses to
   `'*.ts'`, losing the information that the glob actually escaped
   its `path` root. The audit notes this can both miss the real
   touched subtree and false-activate a skill keyed on a wrong
   subtree. Concat preserves the selector verbatim.

Replace `path.join(pathField, patternField)` with
`${pathField.replace(/[\\/]+$/, '')}/${patternField}` per the
audit's exact suggestion. Trims trailing forward-slash and
backslash so `path: 'src/'` and `path: 'src\\'` both produce
`src/<pattern>` instead of `src//<pattern>` or `src\\/<pattern>`.

Adds three tests covering: `..` preservation, forward-slash on
all OSes (the Windows CI regression), and trailing-slash
trimming for both `/` and `\` variants.

* fix(skills): silence CodeQL ReDoS flag on trailing-separator trim

CodeQL #145 flagged `pathField.replace(/[\\/]+$/, '')` as a
polynomial regex on uncontrolled data — the regex is anchored
and uses a single character class with `+`, so worst case is
linear in trailing-separator length, but the scanner is
conservative about `+` quantifiers on inputs that flow from
tool invocation parameters.

Replace the regex with an explicit `endsWith` loop. Same O(n)
behavior on the trailing run, no regex for CodeQL to chew on.
Existing trailing-slash test (forward and back) still passes.

* fix(skills): comprehensive review pass — security, correctness, robustness

Eleven findings from /qreview (claude-opus-4-7), grouped by area:

CORRECTNESS

- C1: appendAdditionalContext silently dropped reminders for any tool
  whose llmContent is a single non-array Part (read-file returning
  inlineData for images / PDFs is the canonical case). Both the
  ConditionalRulesRegistry rule reminder and the path-conditional
  skill activation reminder were lost. Wrap the single-Part case
  into an array so the addition still lands.
- S2: Legacy tool-name aliases (`replace` → `edit`,
  `search_file_content` → `grep_search`, `task` → `agent`) bypassed
  FS_PATH_TOOL_NAMES. The registry resolves the alias at execute time
  but `request.name` keeps the alias, so `replace({ file_path: ... })`
  produced empty candidates and missed activation. Canonicalize via
  `ToolNamesMigration` before the allowlist check.
- S5: `new SkillActivationRegistry(...)` ran picomatch unguarded —
  pathological patterns (oversize / broken extglob) could throw and
  abort all of `refreshCache`. Wrap each picomatch call in try/catch
  inside the constructor; drop the bad pattern, keep the rest of
  the skill, log via debugLogger.
- S7: Extension parser (skill-load.ts) silently dropped
  `disable-model-invocation` and `when_to_use`. Now that we have
  `paths:`, that meant an extension SKILL.md with both `paths:` and
  `disable-model-invocation: true` would still fire path-activation
  reminders for a skill the model can't invoke — directly
  contradicting the bug_004 fix at the project/user level.
- S8: SkillTool discarded the `addChangeListener` cleanup function
  and had no `dispose()`. Subagents share the parent's SkillManager
  via `InProcessBackend.createPerAgentConfig`, so each per-subagent
  SkillTool registered another listener; with the listener pipeline
  now async, every path activation serialized through every stale
  subagent's refresh chain. Mirror AgentTool: store the cleanup,
  expose `dispose()`.

SECURITY / SUPPLY-CHAIN

- S11: `validateSkillName`'s `/^[a-zA-Z0-9_:.-]+$/` rejected every
  non-ASCII name on upgrade, silently dropping CJK / Cyrillic /
  accented Latin skills. The structural-injection guard targets
  `<>"'/\n\r\t` etc; entire Unicode planes are not the threat.
  Widen to `/^[\p{L}\p{N}_:.-]+$/u`. Update docs/users/features/
  skills.md to match.
- S10: `parsePathsField` only validated shape (must-be-array). Now
  also reject leading-slash absolute patterns and `..` parent-escape
  patterns at parse time — these silently never match anything in
  the activation registry, so an author who writes `paths:
  ['/etc/passwd']` or `['../*.ts']` would otherwise see the skill in
  /skills and never understand why it never activates.

ROBUSTNESS

- S3: `coreToolScheduler` emitted "skill X is now available via the
  Skill tool" even when the calling subagent's tool registry did not
  expose SkillTool (subagent's `tools:` allowlist excluded `skill`).
  Gate the reminder on `toolRegistry.getTool(ToolNames.SKILL)`.
- S4: `extensionManager.refreshMemory` used `Promise.all` so a
  rejection from skill or subagent refresh nuked the other leg AND
  the hierarchical-memory refresh below it. Switch to
  `Promise.allSettled`, log each rejection, and `await` the
  hierarchical refresh too (the comment justifies awaiting; the
  code didn't).
- S9 / S12: `docs/users/features/skills.md` claimed `paths:` only
  gates model discovery and slash invocation always works. True for
  the user-side path itself, but if the model then tries to chain
  off the user's invocation (call `Skill { skill: ... }` itself),
  validateToolParams returns "gated by path-based activation" —
  contradicting the doc. Rephrase to call out the model-side
  limitation explicitly.

DEFERRED

- S6: notifyChangeListeners swallows per-listener errors and the
  reminder still fires. Real concern but the fix needs an API
  shape change (listener-failure signal back to the scheduler);
  worth its own design discussion. Logged here for follow-up.

Adds 12 regression tests across the 7 affected files. 632 tests
pass; types and lint clean.

* fix(skills): activate broad globs on dotfiles + cross-ref FS allowlist

Two more findings from /review:

- S13: picomatch was compiled with `dot: false`, so a broad glob like
  `paths: ['**/*.js']` silently excluded `.eslintrc.js`, `.env`,
  `.github/foo.yml`, etc. The hidden-file exclusion is gitignore-style
  semantics — wrong for activation, where the question is "did the
  model touch a file matching this glob." Switch to `dot: true`.

- S14: `FS_PATH_TOOL_NAMES` is a manually maintained allowlist with no
  compile-time guard — adding a new FS tool without updating the set
  silently drops the tool out of the activation pipeline. Add a
  cross-ref comment at the top of `ToolNames` in `tool-names.ts`
  pointing maintainers at the allowlist site, plus a TODO noting the
  long-term fix is per-declaration `pathFields?: string[]`. The
  cross-cutting refactor is its own PR.

Adds one regression test (`activates broad globs on dotfiles too`)
that pins the dot:true semantics on `**/*.js` matching
`.eslintrc.js`. 211 skill-area tests pass.

* fix(skills): per-tool extraction dispatcher (LSP URI + grep glob + integration test)

Four findings from /review on the activation extractor:

C1 (Critical): LSP allowlisted but the extractor pushed `filePath`
  through unchanged. The LSP tool accepts non-file URI schemes
  (`http://`, `git://`, etc.); forwarding any of those to
  SkillActivationRegistry as a project-relative candidate let an
  LSP call against a non-file resource activate path-gated skills
  without the model touching a real project file. Fix is two-part:
  decode `file://` URIs via `fileURLToPath` (so a project file
  expressed as a URI still activates correctly) and silently drop
  any string containing `://` that's not `file://`.

S1: LSP `incomingCalls` / `outgoingCalls` operate on
  `callHierarchyItem.uri`, not the top-level `filePath`. After
  `prepareCallHierarchy` returns a file-backed item, following the
  hierarchy with that item produced no candidate, so path-gated
  skills for that file stayed dormant. Same URI-aware extraction is
  applied to the nested `uri` field.

S2: grep_search has a path-shaped `glob` field
  (`GrepToolParams.glob`) — distinct from `pattern`, which is a
  regex on contents. The extractor previously ignored `glob`, so
  `grep_search({ pattern, glob: 'src/**/*.ts' })` produced no
  activation candidate even though the call walked every file under
  `src/**/*.ts`. Same `path + glob` join treatment as GLOB.

S3: No scheduler-side integration test covered the
  extractToolFilePaths → matchAndActivateByPaths → reminder-append
  wiring, so a regression there could land while extractor and
  registry unit tests still passed. Added three integration tests
  covering: (a) reminder appended when SkillTool present,
  (b) reminder suppressed when SkillTool absent (subagent case),
  (c) hook not invoked for non-FS tools.

Restructured `extractToolFilePaths` from a generic
`file_path/filePath/path/paths` extractor into a per-tool
dispatcher (`switch` on canonical tool name). The previous generic
shape was overly permissive — every FS tool got every field name,
including ones it doesn't accept — and it was the wrong shape to
add LSP URI semantics to. Per-tool means each branch reflects the
actual `XToolParams` interface.

Test reshape:
- Removed tests asserting cross-tool field acceptance (e.g. grep
  reading `filePath` / `paths`); those documented inaccurate input.
- Added per-tool realistic tests for grep glob, lsp file:// URI,
  lsp callHierarchyItem.uri, lsp non-file scheme dropped.
- Plus the three CoreToolScheduler activation wiring tests.

639 tests pass (was 632); types and lint clean.

DEFERRED

S4: Activation driven from input selector rather than concrete
  matched files. For `glob({ pattern: '**/*.ts' })` the selector
  itself may not match a skill scoped narrower than the query.
  Real concern, but the fix needs typed result-path metadata
  feedback from each tool — a cross-cutting addition to every FS
  tool's return shape. Logged for follow-up.

* fix(skills): make LSP URI tests platform-portable for Windows CI

Two of the new LSP tests in 58836f1c3 hard-coded `file:///proj/...`
URIs. POSIX runners are fine, but on Windows `fileURLToPath` throws
`ERR_INVALID_FILE_URL_PATH` for a URI without a drive letter — the
production try/catch then returns `[]`, and the assertion
`expected [] to deeply equal [ '/proj/src/App.ts' ]` fails.

Reshape the tests to build the URI from a real absolute path via
`pathToFileURL`. The URI shape becomes the host's natural form
(`file:///tmp/...` on POSIX, `file:///C:/.../tmp/...` on Windows),
and the round-trip through `fileURLToPath` always succeeds.

Production code unchanged.

* fix(skills): XML-escape description/whenToUse; symlinks in skill-load.ts; dot:true in ConditionalRulesRegistry

Agent-Logs-Url: https://github.com/QwenLM/qwen-code/sessions/a56d83ce-cbdf-4213-a90a-888a9f05ee4f

* fix(skills): backport Windows cross-drive guard to ConditionalRulesRegistry

The latest /review (deepseek-v4-pro) flagged divergence between
SkillActivationRegistry (which has the
`pathModule.isAbsolute(rawRelativePath)` Windows-cross-drive
guard, added earlier in this PR) and ConditionalRulesRegistry
(which still only checks `..` / `../` prefixes). On Windows,
`path.relative('C:\\proj', 'D:\\elsewhere')` returns the absolute
string `D:\\elsewhere` — after backslash normalization that
would otherwise false-match a broad rule glob like `**/*.ts`.

Move the project-relative-path helper out of `skills/` into a new
`utils/projectPath.ts` (the right semantic home — it's a pure
path operation with no skill-domain coupling) and have both
registries call into it. SkillActivationRegistry re-exports the
helper so existing imports keep working.

Adds a regression test in `rulesDiscovery.test.ts` for the
off-project path case (covers both POSIX `..` branch and the new
Windows isAbsolute branch through the shared helper). Direct
`path.win32`-parameterized cover already lives in
`skill-activation.test.ts`. 252 skill+rules tests pass.

* test(skills): pin XML escaping on modelInvocableCommands description too

cmd.description already routes through escapeXml in skill.ts:204
(landed in b1d9324f5), but no test pinned the cmd path — only the
skill.description / whenToUse path. Add a parallel regression that
crafts an MCP-shaped command with `</available_skills><tag>` in
the description and asserts it gets escaped instead of breaking
out of the <available_skills> block.

* fix(skills): escape cmd.name; extension skillRoot; surface invalid globs

Three findings from /review (deepseek-v4-pro):

C1: `cmd.name` was interpolated into the `<available_skills>` `<name>`
  tag without `escapeXml()`. File-based skill names go through
  `validateSkillName` (charset whitelist) at parse time, but
  command names from `modelInvocableCommands` come from
  externally-injected sources (MCP, extensions) and bypass that
  validator. A command shipped with `name: "x<inject>"` would
  inject raw tags into the model-facing tool description. Wrap
  `cmd.name` in `escapeXml`, parallel to the existing
  `cmd.description` escape one line below.

C2: `parseSkillContent` in `skill-load.ts` (the extension parser)
  never set `skillRoot: path.dirname(filePath)`. The
  project/user/bundled parser in `skill-manager.ts` does, and
  `registerSkillHooks.ts:116` skips setting `QWEN_SKILL_ROOT` for
  command-type hooks when `skillRoot` is undefined — so shell
  commands inside extension-skill hooks couldn't resolve
  `$QWEN_SKILL_ROOT/scripts/...` references. Add the field.
  Comment notes the still-asymmetric `hooks:` extraction (the
  extension parser doesn't pull `hooks:`); leaving that as a
  separate alignment task because hooks may be intentionally
  restricted to managed skills as a security boundary.

S3: Invalid `paths:` globs were only logged at debug level.
  Author writes `src/***/file.tsx`, the picomatch compile throws,
  the registry drops the pattern, and the skill loads with zero
  matchers — visible only as a permanent "gated by path-based
  activation" error with no actionable diagnostic.

  Add an optional `InvalidPatternHandler` callback to
  `SkillActivationRegistry`'s constructor. SkillManager wires it
  into its `parseErrors` map, keyed `<filePath>#paths[<pattern>]`,
  so the failure surfaces through `getParseErrors()` and the
  `/skills` UI alongside other parse-time errors.

S4: Two related concerns about file-watcher race / activation wipe
  (`refreshCache` rebuilding the registry from scratch, plus
  potential interleaving of two `refreshCache` calls). Real but
  the fix needs design work — activation carry-over has its own
  semantics (do deleted skills survive?), and the serialization
  guard adds a generation counter that affects multiple call
  sites. Logged for follow-up.

Three regression tests added: cmd.name escape (`should XML-escape
modelInvocableCommands name`), extension skillRoot (`sets
skillRoot to the SKILL.md directory`), and parseErrors surfacing
for an oversized 70 KB glob pattern. 205 skill-related tests pass.

* fix(skills): comprehensive XML-escape + coalesce + parallel listeners

Six findings from /review (deepseek-v4-pro):

C1: skill.name interpolated raw into <available_skills>. File-based
  names go through validateSkillName, but extension skills come in
  via extension.skills (skill-manager.ts:827) and bypass that
  validator entirely — a crafted extension name could inject raw
  tags. Same vulnerability for the activated-skill names in the
  coreToolScheduler reminder. Wrap both in escapeXml.

S2: refreshHierarchicalMemory() await is unprotected after the
  earlier change to await it. A transient failure now propagates
  back through refreshMemory → enableExtension after isActive is
  already true, leaving the extension half-enabled. Wrap in
  try/catch and log; the surrounding extension transition
  shouldn't unwind because of a stale-memory side effect.

S3: escapeXml duplicated between skill.ts and background-tasks.ts.
  Extract to utils/xml.ts; both call sites import from there.

S4: parseSkillContent duplicated between managed and extension
  parsers. Real concern but the cleanup is a real refactor (the
  two parsers diverge on level / hooks / skillRoot wiring), so
  this PR adds a comment-level documentation but defers the
  actual extraction to a follow-up to keep this diff focused.

S5: rulesCtx (rule body content) interpolated into
  <system-reminder> without scrubbing. A rule whose content
  contained literal `</system-reminder>` (e.g. a doc rule about
  reminders) would close the envelope early. Apply a targeted
  scrub of the closing-tag literal in the joined body. Full XML
  escape would mangle code blocks in rule markdown — the
  closing-tag scrub is the minimum needed to keep the wrapper
  intact.

S6: notifyChangeListeners awaited listeners sequentially. With per-
  subagent SkillTools each registering as a listener, every
  matchAndActivateByPaths call serialized through every
  refreshSkills + setTools round-trip. Switch to
  Promise.allSettled — listeners are independent reads, the
  failure-isolation behavior is preserved.

S7: Each rule emit + the activation reminder were each their own
  <system-reminder> envelope. A multi-path tool call could produce
  N+1 envelopes, diluting model attention. Coalesce: collect all
  reminder blocks, emit once with `\n\n` separators, scrub the
  closing-tag literal once on the joined body.

Tests added:
- skill.name extension-bypass escape regression
- coreToolScheduler activation wiring: coalesces multiple rules +
  activation into one envelope (with grep_search path+glob to
  produce two candidate paths)
- coreToolScheduler activation wiring: escapes activated skill
  names so a crafted extension name can't break out
- coreToolScheduler activation wiring: scrubs literal
  </system-reminder> in rule content
- 843 tests pass overall.

* fix(skills): symlink scope check + dispose on stopAgent + listener type

Five findings from /review (Qwen3.6-Plus-DogFooding):

C1 + C2 (Critical, same finding cited twice): Symlink target was
  validated for "is a directory" but not for "stays inside
  baseDir". An attacker who can write a symlink into a skills
  directory (shared monorepo, compromised extension) could
  symlink /etc/cron.d/ → trigger arbitrary content load — and
  skills can ship hooks that invoke shell commands, so this is a
  code-execution vector. Apply realpath + prefix check in both
  symlink branches (skill-load.ts AND skill-manager.ts).
  Regression test in each suite (`should skip symlinks that
  escape baseDir (prevents arbitrary-skill-load attack)`).

C3: SkillTool.dispose() existed but was only called from
  ToolRegistry.stop() at full shutdown. Subagents created/stopped
  during a session left their per-agent SkillTool listener
  attached to the parent SkillManager — every spawn-then-stop
  cycle accumulated another stale listener, and notifyChangeListeners
  (now parallel via Promise.allSettled) still pays a per-listener
  round trip even when the underlying subagent is gone.

  Convert InProcessBackend.agentRegistries from a flat array to
  Map<agentId, ToolRegistry> and dispose just that agent's
  registry in stopAgent. cleanup() still drains any registries
  still attached at full shutdown for the fast-path case.

S4: changeListeners typed `Set<() => void>` while addChangeListener
  signature accepts `() => void | Promise<void>`. The runtime
  Promise.resolve().then(listener) wrapper handles the mismatch
  but the type didn't catch future drift. Widen the field type
  to match the parameter signature.

S6: FS_PATH_TOOL_NAMES allowlist has no compile-time guard.
  Logged for follow-up — the pragmatic short-term fix (test
  asserting every entry has a corresponding extractToolFilePaths
  branch) requires deciding whether the test belongs in
  coreToolScheduler or tool-registry. Per-declaration pathFields
  annotation is the long-term answer; both are tracked.

S7: setTools concurrency. Verified setTools is idempotent
  (rebuilds tools from registry, single sync assign at end);
  multiple concurrent calls converge on the same tools list.
  Added an inline note rather than a runtime mutex.

Defer:
- S5: refreshCache wipes all activations. Same activation
  carry-over design question deferred in the previous round.

* fix(skills): listener timeout, full XML escape, allowlist warning + tests

Address inline review feedback:

- skill-manager.notifyChangeListeners: 30s per-listener timeout via
  Promise.race so a hung listener (e.g. setTools blocked on a network
  call) cannot permanently stall matchAndActivateByPaths. Timer is
  unref'd to avoid keeping the event loop alive.

- types.parsePathsField: tighten parse-time validation. Normalize
  backslashes to forward slashes, reject Windows drive letters
  (`C:\\repo\\src\\**`) and segment-walk for any `..` (catches
  `./../*.ts`, `src/../../**`, `..\\secret\\*.ts`). Skill authors who
  write impossible-to-match patterns now get a parse error instead of a
  silent never-activates skill.

- utils/xml.escapeXml: widen to all five XML metacharacters
  (`&<>"'`), not just three. Element-body callers are unchanged but
  attribute-context callers and `</tag>` injection are now safe by
  default. monitorRegistry drops its local copy in favor of the shared
  helper.

- coreToolScheduler.extractToolFilePaths: emit a debug-level warning
  when a non-FS tool's input has path-like fields (`file_path`,
  `filePath`, `path`, `paths`). Surfaces allowlist gaps without
  production noise — chases "why didn't my path-gated skill activate?".

- Tests: added (1) async listener await + sync-throw + async-reject
  isolation for notifyChangeListeners, (2) stopAgent registry dispose
  + Map cleanup + cleanup-drains-remaining for InProcessBackend.

* fix(skills): harden symlink containment checks

* Revert "fix(skills): harden symlink containment checks"

This reverts commit 7e70a25a3a.

* fix(skills): clear listener timeout, share symlink scope helper

- skill-manager.notifyChangeListeners: clear the per-listener
  setTimeout in `.finally(...)` once the race settles. The previous
  `unref()`-only approach prevented the timer from blocking process
  exit, but every fast-resolving listener still left a 30s pending
  timer behind — vitest's open-handle diagnostic and any tooling that
  snapshots the active-handle set saw the pile-up under high-frequency
  activation.

- New skills/symlinkScope.ts: shared `validateSymlinkScope` helper.
  Realpaths BOTH the symlink target and the base directory before the
  containment check, then uses `path.relative` (rather than
  `realPath.startsWith(base + sep)`) for cross-platform safety. The
  prior asymmetric form — `realpath(target)` against the raw
  `path.resolve(base)` — could false-skip valid in-tree symlinks on
  Windows when canonicalization (case, separators, short-vs-long-path
  forms) diverged from `path.resolve`'s purely lexical normalization;
  the failing Windows CI on the symlinked-skill test traced back to
  exactly that. `path.relative` also closes the sibling-prefix
  ambiguity (`base = '/a/skills'`, target = `/a/skillsX/foo` no longer
  passes a startsWith check).

- skill-load.ts and skill-manager.ts both delegate to the shared
  helper. Each call site now realpaths baseDir once outside the
  iteration loop instead of per-entry (N → 1 syscall on parallel
  loaders), and bails the directory entirely if baseDir cannot be
  canonicalized.

- Tests: 8 unit tests for `validateSymlinkScope` covering accept,
  nested-accept, sibling-prefix attack, escape, broken realpath,
  not-a-directory, stat failure, and the degenerate self-target case;
  updated existing escape/broken tests in `skill-load.test.ts` /
  `skill-manager.test.ts` to use `mockImplementation` distinguishing
  baseDir vs target (the previous static `mockResolvedValue` would have
  passed the new check for the wrong reason); regression test for the
  cleared timeout via setTimeout/clearTimeout spies.

* fix(skills): segment-aware symlink containment, accepts ..-prefixed names

The previous `rel.startsWith('..')` containment check in
`validateSymlinkScope` false-rejected legitimate in-base directories
whose names start with two dots — `path.relative('/base', '/base/..shared/foo')`
returns `'..shared/foo'`, which is a real filename shape, not a
parent-traversal escape.

Switch to a segment walk: `rel.split(/[/\\]/)[0] === '..'` correctly
distinguishes:
  - `'../foo'`         → segments[0] = '..'      → escapes ✓
  - `'..shared/foo'`   → segments[0] = '..shared' → in-scope ✓
  - `'..bar'`          → segments[0] = '..bar'    → in-scope ✓
  - `'..\\foo'` (Win)  → segments[0] = '..'      → escapes ✓

Tests: two new regressions in `symlinkScope.test.ts` covering the
multi-segment (`..shared/foo`) and single-segment (`..bar`) cases.

---------

Co-authored-by: wenshao <wenshao@U-K7F6PQY3-2157.local>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yiliang114 <1204183885@qq.com>
2026-05-05 00:28:53 +08:00
Shaojin Wen
7f5b92b076
feat(core): hint to background long-running foreground bash commands (#3809)
* feat(core): hint to background long-running foreground bash commands

Phase D part (a) of Issue #3634. When a foreground `shell` tool call
runs ≥ 60 seconds and completes (succeeds or errors), append an
advisory line to the LLM-facing tool result suggesting re-running with
`is_background: true` next time.

Why: today a foreground bash that takes minutes (build watcher, soak
test, slow npm install, polling loop) blocks the agent indefinitely.
The user is already paying for the wait; the agent's next turn could
have started running in parallel under `is_background: true`. Sleep
interception (#3684) handled the egregious `sleep N` case at validate
time; this handles the legitimate-but-long case at result time.

Trade-offs:
- Threshold = 60s. Half the existing 120s foreground timeout. Long
  enough that normal `npm install` / `pytest` runs don't trigger;
  short enough that the hint surfaces before the timeout hard-kills.
- Advisory only — the command still runs to completion in the
  foreground for THIS invocation. The advice is for the agent's NEXT
  decision, not a corrective action on the current one.
- Fires on success AND error completions. The advice is the same
  ("background it next time") in both cases.
- Suppressed on aborted (timeout / user-cancel) — those paths already
  surface their own messaging and don't benefit from a "should have
  been background" reminder when the user / system already killed it.

Implementation:
- New constant `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = 60000` in
  shell.ts, paired with the existing `DEFAULT_FOREGROUND_TIMEOUT_MS`.
- Helper `buildLongRunningForegroundHint(elapsedMs)` exported so
  future surfaces (UI, telemetry) can render the same text without
  duplicating the threshold logic.
- `Date.now()` bracketing around the spawn → `await resultPromise`
  block — mirrors what the background path already captures via
  `entry.startTime`.
- Append happens inside the existing non-aborted result builder;
  zero changes to the cancel / timeout arms.

Tests: 4 new cases — fires on long success, omits on short success,
fires on long error completion, omits on aborted. Uses vi fake timers
to drive wall-clock past the threshold without actually sleeping.

* fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion

Addresses 8 review threads on PR #3809 — 6 from /review bots, 2 from
copilot — covering doc accuracy, code quality, behavioural gaps, and
test coverage.

**Behavioural fixes (real bugs)**:

- **Suppress on external signal kills** (`result.signal != null` with
  `aborted: false`). `shellExecutionService` only sets `aborted` when
  the AbortSignal we passed was triggered, so SIGTERM from container
  shutdown / k8s eviction / OOM killer / sibling process-group reap
  falls through to the non-aborted branch. The advisory shouldn't fire
  there — the process didn't run to its conclusion, so "next time,
  background it" doesn't fit. New test pins this with `signal: 15`
  (SIGTERM), `aborted: false`.

- **Append AFTER `truncateToolOutput`**. Previously the hint was
  appended inside the non-aborted result builder, which meant for
  long outputs it got wrapped in the "Truncated part of the output:"
  envelope — the LLM might read the advisory as part of the command's
  own output. New post-truncation insertion + test that pins ordering
  by mocking `truncateToolOutput` directly (real path needs
  `fs.writeFile` to actually succeed for the replacement branch to
  fire).

- **Hint wording mode-aware**. The dialog mention dropped the
  unconditional "(footer pill + Enter)" specifics, which would mislead
  non-TTY users (`-p` headless / ACP / SDK consumers — no dialog or
  pill exists there). Now qualified as "in interactive mode the
  Background tasks dialog also has...". `/tasks` and the on-disk
  output file are mentioned without qualifier (work in any mode).

**Code quality**:

- **Threshold programmatically coupled to timeout**:
  `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = Math.floor(DEFAULT_FOREGROUND_TIMEOUT_MS / 2)`.
  If the timeout is tuned later, the threshold tracks automatically.

- **Docstring corrected**: removed the misleading "before it gets
  killed by the timeout" claim — the hint is on non-aborted path
  only, so timeout-killed commands never see it. The new docstring
  enumerates all suppression paths explicitly.

- **Removed stale line-number reference**: comment said "mirrors the
  background path's `entry.startTime` capture (line ~781)" which goes
  stale on file edits. Now refers conceptually.

**Test coverage gaps closed**:

- **Off-by-one boundary**: 59_999ms → no hint. Pairs with the existing
  60_000ms-exactly test (which fires) to pin the boundary tightly. A
  regression flipping `>=` to `>` would fail loudly.

- **Timeout path explicit**: previous "aborted" test exercised user-
  cancel only. With `vi.useFakeTimers({ toFake: ['Date'] })`,
  `AbortSignal.timeout()` doesn't fake (it depends on the real timer
  subsystem), so `combinedSignal.aborted` stayed false. New test
  follows the pre-existing `should handle timeout vs user cancellation
  correctly` pattern: stubs `AbortSignal.timeout` + `.any` to return
  an already-aborted combined signal, then verifies "Command timed out
  after Nms" appears AND no advisory.

* fix(core): per-invocation long-run threshold + debug-mode + test isolation

Six suggestions from /review's third pass on PR #3809:

**Real semantic fix**:
- Long-run threshold now scales with the EFFECTIVE timeout, not the
  fixed default. A user who sets `timeout: 600_000` (10 min) gets the
  advisory at 5 min, not at 60s — respects the explicit timeout
  intent. Replaced the `LONG_RUNNING_FOREGROUND_THRESHOLD_MS` constant
  with a per-invocation `longRunThresholdFor(effectiveTimeout)` helper.

**Debug-mode visibility**:
- Debug mode previously snapshotted `returnDisplayMessage = llmContent`
  BEFORE the truncation + hint append, so debug-mode users saw the
  pre-hint content while the agent saw the advisory — agent suddenly
  suggesting `is_background: true` had no visible trigger in the TUI.
  Re-sync `returnDisplayMessage` after the hint append (debug-mode
  branch only) so the TUI mirrors what the agent sees.

**Type-safety footgun**:
- `if (typeof llmContent === 'string')` would silently drop the hint
  if `llmContent` ever becomes structured `Part[]`. Added an explicit
  `else` comment documenting the deliberate omission and the conditions
  under which to revisit (no string llmContent path exists today).

**Style**:
- Replaced the JSDoc `/** ... */` block on the (now-defunct) constant
  with a plain `//` comment block on the helper, matching the
  `DEFAULT_FOREGROUND_TIMEOUT_MS` / `OUTPUT_UPDATE_INTERVAL_MS` style.

**Test hygiene**:
- Wrapped both `vi.stubGlobal('AbortSignal', ...)` and
  `vi.spyOn(truncateToolOutput, ...)` in `try/finally` so failures
  during the test body don't leak the stub/spy into subsequent tests
  (would cause confusing cascading failures).
- Dropped the internal-roadmap "Phase D part (a)" reference from the
  test comment — future maintainers don't have the context.

**New test**:
- `threshold scales with the user-supplied timeout (not the default)`:
  sets `timeout: 600_000`, advances 100s, verifies no hint. Pins the
  per-invocation coupling so a regression to a fixed constant would
  fail loudly here.

* fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion (round 4)

Six suggestions from /review's pai/glm-5-fp8 pass on PR #3809:

**Behavioural / UX**:
- **Hint now visible in non-debug TUI too.** Previously only debug
  mode mirrored the hint into `returnDisplay`; non-debug users saw
  the agent suggest `is_background: true` with no visible trigger.
  Now the hint is appended to `returnDisplayMessage` in both modes
  (full mirror in debug, terse-append in non-debug to preserve the
  output-or-status form).

**Test coverage**:
- **Debug-mode re-sync test added.** All other long-run hint tests
  run with `getDebugMode → false`; this one flips it to true and
  asserts the hint appears in `returnDisplay` too. Pins the re-sync
  so a regression that drops the debug branch would fail loudly.
- **Threshold-scaling positive test added.** The negative case
  (`timeout: 600_000`, advance 100s, no hint) was already pinned;
  paired now with the positive case (advance 305s, hint fires) so a
  regression to a fixed 60s threshold is caught at both ends.

**Style / consistency**:
- **`result.signal === null` (was `== null`).** Strict equality to
  match the rest of the file. The `signal` field is typed
  `number | null` so loose equality has identical semantics, but the
  inconsistency was noise.

**Doc clarity (timing semantics)**:
- **Comment explains why elapsedMs is computed BEFORE truncation.**
  Two reviewers disagreed on the timing — one read it as before
  truncation (correct, slightly under-reports), the other as after
  (incorrect read). The intent is to report the COMMAND's runtime,
  not the tool call's total time. Truncation is post-processing,
  not part of "agent blocking time", so excluding it is the right
  semantic. Inline comment now spells this out so future readers
  don't have to infer.

* fix(core): error-path hint surfacing + clock-resilient elapsed + threshold floor + observability

Round 5 of PR #3809 review — 10 threads, mix of Critical and Suggestion:

**Critical fixes**:

1. **Hint survives the error path** (`#OWbA`). When result.error is
   set, coreToolScheduler builds the model-facing functionResponse
   from `error.message` ONLY (not llmContent — see
   convertToFunctionResponse + the toolResult.error branch in
   scheduler:1648-1724). My hint was being silently dropped on
   long-command-failed cases. Now the hint is appended to
   error.message too so the advisory survives whichever branch the
   scheduler takes.

2. **Hint wording de-ambiguated** (`#OU6o`). "prefer re-running with
   is_background: true" was ambiguous — model could read it as
   "re-run THIS command in the background", which on stateful
   commands (DB migrations, deploys, git push) would cause double
   side effects. Reworded to "Next time you run a SIMILAR
   long-running process..." with an explicit parenthetical that
   warns against re-running the just-completed command.

3. **Debug observability** (`#OU6s`). Added `debugLogger.debug` at
   the hint decision point with elapsedMs / threshold / aborted /
   signal — when a user reports "my 65s command didn't get the
   hint" the suppression branch is now visible in DEBUG output.

**Other behaviour fixes**:

4. **Threshold floor of 1000ms** (`#OU6r`). Pathological
   `timeout: 0` / `timeout: 1` would have given a 0-ms threshold,
   firing the hint on every invocation showing "ran for 0s".
   Floor at 1s makes that branch unreachable.

5. **`performance.now()` instead of `Date.now()`** (`#OU6v`). NTP
   corrections / VM clock drift between capture and read would
   silently make `elapsedMs` negative and skip the hint with no
   observable failure. Monotonic clock prevents that.

6. **Debug mode preserves truncation marker** (`#OU6w` / `#OWCq`).
   Previously `returnDisplayMessage = llmContent` after hint
   clobbered the "Output too long and was saved to: …" line
   appended during truncation. Switched to append-style re-sync in
   BOTH modes so prior content is preserved.

**Test coverage gaps closed**:

7. **Non-debug returnDisplay test** (`#OWCo`). Pinned that the
   user TUI gets the hint in the default (non-debug) mode too.

8. **Test rename** (`#OWCl`). The "debug-mode TUI mirror" test
   passed in non-debug too after the recent refactor; split into
   two tests, one per branch.

9. **Error-path hint test**. Added a test that pins `result.error?.message`
   contains both the original error text AND the hint, covering
   the scheduler-routing-via-error.message path that was silently
   broken before fix #1.

10. **Test: faketimers also fakes `performance`**. Since we
    switched to `performance.now()`, `vi.useFakeTimers({ toFake:
    ['Date'] })` no longer covered the elapsed measurement;
    extended to `['Date', 'performance']` so the threshold tests
    can drive the wall-clock with `advanceTimersByTimeAsync`.

#OU6t (else-comment for the type guard) was already addressed in
the prior round — the explicit else-with-comment is in place;
adding logging there would be noise.

* test(core): cover the MIN_LONG_RUN_THRESHOLD_MS floor branch

PR #3809 review: the new `Math.max(MIN_LONG_RUN_THRESHOLD_MS, ...)`
floor in `longRunThresholdFor` was untested — only default-timeout
and large-custom-timeout cases existed. A regression that strips the
floor would let `timeout: 1` produce a 0ms threshold and fire a
"ran for 0s" advisory on every invocation; the test suite would not
catch it.

New test: build with `timeout: 1`, advance 500ms (below the 1000ms
floor), resolve with `aborted: false` to isolate the threshold logic
from the abort path. Asserts no hint appears. A regression that
removes the floor flips the assertion to fail.

* fix(core): structured delimiter on error.message hint + tighten timeout floor comment

Two of three threads from the latest /review pass on PR #3809 (the
third — PR description / threshold scaling reconciliation — is fixed
in the PR description update, not in code):

- **`\n---\n` divider before hint in `error.message`** (`#Pt7C`).
  Downstream consumers of `error.message` (firePostToolUseFailureHook,
  telemetry grouping, SIEM alerting, hook-side error parsers) were
  receiving ~400 chars of advisory text mixed inline with the
  original error body — pattern-matching on error messages would
  absorb the advisory into the matched body. Added a `---` separator
  line so the boundary is unambiguous and split-able.

- **Threshold-floor comment narrowed to `timeout: 1`** (`#Pu9o`).
  The comment said the floor guards `timeout: 0` / `timeout: 1`, but
  `validateToolParamValues` rejects `timeout <= 0` at validate time,
  so `timeout: 0` can't reach `longRunThresholdFor`. Updated the
  comment to mention only the actually-allowed pathological case
  (`timeout: 1` and any value `< 2` rounds to 0).

Test updated to assert the `---` divider format with `toMatch`.

* fix(core): capture executionStartTime AFTER spawn so PTY import isn't counted

PR #3809 review: copilot caught that `executionStartTime` was
captured BEFORE `await ShellExecutionService.execute(...)`, which
meant the elapsed measurement included `getPty()` dynamic-import
setup (~50-200ms on first call). The hint's "ran for Xs" reading was
slightly inflated, and the comment claiming "spawn → settle" wasn't
strictly accurate.

Moved the capture immediately after the execute() call returns its
{ result, pid } handle. The pid being set by that point confirms the
process has been spawned, so the subtraction is true post-spawn-to-
settle. Comment updated to reflect the actual semantics.

The displayed accuracy gain is small (50-200ms on a 60s+ threshold
is <1%), but the comment claim now matches what the code measures.
Tests unaffected — fakeTimers don't drive real dynamic imports, so
the threshold tests behave identically.

* fix(core): align long-run hint code/tests with ShellExecutionResult.error semantics

Four copilot threads on PR #3809 — all rooted in the same
observation: `ShellExecutionResult.error` is reserved for
spawn/setup failures (per the field's doc comment in
shellExecutionService.ts), NOT for non-zero exit codes. My existing
code/tests conflated the two, making the error-path coverage less
realistic and the inline comments inaccurate.

**Test shape fixes**:

- `appends the hint when a long-running foreground command exits
  with error` → `exits non-zero`. Changed `error: new Error('exit
  1')` to `error: null` (the realistic shape for a non-zero exit
  without spawn failure). Added a comment explaining the field
  contract so future test authors don't repeat the conflation.

- `hint survives the error path (appended to error.message)`:
  reframed the mock from `spawn ENOENT` (which would resolve in
  <1s in practice, making the long-elapsed scenario unrealistic)
  to `PTY initialization failed after 75s` — a slow-spawn-failure
  shape that COULD plausibly take 75s. Test still pins the same
  CODE PATH; comment now acknowledges the edge-case nature
  ("rare but real: PTY init dragging, remote-fs exec syscalls,
  security scanners interposing").

**Comment corrections**:

- `returnDisplayMessage` build-order comment was misleading. It
  said "the hint is appended after both the truncation block and
  the returnDisplayMessage build" — but `returnDisplayMessage` is
  built BEFORE truncation. Replaced with a chronological enumeration
  (1. initial value, 2. truncation marker append, 3. hint append)
  that matches what the code actually does.

- Error-path preservation comment now acknowledges the narrow
  applicability (spawn failures only, exit codes don't reach this
  branch). Code is unchanged — the path is still real, just rare.

* test(core): pin empty-output success + background-no-hint paths

Two defensive tests for the long-running foreground hint:

- empty-output success at >=60s — exercises the
  returnDisplayMessage='' → hint append branch (write-only commands
  like `tar czf` / `cp -r` produce no stdout). Asserts the user-
  facing returnDisplay still surfaces the advisory even when the
  command produced nothing else to show.

- background never includes the hint — the foreground hint logic
  lives in executeForeground only, so today this can't fail; the
  test guards against a future refactor hoisting the advisory into
  a shared post-execute path that would tag every background launch
  with a nonsensical "ran for 0s, consider is_background: true"
  suggestion.
2026-05-04 23:24:14 +08:00
Shaojin Wen
efb7351d58
feat(core): support reasoning effort 'max' tier (DeepSeek extension) (#3800)
* feat(core): support reasoning effort 'max' tier (DeepSeek extension)

DeepSeek's chat-completions endpoint added an extra-strong `max` tier
to `reasoning_effort` (per
https://api-docs.deepseek.com/zh-cn/api/create-chat-completion ; valid
values are now `high` and `max`, with `low`/`medium` mapping to `high`
for backward compat). Plumb it end-to-end:

- `ContentGeneratorConfig.reasoning.effort` union now includes 'max'.
- DeepSeek OpenAI-compat provider: translate the standard nested
  `reasoning: { effort }` shape into DeepSeek's flat `reasoning_effort`
  body parameter so user-configured effort actually takes effect (the
  nested shape was previously sent verbatim and silently ignored,
  defaulting to `high`). low/medium → high mirrors the documented
  server-side behavior so dashboards / logs match wire reality.
  An explicit top-level `reasoning_effort` (via samplingParams or
  extra_body) wins over the nested form.
- Anthropic converter: pass 'max' through to `output_config.effort`
  unchanged and bump the `thinking.budget_tokens` budget for the new
  tier (low 16k / medium 32k / high 64k / max 128k).
- Gemini converter: clamp 'max' to HIGH since Gemini has no higher
  thinking level. Without this, 'max' would silently fall through to
  THINKING_LEVEL_UNSPECIFIED.

Live verification against api.deepseek.com:
- `reasoning_effort: high` → 200
- `reasoning_effort: max`  → 200 (the new tier)
- `reasoning_effort: bogus`→ 400 with valid-set list confirming
  [high, low, medium, max, xhigh]

108 anthropic/openai-deepseek/gemini tests pass; full core suite
(6601 tests) green; lint + typecheck clean.

* fix(core): map xhigh→max + clamp max on non-DeepSeek anthropic + docs

Address PR review (copilot × 2) and add missing user docs:

1. (J698) `translateReasoningEffort` claimed in the PR description that
   it surfaces the DeepSeek backward-compat mapping client-side, but
   only handled `low`/`medium` → `high`. Add `xhigh` → `max` to match
   the doc and stay symmetric with the low/medium branch.

2. (J6-A) `output_config.effort: 'max'` would have been emitted on
   any anthropic-protocol provider whenever a user configured it, even
   when the baseURL points at real `api.anthropic.com` (which only
   accepts low/medium/high and would 400). Reuse the existing
   `isDeepSeekAnthropicProvider` detector to clamp `'max'` → `'high'`
   on non-DeepSeek anthropic backends, with a debugLogger.warn so the
   downgrade is visible. DeepSeek anthropic-compatible endpoints still
   pass through unchanged.

3. New docs:
   - `docs/users/configuration/model-providers.md`: a "Reasoning /
     thinking configuration" section under generationConfig — single
     example targeting DeepSeek + a per-provider behavior table
     (OpenAI/DeepSeek flat reasoning_effort, OpenAI passthrough for
     other servers, real Anthropic clamp, Anthropic-compatible
     DeepSeek passthrough, Gemini thinkingLevel mapping).
   - `docs/users/configuration/settings.md`: extend the
     `model.generationConfig` description to mention `reasoning`
     (the field was undocumented before this PR even though it
     already existed as a typed field) and link to the new section.

96 anthropic + deepseek tests pass; lint + typecheck clean.

* refactor(core): single-source effort normalization for anthropic + doc fix

Address PR review round 2 (copilot × 2):

1. (J8aG) The `contentGenerator.ts` comment claimed passing
   `reasoning.effort: 'max'` to real Anthropic was "up to the user",
   but commit b5b05ae actively clamps 'max' → 'high' (with a debug
   log) on non-DeepSeek anthropic backends. Update the comment to
   describe current runtime behavior.

2. (J8aL) The clamp ran inside `buildOutputConfig()` only — the effort
   label was downgraded but `buildThinkingConfig()` still used the
   raw user value to size the budget, so a non-DeepSeek anthropic
   request could end up with `output_config.effort: 'high'` paired
   with a 'max'-sized 128K thinking budget. Inconsistent label vs.
   budget on the wire.

   Refactor: hoist the normalization into a single
   `resolveEffectiveEffort()` helper that runs once per request in
   `buildRequest()`. Both `buildThinkingConfig` and `buildOutputConfig`
   now consume the same clamped value, so the budget ladder and the
   effort label stay aligned. The debug log fires once per request.

Add a regression test asserting that on a non-DeepSeek anthropic
provider with `effort: 'max'` configured, the wire request carries
both `output_config.effort: 'high'` AND `thinking.budget_tokens:
64_000` (the 'high' tier), not the 128K 'max' budget.

96 tests pass; lint + typecheck clean.

* fix(core): tighten 'max' clamp + warn-once + strip reasoning_effort on side queries

Address PR review round 3 (copilot × 3):

1. (J-2v) When request.config.thinkingConfig.includeThoughts is false,
   pipeline.buildRequest's post-processing only deleted the nested
   `reasoning` key. The DeepSeek provider's translateReasoningEffort
   may have already flattened an extra_body-injected reasoning into
   top-level `reasoning_effort` by that point, so a side query (e.g.
   suggestionGenerator) could still ship reasoning_effort on the wire.
   Extend the post-processing to also delete `reasoning_effort`.

2. (J-2z) The warn for clamping 'max' on non-DeepSeek anthropic ran on
   every request needing the downgrade — the docstring claimed "first
   time only" but the implementation didn't latch. Add a private
   `effortClampWarned` boolean on the generator so the warning fires
   once per generator lifetime.

3. (J-23) `resolveEffectiveEffort` used the broad
   `isDeepSeekAnthropicProvider` detector for the clamp decision, but
   that helper falls back to model-name matching to cover sglang/vllm
   self-hosted DeepSeek deployments. A model configured as e.g.
   "deepseek-distill" but routed to real api.anthropic.com would
   bypass the clamp and trigger HTTP 400. Split the detector: keep
   `isDeepSeekAnthropicProvider` (broad) for the thinking-block
   injection workaround where false-positives are harmless, and add
   `isDeepSeekAnthropicHostname` (hostname-only) for decisions where
   a model-name false-positive would route DeepSeek-only behavior to
   a stricter backend. The clamp now uses the hostname-only check.

New regression test: a config with model name containing "deepseek"
but baseURL pointing at api.anthropic.com still clamps `'max'` to
`'high'`. Existing "passes max through" test updated to set a
DeepSeek baseURL since model name alone no longer suffices for the
clamp bypass.

385 tests pass; lint + typecheck clean.

* docs(core): correct pipeline timing comment + samplingParams caveat

Address PR review round 4 (copilot × 3) — three documentation accuracy
fixes, no behavior change:

1. (KBcw) The post-processing comment in pipeline.ts misdescribed the
   call order ("after this branch already ran during the same
   buildRequest pass") — provider.buildRequest actually runs BEFORE
   the includeThoughts=false post-processing in the same pass.
   Reword to match the actual order: provider hook flattens nested
   reasoning to reasoning_effort first, this cleanup runs after and
   strips both shapes.

2. (KBdC, KBdE) The "Reasoning / thinking configuration" section in
   model-providers.md and the model.generationConfig description in
   settings.md both implied `reasoning` is honored on every provider.
   For OpenAI-compatible providers, when `generationConfig.samplingParams`
   is set, `ContentGenerationPipeline.buildGenerateContentConfig()`
   ships samplingParams verbatim and skips the separate `reasoning`
   injection entirely. Configs like
   `{ samplingParams: { temperature: 0.5 }, reasoning: { effort: 'max' } }`
   would silently drop the reasoning field on OpenAI/DeepSeek
   requests.

   Add an explicit "Interaction with samplingParams" warning section
   in model-providers.md and a parenthetical note in settings.md
   directing users to put `reasoning_effort` inside `samplingParams`
   (or `extra_body`) when both are configured.

385 tests pass; lint + typecheck clean.

* docs(core): clarify explicit budget_tokens bypasses 'max' effort clamp

When user sets `{ effort: 'max', budget_tokens: N }` on a non-DeepSeek
anthropic backend, the effort label gets clamped to 'high' (otherwise
the server 400s on the unknown enum) but the explicit budget_tokens is
preserved verbatim. The wire-shape mismatch is intentional, not a bug:
the clamp only protects the enum field, while budget is a free integer
the server accepts within the context window, so an explicit override
stays explicit. Document the contract on the early-return and add a
regression test that locks it in.

* docs(deepseek): fix comments to match flatten + reasoning-strip behavior

Two doc-only nits called out in review:

1. `buildRequest` JSDoc said non-text parts are "rejected", but
   `flattenContentParts` actually substitutes a textual placeholder
   (`[Unsupported content type: <type>]`) so the request still goes
   through with a breadcrumb. Reword the JSDoc accordingly.

2. `translateReasoningEffort`'s strip comment claimed it strips the
   nested form to avoid shipping both shapes, but it only drops the
   duplicated `effort` key when other keys (e.g. `budget_tokens`) are
   present. Reword to describe the actual selective behavior and why
   keeping orthogonal keys is intentional.

Behavior unchanged.

* fix(deepseek): gate reasoning_effort translation on actual DeepSeek hostname

The provider class is selected via the broader `isDeepSeekProvider`
check, which falls back to model-name matching to cover self-hosted
DeepSeek deployments (sglang/vllm/ollama, see #3613). That fallback is
the right call for content-part flattening — it's a model-format
constraint baked into the model itself, not the API surface.

But the same broad detection was also gating
`translateReasoningEffort`, which rewrites the standard
`reasoning: { effort }` config into DeepSeek's flat `reasoning_effort`
body parameter. That's a wire-shape decision, not a model-format one:
strict OpenAI-compat backends in self-hosted setups may not accept the
DeepSeek extension and would have happily handled the original shape.

Split the two decisions: keep `isDeepSeekProvider` (broad) for
flattening, add a hostname-only `isDeepSeekHostname` and gate the body
rewrite on it. Self-hosted DeepSeek users who actually want the
translation can either use a baseUrl containing api.deepseek.com or
inject `reasoning_effort` directly via `samplingParams`/`extra_body`.

Regression tests:
  - self-hosted (sglang) with deepseek-named model + nested
    `reasoning.effort` → flattening still runs, body shape preserved
  - `isDeepSeekHostname` matches api.deepseek.com but not custom hosts

* fix(deepseek): use URL parsing in isDeepSeekHostname; fix log-level docs

CodeQL flagged a high-severity URL substring sanitization issue on the
new `isDeepSeekHostname` helper. The naive
`baseUrl.includes('api.deepseek.com')` check would false-positive on
hostile hosts like `https://api.deepseek.com.evil.com/v1` and
incorrectly inject the DeepSeek-only `reasoning_effort` body parameter
into requests routed elsewhere. Switch to `new URL(...).hostname` with
exact match against `api.deepseek.com` (and `.api.deepseek.com`
subdomains), mirroring `isDeepSeekAnthropicHostname` on the Anthropic
side. Invalid URLs treated as non-DeepSeek.

`isDeepSeekProvider` already routes through `isDeepSeekHostname`, so
the hardening applies to both decision paths.

Regression tests cover:
  - subdomain match (us.api.deepseek.com)
  - hostile substrings (api.deepseek.com.evil.com,
    evil.com/api.deepseek.com/v1, api.deepseek.comevil.com,
    api-deepseek-com.example.com)
  - invalid / empty baseUrl

Also fix two doc-level mismatches: the `'max'` clamp on Anthropic logs
via `debugLogger.warn` (warning level, once per generator), not "with
a debug log". Update both `ContentGeneratorConfig.reasoning` JSDoc and
the per-provider behavior table in model-providers.md.

* feat(deepseek): emit thinking:disabled signal when reasoning is off

DeepSeek V4+ defaults `thinking.type` to `'enabled'`, so just stripping
`reasoning_effort` from the request leaves the server happily thinking
on side queries — paying full thinking latency/cost without an effort
configured. Per yiliang114's review, emit the explicit
`thinking: { type: 'disabled' }` field on the wire whenever reasoning
is disabled.

Triggered when either:
  - `request.config.thinkingConfig.includeThoughts === false` (forked
    queries, e.g. suggestion generation)
  - `contentGeneratorConfig.reasoning === false` (config-level opt-out)

The previous post-processing block only fired on the per-request opt-out
path, so the config-level case was already leaking. Unify both under a
single `reasoningDisabled` predicate that runs the same strip + signal
logic.

Hostname-gated to `api.deepseek.com` (and subdomains): self-hosted
DeepSeek behind sglang/vllm/ollama, or older DeepSeek versions, may
not accept the V4 thinking parameter — pushing it there could trip an
unknown-key 400. Mirrors the round-7 decision to gate
`reasoning_effort` translation on hostname.

Regression tests cover all four matrix points:
  - DeepSeek hostname + includeThoughts false → emits disabled
  - DeepSeek hostname + reasoning false → emits disabled
  - non-DeepSeek hostname + includeThoughts false → does not emit
  - self-hosted DeepSeek (model-name fallback only) → does not emit

Docs: extend the `reasoning: false` section with the new behavior and
the self-hosted/non-DeepSeek caveat.

* refactor(deepseek): expose isDeepSeek* as free functions; clarify docs

Two doc/coupling nits from review:

1. The pipeline post-processing block was importing the concrete
   `DeepSeekOpenAICompatibleProvider` class just to reach
   `isDeepSeekHostname`. That couples the generic OpenAI pipeline to a
   specific provider implementation. Promote the helper (and its broad
   `isDeepSeekProvider` sibling) to free `export function`s in
   `provider/deepseek.ts` and import them by name. The class keeps thin
   static delegates for backward compat with existing callers and tests.

2. The per-provider behavior table on `model-providers.md` said
   `'low'/'medium' → 'high'` and `'xhigh' → 'max'` "client-side", but
   that normalization only fires inside `translateReasoningEffort`,
   which runs on the nested `reasoning.effort` config path. Explicit
   top-level overrides via `samplingParams.reasoning_effort` or
   `extra_body.reasoning_effort` skip the rewrite and ship verbatim.
   Reword the row to reflect that.

Behavior unchanged.
2026-05-04 22:42:23 +08:00
Shaojin Wen
fcefab6df5
fix(core): clear FileReadCache on every history rewrite path (#3810)
* fix(core): clear FileReadCache after microcompaction

Microcompaction (the idle-cleanup pass that runs at the start of every
new user/cron message) replaces old read_file / shell / glob / grep /
edit / write_file tool outputs with a `[Old tool result content cleared]`
placeholder. The FileReadCache, however, still records the prior full
Reads as "seen in this conversation" — so the next ReadFile of an
unchanged file returns the file_unchanged placeholder pointing at bytes
the model can no longer retrieve from history. The result is a Read
that succeeds at the tool layer but delivers no usable content to the
model, which is the failure mode reported in #3805 ("read tool returns
no content in long-running sessions").

This mirrors the existing post-compaction clear in tryCompressChat —
microcompaction has the same "history rewrite invalidates the cache's
'model has seen this' assumption" property, it was just missed when the
cache was wired in.

* fix(core): clear FileReadCache on every history rewrite path

PR1 only patched microcompaction, but a multi-round audit found four
more entry points that rewrite history without clearing the cache,
producing the same `file_unchanged` placeholder vs. missing-content
mismatch. Each is fixed in the same minimal way (clear() at the call
site) and covered by a regression test:

- GeminiClient.setHistory     — /restore checkpoint, /load_history
- GeminiClient.truncateHistory — rewind in AppContainer
- GeminiClient.resetChat       — public API; clearCommand happens to
  clear the cache via startNewSession beforehand, but other callers
  have no such guarantee
- stripOrphanedUserEntriesFromHistory — Retry path drops trailing user
  entries that may include read_file functionResponses

Also tightened the microcompaction comment ("compactable tool outputs"
instead of an enumerated list, since the source of truth is
microcompact.COMPACTABLE_TOOLS) and removed caller references per the
codebase comment style.

Reverse-tested every new clear() by commenting it out and confirming
the matching regression test fails.

* test(core): integration test for FileReadCache + history rewrite

End-to-end tests using the real ReadFileTool, real FileReadCache,
real microcompactHistory, and a real on-disk file. Three cases:

1. Without a cache clear after microcompact, the second Read of an
   unchanged file returns the file_unchanged placeholder while the
   prior content has already been wiped from history. Demonstrates
   the failure mode this PR fixes.
2. After an explicit cache.clear(), the second Read re-emits the
   real bytes. Demonstrates that the fix works.
3. When microcompact removes every prior read of a file, the
   placeholder leaves zero recoverable bytes — the model literally
   cannot find the content anywhere it can reach.

These complement the existing unit tests in client.test.ts (which
verify the call-site wiring) by proving the end-to-end behaviour
through the real code paths, without mocks.

* chore(core): add traceable debug log for every FileReadCache clear

Per review feedback: the new clear() call sites were silent, leaving
no breadcrumb in production debug streams when the cache is dropped.
Adds a `[FILE_READ_CACHE] clear after <reason>` log at every clear
site (5 new + 1 pre-existing in tryCompressChat) so operators can
grep one prefix and see why the cache was invalidated.

* chore(core): refine truncateHistory cache clear + extract test helper

Per review feedback (deepseek-v4-pro):

1. truncateHistory now skips the cache clear when keepCount >=
   prevLen, since a no-op truncate leaves the cache valid against the
   unchanged history. Adds a regression test covering both
   keepCount==prevLen and keepCount>prevLen.

2. The 6 cache-spy test cases each repeated the same 4-line mock
   setup. Extract a `mockFileReadCacheClear()` helper so future
   changes to the FileReadCache mock surface only need one edit.

Both are quality-of-implementation tweaks; the underlying fix is
unchanged.

* perf(core): use O(1) getHistoryLength in truncateHistory

Per Copilot review feedback: the previous commit's no-op detection in
truncateHistory called this.getChat().getHistory().length, but
GeminiChat.getHistory() does a structuredClone of the entire history
on every call (line 770 of geminiChat.ts) — paying an O(history)
clone purely to read .length. In long-running sessions with hundreds
of entries this is a meaningful regression.

Adds GeminiChat.getHistoryLength(): O(1), no clone. truncateHistory
switches to it. The behaviour (skip clear when keepCount >= prevLen)
is unchanged.

Also adds:
- Unit tests for GeminiChat.getHistoryLength (empty, after addHistory,
  parity with getHistory().length).
- A regression test asserting truncateHistory calls getHistoryLength
  and NOT getHistory, locking in the perf fix against future drift.

* fix(core): close NaN hole + use public ReadFileTool API in tests

Two issues from copilot review:

1. NaN edge case in truncateHistory cache invalidation. The
   "did anything actually change?" check was `keepCount < prevLen`,
   but `Array.slice(0, NaN)` returns [] (history wiped) while
   `NaN < prevLen` is false. That sequence would wipe the chat but
   leave the FileReadCache claiming the model has seen the prior
   reads — exactly the file_unchanged placeholder bug this PR is
   closing. Switched the check to compare actual post-truncate length
   (`newLen < prevLen`), which correctly invalidates whenever entries
   were removed regardless of how `keepCount` was malformed. Added
   a NaN regression test.

2. The integration test cast `tool` to `unknown` to reach the
   protected `createInvocation()` method. Switched to the public
   `tool.buildAndExecute(params, signal)` API so the test exercises
   the same surface real callers use, including build-time schema
   validation.
2026-05-04 22:42:06 +08:00
Shaojin Wen
fbcf59e0b3
docs(core): point background-shell + monitor guidance at both /tasks and the dialog (#3808)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
SDK Python / SDK Python (3.10) (push) Waiting to run
SDK Python / SDK Python (3.11) (push) Waiting to run
SDK Python / SDK Python (3.12) (push) Waiting to run
* docs(core): point background-shell guidance at both /tasks and the dialog

Follow-up to PR #3801, fulfilling the "separate small PR" commitment in
its description. The two model-facing strings (`shell.ts` after
spawning a background shell, `task-stop.ts` after requesting cancel)
referenced only `/tasks` as the inspection path, predating the
interactive Background tasks dialog landing at #3488 / #3720 / #3791.
Now that the dialog handles all three kinds (agent / shell / monitor),
both surfaces should be visible to the LLM so it can suggest the right
one based on the user's mode.

Updates:

- `shell.ts:865` (LLM message after `is_background: true` spawn) now
  surfaces both `/tasks` (text, any mode) AND the interactive dialog
  (footer pill + Enter, with detail view + live updates). Output file
  guidance retained.
- `task-stop.ts:110` (LLM message after `task_stop` on a shell) same
  pattern: both surfaces named.
- `task-stop.ts:95` comment updated to enumerate all observation paths
  (including the dialog).
- `monitorRegistry.ts:197` comment fixed — said "/tasks dialog" which
  conflated two distinct surfaces. Split to "Background tasks dialog
  reopens or `/tasks` listings".
- `backgroundShellRegistry.ts:10` (module docstring) and `:31` (shellId
  doc) now mention all three consumers (agent, dialog, slash command).

No behavior change — pure documentation/string update. Tests untouched
(none asserted on these exact strings); build + lint + 152-test core
suite all clean.

* docs(core): address PR 3808 review — 'captured output' + consistent ordering

Three review nits:

1. (LoqU — copilot) `shell.ts:865` said the output file holds "raw
   content", but `shellExecutionService` runs each chunk through
   stripAnsi and skips non-string/binary chunks before writing. Reword
   to "captured output" so callers don't expect a byte-for-byte stream.

2. (LqKr — wenshao) The PR mentioned both surfaces in two different
   orders depending on the file: `backgroundShellRegistry.ts` listed
   the dialog first, while `task-stop.ts` and `shell.ts` listed
   `/tasks` first. Unify on the LLM-facing order — `/tasks` first,
   then the interactive Background tasks dialog — across all four
   sites. Also flips the line-31 docstring on the `shellId` field for
   the same reason.

3. (LqKt — wenshao, flagged for awareness only) Trim the redundant
   keystroke detail in shell.ts:865 to match `task-stop.ts:111`'s
   shorter "(footer pill + Enter)" form. Saves ~7 tokens per
   background shell launch in `llmContent` while still naming both
   surfaces. The PR description's rationale (LLM should know both
   surfaces exist so it can suggest the right one for the user's
   mode) is preserved — only the operational verbosity is trimmed.

581 tests pass; lint + typecheck clean. Pure docs / string update.

* docs(core): grammar polish on PR 3808 strings

Two more wording nits from copilot review:

- backgroundShellRegistry.ts:10 — change "metadata the agent…need to
  query" to "metadata that the agent…use to query". The original
  phrasing reads as if the metadata itself is performing the query.

- shell.ts:865 — change "Read the output file directly for the
  captured output." to "Read the output file directly to view the
  captured output." — clearer instruction to the model/user.

Pure wording, no behavior change.

* docs(core): grammar fix on PR 3808 monitor comment

'not visible from later Background tasks dialog reopens' read as
if 'reopens' was a noun. Reword to 'not visible after reopening
the Background tasks dialog or from /tasks listings'.

* docs(core): round 4 wording polish on PR 3808

Four more nits from copilot:

- shell.ts:865 + task-stop.ts:96,111: "footer pill + Enter" was
  ambiguous now that the footer renders multiple pills (background
  tasks vs other status indicators). Disambiguate to
  "focus the footer Background tasks pill, then Enter".
- monitorRegistry.ts:198: re-tweak my round-3 phrasing —
  "after reopening the Background tasks dialog or from /tasks
  listings" → "in later Background tasks dialog reopens or /tasks
  listings". Reads as "from those surfaces" rather than "after
  reopening", which the reviewer found ungrammatical.
- backgroundShellRegistry.ts:10,31: clarify "/tasks" as the slash
  command, since the codebase also uses "<projectDir>/tasks/..."
  on-disk paths in agent-transcript contexts.

Pure wording, no behavior change. 87 affected tests pass.

* docs(core): mirror /tasks + dialog guidance to monitor llmContent paths

Address @doudouOUC review on PR #3808 — two Medium findings: this PR
updated shell-facing strings to mention both inspection surfaces but
left the parallel monitor strings without any inspection guidance, even
though monitors render in the same /tasks output and the same
Background tasks dialog. Restore symmetry:

- monitor.ts:587-598 — append the same "/tasks (text) or the
  interactive Background tasks dialog (focus the footer Background
  tasks pill, then Enter — detail view + live updates)" sentence to
  the Monitor-started llmContent, mirroring shell.ts:865.
- task-stop.ts:125-131 — the monitor cancellation llmContent had no
  guidance at all. Add the same "Final status will be visible via
  /tasks (text) or the interactive Background tasks dialog (focus the
  footer Background tasks pill, then Enter) once the process drains"
  line that already existed for shells at task-stop.ts:111.

The (Low) commit-churn observation is a maintainer call (squash on
merge); the (Info) snapshot-test gap is pre-existing and not in scope.

78 monitor + task-stop tests pass; lint + typecheck clean.

* docs(core): drop drain phrasing for monitor cancel + restructure dialog comment

Address PR #3808 review round 5 (doudouOUC + copilot × 2):

1. (XNoH copilot, XSBu doudouOUC — Medium) The monitor cancellation
   message inherited "once the process drains" from the shell branch,
   but `monitorRegistry.cancel()` settles synchronously — when the
   tool returns, the entry is already `cancelled`, not waiting on a
   child process. The drain qualifier is accurate for shells (which
   use `requestCancel()` + the AbortController and settle when the
   real process exits) but misleading for monitors.

   Reword the monitor branch in `task-stop.ts:121-130` to drop the
   drain phrasing and add an explanatory comment about the sync vs.
   async difference so future maintainers don't replicate the wording
   from the shell branch by reflex.

2. (XNod copilot — wording, third revision on the same comment)
   Restructure rather than re-litigate the preposition. The
   "reopens" noun framing has gone through three rounds of churn
   (`from later... reopens` → `after reopening...` → `in later...
   reopens` → and now back to `from`). Sidestep the loop by making
   the comment a proper sentence about WHAT the surfaces actually
   read: the persisted `entry.error` is the source of truth; the
   chat-history notification is a separate, ephemeral side channel.
   Avoids the noun-form "reopens" entirely.

Updated test assertion to match the new "Monitor \"...\" cancelled"
prefix. 51 tests pass; lint + typecheck clean.
2026-05-04 22:27:00 +08:00
jinye
2fea1d3aa7
fix(core): address post-merge monitor tool and UI routing issues (#3792)
* fix(core): address post-merge monitor tool and UI routing issues

- Guard token bucket against clock drift after system suspend/resume
  (negative elapsed resets lastRefill instead of starving the bucket)
- Add debugLogger.warn for AST read-only check failures in monitor
  getConfirmationDetails (previously silent catch)
- Consolidate SHELL_TOOL_NAMES: export from rule-parser.ts, import in
  permission-manager.ts (removes identical SHELL_LIKE_TOOLS duplicate)
- Extract hasBlockingBackgroundWork/resetBackgroundStateForSessionSwitch
  to shared backgroundWorkUtils.ts (removes identical copies in
  clearCommand.ts and useResumeCommand.ts)
- Consolidate getToolCallComponent routing into packages/webui (removes
  near-identical copies in ChatViewer.tsx and vscode-ide-companion, adds
  missing web_search compat alias to VSCode path)
- Add test for droppedLines count in terminal notification text
- Add test for exit(null, null) settlement (externally killed process)

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,cli,vscode): address PR #3792 review feedback

- Add debugLogger.warn for clock-drift guard (observability for
  throttle bucket resets after suspend/resume)
- Add test for clock-drift recovery (elapsed < 0 bucket reset)
- Add test for AST parse failure catch path (mockRejectedValueOnce)
- Forward isFirst/isLast props through VSCode ToolCallRouter
  (fixes timeline connector rendering)
- Add test for shell running branch in hasBlockingBackgroundWork

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,vscode): address follow-up review comments

- Fix React.FC missing import: use `import type { FC } from 'react'`
  instead of `React.FC` (original file had this import before refactor)
- Tighten clock-drift test: emit while clock is in the past to confirm
  guard resets lastRefill, then verify refill at the new reference point

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,cli): adopt review feedback — ReadonlySet + hasRunningEntries

- Type SHELL_TOOL_NAMES as ReadonlySet<string> to prevent accidental
  mutation of permission-critical set
- Use BackgroundShellRegistry.hasRunningEntries() instead of
  getAll().some() for zero-allocation short-circuit check
- Update clearCommand test mocks to include hasRunningEntries

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(core,webui,vscode): address remaining PR #3792 review comments

- Merge AgentToolCall + isAgentExecutionToolCall into single import in
  routing.ts (comment 3178280762)
- Use real getToolCallComponent via vi.importActual in VSCode test mock
  so routing logic is validated, not a parallel mock that can drift
  (comment 3178280775)
- Validate isFirst/isLast forwarding in VSCode test mock via data
  attributes (comment 3178346891)
- Add comment documenting debugLogger.warn no-op tradeoff for clock
  drift guard (comment 3178346889)

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* test(webui): add unit test for getToolCallComponent routing

Covers all 8 component branches including the web_search compatibility
alias, agent execution detection, case-insensitive matching, and
fallback to GenericToolCall.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(cli): add missing hasRunningEntries to useResumeCommand test mocks

The backgroundWorkUtils refactor replaced getAll().some() with
hasRunningEntries(), but the test mocks in useResumeCommand.test.ts
were not updated, causing CI failures.

🤖 Generated with [Qoder Code](https://github.com/QwenLM/qwen-code)

* test(cli): add unit tests for backgroundWorkUtils shared utility

Cover hasBlockingBackgroundWork (6 cases including short-circuit
behaviour) and resetBackgroundStateForSessionSwitch (1 case verifying
all three registries are reset).

🤖 Generated with [Qoder Code](https://github.com/QwenLM/qwen-code)

---------

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
2026-05-04 21:19:41 +08:00
jinye
03f66bada5
feat(sdk-python): add PyPI release workflow (#3685)
* feat(sdk-python): add pypi release workflow

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): build cli before smoke test

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): tighten release conflict handling

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): harden python release workflow

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): tighten stable release guards

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): harden prerelease publish flow

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): reuse release branches on rerun

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): resume incomplete releases

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(release): tighten missing-release checks

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): resume stable release reruns

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): tighten release recovery guards

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* test(sdk-python): cover release version edge cases

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): address release workflow review feedback

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* refactor(sdk-python): address review feedback on release version script

- Remove unreachable `if (type === 'stable')` branch in bumpVersion();
  the stable path was dead code since getVersion() throws for all
  stable conflicts before calling bumpVersion(). Move nightly conflict
  throw to the call site for symmetry.
- Rename getNextPatchBaseVersion → getNextBaseVersion to reflect that
  the function can return a prerelease base without incrementing patch.
- Add test for preview+nightly coexistence where nightly base is higher.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): address remaining review feedback on release workflow

- Fix failure-issue gate to read github.event.inputs.dry_run directly
  instead of steps.vars.outputs.is_dry_run (which is empty when early
  steps fail). Add --repo flag for gh issue create when checkout failed.
- Add diagnostic state table to failure-issue body (RELEASE_TAG,
  PACKAGE_VERSION, PUBLISH_CHANNEL, RESUME_EXISTING_RELEASE, etc.)
- Fix release-notes error swallow: only silence release not found /
  Not Found / HTTP 404, emit :⚠️: for other gh release view errors.
- Improve validateVersion error messages to use human-readable format
  keys (X.Y.Z, X.Y.Z-preview.N) matching TS sibling convention.
- Filter fully-yanked versions in getAllVersionsFromPyPI.
- Add console.error log when stable is derived from nightly.
- Add bash regex guard for inputs.version to prevent shell injection.
- Use per-release-type concurrency groups (nightly/preview/stable).
- Add jq null-guard checks for all 6 field extractions.
- Remove misleading --follow-tags from git push (lightweight tags).

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): rename misleading test description

The test asserts that preview/nightly releases return empty
previousReleaseTag, but the name said "same-channel previous
release tags" which implied non-empty values.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): address unresolved review comments on release workflow

- Remove -z check in extract_field() that blocked preview/nightly releases
  (previousReleaseTag is legitimately empty for non-stable releases)
- Use static environment.url since step outputs aren't available at job startup
- Use skip-existing for resumed PyPI publish to fill in missing artifacts
- Add AbortSignal.timeout(30s) to PyPI fetch to prevent indefinite hangs
- Add downgrade guard for stable_version_override
- Use GHA :⚠️: annotation instead of console.error for visibility
- Separate yanked/non-yanked version lists so conflict detection includes
  yanked versions (PyPI still reserves those slots)
- Filter current release from previousReleaseTag to avoid self-reference on resume
- Add tests for yanked conflict detection, downgrade guard, and resume previousReleaseTag

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

* fix(sdk-python): address final review round on release version script

- Fix getNextBaseVersion() first-release skip: use pyproject.toml version
  directly when PyPI has no stable versions instead of unconditionally
  incrementing
- Fix getNextBaseVersion() off-by-one: change > to >= so equal prerelease
  base continues the existing line instead of incrementing patch
- Add :⚠️: annotation when preview auto-bumps due to orphan git
  tags (tag exists without PyPI version or GitHub release)
- Add set -euo pipefail to 5 workflow steps missing it: release_branch,
  persist_source, Create GitHub release, Delete prerelease branch, Create
  issue on failure
- Fix 2 existing tests affected by first-release change, add 4 new tests

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(sdk-python): use stderr for GHA warning annotations to avoid corrupting JSON stdout

console.log writes to stdout, which gets captured by VERSION_JSON=$(node ...)
in the workflow and corrupts the JSON output for jq. Switch to console.error
so :⚠️: annotations go to stderr (GHA recognizes workflow commands on
both streams). Also add set -euo pipefail to the "Get the version" step for
consistency with other workflow steps.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

---------

Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
2026-05-04 21:07:21 +08:00
wenshao
eecb0cfa16 fix(attribution): skip values for env -u NAME and -S string
`env`'s value-taking flags (`-u`/`--unset`, `-S`/`--split-string`) were
not in the wrapper's flag-skip allowlist, so `env -u FOO git commit ...`
left FOO as the next token and the parser treated it as the program —
masking the real `git commit` from attribution detection. Add an
ENV_FLAGS_WITH_VALUE table mirroring the sudo allowlist. Regression
test added.
2026-05-04 06:45:40 +08:00
wenshao
8891b83b85 fix(attribution): last-match --body, symlink leaf canonicalisation, scoped prompt count
- addAttributionToPR: use matchAll/last-match for `--body`/`-b` so the
  trailer lands in the gh-honoured (final) body when multiple flags are
  present. Mirrors addCoAuthorToGitCommit. Adds regression test.
- attachCommitAttribution: also fs.realpathSync the per-file resolved
  path (not just the repo root) so files behind intermediate symlinks
  are matched against canonical keys recordEdit stored, instead of
  silently zeroing attribution and leaking entries past commit.
- incrementPromptCount: scope to SendMessageType.UserQuery — ToolResult,
  Retry, Hook, Cron, Notification are model/background re-entries of
  the same logical turn. Tracking them all inflated the "N-shotted"
  trailer (one user message could become 10-shotted with 10 tool calls).
- AttributionSnapshot: add `version: 1` field; restoreFromSnapshot now
  refuses incompatible versions and validates per-field types so a
  partially-written snapshot can't seed `Math.min(undefined, n) === NaN`
  into git-notes payloads.
- Drop unused permission/escape counters (declared, persisted, never
  read or incremented) — fields, snapshot tolerance, and clear-method
  bookkeeping all removed; AttributionSnapshot interface simplifies.
- isGeneratedFile: switch directory rule from substring `.includes('/dist/')`
  to segment-boundary check (split on `/`) so project dirs like
  `my-dist/` or `xbuild/` don't match. `.lock` removed from the blanket
  extension exclusion — well-known lockfiles already covered by
  EXCLUDED_FILENAMES; hand-authored `.lock` files (e.g. `.terraform.lock.hcl`)
  now stay attributable.
- getClientSurface: document `QWEN_CODE_ENTRYPOINT` as the embedder
  override hook so the always-`'cli'` default is intentional.
2026-05-04 06:42:26 +08:00
wenshao
d78d4cfe22 Merge remote-tracking branch 'origin/main' into feat/commit-attribution
# Conflicts:
#	packages/core/src/tools/shell.test.ts
#	packages/core/src/tools/shell.ts
2026-05-04 06:36:11 +08:00
jinye
e617f20d15
fix(telemetry): suppress async resource attribute warning on startup (#3807)
Some checks are pending
Qwen Code CI / CodeQL (push) Waiting to run
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* fix(telemetry): suppress async resource attribute warning on startup

Disable NodeSDK's default auto-detection of host/process/env resources.
These detectors return async attributes; if a span is exported before
they settle, OTel logs an error-level diag message that surfaces in the
terminal UI. The attributes we actually need (service name, version,
session.id) are already provided synchronously via resourceFromAttributes.

Closes part of #3731

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): add comment and test for autoDetectResources: false

Address review feedback:
- Add inline comment explaining why auto-detection is disabled
  (avoids async attribute warning before first span export)
- Add test assertion that NodeSDK is constructed with
  autoDetectResources: false to prevent silent regression

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

* fix(telemetry): improve comment precision and extend autoDetectResources test coverage

- Clarify comment: error fires on any resource attribute read before
  async detectors settle, not just during export; HttpInstrumentation
  span creation is the typical trigger
- Add autoDetectResources: false assertion to HTTP OTLP and file
  exporter test cases for branch-complete coverage across all three
  NodeSDK construction paths

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
2026-05-03 19:36:03 +08:00
Shaojin Wen
07fdfadc33
feat(cli): include monitors in /tasks + add interactive-mode hint (#3801)
* feat(cli): include monitors in /tasks + add interactive-mode hint

Phase B closure for Issue #3634. Two coupled changes to /tasks:

1. **Bug fix — include monitors.** The command was last touched before
   #3684 / #3791 landed, so it merged only agent + shell entries while
   monitors silently disappeared from the headless / non-interactive /
   ACP listing path. Add a third registry pull from `getMonitorRegistry()`
   and wire monitor through statusLabel / taskLabel / taskId /
   taskOutputPath. Status line includes eventCount (`running (N events)`,
   `completed (exit 0, N events)`, `completed (Max events reached, N
   events)` for auto-stop) and pid where defined.

2. **Soft deprecation hint, scoped to interactive mode only.** Once the
   richer Ctrl+T dialog (#3488 + #3720 + #3791) is available, the text
   dump is the long-form fallback rather than the primary surface. Show
   `Tip: Ctrl+T opens the interactive Background tasks dialog with
   detail view + live updates.` at the top of the output when
   `executionMode === 'interactive'`. Headless / ACP get the bare list
   — they have no dialog to point at and the hint would just clutter.
   Description string also clarified to call out the modal split.

Kept on all three executionModes (no deletion) — `/tasks` is the only
way headless / ACP / SDK consumers can inspect background-task state.

Tests: 4 new cases in tasksCommand.test.ts cover monitor entry
formatting (running with pid, natural completion with exitCode,
auto-stop with error string, failed), the singular `1 event` form,
the interactive-mode hint gating, and the cross-kind merge order.

* fix(cli): address PR 3801 review — exhaustive switch + i18n + extra tests

Three actionable Suggestions from /review's pass:

- `taskLabel` rewritten as a `switch` with a `never`-typed `default`
  arm, matching the structural-safety pattern already used by `taskId`.
  Adding a 4th DialogEntry kind in the future will now flip both
  helpers to compile errors instead of letting `taskLabel` silently
  fall through to `entry.description` (which the new kind may not have).

- Hint string wrapped in `t()` for i18n consistency with the rest of
  the file. The literal stays as the i18n key default, so today's
  output is unchanged.

- Tests: cover `cancelled` monitor status (was the only one without an
  inline assertion) and explicit `acp` execution mode hint suppression
  (pins the suppression rationale so a future regression flipping the
  check to `!== 'non_interactive'` would fail loudly).

* fix(cli): correct /tasks dialog-open hint — Ctrl+T was wrong

Tmux verification on PR #3801 caught that the hint string says "Ctrl+T
opens the interactive Background tasks dialog" but Ctrl+T is actually
bound to the MCP tool descriptions toggle (ContextSummaryDisplay.tsx
lines 110-115). The dialog opens via Down arrow on an empty composer
(focuses the footer pill) followed by Enter (InputPrompt.tsx 947-968).
Same misattribution slipped into PR #3791's first description and was
caught + fixed there before merge — this PR carried the wrong wording
forward in code.

Updates four sites:
- The hint string itself: "Tip: press ↓ from an empty composer then
  Enter to open the interactive Background tasks dialog with detail
  view + live updates."
- The slash-command description: "interactive UI is Ctrl+T" → "interactive
  dialog opens via the footer pill"
- Two inline comments referencing Ctrl+T as the dialog opener
- The interactive-mode hint test now pins on `↓` + `Enter` and
  asserts `not.toContain('Ctrl+T')` so a regression to the wrong
  wording fails loudly.

* fix(cli): address PR 3801 review — exhaustive switch consistency + path-agnostic hint

Four Suggestions from the latest /review pass:

- `statusLabel` rewritten as a single top-level switch with a
  `never`-typed default, matching `taskLabel` / `taskId` /
  `taskOutputPath`. The previous `if`/`if`/fallthrough form would
  silently apply monitor formatting to a future 4th kind.
- `taskOutputPath` gained the same exhaustive default — was the only
  per-kind helper still relying on implicit fallthrough; would
  silently omit a 4th-kind output path while the adjacent helpers
  flip to compile errors.
- Hint wording de-specifies the exact keystroke count: `'Tip: focus
  the Background tasks pill in the footer (use ↓ from an empty
  composer) and press Enter ...'`. Previous "press ↓ then Enter"
  phrasing was wrong when the Arena agent tab bar is present —
  `InputPrompt`'s focus chain routes Down through the tab bar first,
  so a single Down lands there, not on the bg pill.
- Test pin tightened: `[mon_fail] failed: spawn ENOENT (0 events)` is
  now a full-string assertion instead of a prefix match, so a
  regression that drops the `(N events)` suffix from monitor's failed
  branch fails loudly.

* fix(cli): sanitize ANSI escape sequences in /tasks output

deepseek's review pass flagged that monitor description / error fields
are user / process-supplied strings rendered directly to the terminal.
A maliciously-crafted tool description or spawn error containing raw
ANSI control sequences (clear-screen, cursor-move, colour) would
otherwise reach stdout verbatim and corrupt display.

Same risk applies to agent error / description and shell error /
command — all already-existing renderers with the same exposure that
this PR didn't introduce but inherits. So instead of per-field
sprinkling, wrap the joined output once with `escapeAnsiCtrlCodes`
(no-op when no control chars present, so cost is zero in the common
case). One line change in the renderer covers every kind including
any future one.

Test pins the behaviour: a monitor entry with `\x1b[2J` /
`\x1b[31m...` content produces output with no raw ESC bytes and
visible escaped `[...]` sequences.

* docs(cli): tighten escapeAnsiCtrlCodes comments to match actual scope

Two doc-precision Suggestions from copilot's pass on 0840e32f6:

- Source comment claimed `escapeAnsiCtrlCodes` is "a no-op when no
  control chars" but it's narrower than that — it only handles
  sequences matched by `ansi-regex` (CSI / OSC / SGR — anything
  starting with ESC). Isolated C0/C1 control bytes like BEL, BS, VT
  pass through untouched. Updated the comment to enumerate the actual
  scope and call out that `node:util`'s `stripVTControlCharacters`
  would be needed if those become a concern.

- Test comment had a literal raw ESC byte (octal 033) embedded in the
  source — visually showed `^[[...]` in editors that render ESC, but
  was a real ESC byte in the file rather than the escaped ``
  form the sanitizer produces. Rewrote with a literal `` text
  description so what the comment shows matches what the assertions
  check for.

* fix(cli): broaden /tasks sanitization + tighten inner switch exhaustiveness

Addresses 3 of 5 items from doudouOUC's PR 3801 review:

- **Issue 1 (Low) — C0/C1 control byte gap**: switched from
  `escapeAnsiCtrlCodes` (only handles ESC-initiated ANSI sequences) to
  `stripUnsafeCharacters` (one-pass strip of ANSI + VT + C0/C1, with
  TAB/CR/LF preserved). The pre-existing exposure to bare BEL / BS /
  FF / VT bytes via shell entry strings is now closed for all three
  kinds. Test rewritten to cover both ANSI sequences AND bare control
  bytes (BEL, BS), and pins that surrounding printable text and line
  breaks survive.

- **Issue 2 (Low) — inner status switches inconsistent**: the three
  inner `switch (entry.status)` blocks (agent / shell / monitor) used
  `case 'running': default: return 'running'` (or duplicated bodies).
  All three now have explicit `running` cases plus a `never`-typed
  default that throws — matches the outer `switch (entry.kind)`
  pattern and means a future status added to any of `BackgroundTaskEntry`
  / `BackgroundShellEntry` / `MonitorStatus` flips to a compile error
  here instead of silently returning `'running'`.

- **Issue 5 (Nit) — beforeEach default change**: added an inline
  comment explaining why the test default overrides
  `createMockCommandContext`'s `'interactive'` default
  (`'non_interactive'` lets the hint-suppression assertions work
  without each test rebinding context).

Issues 3 and 4 from the review are nits with no action needed (3 is
already documented as intentional; 4 is a UX call about hint length
that's better handled by user feedback than guess-tweaking).

* fix(cli): bind status to local before exhaustive switch — fixes tsc build

CI's `tsc --build` (full mode, vs `--noEmit` locally) caught that
`switch (entry.status)` followed by a `never`-typed default reading
`entry.status` doesn't compile. After the case arms exhaust the
discriminated union, TS narrows `entry` itself to `never`, so the
`.status` access in the default arm becomes "Property 'status' does
not exist on type 'never'" + the resulting `any` value can't be
assigned to `never`.

Fix: bind `entry.status` to a local `status` const before the inner
switch. The local stays typed as the per-kind status union and
narrows correctly to `never` at the default arm — `const _exhaustive:
never = status` is then `never = never`, valid.

Standard exhaustive-switch-on-discriminator pattern; doesn't change
runtime behavior or test surface, just gets past TS narrowing on the
nested case.
2026-05-03 18:45:51 +08:00
Shaojin Wen
cdadbcdb33
feat(cli): wire Monitor entries into combined Background tasks dialog (#3791)
Some checks are pending
Qwen Code CI / Lint (push) Waiting to run
Qwen Code CI / Test (push) Blocked by required conditions
Qwen Code CI / Test-1 (push) Blocked by required conditions
Qwen Code CI / Test-2 (push) Blocked by required conditions
Qwen Code CI / Test-3 (push) Blocked by required conditions
Qwen Code CI / Test-4 (push) Blocked by required conditions
Qwen Code CI / Test-5 (push) Blocked by required conditions
Qwen Code CI / Test-6 (push) Blocked by required conditions
Qwen Code CI / Test-7 (push) Blocked by required conditions
Qwen Code CI / Test-8 (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
* feat(cli): wire Monitor entries into combined Background tasks dialog

Phase C mirror follow-up of Issue #3634, structurally a clean repeat of
#3720 but for Monitor (third consumer of the kind framework).

Core (MonitorRegistry):
- Add `setStatusChangeCallback` mirroring `BackgroundShellRegistry` /
  `BackgroundTaskRegistry`. Fires synchronously inside `register()` (so
  subscribers see lifecycle start) and inside `settle()` (so subscribers
  see every running → terminal transition: complete / fail / cancel /
  emitEvent's auto-stop at maxEvents).
- Subscriber failures are caught and logged but do not poison the
  registry — same defensive contract as the other two registries.

CLI:
- `useBackgroundTaskView` subscribes to all three registries (agent +
  shell + monitor) and merges by `startTime`. `DialogEntry` union
  extended with `(MonitorEntry & { kind: 'monitor' })`. `entryId`
  switches over the three kinds and returns the right id field.
- `BackgroundTasksPill.getPillLabel` adds monitor to its KIND_NAMES
  table; grouping order is shell → agent → monitor (monitor last
  because it tends to be the longest-lived, least urgent to glance at).
- `BackgroundTasksDialog`:
  - `rowLabel` returns `[monitor] <description>` for monitor rows.
  - New `MonitorDetailBody` showing command / status / pid / event
    count / dropped lines. No Progress block (monitors don't fire
    activity callbacks per-event).
  - DetailBody dispatcher gains the monitor branch.
- `BackgroundTaskViewContext.cancelSelected` routes monitor cancels via
  `monitorRegistry.cancel(monitorId)`. Monitor's cancel is synchronous
  (settle + abort happen inside the registry), matching the same path
  task_stop already uses.

Tests: 6 new core tests (registry callbacks fire on register / each
terminal transition / not on emitEvent / on auto-stop / clear stops
notifications / subscriber-failure isolation), 4 new pill tests
(singular / plural / 3-kind grouping / running-only filter), dialog
mock extended with `getMonitorRegistry`.

* fix(cli): add exhaustive default arms to DialogEntry switches

ESLint's `default-case` rule requires every switch to have a default
arm even when TypeScript can prove the union is exhaustive. Add
`default: { const _exhaustive: never = entry; throw ... }` to the
five switches added in this PR — same pattern keeps both the runtime
guard and the compile-time exhaustiveness check.

* fix(core): fire statusChange in MonitorRegistry.reset()

The newly-added `setStatusChangeCallback` subscriber misses `reset()`,
so a `/clear` or session reset leaves stale monitor rows visible in the
combined Background tasks dialog until an unrelated register/settle
event happens. Both BackgroundShellRegistry and BackgroundTaskRegistry
already fire statusChange on their reset paths — Monitor was the
outlier.

Fix: fire `statusChange()` (no arg) after `monitors.clear()`, with an
early return when the registry is already empty so we don't notify on
a no-op reset. Two new tests cover both branches.

* fix(cli,core): address PR 3791 review feedback

Four review threads from /review's second pass on top of f26b700:

1. **MonitorDetailBody live counters were stale.** `eventCount` and
   `droppedLines` came from the `useBackgroundTaskView` snapshot, which
   only refreshes on `statusChange`, and `emitEvent` deliberately
   doesn't fire `statusChange` (the per-event churn would burn the pill
   / AppContainer). Fix: extend the existing `selectedEntry` useMemo to
   re-resolve monitors from `monitorRegistry.get()` on each
   `activityTick`, mirroring the agent path. The pre-existing 1s
   wall-clock interval already drives the recompute while the entry is
   running, so no new callback wiring is needed.

2. **Settle reasons weren't persisted on `MonitorEntry`.** `fail()`,
   `complete(exitCode)`, max-events auto-stop, and idle-timeout all
   passed their reason strings only to the chat-history terminal
   notification — opening the dialog detail view after the monitor
   died showed the bare status with no clue why. Add `exitCode?` and
   `error?` fields (mirrors `BackgroundShellEntry`); populate them
   before `settle()` in each path; render them in `MonitorDetailBody`
   with appropriate colour (red for `failed`, warning for
   auto-stopped `completed`).

3. **`monitorCancel` mock had no test asserting it.** Add a dedicated
   test that opens detail on a monitor entry, presses `x`, and verifies
   `monitorRegistry.cancel(monitorId)` was called and the agent
   registry's cancel was NOT called. Pins the kind switch in
   `cancelSelected` so a regression flipping the monitor branch to
   anything else (e.g. `requestCancel`) would fail loudly.

4. **`MonitorStatusChangeCallback` docstring was wrong.** It claimed
   the callback fires on running → terminal transitions, but
   `register()` (nothing → running) and `reset()` (mass clear, fired
   with no entry) also fire it. Rewrite the docstring to enumerate the
   actual fire sites and explicitly note that `emitEvent` is excluded
   (per-event churn).

* docs(cli,core): tighten MonitorEntry.error and rowLabel comments

Two doc-precision fixes from copilot's PR 3791 review pass:

- `MonitorEntry.error` enumerated max-events as the only auto-stop
  reason that populates the field, but `idle-timeout` also writes it
  (was added in the same earlier commit). Rewrote to list both current
  reasons and explicitly note: any future auto-stop reason should
  populate this field too. Also clarified that `cancelled` and
  natural-exit `completed` paths intentionally don't.

- `rowLabel`'s shell-branch comment claimed long commands are
  "truncated by the row renderer's MaxSizedBox", but ListBody renders
  rows with plain `<Text>` (no MaxSizedBox / truncation helper). Long
  text wraps. Rewrote to describe actual behaviour and note the
  intentional decision to leave it wrapping (the dialog is what users
  open to see context — truncating defeats the purpose).

* test(cli): cover MonitorDetailBody render branches + useBackgroundTaskView

Two coverage gaps from /review's third pass on PR 3791:

- New file `useBackgroundTaskView.test.ts` (6 cases) directly exercises
  the merge logic with a stub config: empty when config is null, merges
  three registries, sorts by startTime, tags `kind`, subscribes on
  mount, refreshes when any registry fires statusChange, clears all
  three subscriptions on unmount.

- New `MonitorDetailBody render branches` describe block in
  `BackgroundTasksDialog.test.tsx` (8 cases) covers the conditional
  pieces my last commit added: title + Command, pid show/hide,
  eventCount singular vs plural, droppedLines show/hide, exitCode
  display, Error block (failed) vs Stopped because (auto-stop), and
  the all-undefined no-block case.
2026-05-03 10:05:19 +08:00