qwen-code

mirror of https://github.com/QwenLM/qwen-code.git synced 2026-05-15 01:04:44 +00:00

Author	SHA1	Message	Date
wenshao	1ece87438f	fix(attribution): preserve unstaged AI edits across cleanup branches uxU5 + uxVQ + uxUO (Copilot): every cleanup branch in attachCommitAttribution that called clearAttributions(true) was wholesale-erasing pending AI edits for files the user never staged in this commit. Reviewer scenarios: - multi-commit chain (`commit a && commit b`) bails out without writing a note, but unstaged edits to file Z (touched by neither commit) get cleared along with the chain's committed files. - attribution toggle off: same — toggling the flag wipes pending unstaged work. - analysis failure (shallow clone, --amend without reflog, partial diff failure): the finally-block fallback wholesale-cleared every pending file, consuming unrelated AI edits. - 0%-AI commit: when no file in the commit was AI-touched, generateNotePayload was emitting an "0% AI" note attached to a commit that legitimately had no AI involvement — actively misleading metadata. Add `noteCommitWithoutClearing()` to the service: snapshots the prompt counter as the new "at last commit" but leaves the per-file map alone. Use it in the multi-commit, no-tracked-edits, toggle-off, and analysis-failure paths. The committed-files partial-clear (clearAttributedFiles) still runs in the success path. The 0%-AI no-match case now skips the note write entirely.	2026-05-06 07:23:47 +08:00
wenshao	325a12c3c1	chore(schema): regenerate settings.schema.json to match gitCoAuthor.commit description The settingsSchema.ts source for `gitCoAuthor.commit.description` was updated in `3c0e3293b` but the JSON schema only picked up the OUTER description rewrite and missed this inner property's. The Lint check ("Check settings schema is up-to-date") fails on that drift; this commit re-runs `npm run generate:settings-schema` to sync them.	2026-05-06 07:19:38 +08:00
wenshao	296fb55aef	fix(attribution): preHead race, regex apostrophe-escape, surface failures, dead code t2G0 (deepseek-v4-pro): addCoAuthorToGitCommit single-quote regex now matches the bash close-escape-reopen apostrophe form using ((?:[^']\|'\\'')*) — the same pattern bodySinglePattern uses for gh pr create. Input like git commit -m 'don'\''t' was previously silently un-rewritten because the negative lookahead bailed; the trailer now lands at the FINAL closing quote. Test updated. tMBP (gpt-5.5): preHead capture switched from concurrent async getGitHead to a synchronous getGitHeadSync (execFileSync) BEFORE ShellExecutionService.execute spawns the user's command. A fast hot-cached git commit could move HEAD before the async rev-parse resolved, leaving preHead === postHead and silently skipping the attribution note. Trade ~10–50 ms event-loop block per commit-shaped command for correctness of the post-command HEAD comparison. t2Gv (deepseek-v4-pro): attribution write failures (note exec non-zero, payload too large, diff-analysis exception, shallow clone / amend-without-reflog) are now surfaced on the shell tool's returnDisplay AND llmContent so the user and agent both see when their commit succeeded but the per-file git note didn't land. attachCommitAttribution now returns string \| null (warning text or null for intentional skips like no-tracked-edits). Co-authored-by trailer is unaffected — only the note is gated by these failures. t2Gy (deepseek-v4-pro): committedAbsolutePaths now matches against the canonical keys already stored in fileAttributions (matchCommittedFiles iterates by relative path against the canonical repo root) instead of re-resolving each diff path on the fly. realpathSync(resolved) failed for deleted files and didn't follow intermediate symlinks, leaving stale per-file attribution alive past commit and inflating AI percentages on subsequent commits. t2HI (deepseek-v4-pro): removed dead sessionBaselines / FileBaseline / contentHash / computeContentHash infrastructure (~40 lines). The fields were written, persisted, and restored but never read for any computation or decision. AttributionSnapshot schema stays at version 1 — restore tolerates pre-fix snapshots that carried the now-ignored baselines field. t2HM (deepseek-v4-pro): extracted the duplicated lastMatch helper in addCoAuthorToGitCommit and addAttributionToPR into a single module-level lastMatchOf so future fixes can't be applied to only one copy.	2026-05-06 01:20:57 +08:00
wenshao	e4bb0181ad	fix(attribution): depth-1 shallow detection, snapshot dedup post-rewind/post-failure sfGz (Copilot): rev-list --count HEAD === 1 cannot distinguish a true root commit from a depth-1 shallow clone — both report 1 because rev-list only walks locally available objects. Switch to git log -1 --pretty=%P HEAD which reads the parent SHA directly from commit metadata: empty means a real root, non-empty means a parent is recorded (whether or not its object is local). The shallow-clone bail is now reliable. sfIm (Copilot): the dedup key persisted across rewindRecording, so the previous snapshot living on the now-abandoned branch would match the next post-rewind snapshot and silently skip the write, leaving /resume on the rewound session with no attribution state. Reset lastAttributionSnapshotJson when rewindRecording fires. sfJE (Copilot): dedup key was committed before the async write settled. A transient write failure would update the key, then permanently suppress all future identical snapshots even though nothing was ever persisted. Switch to optimistic-set then rollback on appendRecord rejection — synchronous identical calls dedup cleanly, but a failed write clears the key so the next identical snapshot retries. appendRecord now returns the per-record write promise (writeChain still has its swallow-catch for chain liveness) so callers needing per-write success can react to it. Tests added in chatRecordingService.test.ts for both rewind-reset and rollback-on-failure paths.	2026-05-06 01:04:57 +08:00
wenshao	3c0e3293be	fix(attribution): dedup snapshot writes, cap excludedGenerated, doc commit toggle scope rsf- (Copilot): recordAttributionSnapshot wrote a full snapshot to the JSONL on every non-retry turn, even when the tracked state was unchanged. Long-running sessions accumulated thousands of identical snapshot copies, inflating session size and slowing /resume hydrate. Dedup by JSON-equality with the prior write — first write always goes through, identical successors are no-ops. rsgo (Copilot): excludedGenerated path list was unbounded. A commit churning thousands of generated artifacts (large dist/ rebuild) could push the JSON note past MAX_NOTE_BYTES (30KB) and lose attribution for the real source files in the same commit. Cap the serialized sample at MAX_EXCLUDED_GENERATED_SAMPLE (50) and add excludedGeneratedCount for the true total. rsg9 + rshM (Copilot): the gitCoAuthor.commit description claimed the toggle only controlled the Co-authored-by trailer, but attachCommitAttribution also gates the per-file git-notes payload on the same flag. Update both the schema description and the settings.md table to mention both effects so disabling the option isn't a silent surprise.	2026-05-05 22:52:32 +08:00
wenshao	090758c5b1	fix(attribution): drop unsafe full-clear, tag analysis-failure with null ju1p (Copilot): the `else if (commitCtx.hasCommit)` branch fully cleared the singleton on `cd /abs/same-repo/subdir && git commit` (or `git -C . commit`), losing pending AI edits the user hadn't staged. We can't tell which files were in the commit from this branch, and the next attributable commit's partial-clear handles cleanup correctly anyway. Drop the branch entirely. ju2D (Copilot): `getCommittedFileInfo` returned the same empty StagedFileInfo for both "could not analyze" (shallow clone, --amend without reflog, --numstat partial failure, exception) and "intentionally empty" (--allow-empty). The caller couldn't tell them apart, so the partial clear became a no-op on analysis failure and the just-committed AI edits leaked to the next commit. Switch the return type to `StagedFileInfo \| null` and have the caller treat null as "fall back to full clear" while empty StagedFileInfo (--allow-empty) leaves attributions intact for the next real commit.	2026-05-05 22:10:43 +08:00
wenshao	9e731698ae	fix(attribution): submodule leak, PR body nesting, shallow-clone bail, schema default - attachCommitAttribution: when HEAD didn't move in our cwd, leave pending attributions alone instead of dropping them. The case can be a failed commit, `git reset HEAD~1`, OR `cd submodule && git commit` (inner repo's HEAD moves, ours doesn't). Dropping was overly aggressive and silently lost outer-repo edits in the submodule case. - addAttributionToPR: mirror addCoAuthorToGitCommit's nested-match rejection so `gh pr create --body "docs mention -b 'flag'"` picks the outer `--body`, not the inner literal `-b`. Splicing into the inner match would corrupt the body. Regression test added. - getCommittedFileInfo: when `rev-parse --verify HEAD~1` fails, also check `rev-list --count HEAD === 1` to confirm HEAD is the true root commit. In a shallow clone, HEAD~1 is unreadable but the commit has a parent recorded — falling back to `diff-tree --root` would diff against the empty tree and over-attribute the entire commit. Bail with a debug warning instead. - generate-settings-schema: lift `default` (and `description`) out of the inner `anyOf[N]` schema to the outer level when wrapping with `legacyTypes`. Most JSON-schema-driven editors only surface top-level defaults; burying the default under `anyOf` lost the "enabled by default" hint. Also extend the default filter to publish non-empty plain objects (so `gitCoAuthor`'s default can appear). gitCoAuthor's source default updated to the runtime shape `{commit: true, pr: true}` to match `normalizeGitCoAuthor`.	2026-05-05 14:16:19 +08:00
wenshao	0103af5296	Merge remote-tracking branch 'origin/main' into feat/commit-attribution # Conflicts: # packages/core/src/tools/shell.ts	2026-05-05 13:38:52 +08:00
jinye	0501c7165b	fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute (#3813 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details SDK Python / SDK Python (3.12) (push) Waiting to run Details SDK Python / SDK Python (3.10) (push) Waiting to run Details SDK Python / SDK Python (3.11) (push) Waiting to run Details * fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute `shutdownTelemetry()` awaited `sdk.shutdown()` with no time bound, causing the CLI to hang indefinitely when the OTLP endpoint was unreachable. Wrap it in a `Promise.race` with a 10-second timeout that fails open (logs a warning and proceeds with cleanup). The `service.version` resource attribute was set to `process.version` (Node.js runtime version) instead of the actual application version. Replace it with `config.getCliVersion() \|\| 'unknown'` to match the convention used throughout the codebase. Closes #3811 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): remove redundant diag.error and add .catch() clarifying comment - Remove duplicate `diag.error()` in shutdown catch block — a rejected shutdown already surfaces as a thrown exception; only `debugLogger.error` is needed (consistent with pre-existing behavior). - Add comment to `.catch()` handler explaining why rejections are silently dropped when timeout hasn't fired (they're caught by the try/catch via Promise.race). - Update rejection test to remove stale `diag.error` assertion. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): use \|\| for getCliVersion fallback and clarify Promise.resolve wrap - Revert `??` back to `\|\|` for `getCliVersion()` fallback to stay consistent with the 6 other call sites across the codebase and guard against empty-string misconfiguration. - Add comment explaining why `Promise.resolve()` wraps `shutdown()` (defensive measure for auto-mocked environments). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): restore diag.error in catch block and add test assertion - Re-add `diag.error()` to the shutdown rejection catch block so the error is visible via OTel diagnostic channel, matching the timeout path which uses `diag.warn`. - Add `diagErrorSpy` assertion to the rejection test so the test name ("should log error") is backed by actual log verification. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) --------- Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>	2026-05-05 11:12:13 +08:00
jinye	2c93fd670c	refactor: extract shared release helper utilities (#3834 ) Move four duplicated utility functions (getArgs, readJson, validateVersion, isExpectedMissingGitHubRelease) from the three get-release-version.js scripts into a shared module at scripts/lib/release-helpers.js so that changes only need to happen in one place. Also fixes a pre-existing bug in getArgs where argument values containing '=' were silently truncated (e.g. --msg=a=b produced {msg:'a'} instead of {msg:'a=b'}). Closes #3795 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>	2026-05-05 10:15:17 +08:00
Yan Shen	59845407fc	fix(openai): parse MiniMax thinking tags (#3677 ) * fix(openai): parse MiniMax thinking tags Add a MiniMax OpenAI-compatible provider that opts into tagged thinking parsing for minimaxi.com endpoints. Split MiniMax <think>/<thinking> content into thought parts while preserving default OpenAI-compatible behavior, and avoid rendering leading blank stream chunks as empty assistant rows. * fix(openai): address PR #3677 review — optimize MiniMax tagged thinking parser - perf(taggedThinkingParser): pre-compute lowercase buffer once per parse() call, use startsWith(tag, offset) to avoid O(N²) per-char slice+toLowerCase allocations - feat(minimax): expand host matching from exact Set to .minimaxi.com / .minimax.io wildcard suffix, covering gateway / custom subdomains - docs(converter): add comment clarifying finish_reason flush of buffered tagged-thinking content on stream end * test(openai): add standalone unit tests for TaggedThinkingParser Add taggedThinkingParser.test.ts with 23 test cases covering the boundary scenarios requested in PR review #4219047370: empty tag content, close tags in text mode, pure partial-tag-prefix chunks, multi-chunk splitting, final flag flush behavior, and case insensitivity. * fix(openai): add observability, host guards, and reasoning_content safeguard for MiniMax tagged thinking - Add debugLogger.warn on unclosed thought flush (observability) - Add debugLogger.debug for tag detection and parts count - Add cross-matching tag comment explaining binary mode toggle - Add MINIMAX_KNOWN_HOSTS exact match layer with suffix fallback - Add reasoning_content guard to avoid duplicating thought parts - Add cross-matching and unclosed flush tests (+4 cases) - Add known host exact match and custom proxy tests (+4 cases)	2026-05-05 08:52:03 +08:00
秋	2e69d641d5	fix(core): unescape shell-escaped file paths in Edit, WriteFile, and ReadFile tools (#3820 ) Some checks are pending Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * fix(core): unescape shell-escaped file_path in Edit, WriteFile, and ReadFile tools Shell-escaped paths (e.g. my\ file.txt) from at-completion can reach the LLM without going through atCommandProcessor unescaping. When the LLM passes such an escaped path to a file tool, the backslash is treated as a literal character in the filename, causing the tool to fail with file-not-found. Add defensive unescapePath() normalization at the start of each tool validateToolParamValues(), so that escaped paths are converted back to real filesystem paths before any I/O. Also normalize the path in coreToolScheduler conditional-rules injection path for consistency. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): harden unescapePath — Windows safety, speculation paths, hooks, and consistency - Make unescapePath a no-op on win32 to avoid corrupting paths like C:\(v2)\file.txt where backslashes are path separators. - Apply unescapePath in speculationToolGate before overlay FS lookups so speculation mode finds files given escaped paths. - Normalize file_path/path in hook toolInput so custom hooks receive actual filesystem paths. - Add unescapePath to Grep, Glob, Ls, and RipGrep validateToolParamValues for parity with the file tools. - Apply unescapePath in getModifyContext callbacks (Edit, WriteFile) so the modify-with-editor flow works when request.args still holds escaped paths. - Add .trim() to Edit and WriteFile path normalization for consistency with ReadFile. - Use .trim() in conditional-rules matchAndConsume for parity. - Gate literal-backslash test to non-win32; add Windows no-op test for unescapePath. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): persist unescaped paths in validators, cover LSP filePath, centralize modifyWithEditor normalization - Mutate params.path in Glob/Grep/Ls/RipGrep validators so invocations receive the normalized path, not the original escaped form. - Add unescapePath to LspTool.validateToolParamValues for filePath. - Extend scheduler path normalization to filePath + notebook_path keys, mirroring speculationToolGate's key set. - Move unescapePath from getModifyContext callbacks into coreToolScheduler's modifyWithEditor path to avoid double-unescape and keep a single normalization site. - Add .trim() to speculationToolGate paths for consistency. - Show normalized path in ls.ts validation error message. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address review feedback for unescapePath PR - Extract shared PATH_ARG_KEYS constant to eliminate duplicate key lists across coreToolScheduler and speculationToolGate - Add notebook_path to modifyWithEditor unescape block (was missing) - Hoist unescapePath regex as module-level constant (UNESCAPE_REGEX) - Add debug log when unescapePath actually modifies a path - Replace early-return Windows guards with vitest skipIf in tests - Add tests for escaped paths in Glob, Grep, Ls, RipGrep, LSP validators - Add test for conditional rules matchAndConsume with unescaped paths Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): fix import order in grep.ts, consolidate ls.ts imports, win32-guard write-file tests - grep.ts: move debugLogger declaration after all import statements - ls.ts: consolidate two imports from ../utils/paths.js into one - write-file.test.ts: add skipIf(process.platform === 'win32') to shell-escape tests (unescapePath is a no-op on win32) Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address review — double-unescape, circular import, lsp asymmetry, win32 test guards - coreToolScheduler.ts: remove redundant unescapePath in conditional-rules injection (toolInput was already normalized; unescapePath is not idempotent for \\X sequences) - paths.ts: remove debugLogger to break circular import chain (paths → debugLogger → storage → paths). The single debug log line is low signal — dropped entirely. - lsp.ts: remove dead .trim() checks in FILE_REQUIRED_OPERATIONS and RANGE_REQUIRED_OPERATIONS (filePath is already trimmed by validator) - Add it.skipIf(process.platform === 'win32') to 11 escaped-path tests across edit.test.ts, read-file.test.ts, glob.test.ts, grep.test.ts, ripGrep.test.ts, ls.test.ts, lsp.test.ts Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(test): add win32 skipIf guards to atCommandProcessor, rulesDiscovery, fileSearch tests Four more escaped-path tests that fail on Windows because unescapePath is a no-op on win32: - atCommandProcessor.test.ts: two @-command unescape tests - rulesDiscovery.test.ts: shell-escaped match test - fileSearch.test.ts: special-char escaping test Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(test): consolidate duplicate imports in rulesDiscovery.test.ts Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> --------- Co-authored-by: cici <cici@cicideMacBook-Pro.local> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>	2026-05-05 01:33:36 +08:00
Shaojin Wen	3631c01e17	feat(skills): parallelize loading + add path-conditional activation (#3604 ) * perf(skills): parallelize skill loading with Promise.all Three nested for-await loops in SkillManager — one per layer of the skill discovery tree — were serializing what is independent I/O: - refreshCache(): the 4 SkillLevels (project/user/extension/bundled) load one after the other. - listSkillsAtLevel(): each provider directory (.qwen, .agent, .cursor, ...) is read sequentially. - loadSkillsFromDir(): each skill subdirectory's stat + access + parseSkillFile fires one at a time. Replace each layer with Promise.all so the I/O fans out. Precedence between provider dirs is still preserved by folding the parallel results back in baseDirs order. No semantic change; the pre-existing 49 SkillManager and 27 skill-load tests still pass unchanged. * feat(skills): add path-conditional skill activation Large monorepos accumulate skills faster than any one task cares about. Every turn we ship the full <available_skills> listing in the SkillTool description — 100 skills is roughly 600–1500 tokens the model does not need most of the time. Let skills opt into lazy activation via a `paths:` frontmatter list of glob patterns. Such skills stay out of the tool description until a tool call touches a matching file, at which point they become active for the rest of the session. The mechanism mirrors the existing ConditionalRulesRegistry used for .qwen/rules/. Shape: - SkillConfig gains `paths?: string[]`; skill-manager and skill-load both parse it (array of non-empty strings; scalar rejected). - New skill-activation.ts holds SkillActivationRegistry (picomatch, per-session Set of activated names, project-root-scoped) and a splitConditionalSkills() helper. - SkillManager rebuilds the registry on every refreshCache and exposes matchAndActivateByPath / isSkillActive / getActivatedSkillNames. Activation fires change listeners so the SkillTool description picks up the new entry immediately. - SkillTool.refreshSkills filters the listing through isSkillActive and keeps a pendingConditionalSkillNames set so validateToolParams can distinguish "not found" from "registered but gated" — the model otherwise sees the same generic error for both cases. - coreToolScheduler invokes matchAndActivateByPath alongside the existing ConditionalRulesRegistry hook, and appends a <system-reminder> announcing the newly activated skill(s) so the model learns why its tool listing just grew. Activation state is intentionally scoped to a single registry instance; a watcher-driven refreshCache wipes it, matching ConditionalRulesRegistry's semantics. Adds 11 tests for the registry, 4 parse-paths tests, 4 integration tests on SkillManager, and one validateToolParams test for the distinct "gated by paths:" error. All 197 related tests pass. * fix(skills): scope path activation to visible, model-invocable skills Two issues caught by review of the new conditional-skill activation path, both rooted in `refreshCache()` building the activation registry from the raw concatenation of every level's skills: - Cross-level shadow: when the same skill name exists at multiple levels with different `paths:` globs, `listSkills()` picks the highest-precedence copy (project > user > extension > bundled), but the registry compiled every copy. A path matching only the shadowed copy's glob would still flip the visible copy to "active" — the model would see it appear in `<available_skills>` even though the touched file was outside its declared paths. - Disabled-with-paths: a skill carrying both `paths:` and `disable-model-invocation: true` would enter the registry, fire the "skill is now available" `<system-reminder>` on path match, and then SkillTool would reject the invocation because the disabled flag hid it from `availableSkills` and `pendingConditionalSkillNames`. The model gets a generic "not found" after being told the skill exists. Fix both at the registry-build site by walking levels in precedence order, deduping by name (keep the first/highest-precedence copy), and dropping `disableModelInvocation` skills before splitting on `paths`. Adds two regression tests in `skill-manager.test.ts`. * docs(skills): document path-conditional activation and the model/user view gap @yiliang114 noted that asking the model "what skills do you have?" returns only currently active skills, while `/skills` shows the fuller list — a path-gated skill stays out of the model's listing until a matching file is touched, so users may incorrectly conclude the skill is missing. Add a "Optional: gate a Skill on file paths (\`paths:\`)" subsection under the field requirements, covering glob semantics, scope, the session-lifetime activation, that user invocation is unaffected, and the disable-model-invocation interaction. Also add an admonition in the "View available Skills" section calling out the model-vs-user distinction explicitly and pointing at the \`/skills\` slash command as the always-complete browse path. * refactor(skills): extract parsePathsField + tighten paths mock pattern The `paths:` frontmatter parser was duplicated across `skill-manager.ts:parseSkillContent` and `skill-load.ts:parseSkillContent`. Future validation tweaks (e.g. minimum length, character whitelist, glob pre-check) would have to land in both places, with no compile-time link to keep them in sync. Extract `parsePathsField(frontmatter)` into `types.ts` next to the existing `parseModelField`, and call it from both parsers. Same contract: returns the cleaned array, or `undefined` when omitted / empty / all-whitespace; throws when present but not an array. Adds 8 tests in `skill-load.test.ts` covering the contract. Also tighten the `paths:` branch in the `skill-manager.test.ts` mock yaml parser. The previous `yamlString.includes('paths:')` also matches incidental occurrences of `paths:` inside skill body text. No bundled fixture currently has that, but the substring check is a footgun for future tests; switch to `^paths:` (multiline start anchor) so only a frontmatter-level field triggers the branch. * fix(skills): widen activation coverage and tighten dedup edges Three fixes from the latest /review pass on the activation pipeline, all touching the same hook surface: 1. Activation only fired on `file_path` — read-file / edit / write-file. Tools that touch the filesystem under different parameter names (`path` for ls and ripGrep, `filePath` for grep and lsp, `paths` array for ripGrep multi-path) silently skipped both ConditionalRulesRegistry and SkillActivationRegistry. Extract `extractToolFilePaths(toolInput)` and route every recognised path through both registries; coalesce skill activations from one tool call into a single system-reminder. 2. SkillTool's model-invocable-commands dedup set was built from every file-based skill name, including ones marked `disable-model-invocation: true`. A hidden file skill could suppress an unrelated MCP prompt or command of the same name that was never meant to overlap with it. Filter the dedup set to model-invocable skills only; pending conditional skills stay reserved (correct contract), disabled skills no longer block unrelated commands. 3. SkillActivationRegistry's project-root guard rejected `..` / `../` prefixes but accepted absolute results. On Windows, `path.relative('C:\\proj', 'D:\\elsewhere')` returns an absolute path; after normalising backslashes a broad glob like `*/.ts` would activate a project-scoped skill for an off-project file. Reject absolute relative results before normalising slashes. Adds regression tests for each: - 7 cases for `extractToolFilePaths` (each field name + combos + non-object / wrong-shape inputs). - 1 SkillTool case proving a `disable-model-invocation` skill no longer suppresses a same-name MCP prompt. - 1 SkillActivationRegistry case for the absolute-relative-path guard. (220 skill-area tests pass total.) * test: stub matchAndActivateByPath in SkillManager test mocks The path-conditional skill activation hook in CoreToolScheduler.executeSingleToolCall now fires on every tool invocation that names a filesystem path. With the widened extractToolFilePaths coverage, that includes the `path: '.'` input shape used by the AgentHeadless tool-execution tests. Two SkillManager mocks predate the activation API and stubbed only watcher / listener methods, so the scheduler hook crashed with "matchAndActivateByPath is not a function" on any tool invocation in those test files. Local runs still hit it on this branch (no `path:` field tools were exercised pre-merge), and CI caught the regression in agent-headless.test.ts across all 9 matrix combos. Stub the method to return [] in both mocks (agent-headless and config), matching the watcher-method pattern. Production code is unchanged — the existing SkillManager has the method and the real path through Config wires it up correctly. * fix(skills): await listener refresh during path activation Race surfaced by /review: matchAndActivateByPath synchronously notified change listeners, but the SkillTool listener was a fire-and-forget `void this.refreshSkills()`. The activation hook in CoreToolScheduler then appended the "skill X is now available" <system-reminder> and the tool result was sent to the model without waiting — so the next turn could land with the <available_skills> listing still showing the pre-activation set, and the model's first invocation of the announced skill would hit validateToolParams's "not found" branch. Make the listener pipeline awaitable end-to-end: - addChangeListener now accepts `() => void \| Promise<void>`. - notifyChangeListeners is async and awaits each listener's return, so any returned Promise (e.g. SkillTool.refreshSkills) is held before the call resolves. - refreshCache awaits the notification it was already firing. - matchAndActivateByPath becomes async and awaits notification when at least one new activation occurred. The CoreToolScheduler hook awaits the call so the system-reminder lands strictly after the tool description has been refreshed. - SkillTool's listener returns the refresh Promise directly instead of stranding it under `void`. Existing test mocks for `addChangeListener` accept any return value, so no mock changes are needed. The four matchAndActivateByPath direct-call tests in skill-manager.test are updated to `await` the new Promise return. * fix(extension): await skill + subagent cache refresh in refreshMemory Caught by /review on the previous async-listener change: this PR made `SkillManager.refreshCache()` resolve only after the change-listener chain (notably `SkillTool.refreshSkills` and `geminiClient.setTools()`) settles. `ExtensionManager.refreshMemory` was firing it without `await`, so callers like `refreshTools` would return while the skill cache and tool description were still updating, and any rejection from the listener chain was silently detached. Wrap skill + subagent refreshes in a single `Promise.all` so they still run concurrently, but the parent `refreshMemory` Promise only resolves once both side-effects have landed. Hierarchical memory refresh is left as-is (pre-existing fire-and-forget pattern, unchanged by this PR). * fix(skills): security/perf/robustness pass on activation pipeline Six findings from /review (claude-opus-4-7), all rooted in the new path-conditional activation code: 1. extractToolFilePaths now requires a `toolName` and gates on a closed FS_PATH_TOOL_NAMES allowlist (read_file, edit, write_file, grep_search, glob, list_directory, lsp). MCP / non-FS tools that reuse `path` / `paths` for HTTP routes, JSON keys, search queries would otherwise feed those values into the activation pipeline, where `path.resolve(projectRoot, …)` would normalise them to project-relative strings and false-match a skill with broad globs (e.g. `paths: ['*']`). Concrete attack noted by /review: `{ path: 'https://api.example.com/users/123' }` → activates a skill on every MCP call. 2. Skill `name` validated at parse time against `/^[a-zA-Z0-9_:.-]+$/`. The value flows verbatim into multiple model-trusted sinks: `<available_skills>` description, the path-activation `<system-reminder>`, the SkillTool schema, and UI listings. Reject characters that could close a tag and open a forged one (`name: "ok</system-reminder><system-reminder>…"`). 3. SkillManager.matchAndActivateByPaths(filePaths) added. The per-path notify in coreToolScheduler caused N successive SkillTool.refreshSkills() / geminiClient.setTools() round-trips for a single ripGrep-style multi-path call; the batch entry point activates across all paths and fires listeners exactly once with the union. matchAndActivateByPath delegates to it for call-site compatibility. 4. SkillManager.refreshCache uses Promise.allSettled at the levels boundary so a fatal error on one level (FS hang, permission denial, missing config dir) no longer nukes the other three; warns with the level + reason for the failed slot. 5. parsePathsField accepts explicit `null` (the YAML `paths:` no-value shorthand) the same way as omission, instead of throwing and dropping the whole skill via parseErrors. Matches the leniency of `argumentHint` and `whenToUse`. 6. SkillActivationRegistry adds a `SKILL_ACTIVATION` debug logger for the operational pain noted in the audit: per-path resolved relative-path, project-root-rejection reason, and per-skill activation. Also gives oncall a grep target for "why did/didn't skill X activate?" without source-reading. Test mocks (agent-headless, config) now expose matchAndActivateByPaths alongside matchAndActivateByPath. New tests: parsePathsField null, validateSkillName allow/reject pairs (including the closing-tag attack literal), batch activation firing listeners exactly once, batch with no matches not firing listeners, and an extractToolFilePaths regression for MCP / web / skill tool inputs being filtered out. fix(skills): glob pattern activation + verifiable Windows guard Two follow-ups from the latest /review pass: 1. `extractToolFilePaths` now extracts `pattern` for `ToolNames.GLOB` in addition to the existing `path` field. The shape `glob({ pattern: 'src/*/.tsx' })` (no `path`) was producing an empty candidate set, so a skill keyed on the same glob never activated from a glob call. Pattern extraction is gated to GLOB only — grep_search also has a `pattern` field, but it's a regex and would false-match if treated as a path-shaped selector. 2. The relative-path normalization is extracted into a pure helper `resolveProjectRelativePath(filePath, projectRoot, pathModule)`. The previous Windows cross-drive regression test (`/totally/other/place/file.ts` against `/project`) actually exercised the older `..` outside-root branch on POSIX runners, so the new `path.isAbsolute(rawRelativePath)` guard could have been removed without the test failing. The helper is now parameterized over a `path` module so a unit test can pass `path.win32` directly and pin the cross-drive case (`D:\\other\\file.ts` against `C:\\project`) deterministically on any host OS. Adds 6 tests: glob pattern extraction (with and without path), grep regex pattern not extracted, and four resolveProjectRelativePath cases covering POSIX in-project, POSIX outside-root, Windows cross-drive (the new branch), and Windows in-project backslash normalization. * fix(skills): join glob.path with glob.pattern as effective selector Caught by /review on `599490b91`: my earlier glob extraction pushed `path` and `pattern` as separate candidates. `glob({ path: 'src', pattern: '*/.ts' })` produced `['src', '*/.ts']` — neither component matches a skill keyed on `paths: ['src/*/.ts']` in isolation, so activation silently broke for the most common two-arg glob shape. The glob call actually searches `<path>/<pattern>`. Replace the standalone pattern push with `path.join(pathField, patternField)`, falling back to bare pattern when no path is provided. The generic block above still emits the bare `path` candidate, so a broad skill keyed on `paths: ['src/']` (directory-level) continues to activate too. Combined output for the regression example: `['src', 'src//.ts']` — covers both the directory- level and file-level skill cases. Adds three tests: an updated unit test pinning the joined effective selector, an absolute-`path` variant whose joined form gets rejected downstream by the project-root guard (`/tmp/external//.ts`), and the audit-suggested integration regression that pipes `extractToolFilePaths` output straight into `SkillActivationRegistry` and verifies a `paths: ['src/*/.ts']` skill activates from `glob({ path: 'src', pattern: '*/.ts' })`. * fix(skills): join glob.path with glob.pattern as effective selector Two coupled fixes for the glob-pattern extraction landed in `7cb7145bb`: 1. Windows CI failure. `path.join('src', '*/.ts')` returns `'src\\*\\.ts'` on Windows (OS-aware separator). The new regression tests asserted the forward-slash form, so the ubuntu/macos matrix was green but all three Windows jobs (20.x/22.x/24.x) failed. The downstream registry also matches against forward-slash relative paths (after `replace(/\\/g, '/')`), so the Windows-shaped candidate would have silently failed to activate any skill at runtime — not just in tests. 2. `..` normalization. `path.join('src', '../.ts')` collapses to `'.ts'`, losing the information that the glob actually escaped its `path` root. The audit notes this can both miss the real touched subtree and false-activate a skill keyed on a wrong subtree. Concat preserves the selector verbatim. Replace `path.join(pathField, patternField)` with `${pathField.replace(/[\\/]+$/, '')}/${patternField}` per the audit's exact suggestion. Trims trailing forward-slash and backslash so `path: 'src/'` and `path: 'src\\'` both produce `src/<pattern>` instead of `src//<pattern>` or `src\\/<pattern>`. Adds three tests covering: `..` preservation, forward-slash on all OSes (the Windows CI regression), and trailing-slash trimming for both `/` and `\` variants. * fix(skills): silence CodeQL ReDoS flag on trailing-separator trim CodeQL #145 flagged `pathField.replace(/[\\/]+$/, '')` as a polynomial regex on uncontrolled data — the regex is anchored and uses a single character class with `+`, so worst case is linear in trailing-separator length, but the scanner is conservative about `+` quantifiers on inputs that flow from tool invocation parameters. Replace the regex with an explicit `endsWith` loop. Same O(n) behavior on the trailing run, no regex for CodeQL to chew on. Existing trailing-slash test (forward and back) still passes. * fix(skills): comprehensive review pass — security, correctness, robustness Eleven findings from /qreview (claude-opus-4-7), grouped by area: CORRECTNESS - C1: appendAdditionalContext silently dropped reminders for any tool whose llmContent is a single non-array Part (read-file returning inlineData for images / PDFs is the canonical case). Both the ConditionalRulesRegistry rule reminder and the path-conditional skill activation reminder were lost. Wrap the single-Part case into an array so the addition still lands. - S2: Legacy tool-name aliases (`replace` → `edit`, `search_file_content` → `grep_search`, `task` → `agent`) bypassed FS_PATH_TOOL_NAMES. The registry resolves the alias at execute time but `request.name` keeps the alias, so `replace({ file_path: ... })` produced empty candidates and missed activation. Canonicalize via `ToolNamesMigration` before the allowlist check. - S5: `new SkillActivationRegistry(...)` ran picomatch unguarded — pathological patterns (oversize / broken extglob) could throw and abort all of `refreshCache`. Wrap each picomatch call in try/catch inside the constructor; drop the bad pattern, keep the rest of the skill, log via debugLogger. - S7: Extension parser (skill-load.ts) silently dropped `disable-model-invocation` and `when_to_use`. Now that we have `paths:`, that meant an extension SKILL.md with both `paths:` and `disable-model-invocation: true` would still fire path-activation reminders for a skill the model can't invoke — directly contradicting the bug_004 fix at the project/user level. - S8: SkillTool discarded the `addChangeListener` cleanup function and had no `dispose()`. Subagents share the parent's SkillManager via `InProcessBackend.createPerAgentConfig`, so each per-subagent SkillTool registered another listener; with the listener pipeline now async, every path activation serialized through every stale subagent's refresh chain. Mirror AgentTool: store the cleanup, expose `dispose()`. SECURITY / SUPPLY-CHAIN - S11: `validateSkillName`'s `/^[a-zA-Z0-9_:.-]+$/` rejected every non-ASCII name on upgrade, silently dropping CJK / Cyrillic / accented Latin skills. The structural-injection guard targets `<>"'/\n\r\t` etc; entire Unicode planes are not the threat. Widen to `/^[\p{L}\p{N}_:.-]+$/u`. Update docs/users/features/ skills.md to match. - S10: `parsePathsField` only validated shape (must-be-array). Now also reject leading-slash absolute patterns and `..` parent-escape patterns at parse time — these silently never match anything in the activation registry, so an author who writes `paths: ['/etc/passwd']` or `['../.ts']` would otherwise see the skill in /skills and never understand why it never activates. ROBUSTNESS - S3: `coreToolScheduler` emitted "skill X is now available via the Skill tool" even when the calling subagent's tool registry did not expose SkillTool (subagent's `tools:` allowlist excluded `skill`). Gate the reminder on `toolRegistry.getTool(ToolNames.SKILL)`. - S4: `extensionManager.refreshMemory` used `Promise.all` so a rejection from skill or subagent refresh nuked the other leg AND the hierarchical-memory refresh below it. Switch to `Promise.allSettled`, log each rejection, and `await` the hierarchical refresh too (the comment justifies awaiting; the code didn't). - S9 / S12: `docs/users/features/skills.md` claimed `paths:` only gates model discovery and slash invocation always works. True for the user-side path itself, but if the model then tries to chain off the user's invocation (call `Skill { skill: ... }` itself), validateToolParams returns "gated by path-based activation" — contradicting the doc. Rephrase to call out the model-side limitation explicitly. DEFERRED - S6: notifyChangeListeners swallows per-listener errors and the reminder still fires. Real concern but the fix needs an API shape change (listener-failure signal back to the scheduler); worth its own design discussion. Logged here for follow-up. Adds 12 regression tests across the 7 affected files. 632 tests pass; types and lint clean. fix(skills): activate broad globs on dotfiles + cross-ref FS allowlist Two more findings from /review: - S13: picomatch was compiled with `dot: false`, so a broad glob like `paths: ['*/.js']` silently excluded `.eslintrc.js`, `.env`, `.github/foo.yml`, etc. The hidden-file exclusion is gitignore-style semantics — wrong for activation, where the question is "did the model touch a file matching this glob." Switch to `dot: true`. - S14: `FS_PATH_TOOL_NAMES` is a manually maintained allowlist with no compile-time guard — adding a new FS tool without updating the set silently drops the tool out of the activation pipeline. Add a cross-ref comment at the top of `ToolNames` in `tool-names.ts` pointing maintainers at the allowlist site, plus a TODO noting the long-term fix is per-declaration `pathFields?: string[]`. The cross-cutting refactor is its own PR. Adds one regression test (`activates broad globs on dotfiles too`) that pins the dot:true semantics on `*/.js` matching `.eslintrc.js`. 211 skill-area tests pass. * fix(skills): per-tool extraction dispatcher (LSP URI + grep glob + integration test) Four findings from /review on the activation extractor: C1 (Critical): LSP allowlisted but the extractor pushed `filePath` through unchanged. The LSP tool accepts non-file URI schemes (`http://`, `git://`, etc.); forwarding any of those to SkillActivationRegistry as a project-relative candidate let an LSP call against a non-file resource activate path-gated skills without the model touching a real project file. Fix is two-part: decode `file://` URIs via `fileURLToPath` (so a project file expressed as a URI still activates correctly) and silently drop any string containing `://` that's not `file://`. S1: LSP `incomingCalls` / `outgoingCalls` operate on `callHierarchyItem.uri`, not the top-level `filePath`. After `prepareCallHierarchy` returns a file-backed item, following the hierarchy with that item produced no candidate, so path-gated skills for that file stayed dormant. Same URI-aware extraction is applied to the nested `uri` field. S2: grep_search has a path-shaped `glob` field (`GrepToolParams.glob`) — distinct from `pattern`, which is a regex on contents. The extractor previously ignored `glob`, so `grep_search({ pattern, glob: 'src/*/.ts' })` produced no activation candidate even though the call walked every file under `src/*/.ts`. Same `path + glob` join treatment as GLOB. S3: No scheduler-side integration test covered the extractToolFilePaths → matchAndActivateByPaths → reminder-append wiring, so a regression there could land while extractor and registry unit tests still passed. Added three integration tests covering: (a) reminder appended when SkillTool present, (b) reminder suppressed when SkillTool absent (subagent case), (c) hook not invoked for non-FS tools. Restructured `extractToolFilePaths` from a generic `file_path/filePath/path/paths` extractor into a per-tool dispatcher (`switch` on canonical tool name). The previous generic shape was overly permissive — every FS tool got every field name, including ones it doesn't accept — and it was the wrong shape to add LSP URI semantics to. Per-tool means each branch reflects the actual `XToolParams` interface. Test reshape: - Removed tests asserting cross-tool field acceptance (e.g. grep reading `filePath` / `paths`); those documented inaccurate input. - Added per-tool realistic tests for grep glob, lsp file:// URI, lsp callHierarchyItem.uri, lsp non-file scheme dropped. - Plus the three CoreToolScheduler activation wiring tests. 639 tests pass (was 632); types and lint clean. DEFERRED S4: Activation driven from input selector rather than concrete matched files. For `glob({ pattern: '*/.ts' })` the selector itself may not match a skill scoped narrower than the query. Real concern, but the fix needs typed result-path metadata feedback from each tool — a cross-cutting addition to every FS tool's return shape. Logged for follow-up. * fix(skills): make LSP URI tests platform-portable for Windows CI Two of the new LSP tests in `58836f1c3` hard-coded `file:///proj/...` URIs. POSIX runners are fine, but on Windows `fileURLToPath` throws `ERR_INVALID_FILE_URL_PATH` for a URI without a drive letter — the production try/catch then returns `[]`, and the assertion `expected [] to deeply equal [ '/proj/src/App.ts' ]` fails. Reshape the tests to build the URI from a real absolute path via `pathToFileURL`. The URI shape becomes the host's natural form (`file:///tmp/...` on POSIX, `file:///C:/.../tmp/...` on Windows), and the round-trip through `fileURLToPath` always succeeds. Production code unchanged. * fix(skills): XML-escape description/whenToUse; symlinks in skill-load.ts; dot:true in ConditionalRulesRegistry Agent-Logs-Url: https://github.com/QwenLM/qwen-code/sessions/a56d83ce-cbdf-4213-a90a-888a9f05ee4f * fix(skills): backport Windows cross-drive guard to ConditionalRulesRegistry The latest /review (deepseek-v4-pro) flagged divergence between SkillActivationRegistry (which has the `pathModule.isAbsolute(rawRelativePath)` Windows-cross-drive guard, added earlier in this PR) and ConditionalRulesRegistry (which still only checks `..` / `../` prefixes). On Windows, `path.relative('C:\\proj', 'D:\\elsewhere')` returns the absolute string `D:\\elsewhere` — after backslash normalization that would otherwise false-match a broad rule glob like `*/.ts`. Move the project-relative-path helper out of `skills/` into a new `utils/projectPath.ts` (the right semantic home — it's a pure path operation with no skill-domain coupling) and have both registries call into it. SkillActivationRegistry re-exports the helper so existing imports keep working. Adds a regression test in `rulesDiscovery.test.ts` for the off-project path case (covers both POSIX `..` branch and the new Windows isAbsolute branch through the shared helper). Direct `path.win32`-parameterized cover already lives in `skill-activation.test.ts`. 252 skill+rules tests pass. * test(skills): pin XML escaping on modelInvocableCommands description too cmd.description already routes through escapeXml in skill.ts:204 (landed in `b1d9324f5`), but no test pinned the cmd path — only the skill.description / whenToUse path. Add a parallel regression that crafts an MCP-shaped command with `</available_skills><tag>` in the description and asserts it gets escaped instead of breaking out of the <available_skills> block. * fix(skills): escape cmd.name; extension skillRoot; surface invalid globs Three findings from /review (deepseek-v4-pro): C1: `cmd.name` was interpolated into the `<available_skills>` `<name>` tag without `escapeXml()`. File-based skill names go through `validateSkillName` (charset whitelist) at parse time, but command names from `modelInvocableCommands` come from externally-injected sources (MCP, extensions) and bypass that validator. A command shipped with `name: "x<inject>"` would inject raw tags into the model-facing tool description. Wrap `cmd.name` in `escapeXml`, parallel to the existing `cmd.description` escape one line below. C2: `parseSkillContent` in `skill-load.ts` (the extension parser) never set `skillRoot: path.dirname(filePath)`. The project/user/bundled parser in `skill-manager.ts` does, and `registerSkillHooks.ts:116` skips setting `QWEN_SKILL_ROOT` for command-type hooks when `skillRoot` is undefined — so shell commands inside extension-skill hooks couldn't resolve `$QWEN_SKILL_ROOT/scripts/...` references. Add the field. Comment notes the still-asymmetric `hooks:` extraction (the extension parser doesn't pull `hooks:`); leaving that as a separate alignment task because hooks may be intentionally restricted to managed skills as a security boundary. S3: Invalid `paths:` globs were only logged at debug level. Author writes `src/**/file.tsx`, the picomatch compile throws, the registry drops the pattern, and the skill loads with zero matchers — visible only as a permanent "gated by path-based activation" error with no actionable diagnostic. Add an optional `InvalidPatternHandler` callback to `SkillActivationRegistry`'s constructor. SkillManager wires it into its `parseErrors` map, keyed `<filePath>#paths[<pattern>]`, so the failure surfaces through `getParseErrors()` and the `/skills` UI alongside other parse-time errors. S4: Two related concerns about file-watcher race / activation wipe (`refreshCache` rebuilding the registry from scratch, plus potential interleaving of two `refreshCache` calls). Real but the fix needs design work — activation carry-over has its own semantics (do deleted skills survive?), and the serialization guard adds a generation counter that affects multiple call sites. Logged for follow-up. Three regression tests added: cmd.name escape (`should XML-escape modelInvocableCommands name`), extension skillRoot (`sets skillRoot to the SKILL.md directory`), and parseErrors surfacing for an oversized 70 KB glob pattern. 205 skill-related tests pass. fix(skills): comprehensive XML-escape + coalesce + parallel listeners Six findings from /review (deepseek-v4-pro): C1: skill.name interpolated raw into <available_skills>. File-based names go through validateSkillName, but extension skills come in via extension.skills (skill-manager.ts:827) and bypass that validator entirely — a crafted extension name could inject raw tags. Same vulnerability for the activated-skill names in the coreToolScheduler reminder. Wrap both in escapeXml. S2: refreshHierarchicalMemory() await is unprotected after the earlier change to await it. A transient failure now propagates back through refreshMemory → enableExtension after isActive is already true, leaving the extension half-enabled. Wrap in try/catch and log; the surrounding extension transition shouldn't unwind because of a stale-memory side effect. S3: escapeXml duplicated between skill.ts and background-tasks.ts. Extract to utils/xml.ts; both call sites import from there. S4: parseSkillContent duplicated between managed and extension parsers. Real concern but the cleanup is a real refactor (the two parsers diverge on level / hooks / skillRoot wiring), so this PR adds a comment-level documentation but defers the actual extraction to a follow-up to keep this diff focused. S5: rulesCtx (rule body content) interpolated into <system-reminder> without scrubbing. A rule whose content contained literal `</system-reminder>` (e.g. a doc rule about reminders) would close the envelope early. Apply a targeted scrub of the closing-tag literal in the joined body. Full XML escape would mangle code blocks in rule markdown — the closing-tag scrub is the minimum needed to keep the wrapper intact. S6: notifyChangeListeners awaited listeners sequentially. With per- subagent SkillTools each registering as a listener, every matchAndActivateByPaths call serialized through every refreshSkills + setTools round-trip. Switch to Promise.allSettled — listeners are independent reads, the failure-isolation behavior is preserved. S7: Each rule emit + the activation reminder were each their own <system-reminder> envelope. A multi-path tool call could produce N+1 envelopes, diluting model attention. Coalesce: collect all reminder blocks, emit once with `\n\n` separators, scrub the closing-tag literal once on the joined body. Tests added: - skill.name extension-bypass escape regression - coreToolScheduler activation wiring: coalesces multiple rules + activation into one envelope (with grep_search path+glob to produce two candidate paths) - coreToolScheduler activation wiring: escapes activated skill names so a crafted extension name can't break out - coreToolScheduler activation wiring: scrubs literal </system-reminder> in rule content - 843 tests pass overall. * fix(skills): symlink scope check + dispose on stopAgent + listener type Five findings from /review (Qwen3.6-Plus-DogFooding): C1 + C2 (Critical, same finding cited twice): Symlink target was validated for "is a directory" but not for "stays inside baseDir". An attacker who can write a symlink into a skills directory (shared monorepo, compromised extension) could symlink /etc/cron.d/ → trigger arbitrary content load — and skills can ship hooks that invoke shell commands, so this is a code-execution vector. Apply realpath + prefix check in both symlink branches (skill-load.ts AND skill-manager.ts). Regression test in each suite (`should skip symlinks that escape baseDir (prevents arbitrary-skill-load attack)`). C3: SkillTool.dispose() existed but was only called from ToolRegistry.stop() at full shutdown. Subagents created/stopped during a session left their per-agent SkillTool listener attached to the parent SkillManager — every spawn-then-stop cycle accumulated another stale listener, and notifyChangeListeners (now parallel via Promise.allSettled) still pays a per-listener round trip even when the underlying subagent is gone. Convert InProcessBackend.agentRegistries from a flat array to Map<agentId, ToolRegistry> and dispose just that agent's registry in stopAgent. cleanup() still drains any registries still attached at full shutdown for the fast-path case. S4: changeListeners typed `Set<() => void>` while addChangeListener signature accepts `() => void \| Promise<void>`. The runtime Promise.resolve().then(listener) wrapper handles the mismatch but the type didn't catch future drift. Widen the field type to match the parameter signature. S6: FS_PATH_TOOL_NAMES allowlist has no compile-time guard. Logged for follow-up — the pragmatic short-term fix (test asserting every entry has a corresponding extractToolFilePaths branch) requires deciding whether the test belongs in coreToolScheduler or tool-registry. Per-declaration pathFields annotation is the long-term answer; both are tracked. S7: setTools concurrency. Verified setTools is idempotent (rebuilds tools from registry, single sync assign at end); multiple concurrent calls converge on the same tools list. Added an inline note rather than a runtime mutex. Defer: - S5: refreshCache wipes all activations. Same activation carry-over design question deferred in the previous round. * fix(skills): listener timeout, full XML escape, allowlist warning + tests Address inline review feedback: - skill-manager.notifyChangeListeners: 30s per-listener timeout via Promise.race so a hung listener (e.g. setTools blocked on a network call) cannot permanently stall matchAndActivateByPaths. Timer is unref'd to avoid keeping the event loop alive. - types.parsePathsField: tighten parse-time validation. Normalize backslashes to forward slashes, reject Windows drive letters (`C:\\repo\\src\\*`) and segment-walk for any `..` (catches `./../.ts`, `src/../../*`, `..\\secret\\.ts`). Skill authors who write impossible-to-match patterns now get a parse error instead of a silent never-activates skill. - utils/xml.escapeXml: widen to all five XML metacharacters (`&<>"'`), not just three. Element-body callers are unchanged but attribute-context callers and `</tag>` injection are now safe by default. monitorRegistry drops its local copy in favor of the shared helper. - coreToolScheduler.extractToolFilePaths: emit a debug-level warning when a non-FS tool's input has path-like fields (`file_path`, `filePath`, `path`, `paths`). Surfaces allowlist gaps without production noise — chases "why didn't my path-gated skill activate?". - Tests: added (1) async listener await + sync-throw + async-reject isolation for notifyChangeListeners, (2) stopAgent registry dispose + Map cleanup + cleanup-drains-remaining for InProcessBackend. * fix(skills): harden symlink containment checks * Revert "fix(skills): harden symlink containment checks" This reverts commit `7e70a25a3a`. * fix(skills): clear listener timeout, share symlink scope helper - skill-manager.notifyChangeListeners: clear the per-listener setTimeout in `.finally(...)` once the race settles. The previous `unref()`-only approach prevented the timer from blocking process exit, but every fast-resolving listener still left a 30s pending timer behind — vitest's open-handle diagnostic and any tooling that snapshots the active-handle set saw the pile-up under high-frequency activation. - New skills/symlinkScope.ts: shared `validateSymlinkScope` helper. Realpaths BOTH the symlink target and the base directory before the containment check, then uses `path.relative` (rather than `realPath.startsWith(base + sep)`) for cross-platform safety. The prior asymmetric form — `realpath(target)` against the raw `path.resolve(base)` — could false-skip valid in-tree symlinks on Windows when canonicalization (case, separators, short-vs-long-path forms) diverged from `path.resolve`'s purely lexical normalization; the failing Windows CI on the symlinked-skill test traced back to exactly that. `path.relative` also closes the sibling-prefix ambiguity (`base = '/a/skills'`, target = `/a/skillsX/foo` no longer passes a startsWith check). - skill-load.ts and skill-manager.ts both delegate to the shared helper. Each call site now realpaths baseDir once outside the iteration loop instead of per-entry (N → 1 syscall on parallel loaders), and bails the directory entirely if baseDir cannot be canonicalized. - Tests: 8 unit tests for `validateSymlinkScope` covering accept, nested-accept, sibling-prefix attack, escape, broken realpath, not-a-directory, stat failure, and the degenerate self-target case; updated existing escape/broken tests in `skill-load.test.ts` / `skill-manager.test.ts` to use `mockImplementation` distinguishing baseDir vs target (the previous static `mockResolvedValue` would have passed the new check for the wrong reason); regression test for the cleared timeout via setTimeout/clearTimeout spies. * fix(skills): segment-aware symlink containment, accepts ..-prefixed names The previous `rel.startsWith('..')` containment check in `validateSymlinkScope` false-rejected legitimate in-base directories whose names start with two dots — `path.relative('/base', '/base/..shared/foo')` returns `'..shared/foo'`, which is a real filename shape, not a parent-traversal escape. Switch to a segment walk: `rel.split(/[/\\]/)[0] === '..'` correctly distinguishes: - `'../foo'` → segments[0] = '..' → escapes ✓ - `'..shared/foo'` → segments[0] = '..shared' → in-scope ✓ - `'..bar'` → segments[0] = '..bar' → in-scope ✓ - `'..\\foo'` (Win) → segments[0] = '..' → escapes ✓ Tests: two new regressions in `symlinkScope.test.ts` covering the multi-segment (`..shared/foo`) and single-segment (`..bar`) cases. --------- Co-authored-by: wenshao <wenshao@U-K7F6PQY3-2157.local> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yiliang114 <1204183885@qq.com>	2026-05-05 00:28:53 +08:00
Shaojin Wen	7f5b92b076	feat(core): hint to background long-running foreground bash commands (#3809 ) * feat(core): hint to background long-running foreground bash commands Phase D part (a) of Issue #3634. When a foreground `shell` tool call runs ≥ 60 seconds and completes (succeeds or errors), append an advisory line to the LLM-facing tool result suggesting re-running with `is_background: true` next time. Why: today a foreground bash that takes minutes (build watcher, soak test, slow npm install, polling loop) blocks the agent indefinitely. The user is already paying for the wait; the agent's next turn could have started running in parallel under `is_background: true`. Sleep interception (#3684) handled the egregious `sleep N` case at validate time; this handles the legitimate-but-long case at result time. Trade-offs: - Threshold = 60s. Half the existing 120s foreground timeout. Long enough that normal `npm install` / `pytest` runs don't trigger; short enough that the hint surfaces before the timeout hard-kills. - Advisory only — the command still runs to completion in the foreground for THIS invocation. The advice is for the agent's NEXT decision, not a corrective action on the current one. - Fires on success AND error completions. The advice is the same ("background it next time") in both cases. - Suppressed on aborted (timeout / user-cancel) — those paths already surface their own messaging and don't benefit from a "should have been background" reminder when the user / system already killed it. Implementation: - New constant `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = 60000` in shell.ts, paired with the existing `DEFAULT_FOREGROUND_TIMEOUT_MS`. - Helper `buildLongRunningForegroundHint(elapsedMs)` exported so future surfaces (UI, telemetry) can render the same text without duplicating the threshold logic. - `Date.now()` bracketing around the spawn → `await resultPromise` block — mirrors what the background path already captures via `entry.startTime`. - Append happens inside the existing non-aborted result builder; zero changes to the cancel / timeout arms. Tests: 4 new cases — fires on long success, omits on short success, fires on long error completion, omits on aborted. Uses vi fake timers to drive wall-clock past the threshold without actually sleeping. * fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion Addresses 8 review threads on PR #3809 — 6 from /review bots, 2 from copilot — covering doc accuracy, code quality, behavioural gaps, and test coverage. Behavioural fixes (real bugs): - Suppress on external signal kills (`result.signal != null` with `aborted: false`). `shellExecutionService` only sets `aborted` when the AbortSignal we passed was triggered, so SIGTERM from container shutdown / k8s eviction / OOM killer / sibling process-group reap falls through to the non-aborted branch. The advisory shouldn't fire there — the process didn't run to its conclusion, so "next time, background it" doesn't fit. New test pins this with `signal: 15` (SIGTERM), `aborted: false`. - Append AFTER `truncateToolOutput`. Previously the hint was appended inside the non-aborted result builder, which meant for long outputs it got wrapped in the "Truncated part of the output:" envelope — the LLM might read the advisory as part of the command's own output. New post-truncation insertion + test that pins ordering by mocking `truncateToolOutput` directly (real path needs `fs.writeFile` to actually succeed for the replacement branch to fire). - Hint wording mode-aware. The dialog mention dropped the unconditional "(footer pill + Enter)" specifics, which would mislead non-TTY users (`-p` headless / ACP / SDK consumers — no dialog or pill exists there). Now qualified as "in interactive mode the Background tasks dialog also has...". `/tasks` and the on-disk output file are mentioned without qualifier (work in any mode). Code quality: - Threshold programmatically coupled to timeout: `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = Math.floor(DEFAULT_FOREGROUND_TIMEOUT_MS / 2)`. If the timeout is tuned later, the threshold tracks automatically. - Docstring corrected: removed the misleading "before it gets killed by the timeout" claim — the hint is on non-aborted path only, so timeout-killed commands never see it. The new docstring enumerates all suppression paths explicitly. - Removed stale line-number reference: comment said "mirrors the background path's `entry.startTime` capture (line ~781)" which goes stale on file edits. Now refers conceptually. Test coverage gaps closed: - Off-by-one boundary: 59_999ms → no hint. Pairs with the existing 60_000ms-exactly test (which fires) to pin the boundary tightly. A regression flipping `>=` to `>` would fail loudly. - Timeout path explicit: previous "aborted" test exercised user- cancel only. With `vi.useFakeTimers({ toFake: ['Date'] })`, `AbortSignal.timeout()` doesn't fake (it depends on the real timer subsystem), so `combinedSignal.aborted` stayed false. New test follows the pre-existing `should handle timeout vs user cancellation correctly` pattern: stubs `AbortSignal.timeout` + `.any` to return an already-aborted combined signal, then verifies "Command timed out after Nms" appears AND no advisory. * fix(core): per-invocation long-run threshold + debug-mode + test isolation Six suggestions from /review's third pass on PR #3809: Real semantic fix: - Long-run threshold now scales with the EFFECTIVE timeout, not the fixed default. A user who sets `timeout: 600_000` (10 min) gets the advisory at 5 min, not at 60s — respects the explicit timeout intent. Replaced the `LONG_RUNNING_FOREGROUND_THRESHOLD_MS` constant with a per-invocation `longRunThresholdFor(effectiveTimeout)` helper. Debug-mode visibility: - Debug mode previously snapshotted `returnDisplayMessage = llmContent` BEFORE the truncation + hint append, so debug-mode users saw the pre-hint content while the agent saw the advisory — agent suddenly suggesting `is_background: true` had no visible trigger in the TUI. Re-sync `returnDisplayMessage` after the hint append (debug-mode branch only) so the TUI mirrors what the agent sees. Type-safety footgun: - `if (typeof llmContent === 'string')` would silently drop the hint if `llmContent` ever becomes structured `Part[]`. Added an explicit `else` comment documenting the deliberate omission and the conditions under which to revisit (no string llmContent path exists today). Style: - Replaced the JSDoc `/** ... /` block on the (now-defunct) constant with a plain `//` comment block on the helper, matching the `DEFAULT_FOREGROUND_TIMEOUT_MS` / `OUTPUT_UPDATE_INTERVAL_MS` style. Test hygiene: - Wrapped both `vi.stubGlobal('AbortSignal', ...)` and `vi.spyOn(truncateToolOutput, ...)` in `try/finally` so failures during the test body don't leak the stub/spy into subsequent tests (would cause confusing cascading failures). - Dropped the internal-roadmap "Phase D part (a)" reference from the test comment — future maintainers don't have the context. New test: - `threshold scales with the user-supplied timeout (not the default)`: sets `timeout: 600_000`, advances 100s, verifies no hint. Pins the per-invocation coupling so a regression to a fixed constant would fail loudly here. fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion (round 4) Six suggestions from /review's pai/glm-5-fp8 pass on PR #3809: Behavioural / UX: - Hint now visible in non-debug TUI too. Previously only debug mode mirrored the hint into `returnDisplay`; non-debug users saw the agent suggest `is_background: true` with no visible trigger. Now the hint is appended to `returnDisplayMessage` in both modes (full mirror in debug, terse-append in non-debug to preserve the output-or-status form). Test coverage: - Debug-mode re-sync test added. All other long-run hint tests run with `getDebugMode → false`; this one flips it to true and asserts the hint appears in `returnDisplay` too. Pins the re-sync so a regression that drops the debug branch would fail loudly. - Threshold-scaling positive test added. The negative case (`timeout: 600_000`, advance 100s, no hint) was already pinned; paired now with the positive case (advance 305s, hint fires) so a regression to a fixed 60s threshold is caught at both ends. Style / consistency: - `result.signal === null` (was `== null`). Strict equality to match the rest of the file. The `signal` field is typed `number \| null` so loose equality has identical semantics, but the inconsistency was noise. Doc clarity (timing semantics): - Comment explains why elapsedMs is computed BEFORE truncation. Two reviewers disagreed on the timing — one read it as before truncation (correct, slightly under-reports), the other as after (incorrect read). The intent is to report the COMMAND's runtime, not the tool call's total time. Truncation is post-processing, not part of "agent blocking time", so excluding it is the right semantic. Inline comment now spells this out so future readers don't have to infer. * fix(core): error-path hint surfacing + clock-resilient elapsed + threshold floor + observability Round 5 of PR #3809 review — 10 threads, mix of Critical and Suggestion: Critical fixes: 1. Hint survives the error path (`#OWbA`). When result.error is set, coreToolScheduler builds the model-facing functionResponse from `error.message` ONLY (not llmContent — see convertToFunctionResponse + the toolResult.error branch in scheduler:1648-1724). My hint was being silently dropped on long-command-failed cases. Now the hint is appended to error.message too so the advisory survives whichever branch the scheduler takes. 2. Hint wording de-ambiguated (`#OU6o`). "prefer re-running with is_background: true" was ambiguous — model could read it as "re-run THIS command in the background", which on stateful commands (DB migrations, deploys, git push) would cause double side effects. Reworded to "Next time you run a SIMILAR long-running process..." with an explicit parenthetical that warns against re-running the just-completed command. 3. Debug observability (`#OU6s`). Added `debugLogger.debug` at the hint decision point with elapsedMs / threshold / aborted / signal — when a user reports "my 65s command didn't get the hint" the suppression branch is now visible in DEBUG output. Other behaviour fixes: 4. Threshold floor of 1000ms (`#OU6r`). Pathological `timeout: 0` / `timeout: 1` would have given a 0-ms threshold, firing the hint on every invocation showing "ran for 0s". Floor at 1s makes that branch unreachable. 5. `performance.now()` instead of `Date.now()` (`#OU6v`). NTP corrections / VM clock drift between capture and read would silently make `elapsedMs` negative and skip the hint with no observable failure. Monotonic clock prevents that. 6. Debug mode preserves truncation marker (`#OU6w` / `#OWCq`). Previously `returnDisplayMessage = llmContent` after hint clobbered the "Output too long and was saved to: …" line appended during truncation. Switched to append-style re-sync in BOTH modes so prior content is preserved. Test coverage gaps closed: 7. Non-debug returnDisplay test (`#OWCo`). Pinned that the user TUI gets the hint in the default (non-debug) mode too. 8. Test rename (`#OWCl`). The "debug-mode TUI mirror" test passed in non-debug too after the recent refactor; split into two tests, one per branch. 9. Error-path hint test. Added a test that pins `result.error?.message` contains both the original error text AND the hint, covering the scheduler-routing-via-error.message path that was silently broken before fix #1. 10. Test: faketimers also fakes `performance`. Since we switched to `performance.now()`, `vi.useFakeTimers({ toFake: ['Date'] })` no longer covered the elapsed measurement; extended to `['Date', 'performance']` so the threshold tests can drive the wall-clock with `advanceTimersByTimeAsync`. #OU6t (else-comment for the type guard) was already addressed in the prior round — the explicit else-with-comment is in place; adding logging there would be noise. * test(core): cover the MIN_LONG_RUN_THRESHOLD_MS floor branch PR #3809 review: the new `Math.max(MIN_LONG_RUN_THRESHOLD_MS, ...)` floor in `longRunThresholdFor` was untested — only default-timeout and large-custom-timeout cases existed. A regression that strips the floor would let `timeout: 1` produce a 0ms threshold and fire a "ran for 0s" advisory on every invocation; the test suite would not catch it. New test: build with `timeout: 1`, advance 500ms (below the 1000ms floor), resolve with `aborted: false` to isolate the threshold logic from the abort path. Asserts no hint appears. A regression that removes the floor flips the assertion to fail. * fix(core): structured delimiter on error.message hint + tighten timeout floor comment Two of three threads from the latest /review pass on PR #3809 (the third — PR description / threshold scaling reconciliation — is fixed in the PR description update, not in code): - `\n---\n` divider before hint in `error.message` (`#Pt7C`). Downstream consumers of `error.message` (firePostToolUseFailureHook, telemetry grouping, SIEM alerting, hook-side error parsers) were receiving ~400 chars of advisory text mixed inline with the original error body — pattern-matching on error messages would absorb the advisory into the matched body. Added a `---` separator line so the boundary is unambiguous and split-able. - Threshold-floor comment narrowed to `timeout: 1` (`#Pu9o`). The comment said the floor guards `timeout: 0` / `timeout: 1`, but `validateToolParamValues` rejects `timeout <= 0` at validate time, so `timeout: 0` can't reach `longRunThresholdFor`. Updated the comment to mention only the actually-allowed pathological case (`timeout: 1` and any value `< 2` rounds to 0). Test updated to assert the `---` divider format with `toMatch`. * fix(core): capture executionStartTime AFTER spawn so PTY import isn't counted PR #3809 review: copilot caught that `executionStartTime` was captured BEFORE `await ShellExecutionService.execute(...)`, which meant the elapsed measurement included `getPty()` dynamic-import setup (~50-200ms on first call). The hint's "ran for Xs" reading was slightly inflated, and the comment claiming "spawn → settle" wasn't strictly accurate. Moved the capture immediately after the execute() call returns its { result, pid } handle. The pid being set by that point confirms the process has been spawned, so the subtraction is true post-spawn-to- settle. Comment updated to reflect the actual semantics. The displayed accuracy gain is small (50-200ms on a 60s+ threshold is <1%), but the comment claim now matches what the code measures. Tests unaffected — fakeTimers don't drive real dynamic imports, so the threshold tests behave identically. * fix(core): align long-run hint code/tests with ShellExecutionResult.error semantics Four copilot threads on PR #3809 — all rooted in the same observation: `ShellExecutionResult.error` is reserved for spawn/setup failures (per the field's doc comment in shellExecutionService.ts), NOT for non-zero exit codes. My existing code/tests conflated the two, making the error-path coverage less realistic and the inline comments inaccurate. Test shape fixes: - `appends the hint when a long-running foreground command exits with error` → `exits non-zero`. Changed `error: new Error('exit 1')` to `error: null` (the realistic shape for a non-zero exit without spawn failure). Added a comment explaining the field contract so future test authors don't repeat the conflation. - `hint survives the error path (appended to error.message)`: reframed the mock from `spawn ENOENT` (which would resolve in <1s in practice, making the long-elapsed scenario unrealistic) to `PTY initialization failed after 75s` — a slow-spawn-failure shape that COULD plausibly take 75s. Test still pins the same CODE PATH; comment now acknowledges the edge-case nature ("rare but real: PTY init dragging, remote-fs exec syscalls, security scanners interposing"). Comment corrections: - `returnDisplayMessage` build-order comment was misleading. It said "the hint is appended after both the truncation block and the returnDisplayMessage build" — but `returnDisplayMessage` is built BEFORE truncation. Replaced with a chronological enumeration (1. initial value, 2. truncation marker append, 3. hint append) that matches what the code actually does. - Error-path preservation comment now acknowledges the narrow applicability (spawn failures only, exit codes don't reach this branch). Code is unchanged — the path is still real, just rare. * test(core): pin empty-output success + background-no-hint paths Two defensive tests for the long-running foreground hint: - empty-output success at >=60s — exercises the returnDisplayMessage='' → hint append branch (write-only commands like `tar czf` / `cp -r` produce no stdout). Asserts the user- facing returnDisplay still surfaces the advisory even when the command produced nothing else to show. - background never includes the hint — the foreground hint logic lives in executeForeground only, so today this can't fail; the test guards against a future refactor hoisting the advisory into a shared post-execute path that would tag every background launch with a nonsensical "ran for 0s, consider is_background: true" suggestion.	2026-05-04 23:24:14 +08:00
Shaojin Wen	efb7351d58	feat(core): support reasoning effort 'max' tier (DeepSeek extension) (#3800 ) * feat(core): support reasoning effort 'max' tier (DeepSeek extension) DeepSeek's chat-completions endpoint added an extra-strong `max` tier to `reasoning_effort` (per https://api-docs.deepseek.com/zh-cn/api/create-chat-completion ; valid values are now `high` and `max`, with `low`/`medium` mapping to `high` for backward compat). Plumb it end-to-end: - `ContentGeneratorConfig.reasoning.effort` union now includes 'max'. - DeepSeek OpenAI-compat provider: translate the standard nested `reasoning: { effort }` shape into DeepSeek's flat `reasoning_effort` body parameter so user-configured effort actually takes effect (the nested shape was previously sent verbatim and silently ignored, defaulting to `high`). low/medium → high mirrors the documented server-side behavior so dashboards / logs match wire reality. An explicit top-level `reasoning_effort` (via samplingParams or extra_body) wins over the nested form. - Anthropic converter: pass 'max' through to `output_config.effort` unchanged and bump the `thinking.budget_tokens` budget for the new tier (low 16k / medium 32k / high 64k / max 128k). - Gemini converter: clamp 'max' to HIGH since Gemini has no higher thinking level. Without this, 'max' would silently fall through to THINKING_LEVEL_UNSPECIFIED. Live verification against api.deepseek.com: - `reasoning_effort: high` → 200 - `reasoning_effort: max` → 200 (the new tier) - `reasoning_effort: bogus`→ 400 with valid-set list confirming [high, low, medium, max, xhigh] 108 anthropic/openai-deepseek/gemini tests pass; full core suite (6601 tests) green; lint + typecheck clean. * fix(core): map xhigh→max + clamp max on non-DeepSeek anthropic + docs Address PR review (copilot × 2) and add missing user docs: 1. (J698) `translateReasoningEffort` claimed in the PR description that it surfaces the DeepSeek backward-compat mapping client-side, but only handled `low`/`medium` → `high`. Add `xhigh` → `max` to match the doc and stay symmetric with the low/medium branch. 2. (J6-A) `output_config.effort: 'max'` would have been emitted on any anthropic-protocol provider whenever a user configured it, even when the baseURL points at real `api.anthropic.com` (which only accepts low/medium/high and would 400). Reuse the existing `isDeepSeekAnthropicProvider` detector to clamp `'max'` → `'high'` on non-DeepSeek anthropic backends, with a debugLogger.warn so the downgrade is visible. DeepSeek anthropic-compatible endpoints still pass through unchanged. 3. New docs: - `docs/users/configuration/model-providers.md`: a "Reasoning / thinking configuration" section under generationConfig — single example targeting DeepSeek + a per-provider behavior table (OpenAI/DeepSeek flat reasoning_effort, OpenAI passthrough for other servers, real Anthropic clamp, Anthropic-compatible DeepSeek passthrough, Gemini thinkingLevel mapping). - `docs/users/configuration/settings.md`: extend the `model.generationConfig` description to mention `reasoning` (the field was undocumented before this PR even though it already existed as a typed field) and link to the new section. 96 anthropic + deepseek tests pass; lint + typecheck clean. * refactor(core): single-source effort normalization for anthropic + doc fix Address PR review round 2 (copilot × 2): 1. (J8aG) The `contentGenerator.ts` comment claimed passing `reasoning.effort: 'max'` to real Anthropic was "up to the user", but commit `b5b05ae` actively clamps 'max' → 'high' (with a debug log) on non-DeepSeek anthropic backends. Update the comment to describe current runtime behavior. 2. (J8aL) The clamp ran inside `buildOutputConfig()` only — the effort label was downgraded but `buildThinkingConfig()` still used the raw user value to size the budget, so a non-DeepSeek anthropic request could end up with `output_config.effort: 'high'` paired with a 'max'-sized 128K thinking budget. Inconsistent label vs. budget on the wire. Refactor: hoist the normalization into a single `resolveEffectiveEffort()` helper that runs once per request in `buildRequest()`. Both `buildThinkingConfig` and `buildOutputConfig` now consume the same clamped value, so the budget ladder and the effort label stay aligned. The debug log fires once per request. Add a regression test asserting that on a non-DeepSeek anthropic provider with `effort: 'max'` configured, the wire request carries both `output_config.effort: 'high'` AND `thinking.budget_tokens: 64_000` (the 'high' tier), not the 128K 'max' budget. 96 tests pass; lint + typecheck clean. * fix(core): tighten 'max' clamp + warn-once + strip reasoning_effort on side queries Address PR review round 3 (copilot × 3): 1. (J-2v) When request.config.thinkingConfig.includeThoughts is false, pipeline.buildRequest's post-processing only deleted the nested `reasoning` key. The DeepSeek provider's translateReasoningEffort may have already flattened an extra_body-injected reasoning into top-level `reasoning_effort` by that point, so a side query (e.g. suggestionGenerator) could still ship reasoning_effort on the wire. Extend the post-processing to also delete `reasoning_effort`. 2. (J-2z) The warn for clamping 'max' on non-DeepSeek anthropic ran on every request needing the downgrade — the docstring claimed "first time only" but the implementation didn't latch. Add a private `effortClampWarned` boolean on the generator so the warning fires once per generator lifetime. 3. (J-23) `resolveEffectiveEffort` used the broad `isDeepSeekAnthropicProvider` detector for the clamp decision, but that helper falls back to model-name matching to cover sglang/vllm self-hosted DeepSeek deployments. A model configured as e.g. "deepseek-distill" but routed to real api.anthropic.com would bypass the clamp and trigger HTTP 400. Split the detector: keep `isDeepSeekAnthropicProvider` (broad) for the thinking-block injection workaround where false-positives are harmless, and add `isDeepSeekAnthropicHostname` (hostname-only) for decisions where a model-name false-positive would route DeepSeek-only behavior to a stricter backend. The clamp now uses the hostname-only check. New regression test: a config with model name containing "deepseek" but baseURL pointing at api.anthropic.com still clamps `'max'` to `'high'`. Existing "passes max through" test updated to set a DeepSeek baseURL since model name alone no longer suffices for the clamp bypass. 385 tests pass; lint + typecheck clean. * docs(core): correct pipeline timing comment + samplingParams caveat Address PR review round 4 (copilot × 3) — three documentation accuracy fixes, no behavior change: 1. (KBcw) The post-processing comment in pipeline.ts misdescribed the call order ("after this branch already ran during the same buildRequest pass") — provider.buildRequest actually runs BEFORE the includeThoughts=false post-processing in the same pass. Reword to match the actual order: provider hook flattens nested reasoning to reasoning_effort first, this cleanup runs after and strips both shapes. 2. (KBdC, KBdE) The "Reasoning / thinking configuration" section in model-providers.md and the model.generationConfig description in settings.md both implied `reasoning` is honored on every provider. For OpenAI-compatible providers, when `generationConfig.samplingParams` is set, `ContentGenerationPipeline.buildGenerateContentConfig()` ships samplingParams verbatim and skips the separate `reasoning` injection entirely. Configs like `{ samplingParams: { temperature: 0.5 }, reasoning: { effort: 'max' } }` would silently drop the reasoning field on OpenAI/DeepSeek requests. Add an explicit "Interaction with samplingParams" warning section in model-providers.md and a parenthetical note in settings.md directing users to put `reasoning_effort` inside `samplingParams` (or `extra_body`) when both are configured. 385 tests pass; lint + typecheck clean. * docs(core): clarify explicit budget_tokens bypasses 'max' effort clamp When user sets `{ effort: 'max', budget_tokens: N }` on a non-DeepSeek anthropic backend, the effort label gets clamped to 'high' (otherwise the server 400s on the unknown enum) but the explicit budget_tokens is preserved verbatim. The wire-shape mismatch is intentional, not a bug: the clamp only protects the enum field, while budget is a free integer the server accepts within the context window, so an explicit override stays explicit. Document the contract on the early-return and add a regression test that locks it in. * docs(deepseek): fix comments to match flatten + reasoning-strip behavior Two doc-only nits called out in review: 1. `buildRequest` JSDoc said non-text parts are "rejected", but `flattenContentParts` actually substitutes a textual placeholder (`[Unsupported content type: <type>]`) so the request still goes through with a breadcrumb. Reword the JSDoc accordingly. 2. `translateReasoningEffort`'s strip comment claimed it strips the nested form to avoid shipping both shapes, but it only drops the duplicated `effort` key when other keys (e.g. `budget_tokens`) are present. Reword to describe the actual selective behavior and why keeping orthogonal keys is intentional. Behavior unchanged. * fix(deepseek): gate reasoning_effort translation on actual DeepSeek hostname The provider class is selected via the broader `isDeepSeekProvider` check, which falls back to model-name matching to cover self-hosted DeepSeek deployments (sglang/vllm/ollama, see #3613). That fallback is the right call for content-part flattening — it's a model-format constraint baked into the model itself, not the API surface. But the same broad detection was also gating `translateReasoningEffort`, which rewrites the standard `reasoning: { effort }` config into DeepSeek's flat `reasoning_effort` body parameter. That's a wire-shape decision, not a model-format one: strict OpenAI-compat backends in self-hosted setups may not accept the DeepSeek extension and would have happily handled the original shape. Split the two decisions: keep `isDeepSeekProvider` (broad) for flattening, add a hostname-only `isDeepSeekHostname` and gate the body rewrite on it. Self-hosted DeepSeek users who actually want the translation can either use a baseUrl containing api.deepseek.com or inject `reasoning_effort` directly via `samplingParams`/`extra_body`. Regression tests: - self-hosted (sglang) with deepseek-named model + nested `reasoning.effort` → flattening still runs, body shape preserved - `isDeepSeekHostname` matches api.deepseek.com but not custom hosts * fix(deepseek): use URL parsing in isDeepSeekHostname; fix log-level docs CodeQL flagged a high-severity URL substring sanitization issue on the new `isDeepSeekHostname` helper. The naive `baseUrl.includes('api.deepseek.com')` check would false-positive on hostile hosts like `https://api.deepseek.com.evil.com/v1` and incorrectly inject the DeepSeek-only `reasoning_effort` body parameter into requests routed elsewhere. Switch to `new URL(...).hostname` with exact match against `api.deepseek.com` (and `.api.deepseek.com` subdomains), mirroring `isDeepSeekAnthropicHostname` on the Anthropic side. Invalid URLs treated as non-DeepSeek. `isDeepSeekProvider` already routes through `isDeepSeekHostname`, so the hardening applies to both decision paths. Regression tests cover: - subdomain match (us.api.deepseek.com) - hostile substrings (api.deepseek.com.evil.com, evil.com/api.deepseek.com/v1, api.deepseek.comevil.com, api-deepseek-com.example.com) - invalid / empty baseUrl Also fix two doc-level mismatches: the `'max'` clamp on Anthropic logs via `debugLogger.warn` (warning level, once per generator), not "with a debug log". Update both `ContentGeneratorConfig.reasoning` JSDoc and the per-provider behavior table in model-providers.md. * feat(deepseek): emit thinking:disabled signal when reasoning is off DeepSeek V4+ defaults `thinking.type` to `'enabled'`, so just stripping `reasoning_effort` from the request leaves the server happily thinking on side queries — paying full thinking latency/cost without an effort configured. Per yiliang114's review, emit the explicit `thinking: { type: 'disabled' }` field on the wire whenever reasoning is disabled. Triggered when either: - `request.config.thinkingConfig.includeThoughts === false` (forked queries, e.g. suggestion generation) - `contentGeneratorConfig.reasoning === false` (config-level opt-out) The previous post-processing block only fired on the per-request opt-out path, so the config-level case was already leaking. Unify both under a single `reasoningDisabled` predicate that runs the same strip + signal logic. Hostname-gated to `api.deepseek.com` (and subdomains): self-hosted DeepSeek behind sglang/vllm/ollama, or older DeepSeek versions, may not accept the V4 thinking parameter — pushing it there could trip an unknown-key 400. Mirrors the round-7 decision to gate `reasoning_effort` translation on hostname. Regression tests cover all four matrix points: - DeepSeek hostname + includeThoughts false → emits disabled - DeepSeek hostname + reasoning false → emits disabled - non-DeepSeek hostname + includeThoughts false → does not emit - self-hosted DeepSeek (model-name fallback only) → does not emit Docs: extend the `reasoning: false` section with the new behavior and the self-hosted/non-DeepSeek caveat. * refactor(deepseek): expose isDeepSeek* as free functions; clarify docs Two doc/coupling nits from review: 1. The pipeline post-processing block was importing the concrete `DeepSeekOpenAICompatibleProvider` class just to reach `isDeepSeekHostname`. That couples the generic OpenAI pipeline to a specific provider implementation. Promote the helper (and its broad `isDeepSeekProvider` sibling) to free `export function`s in `provider/deepseek.ts` and import them by name. The class keeps thin static delegates for backward compat with existing callers and tests. 2. The per-provider behavior table on `model-providers.md` said `'low'/'medium' → 'high'` and `'xhigh' → 'max'` "client-side", but that normalization only fires inside `translateReasoningEffort`, which runs on the nested `reasoning.effort` config path. Explicit top-level overrides via `samplingParams.reasoning_effort` or `extra_body.reasoning_effort` skip the rewrite and ship verbatim. Reword the row to reflect that. Behavior unchanged.	2026-05-04 22:42:23 +08:00
Shaojin Wen	fcefab6df5	fix(core): clear FileReadCache on every history rewrite path (#3810 ) * fix(core): clear FileReadCache after microcompaction Microcompaction (the idle-cleanup pass that runs at the start of every new user/cron message) replaces old read_file / shell / glob / grep / edit / write_file tool outputs with a `[Old tool result content cleared]` placeholder. The FileReadCache, however, still records the prior full Reads as "seen in this conversation" — so the next ReadFile of an unchanged file returns the file_unchanged placeholder pointing at bytes the model can no longer retrieve from history. The result is a Read that succeeds at the tool layer but delivers no usable content to the model, which is the failure mode reported in #3805 ("read tool returns no content in long-running sessions"). This mirrors the existing post-compaction clear in tryCompressChat — microcompaction has the same "history rewrite invalidates the cache's 'model has seen this' assumption" property, it was just missed when the cache was wired in. * fix(core): clear FileReadCache on every history rewrite path PR1 only patched microcompaction, but a multi-round audit found four more entry points that rewrite history without clearing the cache, producing the same `file_unchanged` placeholder vs. missing-content mismatch. Each is fixed in the same minimal way (clear() at the call site) and covered by a regression test: - GeminiClient.setHistory — /restore checkpoint, /load_history - GeminiClient.truncateHistory — rewind in AppContainer - GeminiClient.resetChat — public API; clearCommand happens to clear the cache via startNewSession beforehand, but other callers have no such guarantee - stripOrphanedUserEntriesFromHistory — Retry path drops trailing user entries that may include read_file functionResponses Also tightened the microcompaction comment ("compactable tool outputs" instead of an enumerated list, since the source of truth is microcompact.COMPACTABLE_TOOLS) and removed caller references per the codebase comment style. Reverse-tested every new clear() by commenting it out and confirming the matching regression test fails. * test(core): integration test for FileReadCache + history rewrite End-to-end tests using the real ReadFileTool, real FileReadCache, real microcompactHistory, and a real on-disk file. Three cases: 1. Without a cache clear after microcompact, the second Read of an unchanged file returns the file_unchanged placeholder while the prior content has already been wiped from history. Demonstrates the failure mode this PR fixes. 2. After an explicit cache.clear(), the second Read re-emits the real bytes. Demonstrates that the fix works. 3. When microcompact removes every prior read of a file, the placeholder leaves zero recoverable bytes — the model literally cannot find the content anywhere it can reach. These complement the existing unit tests in client.test.ts (which verify the call-site wiring) by proving the end-to-end behaviour through the real code paths, without mocks. * chore(core): add traceable debug log for every FileReadCache clear Per review feedback: the new clear() call sites were silent, leaving no breadcrumb in production debug streams when the cache is dropped. Adds a `[FILE_READ_CACHE] clear after <reason>` log at every clear site (5 new + 1 pre-existing in tryCompressChat) so operators can grep one prefix and see why the cache was invalidated. * chore(core): refine truncateHistory cache clear + extract test helper Per review feedback (deepseek-v4-pro): 1. truncateHistory now skips the cache clear when keepCount >= prevLen, since a no-op truncate leaves the cache valid against the unchanged history. Adds a regression test covering both keepCount==prevLen and keepCount>prevLen. 2. The 6 cache-spy test cases each repeated the same 4-line mock setup. Extract a `mockFileReadCacheClear()` helper so future changes to the FileReadCache mock surface only need one edit. Both are quality-of-implementation tweaks; the underlying fix is unchanged. * perf(core): use O(1) getHistoryLength in truncateHistory Per Copilot review feedback: the previous commit's no-op detection in truncateHistory called this.getChat().getHistory().length, but GeminiChat.getHistory() does a structuredClone of the entire history on every call (line 770 of geminiChat.ts) — paying an O(history) clone purely to read .length. In long-running sessions with hundreds of entries this is a meaningful regression. Adds GeminiChat.getHistoryLength(): O(1), no clone. truncateHistory switches to it. The behaviour (skip clear when keepCount >= prevLen) is unchanged. Also adds: - Unit tests for GeminiChat.getHistoryLength (empty, after addHistory, parity with getHistory().length). - A regression test asserting truncateHistory calls getHistoryLength and NOT getHistory, locking in the perf fix against future drift. * fix(core): close NaN hole + use public ReadFileTool API in tests Two issues from copilot review: 1. NaN edge case in truncateHistory cache invalidation. The "did anything actually change?" check was `keepCount < prevLen`, but `Array.slice(0, NaN)` returns [] (history wiped) while `NaN < prevLen` is false. That sequence would wipe the chat but leave the FileReadCache claiming the model has seen the prior reads — exactly the file_unchanged placeholder bug this PR is closing. Switched the check to compare actual post-truncate length (`newLen < prevLen`), which correctly invalidates whenever entries were removed regardless of how `keepCount` was malformed. Added a NaN regression test. 2. The integration test cast `tool` to `unknown` to reach the protected `createInvocation()` method. Switched to the public `tool.buildAndExecute(params, signal)` API so the test exercises the same surface real callers use, including build-time schema validation.	2026-05-04 22:42:06 +08:00
Shaojin Wen	fbcf59e0b3	docs(core): point background-shell + monitor guidance at both /tasks and the dialog (#3808 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details SDK Python / SDK Python (3.10) (push) Waiting to run Details SDK Python / SDK Python (3.11) (push) Waiting to run Details SDK Python / SDK Python (3.12) (push) Waiting to run Details * docs(core): point background-shell guidance at both /tasks and the dialog Follow-up to PR #3801, fulfilling the "separate small PR" commitment in its description. The two model-facing strings (`shell.ts` after spawning a background shell, `task-stop.ts` after requesting cancel) referenced only `/tasks` as the inspection path, predating the interactive Background tasks dialog landing at #3488 / #3720 / #3791. Now that the dialog handles all three kinds (agent / shell / monitor), both surfaces should be visible to the LLM so it can suggest the right one based on the user's mode. Updates: - `shell.ts:865` (LLM message after `is_background: true` spawn) now surfaces both `/tasks` (text, any mode) AND the interactive dialog (footer pill + Enter, with detail view + live updates). Output file guidance retained. - `task-stop.ts:110` (LLM message after `task_stop` on a shell) same pattern: both surfaces named. - `task-stop.ts:95` comment updated to enumerate all observation paths (including the dialog). - `monitorRegistry.ts:197` comment fixed — said "/tasks dialog" which conflated two distinct surfaces. Split to "Background tasks dialog reopens or `/tasks` listings". - `backgroundShellRegistry.ts:10` (module docstring) and `:31` (shellId doc) now mention all three consumers (agent, dialog, slash command). No behavior change — pure documentation/string update. Tests untouched (none asserted on these exact strings); build + lint + 152-test core suite all clean. * docs(core): address PR 3808 review — 'captured output' + consistent ordering Three review nits: 1. (LoqU — copilot) `shell.ts:865` said the output file holds "raw content", but `shellExecutionService` runs each chunk through stripAnsi and skips non-string/binary chunks before writing. Reword to "captured output" so callers don't expect a byte-for-byte stream. 2. (LqKr — wenshao) The PR mentioned both surfaces in two different orders depending on the file: `backgroundShellRegistry.ts` listed the dialog first, while `task-stop.ts` and `shell.ts` listed `/tasks` first. Unify on the LLM-facing order — `/tasks` first, then the interactive Background tasks dialog — across all four sites. Also flips the line-31 docstring on the `shellId` field for the same reason. 3. (LqKt — wenshao, flagged for awareness only) Trim the redundant keystroke detail in shell.ts:865 to match `task-stop.ts:111`'s shorter "(footer pill + Enter)" form. Saves ~7 tokens per background shell launch in `llmContent` while still naming both surfaces. The PR description's rationale (LLM should know both surfaces exist so it can suggest the right one for the user's mode) is preserved — only the operational verbosity is trimmed. 581 tests pass; lint + typecheck clean. Pure docs / string update. * docs(core): grammar polish on PR 3808 strings Two more wording nits from copilot review: - backgroundShellRegistry.ts:10 — change "metadata the agent…need to query" to "metadata that the agent…use to query". The original phrasing reads as if the metadata itself is performing the query. - shell.ts:865 — change "Read the output file directly for the captured output." to "Read the output file directly to view the captured output." — clearer instruction to the model/user. Pure wording, no behavior change. * docs(core): grammar fix on PR 3808 monitor comment 'not visible from later Background tasks dialog reopens' read as if 'reopens' was a noun. Reword to 'not visible after reopening the Background tasks dialog or from /tasks listings'. * docs(core): round 4 wording polish on PR 3808 Four more nits from copilot: - shell.ts:865 + task-stop.ts:96,111: "footer pill + Enter" was ambiguous now that the footer renders multiple pills (background tasks vs other status indicators). Disambiguate to "focus the footer Background tasks pill, then Enter". - monitorRegistry.ts:198: re-tweak my round-3 phrasing — "after reopening the Background tasks dialog or from /tasks listings" → "in later Background tasks dialog reopens or /tasks listings". Reads as "from those surfaces" rather than "after reopening", which the reviewer found ungrammatical. - backgroundShellRegistry.ts:10,31: clarify "/tasks" as the slash command, since the codebase also uses "<projectDir>/tasks/..." on-disk paths in agent-transcript contexts. Pure wording, no behavior change. 87 affected tests pass. * docs(core): mirror /tasks + dialog guidance to monitor llmContent paths Address @doudouOUC review on PR #3808 — two Medium findings: this PR updated shell-facing strings to mention both inspection surfaces but left the parallel monitor strings without any inspection guidance, even though monitors render in the same /tasks output and the same Background tasks dialog. Restore symmetry: - monitor.ts:587-598 — append the same "/tasks (text) or the interactive Background tasks dialog (focus the footer Background tasks pill, then Enter — detail view + live updates)" sentence to the Monitor-started llmContent, mirroring shell.ts:865. - task-stop.ts:125-131 — the monitor cancellation llmContent had no guidance at all. Add the same "Final status will be visible via /tasks (text) or the interactive Background tasks dialog (focus the footer Background tasks pill, then Enter) once the process drains" line that already existed for shells at task-stop.ts:111. The (Low) commit-churn observation is a maintainer call (squash on merge); the (Info) snapshot-test gap is pre-existing and not in scope. 78 monitor + task-stop tests pass; lint + typecheck clean. * docs(core): drop drain phrasing for monitor cancel + restructure dialog comment Address PR #3808 review round 5 (doudouOUC + copilot × 2): 1. (XNoH copilot, XSBu doudouOUC — Medium) The monitor cancellation message inherited "once the process drains" from the shell branch, but `monitorRegistry.cancel()` settles synchronously — when the tool returns, the entry is already `cancelled`, not waiting on a child process. The drain qualifier is accurate for shells (which use `requestCancel()` + the AbortController and settle when the real process exits) but misleading for monitors. Reword the monitor branch in `task-stop.ts:121-130` to drop the drain phrasing and add an explanatory comment about the sync vs. async difference so future maintainers don't replicate the wording from the shell branch by reflex. 2. (XNod copilot — wording, third revision on the same comment) Restructure rather than re-litigate the preposition. The "reopens" noun framing has gone through three rounds of churn (`from later... reopens` → `after reopening...` → `in later... reopens` → and now back to `from`). Sidestep the loop by making the comment a proper sentence about WHAT the surfaces actually read: the persisted `entry.error` is the source of truth; the chat-history notification is a separate, ephemeral side channel. Avoids the noun-form "reopens" entirely. Updated test assertion to match the new "Monitor \"...\" cancelled" prefix. 51 tests pass; lint + typecheck clean.	2026-05-04 22:27:00 +08:00
jinye	2fea1d3aa7	fix(core): address post-merge monitor tool and UI routing issues (#3792 ) * fix(core): address post-merge monitor tool and UI routing issues - Guard token bucket against clock drift after system suspend/resume (negative elapsed resets lastRefill instead of starving the bucket) - Add debugLogger.warn for AST read-only check failures in monitor getConfirmationDetails (previously silent catch) - Consolidate SHELL_TOOL_NAMES: export from rule-parser.ts, import in permission-manager.ts (removes identical SHELL_LIKE_TOOLS duplicate) - Extract hasBlockingBackgroundWork/resetBackgroundStateForSessionSwitch to shared backgroundWorkUtils.ts (removes identical copies in clearCommand.ts and useResumeCommand.ts) - Consolidate getToolCallComponent routing into packages/webui (removes near-identical copies in ChatViewer.tsx and vscode-ide-companion, adds missing web_search compat alias to VSCode path) - Add test for droppedLines count in terminal notification text - Add test for exit(null, null) settlement (externally killed process) 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(core,cli,vscode): address PR #3792 review feedback - Add debugLogger.warn for clock-drift guard (observability for throttle bucket resets after suspend/resume) - Add test for clock-drift recovery (elapsed < 0 bucket reset) - Add test for AST parse failure catch path (mockRejectedValueOnce) - Forward isFirst/isLast props through VSCode ToolCallRouter (fixes timeline connector rendering) - Add test for shell running branch in hasBlockingBackgroundWork 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(core,vscode): address follow-up review comments - Fix React.FC missing import: use `import type { FC } from 'react'` instead of `React.FC` (original file had this import before refactor) - Tighten clock-drift test: emit while clock is in the past to confirm guard resets lastRefill, then verify refill at the new reference point 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(core,cli): adopt review feedback — ReadonlySet + hasRunningEntries - Type SHELL_TOOL_NAMES as ReadonlySet<string> to prevent accidental mutation of permission-critical set - Use BackgroundShellRegistry.hasRunningEntries() instead of getAll().some() for zero-allocation short-circuit check - Update clearCommand test mocks to include hasRunningEntries 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(core,webui,vscode): address remaining PR #3792 review comments - Merge AgentToolCall + isAgentExecutionToolCall into single import in routing.ts (comment 3178280762) - Use real getToolCallComponent via vi.importActual in VSCode test mock so routing logic is validated, not a parallel mock that can drift (comment 3178280775) - Validate isFirst/isLast forwarding in VSCode test mock via data attributes (comment 3178346891) - Add comment documenting debugLogger.warn no-op tradeoff for clock drift guard (comment 3178346889) 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * test(webui): add unit test for getToolCallComponent routing Covers all 8 component branches including the web_search compatibility alias, agent execution detection, case-insensitive matching, and fallback to GenericToolCall. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(cli): add missing hasRunningEntries to useResumeCommand test mocks The backgroundWorkUtils refactor replaced getAll().some() with hasRunningEntries(), but the test mocks in useResumeCommand.test.ts were not updated, causing CI failures. 🤖 Generated with [Qoder Code](https://github.com/QwenLM/qwen-code) * test(cli): add unit tests for backgroundWorkUtils shared utility Cover hasBlockingBackgroundWork (6 cases including short-circuit behaviour) and resetBackgroundStateForSessionSwitch (1 case verifying all three registries are reset). 🤖 Generated with [Qoder Code](https://github.com/QwenLM/qwen-code) --------- Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>	2026-05-04 21:19:41 +08:00
jinye	03f66bada5	feat(sdk-python): add PyPI release workflow (#3685 ) * feat(sdk-python): add pypi release workflow Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): build cli before smoke test Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): tighten release conflict handling Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): harden python release workflow Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): tighten stable release guards Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): harden prerelease publish flow Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): reuse release branches on rerun Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): resume incomplete releases Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(release): tighten missing-release checks Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): resume stable release reruns Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): tighten release recovery guards Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * test(sdk-python): cover release version edge cases Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): address release workflow review feedback Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * refactor(sdk-python): address review feedback on release version script - Remove unreachable `if (type === 'stable')` branch in bumpVersion(); the stable path was dead code since getVersion() throws for all stable conflicts before calling bumpVersion(). Move nightly conflict throw to the call site for symmetry. - Rename getNextPatchBaseVersion → getNextBaseVersion to reflect that the function can return a prerelease base without incrementing patch. - Add test for preview+nightly coexistence where nightly base is higher. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(sdk-python): address remaining review feedback on release workflow - Fix failure-issue gate to read github.event.inputs.dry_run directly instead of steps.vars.outputs.is_dry_run (which is empty when early steps fail). Add --repo flag for gh issue create when checkout failed. - Add diagnostic state table to failure-issue body (RELEASE_TAG, PACKAGE_VERSION, PUBLISH_CHANNEL, RESUME_EXISTING_RELEASE, etc.) - Fix release-notes error swallow: only silence release not found / Not Found / HTTP 404, emit :⚠️: for other gh release view errors. - Improve validateVersion error messages to use human-readable format keys (X.Y.Z, X.Y.Z-preview.N) matching TS sibling convention. - Filter fully-yanked versions in getAllVersionsFromPyPI. - Add console.error log when stable is derived from nightly. - Add bash regex guard for inputs.version to prevent shell injection. - Use per-release-type concurrency groups (nightly/preview/stable). - Add jq null-guard checks for all 6 field extractions. - Remove misleading --follow-tags from git push (lightweight tags). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(sdk-python): rename misleading test description The test asserts that preview/nightly releases return empty previousReleaseTag, but the name said "same-channel previous release tags" which implied non-empty values. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(sdk-python): address unresolved review comments on release workflow - Remove -z check in extract_field() that blocked preview/nightly releases (previousReleaseTag is legitimately empty for non-stable releases) - Use static environment.url since step outputs aren't available at job startup - Use skip-existing for resumed PyPI publish to fill in missing artifacts - Add AbortSignal.timeout(30s) to PyPI fetch to prevent indefinite hangs - Add downgrade guard for stable_version_override - Use GHA :⚠️: annotation instead of console.error for visibility - Separate yanked/non-yanked version lists so conflict detection includes yanked versions (PyPI still reserves those slots) - Filter current release from previousReleaseTag to avoid self-reference on resume - Add tests for yanked conflict detection, downgrade guard, and resume previousReleaseTag Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(sdk-python): address final review round on release version script - Fix getNextBaseVersion() first-release skip: use pyproject.toml version directly when PyPI has no stable versions instead of unconditionally incrementing - Fix getNextBaseVersion() off-by-one: change > to >= so equal prerelease base continues the existing line instead of incrementing patch - Add :⚠️: annotation when preview auto-bumps due to orphan git tags (tag exists without PyPI version or GitHub release) - Add set -euo pipefail to 5 workflow steps missing it: release_branch, persist_source, Create GitHub release, Delete prerelease branch, Create issue on failure - Fix 2 existing tests affected by first-release change, add 4 new tests 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(sdk-python): use stderr for GHA warning annotations to avoid corrupting JSON stdout console.log writes to stdout, which gets captured by VERSION_JSON=$(node ...) in the workflow and corrupts the JSON output for jq. Switch to console.error so :⚠️: annotations go to stderr (GHA recognizes workflow commands on both streams). Also add set -euo pipefail to the "Get the version" step for consistency with other workflow steps. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) --------- Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>	2026-05-04 21:07:21 +08:00
wenshao	eecb0cfa16	fix(attribution): skip values for env -u NAME and -S string `env`'s value-taking flags (`-u`/`--unset`, `-S`/`--split-string`) were not in the wrapper's flag-skip allowlist, so `env -u FOO git commit ...` left FOO as the next token and the parser treated it as the program — masking the real `git commit` from attribution detection. Add an ENV_FLAGS_WITH_VALUE table mirroring the sudo allowlist. Regression test added.	2026-05-04 06:45:40 +08:00
wenshao	8891b83b85	fix(attribution): last-match --body, symlink leaf canonicalisation, scoped prompt count - addAttributionToPR: use matchAll/last-match for `--body`/`-b` so the trailer lands in the gh-honoured (final) body when multiple flags are present. Mirrors addCoAuthorToGitCommit. Adds regression test. - attachCommitAttribution: also fs.realpathSync the per-file resolved path (not just the repo root) so files behind intermediate symlinks are matched against canonical keys recordEdit stored, instead of silently zeroing attribution and leaking entries past commit. - incrementPromptCount: scope to SendMessageType.UserQuery — ToolResult, Retry, Hook, Cron, Notification are model/background re-entries of the same logical turn. Tracking them all inflated the "N-shotted" trailer (one user message could become 10-shotted with 10 tool calls). - AttributionSnapshot: add `version: 1` field; restoreFromSnapshot now refuses incompatible versions and validates per-field types so a partially-written snapshot can't seed `Math.min(undefined, n) === NaN` into git-notes payloads. - Drop unused permission/escape counters (declared, persisted, never read or incremented) — fields, snapshot tolerance, and clear-method bookkeeping all removed; AttributionSnapshot interface simplifies. - isGeneratedFile: switch directory rule from substring `.includes('/dist/')` to segment-boundary check (split on `/`) so project dirs like `my-dist/` or `xbuild/` don't match. `.lock` removed from the blanket extension exclusion — well-known lockfiles already covered by EXCLUDED_FILENAMES; hand-authored `.lock` files (e.g. `.terraform.lock.hcl`) now stay attributable. - getClientSurface: document `QWEN_CODE_ENTRYPOINT` as the embedder override hook so the always-`'cli'` default is intentional.	2026-05-04 06:42:26 +08:00
wenshao	d78d4cfe22	Merge remote-tracking branch 'origin/main' into feat/commit-attribution # Conflicts: # packages/core/src/tools/shell.test.ts # packages/core/src/tools/shell.ts	2026-05-04 06:36:11 +08:00
jinye	e617f20d15	fix(telemetry): suppress async resource attribute warning on startup (#3807 ) Some checks are pending Qwen Code CI / CodeQL (push) Waiting to run Details Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * fix(telemetry): suppress async resource attribute warning on startup Disable NodeSDK's default auto-detection of host/process/env resources. These detectors return async attributes; if a span is exported before they settle, OTel logs an error-level diag message that surfaces in the terminal UI. The attributes we actually need (service name, version, session.id) are already provided synchronously via resourceFromAttributes. Closes part of #3731 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): add comment and test for autoDetectResources: false Address review feedback: - Add inline comment explaining why auto-detection is disabled (avoids async attribute warning before first span export) - Add test assertion that NodeSDK is constructed with autoDetectResources: false to prevent silent regression 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): improve comment precision and extend autoDetectResources test coverage - Clarify comment: error fires on any resource attribute read before async detectors settle, not just during export; HttpInstrumentation span creation is the typical trigger - Add autoDetectResources: false assertion to HTTP OTLP and file exporter test cases for branch-complete coverage across all three NodeSDK construction paths 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)	2026-05-03 19:36:03 +08:00
Shaojin Wen	07fdfadc33	feat(cli): include monitors in /tasks + add interactive-mode hint (#3801 ) * feat(cli): include monitors in /tasks + add interactive-mode hint Phase B closure for Issue #3634. Two coupled changes to /tasks: 1. Bug fix — include monitors. The command was last touched before #3684 / #3791 landed, so it merged only agent + shell entries while monitors silently disappeared from the headless / non-interactive / ACP listing path. Add a third registry pull from `getMonitorRegistry()` and wire monitor through statusLabel / taskLabel / taskId / taskOutputPath. Status line includes eventCount (`running (N events)`, `completed (exit 0, N events)`, `completed (Max events reached, N events)` for auto-stop) and pid where defined. 2. Soft deprecation hint, scoped to interactive mode only. Once the richer Ctrl+T dialog (#3488 + #3720 + #3791) is available, the text dump is the long-form fallback rather than the primary surface. Show `Tip: Ctrl+T opens the interactive Background tasks dialog with detail view + live updates.` at the top of the output when `executionMode === 'interactive'`. Headless / ACP get the bare list — they have no dialog to point at and the hint would just clutter. Description string also clarified to call out the modal split. Kept on all three executionModes (no deletion) — `/tasks` is the only way headless / ACP / SDK consumers can inspect background-task state. Tests: 4 new cases in tasksCommand.test.ts cover monitor entry formatting (running with pid, natural completion with exitCode, auto-stop with error string, failed), the singular `1 event` form, the interactive-mode hint gating, and the cross-kind merge order. * fix(cli): address PR 3801 review — exhaustive switch + i18n + extra tests Three actionable Suggestions from /review's pass: - `taskLabel` rewritten as a `switch` with a `never`-typed `default` arm, matching the structural-safety pattern already used by `taskId`. Adding a 4th DialogEntry kind in the future will now flip both helpers to compile errors instead of letting `taskLabel` silently fall through to `entry.description` (which the new kind may not have). - Hint string wrapped in `t()` for i18n consistency with the rest of the file. The literal stays as the i18n key default, so today's output is unchanged. - Tests: cover `cancelled` monitor status (was the only one without an inline assertion) and explicit `acp` execution mode hint suppression (pins the suppression rationale so a future regression flipping the check to `!== 'non_interactive'` would fail loudly). * fix(cli): correct /tasks dialog-open hint — Ctrl+T was wrong Tmux verification on PR #3801 caught that the hint string says "Ctrl+T opens the interactive Background tasks dialog" but Ctrl+T is actually bound to the MCP tool descriptions toggle (ContextSummaryDisplay.tsx lines 110-115). The dialog opens via Down arrow on an empty composer (focuses the footer pill) followed by Enter (InputPrompt.tsx 947-968). Same misattribution slipped into PR #3791's first description and was caught + fixed there before merge — this PR carried the wrong wording forward in code. Updates four sites: - The hint string itself: "Tip: press ↓ from an empty composer then Enter to open the interactive Background tasks dialog with detail view + live updates." - The slash-command description: "interactive UI is Ctrl+T" → "interactive dialog opens via the footer pill" - Two inline comments referencing Ctrl+T as the dialog opener - The interactive-mode hint test now pins on `↓` + `Enter` and asserts `not.toContain('Ctrl+T')` so a regression to the wrong wording fails loudly. * fix(cli): address PR 3801 review — exhaustive switch consistency + path-agnostic hint Four Suggestions from the latest /review pass: - `statusLabel` rewritten as a single top-level switch with a `never`-typed default, matching `taskLabel` / `taskId` / `taskOutputPath`. The previous `if`/`if`/fallthrough form would silently apply monitor formatting to a future 4th kind. - `taskOutputPath` gained the same exhaustive default — was the only per-kind helper still relying on implicit fallthrough; would silently omit a 4th-kind output path while the adjacent helpers flip to compile errors. - Hint wording de-specifies the exact keystroke count: `'Tip: focus the Background tasks pill in the footer (use ↓ from an empty composer) and press Enter ...'`. Previous "press ↓ then Enter" phrasing was wrong when the Arena agent tab bar is present — `InputPrompt`'s focus chain routes Down through the tab bar first, so a single Down lands there, not on the bg pill. - Test pin tightened: `[mon_fail] failed: spawn ENOENT (0 events)` is now a full-string assertion instead of a prefix match, so a regression that drops the `(N events)` suffix from monitor's failed branch fails loudly. * fix(cli): sanitize ANSI escape sequences in /tasks output deepseek's review pass flagged that monitor description / error fields are user / process-supplied strings rendered directly to the terminal. A maliciously-crafted tool description or spawn error containing raw ANSI control sequences (clear-screen, cursor-move, colour) would otherwise reach stdout verbatim and corrupt display. Same risk applies to agent error / description and shell error / command — all already-existing renderers with the same exposure that this PR didn't introduce but inherits. So instead of per-field sprinkling, wrap the joined output once with `escapeAnsiCtrlCodes` (no-op when no control chars present, so cost is zero in the common case). One line change in the renderer covers every kind including any future one. Test pins the behaviour: a monitor entry with `\x1b[2J` / `\x1b[31m...` content produces output with no raw ESC bytes and visible escaped `[...]` sequences. * docs(cli): tighten escapeAnsiCtrlCodes comments to match actual scope Two doc-precision Suggestions from copilot's pass on `0840e32f6`: - Source comment claimed `escapeAnsiCtrlCodes` is "a no-op when no control chars" but it's narrower than that — it only handles sequences matched by `ansi-regex` (CSI / OSC / SGR — anything starting with ESC). Isolated C0/C1 control bytes like BEL, BS, VT pass through untouched. Updated the comment to enumerate the actual scope and call out that `node:util`'s `stripVTControlCharacters` would be needed if those become a concern. - Test comment had a literal raw ESC byte (octal 033) embedded in the source — visually showed `^[[...]` in editors that render ESC, but was a real ESC byte in the file rather than the escaped `` form the sanitizer produces. Rewrote with a literal `` text description so what the comment shows matches what the assertions check for. * fix(cli): broaden /tasks sanitization + tighten inner switch exhaustiveness Addresses 3 of 5 items from doudouOUC's PR 3801 review: - Issue 1 (Low) — C0/C1 control byte gap: switched from `escapeAnsiCtrlCodes` (only handles ESC-initiated ANSI sequences) to `stripUnsafeCharacters` (one-pass strip of ANSI + VT + C0/C1, with TAB/CR/LF preserved). The pre-existing exposure to bare BEL / BS / FF / VT bytes via shell entry strings is now closed for all three kinds. Test rewritten to cover both ANSI sequences AND bare control bytes (BEL, BS), and pins that surrounding printable text and line breaks survive. - Issue 2 (Low) — inner status switches inconsistent: the three inner `switch (entry.status)` blocks (agent / shell / monitor) used `case 'running': default: return 'running'` (or duplicated bodies). All three now have explicit `running` cases plus a `never`-typed default that throws — matches the outer `switch (entry.kind)` pattern and means a future status added to any of `BackgroundTaskEntry` / `BackgroundShellEntry` / `MonitorStatus` flips to a compile error here instead of silently returning `'running'`. - Issue 5 (Nit) — beforeEach default change: added an inline comment explaining why the test default overrides `createMockCommandContext`'s `'interactive'` default (`'non_interactive'` lets the hint-suppression assertions work without each test rebinding context). Issues 3 and 4 from the review are nits with no action needed (3 is already documented as intentional; 4 is a UX call about hint length that's better handled by user feedback than guess-tweaking). * fix(cli): bind status to local before exhaustive switch — fixes tsc build CI's `tsc --build` (full mode, vs `--noEmit` locally) caught that `switch (entry.status)` followed by a `never`-typed default reading `entry.status` doesn't compile. After the case arms exhaust the discriminated union, TS narrows `entry` itself to `never`, so the `.status` access in the default arm becomes "Property 'status' does not exist on type 'never'" + the resulting `any` value can't be assigned to `never`. Fix: bind `entry.status` to a local `status` const before the inner switch. The local stays typed as the per-kind status union and narrows correctly to `never` at the default arm — `const _exhaustive: never = status` is then `never = never`, valid. Standard exhaustive-switch-on-discriminator pattern; doesn't change runtime behavior or test surface, just gets past TS narrowing on the nested case.	2026-05-03 18:45:51 +08:00
Shaojin Wen	cdadbcdb33	feat(cli): wire Monitor entries into combined Background tasks dialog (#3791 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * feat(cli): wire Monitor entries into combined Background tasks dialog Phase C mirror follow-up of Issue #3634, structurally a clean repeat of #3720 but for Monitor (third consumer of the kind framework). Core (MonitorRegistry): - Add `setStatusChangeCallback` mirroring `BackgroundShellRegistry` / `BackgroundTaskRegistry`. Fires synchronously inside `register()` (so subscribers see lifecycle start) and inside `settle()` (so subscribers see every running → terminal transition: complete / fail / cancel / emitEvent's auto-stop at maxEvents). - Subscriber failures are caught and logged but do not poison the registry — same defensive contract as the other two registries. CLI: - `useBackgroundTaskView` subscribes to all three registries (agent + shell + monitor) and merges by `startTime`. `DialogEntry` union extended with `(MonitorEntry & { kind: 'monitor' })`. `entryId` switches over the three kinds and returns the right id field. - `BackgroundTasksPill.getPillLabel` adds monitor to its KIND_NAMES table; grouping order is shell → agent → monitor (monitor last because it tends to be the longest-lived, least urgent to glance at). - `BackgroundTasksDialog`: - `rowLabel` returns `[monitor] <description>` for monitor rows. - New `MonitorDetailBody` showing command / status / pid / event count / dropped lines. No Progress block (monitors don't fire activity callbacks per-event). - DetailBody dispatcher gains the monitor branch. - `BackgroundTaskViewContext.cancelSelected` routes monitor cancels via `monitorRegistry.cancel(monitorId)`. Monitor's cancel is synchronous (settle + abort happen inside the registry), matching the same path task_stop already uses. Tests: 6 new core tests (registry callbacks fire on register / each terminal transition / not on emitEvent / on auto-stop / clear stops notifications / subscriber-failure isolation), 4 new pill tests (singular / plural / 3-kind grouping / running-only filter), dialog mock extended with `getMonitorRegistry`. * fix(cli): add exhaustive default arms to DialogEntry switches ESLint's `default-case` rule requires every switch to have a default arm even when TypeScript can prove the union is exhaustive. Add `default: { const _exhaustive: never = entry; throw ... }` to the five switches added in this PR — same pattern keeps both the runtime guard and the compile-time exhaustiveness check. * fix(core): fire statusChange in MonitorRegistry.reset() The newly-added `setStatusChangeCallback` subscriber misses `reset()`, so a `/clear` or session reset leaves stale monitor rows visible in the combined Background tasks dialog until an unrelated register/settle event happens. Both BackgroundShellRegistry and BackgroundTaskRegistry already fire statusChange on their reset paths — Monitor was the outlier. Fix: fire `statusChange()` (no arg) after `monitors.clear()`, with an early return when the registry is already empty so we don't notify on a no-op reset. Two new tests cover both branches. * fix(cli,core): address PR 3791 review feedback Four review threads from /review's second pass on top of `f26b700`: 1. MonitorDetailBody live counters were stale. `eventCount` and `droppedLines` came from the `useBackgroundTaskView` snapshot, which only refreshes on `statusChange`, and `emitEvent` deliberately doesn't fire `statusChange` (the per-event churn would burn the pill / AppContainer). Fix: extend the existing `selectedEntry` useMemo to re-resolve monitors from `monitorRegistry.get()` on each `activityTick`, mirroring the agent path. The pre-existing 1s wall-clock interval already drives the recompute while the entry is running, so no new callback wiring is needed. 2. Settle reasons weren't persisted on `MonitorEntry`. `fail()`, `complete(exitCode)`, max-events auto-stop, and idle-timeout all passed their reason strings only to the chat-history terminal notification — opening the dialog detail view after the monitor died showed the bare status with no clue why. Add `exitCode?` and `error?` fields (mirrors `BackgroundShellEntry`); populate them before `settle()` in each path; render them in `MonitorDetailBody` with appropriate colour (red for `failed`, warning for auto-stopped `completed`). 3. `monitorCancel` mock had no test asserting it. Add a dedicated test that opens detail on a monitor entry, presses `x`, and verifies `monitorRegistry.cancel(monitorId)` was called and the agent registry's cancel was NOT called. Pins the kind switch in `cancelSelected` so a regression flipping the monitor branch to anything else (e.g. `requestCancel`) would fail loudly. 4. `MonitorStatusChangeCallback` docstring was wrong. It claimed the callback fires on running → terminal transitions, but `register()` (nothing → running) and `reset()` (mass clear, fired with no entry) also fire it. Rewrite the docstring to enumerate the actual fire sites and explicitly note that `emitEvent` is excluded (per-event churn). * docs(cli,core): tighten MonitorEntry.error and rowLabel comments Two doc-precision fixes from copilot's PR 3791 review pass: - `MonitorEntry.error` enumerated max-events as the only auto-stop reason that populates the field, but `idle-timeout` also writes it (was added in the same earlier commit). Rewrote to list both current reasons and explicitly note: any future auto-stop reason should populate this field too. Also clarified that `cancelled` and natural-exit `completed` paths intentionally don't. - `rowLabel`'s shell-branch comment claimed long commands are "truncated by the row renderer's MaxSizedBox", but ListBody renders rows with plain `<Text>` (no MaxSizedBox / truncation helper). Long text wraps. Rewrote to describe actual behaviour and note the intentional decision to leave it wrapping (the dialog is what users open to see context — truncating defeats the purpose). * test(cli): cover MonitorDetailBody render branches + useBackgroundTaskView Two coverage gaps from /review's third pass on PR 3791: - New file `useBackgroundTaskView.test.ts` (6 cases) directly exercises the merge logic with a stub config: empty when config is null, merges three registries, sorts by startTime, tags `kind`, subscribes on mount, refreshes when any registry fires statusChange, clears all three subscriptions on unmount. - New `MonitorDetailBody render branches` describe block in `BackgroundTasksDialog.test.tsx` (8 cases) covers the conditional pieces my last commit added: title + Command, pid show/hide, eventCount singular vs plural, droppedLines show/hide, exitCode display, Error block (failed) vs Stopped because (auto-stop), and the all-undefined no-block case.	2026-05-03 10:05:19 +08:00
Umut Polat	a08d48b75c	fix(cli): stop double-wrapping and double-printing API errors in non-interactive mode (#3749 ) * fix(cli): stop double-wrapping and double-printing API errors in non-interactive mode In non-interactive (-p) mode, any upstream 4xx ended up on stderr three times: once from the stream-error handler, once from handleError after the thrown Error.message (already containing the formatted text) was fed back through parseAndFormatApiError producing "[API Error: [API Error: ...]]", and once more from JsonOutputAdapter.emitResult writing the same errorMessage out in TEXT mode. The top-level catch then framed the resulting throw as "An unexpected critical error occurred:" with a stack trace, which made a routine 4xx look like a CLI crash. Three coordinated changes: - AlreadyReportedError marks a throw whose message is already on the wire. handleError short-circuits on it: no second writeStderrLine, no second parseAndFormatApiError, just propagate the exit code. - The non-interactive stream-error handler now throws AlreadyReportedError, and the catch block skips the adapter's emitResult in TEXT mode so we don't get a third copy. - The top-level .catch in packages/cli/index.ts treats AlreadyReportedError as a routine, already-reported failure: exit with the carried code without printing the "unexpected critical" framing or the stack trace. parseAndFormatApiError is also made idempotent — input that already starts with "[API Error: " and ends with "]" is returned unchanged. That is the safety net: even if a future caller forgets to mark its throw, the double-wrap symptom is impossible. Tests cover all three layers: idempotency in errorParsing, the short-circuit in handleError, and a regression test on runNonInteractive that asserts no "[API Error: [API Error: ...]" line is ever produced on stderr. Fixes #3748 * fix(cli): make API error formatting idempotent for 429 suffix + use AlreadyReportedError * fix(cli): follow up on error reporting review	2026-05-03 08:39:31 +08:00
John London	5037fa7627	Feat/stats model cost estimation rebase (#3780 ) * feat(stats): add optional cost estimation to /stats model Adds optional cost estimation based on user-defined pricing in settings.json. Users can configure per-model pricing via the new modelPricing setting. When configured, /stats model shows estimated cost; when not configured, the behavior is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(stats): extract raw model name from composite key for cost lookup flattenModelsBySource creates keys like "model::source", but modelPricing is keyed by raw model names. Extract the raw model name by splitting on "::" to fix cost lookup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: update settings schema after adding modelPricing Regenerate the VS Code settings schema to include the new modelPricing field so the lint check passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(ui): add regression tests for cost estimation edge cases Coverage for the cost display fixes in ModelStatsDisplay: - Cost section hidden when no pricing + no thoughts - Cost section shown when pricing is configured (with value check) - Thoughts tokens included in cost calculation (with larger numbers to expose the before/after difference) - Raw model name used for pricing lookup with subagent attribution - Cost section shown when thoughts > 0 even without pricing - Multiple models with different pricing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(ui): include thoughts tokens in ModelStatsDisplay cost calculation Address review feedback: the interactive cost estimate was omitting thoughts/reasoning tokens from the output token count, causing it to disagree with the non-interactive /stats model path. Changes: - hasPricing visibility gate now includes thoughts in outputTokens - Cost estimate calculation now includes thoughts in outputTokens - getModelName() already correctly extracts raw model name from flattened model::source keys for pricing lookup Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(stats): include thoughts tokens in non-interactive cost calc + update snapshots - Fix statsCommand.ts non-interactive path to include thoughts tokens in outputTokens for cost calculation, aligning with the interactive ModelStatsDisplay path. - Update 4 ModelStatsDisplay snapshots to reflect Cost section appearing when thoughts > 0 (even without pricing, showing N/A). * fix(test): address all CI failures and reviewer feedback on PR #3780 - Add `toModelMetrics` helper to statsCommand.test.ts to properly create ModelMetrics with required `bySource` property - Fix ModelStatsDisplay.test.tsx: - Add missing `sessionId` to SessionStatsState mock - Add missing `startNewSession` to useSessionStatsMock - Add `auto_accept` to all totalDecisions mocks (7 locations) - Add `files` to all SessionMetrics objects (6 locations) - Remove contradictory test "should show Cost section when thoughts > 0 even without pricing" per Option A (strict opt-in) - Revert 4 snapshots that incorrectly showed Cost/N/A lines for models without pricing configuration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-03 07:31:08 +08:00
Shaojin Wen	c1b4f9eb4b	fix(core): inject thinking blocks for DeepSeek anthropic-compatible provider (#3788 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * fix(core): inject thinking blocks for DeepSeek anthropic-compatible provider DeepSeek's anthropic-compatible endpoint (https://api.deepseek.com/anthropic) rejects follow-up requests with HTTP 400 ("The content[].thinking in the thinking mode must be passed back to the API.") whenever a prior assistant turn carrying tool_use omits a thinking block. The model can legitimately return a tool round without thinking text, so qwen-code stored no thought parts and rebuilt the next request with no thinking block, tripping the API's check. Mirroring the existing OpenAI-side fix (#3729, #3747), the converter now detects DeepSeek by base URL or model name and prepends an empty { type: 'thinking', thinking: '', signature: '' } block to assistant turns missing one. Other anthropic-protocol providers are unaffected. Verified against the live api.deepseek.com/anthropic endpoint: - assistant with tool_use, no thinking → 400 (reproduces #3786) - assistant with tool_use, empty thinking injected → 200 OK Refs #3786 * fix(core): gate DeepSeek thinking-block injection on thinking mode Address PR review feedback: 1. (Critical) Gate empty-thinking injection on the same per-request condition that emits the top-level `thinking` parameter. The previous implementation injected unconditionally on DeepSeek providers, but `buildThinkingConfig()` may omit `thinking` when reasoning=false or `thinkingConfig.includeThoughts=false` — which is exactly what suggestionGenerator / ArenaManager / forkedAgent do. Shipping thinking blocks without enabling thinking mode is a protocol violation that DeepSeek may reject. Move the option from converter constructor to a per-request `convertGeminiRequestToAnthropic` parameter so the generator can compute the gate correctly. 2. (CodeQL) Replace `baseUrl.includes('api.deepseek.com')` with `new URL(baseUrl).hostname` exact-match. The substring check would accept spoofed hosts like `api.deepseek.com.evil.com`. 3. Document the empty-signature workaround inline. 4. Rename the misleading "redacted_thinking" test case. * fix(core): narrow DeepSeek thinking injection to tool_use turns + subdomain test Address PR review round 2: 1. Narrow injection scope to assistant turns containing tool_use. Live verification against api.deepseek.com/anthropic showed plain-text assistant turns without thinking are accepted unchanged — only tool_use turns trigger the HTTP 400. Injecting on every assistant turn unnecessarily bloats replay history with synthetic blocks the API does not require. Existing thinking blocks on any turn are still preserved untouched. 2. Add test coverage for the subdomain hostname branch (us.api.deepseek.com → matches), addressing the gap noted in review. 3. Update existing negative-case tests (non-deepseek / spoofed / reasoning=false / includeThoughts=false) to use tool_use scenarios so they actually exercise the gating logic instead of trivially passing under the narrowed scope. * docs(core): align DeepSeek thinking-injection comments with narrowed scope Address PR review round 3 (copilot-pull-request-reviewer × 3): comments in three locations still described the constraint as applying to "any prior assistant turn", which was true before commit `8721b41` but no longer matches the implementation. Update the doc comment on isDeepSeekAnthropicProvider and the two test-suite header comments to state the actual narrower contract: the API rejects only tool-use turns that omit thinking blocks; plain-text assistant turns are accepted unchanged. Comment-only change; 58 tests still pass. * fix(core): per-request DeepSeek detection + strip thinking when off Address PR review round 4 (copilot-pull-request-reviewer × 2): 1. Stale provider-detection cache (HCia). The constructor cached isDeepSeekProvider once, but Config.setModel() mutates contentGeneratorConfig.model in place. After a runtime /model switch from a non-DeepSeek model to a DeepSeek one on the same auth config, buildRequest() would keep using the stale flag. Move the detection into buildRequest so each call sees the current model. The detector is cheap (URL parse + string compare). 2. Real thought parts leak through when thinking is disabled (HCib). The previous gate only blocked synthetic injection — but the converter still replayed any existing `thought: true` parts in request.contents as thinking blocks. Code paths that disable thinking against a session whose history was built with thinking on (suggestionGenerator / ArenaManager / forkedAgent) would still emit thinking blocks alongside an absent top-level `thinking` config — the same protocol mismatch the gate was meant to avoid. Add a `stripAssistantThinking` converter option, set in buildRequest to `isDeepSeek && !thinking`. The converter strips thinking and redacted_thinking blocks from assistant messages before message construction completes. Mirror behavior is already proven safe by live verification (DeepSeek currently tolerates either shape, but stripping makes the request body internally consistent and robust to future validation tightening). 3 new tests: - converter strips thinking from assistant turns when option set - generator strips real thought parts when reasoning=false - generator reflects runtime model changes (no stale cache) 61 tests pass; lint + typecheck clean. * fix(core): preserve thinking-only assistant turns instead of emitting empty content Address PR review round 5 (copilot-pull-request-reviewer × 2 — code + test): stripThinkingFromAssistantMessages previously replaced message.content with the filtered array unconditionally. For an assistant turn whose only blocks are thinking/redacted_thinking (e.g. a round cut off by max_tokens before any text or tool_use was emitted), this left `content: []` — which Anthropic API rejects. Dropping the message entirely was considered but would break the required user/assistant alternation. Instead, fall back to leaving the original blocks in place when stripping would empty the message. DeepSeek empirically tolerates the residual `thinking-block + no-thinking-config` shape (verified against api.deepseek.com/anthropic in the V2/X scenarios), so leaving the message untouched is the safer choice than emitting invalid structure. Add regression test for the thinking-only turn shape. 62 tests pass; lint + typecheck clean. * fix(core): validate thinking-block signature, rename option, gate output_config Address PR review round 6 — five substantive items: 1. (Critical) Drop non-compliant thinking blocks lacking a `signature` field and replace them with a synthetic one. A `redacted_thinking` block round-tripped through Gemini Part format becomes `{ text: '', thought: true }` (no thoughtSignature) and converts back to `{ type: 'thinking', thinking: '' }` without `signature` — not spec-compliant. The previous `hasThinking` check accepted these as already-satisfying, leaving non-compliant blocks in the wire message. Tighten the check so they're filtered out and the synthetic injection runs. Live verification: DeepSeek currently tolerates both shapes (lenient), but normalizing is defensively correct against future tightening. 2. Rename converter option `ensureAssistantThinking` → `ensureThinkingOnToolUseTurns`. The new name reflects the actual contract (tool-use turns only, not every assistant turn). 3. Honor `thinkingConfig.includeThoughts: false` in `buildOutputConfig`. Previously a per-request opt-out dropped the top-level `thinking` parameter but still emitted `output_config.effort`, leaking a reasoning-shaped field into side queries that don't want it. 4. Add regression test for mixed text + tool_use assistant turns (common shape: model says something, then calls a tool). 5. Add explicit test for the signature-validation path: an existing compliant thinking block (with signature) is preserved untouched. 64 tests pass; lint + typecheck clean. * fix(core): clean up non-compliant thinking blocks on plain-text turns + assert output_config gating Address PR review round 7 (copilot-pull-request-reviewer × 2): 1. (Hp-x) Round-tripped redacted_thinking blocks were left malformed on assistant turns lacking tool_use. The previous structure only ran the cleanup pass when a tool_use block was present (early return on `!hasToolUse`), so plain-text turns kept the non-compliant `{ type: 'thinking', thinking: '' }` shape. Restructure into two sequential steps: a. Drop non-compliant thinking blocks (no `signature`) on every assistant turn — same fallback that avoids `content: []` if the message is thinking-only. b. Inject the synthetic empty thinking block on tool_use turns that still lack a compliant thinking block after step (a). 2. (Hp-7) The includeThoughts=false test asserted that the top-level `thinking` field is suppressed but didn't cover `output_config`, leaving regressions in the new `buildOutputConfig` gate uncaught. Tighten the assertion to also verify `output_config` is absent. 3. New converter test: cleanup runs on plain-text assistant turns too. 65 tests pass; lint + typecheck clean. * test(core): add explicit redacted_thinking injection-path coverage Address PR review round 8 (#30 — copilot reviewer). The converter treats `redacted_thinking` as already satisfying the thinking-block requirement (no synthetic injected), distinguished from a signature-less `thinking` block which is non-compliant and gets dropped/replaced. Existing tests covered the latter path; this adds explicit coverage of the former. processContent doesn't synthesize redacted_thinking from Gemini parts, so the test reaches into the private helper directly. (#31 — subdomain hostname coverage — already exists at line 602.) 66 tests pass; lint + typecheck clean. * fix(core): per-request anthropic-beta + normalize thinking-only turns Address PR review round 9 (copilot-pull-request-reviewer × 2): 1. (Hydz) thinking-only assistant turns (e.g. max_tokens cutoff or round-tripped redacted_thinking) hit the cleanup-empties fallback and kept the original non-compliant `{ type: 'thinking', thinking: '' }` block. The fallback now replaces the message with a synthetic empty thinking block (`signature: ''` included), which keeps the message non-empty AND spec-compliant. 2. (Hyd4) `anthropic-beta` was set once at construction from the global `reasoning` config, so requests with per-request `thinkingConfig.includeThoughts=false` still advertised interleaved-thinking / effort even though the body had dropped the matching fields. Move beta computation to a new `buildPerRequestHeaders` that derives the header from the actual `thinking` / `output_config` fields present in the request body, and pass it via `messages.create(..., { headers })`. The wire shape is now internally consistent. Test updates: - Drop the three constructor-time beta assertions; they no longer apply. - Add four per-request header tests covering: both betas present, only interleaved-thinking, reasoning=false (no betas), and per-request includeThoughts=false (no betas). 67 tests pass; lint + typecheck clean. * fix(core): preserve thinking text by normalizing in place + merge user beta flags Address PR review round 10 (copilot-pull-request-reviewer × 2): 1. (H0oF) The previous cleanup filtered out every thinking block missing a `signature` field. But that shape is the normal output from OpenAI/Gemini/agent-runtime generators, which only set `thought: true` without a signature. Users switching providers mid-session would silently lose preserved thinking text on the first DeepSeek request. Change Step 1 to NORMALIZE in place: when a thinking block has no signature, set `signature: ''` rather than dropping the block. The original `thinking` text is preserved; DeepSeek empirically accepts empty signatures so the wire shape stays valid. 2. (H0oL) `buildPerRequestHeaders()` overwrote `customHeaders['anthropic-beta']` whenever the per-request override fired, regressing the customHeaders escape hatch for unrelated Anthropic beta features. Merge the user's flags into the computed list (deduped) so users can stack their own betas alongside interleaved-thinking / effort. Test changes: - Renamed and rewrote "drops non-compliant... plain-text" test to assert in-place normalization that preserves thinking text. - Updated "replaces a non-compliant thinking block" comment + name to describe the normalization (the assertion was already correct because the test happened to use empty thinking text). - The empty-content fallback in Step 1 is no longer reachable under the new logic, so the dedicated thinking-only-turn test now exercises only the strip path (where it remains relevant). - Added 3 customHeaders[anthropic-beta] tests: merge with computed, passthrough when no thinking/effort, dedupe. 70 tests pass; lint + typecheck clean. * docs(core): align thinking-injection comments with normalize semantics + add stream test Address PR review round 11 (copilot-pull-request-reviewer × 3): 1. (H3Iw) Update the `ensureThinkingOnToolUseTurns` option docstring to describe in-place normalization (preserving thinking text by filling in `signature: ''`) instead of the old drop-and-replace semantics. 2. (H3I9) Same update on the `applyEmptyThinkingToToolUseTurns` helper JSDoc — clarify that signature-less thinking blocks are normalized in place (preserving original text), not dropped. Mention the common case of cross-provider history where non-Anthropic generators only set `thought: true`. 3. (H3I3) Add a streaming test asserting that `generateContentStream()` also attaches the per-request `anthropic-beta` header. The previous coverage only exercised `generateContent()`, leaving the streaming path's separate code path (line 144 in anthropicContentGenerator.ts) unverified. 71 tests pass; lint + typecheck clean. * refactor(core): split DeepSeek thinking option in two + add header coexistence test Address PR review round 12 (copilot-pull-request-reviewer × 2): 1. (H6ws) The single `ensureThinkingOnToolUseTurns` option was misleadingly narrow: the implementation also rewrote non-tool-use turns by normalizing malformed thinking blocks. Future callers could enable it expecting only the tool-use behavior. Split into two precisely-named options: - normalizeAssistantThinkingSignature: fill missing `signature` on every assistant `thinking` block (cross-provider history compat). - injectThinkingOnToolUseTurns: prepend synthetic empty thinking on tool_use turns missing one (issue #3786 trigger). The generator wires both together for DeepSeek when thinking mode is on; either can be used independently if a future caller needs only one pass. 2. (H6w4) Add a test asserting that the per-request `headers` path coexists correctly with `customHeaders`: User-Agent and unrelated customHeaders entries stay in `defaultHeaders` while only the computed `anthropic-beta` rides on the per-request path. Defends against a future regression where header config might be routed through a code path that wipes the constructor defaults. 72 tests pass; lint + typecheck clean. * fix(core): case-insensitive customHeaders[anthropic-beta] merge Address yiliang114 review feedback (#3788). HTTP header names are case-insensitive by spec, and the Anthropic SDK lower-cases them during merge. Previously buildPerRequestHeaders only read the lower-case `anthropic-beta` key from customHeaders, so a user-configured `Anthropic-Beta` or `ANTHROPIC-BETA` would be silently overwritten by the per-request computed value. Replace the direct dict lookup with collectCustomBetaFlags() which walks all customHeaders entries and matches the key case-insensitively. Multiple matching entries (unlikely but possible) are concatenated; the existing dedupe pass handles any duplicates. Add a regression test for both `Anthropic-Beta` and `ANTHROPIC-BETA` key shapes. 73 tests pass; lint + typecheck clean. * docs(core): align thinking-injection docs with normalize-in-place semantics + redacted_thinking strip test Address PR review round 14 (copilot-pull-request-reviewer × 4): 1. (IAl4) PR description still described "dropped here so synthetic injection takes over" but the implementation now normalizes signature-less thinking blocks in place (preserving text). PR description rewritten to describe the two-pass model: normalize-in-place + injection-when-truly-missing. 2. (IAl7) `injectThinkingOnToolUseTurns` option docstring claimed signature-less blocks would be "seen as missing" so the synthetic replaces them. Updated to describe the actual flow: the normalization pass runs first, blocks become compliant in place, the injector then sees them as already-satisfying and prepends nothing. Helper JSDoc on `injectEmptyThinkingOnToolUseTurns` fixed the same way. 3. (IAl8) Strip-path coverage missed `redacted_thinking` blocks. Added regression test that verifies both thinking and redacted_thinking blocks are removed when `stripAssistantThinking` is set. 4. (IAl-) Renamed the converter test suite from "thinking-mode injection + normalization (DeepSeek thinking on)" to "DeepSeek thinking-mode normalization, injection, and stripping" so the title accurately covers all behavior the block exercises (including `stripAssistantThinking` cases later in the same describe). 74 tests pass; lint + typecheck clean. * fix(core): exclude anthropic-beta variants from defaultHeaders to avoid wire duplication Address PR review round 15 (copilot-pull-request-reviewer #1). `buildHeaders()` previously spread the entire `customHeaders` map into the SDK's `defaultHeaders`. After moving anthropic-beta computation to the per-request path, a user-configured mixed-case `Anthropic-Beta` key would survive in defaultHeaders verbatim, while the per-request override added a lowercase `anthropic-beta`. The wire then carried two physical headers for the same logical name — SDK behavior on duplicate headers with different casings is undefined. `buildPerRequestHeaders()` already merges those user flags case-insensitively (commit `0d8b5de`), so dropping the entry from defaultHeaders is the right boundary: the per-request path owns the header end-to-end. Other customHeaders entries continue to pass through. Add a regression test asserting no `Anthropic-Beta` (any casing) lands in defaultHeaders while unrelated customHeaders are kept. 75 tests pass; lint + typecheck clean.	2026-05-03 00:31:24 +08:00
Shaojin Wen	9e8f826397	feat(cli): add MCP health pill to footer (#3741 ) Surfaces MCP servers stuck in DISCONNECTED next to the Background tasks pill, so a failed-to-connect or dropped MCP isn't invisible to the user until they happen to run /mcp. v1 is a visual indicator only — Down-arrow focus chain into the pill (Enter to open /mcp) is deferred to a follow-up to keep this PR small and validate the right scope first. - New `useMCPHealth` hook subscribes to mcp-client's listener API and exposes raw counts (total / disconnected / connecting / connected) rather than a pre-formatted label, so future surfaces (boot screen, tooltips) can derive their own presentation. - `MCPHealthPill` component renders ` · N MCPs offline` (warning color) when `disconnectedCount > 0`, hidden otherwise. Connecting is intentionally suppressed to avoid flicker during boot/reconnect — the state that matters is the one that doesn't recover on its own. - Footer renders MCPHealthPill after BackgroundTasksPill, sharing the same left-bottom region. This is a deliberate parallel-pill pattern, not a `kind` extension of the Background tasks dialog: MCP connections have no terminal status (disconnected ↔ connecting ↔ connected, with a 30s health-check auto-reconnect loop), so they don't fit the task contract that the combined dialog is built around. Co-authored-by: wenshao <wenshao@U-K7F6PQY3-2157.local>	2026-05-02 22:57:05 +08:00
Shaojin Wen	d40f3e975e	fix(test): restore abort-and-lifecycle stdin-close test to pre-#3723 version (#3777 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * fix(test): restore abort-and-lifecycle stdin-close test to pre-#3723 version #3723 rewrote `should handle control responses when stdin closes before replies` in a way that flipped its semantics: - Old: canUseTool sleeps 1s before allowing; asyncGenerator awaits `inputStreamDonePromise` so stdin closes WHILE the control reply is still in flight; expects `original content` (the in-flight tool must NOT execute). Tests CLI robustness when stdin closes before replies — matching the test name. - New: canUseTool returns `allow` immediately; stdin stays open until the second result arrives; expects `updated`. Requires the LLM to actually call write_file → receive tool result → reply 'done'. The test name still says "stdin closes before replies", but it no longer tests that. The new version times out (testTimeout 5min, retry x2 = 900s) on both macOS and Linux on every push since #3723, because it depends on LLM tool-calling behavior that isn't deterministic on the CI endpoint. CI history shows the pre-#3723 version was stable across 30+ runs. This restores only the test file. The shared permissionFlow, coreToolScheduler/Session wiring, and e2e workflow `npm run bundle` step from #3723 are kept intact. * test(integration): add timeout and unify loop into race chain Address review feedback on the restored test: - firstResultPromise / secondResultPromise now have a 30s setTimeout reject path, matching the pattern used by canUseToolCalledPromise and inputStreamDonePromise (15s). Without these, a hang in the result stream falls back to the global Vitest testTimeout (5min) with no useful diagnostic. - loop() is now retained as `loopPromise` and joined into the await chain via `Promise.race`. If the iterator throws or the consumer exits unexpectedly, the failure surfaces directly to the test instead of becoming an unhandled rejection while the test waits on side-channel promises. * test(integration): close pseudo-pass paths in stdin-close lifecycle test Address review feedback. Each change maps to a specific finding: - Guard canUseTool by toolName === 'write_file' AND file_path against the target absolute path. The model may issue read_file or call write_file with an unexpected path; those must not satisfy the permission-control timing harness, otherwise the test could pass without exercising the intended path. - Capture the second SDK result and assert it's defined, so the Promise.race below can no longer short-circuit silently. - Replace `Promise.race([..., loopPromise])` with a rejection-only loopError partner. Loop completion alone (e.g. iterator ends before canUseTool is invoked) must not short-circuit the awaited milestones; only loop errors should fail the test. - Restore absolute path via `helper.getPath('test.txt')` and embed it in the prompt, so the file the test asserts on is unambiguously the same one the model is asked to write. - Wrap timing promises in a `boundedPromise` helper that clears its timeout on resolve, eliminating dangling timers on success runs. - Drop the unconditional `console.log(JSON.stringify(...))` in the consumer loop to reduce CI retry noise. Out of scope (acknowledged but deferred): the test still requires the model to actually emit a write_file tool call; with the new 15s/30s bounded timeouts, an LLM that fails to call write_file now fails fast with a labeled error ("canUseTool callback not called timeout after 15000ms") instead of hanging to the global 5-min testTimeout. Making the test fully model-independent would require a control-only path that doesn't go through tool dispatch — out of scope for this regression fix. * test(integration): defer phase timers in stdin-close lifecycle test Address review suggestion: the 15s budgets on canUseToolCalled and inputStreamDone started counting at promise creation, but those phases only begin after firstResult (30s budget) resolves. On a slow CI run where the first LLM round-trip exceeds 15s, those timers would reject before their phase even starts, surfacing a misleading "canUseTool callback not called" error when the actual cause was first-result latency. Add an explicit `startTimer()` to boundedPromise and arm each timer only when its phase actually begins: - firstResult: armed immediately (begins with the query). - canUseToolCalled / inputStreamDone / secondResult: armed inside createPrompt right after firstResult resolves, so first-turn latency cannot eat into their budgets. This also makes timeout errors point at the correct phase if any of them does fire.	2026-05-02 21:39:43 +08:00
wenshao	f531924167	fix(attribution): scope isAmendCommit to attributable segment only `git -C ../other commit --amend && git commit -m x` would previously flag the second (fresh) commit as an amend, causing attachCommitAttribution to diff `HEAD@{1}..HEAD` against an unrelated reflog entry. Mirror findAttributableCommitSegment's cd/cwd tracking so only the first commit segment that runs in the original cwd determines amend status.	2026-05-02 21:09:24 +08:00
wenshao	1498133a73	fix(attribution): cd embedded .., env wrapper, Windows ARG_MAX, segment-locator warn Four review items, all small but real: - cdTargetMayChangeRepo missed embedded `..` traversal — `cd foo/../../escape` and similar would slip past the leading-`..` check and be treated as in-repo. Added an `includes('/..')` / `includes('\\..')` check (catches POSIX and Windows separators without false-positiving on `..` chars inside ordinary names, which only escape when followed by a separator). - tokeniseSegment now recognises `env` as a safe wrapper alongside `sudo`/`command`, so `env GIT_COMMITTER_DATE=now git commit ...` resolves to `git`. After the wrapper detection we also skip any `KEY=VALUE` argv entries (env's own argument syntax for setting vars before the program). - buildGitNotesCommand's MAX_NOTE_BYTES dropped from 128 KB to 30 KB. Windows' CreateProcess lpCommandLine is capped around 32,768 UTF-16 chars including the executable path and other argv entries; a 128 KB note would still fail to spawn even though the function returned a command instead of null. 30 KB leaves ~2 KB of headroom for the rest of the argv on Windows and is larger than any real commit's metadata in practice. - findAttributableCommitSegment / findGhPrCreateSegment now log a debugLogger.warn when `command.indexOf(sub, cursor)` returns -1 — splitCommands strips line continuations (`\<newline>`), so a multi-line command can have the trimmed segment text fail to match its source. Previously the segment was silently skipped with no signal; the warn makes the failure observable when QWEN_DEBUG_LOG_FILE is set. Two regression tests added: `cd foo/../../escape && git commit` gets no trailer (embedded-`..` heuristic catches it), and `env GIT_COMMITTER_DATE=now git commit` does (env wrapper skipped).	2026-05-02 20:59:10 +08:00
jinye	df594f75fe	feat(core): event monitor tool with throttled stdout streaming (Phase C) (#3684 ) * feat(core): event monitor tool with throttled stdout streaming (Phase C) Add a new Monitor tool that spawns a long-running shell command and streams its stdout lines back to the agent as event notifications. This is Phase C from the background task management roadmap (#3634, #3666). What changes: - New MonitorRegistry (services/monitorRegistry.ts): per-monitor entry with lifecycle (running/completed/failed/cancelled), idle timeout auto-stop, max events auto-stop, AbortController-based cancellation. Follows the same structural pattern as BackgroundTaskRegistry. - New Monitor tool (tools/monitor.ts): spawns via child_process.spawn with independent AbortController (Ctrl+C won't kill monitors), separate stdout/stderr line buffers, token-bucket throttling (burst=5, sustain=1/s). Returns immediately with monitor ID; events stream as notifications. - Sleep interception in shell.ts: detectBlockedSleepPattern() blocks foreground `sleep N` (N>=2) and guides model to use Monitor or is_background instead. - Config integration: MonitorRegistry instantiation, accessor, shutdown cleanup (abortAll), lazy tool registration. - CLI wiring: notification callbacks in useGeminiStream.ts (interactive) and nonInteractiveCli.ts (headless), including hold-back loop abort on exit and SIGINT cleanup. What this PR doesn't do (gated on #3471/#3488): - Footer pill / dialog integration - task_stop / send_message integration Test plan: - 21 MonitorRegistry unit tests (lifecycle, idle timeout, max events, XML escaping, nonexistent ID guard, callback clearing) - 20 Monitor tool unit tests (validation, spawn, line buffering, separate stdout/stderr buffers, throttling, signal-killed path, turn isolation) - 7 detectBlockedSleepPattern unit tests - 2 E2E tests (monitor invocation, sleep interception) - Full core suite: 248 files / 6151 passed Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): hold-back loop waits for monitors + emit task_started for SDK Two fixes from Codex review: P1: The non-interactive hold-back loop now includes monitorRegistry.getRunning() in its wait condition, so monitors can stream events before the CLI exits. Previously monitors were aborted immediately after the agent's first reply. P2: MonitorRegistry gains setRegisterCallback(), and nonInteractiveCli wires it to emit task_started system messages. Stream-json/SDK consumers now see a task_started for each monitor, matching the backgroundTaskRegistry contract. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): Windows process kill + pipeline sleep false-positive Two fixes from Codex review: P1: Monitor abort handler now uses `taskkill /f /t` on Windows instead of POSIX-only `process.kill(-pid)`. Follows the existing pattern in ShellExecutionService.childProcessFallback. P2: detectBlockedSleepPattern no longer uses splitCommands (which splits on `\|` pipes). Replaced with a regex that only matches sleep followed by sequential separators (&&, \|\|, ;, &, newline), not pipes. `sleep 5 \| cat` is now correctly allowed. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(test): resolve TS errors in monitor.test.ts mock types Use Object.defineProperty for readonly ChildProcess.pid and proper Readable type for stdout/stderr mocks to satisfy strict tsc builds. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): remove false notification promise + add early-abort guard P1: Sleep interception guidance no longer promises "completion notification" for is_background — that wiring doesn't exist yet (follow-up from #3642). P2: Monitor.execute() now checks _signal.aborted before spawning, preventing a race where cancellation during tool scheduling still launches a monitor. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(test): add getMonitorRegistry mock to useGeminiStream tests The useGeminiStream hook now calls config.getMonitorRegistry() to wire up monitor notification callbacks. The test mock config was missing this method, causing 64 test failures with "config.getMonitorRegistry is not a function". Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(test): add getMonitorRegistry mock to nonInteractiveCli tests Same fix as useGeminiStream.test.tsx — the mock config needs getMonitorRegistry to avoid "is not a function" errors (29 failures). Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address PR review — CORE_TOOLS, directory param, test fix 1. Add 'monitor' to PermissionManager.CORE_TOOLS so coreTools allowlist correctly gates the monitor tool (same as run_shell_command). 2. Add optional 'directory' parameter to MonitorTool with workspace validation, mirroring ShellTool's directory support for multi-root workspaces. 3. Fix sleep-interception E2E test: readToolLogs() doesn't expose toolResult, so the old assertion was dead code. Now verifies via the model's output text instead. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address MonitorTool review #4186888042 Addresses three [Critical] review comments on packages/core/src/tools/monitor.ts: 1. Partial-line buffer unbounded growth (processLines) MAX_LINE_LENGTH was only enforced after a newline, so a command emitting a long stream without newlines would grow buffer.value without bound and re-split the entire accumulated string on every chunk. Now, when the buffer has no newline and exceeds MAX_LINE_LENGTH, we force-emit a single truncated event through the throttled path and reset the buffer. 2. Missing type guard on params.command validateToolParamValues called params.command.trim() without a typeof check. Schema validation normally catches this, but SDK/direct callers could bypass it and hit an uncaught TypeError. Added typeof === 'string' guard, matching the pattern used for max_events / idle_timeout_ms. 3. Workspace check bypass via raw startsWith The directory validator used workspaceDirs.some(d => params.directory .startsWith(d)), which allowed prefix collisions (e.g. /tmp/project-evil against a /tmp/project workspace) and skipped canonicalisation / symlink resolution. Switched to WorkspaceContext.isPathWithinWorkspace, which already does fullyResolvedPath + segment-aware isPathWithinRoot matching and is the standard used elsewhere in the codebase. Test coverage: added 6 unit tests covering non-string command guard, non-absolute directory rejection, prefix-collision rejection, traversal rejection, workspace acceptance, and partial-line cap behaviour (including buffer reset). All 26 monitor.test.ts cases pass. The same startsWith pattern also exists in ShellTool and is tracked as a separate follow-up to keep this PR focused on Phase C scope. * fix(core): scope monitor always-allow permissions Populate Monitor confirmation permissionRules using the same command-rule extraction path as ShellTool, so ProceedAlways persists command-scoped Bash(...) rules instead of a broad monitor-level allow. Also add unit coverage for command-scoped rules, filtering already-allowed subcommands, and extractor fallback behavior. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): decouple monitor permission scope from Bash rules Remove pm.isCommandAllowed() from MonitorToolInvocation.getConfirmationDetails() to prevent existing Bash(...) allow rules from shrinking the monitor confirmation scope. Monitor is a long-running background process with a different risk profile than one-shot shell execution and should maintain its own permission boundary. Only AST-based read-only filtering is retained. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): unify monitor error/exit cleanup to prevent resource leaks Extract a shared cleanup() helper called from both the `exit` and `error` event handlers. Previously the `error` handler did not flush buffers, clear buffer values, remove the abort listener, or log dropped-line stats — causing potential memory leaks when `error` fires without a subsequent `exit` (e.g. ENOENT for missing commands). Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): add user-skills-directory guard to monitor directory validation Mirror ShellTool's getUserSkillsDirs() check in MonitorTool's validateToolParamValues() to prevent monitor commands from running inside user skills directories. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * feat(core): add Monitor(...) permission namespace for monitor tool (#3726) Introduce a dedicated Monitor(...) permission namespace so monitor and shell tools have independent permission boundaries. Previously monitor emitted Bash(...) rules, causing "Always Allow" to fail for future monitor invocations while unintentionally granting run_shell_command. Changes: - rule-parser.ts: add Monitor alias, SHELL_TOOL_NAMES entry, CANONICAL_TO_RULE_DISPLAY, DISPLAY_NAME_TO_VERB - permission-manager.ts: extract SHELL_LIKE_TOOLS set so evaluate(), evaluateSingle(), hasRelevantRules(), hasMatchingAskRule() handle both run_shell_command and monitor - monitor.ts: emit Monitor(...) instead of Bash(...) in permissionRules - Tests: parseRule, matchesRule, cross-tool isolation regression, buildPermissionRules, buildHumanReadableRuleLabel for Monitor Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): decouple headless monitor lifetime from final result Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): stabilize stream-json monitor session shutdown Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): deny monitor in headless approval defaults Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): honor tool aliases in headless allow checks Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address opus review — sleep regex, monitor cap, non-interactive cleanup - Fix sleep interception false positive for backgrounded sleep (`sleep 5 & echo done`). Remove bare `&` from separator character class so the background operator is not treated as a sequential separator. - Add MAX_CONCURRENT_MONITORS (16) check in MonitorRegistry.register() and early rejection in MonitorTool.execute() to prevent unbounded process spawning. - Widen monitorId from 8 to 16 hex chars to reduce birthday collision risk. - Abort all running monitors in nonInteractiveCli.ts success-path finally so piped stdio refs don't keep the Node event loop alive after result emission in one-shot (--print) mode. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): abort monitors and background shells on /clear Without this, long-running monitors from a previous session survive /clear and continue pushing events into the new session's notification queue. This enables cross-session prompt injection where a malicious monitor persists across the user's escape hatch. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): abort monitors on stream-json session shutdown Call monitorRegistry.abortAll() in both shutdown() and drainAndShutdown() so detached monitor child processes don't survive session termination. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * test(cli): use content event type in stream tests Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): isolate session cleanup on clear and shutdown Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): finalize session cleanup after drain Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): close remaining monitor review gaps Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): preserve shell cwd in virtual permission checks Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): normalize trailing background ampersands Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): align monitor permission and wrapper handling Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * test(core): make monitor CI assertions cross-platform Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): align monitor wrapper normalization Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): normalize wrapped monitor commands Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): harden monitor headless edge cases Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): preserve monitor spawn errors Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): harden monitor register cleanup Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): parse monitor wrapper script token Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address PR review comments for monitor tool - Make Bash(...) permission rules cover monitor via toolMatchesRuleToolName, so deny rules like Bash(rm ) also block monitor({command: "rm ..."}) - Remove dead `normalizeRuleToolName` mock reference in config.test.ts - Fix tool description to mention stdout/stderr instead of just stdout - Export MAX_CONCURRENT_MONITORS from monitorRegistry and use it in monitor.ts instead of hardcoded 16 - Rename ambiguous MAX_LINE_LENGTH constants: PARTIAL_LINE_BUFFER_CAP (4096, monitor.ts) and EVENT_LINE_TRUNCATE (2000, monitorRegistry.ts) - Fix schema description text: "Max 80 characters" → "Truncated to 80 characters in display" - Add .unref() to SIGTERM→SIGKILL escalation timer to prevent 200ms exit delay Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> fix(cli): resolve clear command typecheck issues Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): preserve background tasks across shutdown abort Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): close monitor review gaps Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address latest monitor review comments Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(cli): handle monitors across session switches Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * test(core): cover aborted monitor startup Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): address remaining monitor PR review comments Adopts four unresolved review threads on PR #3684: * shell: trim top-level trailing comments before validating sleep separator so 'sleep 5 # wait' no longer bypasses detectBlockedSleepPattern. * monitor: add sanitizeMonitorLine to strip C0/C1 control chars (except tab) and defang structural envelope tag names with U+200B before forwarding output to the model, blocking prompt-injection attempts hidden in monitored stdout/stderr. * monitor: declare line buffers and throttledEmit before abortHandler to avoid TDZ on synchronous abort paths, and add flushPartialLineBuffers called from both abortHandler (before kill) and cleanup (natural exit/error) so partial-line data is no longer silently dropped on cancel. * permissions: document that normalizePermissionContext relies on buildPermissionCheckContext to forward monitor's directory as cwd, and add regression tests proving relative-path Read(./...) allow and deny rules resolve against the monitor's explicit cwd. * fix(core): abort running monitors in MonitorRegistry.reset() reset() previously only cleared idle timers and emptied the map without aborting running monitors' AbortControllers. This could orphan child processes when reset() was called without a prior abortAll(), e.g. via useResumeCommand → resetBackgroundStateForSessionSwitch. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(core): harden monitor notification XML and displayText - Extend escapeXml to escape " and ' as defense-in-depth: safe to reuse the helper in any future XML attribute context without re-auditing. - Strip C0 (except tab) and C1 control characters from the displayText surface before interpolation, so untrusted child-process output cannot leak ANSI escapes / NUL bytes into the operator's terminal even if a direct caller of MonitorRegistry.emitEvent skips sanitization. Adds unit tests for both hardening paths. * test(core): cover token-bucket throttling and commented-sleep bypass - Add 4 unit tests for the monitor token-bucket throttle (burst=5, 1 token/sec refill): burst cap, refill release, long-idle bucket cap, and whitespace lines not consuming budget. Uses vi.setSystemTime to exercise Date.now() without advancing pending setTimeouts. - Add an E2E case that feeds 'sleep 5 # wait for db' through the shell tool to lock in trimTrailingShellComment behavior end-to-end; the unit-level coverage in shell.test.ts remains authoritative but the E2E anchor prevents a regression from silently passing unit tests. * fix(core): address 3 remaining copilot review comments 1. shell.ts sleep interception: strip shell wrapper before detecting the blocked sleep pattern so `bash -c 'sleep 5'` / `sh -c ...` cannot route around the block. Mirrors every other sensitive check in shell.ts, which already normalizes through stripShellWrapper. 2. monitorRegistry.ts emitEvent auto-stop: settle the entry BEFORE aborting its controller so that any synchronous abort listener that flushes buffered output back through registry.emitEvent() (e.g. the Monitor tool's flushPartialLineBuffers) finds status !== 'running' and short-circuits instead of overshooting maxEvents and emitting a duplicate 'Max events reached' terminal notification. 3. monitorRegistry.ts truncateDescription: cap output at exactly MAX_DESCRIPTION_LENGTH by counting the ellipsis against the budget, instead of returning MAX_DESCRIPTION_LENGTH + 3 characters. Each fix is covered by a new unit test. * fix(core): address review comments — sanitize, notify, kill logging, throttle observability - Remove double normalize in buildPermissionCheckContext (PM is single source) - Add {notify:false} to Config.shutdown() and abortTaskRegistries() abortAll - Swap settle-before-abort in cancel() and resetIdleTimer() to prevent races - Add stripDisplayControlChars to emitTerminalNotification - Sanitize monitor description at entry creation via sanitizeMonitorLine - Surface throttle-dropped line count in terminal notification - Add .unref() to idle timer to allow clean process exit - Add error handler + stdio:ignore to Windows taskkill spawn - Log SIGTERM/SIGKILL kill failures via debugLogger.warn - Attach early child error handler to cover spawn-to-register window - Destroy child stdio on register failure to prevent handle leaks - Improve stripShellWrapper to handle absolute paths, combined flags, env prefix - Improve SHELL_TOOL_NAMES documentation and toolMatchesRuleToolName clarity Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(core): resolve monitor tool typecheck errors - Cast child.stdout/stderr to a minimal { destroy?: () => void } shape so the optional destroy() call compiles and still works with test mocks. - Initialize droppedLines: 0 in MonitorEntry test fixtures that predate the field becoming required. * fix(monitor): add missing stdio option in taskkill test assertions (#3784) * fix(core): address monitor review feedback * fix(core): harden monitor command lifecycle --------- Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>	2026-05-02 20:57:26 +08:00
wenshao	80bed8f58e	fix(attribution): cd-subdir, scope --body, multi-commit count guard, /clear reset Four bugs flagged this round: - gitCommitContext / findAttributableCommitSegment used a blanket "any cd shifts cwd" gate, breaking the very common `cd subdir && git commit -m "..."` flow even though the commit lands in the same repo. New `cdTargetMayChangeRepo` heuristic: treat relative paths that don't escape upward (no leading `..`, no absolute path, no `~`/`$VAR` expansion, no bare `cd`/`cd -`) as in-repo and let attribution proceed. Conservative on anything it can't statically verify. - addAttributionToPR was running the `--body`/`-b` regex against the FULL compound command string. In `curl -b "session=abc" && gh pr create --body "summary"` the regex would match curl's `-b` cookie flag and inject attribution into the cookie value, corrupting the curl call. Added `findGhPrCreateSegment` (analog of `findAttributableCommitSegment`) and scoped the body regex to that segment, splicing back into the original command via offsetting the in-segment match index. - The multi-commit guard treated `runGitCount === 0` as "single commit" and bypassed itself. After `commitCreated === true`, a count of 0 is impossible in normal operation — it means rev-list errored or timed out. Now we bail on `commitCount !== 1` with a tailored message: anything other than exactly 1 commit is suspicious and refuses the note. - The CommitAttributionService singleton survives across `Config.startNewSession()` (the `/clear` and resume paths). New `CommitAttributionService.resetInstance()` call alongside the existing chat-recording / file-cache resets in startNewSession prevents pending attributions from a prior session attaching to a commit in the new one. Three regression tests added: `cd src && git commit` produces a trailer (in-repo cd), `cd .. && git commit` does not (could escape repo root), and `curl -b "..." && gh pr create --body "..."` leaves curl's cookie value untouched while attribution lands in gh's body.	2026-05-02 17:03:09 +08:00
wenshao	66cef3bb5f	fix(attribution): --amend, --message/-b aliases, .d.ts over-exclusion Four Copilot follow-ups, three of them user-visible coverage gaps: - `git commit --amend` was diffing `HEAD~1..HEAD` for attribution, which spans the entire amended commit (parent → amended) rather than the actual amend delta. A message-only amend would emit a note attributing every file in the original commit to this amend. New `isAmendCommit` helper detects the flag and getCommittedFileInfo switches to `HEAD@{1}..HEAD` (the pre-amend HEAD vs the amended HEAD); if the reflog is GC'd we bail with a warning rather than over-attribute. - `git commit --message "..."` and `--message="..."` were silently skipped because the regex only recognised the short `-m` form. The flag prefix now matches both alternatives via `(?:-[a-zA-Z]m\|--message)\s=?\s` (non-capturing inner group so the existing `[full, prefix, body]` destructure still works). - `gh pr create -b "..."` (the short alias for `--body`) was the same gap on the PR side; `(?:--body\|-b)[\s=]+` now covers both forms. - `.d.ts` was an over-broad blanket exclusion in EXCLUDED_EXTENSIONS — declaration files are commonly authored (ambient declarations, asset shims like `.d.ts` for `import './x.svg'`); the repo even contains `packages/vscode-ide-companion/src/assets.d.ts`. Removed `.d.ts` from the extensions Set and adjusted the test to assert the new behavior. Auto-generated `.d.ts` (e.g. `tsc --declaration` output) still gets caught by the build-directory rules. Tests added: `--amend` plumbing covered by the new branch in getCommittedFileInfo (no targeted unit test — the diff invocation goes through ShellExecutionService and is exercised by the existing post-command path); `--message`/`--message="..."`/-b/-b="..."` all have positive trailer-injection assertions; `.d.ts` test split into "hand-authored" (negative) and "in dist" (positive).	2026-05-02 15:07:18 +08:00
wenshao	84d764bc03	fix(attribution): cd-leak, numstat partial failure, $() bailout, gh pr new alias Five Critical/Suggestion items: - `cd subdir && git commit` (or any non-attributable commit chain whose HEAD movement still happens in our cwd, e.g. cd into a subdirectory of the same repo) used to skip attribution AND fail to clear pending per-file entries. Those entries then leaked into the next foreground commit, inflating its AI percentage. New `else if (commitCtx.hasCommit)` branch in execute() compares pre- and post-HEAD; if HEAD moved we drop the per-file state. preHead is now snapshotted whenever ANY commit was attempted, not only attributable ones. - getCommittedFileInfo's three diff calls run in `Promise.all`. If `--numstat` failed while `--name-only` succeeded, every file's diffSize would be 0 and generateNotePayload would clamp aiChars to 0 — emitting a structurally valid note with all-zero AI percentages. Detect the partial-failure shape (files non-empty, diffSizes empty) and return empty so no note is written. - addCoAuthorToGitCommit and addAttributionToPR now bail when the captured `-m`/`--body` value contains `$(`. The tool description recommends `git commit -m "$(cat <<'EOF' ... EOF)"` for multi-line messages, but the regex's `(?:[^"\\]\|\\.)*` body group stops at the first interior `"` from a nested shell token — splicing the trailer there breaks the command before it reaches the executor. - looksLikeGhPrCreate now accepts `gh pr new` as well — it's a documented alias for `gh pr create` and was silently skipped. - Removed `incrementPermissionPromptCount` / `incrementEscapeCount` and their getters: they had no production callers, so the backing fields just round-tripped through snapshots as 0. The four snapshot fields are now optional so pre-fix snapshots that carry non-zero values still load cleanly and just get ignored. Three regression tests added: heredoc-style `-m "$(cat <<EOF...)"` preserved literally, heredoc-style `--body` likewise, `gh pr new --body "..."` rewritten with attribution.	2026-05-02 14:06:57 +08:00
John London	ad12bf84cd	fix(vscode-companion): align package eslint config with root and style cleanup (#3782 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * fix(vscode-companion): fix ESLint curly and eqeqeq violations Fix pre-existing lint errors in vscode-ide-companion that cause all 8 CI test matrix jobs to fail (ubuntu/macos/windows × Node 20/22/24). - extension.ts:221 — add braces to single-line if (curly rule) - WebViewProvider.ts:1740,1742 — add braces to single-line if (curly) - App.tsx:179 — replace == with === (eqeqeq) and add braces (curly) - App.tsx:215 — add braces to single-line if (curly rule) - App.tsx:1210 — add braces to single-line if return (curly rule) - App.tsx:1241 — add braces to single-line if continue (curly rule) - App.tsx:1258 — add braces to single-line if continue (curly rule) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(vscode-companion): revert === null to == null and align package eslint config - Revert `child === null` back to `child == null` in App.tsx:179. The == null idiom catches both null and undefined, which is the defensive pattern needed here. A future case that forgets to assign `child` would let undefined slip through with === null, desyncing childIndexMap from the rendered DOM and breaking findMessageIndex. - Align package-level eslint eqeqeq with root config by adding { null: 'ignore' }, so == null is allowed as a warning-level rule matching the project-wide ESLint policy. The curly brace additions from the prior commit are retained as style improvements (curly: 'warn' in the package config), though they were not CI-blocking violations — CI runs from root which uses curly: ['error', 'multi-line'] and permits single-line if statements. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-02 12:35:24 +08:00
wenshao	222b1884dc	fix(shell): scope -m rewrite to commit segment, reject nested matches Two Critical findings on addCoAuthorToGitCommit, plus a Copilot maintainability nit: - The `-m` regex used to scan the whole compound command, so `git commit -m "fix" && git tag -a v1 -m "release"` would target the LATER tag annotation (last -m wins) and splice the trailer there instead of the commit message. The rewrite now scopes to the actual `git commit` segment via a new findAttributableCommitSegment(): same shell-aware walk gitCommitContext does, but returning the segment's character range so the regex can be run on a slice and spliced back into the original command. - Within the segment, a literal `-m '...'` inside a quoted body was treated as a real later -m. For `git commit -m "docs mention -m 'flag' for completeness"`, the inner single-quoted -m sits at a higher index than the real outer -m, and the previous index comparison would have it win — splicing the trailer mid-message and corrupting the quoting. The new code checks whether the candidate is nested inside the other quote-style's range (start/end containment) and prefers the outer match when so. - Hoisted three constant Sets (sudo flag list, git global flags taking values, git global flags shifting cwd, gh global flags) out of the per-call scope to module constants. Functional no-op, but keeps the parsing helpers easier to read and avoids re-allocating the Sets on every command. Two regression tests added for the cases above: - inner `-m '...'` inside the outer message body is preserved literally and the trailer lands after the body - `git tag -a v1 -m "release notes"` after a real `git commit -m "fix"` is left untouched, with the trailer appended to "fix" only	2026-05-02 11:04:31 +08:00
wenshao	2dd792c87e	fix(attribution): canonicalize-from-root cleanup; fix mixed-quote -m / gh -R= Five review items, one Critical: - attachCommitAttribution now canonicalises via the repo root (one realpath call) and resolves committed paths against that canonical root, rather than per-leaf realpath inside clearAttributedFiles. At cleanup time the leaf for a just-deleted file no longer exists, so per-leaf fs.realpathSync would fail and silently fall back to a non-canonical path that misses the stored canonical key — leaving stale attributions for deleted files. clearAttributedFiles drops its internal realpath and now documents the canonical-paths-required precondition explicitly. - addCoAuthorToGitCommit picks the LAST `-m` regardless of quote style. Previously `doubleMatch ?? singleMatch` always preferred the last double-quoted match, so `git commit -m "Title" -m 'Body'` injected the trailer into the title where git interpret-trailers would silently ignore it. Now compares match indices, and the escape helper follows the actually-selected match's quote style. - parseGhInvocation handles `-R=value` (the equals form of the short `--repo` alias). `--repo=...` and `--hostname=...` were already covered; `-R=...` previously fell through to the generic flag branch and skipped the value. - New tests for the symlink-aware canonicalisation: macOS-style `/var` ↔ `/private/var` mapping is mocked via vi.mock on node:fs, with cases for record-then-look-up under either form, generateNotePayload with a symlinked baseDir, partial clear via the canonical-root-derived path (deleted leaf), and snapshot restore canonicalisation. - Doc-only: integration-test header comments updated from "V1 -> V2 -> V3" / "migration to V3" to reflect the actual V4 end state (assertions already used the literal `4`).	2026-05-02 09:22:55 +08:00
wenshao	89ee3e38b0	fix(attribution): canonicalize file paths centrally in CommitAttributionService Two related Copilot follow-ups: - recordEdit/getFileAttribution/clearAttributedFiles now run input paths through fs.realpathSync before storing/looking up, so a symlinked path (e.g. macOS /var ↔ /private/var) resolves to the same key regardless of which form the caller passes. Previously edit.ts/write-file.ts handed in non-realpath'd absolute paths while generateNotePayload tried to realpath only inside its lookup loop, leaving partial-clear and clear-on-finally paths unable to find entries when the forms diverged. - restoreFromSnapshot also canonicalises on the way in so a session resumed from a pre-fix snapshot (where keys may not have been canonical) ends up with the same shape as newly recorded entries — otherwise a single file could end up with two parallel records. - generateNotePayload's lookup loop dropped its per-entry realpath call (now redundant since keys are canonical at write time), keeping only the realpath of `baseDir` (which still comes from `git rev-parse --show-toplevel` and may be a symlink). - Updated `clearAttributedFiles` doc to describe the new semantics: callers can pass either the resolved repo-relative path or an already-canonical absolute path, and either will match.	2026-05-02 05:39:00 +08:00
wenshao	be81c6f08d	fix(attribution): partial-commit clear, symlink baseDir, gh/git flag handling Two Critical items, two Copilot, and five wenshao Suggestions: - attachCommitAttribution's `finally` block used to call `clearAttributions()` unconditionally, wiping per-file tracking for files the AI had edited but the user excluded from this commit. Added `clearAttributedFiles(committedAbsolutePaths)` to the service and the call site now passes only the paths that actually landed in this commit; entries for un-`add`ed files stay pending for a later commit. - generateNotePayload now runs both `baseDir` and each tracked absolute path through `fs.realpathSync` before `path.relative`. On macOS in particular `/var` symlinks to `/private/var`, so the toplevel from `git rev-parse --show-toplevel` and the absolute path captured by edit/write-file tools could diverge — producing `../../actual/path` keys in the lookup that never matched and silently zeroed all per-file AI attribution. - tokeniseSegment now consumes value-taking sudo flags (`-u`, `-g`, `-h`, `-D`, `-r`, `-t`, `-C`, plus the long forms). Without this, `sudo -u other git commit` left `other` standing in for the program name and skipped the trailer entirely. - A duplicate JSDoc block above `countCommitsAfter` (a leftover from the earlier extraction of `getGitHead`) was removed; both helpers now have one accurate comment each. - attachCommitAttribution's multi-commit guard now also runs when `preHead === null` (brand-new repo), via `git rev-list --count HEAD`. A compound `git init && git commit -m a && git commit -m b` no longer slips through and mis-attributes combined data to the last commit. - addCoAuthorToGitCommit's `-m` matching switched to `matchAll` and takes the LAST match. `git commit -m "title" -m "body"` puts the trailer at the end of the body so `git interpret-trailers` recognises it; the previous first-match behaviour stuffed the trailer in the title where git treats it as plain message text. - addAttributionToPR's `--body` regex accepts both space and `=` separators (`--body "..."` and `--body="..."`); the `=` form is common with gh. - New `parseGhInvocation` walks past gh's global flags (`--repo`, `-R`, `--hostname`) so `gh --repo owner/repo pr create ...` is detected. The earlier fixed-position check at tokens[1]/tokens[2] missed any command with a global flag. - getCommittedFileInfo now fans out the two `rev-parse` calls and the three diff calls with `Promise.all`. They're independent and serialising them was paying spawn latency 5× per commit. Tests: sudo with `-u user`, multi `-m`, `gh --repo owner/repo`, `--body="..."`, plus the existing 84 shell tests still pass.	2026-05-02 04:50:07 +08:00
jinye	5d1052a358	feat(telemetry): define HTTP OTLP endpoint behavior and signal routing (#3779 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * feat(telemetry): define HTTP OTLP endpoint behavior and signal routing - Add resolveHttpOtlpUrl() that appends /v1/traces, /v1/logs, /v1/metrics to base HTTP OTLP endpoints per the OpenTelemetry specification - Add per-signal endpoint overrides (otlpTracesEndpoint, otlpLogsEndpoint, otlpMetricsEndpoint) for backends with non-standard paths (e.g. Alibaba Cloud) - Add LogToSpanProcessor that bridges OTel log records to spans for traces-only backends, with session-based traceId correlation and error status propagation - Auto-wire LogToSpanProcessor when traces URL exists but logs URL doesn't - Validate per-signal URLs gracefully (log error + skip, don't crash) - Preserve query strings when appending signal paths to URLs - Guard gRPC branch against missing base endpoint with per-signal config - Update telemetry documentation with signal routing semantics and Alibaba Cloud HTTP per-signal endpoint examples Closes #3734 Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(telemetry): fix TS noPropertyAccessFromIndexSignature errors in tests Use typed ExportedSpan interface and bracket notation for index signature properties to satisfy strict TypeScript checks in CI. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(telemetry): replace MD5 with SHA-256 for traceId derivation CodeQL flagged MD5 as a weak cryptographic algorithm when used with session.id (considered sensitive data). Switch to SHA-256 truncated to 32 hex chars to satisfy CodeQL while maintaining the same traceId format required by the OTel specification. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(telemetry): address review feedback for LogToSpanProcessor robustness - Wrap JSON.stringify in try/catch to handle circular refs and BigInt - Add export timeout (30s) and try/catch to prevent hung shutdown - Track in-flight exports to avoid interval-vs-shutdown race condition - Fix deriveSpanStatus: use truthy checks (!!), drop success===false heuristic since declined tool calls are normal, not errors - Enforce http(s) scheme in validateUrl to reject file:/javascript: URLs - Change DiagLogLevel from ERROR to WARN to preserve operational diagnostics - Preserve logRecord.instrumentationScope instead of hardcoding - Forward severityNumber/severityText as span attributes - Add tests for circular refs, error status edge cases, severity Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(telemetry): flush sdk shutdown through cleanup Remove async process exit handlers from telemetry initialization and route SDK shutdown through Config cleanup so normal CLI exit paths await pending telemetry exports. Keep shutdown idempotent while an SDK shutdown is in flight. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(telemetry): harden bridged log shutdown Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(telemetry): address review follow-ups Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> --------- Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>	2026-05-01 22:47:01 +08:00
wenshao	f97ba2010f	fix(shell): refuse multi-commit attribution; misc review follow-ups Five follow-ups from the latest review pass: - attachCommitAttribution now refuses to write a single git note for shell commands that produce more than one commit (e.g. `git commit -m a && git commit -m b`). The singleton's per-file attribution map can't be partitioned across the individual commits, so attaching the combined note to HEAD would mis-attribute earlier commits' changes to the last one. Walks `preHead..HEAD` via `git rev-list --count`; on multi-commit detection it snapshots the prompt counters and bails with a debug warning instead of writing a misleading note. - parseGitInvocation now recognises the attached `-C/path` form (e.g. `git -C/path commit -m x`). shell-quote tokenises that as a single `-C/path` token which previously fell to the generic flag branch with `changesCwd = false`, leaving an out-of-cwd commit classified as attributable. - attachCommitAttribution dropped its unused `command` parameter (the caller already gates on `commitCtx.attributableInCwd`, so re-parsing was removed earlier; the parameter became dead). - Added wiring guards in edit.test.ts and write-file.test.ts: AI-originated edits/writes hit `CommitAttributionService.recordEdit`, `modified_by_user: true` skips, and write-file's distinction between a true new file and an overwritten empty file (`null` vs `''` old content) is now pinned by `aiCreated` assertions.	2026-05-01 22:44:08 +08:00
wenshao	2af4e1d2fb	fix(shell): position-independent git subcommand detection + bash-shell guard Six review items, two of them critical: - gitCommitContext was checking fixed-position tokens (`arg1`, `arg3`) and missed every git invocation that puts a global flag between `git` and the subcommand: `git -c user.email=x@y commit`, `git --no-pager commit`, `git -C /p -c k=v commit`, etc. In background mode these would slip past the refusal guard; in foreground they got no co-author trailer, no git note, and no prompt-counter snapshot. New `parseGitInvocation` walks past git's global flags (with their values) before reading the subcommand, and reports `changesCwd` for `-C` / `--git-dir` / `--work-tree`. - The Windows guard on addCoAuthorToGitCommit and addAttributionToPR used `os.platform() === 'win32'`, which incorrectly skipped Windows + Git Bash (`getShellConfiguration().shell === 'bash'`). Switched both to gate on `getShellConfiguration().shell !== 'bash'` so Git Bash users keep the feature. - attachCommitAttribution was re-parsing `gitCommitContext(command)` even though `execute()` already gates on `commitCtx.attributableInCwd`. Removed the redundant re-parse — drift between the two checks would silently diverge trailer injection from git-notes writes. - tokeniseSegment (formerly tokeniseProgram) now logs via debugLogger on parse failure instead of swallowing silently. Easier to debug if shell-quote ever throws on something unusual. - Added a comment on `cwdShifted` documenting that it's a one-way latch — `cd src && cd ..` will still skip attribution. The trade-off matches the wrong-repo guard's "better miss than corrupt unrelated repos" intent. - Stale `--stat` reference in the aiChars-clamp comment updated to `--numstat` to match the actual git command in ShellToolInvocation.getCommittedFileInfo. Tests: `git -c key=val commit` and `git --no-pager commit` now produce a trailer; existing 82 shell tests still pass.	2026-05-01 19:34:51 +08:00
Shaojin Wen	35fe97e0f6	feat(review): expand review pipeline + qwen review CLI subcommands (#3754 ) Some checks are pending Qwen Code CI / Lint (push) Waiting to run Details Qwen Code CI / Test (push) Blocked by required conditions Details Qwen Code CI / Test-1 (push) Blocked by required conditions Details Qwen Code CI / Test-2 (push) Blocked by required conditions Details Qwen Code CI / Test-3 (push) Blocked by required conditions Details Qwen Code CI / Test-4 (push) Blocked by required conditions Details Qwen Code CI / Test-5 (push) Blocked by required conditions Details Qwen Code CI / Test-6 (push) Blocked by required conditions Details Qwen Code CI / Test-7 (push) Blocked by required conditions Details Qwen Code CI / Test-8 (push) Blocked by required conditions Details Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions Details Qwen Code CI / CodeQL (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run Details E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run Details E2E Tests / E2E Test - macOS (push) Waiting to run Details * feat(review): expand review pipeline + add `qwen review` CLI subcommands Review skill (SKILL.md) changes: - Step 4: 5 → 9 parallel agents (split Correctness/Security, add Test Coverage, 3 undirected personas: attacker / 3am-oncall / maintainer) - Step 5: verification "uncertain → reject" → "uncertain → low-confidence" (terminal-only "Needs Human Review" bucket; never posted as PR comments) - Step 6: single reverse audit → iterative (terminate on no-new-findings, hard cap 3 rounds) - Step 9: self-PR detection (downgrade APPROVE/REQUEST_CHANGES → COMMENT when GitHub forbids self-review with HTTP 422); CI status check (downgrade APPROVE → COMMENT on red/pending CI); existing-Qwen-comment classification with priority order Stale > Resolved > Overlap > NoConflict (only Overlap blocks for confirmation) `qwen review` CLI subcommands (packages/cli/src/commands/review/): - fetch-pr — clean stale + fetch PR ref + create worktree + metadata - pr-context — emit Markdown context file with security preamble + already-discussed dedup section - load-rules — read review rules from base branch (4 source files) - deterministic— run tsc, eslint, ruff, cargo-clippy, go-vet, golangci-lint on changed files; filtered + structured findings JSON (TypeScript/JavaScript, Python, Rust, Go) - presubmit — self-PR + CI status + existing-comment classification in a single JSON report - cleanup — worktree + branch ref + per-target temp files (idempotent) Cross-platform: execFileSync (no shell), path.join, CRLF normalization, which/where for tool detection. Replaces bash-style inline commands in SKILL.md; works identically on macOS/Linux/Windows. Path consistency: SKILL.md temp files moved from /tmp/qwen-review-* to .qwen/tmp/qwen-review-* — matches what os.tmpdir() resolves to across platforms (macOS returns /var/folders/... not /tmp). DESIGN.md gains five "Why ..." sections explaining each design decision; docs/users/features/code-review.md synced for user-visible changes. * feat(review): expose full reply chains in pr-context output `qwen review pr-context` now renders each replied-to inline-comment thread as the original reviewer comment + chronological reply chain, instead of only listing the root-comment snippet. This lets review agents see at a glance whether a topic has been addressed (e.g. a "Fixed in <commit>" reply closes the thread) and avoids re-reporting already-resolved concerns without forcing the LLM driver to manually summarise each reply chain in agent prompts. - Walk `in_reply_to_id` chain to group replies under their root comment - Sort replies chronologically (by id, monotonic on GitHub) - Render thread block: root snippet as a quote + bulleted reply list - Sort threads by `(path, line)` for deterministic output - SKILL.md note updated to point agents at the new chain format * feat(review): include review-level summaries in pr-context output `qwen review pr-context` now also fetches `gh api repos/{owner}/{repo}/pulls/{n}/reviews` and renders a "Review summaries" section listing each reviewer's overall body (the comment they typed alongside an APPROVED / CHANGES_REQUESTED / COMMENTED submission). Closes a real gap found during the PR #3684 review: > "@wenshao [CHANGES_REQUESTED]: The previously identified exported > type rename issue no longer maps to the current PR diff, so this > review only includes the remaining high-confidence blocker." Without this section, the LLM driver's review agents would have missed that integration note from the prior reviewer. - New `RawReview` type + extra `ghApi` call - Filter: skip empty bodies + the canonical "No issues found. LGTM!" template the qwen-review pipeline auto-emits — those carry no agent-actionable content beyond the review state itself - Sort meaningful reviews by `submitted_at` for chronological output - Stdout summary now reports `M/N review summaries` (M = kept after filter) Smoke-tested on PR #3684: 30 inline, 3 issue, 1/30 review summaries correctly surfaces the @wenshao CHANGES_REQUESTED body and filters the 29 LGTM templates. * fix(review): paginate gh API calls to capture comments past page 1 `gh api <path>` defaults to per_page=30. Busy PRs cross that limit on inline comments, issue comments, and reviews — the latest entries (the ones most likely to contain new reviewer feedback or in-flight reply chains) end up on page 2+ and were silently truncated. Concrete bug found while re-reviewing PR #3684: Before: `30 inline, 3 issue comments, 1/30 review summaries` After: `97 inline, 3 issue comments, 6/67 review summaries` 5 additional reviewer-level summaries surfaced — including the @wenshao 2026-04-30 "Multi-agent re-review (Phase C)" body with the explicit verification notes that this PR's pipeline is supposed to chain forward into the next review. Changes: - `lib/gh.ts`: new `ghApiAll(path)` helper using `gh api --paginate`, which walks every `next` link and concatenates each page's array. - `pr-context.ts`: 3 fetches (inline / issue / reviews) → `ghApiAll`. - `presubmit.ts`: PR comments fetch → `ghApiAll` too (existing-comment classification was equally susceptible to dropping page 2+ overlap candidates). `check-runs` and `commits/<sha>/status` calls retain `ghApi` — those return objects (with embedded arrays) and rarely cross 30 entries. --------- Co-authored-by: wenshao <wenshao@U-K7F6PQY3-2157.local>	2026-05-01 18:30:35 +08:00
wenshao	6e1028d9df	fix(shell): unified git-commit detection split by intent Six items called out across CodeQL, Copilot, and wenshao: - The earlier `looksLikeGitCommit`/`stripCommandPrefix` returned a single yes/no and rejected ANY `cd` in the chain. That fixed the wrong-repo case but also disabled attribution for `git commit -m "x" && cd ..` (commit already landed safely in our cwd; the cd came after). It also conflated three distinct decisions onto one predicate. New `gitCommitContext` returns both `hasCommit` and `attributableInCwd`, walking segments in order so that a `cd` AFTER the commit doesn't invalidate it. Callers now pick the right arm: - background-mode refusal uses `hasCommit` (refuses even `cd /elsewhere && git commit` since we can't attribute it afterward either way) - HEAD snapshot, addCoAuthorToGitCommit, and the attachCommitAttribution gate use `attributableInCwd` - Tokenisation switches from a regex while-loop to `shell-quote`'s `parse`. Quoted env values like `FOO="a b" git commit` now skip correctly (the old `\S\s+` form would cut after the opening quote). Eliminates the CodeQL polynomial-regex alert at the same time since the `\S\s+` pattern is gone. - attachCommitAttribution now snapshots prompt counters via `clearAttributions(true)` whenever a commit lands, even if no per-file attributions were tracked. Previously the early-return on `hasAttributions() === false` meant `promptCountAtLastCommit` never advanced, so a later `gh pr create` reported an inflated N-shotted count spanning multiple commits. Tests: env-var and sudo prefixes still produce trailers; quoted "git commit" / "gh pr create" leave commands unchanged; cd BEFORE commit suppresses the rewrite while cd AFTER commit does not; `git -C <path> commit` is treated as a commit (refused in background) but not as attributable.	2026-05-01 17:29:36 +08:00
wenshao	31bddbcc12	fix(shell): broaden git-commit detection, gate background, drop dead helpers Five Copilot follow-ups: - looksLikeGitCommit now strips leading env-var assignments (`GIT_COMMITTER_DATE=now git commit ...`) and a small allowlist of safe wrappers (`sudo`, `command`) before matching. The previous exact-prefix match silently skipped trailer injection on common real-world commit forms. - A new looksLikeGhPrCreate (same shell-aware shape) replaces the raw `\bgh\s+pr\s+create\b` regex in addAttributionToPR, so quoted text like `echo "gh pr create --body \"x\""` no longer triggers a command-string rewrite. - executeBackground refuses to run `git commit` and tells the user to re-run foreground. The BackgroundShellRegistry lifecycle has no hook for the post-command pre/post-HEAD comparison or git-notes write, so allowing the commit through would create the new commit without notes and leak stale per-file attribution into the next foreground commit. - recordDeletion was unused outside its own test — removed (and the test). When AI-driven deletions need tracking we'll add it with an actual integration point rather than carrying dead API surface. - generatePRAttribution was likewise unused; addAttributionToPR builds the trailer string inline. The two formats had already diverged. Removed the helper and its tests; reviving from git history is straightforward if a future caller needs it. Tests: env-var and sudo prefixes now produce trailers; quoted "gh pr create" leaves the command unchanged; existing 81 shell tests still pass alongside the trimmed 25 commitAttribution tests.	2026-05-01 17:06:42 +08:00
wenshao	10676dcebf	fix(attribution): drop magic 100 fallback for empty deletions Deleted files with no AI tracking now use diffSize directly. With numstat as the input source, diffSize is an exact count, and an empty-file deletion legitimately reports zero — a magic fallback would only inflate totals.	2026-05-01 16:52:43 +08:00
wenshao	9def2b62cb	fix(shell): shell-aware git-commit detection and apostrophe-escape handling Two more Critical items called out by wenshao plus the matching Copilot quote-handling notes: - attachCommitAttribution and addCoAuthorToGitCommit now go through a shell-aware `looksLikeGitCommit` helper instead of a raw `\bgit\s+commit\b` regex. The helper splits the command on shell separators (`splitCommands`) and checks each segment, so `echo "git commit"` no longer triggers attribution clearing or trailer injection. The same helper bails on any segment that contains `cd` or `git -C <path>`, since either could redirect the commit into a different repo than our cwd — writing notes or capturing HEAD there would corrupt unrelated state. - The post-command attribution call now runs regardless of whether the shell wrapper aborted. `git commit -m "x" && sleep 999` could move HEAD and then time out, leaving the new commit without its attribution note while the stale per-file attribution stayed around for a later unrelated commit. attachCommitAttribution still gates on HEAD movement, so it's a no-op when no commit was actually created. - The `-m '...'` and `--body '...'` regexes used to match only the first quote segment, so a command like `git commit -m 'don'\''t'` (bash's standard apostrophe-escape form) would have the trailer spliced mid-message and break the command's quoting. The single- quote patterns now use a negative lookahead / inner alternation to either skip those messages entirely (commit path) or match the whole escape-aware body (PR path). Tests cover the new behavior: quoted "git commit" is left alone, the `cd && git commit` and `git -C` patterns get no trailer, and the apostrophe-escape form passes through unchanged for both `-m` and `--body`.	2026-05-01 16:51:34 +08:00
wenshao	ae68a4fe8a	fix(attribution): repo-root baseDir, escape co-author trailer, switch to numstat Three Critical items called out by wenshao: - attachCommitAttribution was passing config.getTargetDir() as `baseDir` to generateNotePayload, but getCommittedFileInfo returns paths relative to `git rev-parse --show-toplevel`. When the working directory was a subdirectory of the repo, path.relative produced `../...` keys that never matched in the AI-attribution lookup, silently zeroing out attribution for every file outside getTargetDir. StagedFileInfo now carries an optional `repoRoot` (filled in by getCommittedFileInfo via `git rev-parse --show-toplevel`) and the caller prefers it over the target dir. - addCoAuthorToGitCommit interpolated `gitCoAuthorSettings.name` and `.email` into the rewritten command without escaping. A name containing `$()`, backticks, or `"` could be evaluated as command substitution under double quotes, or break the user-approved `git commit -m "..."` quoting. Now escapes per the surrounding quote style with the same helpers addAttributionToPR uses, gates on non-Windows for the same shell-quoting reason, and fixes the regex to accept `-m"msg"` shorthand (no space) so users who type the bash-shorthand form aren't silently denied a trailer. - parseDiffStat used `git diff --stat` output and approximated each line as ~40 chars by parsing a graphical text bar. Replaced with `git diff --numstat` which gives unambiguous integer additions+deletions per file; the heuristic remains but the parser is no longer fooled by the visual `++--` markers. Binary entries fall back to a fixed estimate so they still land in the map (rather than dropping out as diffSize=0). Suggestions also addressed: stale duplicate JSDoc on addCoAuthorToGitCommit removed, misleading `clearAttributions` comments rewritten to describe what the boolean argument actually does. Tests cover the new shorthand path, escape behavior, and numstat parsing (text/binary/rename/malformed).	2026-05-01 14:14:54 +08:00

1 2 3 4 5 ...

5718 commits