qwen-code/.qwen/skills/structured-debugging/SKILL.md
tanzhenxin 0da1182b74
feat(cli): headless support and SDK task events for background agents (#3379)
* feat(cli): unify notification queue for cron and background agents

Migrate cron from its own queue (cronQueueRef / cronQueue) to the shared
notification queue used by background agents. Both producers now push the
same item shape { displayText, modelText, sendMessageType } and a single
drain effect / helper processes them in FIFO order.

Cron fires render as HistoryItemNotification (● prefix) instead of
HistoryItemUser (> prefix), with a "Cron: <prompt>" display label.
Records use subtype 'cron' for clean resume and analytics separation.

Lift the non-interactive rejection for background agents. Register a
notification callback in nonInteractiveCli.ts with a terminal hold-back
phase (100ms poll) that keeps the process alive until all background
agents complete and their notifications are processed.

* feat(cli): emit SDK task events for background subagents

Emit `task_started` when a background agent registers and
`task_notification` when it completes, fails, or is cancelled, so
headless/SDK consumers can track lifecycle without parsing display
text. Model-facing text is now structured XML with status, summary,
truncated result, and usage stats. Completion stats (tokens, tool
uses, duration) are captured from the subagent and included in both
the SDK payload and the model XML.

* fix: address codex review issues for background subagents

- Background subagents now inherit the resolved approval mode from
  agentConfig instead of the raw session config, so a subagent with
  `approvalMode: auto-edit` (or execution in a trusted folder) keeps
  that override when it runs asynchronously.
- Non-interactive cron drains are single-flight: concurrent cron fires
  now await the same in-flight drain, and the cron-done check gates
  on it, preventing the final result from being emitted while a cron
  turn is still streaming.
- Background forks go through createForkSubagent so they retain the
  parent's rendered system prompt and inherited history instead of
  degrading to a plain FORK_AGENT.

* fix(cli): restore cancellation, approval, and error paths in queued drain

- Hold-back loop now reacts to SIGINT/SIGTERM: when the main abort
  signal fires it calls registry.abortAll() so background agents with
  their own AbortControllers stop promptly instead of pinning the
  process open.
- Queued-turn tool execution forwards the stream-json approval update
  callback (onToolCallsUpdate) so permission-gated tools inside a
  background-notification follow-up emit can_use_tool requests.
- Queued-turn stream loop mirrors the main loop's text-mode handling
  of GeminiEventType.Error, writing to stderr and throwing so provider
  errors produce a non-zero exit code instead of silently succeeding.
- Interactive cron prompts go through the normal slash/@-command/shell
  preprocessing again; only Notification messages skip that path.

* fix(cli): skip duplicate user-message item for cron prompts

Cron prompts already render as a `● Cron: …` notification via the queue
drain, so adding them again as a `USER` history item produced a
duplicate `> …` line.

* fix(cli): honor SIGINT/SIGTERM during cron scheduler wait

The non-interactive cron phase awaits a Promise that resolves only when
scheduler.size reaches 0 and no drain is in flight. Recurring cron jobs
never drop the scheduler size to 0 on their own, so the previous abort
handling (added to the hold-back loop) was unreachable — the process
hung indefinitely after SIGINT/SIGTERM. Attach an abort listener inside
the promise so abort stops the scheduler and resolves immediately,
allowing the hold-back loop to run and the process to exit cleanly.

* feat(core): propagate tool-use id through background agent notifications

Plumb the scheduler's callId into AgentToolInvocation via an optional
setCallId hook on the invocation, detected structurally in
buildInvocation. The agent tool forwards it as toolUseId on the
BackgroundTaskRegistry entry so completion notifications can carry a
<tool-use-id> tag and SDK task_started / task_notification events can
emit tool_use_id — letting consumers correlate background completions
back to the original Agent tool-use that spawned them.

* fix(cli): drain single-flight race kept task_notification from emitting

drainLocalQueue wrapped its body in an async IIFE and cleared the
promise reference via finally. When the queue is empty the IIFE has
no awaits, so its finally runs synchronously as part of the RHS of
the assignment `drainPromise = (async () => {...})()` — clearing
drainPromise BEFORE the outer assignment overwrites it with the
resolved promise. The reference then stayed stuck on that fulfilled
promise forever, so later calls short-circuited through
`if (drainPromise) return drainPromise` and never processed
queued notifications.

Symptom: in headless `--output-format json` (and `stream-json`),
task_started emitted but task_notification never did, even after
the background agent completed. The process sat in the hold-back
loop until SIGTERM.

Fix: move the null-clearing out of the async body into an outer
`.finally()` on the returned promise. `.finally()` runs as a
microtask after the current synchronous block, so it clears the
latest drainPromise reference instead of the pre-assignment null.

* fix(cli): append newline to text-mode emitResult so zsh PROMPT_SP doesn't erase the line

Headless text mode wrote `resultMessage.result` without a trailing newline.
In a TTY, zsh themes that use PROMPT_SP (powerlevel10k, agnoster, …) detect
the missing `\n` and emit `\r\033[K` before drawing the next prompt, which
wipes the final line off the screen. Pipe-captured output was unaffected,
so the bug only surfaced for interactive shell users — most visibly in the
background-agent flow where the drain-loop's final assistant message is
the *only* stdout write in text mode.

Append `\n` to both the success (stdout) and error (stderr) writes.

* docs(skill): tighten worked-example blurb in structured-debugging

Mirror the simplified blurb from .claude/skills/structured-debugging/SKILL.md
(knowledge repo). Drops the round-by-round narrative; keeps the contradiction
+ two lessons.

* docs(skill): mirror SKILL.md improvements (reframing failure mode, generalized path, value-logging guidance)

Mirror of knowledge repo commit 38eb28d into the qwen-code .qwen/skills
copy.

* docs(skill): mirror worked example into .qwen/skills/structured-debugging/

Mirrors knowledge/.claude/skills/structured-debugging/examples/
headless-bg-agent-empty-stdout.md so the .qwen copy of the skill links
resolve.

* docs(skill): mirror generalized side-note path guidance

* fix(cli): harden headless cron and background-agent failure paths

Three regressions surfaced by Codex review of feat/background-subagent:

- Cron drain rejections were dropped by a bare `void`, so a failing
  queued turn left the outer Promise unresolved and hung the run. Route
  drain failures through the Promise's reject so they propagate to the
  outer catch.
- The background-agent registry entry was inserted before
  `createForkSubagent()` / `createAgentHeadless()` was awaited. Failed
  init returned an error from the tool call but left a phantom `running`
  entry, and the headless hold-back loop (`registry.getRunning()`) waited
  forever. Register only after init succeeds.
- SIGINT/SIGTERM during the hold-back phase aborted background tasks,
  then fell through to `emitResult({ isError: false })`, so a cancelled
  `qwen -p ...` exited 0 with the prior assistant text. Route through
  `handleCancellationError()` so cancellation exits non-zero, matching
  the main turn loop.

* test(cli): update stdout/stderr assertions for trailing newline

`feadf052f` appended `\n` to text-mode `emitResult` output, but the
nonInteractiveCli tests still asserted the pre-change strings. Update
the 11 affected assertions to expect the trailing newline.

* fix: address review comments on background-agent notifications

Four additional issues from the PR review that the prior regression-fix
commit didn't cover:

- Escape XML metacharacters when interpolating `description`, `result`,
  `error`, `agentId`, `toolUseId`, and `status` into the task-notification
  envelope. Subagent output (which itself may carry untrusted tool output,
  fetched HTML, or another agent's notification) could contain
  `</result>` or `</task-notification>` and forge sibling tags the parent
  model would treat as trusted metadata. Truncate result text *before*
  escaping so the truncation never slices through an entity like `&amp;`.
- Emit the terminal notification from `cancel()` and `abortAll()`. The
  fire-and-forget `complete()`/`fail()` from the subagent task is guarded
  by `status !== 'running'` and was no-op'd after cancellation, so SDK
  consumers saw `task_started` with no matching `task_notification`,
  breaking the contract this PR establishes. Updated two race-guard
  tests that asserted the old behavior.
- Call `adapter.finalizeAssistantMessage()` before the abort-triggered
  early return inside `drainOneItem`'s stream loop. Without it,
  `startAssistantMessage()` had already been called, so stream-json mode
  left `message_start` unpaired.
- Enforce `config.getMaxSessionTurns()` in `drainOneItem` for symmetry
  with the main turn loop. Cron fires and notification replies otherwise
  bypass the budget cap in headless runs.
2026-04-17 14:42:44 +08:00

8 KiB

name description
structured-debugging Hypothesis-driven debugging methodology for hard bugs. Use this skill whenever you're investigating non-trivial bugs, unexpected behavior, flaky tests, or tracing issues through complex systems. Activate proactively when debugging requires more than a quick glance — especially when the first attempt at a fix didn't work, when behavior seems "impossible", or when you're tempted to blame an external system (model, API, library) without evidence.

Structured Debugging

When debugging hard issues, the natural instinct is to form a theory and immediately apply a fix. This fails more often than it works. The fix addresses the wrong cause, adds complexity, creates false confidence, and obscures the real issue. Worse, after several failed attempts you lose track of what's been tried and start guessing randomly.

This methodology replaces guessing with a disciplined cycle that converges on the root cause. Each iteration narrows the search space. It's slower per attempt but dramatically faster overall because you stop wasting runs on wrong theories.

The Cycle

1. Hypothesize

Before touching code, write down what you think is happening and why. Be specific about the expected state at each step in the execution path.

Bad: "Something is wrong with the wait loop." Good: "The leader hangs because hasActiveTeammates() returns true after all agents have reported completed, likely because terminal status isn't being set on the agent object after the backend process exits."

For bugs you expect to take more than one round, create a side note file for the investigation in whichever location the project uses for such notes.

Write your hypothesis there. This file persists across conversation turns and even across sessions — it's your investigation journal.

2. Design Instrumentation

Add targeted debug logs or assertions at the exact decision points that would confirm or reject your hypothesis. Think about what data you need to see.

Don't scatter console.log everywhere. Identify the 2-3 places where your hypothesis makes a testable prediction, and instrument those.

Prefer logging values (return codes, payload contents, stream types, message bodies, env state) over presence checks ("was this function called?", "was this branch taken?"). Code-path traces tell you what ran; data traces tell you what it ran on. Most non-trivial bugs are correct code processing wrong data.

Ask yourself: "If my hypothesis is correct, what will I see at point X? If it's wrong, what will I see instead?"

3. Verify Data Collection

Before running, confirm that your instrumentation output will actually be captured and accessible.

Common traps:

  • stderr discarded by 2>/dev/null in the test command
  • Process killed before flush (logs lost)
  • Logging to a file in a directory that doesn't exist
  • Output piped through something that truncates it
  • Looking at log files from a previous run, not the current one

A test run that produces no data is wasted.

4. Run and Observe

Execute the test. Read the actual output — every line of it. Don't assume what it says.

When the data contradicts your hypothesis, believe the data. Don't rationalize it away. The whole point of this step is to let reality override your theory.

5. Document Findings

Update the side note with:

  • What the data showed (quote specific log lines)
  • What was confirmed vs. disproved
  • Updated hypothesis for the next iteration

This is critical for not losing context across attempts. Hard bugs typically take 3-5 rounds. Without notes, you'll forget what you ruled out and waste runs re-checking things.

6. Iterate

Update the hypothesis based on the new evidence. Go back to step 2. Each round should narrow the search space.

If you're not making progress after 3 rounds, step back and question your assumptions. The bug might be in a layer you haven't considered.

Failure Modes to Avoid

These are the specific traps this methodology is designed to prevent. When you notice yourself drifting toward any of them, stop and return to the cycle.

Jumping to fixes without evidence

The most common failure. You have a plausible theory, so you "fix" it and run again. If the theory was wrong, you've added complexity, wasted a test run, and possibly introduced a new bug. The side note should always show "hypothesis verified by [specific data]" before any fix is applied.

Blaming external systems

"The model is hallucinating." "The API is flaky." "The library has a bug." These conclusions feel satisfying because they put the problem outside your control. They're also usually wrong.

Before blaming an external system, inspect what it actually received. A model that appears to hallucinate may be responding rationally to stale data you didn't know was there. An API that appears flaky may be receiving malformed requests. Look at the inputs, not just the outputs.

Inspecting code paths but not data

You instrument the code and prove it executes correctly — the right functions are called, in the right order, with no errors. But the bug persists. Why?

Because the code can work perfectly while processing garbage input. A function that correctly reads an inbox, correctly delivers messages, and correctly formats output is still broken if the inbox contains stale messages from a previous run.

Always inspect the content flowing through the code, not just whether the code runs. Check payloads, message contents, file data, and database state.

Reframing the user's report instead of investigating it

When the user reports a symptom your own run doesn't reproduce, the contradiction is the evidence — the two environments differ in some way you haven't identified yet. The wrong move is to reframe their report ("they must be on a stale SHA", "they must be confused about what they saw", "must be a flake") so that your run becomes the ground truth. Once you do that, every later piece of evidence gets bent to defend the reframing, and the actual bug stays hidden.

The right move: catalogue what differs between their environment and yours (TTY vs pipe, terminal emulator, shell, locale, env vars, prior state, build artifacts) before forming any hypothesis. For ambiguous symptoms ("no output", "it's slow", "it's wrong") ask one disambiguating question first — e.g., "does it hang or exit cleanly?" — that prunes the hypothesis space cheaply before any test run.

Losing context across attempts

After several debugging rounds, you start forgetting what you already tried and what you ruled out. You re-check things, go in circles, or abandon a promising line of investigation because you lost track of where it was heading.

This is why the side note file exists. Update it after every run. When you start a new round, re-read it first.

Persistent State: A Special Category

Features that persist data across runs — caches, session recordings, message queues, temp files, database rows — are a frequent source of "impossible" bugs. The current run's behavior is contaminated by leftover state from previous runs.

When behavior seems irrational, always check:

  • Is there persistent state that carries across runs?
  • Was it cleared before this run?
  • Is the system responding to stale data rather than current data?

This is easy to miss because the code is correct — it's the data that's wrong.

When to Exit the Cycle

Apply the fix when — and only when — you can point to specific data from your instrumentation that confirms the root cause. Write in the side note:

Root cause: [specific mechanism]
Evidence: [specific log lines / data that confirm it]
Fix: [what you're changing and why it addresses the root cause]

Then apply the fix, remove instrumentation, and verify with a clean run.

Worked examples

  • examples/headless-bg-agent-empty-stdout.md — pipe-captured runs all passed; the user's TTY printed nothing. The contradiction was the bug. Illustrates reproduction contradiction is data and instrument data, not code paths.