qwen-code/docs/design/workflow-tracing-gaps.md
jinye 8dfbdaa5d4
feat(telemetry): unify span creation paths for hierarchical trace tree (#4126)
* feat(telemetry): unify span creation paths for hierarchical trace tree (#3731 P3 Phase 1)

Replace disconnected withSpan/startSpanWithContext calls in runtime with
session-tracing typed helpers so LLM and tool spans become children of
the interaction span instead of siblings under the session root.

- Add toolContext ALS with runInToolSpanContext() for concurrent-safe
  tool span scoping (uses AsyncLocalStorage.run, not enterWith)
- Wire startLLMRequestSpan/endLLMRequestSpan in loggingContentGenerator
  for both streaming and non-streaming paths
- Wire startToolSpan/endToolSpan + startToolExecutionSpan/endToolExecutionSpan
  in coreToolScheduler with proper try/finally lifecycle
- Remove redundant withSpan('client.generateContent') wrapper from client.ts
- Fix endToolSpan to not override pre-set status when metadata is omitted
- Change startToolExecutionSpan to read parent from toolContext ALS
- Update tests for new span creation APIs and remove dead test infrastructure

* fix(telemetry): address CI build errors in session-tracing tests

- Remove unused _toolSpan variable (TS6133)
- Use bracket notation for index signature property access (TS4111)

* fix(telemetry): update coreToolScheduler and loggingContentGenerator test mocks

- coreToolScheduler.test.ts: mock startToolSpan/endToolSpan/runInToolSpanContext
  instead of withSpan; update cancellation tests for restored safeSetStatus call
- loggingContentGenerator.test.ts: fix attribute keys in mock, add try/catch in
  endLLMRequestSpan mock to match production best-effort behavior

* fix(telemetry): address review feedback from wenshao

- Add debugLogger.warn in catch blocks of endLLMRequestSpan/endToolSpan/
  endToolExecutionSpan instead of silent swallowing
- Add JSDoc on endToolSpan documenting intentional no-metadata-no-status
  contract with setToolSpanFailure/setToolSpanCancelled
- Add warning in startToolExecutionSpan when called outside
  runInToolSpanContext (no active toolContext)
- Sanitize error message in endToolExecutionSpan: use constant
  TOOL_SPAN_STATUS_TOOL_EXCEPTION instead of raw error message

* fix(telemetry): use partial mock for telemetry/index.js in coreToolScheduler tests

The full mock shadowed all re-exports (logToolCall, etc.) causing 49 test
failures. Use importActual to preserve other exports, only override span
functions.

* fix(telemetry): getLastToolSpan must skip tool.execution sub-spans

startToolExecutionSpan mock also pushes to toolSpanRecords, so at(-1)
returns the execution sub-span instead of the tool span. Use findLast
to filter by name.

* fix(telemetry): address second round review feedback

- Remove redundant safeSetStatus(span, OK) on success path — endToolSpan
  in finally already sets OK via metadata
- Add llm_request.stream attribute (true/false) to distinguish streaming
  vs non-streaming LLM requests in trace backends

* fix(telemetry): endToolSpan mock writes to record directly

Bypass span.setStatus() in mock to avoid potential interference from
vitest module resolution. Write to statusCalls/ended directly on the
ToolSpanRecord.

* fix(telemetry): mock session-tracing.js directly instead of telemetry/index.js

Mocking the barrel re-export (telemetry/index.js) with importActual was
unreliable — vitest's module resolution could bind production code to the
real endToolSpan before the mock override took effect. Mock the source
module (session-tracing.js) directly to guarantee interception.

* fix(telemetry): fix endToolSpan status on success — toolCalls is empty in finally

Root cause: checkAndNotifyCompletion clears this.toolCalls before the
finally block in executeSingleToolCall runs, so the tc lookup always
returns undefined.

Fix: set OK status explicitly in _executeToolCallBody's success path
via safeSetStatus(span, OK), and call endToolSpan() without metadata
in finally (just ends the span, preserves pre-set status from any path).

* fix(telemetry): address Codex review — activate OTel context, end span on failure

- Wrap non-stream generateContent API call + logging in context.with(spanContext)
  so nested OTel spans (HTTP instrumentation, log-bridge spans) parent to
  qwen-code.llm_request instead of session root (matches streaming path).
- runInToolSpanContext now also activates OTel context via otelContext.with,
  not just the custom toolContext ALS. Hooks/HTTP/IO during tool execution
  now correctly parent to qwen-code.tool span.
- Split end*Span helpers: span.end() runs in its own try/catch so a throwing
  setAttributes/setStatus can't leak unended spans.

* fix(telemetry): address Codex review v2 — session-root fallback + execution span timing

- start{LLMRequest,Tool,ToolExecution}Span now fall back to getSessionContext()
  when no parent context, instead of otelContext.active(). Side-query LLM calls
  (auto-title, recap) now stay in the session trace instead of starting a new
  detached trace.
- Move startToolExecutionSpan() to BEFORE invocation.execute(), matching
  claude-code. Previously the synchronous setup inside execute (shell command
  preprocessing, child_process.spawn) ran outside the execution span.

* fix(telemetry): address Codex review v3 — sync throw, idle timeout race, test coverage

- coreToolScheduler.executeSingleToolCall: move try-block to wrap
  invocation.execute() so synchronous throws (e.g. shell setup failure)
  flow into the same catch path as async rejections. Previously a sync
  throw would leak the execution span and skip failure hooks.

- loggingStreamWrapper: track spanEndedByTimeout flag so a stream that
  resumes after the 5-min idle timeout does not run the final
  endLLMRequestSpan (which would no-op anyway, but the flag also stops
  resetSpanTimeout from queuing further timer callbacks).

- coreToolScheduler.test: add execution sub-span assertions for success,
  ToolResult.error, thrown invocation exceptions, and pre-hook denial.

- loggingContentGenerator.test: capture setAttribute calls into the mock
  span attributes record; assert llm_request.stream is false for non-stream
  and true for stream paths.

* fix(telemetry): address Codex review v4 — consistency + test coverage gaps

- endLLMRequestSpan now uses spanCtx.span for mutations (matches
  endToolSpan/endToolExecutionSpan pattern). Same object, but consistent
  lookup pattern prevents future drift.

- Mocks capture endLLMRequestSpan and endToolSpan/endToolExecutionSpan
  metadata so tests can assert token counts, durationMs, success, error
  are forwarded correctly. Add assertions on:
    * Non-stream LLM: inputTokens, outputTokens, success on response path
    * Non-stream LLM: success: false + sanitized error on rejection
    * Stream LLM: final lastUsageMetadata reaches endLLMRequestSpan
    * Tool execution sub-span: success: true on happy path
    * Tool execution sub-span: success: false on ToolResult.error
    * Tool execution sub-span: success: false + sanitized error on throw

- Add OTel error resilience tests: when setAttributes or setStatus throws,
  span.end() must still run and the span must be removed from activeSpans.
  Covered for endLLMRequestSpan, endToolSpan, endToolExecutionSpan.

* fix(telemetry): address Codex review v5 — abort distinction + API symmetry

- session-tracing.ts SpanContext.type: comment 'tool.blocked_on_user' |
  'hook' as Phase 2 forward-declarations (no helpers wired yet).

- endToolExecutionSpan: align no-metadata-no-status behavior with
  endToolSpan. Currently no caller omits metadata, but the asymmetric
  default (OK vs preserve-pre-set) was a maintenance trap.

- loggingContentGenerator generateContent (non-stream) catch block:
  call endLLMRequestSpan BEFORE the logging block, mirroring the
  streaming path. Defense-in-depth against logging-side rejections.

- loggingContentGenerator: restore abort-specific span status message.
  All three LLM error paths (non-stream catch, stream eager-error catch,
  stream loggingStreamWrapper finally) now use
  API_CALL_ABORTED_SPAN_STATUS_MESSAGE when req.config.abortSignal.aborted,
  matching the original withSpan('client.generateContent') behavior.
  Trace backends can now distinguish cancellations from real failures.

- coreToolScheduler _executeToolCallBody catch: distinguish abort vs
  exception in execSpan error message. New constant
  TOOL_SPAN_STATUS_TOOL_CANCELLED prevents operators filtering exec
  spans for errors from seeing cancellation false positives.

- New test asserting exec span uses cancelled-by-user message when the
  invocation throws after abort.

* fix(telemetry): always write 'success' attribute on tool spans

E2E review found qwen-code.tool spans never carry the `success` boolean
attribute (the helper only writes it when metadata is passed, and the
finally block calls endToolSpan(toolSpan) without metadata). This breaks
the most common observability query — filtering tool failures with
`success = false` — because tool spans don't have that field at all.

Fix: setToolSpanFailure / setToolSpanCancelled now also call
span.setAttribute('success', false); the success path in
_executeToolCallBody adds span.setAttribute('success', true) after
safeSetStatus(span, OK). Mirrors the unconditional `success` attribute
on llm_request spans, so backends can use one query for both span types.

Add 4 scheduler-level tests asserting the success attribute on:
- success path
- ToolResult.error path
- thrown invocation path
- cancellation path
2026-05-16 22:29:55 +08:00

20 KiB
Raw Permalink Blame History

Workflow 级 Span 粒度不足分析 (P1)

基于 2026-05-13 对 qwen-code origin/main 的复核

现状

qwen-code 已具备 tracing 基础设施:

组件 位置 说明
Span 类型定义 packages/core/src/telemetry/session-tracing.ts interactionllm_requesttooltool.execution
Tracer 工具 packages/core/src/telemetry/tracer.ts session root context、withSpanstartSpanWithContext
交互入口 packages/core/src/core/client.ts 顶层交互显式启动 interaction span
生命周期管理 AsyncLocalStorage + WeakRef + TTL cleanup

当前 runtime 中稳定接入的主要是两类 generic spans

  • api.generateContent / api.generateContentStream
  • tool.<toolName>

结论:已进入"有 tracing 主干"阶段,但尚未把 agent workflow 的阶段边界完整编码进 trace 树。

对比claude-code 已实现的 span 类型

参考 claude-code/src/utils/telemetry/sessionTracing.ts (line 49)

  • interaction
  • llm_request
  • tool
  • tool.blocked_on_user
  • tool.execution
  • hook

缺失项

缺失 span / 机制 影响
permission_wait / blocked_on_user span 无法区分审批等待 vs 工具执行耗时
hook span hook 耗时被折叠进 tool span定位边界不清
subagent root span subagent 内部 llm/tool 调用无法形成 trace 子树
tool.execution 真实接线 helper 已定义但主链路未调用
稳定的 parent-child wiring spans 多为 session root 下的 sibling 而非层级树

逐项分析

1. 用户审批等待不在 trace 中

工具调用等待审批时,状态迁移路径为 awaiting_approvalscheduled → 执行。

  • "等待用户确认"只是状态迁移,不是 trace 节点
  • trace 上看不到审批等待耗时
  • 工具慢时无法区分是"卡在等用户"还是"工具本身执行慢"

2. Hook 有事件记录但没有独立 span

Pre/Post hook 执行后产出 HookCallEvent,走 logHookCall(),但不建立独立 OTel span。

  • hook 变慢时表现为外层 tool span 变慢
  • hook 失败时表现为 "tool 失败"
  • trace 无法回答"时间花在 hook 还是 tool.execution 上"

3. Subagent 是 log/metric 而非 trace subtree

subagent 启动/完成时记录 SubagentExecutionEvent 并进入 log/metric但没有形成显式 span 子树。

  • 能统计"哪个 subagent 跑过"
  • 不能顺着 trace 看"这个 subagent 触发了哪些 llm/tool 调用"
  • 并发 subagent 场景下因果链不清

4. tool.execution helper 已定义但未接入主链路

session-tracing.ts 中已有 startToolExecutionSpan() / endToolExecutionSpan(),但非测试代码中未见调用点。

当前实际 trace 树:

session-root
  interaction
    api.generateContent
    tool.Bash
  subagent_execution        (log/metric)
  hook_call                 (event/QwenLogger)

理想 trace 树:

interaction
  llm_request
    tool
      tool.blocked_on_user
      hook(pre)
      tool.execution
      hook(post)
  subagent
    interaction
      llm_request
        tool

5. Parent-child wiring 不够稳定

interaction span 已存在,但很多运行中的 spans 挂在 session root 下作为 sibling而不是 interaction 的子节点。

  • 调用树偏平
  • 节点间因果关系不直观
  • 从一个用户轮次追到内部 llm/tool/hook/subagent 的体验不连续

影响

  • traces 有基础价值,但不足以支撑 workflow 级排障
  • 无法直接回答"这轮慢在等用户、hook还是 tool 真执行"
  • 无法把 subagent 运行过程还原为可阅读的 trace 子树
  • hook 问题被折叠进 tool span定位边界不清
  • 在 Jaeger / Tempo / ARMS 上的树比 claude-code 更平、更难读

claude-code 方案复用分析

基于 2026-05-13 对 claude-code 源码的深度对比

claude-code 的 tracing 架构

claude-code 在 src/utils/telemetry/sessionTracing.ts 中实现了一个统一的、基于双 ALS 的 span 管理系统

                    interactionContext (ALS)          toolContext (ALS)
                          │                                │
                          ▼                                ▼
              ┌─────────────────────┐           ┌─────────────────────┐
              │  interaction span   │           │    tool span        │
              │  (session root)     │           │  (child of intxn)   │
              └─────────────────────┘           └─────────────────────┘
                   ▲ parent of                       ▲ parent of
                   │                                 │
           ┌───────┴───────┐              ┌──────────┼──────────┐
           │               │              │          │          │
      llm_request      tool          blocked    execution    hook
                                     _on_user

核心机制:

机制 实现
双 ALS interactionContext 存当前 interaction spantoolContext 存当前 tool span
parent 解析 每种 span 类型硬编码从哪个 ALS 取 parentllm_request/toolinteractionContextblocked_on_user/execution/hooktoolContexthook 有 fallback 到 interactionContext
生命周期 enterWith 注入 → span 运行 → enterWith(undefined) 清除
查找 span 非 ALS 存储的 span如 blocked_on_user通过 activeSpans Map 按 span.type 反查
内存管理 ALS 持有的 span 用 WeakRef非 ALS 持有的 span 用 strongRef 防 GCTTL 30min 自动清理

claude-code tool span 完整生命周期 (toolExecution.ts):

startToolSpan(name, attrs)                    // → toolContext.enterWith(spanCtx)
  startToolBlockedOnUserSpan()                // → parent = toolContext.getStore()
    [permission resolution / user prompt]
  endToolBlockedOnUserSpan(decision, source)
  startToolExecutionSpan()                    // → parent = toolContext.getStore()
    [tool.call()]
  endToolExecutionSpan({ success })
endToolSpan(result)                           // → toolContext.enterWith(undefined)

claude-code hook span (hooks.ts):

startHookSpan(event, name, count, defs)       // → parent = toolContext ?? interactionContext
  [parallel hook execution]
endHookSpan(span, { success, blocking, ... })

qwen-code 现有架构 vs claude-code

根本差异:两套断裂的 span 创建路径

这是 qwen-code 当前最关键的架构问题:

文件 用法 parent 解析
session-tracing 层 session-tracing.ts startInteractionSpan / startLLMRequestSpan / startToolSpan / startToolExecutionSpan 显式从 interactionContext ALS 取 parent
tracer 层 tracer.ts withSpan / startSpanWithContext context.active() 取 parentfallback 到 session root

runtime 实际调用情况:

  • startInteractionSpan已接入 (client.ts line 956),写入 interactionContext ALS
  • startLLMRequestSpan / endLLMRequestSpan未接入runtime 用的是 withSpan('api.generateContent', ...) (在 loggingContentGenerator.ts)
  • startToolSpan / endToolSpan未接入runtime 用的是 withSpan('tool.${name}', ...) (在 coreToolScheduler.ts)
  • startToolExecutionSpan / endToolExecutionSpan未接入

后果:

withSpangetParentContext() 先检查 context.active()OTel 原生 context找不到活跃 span 时回退到 session root context。它完全不读取 interactionContext ALS

因此 interaction span 和 LLM/tool spans 变成了 session root 下的平级 sibling,而不是 parent-child 树:

session-root
  ├── interaction         (来自 session-tracing, 写入了 interactionContext ALS)
  ├── api.generateContent (来自 withSpan, 不读 interactionContext → 挂到 session root)
  ├── tool.Bash           (来自 withSpan, 同上)
  └── tool.Read           (来自 withSpan, 同上)

而 claude-code 中,只有一套 span 创建路径sessionTracing.ts所有 span 都走同一套 ALS → OTel context 转换逻辑,所以树是完整的。

逐项复用评估

1. 双 ALS + 显式 parent 解析 — 可复用,是核心修复
维度 claude-code qwen-code
ALS 数量 2 (interactionContext + toolContext) 1 (interactionContext,无 toolContext)
parent 解析 每种 span 类型显式指定从哪个 ALS 取 parent withSpan 统一走 context.active()
context 注入 trace.setSpan(otelContext.active(), parentCtx.span) withSpan 内部由 startActiveSpan 隐式注入

复用方案:

qwen-code 的 session-tracing.ts 已经实现了与 claude-code 几乎相同的 parent 解析模式

// qwen-code session-tracing.ts (已有但未用)
export function startLLMRequestSpan(model, promptId): Span {
  const parentCtx = interactionContext.getStore();
  const ctx = parentCtx
    ? trace.setSpan(otelContext.active(), parentCtx.span)
    : otelContext.active();
  // ...
}

这段代码与 claude-code 的 startLLMRequestSpan 逻辑完全一致

核心修复路径:废弃 runtime 中的 withSpan('api.*') / withSpan('tool.*') 调用,改为调用 session-tracing 的 typed helpers。 不需要重写 session-tracing 层——它的 API 已经就绪。

需要新增的只有:

  • 增加 toolContext ALS仿 claude-code
  • 增加 blocked_on_userhook span 类型及 helper 函数
2. tool.blocked_on_user — 需要适配审批流差异
维度 claude-code qwen-code
审批位置 toolExecution.tstool span 内部 coreToolScheduler._schedule()tool span 之前
审批模式 同步等待 resolveHookPermissionDecision() 状态机驱动:validatingawaiting_approvalscheduledexecuting
span 覆盖范围 tool span 包含 blocked + execution tool span(withSpan) 只包含 executionexecuteSingleToolCall 开始)

关键差异: qwen-code 的 executeSingleToolCall 入口检查 toolCall.status !== 'scheduled' 才继续——也就是说调用到这里时审批已经完成。Tool span 的 withSpan 包不住审批等待。

适配方案(两种):

方案 A — 前移 tool span 起点(推荐):

startToolSpan 调用从 executeSingleToolCall 移到 _schedule 中审批检查之前,使 tool span 覆盖完整生命周期。在进入 awaiting_approval 状态时 startToolBlockedOnUserSpan,在审批完成(scheduled)时 endToolBlockedOnUserSpan

_schedule():
  startToolSpan(name)                         // ← 新增
    startToolBlockedOnUserSpan()              // ← 新增,进入 awaiting_approval 时
      [状态机等待]
    endToolBlockedOnUserSpan(decision)        // ← 新增,进入 scheduled 时
executeSingleToolCall():
    startToolExecutionSpan()                  // ← 接入已有 helper
      [hook + execute]
    endToolExecutionSpan()
  endToolSpan()                               // ← 需要在 finally 中

方案 B — 保持 tool span 位置不变,单独追踪审批:

_schedule 中独立创建 approval_wait span不作为 tool 的 child挂到 interaction 下。好处是改动更小,坏处是与 claude-code 模型不一致、trace 树可读性差。

建议采用方案 A,因为:

  • 与 claude-code 的 trace 树结构一致
  • trace 上一个 tool 节点就能看到"等了多久 + 执行了多久"
  • 状态机驱动的特性只影响 span start/end 的触发时机,不影响 parent-child 建模
3. hook span — 可直接复用
维度 claude-code qwen-code
hook 执行入口 executeHooks() in hooks.ts firePreToolUseHook/firePostToolUseHook via hookEventHandler.ts
现有记录方式 OTel span + Perfetto span HookCallEventQwenLogger (无 OTel)
parent toolContext ?? interactionContext

复用方案:

  1. session-tracing.ts 新增 startHookSpan / endHookSpanparent = toolContext ?? interactionContext,与 claude-code 一致)
  2. coreToolScheduler.tsexecuteSingleToolCallpre/post hook 调用前后分别 start/end hook span
  3. 保留现有 logHookCall 事件记录(两套并行,不互斥)

改动量低,不影响现有 hook 逻辑。

4. tool.execution — 已有 helper只需接线

qwen-code 的 startToolExecutionSpan(parentToolSpan) / endToolExecutionSpan(span, metadata) 已经完整实现,只需在 executeSingleToolCall 中调用:

// coreToolScheduler.ts executeSingleToolCall 内部
const toolSpan = startToolSpan(toolName, attrs);
// ... hook pre ...
const execSpan = startToolExecutionSpan(toolSpan);
try {
  // ... invocation.execute() ...
  endToolExecutionSpan(execSpan, { success: true });
} catch (e) {
  endToolExecutionSpan(execSpan, { success: false, error: e.message });
}
// ... hook post ...
endToolSpan(toolSpan);

注意qwen-code 的 startToolExecutionSpan 接收显式 parentToolSpan 参数,而 claude-code 的是从 toolContext ALS 隐式获取。这不影响功能,只是风格差异。如果引入 toolContext ALS可以统一改为隐式获取。

5. subagent trace tree — 双方都不完整,不建议直接复用
维度 claude-code qwen-code
OTel trace 传播 — subagent 的 interaction 是新 root — subagent 无显式 trace 传播
身份关联 Perfetto metadataagent process/thread+ teammateContextStorage ALS subagentNameContext ALS + SubagentExecutionEvent
并发隔离 OTel ALS 有泄漏风险(enterWith 是进程级,并发 subagent 会互覆盖) 同样的风险

claude-code 在 subagent OTel tracing 上自己也没解决好

  • interactionContext.enterWith() 是进程级的,并发 subagent 会覆盖彼此的 ALS 值
  • 真正的 agent 层级树只存在于 Perfetto一个 Anthropic 内部 feature-flagged 的系统),不在 OTel 中

建议:

  • 短期:沿用 qwen-code 现有的 subagentNameContext + 事件日志方案
  • 中期:在 subagent 启动时创建一个 subagent spanparent = 当前 toolContext并用 context.with() 而非 enterWith() 来隔离并发 subagent 的 OTel context
  • 这是需要独立设计的工作项,不建议直接照搬 claude-code
6. LLM request span — 路径明确

qwen-code 当前在 loggingContentGenerator.ts 中用 withSpan('api.generateContent', ...)startSpanWithContext('api.generateContentStream', ...)

改为调用 startLLMRequestSpan / endLLMRequestSpansession-tracing 层已有实现即可。streaming 场景需要注意:

  • startLLMRequestSpan 返回 Span 对象
  • 需要手动传入 endLLMRequestSpan(span, metadata) 终结
  • 这与 startSpanWithContext 的手动管理模式兼容

复用总结

改造项 可复用程度 改动量 优先级
统一 span 创建路径(废弃 runtime withSpan,用 session-tracing helpers 核心修复 — 解决 parent-child 断裂 中(~5 个调用点) P0
新增 toolContext ALS 直接照搬 claude-code 模式 session-tracing.ts 内部) P0
tool.blocked_on_user span 方案 A 需适配状态机 _schedule + executeSingleToolCall 协调) P1
tool.execution 接线 helper 已有,只需调用 executeSingleToolCall 内 3 行) P1
hook span 新增 helper + 调用点 P1
LLM request span 切换 替换 withSpan 为 typed helper 2 个调用点) P1
subagent trace tree 不建议直接复用 — 需独立设计 P2

推荐实施顺序

Phase 1 — 修复 trace 树结构 (P0)
├── 1a. session-tracing.ts 新增 toolContext ALS + blocked_on_user / hook span helpers
├── 1b. loggingContentGenerator.ts: withSpan → startLLMRequestSpan/endLLMRequestSpan
└── 1c. coreToolScheduler.ts: withSpan → startToolSpan/endToolSpan

Phase 2 — 补齐 workflow span (P1)
├── 2a. coreToolScheduler._schedule: blocked_on_user span 接入
├── 2b. coreToolScheduler.executeSingleToolCall: tool.execution span 接入
└── 2c. hook pre/post 调用处: hook span 接入

Phase 3 — Subagent trace tree (P2)
├── 3a. 设计 context.with() 隔离方案(替代 enterWith
├── 3b. subagent 启动时创建 subagent root span
└── 3c. 并发 subagent 场景验证