feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731) (#4417)
Some checks are pending
Qwen Code CI / Classify PR (push) Waiting to run
Qwen Code CI / Lint (push) Blocked by required conditions
Qwen Code CI / Test (macos-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Test (ubuntu-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Test (windows-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Blocked by required conditions
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run

This commit is contained in:
jinye 2026-05-22 10:54:11 +08:00 committed by GitHub
parent 48b0a8bfce
commit fd75f77e19
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 1190 additions and 2 deletions

View file

@ -0,0 +1,538 @@
# LLM Request Timing Decomposition Design (P3 Phase 4)
> Issue #3731 — Phase 4 of hierarchical session tracing. Adds time-to-first-token, request-setup duration, sampling duration, and per-attempt retry telemetry to the `qwen-code.llm_request` span so operators can answer "why was this LLM call slow?" without guessing.
>
> Builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321). Independent of Phase 3 (#4410, in review) — recommended to land Phase 3 first so Phase 4's per-attempt fields aggregate cleanly under subagent subtrees.
## Problem
`qwen-code.llm_request` spans today carry only `model`, `prompt_id`, `input_tokens`, `output_tokens`, `success`, `error`, `duration_ms`. Operators reading a single trace cannot tell:
1. **How much of `duration_ms` was the model thinking vs the network setup.** A 12-second `duration_ms` could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.
2. **When the user saw the first token.** TTFT (time-to-first-token) is the standard latency SLO for chat UIs. We can't compute it; we don't capture it.
3. **What happened during retries.** `retryWithBackoff` (`utils/retry.ts:285`) only calls `debugLogger.warn` — no OTel event, no span attribute. The 4 LLM call sites that go through it (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) have zero retry visibility in traces or metrics. `ContentRetryEvent` exists for content-recovery retries inside `geminiChat.ts:806,830` but not for the more common rate-limit / 5xx retries.
4. **That `api.request.breakdown` is dead code.** The metric is defined at `metrics.ts:242-251` with 4 `ApiRequestPhase` values, exported from `index.ts:117`, tested in `metrics.test.ts:646-675` — but `recordApiRequestBreakdown()` has zero callers in production code. The metric infrastructure is paid for; the data flow was never connected.
These gaps make `qwen-code.llm_request` the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.
## Existing surface (no change)
| Component | Location | Why we don't touch it |
| ------------------------------------------------------------ | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| LLM request span lifecycle | `session-tracing.ts` `startLLMRequestSpan` / `endLLMRequestSpan` | Phase 1 (#4126) established the helpers. We extend the metadata interface, don't restructure |
| Active span propagation into provider generators | `loggingContentGenerator.ts:213,287` | Phase 1 (#4126) replaced `withSpan('api.*')` with native helpers; the active context already reaches the stream wrapper |
| `ContentRetryEvent` schema + consumers | `types.ts:626`, `qwen-logger.ts:947`, `loggers.ts:717` | Existing event keeps its shape and downstreams; we add a sibling event class for the `retryWithBackoff` path |
| `LogToSpanProcessor` log-bridge spans | `log-to-span-processor.ts` | ContentRetryEvent's existing bridge continues to nest under the active LLM span. Phase 4 does not change this |
| `ApiRequestPhase` enum | `metrics.ts:330-334` | Public surface (4 values). We populate 3 of the 4 from production code; leave the enum unchanged for backward compatibility |
| Per-provider chunk normalization → `GenerateContentResponse` | `loggingContentGenerator.ts:286-393` | Each provider already normalizes to Google's `GenerateContentResponse` shape before LoggingContentGenerator sees the stream. TTFT detection runs centrally over this normalized shape; no per-provider code |
| `retryWithBackoff` general-purpose retry | `utils/retry.ts:140` | Used by both LLM callers and non-LLM (`channels/weixin/src/api.ts`). We extend with an opt-in `onRetry` callback rather than hard-coupling to LLM telemetry |
| Non-streaming `generateContent` | `loggingContentGenerator.ts:212` | TTFT is not meaningful for non-streaming; the new fields stay `undefined`. Span lifecycle and existing attrs unchanged |
## Out-of-scope (deferred)
- **SDK-level retries** (openai SDK `maxRetries=3`, google-genai SDK internal retries). These happen entirely inside the third-party SDK; observing them requires disabling SDK retries and reimplementing in `retryWithBackoff`. Separate decision, not Phase 4.
- **Per-token streaming metrics** (inter-token latency, per-chunk size). Useful for inference-engine perf debugging, not for the user-perceived latency questions Phase 4 targets.
- **Separate TTFT for reasoning/thinking blocks.** "First token" includes thinking content (see D1). A future enhancement could split `ttft_to_reasoning_ms` vs `ttft_to_answer_ms`, but only after we know there's demand.
- **Sampling phase as a dedicated child span.** Computable from `duration_ms - ttft_ms - request_setup_ms`; child span adds nothing for OTel-only backends (claude-code uses one for Perfetto only). Stored as a span attribute instead — see D6.
- **Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`) event-level rate limiting.** A single LLM request can produce 50+ `ContentRetryEvent` / `ApiRetryEvent` records under persistent retry. Capping emission is a follow-up — Phase 4 emits all events; if production volumes prove unbearable, add a per-span emission cap with a "+N more attempts (truncated)" summary event in a follow-up PR.
- **`TOKEN_PROCESSING` breakdown phase.** Enum value exists but qwen-code has no real post-stream local processing worth measuring (<10ms typical). Skipped in production callers; enum value retained for future use or for callers we don't control.
- **Migrating `ContentRetryEvent` onto LLM span as span events.** Same reasoning as Phase 3's `subagent_execution` LogRecord: existing consumers (qwen-logger RUM, future metrics) are tightly coupled to the LogRecord. Bridge-span coverage is good enough.
## References (decision evidence)
| Source | Key takeaway |
| --------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| claude-code (Anthropic) `claude.ts:1762, 1789, 1982, 2882` | TTFT captured as `Date.now() - start` on `message_start` SSE event; `start` reset per retry attempt. `requestSetupMs = start - startIncludingRetries`. `attemptStartTimes` array preserved per attempt. Confirms feasibility of the approach; their TTFT semantic is "first stream event" (we diverge to "first content" — see D1) |
| claude-code `perfettoTracing.ts:549-671` | Renders Request Setup → Attempt N (retry) → First Token → Sampling as nested B/E pairs. Demonstrates the visual decomposition; qwen-code does the same decomposition with OTel attributes since we have no Perfetto |
| claude-code `sessionTracing.ts:447` | Only `ttft_ms` makes it onto the OTel span (not `requestSetupMs`, not `samplingMs`, not per-attempt timing). We deliberately put more on the span — claude-code has Perfetto for visualization; we don't |
| opencode (sst/opencode) `session/llm.ts`, `route/client.ts` | No TTFT measurement. Single `LLM.run` Effect span covers everything. Validates that the gap exists across competing tools; not a reference for what to do |
| [OTel GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (status: Development / Experimental) | `gen_ai.usage.input_tokens` (Stable), `gen_ai.usage.output_tokens` (Stable), `gen_ai.usage.cached_tokens` (Experimental), `gen_ai.request.model` (Stable), `gen_ai.server.time_to_first_token` (Experimental, seconds as double). Dual-emit pattern follows #4410 precedent |
| [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events) | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent |
| Phase 3 design doc (`telemetry-subagent-spans-design.md`) | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields |
## Design — seven decisions, each justified
### D1 — TTFT semantic: "first chunk containing user-visible content"
TTFT measures wall-clock from the **successful attempt's** request dispatch to the **first stream chunk that contains user-visible output**. A chunk is "user-visible" if any normalized `Part` in `candidates[0].content.parts` is one of:
- `text` with non-empty string
- `functionCall` (tool use)
- `inlineData` (image, binary)
- `executableCode`
- `thought` / reasoning content (whatever the provider surfaces — Gemini's `thought`, Anthropic's `<thinking>` block, OpenAI o1 reasoning chunk)
Chunks containing only `role` metadata or only `usageMetadata` (final usage-summary chunk) do not trigger TTFT.
**Why not "first stream event of any kind" (claude-code's choice)**: claude-code measures TTFT at `message_start`, an Anthropic-specific metadata event that fires 50300ms before any actual content. Their internal `headlessProfiler.ts` already separates `time_to_first_response_ms` for the "user saw something" semantic, acknowledging the distinction. qwen-code spans multiple providers (Anthropic, OpenAI, Gemini, Qwen) — picking the metadata-event semantic means TTFT for Anthropic is fundamentally different from TTFT for OpenAI (which has no analogous metadata-only first event). The user-visible-content semantic is uniform across all 4 providers and matches "time-to-first-token" literally.
**Why include `thought` / reasoning**: from the operator's perspective, reasoning chunks are still "the model produced output." Excluding them would understate TTFT for reasoning-heavy models (o1, Qwen thinking variants). Future split into `ttft_to_reasoning_ms` vs `ttft_to_answer_ms` is possible; not Phase 4.
**Why include tool-call-only chunks**: agent tool-decision LLM calls (one `tool_use`, no text) are common in qwen-code's workflow. Excluding them means TTFT is undefined for these requests. The `functionCall` Part is meaningful output.
**Cross-product comparison note**: design doc explicitly states `qwen-code.ttft_ms ≈ claude-code.time_to_first_response_ms ≠ claude-code.ttft_ms`. Operators comparing across products should align on the user-visible-content semantic.
### D2 — TTFT measurement site: method-local variables in `LoggingContentGenerator.generateContentStream`
The first-chunk detection runs inside the existing stream wrapper at `loggingContentGenerator.ts:393` (`async function* processStreamGenerator`). Per-call variables (`start`, `ttftMs`) live in the method's closure; **never as instance fields**.
**Why never instance fields**: `LoggingContentGenerator` is instantiated **once per `ContentGenerator`** (`contentGenerator.ts:377`) and shared across all concurrent `generateContentStream` calls — subagent fan-out, warmup queries, side-queries from `geminiChat`. An instance field would be overwritten across concurrent calls, producing nonsense TTFT for one of every two interleaved requests.
**Why not AsyncLocalStorage**: ALS would work but adds a context-management layer for a piece of state that doesn't need to escape the method. Method-local is simpler, zero overhead, zero risk of leakage.
```ts
// loggingContentGenerator.ts — inside generateContentStream
const attemptStart = Date.now(); // per-call local
const requestEntryTime = Date.now(); // also per-call local — see D3
let ttftMs: number | undefined;
const attemptStartTimes: number[] = [attemptStart];
let retryTotalDelayMs = 0;
let finalAttempt = 1;
// stream wrapper inspects each chunk; first one matching hasUserVisibleContent:
// ttftMs = Date.now() - attemptStart;
```
`hasUserVisibleContent(chunk)` is a small standalone helper colocated with the wrapper, exported for tests:
```ts
function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
const parts = chunk.candidates?.[0]?.content?.parts;
if (!parts?.length) return false;
return parts.some(
(p) =>
(typeof p.text === 'string' && p.text.length > 0) ||
p.functionCall !== undefined ||
p.inlineData !== undefined ||
p.executableCode !== undefined ||
// @ts-expect-error — `thought` is not on all SDK versions but providers emit it
p.thought !== undefined,
);
}
```
### D3 — `request_setup_ms` computation: entry-time vs successful-attempt-start
`request_setup_ms` measures wall-clock from `generateContentStream`/`generateContent` entry to the **start of the successful attempt** — including all failed retries, backoff sleeps, and any pre-retry preparation work.
```ts
request_setup_ms = attemptStart_of_successful_attempt - requestEntryTime;
```
When `attempt === 1` and no retries happened, `request_setup_ms` is small (just SDK setup). When retries occurred, it captures the entire retry-budget overhead.
**Putting it on the OTel span (diverges from claude-code, which puts it only on Perfetto)**: rationale at three levels:
1. **No Perfetto** — qwen-code has no out-of-band visualization layer. OTel attributes are the only channel.
2. **Single-trace debug** — operator sees `duration_ms=12000, request_setup_ms=11500, ttft_ms=200, sampling_ms=300` → instantly diagnoses "retries ate 11.5s, model itself was fast." Computing `request_setup_ms` from other fields requires also exposing `sampling_ms`, which we do anyway (D6).
3. **Negligible cost** — 1 INT64 attribute. Same order of magnitude as the existing `input_tokens`, `output_tokens` attributes. Backend ingest cost is not material.
### D4 — Retry telemetry: `onRetry` callback option on `retryWithBackoff` + new `ApiRetryEvent`
`retryWithBackoff` currently calls `logRetryAttempt` (`retry.ts:343`) which only writes to `debugLogger.warn`. We extend the `RetryOptions` interface with an opt-in callback:
```ts
// utils/retry.ts
interface RetryOptions<T> {
// ... existing fields ...
/**
* Optional. Called once per failed attempt, before the backoff sleep.
* Receives the attempt number (1-based), the error, and the delay before
* the next attempt. Use this to emit telemetry events for LLM call sites;
* leave undefined for non-LLM callers (e.g., channels/weixin) so they
* stay silent in LLM-specific telemetry channels.
*/
onRetry?: (info: RetryAttemptInfo) => void;
}
interface RetryAttemptInfo {
attempt: number; // 1-based, matches debugLogger output
error: unknown;
errorStatus?: number;
delayMs: number; // backoff delay before next attempt
}
```
The 4 LLM call sites (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) register a callback that emits a new `ApiRetryEvent`:
```ts
// types.ts — new event class, sibling to ContentRetryEvent
export class ApiRetryEvent implements BaseTelemetryEvent {
'event.name': typeof EVENT_API_RETRY;
'event.timestamp': string;
model: string;
prompt_id?: string;
attempt_number: number; // 1-based
error_type: string;
error_message: string; // truncated to 256 chars
status_code?: number;
retry_delay_ms: number;
// ... duration_ms set to retry_delay_ms so LogToSpanProcessor renders
// a bridge span of meaningful width
duration_ms: number;
}
```
**Why a new event class, not extending `ContentRetryEvent`**:
- `ContentRetryEvent` has 2 downstream consumers (qwen-logger, log-record export). Changing its payload risks breaking them.
- The naming "content retry" semantically refers to content-recovery retries (invalid stream, schema repair) — extending it to cover rate-limit retries would muddy the schema.
- New event is additive; no consumer surprise.
**Why not embed callback IN `retry.ts`**: `retry.ts` is called by `channels/weixin/src/api.ts` too (microsoft messaging API retries). Hard-coupling LLM telemetry inside retry.ts would emit `ApiRetryEvent` for non-LLM retries. The `onRetry` callback is opt-in per caller — LLM callers opt in, weixin caller doesn't.
**ContentRetryEvent coexistence**: ContentRetryEvent stays as-is for content-recovery retries inside `geminiChat.ts:806,830`. ApiRetryEvent covers the rate-limit / 5xx retries from `retryWithBackoff`. The two events fire from different layers and never duplicate. Existing log-bridge behavior for both events is preserved via `LogToSpanProcessor` — both events nest under the active LLM span automatically (Phase 1 wiring ensures the LLM span is active during retries).
**Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`)**: a single 429-loop request may emit 50+ events. Out of scope to rate-limit emission in Phase 4 — if production volumes prove unbearable, add a per-span cap with summary event in a follow-up PR. The aggregated `attempt` and `retry_total_delay_ms` on the parent LLM span (D5) remain accurate regardless of event cap.
### D5 — Parent LLM span aggregation: scalar attributes only (no map-typed attrs)
OTel span attributes are scalars (`string | number | boolean | array of these`). Map-typed attributes (like `retry_count_by_status: {429:2, 503:1}`) require JSON serialization and are awkward to query. Skip them.
| Attribute | Type | Semantic |
| -------------------------- | ------ | ----------------------------------------------------------------------------------- |
| `attempt` | int | 1-based final attempt count (`attemptStartTimes.length`) |
| `retry_total_delay_ms` | int | Sum of all `delayMs` reported by `onRetry`; 0 if no retries |
| `ttft_ms` | int | TTFT per D1; undefined for non-streaming or aborted-before-first-chunk requests |
| `request_setup_ms` | int | Per D3 |
| `sampling_ms` | int | Per D6 |
| `output_tokens_per_second` | double | Derived; `output_tokens / (sampling_ms / 1000)`; undefined when `sampling_ms === 0` |
Per-attempt status-code distribution (e.g., "2 of the 3 attempts were 429s") is queryable from log-bridge spans of `ApiRetryEvent` records. No need to duplicate it as a flattened attribute on the parent.
**Why `sampling_ms` and `output_tokens_per_second` on the span**: derivable but cumbersome to compute in backend queries when summing across many spans. Same cost-benefit as `request_setup_ms` (D3).
### D6 — Activate `recordApiRequestBreakdown()` for 3 of 4 phases
In `endLLMRequestSpan` (or the wrapper that calls it), after computing TTFT/setup/sampling, emit:
```ts
recordApiRequestBreakdown(config, model, [
{ phase: ApiRequestPhase.REQUEST_PREPARATION, durationMs: requestSetupMs },
{ phase: ApiRequestPhase.NETWORK_LATENCY, durationMs: ttftMs }, // ttftMs = network + first-token-generation
{ phase: ApiRequestPhase.RESPONSE_PROCESSING, durationMs: samplingMs },
]);
```
**Why skip `TOKEN_PROCESSING`**: qwen-code does stream chunk processing inline (consolidation happens in the wrapper at `loggingContentGenerator.ts:644`); the post-stream wrap-up phase is <10ms and not architecturally distinct. Filling it with a meaningless value pollutes the histogram. Leaving the enum value unused is safe `apiRequestBreakdownHistogram.record(value, {model, phase})` is just a histogram with `phase` as a label; missing labels are simply absent in queries.
**Why not redefine `NETWORK_LATENCY`**: the spec name is slightly misleading (it's network + first-token-generation, not pure network latency), but:
- The enum is part of `metrics.ts:330-334` which is exported from `index.ts:117` and tested.
- Backend dashboards may already reference these phase names.
- Renaming or adding a new phase would be a breaking change for trivially marginal accuracy improvement.
Document the semantic in the design doc; leave the enum unchanged.
**Why on the span path, not parallel**: keeps `recordApiRequestBreakdown` colocated with span attribute writes — single gated emission point (see D7 idempotency), single ordering invariant.
### D7 — `endLLMRequestSpan` idempotency: metric recording gated on existing double-end guard
Phase 1.5 (#4302) established that `endLLMRequestSpan` may be called twice (abort path + error path collision). The existing guard at `session-tracing.ts:~470` (`if (!activeSpans.has(...)) return;`) prevents double `span.end()`. Phase 4 metric recording (D6) **must sit inside the same guarded block**, before `span.end()`:
```ts
// session-tracing.ts — endLLMRequestSpan
const llmCtx = activeSpans.get(spanRef);
if (!llmCtx) return; // already ended — double-end guard
activeSpans.delete(spanRef); // claim the end
// ... compute duration, set attributes ...
if (metadata) {
recordApiRequestBreakdown(config, llmCtx.attributes.model, [...]); // NEW — gated
recordTokenUsageMetrics(...); // existing
}
span.end();
```
This guarantees metric is recorded **exactly once** per LLM request, matching the span lifecycle.
**Why not record in `loggingContentGenerator`**: it doesn't see the abort path. Recording at the span lifecycle layer ensures every LLM request that opens a span produces exactly one breakdown sample, regardless of success/failure/abort.
### D8 — GenAI semantic conventions dual-emit (private name authoritative)
Each Phase 4 attribute that corresponds to an OTel GenAI semconv attribute is written twice on the span:
| qwen-code private (authoritative) | GenAI semconv (compat layer) | Unit conversion | Spec status |
| ------------------------------------------ | ----------------------------------------------- | --------------- | ------------ |
| `ttft_ms` (ms, int) | `gen_ai.server.time_to_first_token` (s, double) | `ttftMs / 1000` | Experimental |
| `input_tokens` (int) | `gen_ai.usage.input_tokens` (int) | identical | Stable |
| `output_tokens` (int) | `gen_ai.usage.output_tokens` (int) | identical | Stable |
| `cached_input_tokens` (int) (when present) | `gen_ai.usage.cached_tokens` (int) | identical | Experimental |
| `qwen-code.model` (string) | `gen_ai.request.model` (string) | identical | Stable |
**Existing token attribute names** on the LLM span (set in `endLLMRequestSpan` before Phase 4): qwen-code uses bare `input_tokens` and `output_tokens` already. Phase 4 adds the `gen_ai.usage.*` siblings to match #4410's pattern. The bare names stay; **don't rename**.
Fields with no GenAI semconv equivalent — `request_setup_ms`, `sampling_ms`, `retry_total_delay_ms`, `attempt`, `output_tokens_per_second` — are emitted only under the qwen-code namespace.
**Why "private authoritative, semconv as compat"**:
- Internal dashboards, SLOs, debugLogger output, qwen-logger RUM, ARMS queries — all reference `ttft_ms` etc. Treating those as canonical avoids a flag-day migration.
- The Experimental GenAI semconv may rename `gen_ai.server.time_to_first_token` before reaching Stable. If/when it does, we update the semconv emission; the qwen-code names don't move.
- Future spec-aware backends (Datadog AI views, Honeycomb AI, ARMS GenAI dashboards) auto-pick up the `gen_ai.*` attributes without our involvement.
**Why dual-emit unit conversion** (ms ↔ seconds): GenAI semconv chose seconds-as-double for latency; qwen-code chose ms-as-int (matches `duration_ms` already on the span). Both representations have value; the conversion is cheap.
## Helper API (additive to `session-tracing.ts`)
```ts
// session-tracing.ts — LLMRequestMetadata interface extended (additive)
export interface LLMRequestMetadata {
// ... existing fields: inputTokens, outputTokens, cachedInputTokens, success, error, ...
/** Time from successful attempt start to first user-visible content chunk (ms). Undefined for non-streaming or aborted-before-first-chunk requests. */
ttftMs?: number;
/** Time from generateContent entry to start of successful attempt (ms). Includes all failed retries + backoff. */
requestSetupMs?: number;
/** Final attempt number (1-based). 1 = no retries. */
attempt?: number;
/** Sum of all backoff delays before the successful attempt (ms). */
retryTotalDelayMs?: number;
}
// No new exported helpers — Phase 4 reuses startLLMRequestSpan / endLLMRequestSpan with extended metadata.
```
```ts
// types.ts — new event class
export class ApiRetryEvent implements BaseTelemetryEvent {
'event.name': typeof EVENT_API_RETRY = EVENT_API_RETRY;
'event.timestamp': string;
model: string;
prompt_id?: string;
attempt_number: number;
error_type: string;
error_message: string;
status_code?: number;
retry_delay_ms: number;
duration_ms: number; // = retry_delay_ms, drives LogToSpanProcessor bridge span width
constructor(opts: { model: string; promptId?: string; attemptNumber: number; error: unknown; statusCode?: number; retryDelayMs: number }) { ... }
}
// constants.ts
export const EVENT_API_RETRY = 'qwen-code.api_retry';
// loggers.ts
export function logApiRetry(config: Config, event: ApiRetryEvent): void { ... }
```
```ts
// utils/retry.ts — RetryOptions extension
interface RetryOptions<T> {
// ... existing ...
onRetry?: (info: RetryAttemptInfo) => void;
}
interface RetryAttemptInfo {
attempt: number;
error: unknown;
errorStatus?: number;
delayMs: number;
}
// Inside retryWithBackoff, where logRetryAttempt is called today:
options.onRetry?.({ attempt, error, errorStatus, delayMs: actualDelay });
logRetryAttempt(attempt, error, errorStatus); // existing debugLogger call unchanged
```
## Lifecycle wiring
### Streaming path (the common case)
```ts
// loggingContentGenerator.ts:283 — generateContentStream
async generateContentStream(req, userPromptId): Promise<AsyncGenerator<GenerateContentResponse>> {
const requestEntryTime = Date.now();
let attemptStart = requestEntryTime;
const attemptStartTimes: number[] = [attemptStart];
let retryTotalDelayMs = 0;
let finalAttempt = 1;
// Use existing startLLMRequestSpan (Phase 1)
// Pass onRetry callback to whatever retry layer is in use:
const onRetry: RetryAttemptInfo & { invoke: ... } = (info) => {
finalAttempt = info.attempt + 1; // we're about to start attempt N+1
retryTotalDelayMs += info.delayMs;
attemptStart = Date.now() + info.delayMs; // approximate; actual reset is at top of next attempt
attemptStartTimes.push(attemptStart);
// emit ApiRetryEvent
logApiRetry(this.config, new ApiRetryEvent({
model: req.model,
promptId: userPromptId,
attemptNumber: info.attempt,
error: info.error,
statusCode: info.errorStatus,
retryDelayMs: info.delayMs,
}));
};
// stream wrapper detects first user-visible chunk:
return this.processStreamGenerator(stream, ..., {
onFirstUserVisibleChunk: (now) => {
ttftMs = now - attemptStart;
},
});
}
```
At span end (already in Phase 1's `endLLMRequestSpan` flow), include the new fields in `LLMRequestMetadata`:
```ts
endLLMRequestSpan(llmSpan, {
success: true,
inputTokens,
outputTokens,
cachedInputTokens,
ttftMs,
requestSetupMs: attemptStart - requestEntryTime,
attempt: finalAttempt,
retryTotalDelayMs,
});
```
### Non-streaming path
`generateContent` (`loggingContentGenerator.ts:212`) does not produce streaming chunks. TTFT is `undefined`; `request_setup_ms` is still meaningful (captures retry overhead). The breakdown metric records 2 phases (REQUEST_PREPARATION + RESPONSE_PROCESSING where `RESPONSE_PROCESSING = duration_ms - request_setup_ms`), not 3.
### Retry layer integration (4 sites)
Each of the 4 LLM `retryWithBackoff` call sites adds `onRetry`:
```ts
// client.ts:1540 (similar at baseLlmClient.ts:193, 282, geminiChat.ts:1039)
const result = await retryWithBackoff(apiCall, {
...existingOptions,
onRetry: (info) => {
logApiRetry(
this.config,
new ApiRetryEvent({
model,
promptId: userPromptId,
attemptNumber: info.attempt,
error: info.error,
statusCode: info.errorStatus,
retryDelayMs: info.delayMs,
}),
);
// also feed back into LoggingContentGenerator's local retry accumulator
// (when in scope — for callers that don't go through LoggingContentGenerator,
// the LLM span still gets `attempt` and `retry_total_delay_ms` via the
// metadata path because endLLMRequestSpan is called at the LLM layer)
},
});
```
The non-LLM caller (`channels/weixin/src/api.ts`) **does not register `onRetry`** — no `ApiRetryEvent` is emitted for its retries, matching today's behavior.
## Concurrent safety — the headline guarantee
`LoggingContentGenerator` instance is shared (one per `ContentGenerator`, `contentGenerator.ts:377`). Three concurrent `generateContentStream` calls (e.g., 3 subagents fan out via `coreToolScheduler.runConcurrently`) execute three independent closures of `generateContentStream`:
```
call_A: attemptStart_A, ttftMs_A, ... (closure)
call_B: attemptStart_B, ttftMs_B, ... (closure)
call_C: attemptStart_C, ttftMs_C, ... (closure)
```
Per-call locals never overlap. Stream chunks are detected against the local `attemptStart` of each call. Span attributes are set at each call's own `endLLMRequestSpan`.
`AsyncLocalStorageContextManager` (registered by NodeSDK at `sdk.ts:273`) already ensures the active OTel context — and thus the parent span passed to `startLLMRequestSpan` — is correct per fiber.
## Files to change
| File | Change | LOC est |
| -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `packages/core/src/telemetry/constants.ts` | Add `EVENT_API_RETRY` constant | +2 |
| `packages/core/src/telemetry/types.ts` | Add `ApiRetryEvent` class + union member | +40 |
| `packages/core/src/telemetry/loggers.ts` | Add `logApiRetry()` function | +20 |
| `packages/core/src/telemetry/qwen-logger/qwen-logger.ts` | Add `logApiRetryEvent()` for RUM downstream consistency | +20 |
| `packages/core/src/telemetry/session-tracing.ts` | Extend `LLMRequestMetadata` (ttftMs, requestSetupMs, attempt, retryTotalDelayMs); extend `endLLMRequestSpan` to set new attrs + breakdown metric + dual-emit gen_ai.\* | +60 |
| `packages/core/src/telemetry/metrics.ts` | Wire `recordApiRequestBreakdown` callsite inside `endLLMRequestSpan` (no change to the existing recorder) | 0 |
| `packages/core/src/utils/retry.ts` | Add `onRetry?: (info: RetryAttemptInfo) => void` to RetryOptions; export `RetryAttemptInfo`; invoke callback in the existing logRetryAttempt site | +25 |
| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts` | TTFT capture: method-local accumulators + `hasUserVisibleContent` helper + first-chunk detection in stream wrapper; pass new metadata to `endLLMRequestSpan` | +80 |
| `packages/core/src/core/client.ts` | Wire `onRetry` callback at `retryWithBackoff` call site (`client.ts:1540`) | +15 |
| `packages/core/src/core/baseLlmClient.ts` | Wire `onRetry` callback at 2 `retryWithBackoff` call sites | +25 |
| `packages/core/src/core/geminiChat.ts` | Wire `onRetry` callback at `retryWithBackoff` call site (`geminiChat.ts:1039`) | +15 |
| `packages/core/src/telemetry/session-tracing.test.ts` | `endLLMRequestSpan` sets ttft_ms / request_setup_ms / attempt / retry_total_delay_ms / sampling_ms / output_tokens_per_second + gen_ai dual-emit + breakdown metric (each phase) + idempotent end | +120 |
| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts` | `hasUserVisibleContent` (text / functionCall / inlineData / executableCode / thought / role-only / usage-only); concurrent calls don't cross-contaminate; TTFT undefined when aborted before first chunk; TTFT undefined on non-streaming | +100 |
| `packages/core/src/utils/retry.test.ts` | `onRetry` invoked per failed attempt with correct `attempt`, `delayMs`, `error`, `errorStatus`; absence of `onRetry` is silent (no telemetry emitted) | +50 |
| `packages/core/src/telemetry/loggers.test.ts` | `logApiRetry` emits LogRecord with expected payload; bridges through LogToSpanProcessor to nested span under active LLM span | +40 |
Total: 14 files, ~610 LOC. Larger than Phase 2 (#4321) but comparable to Phase 3 (#4410) and justified by the breadth of integration (4 retry sites + telemetry plumbing + streaming wrapper).
If review pushes back on size: split into **Phase 4a + 4b + 4c**:
- **4a** (~200 LOC): TTFT capture + extended `LLMRequestMetadata` + dual-emit. Self-contained value (TTFT visibility from day one).
- **4b** (~250 LOC): `onRetry` callback + `ApiRetryEvent` + 4 caller wiring. **Independently a bug fix** for the `retryWithBackoff` telemetry gap.
- **4c** (~160 LOC): `recordApiRequestBreakdown` activation + parent span aggregation attrs (`attempt`, `retry_total_delay_ms`, `sampling_ms`, `output_tokens_per_second`). Depends on 4a + 4b.
## Testing strategy
| Test | What it proves |
| -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------- |
| `hasUserVisibleContent` returns true for text/functionCall/inlineData/executableCode/thought | D1 semantics across part types |
| `hasUserVisibleContent` returns false for role-only and usage-only chunks | D1 negative cases |
| streaming: TTFT measured from attempt start to first user-visible chunk | End-to-end TTFT detection |
| streaming: TTFT undefined if stream aborts before any user-visible chunk | Edge case |
| streaming: TTFT computed from final attempt's start (not first attempt) | D3 — TTFT reset on retry |
| non-streaming: TTFT remains undefined | S3 decision |
| concurrent `generateContentStream` calls don't cross-contaminate TTFT | D2 — method-local guarantee |
| `endLLMRequestSpan` sets all Phase 4 attrs (ttft_ms, request_setup_ms, sampling_ms, attempt, retry_total_delay_ms, output_tokens_per_second) | Attribute presence |
| `endLLMRequestSpan` dual-emits gen_ai.server.time_to_first_token + gen_ai.usage.\* + gen_ai.request.model | D8 dual-emit |
| `endLLMRequestSpan` records breakdown metric with 3 phases for streaming, 2 for non-streaming | D6 |
| `endLLMRequestSpan` called twice: metric recorded exactly once, attrs not re-set | D7 idempotency |
| `retryWithBackoff` with `onRetry`: callback invoked per failed attempt with correct args | D4 callback contract |
| `retryWithBackoff` without `onRetry`: no telemetry emitted (silent for non-LLM callers) | P2 — channels/weixin scope protection |
| `client.ts` / `baseLlmClient.ts` / `geminiChat.ts` retry callsites emit `ApiRetryEvent` on retry | Integration of D4 at 4 sites |
| `ApiRetryEvent` LogRecord bridges via LogToSpanProcessor to a child span under active LLM span | Trace tree correctness |
| LLM span `attempt` field correctly reflects final attempt number under retries | D5 aggregation |
| LLM span `retry_total_delay_ms` correctly sums onRetry delays | D5 aggregation |
| `output_tokens_per_second` undefined when `sampling_ms === 0` (no streaming) | Avoid divide-by-zero |
## Edge cases
| Case | Handling |
| ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Stream aborts before any chunk arrives | `ttftMs = undefined`, `sampling_ms = undefined`, `output_tokens_per_second = undefined`. `attempt`, `request_setup_ms` still set. `success = false` |
| Stream aborts after first chunk | `ttftMs` set; `sampling_ms` = `duration_ms - ttftMs - request_setup_ms`; reflects partial response time. `success = false` |
| Retry succeeds on attempt 1 (no retries) | `attempt = 1`, `retry_total_delay_ms = 0`, no `ApiRetryEvent` emitted, breakdown metric records `request_setup_ms` close to 0 |
| Persistent retry mode 50+ attempts | 50+ `ApiRetryEvent` records emitted (out-of-scope cap deferred); LLM span `attempt = 51`, `retry_total_delay_ms = sum of all delays`. Operator sees aggregated view on span; full per-attempt detail in log-bridge spans |
| Non-LLM `retryWithBackoff` caller (channels/weixin) | No `onRetry` registered; only existing `debugLogger.warn` fires. No `ApiRetryEvent`; no breakdown metric (caller isn't an LLM site) |
| `endLLMRequestSpan` called twice (abort + error race) | Phase 1.5 guard at `activeSpans.delete()` returns early on second call; `recordApiRequestBreakdown` is inside the guard, recorded exactly once |
| Anthropic `message_start` chunk arrives before content | `hasUserVisibleContent` returns false for it (no parts with text/functionCall/etc.); TTFT not triggered until subsequent `content_block_delta` chunk |
| OpenAI first chunk with empty `delta.content` but `role` only | `hasUserVisibleContent` returns false; TTFT not triggered until first chunk with non-empty delta |
| Tool-call-only response (no text) | First chunk with `functionCall` Part triggers TTFT; `output_tokens_per_second` computed against tool-call token count |
| Concurrent subagents (3 calls in flight) | Each call's closure has its own `attemptStart`, `ttftMs`, `attemptStartTimes`. Per-call span receives its own metadata at `endLLMRequestSpan`. No interleaving (D2) |
| SDK-level retries inside openai-sdk (`maxRetries=3`) | Invisible to qwen-code telemetry — happens entirely inside SDK before retryWithBackoff sees the request. `attempt` reflects retryWithBackoff attempts only. Out of scope (see Out-of-scope) |
| `gen_ai.server.time_to_first_token` spec renames before reaching Stable | Single-file update: `session-tracing.ts:endLLMRequestSpan`. The qwen-code-native `ttft_ms` stays authoritative — no downstream impact |
| Subagent's LLM request | Parent is the subagent span (Phase 3). Phase 4 fields nest correctly. Aggregations grouped by `qwen-code.subagent.id` give per-subagent LLM perf — design-doc-future, easy follow-up |
| Reasoning model with long thought blocks | First `thought` Part triggers TTFT; `sampling_ms` includes both thinking + answer phases. Split into separate metrics deferred |
## Rollback
The change is additive at the OTel and metric level — every new attribute is optional, every new event is a new class. Existing dashboards that don't filter on the new fields keep working unchanged.
Behavior-affecting changes:
- New `ApiRetryEvent` LogRecord starts flowing → log volume increases proportional to retry rate (typically <1% of requests retry). Mitigate by sampling LogRecord at the SDK layer if needed.
- New breakdown metric `qwen-code.api.request.breakdown` starts producing time series → mild Prometheus cardinality bump (`{model, phase}` — bounded).
- `output_tokens_per_second` derived attribute may appear unusual on dashboards filtering "all attributes" — document.
Rollback path: revert the single PR (or each of 4a/4b/4c independently). All new fields use defensive defaults (undefined / 0) and don't change span structure.
## Sequencing
- **After Phase 3 (#4410, in review)**: not a hard dependency. Phase 4 attributes attach to `qwen-code.llm_request` spans regardless of whether they're under a `qwen-code.subagent` (Phase 3) or `qwen-code.interaction` (Phase 1) parent. Recommend Phase 3 land first so per-attempt aggregation under subagent subtrees works naturally.
- **Independent of #4384** (`traceparent` + `X-Qwen-Code-Session-Id` outbound propagation). They touch the HTTP layer; Phase 4 touches the stream/retry/metric layer.
- **Independent of `clearDetailedSpanState` chat-compression follow-up** (#4097 follow-up). Different surface.
## Open questions
1. **`onRetry` callback firing semantics**: invoked **before** backoff sleep (current proposal) or **after** (when the next attempt is about to start)? Before is simpler — callback has all the info immediately; after would require capturing the just-completed delay separately. Pre-sleep is the recommendation; document in callback contract.
2. **Per-attempt timing on the LLM span**: should we add `attempt_durations_ms: number[]` array? OTel supports array-of-primitive attributes. Useful for "which attempt of N was slow" diagnostics. Defer until production data shows demand — log-bridge spans already carry the equivalent.
3. **Persistent retry mode emission cap**: at what `attempt > N` threshold should we start sampling? `N = 5` then 1-in-10? `N = 10` then summary-only? Defer until we have production volume data.
4. **`TOKEN_PROCESSING` phase**: keep enum value dormant or wire it to something (e.g., consolidation time)? Defer — wait for a real use case.
5. **Subagent-level LLM rollups**: trivial follow-up once Phase 4 lands — sum `ttft_ms`/`output_tokens`/`input_tokens` per subagent subtree. Not Phase 4 scope but the data flow enables it.

View file

@ -165,6 +165,11 @@ vi.mock('../../telemetry/index.js', () => {
success: boolean;
inputTokens?: number;
outputTokens?: number;
cachedInputTokens?: number;
ttftMs?: number;
requestSetupMs?: number;
attempt?: number;
retryTotalDelayMs?: number;
durationMs?: number;
error?: string;
},
@ -616,6 +621,150 @@ describe('LoggingContentGenerator', () => {
});
});
it('captures ttftMs on the first user-visible stream chunk (Phase 4a)', async () => {
// Two chunks: first has text (user-visible), second has only usage.
// ttftMs must be set on the first chunk and not overwritten by the second.
const streamFn = vi.fn().mockResolvedValue(
(async function* () {
yield createResponse('r1', 'test-model', [{ text: 'hi' }]);
yield createResponse('r2', 'test-model', [], {
promptTokenCount: 10,
candidatesTokenCount: 2,
totalTokenCount: 12,
});
})(),
);
const wrapped = createWrappedGenerator(vi.fn(), streamFn);
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
model: 'test-model',
authType: AuthType.USE_OPENAI,
enableOpenAILogging: false,
});
const request = {
model: 'test-model',
contents: 'Hello',
} as unknown as GenerateContentParameters;
const stream = await generator.generateContentStream(
request,
'prompt-ttft',
);
for await (const _ of stream) {
// consume
}
const spanRecord = getStreamSpanRecord();
const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined;
expect(meta).toBeDefined();
expect(typeof meta!.ttftMs).toBe('number');
expect(meta!.ttftMs!).toBeGreaterThanOrEqual(0);
});
it('forwards cachedInputTokens from usageMetadata to endLLMRequestSpan (Phase 4a)', async () => {
const streamFn = vi.fn().mockResolvedValue(
(async function* () {
yield createResponse('r1', 'test-model', [{ text: 'ok' }], {
promptTokenCount: 100,
candidatesTokenCount: 20,
cachedContentTokenCount: 40,
totalTokenCount: 160,
});
})(),
);
const wrapped = createWrappedGenerator(vi.fn(), streamFn);
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
model: 'test-model',
authType: AuthType.USE_OPENAI,
enableOpenAILogging: false,
});
const request = {
model: 'test-model',
contents: 'Hello',
} as unknown as GenerateContentParameters;
const stream = await generator.generateContentStream(
request,
'prompt-cache',
);
for await (const _ of stream) {
// consume
}
const spanRecord = getStreamSpanRecord();
expect(spanRecord.endMetadata).toMatchObject({
success: true,
inputTokens: 100,
cachedInputTokens: 40,
});
});
it('leaves ttftMs undefined when stream yields no user-visible chunks (Phase 4a)', async () => {
// Stream emits only usage-metadata chunks (no text/functionCall/etc).
// ttftMs must stay undefined — TTFT is only meaningful when content arrives.
const streamFn = vi.fn().mockResolvedValue(
(async function* () {
yield createResponse('r1', 'test-model', [], {
promptTokenCount: 5,
candidatesTokenCount: 0,
totalTokenCount: 5,
});
})(),
);
const wrapped = createWrappedGenerator(vi.fn(), streamFn);
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
model: 'test-model',
authType: AuthType.USE_OPENAI,
enableOpenAILogging: false,
});
const request = {
model: 'test-model',
contents: 'Hello',
} as unknown as GenerateContentParameters;
const stream = await generator.generateContentStream(
request,
'prompt-no-content',
);
for await (const _ of stream) {
// consume
}
const spanRecord = getStreamSpanRecord();
const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined;
expect(meta!.ttftMs).toBeUndefined();
});
it('forwards cachedInputTokens to endLLMRequestSpan on non-stream success (Phase 4a)', async () => {
const generateFn = vi.fn().mockResolvedValue(
createResponse('resp-cache', 'test-model', [{ text: 'ok' }], {
promptTokenCount: 100,
candidatesTokenCount: 30,
cachedContentTokenCount: 60,
totalTokenCount: 190,
}),
);
const wrapped = createWrappedGenerator(generateFn, vi.fn());
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
model: 'test-model',
authType: AuthType.USE_OPENAI,
enableOpenAILogging: false,
});
const request = {
model: 'test-model',
contents: 'Hi',
} as unknown as GenerateContentParameters;
await generator.generateContent(request, 'prompt-cache-non-stream');
const spanRecord = getGenerateContentSpanRecord();
expect(spanRecord.endMetadata).toMatchObject({
success: true,
inputTokens: 100,
outputTokens: 30,
cachedInputTokens: 60,
});
});
it('preserves non-stream success when response and OpenAI logging fail', async () => {
vi.mocked(logApiResponse).mockImplementationOnce(() => {
throw new Error('response-log-fail');

View file

@ -61,6 +61,7 @@ import {
API_CALL_ABORTED_SPAN_STATUS_MESSAGE,
API_CALL_FAILED_SPAN_STATUS_MESSAGE,
} from '../../telemetry/tracer.js';
import { hasUserVisibleContent } from './streamContentDetection.js';
const debugLogger = createDebugLogger('LOGGING_CONTENT_GENERATOR');
@ -285,6 +286,7 @@ export class LoggingContentGenerator implements ContentGenerator {
success: true,
inputTokens: response.usageMetadata?.promptTokenCount,
outputTokens: response.usageMetadata?.candidatesTokenCount,
cachedInputTokens: response.usageMetadata?.cachedContentTokenCount,
durationMs: Date.now() - startTime,
});
return response;
@ -462,6 +464,14 @@ export class LoggingContentGenerator implements ContentGenerator {
let firstModelVersion = '';
let lastUsageMetadata: GenerateContentResponseUsageMetadata | undefined;
let errorOccurred = false;
// TTFT (time to first token): wall-clock from generateContentStream
// dispatch to the first stream chunk containing user-visible content.
// Method-local closure variable — NEVER an instance field — because
// LoggingContentGenerator is shared across concurrent generateContentStream
// calls (one per ContentGenerator, see contentGenerator.ts:createContentGenerator).
// See docs/design/telemetry-llm-request-timing-design.md (D1, D2).
let ttftMs: number | undefined;
// Tracks whether the idle timeout fired and ended the span. If so,
// a resumed-after-timeout consumer must not call endLLMRequestSpan
// again (the helper would no-op, but more importantly we skip the
@ -516,6 +526,13 @@ export class LoggingContentGenerator implements ContentGenerator {
if (response.usageMetadata) {
lastUsageMetadata = response.usageMetadata;
}
// Capture TTFT on the first stream chunk that contains user-visible
// content. hasUserVisibleContent skips role-only / usageMetadata-only
// chunks, so TTFT reflects "model produced something the operator can
// attribute to user-perceived latency."
if (ttftMs === undefined && hasUserVisibleContent(response)) {
ttftMs = Date.now() - startTime;
}
resetSpanTimeout?.();
yield response;
}
@ -601,6 +618,8 @@ export class LoggingContentGenerator implements ContentGenerator {
success: !errorOccurred,
inputTokens: lastUsageMetadata?.promptTokenCount,
outputTokens: lastUsageMetadata?.candidatesTokenCount,
cachedInputTokens: lastUsageMetadata?.cachedContentTokenCount,
ttftMs,
durationMs: Date.now() - startTime,
error: errorOccurred
? aborted

View file

@ -0,0 +1,131 @@
/**
* @license
* Copyright 2026 Qwen Team
* SPDX-License-Identifier: Apache-2.0
*/
import { describe, expect, it } from 'vitest';
import { GenerateContentResponse } from '@google/genai';
import { hasUserVisibleContent } from './streamContentDetection.js';
function chunkWithParts(parts: unknown[]): GenerateContentResponse {
const r = new GenerateContentResponse();
r.candidates = [
{
content: { role: 'model', parts: parts as never },
},
];
return r;
}
describe('hasUserVisibleContent', () => {
it('returns true for non-empty text part', () => {
expect(hasUserVisibleContent(chunkWithParts([{ text: 'hi' }]))).toBe(true);
});
it('returns false for empty text part', () => {
expect(hasUserVisibleContent(chunkWithParts([{ text: '' }]))).toBe(false);
});
it('returns true for functionCall part', () => {
expect(
hasUserVisibleContent(
chunkWithParts([{ functionCall: { name: 'read', args: {} } }]),
),
).toBe(true);
});
it('returns true for inlineData part', () => {
expect(
hasUserVisibleContent(
chunkWithParts([
{ inlineData: { mimeType: 'image/png', data: 'abc' } },
]),
),
).toBe(true);
});
it('returns true for executableCode part', () => {
expect(
hasUserVisibleContent(
chunkWithParts([
{ executableCode: { language: 'PYTHON', code: 'print(1)' } },
]),
),
).toBe(true);
});
it('returns true for thought / reasoning part with thought: true', () => {
expect(hasUserVisibleContent(chunkWithParts([{ thought: true }]))).toBe(
true,
);
});
it('returns false for thought: false (explicit non-thought part)', () => {
// Codebase convention: `thought` is a boolean flag where false means
// "explicitly not a thought." A part with only `thought: false` and no
// other content must not trigger TTFT.
expect(hasUserVisibleContent(chunkWithParts([{ thought: false }]))).toBe(
false,
);
});
it('returns false for thought: undefined / missing (default non-thought)', () => {
// A bare object without the `thought` key is the common case for non-thinking
// chunks; must not match the thought branch.
expect(hasUserVisibleContent(chunkWithParts([{}]))).toBe(false);
});
it('returns true when thought: true coexists with empty text', () => {
// First Anthropic <thinking> chunk often arrives as { text: '', thought: true }.
// Per design doc D1, "thought / reasoning content" is user-visible — TTFT fires.
expect(
hasUserVisibleContent(chunkWithParts([{ text: '', thought: true }])),
).toBe(true);
});
it('returns true when any part is user-visible (mixed)', () => {
expect(
hasUserVisibleContent(chunkWithParts([{ text: '' }, { text: 'hi' }])),
).toBe(true);
});
it('returns false for empty parts array', () => {
expect(hasUserVisibleContent(chunkWithParts([]))).toBe(false);
});
it('returns false when candidates is missing', () => {
const r = new GenerateContentResponse();
expect(hasUserVisibleContent(r)).toBe(false);
});
it('returns false when content is missing', () => {
const r = new GenerateContentResponse();
r.candidates = [{}];
expect(hasUserVisibleContent(r)).toBe(false);
});
it('returns false when parts is undefined', () => {
const r = new GenerateContentResponse();
r.candidates = [{ content: { role: 'model' } }];
expect(hasUserVisibleContent(r)).toBe(false);
});
it('returns false for usage-only / role-only chunks', () => {
const r = new GenerateContentResponse();
r.candidates = [{ content: { role: 'model', parts: [] } }];
r.usageMetadata = { totalTokenCount: 42 };
expect(hasUserVisibleContent(r)).toBe(false);
});
it('handles parts that are non-objects defensively', () => {
expect(
hasUserVisibleContent(
chunkWithParts([null, undefined, 'string', 42, { text: 'real' }]),
),
).toBe(true);
expect(hasUserVisibleContent(chunkWithParts([null, undefined, 'x']))).toBe(
false,
);
});
});

View file

@ -0,0 +1,57 @@
/**
* @license
* Copyright 2026 Qwen Team
* SPDX-License-Identifier: Apache-2.0
*/
import type { GenerateContentResponse } from '@google/genai';
/**
* Detects whether a streaming chunk contains user-visible model output.
*
* Used by the LoggingContentGenerator stream wrapper to identify the first
* chunk that should trigger TTFT (time-to-first-token) measurement.
*
* A chunk is "user-visible" if any normalized Part in candidates[0].content.parts
* is one of:
* - text with a non-empty string
* - functionCall (tool use even tool-call-only responses count)
* - inlineData (image, binary blob)
* - executableCode (sandbox / code-execution responses)
* - thought / reasoning content (provider-dependent; o1, qwen thinking, Anthropic <thinking>)
*
* Chunks containing only role metadata, only usageMetadata (final summary
* chunk), or empty parts are NOT user-visible TTFT should not fire on these.
*
* Centralised here (single predicate over the normalized GenerateContentResponse
* shape) so the four provider generators (Anthropic / OpenAI / Gemini / Qwen)
* don't each need their own first-token logic. Each provider already normalizes
* its native chunk shape to GenerateContentResponse before LoggingContentGenerator
* sees it (see loggingContentGenerator.ts generateContentStream).
*/
export function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
const parts = chunk.candidates?.[0]?.content?.parts;
if (!parts || parts.length === 0) return false;
return parts.some(isUserVisiblePart);
}
function isUserVisiblePart(part: unknown): boolean {
if (part === null || typeof part !== 'object') return false;
const p = part as {
text?: unknown;
functionCall?: unknown;
inlineData?: unknown;
executableCode?: unknown;
thought?: unknown;
};
if (typeof p.text === 'string' && p.text.length > 0) return true;
if (p.functionCall !== undefined) return true;
if (p.inlineData !== undefined) return true;
if (p.executableCode !== undefined) return true;
// `thought` is a boolean flag in this codebase — `true` means the part
// carries reasoning content, false / absent means none (see loggingContentGenerator.ts
// where `part.thought ? {...} : {}` is the canonical pattern). Match strict `=== true`
// rather than checking presence — `thought: false` parts are explicitly NOT user-visible.
if (p.thought === true) return true;
return false;
}

View file

@ -378,6 +378,214 @@ describe('session-tracing', () => {
});
});
describe('LLM request spans — Phase 4a (timing decomposition + GenAI dual-emit)', () => {
it('startLLMRequestSpan dual-emits gen_ai.request.model alongside qwen-code.model', () => {
const span = startLLMRequestSpan('test-model', 'p');
endLLMRequestSpan(span, { success: true });
const attrs = mockSpans[0]!.attributes;
expect(attrs['qwen-code.model']).toBe('test-model');
expect(attrs['gen_ai.request.model']).toBe('test-model');
});
it('endLLMRequestSpan dual-emits gen_ai.usage.input_tokens / output_tokens', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
inputTokens: 100,
outputTokens: 50,
});
const attrs = mockSpans[0]!.attributes;
expect(attrs['input_tokens']).toBe(100);
expect(attrs['gen_ai.usage.input_tokens']).toBe(100);
expect(attrs['output_tokens']).toBe(50);
expect(attrs['gen_ai.usage.output_tokens']).toBe(50);
});
it('endLLMRequestSpan dual-emits gen_ai.usage.cached_tokens when present', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
inputTokens: 100,
cachedInputTokens: 40,
});
const attrs = mockSpans[0]!.attributes;
expect(attrs['cached_input_tokens']).toBe(40);
expect(attrs['gen_ai.usage.cached_tokens']).toBe(40);
});
it('endLLMRequestSpan omits cached_input_tokens when undefined', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, { success: true, inputTokens: 100 });
const attrs = mockSpans[0]!.attributes;
expect(attrs['cached_input_tokens']).toBeUndefined();
expect(attrs['gen_ai.usage.cached_tokens']).toBeUndefined();
});
it('endLLMRequestSpan emits cached_input_tokens === 0 (cache miss is meaningful info, not undefined)', () => {
// Providers that report 0 cached tokens are signaling an explicit cache
// miss. Distinct from undefined ("we don't know"). Both attribute names
// must propagate the literal 0.
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
inputTokens: 100,
cachedInputTokens: 0,
});
const attrs = mockSpans[0]!.attributes;
expect(attrs['cached_input_tokens']).toBe(0);
expect(attrs['gen_ai.usage.cached_tokens']).toBe(0);
});
it('endLLMRequestSpan writes ttft_ms and dual-emits gen_ai.server.time_to_first_token (in seconds)', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 234,
durationMs: 1000,
});
const attrs = mockSpans[0]!.attributes;
expect(attrs['ttft_ms']).toBe(234);
// Spec uses seconds as double — 234ms → 0.234s
expect(attrs['gen_ai.server.time_to_first_token']).toBeCloseTo(0.234, 6);
});
it('endLLMRequestSpan omits ttft_ms when undefined (non-streaming or aborted before first chunk)', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, { success: true, durationMs: 500 });
const attrs = mockSpans[0]!.attributes;
expect(attrs['ttft_ms']).toBeUndefined();
expect(attrs['gen_ai.server.time_to_first_token']).toBeUndefined();
expect(attrs['sampling_ms']).toBeUndefined();
expect(attrs['output_tokens_per_second']).toBeUndefined();
});
it('endLLMRequestSpan derives sampling_ms when ttftMs is set (no requestSetup)', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 200,
durationMs: 1000,
});
// sampling_ms = duration - ttft - (requestSetup ?? 0) = 1000 - 200 - 0
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(800);
});
it('endLLMRequestSpan derives sampling_ms accounting for requestSetupMs (Phase 4b populates this)', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 200,
requestSetupMs: 300,
durationMs: 1000,
});
// sampling_ms = 1000 - 200 - 300
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(500);
});
it('endLLMRequestSpan clamps sampling_ms to 0 when ttft + setup exceed duration (clock skew)', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 800,
requestSetupMs: 500,
durationMs: 1000,
});
// Math.max(0, 1000 - 800 - 500) = 0
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0);
});
it('endLLMRequestSpan derives output_tokens_per_second from sampling_ms + outputTokens', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 200,
durationMs: 1200,
outputTokens: 500,
});
// sampling_ms = 1000ms = 1s; otps = 500 / 1.0 = 500
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(1000);
expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(500);
});
it('endLLMRequestSpan rounds output_tokens_per_second to 2 decimals', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 200,
durationMs: 1325, // sampling_ms = 1125
outputTokens: 100, // otps = 100 / 1.125 = 88.888…
});
expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(88.89);
});
it('endLLMRequestSpan omits output_tokens_per_second when sampling_ms == 0', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 1000,
durationMs: 1000,
outputTokens: 50,
});
// sampling_ms = 0 → otps would be Infinity, must be omitted
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0);
expect(
mockSpans[0]!.attributes['output_tokens_per_second'],
).toBeUndefined();
});
it('endLLMRequestSpan omits output_tokens_per_second when outputTokens missing', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
ttftMs: 200,
durationMs: 1000,
});
expect(
mockSpans[0]!.attributes['output_tokens_per_second'],
).toBeUndefined();
});
it('endLLMRequestSpan writes Phase 4b retry placeholders when caller provides them', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, {
success: true,
attempt: 3,
requestSetupMs: 4500,
retryTotalDelayMs: 4200,
durationMs: 5000,
});
const attrs = mockSpans[0]!.attributes;
expect(attrs['attempt']).toBe(3);
expect(attrs['request_setup_ms']).toBe(4500);
expect(attrs['retry_total_delay_ms']).toBe(4200);
});
it('endLLMRequestSpan omits Phase 4b fields when caller does not provide them (Phase 4a default)', () => {
const span = startLLMRequestSpan('m', 'p');
endLLMRequestSpan(span, { success: true, durationMs: 500 });
const attrs = mockSpans[0]!.attributes;
expect(attrs['attempt']).toBeUndefined();
expect(attrs['request_setup_ms']).toBeUndefined();
expect(attrs['retry_total_delay_ms']).toBeUndefined();
});
});
describe('tool spans', () => {
it('creates and ends a tool span', () => {
const span = startToolSpan('ReadFile', { 'tool.call_id': 'call-1' });

View file

@ -46,9 +46,46 @@ export interface EndInteractionOptions {
export interface LLMRequestMetadata {
inputTokens?: number;
outputTokens?: number;
/**
* Tokens served from the provider's prompt cache (Anthropic
* cache_read_input_tokens, OpenAI prompt_tokens_details.cached_tokens, etc).
* Normalized to GenerateContentResponseUsageMetadata.cachedContentTokenCount
* by each provider generator before reaching LoggingContentGenerator.
*/
cachedInputTokens?: number;
success: boolean;
durationMs?: number;
error?: string;
/**
* Time from the successful attempt's request dispatch to the first stream
* chunk containing user-visible content (text / functionCall / inlineData /
* executableCode / thought). Undefined for non-streaming requests, requests
* aborted before the first user-visible chunk, and any path that does not
* pass through LoggingContentGenerator's stream wrapper.
*
* Semantics: diverges from claude-code's ttftMs (which fires on
* Anthropic's message_start metadata event). Matches the "time to first
* actual token" intent of the industry-standard TTFT name.
* See docs/design/telemetry-llm-request-timing-design.md (D1).
*/
ttftMs?: number;
/**
* Time from generateContent/generateContentStream entry to the start of the
* successful attempt (ms). Includes all failed retries + backoff sleeps.
* Populated by the retry layer in Phase 4b; undefined in Phase 4a.
*/
requestSetupMs?: number;
/**
* Final attempt number (1-based). 1 = no retries. Populated by the retry
* layer in Phase 4b; undefined in Phase 4a.
*/
attempt?: number;
/**
* Sum of all backoff delays before the successful attempt (ms). 0 if no
* retries occurred. Populated by the retry layer in Phase 4b; undefined
* in Phase 4a.
*/
retryTotalDelayMs?: number;
}
export interface ToolSpanMetadata {
@ -321,6 +358,10 @@ export function startLLMRequestSpan(model: string, promptId: string): Span {
'qwen-code.model': model,
'qwen-code.prompt_id': promptId,
'llm_request.context': parentCtx ? 'interaction' : 'standalone',
// Dual-emit OTel GenAI semantic convention (Stable). Private name
// (qwen-code.model) remains authoritative; gen_ai.* is a compat layer
// for spec-aware backends. See docs/design/telemetry-llm-request-timing-design.md (D8).
'gen_ai.request.model': model,
};
const span = getTracer().startSpan(
@ -362,10 +403,55 @@ export function endLLMRequestSpan(
const endAttributes: Attributes = { duration_ms: duration };
if (metadata) {
if (metadata.inputTokens !== undefined)
if (metadata.inputTokens !== undefined) {
endAttributes['input_tokens'] = metadata.inputTokens;
if (metadata.outputTokens !== undefined)
// Dual-emit OTel GenAI semconv (Stable). Private name stays authoritative.
endAttributes['gen_ai.usage.input_tokens'] = metadata.inputTokens;
}
if (metadata.outputTokens !== undefined) {
endAttributes['output_tokens'] = metadata.outputTokens;
endAttributes['gen_ai.usage.output_tokens'] = metadata.outputTokens;
}
if (metadata.cachedInputTokens !== undefined) {
endAttributes['cached_input_tokens'] = metadata.cachedInputTokens;
// Dual-emit OTel GenAI semconv (Experimental — may rename before Stable).
endAttributes['gen_ai.usage.cached_tokens'] =
metadata.cachedInputTokens;
}
if (metadata.ttftMs !== undefined) {
endAttributes['ttft_ms'] = metadata.ttftMs;
// Dual-emit OTel GenAI semconv (Experimental). Spec unit is seconds-as-double.
endAttributes['gen_ai.server.time_to_first_token'] =
metadata.ttftMs / 1000;
}
if (metadata.requestSetupMs !== undefined) {
endAttributes['request_setup_ms'] = metadata.requestSetupMs;
}
if (metadata.attempt !== undefined) {
endAttributes['attempt'] = metadata.attempt;
}
if (metadata.retryTotalDelayMs !== undefined) {
endAttributes['retry_total_delay_ms'] = metadata.retryTotalDelayMs;
}
// Derived: sampling_ms = time from first user-visible chunk to end
// (== output generation time, excluding setup + first-token delay).
// Computable only when ttftMs is set. requestSetupMs defaults to 0
// when undefined (no retries) — this gives the correct sampling
// duration in both Phase 4a (no retry data) and Phase 4b (with).
if (metadata.ttftMs !== undefined) {
const samplingMs = Math.max(
0,
duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),
);
endAttributes['sampling_ms'] = samplingMs;
// Derived: output tokens per second during sampling. Undefined when
// sampling_ms is 0 (avoid divide-by-zero) or when outputTokens missing.
if (samplingMs > 0 && metadata.outputTokens !== undefined) {
endAttributes['output_tokens_per_second'] =
Math.round((metadata.outputTokens / (samplingMs / 1000)) * 100) /
100;
}
}
endAttributes['success'] = metadata.success;
if (metadata.error !== undefined)
endAttributes['error'] = truncateSpanError(metadata.error);