feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731) (#4417)

2026-05-22 11:13:51 +00:00 · 2026-05-22 10:54:11 +08:00 · 2026-05-22 10:54:11 +08:00 · fd75f77e19
commit fd75f77e19
parent 48b0a8bfce
7 changed files with 1190 additions and 2 deletions
--- a/docs/design/telemetry-llm-request-timing-design.md
+++ b/docs/design/telemetry-llm-request-timing-design.md
@ -0,0 +1,538 @@
+# LLM Request Timing Decomposition Design (P3 Phase 4)
+
+> Issue #3731 — Phase 4 of hierarchical session tracing. Adds time-to-first-token, request-setup duration, sampling duration, and per-attempt retry telemetry to the `qwen-code.llm_request` span so operators can answer "why was this LLM call slow?" without guessing.
+>
+> Builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321). Independent of Phase 3 (#4410, in review) — recommended to land Phase 3 first so Phase 4's per-attempt fields aggregate cleanly under subagent subtrees.
+
+## Problem
+
+`qwen-code.llm_request` spans today carry only `model`, `prompt_id`, `input_tokens`, `output_tokens`, `success`, `error`, `duration_ms`. Operators reading a single trace cannot tell:
+
+1. **How much of `duration_ms` was the model thinking vs the network setup.** A 12-second `duration_ms` could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.
+2. **When the user saw the first token.** TTFT (time-to-first-token) is the standard latency SLO for chat UIs. We can't compute it; we don't capture it.
+3. **What happened during retries.** `retryWithBackoff` (`utils/retry.ts:285`) only calls `debugLogger.warn` — no OTel event, no span attribute. The 4 LLM call sites that go through it (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) have zero retry visibility in traces or metrics. `ContentRetryEvent` exists for content-recovery retries inside `geminiChat.ts:806,830` but not for the more common rate-limit / 5xx retries.
+4. **That `api.request.breakdown` is dead code.** The metric is defined at `metrics.ts:242-251` with 4 `ApiRequestPhase` values, exported from `index.ts:117`, tested in `metrics.test.ts:646-675` — but `recordApiRequestBreakdown()` has zero callers in production code. The metric infrastructure is paid for; the data flow was never connected.
+
+These gaps make `qwen-code.llm_request` the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.
+
+## Existing surface (no change)
+
+| Component                                                    | Location                                                         | Why we don't touch it                                                                                                                                                                                       |
+| ------------------------------------------------------------ | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| LLM request span lifecycle                                   | `session-tracing.ts` `startLLMRequestSpan` / `endLLMRequestSpan` | Phase 1 (#4126) established the helpers. We extend the metadata interface, don't restructure                                                                                                                |
+| Active span propagation into provider generators             | `loggingContentGenerator.ts:213,287`                             | Phase 1 (#4126) replaced `withSpan('api.*')` with native helpers; the active context already reaches the stream wrapper                                                                                     |
+| `ContentRetryEvent` schema + consumers                       | `types.ts:626`, `qwen-logger.ts:947`, `loggers.ts:717`           | Existing event keeps its shape and downstreams; we add a sibling event class for the `retryWithBackoff` path                                                                                                |
+| `LogToSpanProcessor` log-bridge spans                        | `log-to-span-processor.ts`                                       | ContentRetryEvent's existing bridge continues to nest under the active LLM span. Phase 4 does not change this                                                                                               |
+| `ApiRequestPhase` enum                                       | `metrics.ts:330-334`                                             | Public surface (4 values). We populate 3 of the 4 from production code; leave the enum unchanged for backward compatibility                                                                                 |
+| Per-provider chunk normalization → `GenerateContentResponse` | `loggingContentGenerator.ts:286-393`                             | Each provider already normalizes to Google's `GenerateContentResponse` shape before LoggingContentGenerator sees the stream. TTFT detection runs centrally over this normalized shape; no per-provider code |
+| `retryWithBackoff` general-purpose retry                     | `utils/retry.ts:140`                                             | Used by both LLM callers and non-LLM (`channels/weixin/src/api.ts`). We extend with an opt-in `onRetry` callback rather than hard-coupling to LLM telemetry                                                 |
+| Non-streaming `generateContent`                              | `loggingContentGenerator.ts:212`                                 | TTFT is not meaningful for non-streaming; the new fields stay `undefined`. Span lifecycle and existing attrs unchanged                                                                                      |
+
+## Out-of-scope (deferred)
+
+- **SDK-level retries** (openai SDK `maxRetries=3`, google-genai SDK internal retries). These happen entirely inside the third-party SDK; observing them requires disabling SDK retries and reimplementing in `retryWithBackoff`. Separate decision, not Phase 4.
+- **Per-token streaming metrics** (inter-token latency, per-chunk size). Useful for inference-engine perf debugging, not for the user-perceived latency questions Phase 4 targets.
+- **Separate TTFT for reasoning/thinking blocks.** "First token" includes thinking content (see D1). A future enhancement could split `ttft_to_reasoning_ms` vs `ttft_to_answer_ms`, but only after we know there's demand.
+- **Sampling phase as a dedicated child span.** Computable from `duration_ms - ttft_ms - request_setup_ms`; child span adds nothing for OTel-only backends (claude-code uses one for Perfetto only). Stored as a span attribute instead — see D6.
+- **Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`) event-level rate limiting.** A single LLM request can produce 50+ `ContentRetryEvent` / `ApiRetryEvent` records under persistent retry. Capping emission is a follow-up — Phase 4 emits all events; if production volumes prove unbearable, add a per-span emission cap with a "+N more attempts (truncated)" summary event in a follow-up PR.
+- **`TOKEN_PROCESSING` breakdown phase.** Enum value exists but qwen-code has no real post-stream local processing worth measuring (<10ms typical). Skipped in production callers; enum value retained for future use or for callers we don't control.
+- **Migrating `ContentRetryEvent` onto LLM span as span events.** Same reasoning as Phase 3's `subagent_execution` LogRecord: existing consumers (qwen-logger RUM, future metrics) are tightly coupled to the LogRecord. Bridge-span coverage is good enough.
+
+## References (decision evidence)
+
+| Source                                                                                                                      | Key takeaway                                                                                                                                                                                                                                                                                                                       |
+| --------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| claude-code (Anthropic) `claude.ts:1762, 1789, 1982, 2882`                                                                  | TTFT captured as `Date.now() - start` on `message_start` SSE event; `start` reset per retry attempt. `requestSetupMs = start - startIncludingRetries`. `attemptStartTimes` array preserved per attempt. Confirms feasibility of the approach; their TTFT semantic is "first stream event" (we diverge to "first content" — see D1) |
+| claude-code `perfettoTracing.ts:549-671`                                                                                    | Renders Request Setup → Attempt N (retry) → First Token → Sampling as nested B/E pairs. Demonstrates the visual decomposition; qwen-code does the same decomposition with OTel attributes since we have no Perfetto                                                                                                                |
+| claude-code `sessionTracing.ts:447`                                                                                         | Only `ttft_ms` makes it onto the OTel span (not `requestSetupMs`, not `samplingMs`, not per-attempt timing). We deliberately put more on the span — claude-code has Perfetto for visualization; we don't                                                                                                                           |
+| opencode (sst/opencode) `session/llm.ts`, `route/client.ts`                                                                 | No TTFT measurement. Single `LLM.run` Effect span covers everything. Validates that the gap exists across competing tools; not a reference for what to do                                                                                                                                                                          |
+| [OTel GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (status: Development / Experimental) | `gen_ai.usage.input_tokens` (Stable), `gen_ai.usage.output_tokens` (Stable), `gen_ai.usage.cached_tokens` (Experimental), `gen_ai.request.model` (Stable), `gen_ai.server.time_to_first_token` (Experimental, seconds as double). Dual-emit pattern follows #4410 precedent                                                        |
+| [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events)                             | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent                                                                                                                     |
+| Phase 3 design doc (`telemetry-subagent-spans-design.md`)                                                                   | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields                                                                                                                                        |
+
+## Design — seven decisions, each justified
+
+### D1 — TTFT semantic: "first chunk containing user-visible content"
+
+TTFT measures wall-clock from the **successful attempt's** request dispatch to the **first stream chunk that contains user-visible output**. A chunk is "user-visible" if any normalized `Part` in `candidates[0].content.parts` is one of:
+
+- `text` with non-empty string
+- `functionCall` (tool use)
+- `inlineData` (image, binary)
+- `executableCode`
+- `thought` / reasoning content (whatever the provider surfaces — Gemini's `thought`, Anthropic's `<thinking>` block, OpenAI o1 reasoning chunk)
+
+Chunks containing only `role` metadata or only `usageMetadata` (final usage-summary chunk) do not trigger TTFT.
+
+**Why not "first stream event of any kind" (claude-code's choice)**: claude-code measures TTFT at `message_start`, an Anthropic-specific metadata event that fires 50–300ms before any actual content. Their internal `headlessProfiler.ts` already separates `time_to_first_response_ms` for the "user saw something" semantic, acknowledging the distinction. qwen-code spans multiple providers (Anthropic, OpenAI, Gemini, Qwen) — picking the metadata-event semantic means TTFT for Anthropic is fundamentally different from TTFT for OpenAI (which has no analogous metadata-only first event). The user-visible-content semantic is uniform across all 4 providers and matches "time-to-first-token" literally.
+
+**Why include `thought` / reasoning**: from the operator's perspective, reasoning chunks are still "the model produced output." Excluding them would understate TTFT for reasoning-heavy models (o1, Qwen thinking variants). Future split into `ttft_to_reasoning_ms` vs `ttft_to_answer_ms` is possible; not Phase 4.
+
+**Why include tool-call-only chunks**: agent tool-decision LLM calls (one `tool_use`, no text) are common in qwen-code's workflow. Excluding them means TTFT is undefined for these requests. The `functionCall` Part is meaningful output.
+
+**Cross-product comparison note**: design doc explicitly states `qwen-code.ttft_ms ≈ claude-code.time_to_first_response_ms ≠ claude-code.ttft_ms`. Operators comparing across products should align on the user-visible-content semantic.
+
+### D2 — TTFT measurement site: method-local variables in `LoggingContentGenerator.generateContentStream`
+
+The first-chunk detection runs inside the existing stream wrapper at `loggingContentGenerator.ts:393` (`async function* processStreamGenerator`). Per-call variables (`start`, `ttftMs`) live in the method's closure; **never as instance fields**.
+
+**Why never instance fields**: `LoggingContentGenerator` is instantiated **once per `ContentGenerator`** (`contentGenerator.ts:377`) and shared across all concurrent `generateContentStream` calls — subagent fan-out, warmup queries, side-queries from `geminiChat`. An instance field would be overwritten across concurrent calls, producing nonsense TTFT for one of every two interleaved requests.
+
+**Why not AsyncLocalStorage**: ALS would work but adds a context-management layer for a piece of state that doesn't need to escape the method. Method-local is simpler, zero overhead, zero risk of leakage.
+
+```ts
+// loggingContentGenerator.ts — inside generateContentStream
+const attemptStart = Date.now(); // per-call local
+const requestEntryTime = Date.now(); // also per-call local — see D3
+let ttftMs: number | undefined;
+const attemptStartTimes: number[] = [attemptStart];
+let retryTotalDelayMs = 0;
+let finalAttempt = 1;
+// stream wrapper inspects each chunk; first one matching hasUserVisibleContent:
+//   ttftMs = Date.now() - attemptStart;
+```
+
+`hasUserVisibleContent(chunk)` is a small standalone helper colocated with the wrapper, exported for tests:
+
+```ts
+function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
+  const parts = chunk.candidates?.[0]?.content?.parts;
+  if (!parts?.length) return false;
+  return parts.some(
+    (p) =>
+      (typeof p.text === 'string' && p.text.length > 0) ||
+      p.functionCall !== undefined ||
+      p.inlineData !== undefined ||
+      p.executableCode !== undefined ||
+      // @ts-expect-error — `thought` is not on all SDK versions but providers emit it
+      p.thought !== undefined,
+  );
+}
+```
+
+### D3 — `request_setup_ms` computation: entry-time vs successful-attempt-start
+
+`request_setup_ms` measures wall-clock from `generateContentStream`/`generateContent` entry to the **start of the successful attempt** — including all failed retries, backoff sleeps, and any pre-retry preparation work.
+
+```ts
+request_setup_ms = attemptStart_of_successful_attempt - requestEntryTime;
+```
+
+When `attempt === 1` and no retries happened, `request_setup_ms` is small (just SDK setup). When retries occurred, it captures the entire retry-budget overhead.
+
+**Putting it on the OTel span (diverges from claude-code, which puts it only on Perfetto)**: rationale at three levels:
+
+1. **No Perfetto** — qwen-code has no out-of-band visualization layer. OTel attributes are the only channel.
+2. **Single-trace debug** — operator sees `duration_ms=12000, request_setup_ms=11500, ttft_ms=200, sampling_ms=300` → instantly diagnoses "retries ate 11.5s, model itself was fast." Computing `request_setup_ms` from other fields requires also exposing `sampling_ms`, which we do anyway (D6).
+3. **Negligible cost** — 1 INT64 attribute. Same order of magnitude as the existing `input_tokens`, `output_tokens` attributes. Backend ingest cost is not material.
+
+### D4 — Retry telemetry: `onRetry` callback option on `retryWithBackoff` + new `ApiRetryEvent`
+
+`retryWithBackoff` currently calls `logRetryAttempt` (`retry.ts:343`) which only writes to `debugLogger.warn`. We extend the `RetryOptions` interface with an opt-in callback:
+
+```ts
+// utils/retry.ts
+interface RetryOptions<T> {
+  // ... existing fields ...
+  /**
+   * Optional. Called once per failed attempt, before the backoff sleep.
+   * Receives the attempt number (1-based), the error, and the delay before
+   * the next attempt. Use this to emit telemetry events for LLM call sites;
+   * leave undefined for non-LLM callers (e.g., channels/weixin) so they
+   * stay silent in LLM-specific telemetry channels.
+   */
+  onRetry?: (info: RetryAttemptInfo) => void;
+}
+
+interface RetryAttemptInfo {
+  attempt: number; // 1-based, matches debugLogger output
+  error: unknown;
+  errorStatus?: number;
+  delayMs: number; // backoff delay before next attempt
+}
+```
+
+The 4 LLM call sites (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) register a callback that emits a new `ApiRetryEvent`:
+
+```ts
+// types.ts — new event class, sibling to ContentRetryEvent
+export class ApiRetryEvent implements BaseTelemetryEvent {
+  'event.name': typeof EVENT_API_RETRY;
+  'event.timestamp': string;
+  model: string;
+  prompt_id?: string;
+  attempt_number: number; // 1-based
+  error_type: string;
+  error_message: string; // truncated to 256 chars
+  status_code?: number;
+  retry_delay_ms: number;
+  // ... duration_ms set to retry_delay_ms so LogToSpanProcessor renders
+  // a bridge span of meaningful width
+  duration_ms: number;
+}
+```
+
+**Why a new event class, not extending `ContentRetryEvent`**:
+
+- `ContentRetryEvent` has 2 downstream consumers (qwen-logger, log-record export). Changing its payload risks breaking them.
+- The naming "content retry" semantically refers to content-recovery retries (invalid stream, schema repair) — extending it to cover rate-limit retries would muddy the schema.
+- New event is additive; no consumer surprise.
+
+**Why not embed callback IN `retry.ts`**: `retry.ts` is called by `channels/weixin/src/api.ts` too (microsoft messaging API retries). Hard-coupling LLM telemetry inside retry.ts would emit `ApiRetryEvent` for non-LLM retries. The `onRetry` callback is opt-in per caller — LLM callers opt in, weixin caller doesn't.
+
+**ContentRetryEvent coexistence**: ContentRetryEvent stays as-is for content-recovery retries inside `geminiChat.ts:806,830`. ApiRetryEvent covers the rate-limit / 5xx retries from `retryWithBackoff`. The two events fire from different layers and never duplicate. Existing log-bridge behavior for both events is preserved via `LogToSpanProcessor` — both events nest under the active LLM span automatically (Phase 1 wiring ensures the LLM span is active during retries).
+
+**Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`)**: a single 429-loop request may emit 50+ events. Out of scope to rate-limit emission in Phase 4 — if production volumes prove unbearable, add a per-span cap with summary event in a follow-up PR. The aggregated `attempt` and `retry_total_delay_ms` on the parent LLM span (D5) remain accurate regardless of event cap.
+
+### D5 — Parent LLM span aggregation: scalar attributes only (no map-typed attrs)
+
+OTel span attributes are scalars (`string | number | boolean | array of these`). Map-typed attributes (like `retry_count_by_status: {429:2, 503:1}`) require JSON serialization and are awkward to query. Skip them.
+
+| Attribute                  | Type   | Semantic                                                                            |
+| -------------------------- | ------ | ----------------------------------------------------------------------------------- |
+| `attempt`                  | int    | 1-based final attempt count (`attemptStartTimes.length`)                            |
+| `retry_total_delay_ms`     | int    | Sum of all `delayMs` reported by `onRetry`; 0 if no retries                         |
+| `ttft_ms`                  | int    | TTFT per D1; undefined for non-streaming or aborted-before-first-chunk requests     |
+| `request_setup_ms`         | int    | Per D3                                                                              |
+| `sampling_ms`              | int    | Per D6                                                                              |
+| `output_tokens_per_second` | double | Derived; `output_tokens / (sampling_ms / 1000)`; undefined when `sampling_ms === 0` |
+
+Per-attempt status-code distribution (e.g., "2 of the 3 attempts were 429s") is queryable from log-bridge spans of `ApiRetryEvent` records. No need to duplicate it as a flattened attribute on the parent.
+
+**Why `sampling_ms` and `output_tokens_per_second` on the span**: derivable but cumbersome to compute in backend queries when summing across many spans. Same cost-benefit as `request_setup_ms` (D3).
+
+### D6 — Activate `recordApiRequestBreakdown()` for 3 of 4 phases
+
+In `endLLMRequestSpan` (or the wrapper that calls it), after computing TTFT/setup/sampling, emit:
+
+```ts
+recordApiRequestBreakdown(config, model, [
+  { phase: ApiRequestPhase.REQUEST_PREPARATION, durationMs: requestSetupMs },
+  { phase: ApiRequestPhase.NETWORK_LATENCY, durationMs: ttftMs }, // ttftMs = network + first-token-generation
+  { phase: ApiRequestPhase.RESPONSE_PROCESSING, durationMs: samplingMs },
+]);
+```
+
+**Why skip `TOKEN_PROCESSING`**: qwen-code does stream chunk processing inline (consolidation happens in the wrapper at `loggingContentGenerator.ts:644`); the post-stream wrap-up phase is <10ms and not architecturally distinct. Filling it with a meaningless value pollutes the histogram. Leaving the enum value unused is safe — `apiRequestBreakdownHistogram.record(value, {model, phase})` is just a histogram with `phase` as a label; missing labels are simply absent in queries.
+
+**Why not redefine `NETWORK_LATENCY`**: the spec name is slightly misleading (it's network + first-token-generation, not pure network latency), but:
+
+- The enum is part of `metrics.ts:330-334` which is exported from `index.ts:117` and tested.
+- Backend dashboards may already reference these phase names.
+- Renaming or adding a new phase would be a breaking change for trivially marginal accuracy improvement.
+
+Document the semantic in the design doc; leave the enum unchanged.
+
+**Why on the span path, not parallel**: keeps `recordApiRequestBreakdown` colocated with span attribute writes — single gated emission point (see D7 idempotency), single ordering invariant.
+
+### D7 — `endLLMRequestSpan` idempotency: metric recording gated on existing double-end guard
+
+Phase 1.5 (#4302) established that `endLLMRequestSpan` may be called twice (abort path + error path collision). The existing guard at `session-tracing.ts:~470` (`if (!activeSpans.has(...)) return;`) prevents double `span.end()`. Phase 4 metric recording (D6) **must sit inside the same guarded block**, before `span.end()`:
+
+```ts
+// session-tracing.ts — endLLMRequestSpan
+const llmCtx = activeSpans.get(spanRef);
+if (!llmCtx) return;            // already ended — double-end guard
+activeSpans.delete(spanRef);    // claim the end
+
+// ... compute duration, set attributes ...
+if (metadata) {
+  recordApiRequestBreakdown(config, llmCtx.attributes.model, [...]);   // NEW — gated
+  recordTokenUsageMetrics(...); // existing
+}
+
+span.end();
+```
+
+This guarantees metric is recorded **exactly once** per LLM request, matching the span lifecycle.
+
+**Why not record in `loggingContentGenerator`**: it doesn't see the abort path. Recording at the span lifecycle layer ensures every LLM request that opens a span produces exactly one breakdown sample, regardless of success/failure/abort.
+
+### D8 — GenAI semantic conventions dual-emit (private name authoritative)
+
+Each Phase 4 attribute that corresponds to an OTel GenAI semconv attribute is written twice on the span:
+
+| qwen-code private (authoritative)          | GenAI semconv (compat layer)                    | Unit conversion | Spec status  |
+| ------------------------------------------ | ----------------------------------------------- | --------------- | ------------ |
+| `ttft_ms` (ms, int)                        | `gen_ai.server.time_to_first_token` (s, double) | `ttftMs / 1000` | Experimental |
+| `input_tokens` (int)                       | `gen_ai.usage.input_tokens` (int)               | identical       | Stable       |
+| `output_tokens` (int)                      | `gen_ai.usage.output_tokens` (int)              | identical       | Stable       |
+| `cached_input_tokens` (int) (when present) | `gen_ai.usage.cached_tokens` (int)              | identical       | Experimental |
+| `qwen-code.model` (string)                 | `gen_ai.request.model` (string)                 | identical       | Stable       |
+
+**Existing token attribute names** on the LLM span (set in `endLLMRequestSpan` before Phase 4): qwen-code uses bare `input_tokens` and `output_tokens` already. Phase 4 adds the `gen_ai.usage.*` siblings to match #4410's pattern. The bare names stay; **don't rename**.
+
+Fields with no GenAI semconv equivalent — `request_setup_ms`, `sampling_ms`, `retry_total_delay_ms`, `attempt`, `output_tokens_per_second` — are emitted only under the qwen-code namespace.
+
+**Why "private authoritative, semconv as compat"**:
+
+- Internal dashboards, SLOs, debugLogger output, qwen-logger RUM, ARMS queries — all reference `ttft_ms` etc. Treating those as canonical avoids a flag-day migration.
+- The Experimental GenAI semconv may rename `gen_ai.server.time_to_first_token` before reaching Stable. If/when it does, we update the semconv emission; the qwen-code names don't move.
+- Future spec-aware backends (Datadog AI views, Honeycomb AI, ARMS GenAI dashboards) auto-pick up the `gen_ai.*` attributes without our involvement.
+
+**Why dual-emit unit conversion** (ms ↔ seconds): GenAI semconv chose seconds-as-double for latency; qwen-code chose ms-as-int (matches `duration_ms` already on the span). Both representations have value; the conversion is cheap.
+
+## Helper API (additive to `session-tracing.ts`)
+
+```ts
+// session-tracing.ts — LLMRequestMetadata interface extended (additive)
+export interface LLMRequestMetadata {
+  // ... existing fields: inputTokens, outputTokens, cachedInputTokens, success, error, ...
+
+  /** Time from successful attempt start to first user-visible content chunk (ms). Undefined for non-streaming or aborted-before-first-chunk requests. */
+  ttftMs?: number;
+
+  /** Time from generateContent entry to start of successful attempt (ms). Includes all failed retries + backoff. */
+  requestSetupMs?: number;
+
+  /** Final attempt number (1-based). 1 = no retries. */
+  attempt?: number;
+
+  /** Sum of all backoff delays before the successful attempt (ms). */
+  retryTotalDelayMs?: number;
+}
+
+// No new exported helpers — Phase 4 reuses startLLMRequestSpan / endLLMRequestSpan with extended metadata.
+```
+
+```ts
+// types.ts — new event class
+export class ApiRetryEvent implements BaseTelemetryEvent {
+  'event.name': typeof EVENT_API_RETRY = EVENT_API_RETRY;
+  'event.timestamp': string;
+  model: string;
+  prompt_id?: string;
+  attempt_number: number;
+  error_type: string;
+  error_message: string;
+  status_code?: number;
+  retry_delay_ms: number;
+  duration_ms: number;  // = retry_delay_ms, drives LogToSpanProcessor bridge span width
+
+  constructor(opts: { model: string; promptId?: string; attemptNumber: number; error: unknown; statusCode?: number; retryDelayMs: number }) { ... }
+}
+
+// constants.ts
+export const EVENT_API_RETRY = 'qwen-code.api_retry';
+
+// loggers.ts
+export function logApiRetry(config: Config, event: ApiRetryEvent): void { ... }
+```
+
+```ts
+// utils/retry.ts — RetryOptions extension
+interface RetryOptions<T> {
+  // ... existing ...
+  onRetry?: (info: RetryAttemptInfo) => void;
+}
+
+interface RetryAttemptInfo {
+  attempt: number;
+  error: unknown;
+  errorStatus?: number;
+  delayMs: number;
+}
+
+// Inside retryWithBackoff, where logRetryAttempt is called today:
+options.onRetry?.({ attempt, error, errorStatus, delayMs: actualDelay });
+logRetryAttempt(attempt, error, errorStatus); // existing debugLogger call unchanged
+```
+
+## Lifecycle wiring
+
+### Streaming path (the common case)
+
+```ts
+// loggingContentGenerator.ts:283 — generateContentStream
+async generateContentStream(req, userPromptId): Promise<AsyncGenerator<GenerateContentResponse>> {
+  const requestEntryTime = Date.now();
+  let attemptStart = requestEntryTime;
+  const attemptStartTimes: number[] = [attemptStart];
+  let retryTotalDelayMs = 0;
+  let finalAttempt = 1;
+
+  // Use existing startLLMRequestSpan (Phase 1)
+  // Pass onRetry callback to whatever retry layer is in use:
+  const onRetry: RetryAttemptInfo & { invoke: ... } = (info) => {
+    finalAttempt = info.attempt + 1;        // we're about to start attempt N+1
+    retryTotalDelayMs += info.delayMs;
+    attemptStart = Date.now() + info.delayMs; // approximate; actual reset is at top of next attempt
+    attemptStartTimes.push(attemptStart);
+    // emit ApiRetryEvent
+    logApiRetry(this.config, new ApiRetryEvent({
+      model: req.model,
+      promptId: userPromptId,
+      attemptNumber: info.attempt,
+      error: info.error,
+      statusCode: info.errorStatus,
+      retryDelayMs: info.delayMs,
+    }));
+  };
+
+  // stream wrapper detects first user-visible chunk:
+  return this.processStreamGenerator(stream, ..., {
+    onFirstUserVisibleChunk: (now) => {
+      ttftMs = now - attemptStart;
+    },
+  });
+}
+```
+
+At span end (already in Phase 1's `endLLMRequestSpan` flow), include the new fields in `LLMRequestMetadata`:
+
+```ts
+endLLMRequestSpan(llmSpan, {
+  success: true,
+  inputTokens,
+  outputTokens,
+  cachedInputTokens,
+  ttftMs,
+  requestSetupMs: attemptStart - requestEntryTime,
+  attempt: finalAttempt,
+  retryTotalDelayMs,
+});
+```
+
+### Non-streaming path
+
+`generateContent` (`loggingContentGenerator.ts:212`) does not produce streaming chunks. TTFT is `undefined`; `request_setup_ms` is still meaningful (captures retry overhead). The breakdown metric records 2 phases (REQUEST_PREPARATION + RESPONSE_PROCESSING where `RESPONSE_PROCESSING = duration_ms - request_setup_ms`), not 3.
+
+### Retry layer integration (4 sites)
+
+Each of the 4 LLM `retryWithBackoff` call sites adds `onRetry`:
+
+```ts
+// client.ts:1540 (similar at baseLlmClient.ts:193, 282, geminiChat.ts:1039)
+const result = await retryWithBackoff(apiCall, {
+  ...existingOptions,
+  onRetry: (info) => {
+    logApiRetry(
+      this.config,
+      new ApiRetryEvent({
+        model,
+        promptId: userPromptId,
+        attemptNumber: info.attempt,
+        error: info.error,
+        statusCode: info.errorStatus,
+        retryDelayMs: info.delayMs,
+      }),
+    );
+    // also feed back into LoggingContentGenerator's local retry accumulator
+    // (when in scope — for callers that don't go through LoggingContentGenerator,
+    // the LLM span still gets `attempt` and `retry_total_delay_ms` via the
+    // metadata path because endLLMRequestSpan is called at the LLM layer)
+  },
+});
+```
+
+The non-LLM caller (`channels/weixin/src/api.ts`) **does not register `onRetry`** — no `ApiRetryEvent` is emitted for its retries, matching today's behavior.
+
+## Concurrent safety — the headline guarantee
+
+`LoggingContentGenerator` instance is shared (one per `ContentGenerator`, `contentGenerator.ts:377`). Three concurrent `generateContentStream` calls (e.g., 3 subagents fan out via `coreToolScheduler.runConcurrently`) execute three independent closures of `generateContentStream`:
+
+```
+call_A: attemptStart_A, ttftMs_A, ... (closure)
+call_B: attemptStart_B, ttftMs_B, ... (closure)
+call_C: attemptStart_C, ttftMs_C, ... (closure)
+```
+
+Per-call locals never overlap. Stream chunks are detected against the local `attemptStart` of each call. Span attributes are set at each call's own `endLLMRequestSpan`.
+
+`AsyncLocalStorageContextManager` (registered by NodeSDK at `sdk.ts:273`) already ensures the active OTel context — and thus the parent span passed to `startLLMRequestSpan` — is correct per fiber.
+
+## Files to change
+
+| File                                                                             | Change                                                                                                                                                                                                                                    | LOC est |
+| -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
+| `packages/core/src/telemetry/constants.ts`                                       | Add `EVENT_API_RETRY` constant                                                                                                                                                                                                            | +2      |
+| `packages/core/src/telemetry/types.ts`                                           | Add `ApiRetryEvent` class + union member                                                                                                                                                                                                  | +40     |
+| `packages/core/src/telemetry/loggers.ts`                                         | Add `logApiRetry()` function                                                                                                                                                                                                              | +20     |
+| `packages/core/src/telemetry/qwen-logger/qwen-logger.ts`                         | Add `logApiRetryEvent()` for RUM downstream consistency                                                                                                                                                                                   | +20     |
+| `packages/core/src/telemetry/session-tracing.ts`                                 | Extend `LLMRequestMetadata` (ttftMs, requestSetupMs, attempt, retryTotalDelayMs); extend `endLLMRequestSpan` to set new attrs + breakdown metric + dual-emit gen_ai.\*                                                                    | +60     |
+| `packages/core/src/telemetry/metrics.ts`                                         | Wire `recordApiRequestBreakdown` callsite inside `endLLMRequestSpan` (no change to the existing recorder)                                                                                                                                 | 0       |
+| `packages/core/src/utils/retry.ts`                                               | Add `onRetry?: (info: RetryAttemptInfo) => void` to RetryOptions; export `RetryAttemptInfo`; invoke callback in the existing logRetryAttempt site                                                                                         | +25     |
+| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts`      | TTFT capture: method-local accumulators + `hasUserVisibleContent` helper + first-chunk detection in stream wrapper; pass new metadata to `endLLMRequestSpan`                                                                              | +80     |
+| `packages/core/src/core/client.ts`                                               | Wire `onRetry` callback at `retryWithBackoff` call site (`client.ts:1540`)                                                                                                                                                                | +15     |
+| `packages/core/src/core/baseLlmClient.ts`                                        | Wire `onRetry` callback at 2 `retryWithBackoff` call sites                                                                                                                                                                                | +25     |
+| `packages/core/src/core/geminiChat.ts`                                           | Wire `onRetry` callback at `retryWithBackoff` call site (`geminiChat.ts:1039`)                                                                                                                                                            | +15     |
+| `packages/core/src/telemetry/session-tracing.test.ts`                            | `endLLMRequestSpan` sets ttft_ms / request_setup_ms / attempt / retry_total_delay_ms / sampling_ms / output_tokens_per_second + gen_ai dual-emit + breakdown metric (each phase) + idempotent end                                         | +120    |
+| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts` | `hasUserVisibleContent` (text / functionCall / inlineData / executableCode / thought / role-only / usage-only); concurrent calls don't cross-contaminate; TTFT undefined when aborted before first chunk; TTFT undefined on non-streaming | +100    |
+| `packages/core/src/utils/retry.test.ts`                                          | `onRetry` invoked per failed attempt with correct `attempt`, `delayMs`, `error`, `errorStatus`; absence of `onRetry` is silent (no telemetry emitted)                                                                                     | +50     |
+| `packages/core/src/telemetry/loggers.test.ts`                                    | `logApiRetry` emits LogRecord with expected payload; bridges through LogToSpanProcessor to nested span under active LLM span                                                                                                              | +40     |
+
+Total: 14 files, ~610 LOC. Larger than Phase 2 (#4321) but comparable to Phase 3 (#4410) and justified by the breadth of integration (4 retry sites + telemetry plumbing + streaming wrapper).
+
+If review pushes back on size: split into **Phase 4a + 4b + 4c**:
+
+- **4a** (~200 LOC): TTFT capture + extended `LLMRequestMetadata` + dual-emit. Self-contained value (TTFT visibility from day one).
+- **4b** (~250 LOC): `onRetry` callback + `ApiRetryEvent` + 4 caller wiring. **Independently a bug fix** for the `retryWithBackoff` telemetry gap.
+- **4c** (~160 LOC): `recordApiRequestBreakdown` activation + parent span aggregation attrs (`attempt`, `retry_total_delay_ms`, `sampling_ms`, `output_tokens_per_second`). Depends on 4a + 4b.
+
+## Testing strategy
+
+| Test                                                                                                                                         | What it proves                        |
+| -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------- |
+| `hasUserVisibleContent` returns true for text/functionCall/inlineData/executableCode/thought                                                 | D1 semantics across part types        |
+| `hasUserVisibleContent` returns false for role-only and usage-only chunks                                                                    | D1 negative cases                     |
+| streaming: TTFT measured from attempt start to first user-visible chunk                                                                      | End-to-end TTFT detection             |
+| streaming: TTFT undefined if stream aborts before any user-visible chunk                                                                     | Edge case                             |
+| streaming: TTFT computed from final attempt's start (not first attempt)                                                                      | D3 — TTFT reset on retry              |
+| non-streaming: TTFT remains undefined                                                                                                        | S3 decision                           |
+| concurrent `generateContentStream` calls don't cross-contaminate TTFT                                                                        | D2 — method-local guarantee           |
+| `endLLMRequestSpan` sets all Phase 4 attrs (ttft_ms, request_setup_ms, sampling_ms, attempt, retry_total_delay_ms, output_tokens_per_second) | Attribute presence                    |
+| `endLLMRequestSpan` dual-emits gen_ai.server.time_to_first_token + gen_ai.usage.\* + gen_ai.request.model                                    | D8 dual-emit                          |
+| `endLLMRequestSpan` records breakdown metric with 3 phases for streaming, 2 for non-streaming                                                | D6                                    |
+| `endLLMRequestSpan` called twice: metric recorded exactly once, attrs not re-set                                                             | D7 idempotency                        |
+| `retryWithBackoff` with `onRetry`: callback invoked per failed attempt with correct args                                                     | D4 callback contract                  |
+| `retryWithBackoff` without `onRetry`: no telemetry emitted (silent for non-LLM callers)                                                      | P2 — channels/weixin scope protection |
+| `client.ts` / `baseLlmClient.ts` / `geminiChat.ts` retry callsites emit `ApiRetryEvent` on retry                                             | Integration of D4 at 4 sites          |
+| `ApiRetryEvent` LogRecord bridges via LogToSpanProcessor to a child span under active LLM span                                               | Trace tree correctness                |
+| LLM span `attempt` field correctly reflects final attempt number under retries                                                               | D5 aggregation                        |
+| LLM span `retry_total_delay_ms` correctly sums onRetry delays                                                                                | D5 aggregation                        |
+| `output_tokens_per_second` undefined when `sampling_ms === 0` (no streaming)                                                                 | Avoid divide-by-zero                  |
+
+## Edge cases
+
+| Case                                                                    | Handling                                                                                                                                                                                                                 |
+| ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Stream aborts before any chunk arrives                                  | `ttftMs = undefined`, `sampling_ms = undefined`, `output_tokens_per_second = undefined`. `attempt`, `request_setup_ms` still set. `success = false`                                                                      |
+| Stream aborts after first chunk                                         | `ttftMs` set; `sampling_ms` = `duration_ms - ttftMs - request_setup_ms`; reflects partial response time. `success = false`                                                                                               |
+| Retry succeeds on attempt 1 (no retries)                                | `attempt = 1`, `retry_total_delay_ms = 0`, no `ApiRetryEvent` emitted, breakdown metric records `request_setup_ms` close to 0                                                                                            |
+| Persistent retry mode 50+ attempts                                      | 50+ `ApiRetryEvent` records emitted (out-of-scope cap deferred); LLM span `attempt = 51`, `retry_total_delay_ms = sum of all delays`. Operator sees aggregated view on span; full per-attempt detail in log-bridge spans |
+| Non-LLM `retryWithBackoff` caller (channels/weixin)                     | No `onRetry` registered; only existing `debugLogger.warn` fires. No `ApiRetryEvent`; no breakdown metric (caller isn't an LLM site)                                                                                      |
+| `endLLMRequestSpan` called twice (abort + error race)                   | Phase 1.5 guard at `activeSpans.delete()` returns early on second call; `recordApiRequestBreakdown` is inside the guard, recorded exactly once                                                                           |
+| Anthropic `message_start` chunk arrives before content                  | `hasUserVisibleContent` returns false for it (no parts with text/functionCall/etc.); TTFT not triggered until subsequent `content_block_delta` chunk                                                                     |
+| OpenAI first chunk with empty `delta.content` but `role` only           | `hasUserVisibleContent` returns false; TTFT not triggered until first chunk with non-empty delta                                                                                                                         |
+| Tool-call-only response (no text)                                       | First chunk with `functionCall` Part triggers TTFT; `output_tokens_per_second` computed against tool-call token count                                                                                                    |
+| Concurrent subagents (3 calls in flight)                                | Each call's closure has its own `attemptStart`, `ttftMs`, `attemptStartTimes`. Per-call span receives its own metadata at `endLLMRequestSpan`. No interleaving (D2)                                                      |
+| SDK-level retries inside openai-sdk (`maxRetries=3`)                    | Invisible to qwen-code telemetry — happens entirely inside SDK before retryWithBackoff sees the request. `attempt` reflects retryWithBackoff attempts only. Out of scope (see Out-of-scope)                              |
+| `gen_ai.server.time_to_first_token` spec renames before reaching Stable | Single-file update: `session-tracing.ts:endLLMRequestSpan`. The qwen-code-native `ttft_ms` stays authoritative — no downstream impact                                                                                    |
+| Subagent's LLM request                                                  | Parent is the subagent span (Phase 3). Phase 4 fields nest correctly. Aggregations grouped by `qwen-code.subagent.id` give per-subagent LLM perf — design-doc-future, easy follow-up                                     |
+| Reasoning model with long thought blocks                                | First `thought` Part triggers TTFT; `sampling_ms` includes both thinking + answer phases. Split into separate metrics deferred                                                                                           |
+
+## Rollback
+
+The change is additive at the OTel and metric level — every new attribute is optional, every new event is a new class. Existing dashboards that don't filter on the new fields keep working unchanged.
+
+Behavior-affecting changes:
+
+- New `ApiRetryEvent` LogRecord starts flowing → log volume increases proportional to retry rate (typically <1% of requests retry). Mitigate by sampling LogRecord at the SDK layer if needed.
+- New breakdown metric `qwen-code.api.request.breakdown` starts producing time series → mild Prometheus cardinality bump (`{model, phase}` — bounded).
+- `output_tokens_per_second` derived attribute may appear unusual on dashboards filtering "all attributes" — document.
+
+Rollback path: revert the single PR (or each of 4a/4b/4c independently). All new fields use defensive defaults (undefined / 0) and don't change span structure.
+
+## Sequencing
+
+- **After Phase 3 (#4410, in review)**: not a hard dependency. Phase 4 attributes attach to `qwen-code.llm_request` spans regardless of whether they're under a `qwen-code.subagent` (Phase 3) or `qwen-code.interaction` (Phase 1) parent. Recommend Phase 3 land first so per-attempt aggregation under subagent subtrees works naturally.
+- **Independent of #4384** (`traceparent` + `X-Qwen-Code-Session-Id` outbound propagation). They touch the HTTP layer; Phase 4 touches the stream/retry/metric layer.
+- **Independent of `clearDetailedSpanState` chat-compression follow-up** (#4097 follow-up). Different surface.
+
+## Open questions
+
+1. **`onRetry` callback firing semantics**: invoked **before** backoff sleep (current proposal) or **after** (when the next attempt is about to start)? Before is simpler — callback has all the info immediately; after would require capturing the just-completed delay separately. Pre-sleep is the recommendation; document in callback contract.
+2. **Per-attempt timing on the LLM span**: should we add `attempt_durations_ms: number[]` array? OTel supports array-of-primitive attributes. Useful for "which attempt of N was slow" diagnostics. Defer until production data shows demand — log-bridge spans already carry the equivalent.
+3. **Persistent retry mode emission cap**: at what `attempt > N` threshold should we start sampling? `N = 5` then 1-in-10? `N = 10` then summary-only? Defer until we have production volume data.
+4. **`TOKEN_PROCESSING` phase**: keep enum value dormant or wire it to something (e.g., consolidation time)? Defer — wait for a real use case.
+5. **Subagent-level LLM rollups**: trivial follow-up once Phase 4 lands — sum `ttft_ms`/`output_tokens`/`input_tokens` per subagent subtree. Not Phase 4 scope but the data flow enables it.
--- a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts
+++ b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts
@ -165,6 +165,11 @@ vi.mock('../../telemetry/index.js', () => {
          success: boolean;
          inputTokens?: number;
          outputTokens?: number;
+          cachedInputTokens?: number;
+          ttftMs?: number;
+          requestSetupMs?: number;
+          attempt?: number;
+          retryTotalDelayMs?: number;
          durationMs?: number;
          error?: string;
        },
@ -616,6 +621,150 @@ describe('LoggingContentGenerator', () => {
    });
  });

+  it('captures ttftMs on the first user-visible stream chunk (Phase 4a)', async () => {
+    // Two chunks: first has text (user-visible), second has only usage.
+    // ttftMs must be set on the first chunk and not overwritten by the second.
+    const streamFn = vi.fn().mockResolvedValue(
+      (async function* () {
+        yield createResponse('r1', 'test-model', [{ text: 'hi' }]);
+        yield createResponse('r2', 'test-model', [], {
+          promptTokenCount: 10,
+          candidatesTokenCount: 2,
+          totalTokenCount: 12,
+        });
+      })(),
+    );
+    const wrapped = createWrappedGenerator(vi.fn(), streamFn);
+    const generator = new LoggingContentGenerator(wrapped, createConfig(), {
+      model: 'test-model',
+      authType: AuthType.USE_OPENAI,
+      enableOpenAILogging: false,
+    });
+    const request = {
+      model: 'test-model',
+      contents: 'Hello',
+    } as unknown as GenerateContentParameters;
+
+    const stream = await generator.generateContentStream(
+      request,
+      'prompt-ttft',
+    );
+    for await (const _ of stream) {
+      // consume
+    }
+
+    const spanRecord = getStreamSpanRecord();
+    const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined;
+    expect(meta).toBeDefined();
+    expect(typeof meta!.ttftMs).toBe('number');
+    expect(meta!.ttftMs!).toBeGreaterThanOrEqual(0);
+  });
+
+  it('forwards cachedInputTokens from usageMetadata to endLLMRequestSpan (Phase 4a)', async () => {
+    const streamFn = vi.fn().mockResolvedValue(
+      (async function* () {
+        yield createResponse('r1', 'test-model', [{ text: 'ok' }], {
+          promptTokenCount: 100,
+          candidatesTokenCount: 20,
+          cachedContentTokenCount: 40,
+          totalTokenCount: 160,
+        });
+      })(),
+    );
+    const wrapped = createWrappedGenerator(vi.fn(), streamFn);
+    const generator = new LoggingContentGenerator(wrapped, createConfig(), {
+      model: 'test-model',
+      authType: AuthType.USE_OPENAI,
+      enableOpenAILogging: false,
+    });
+    const request = {
+      model: 'test-model',
+      contents: 'Hello',
+    } as unknown as GenerateContentParameters;
+
+    const stream = await generator.generateContentStream(
+      request,
+      'prompt-cache',
+    );
+    for await (const _ of stream) {
+      // consume
+    }
+
+    const spanRecord = getStreamSpanRecord();
+    expect(spanRecord.endMetadata).toMatchObject({
+      success: true,
+      inputTokens: 100,
+      cachedInputTokens: 40,
+    });
+  });
+
+  it('leaves ttftMs undefined when stream yields no user-visible chunks (Phase 4a)', async () => {
+    // Stream emits only usage-metadata chunks (no text/functionCall/etc).
+    // ttftMs must stay undefined — TTFT is only meaningful when content arrives.
+    const streamFn = vi.fn().mockResolvedValue(
+      (async function* () {
+        yield createResponse('r1', 'test-model', [], {
+          promptTokenCount: 5,
+          candidatesTokenCount: 0,
+          totalTokenCount: 5,
+        });
+      })(),
+    );
+    const wrapped = createWrappedGenerator(vi.fn(), streamFn);
+    const generator = new LoggingContentGenerator(wrapped, createConfig(), {
+      model: 'test-model',
+      authType: AuthType.USE_OPENAI,
+      enableOpenAILogging: false,
+    });
+    const request = {
+      model: 'test-model',
+      contents: 'Hello',
+    } as unknown as GenerateContentParameters;
+
+    const stream = await generator.generateContentStream(
+      request,
+      'prompt-no-content',
+    );
+    for await (const _ of stream) {
+      // consume
+    }
+
+    const spanRecord = getStreamSpanRecord();
+    const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined;
+    expect(meta!.ttftMs).toBeUndefined();
+  });
+
+  it('forwards cachedInputTokens to endLLMRequestSpan on non-stream success (Phase 4a)', async () => {
+    const generateFn = vi.fn().mockResolvedValue(
+      createResponse('resp-cache', 'test-model', [{ text: 'ok' }], {
+        promptTokenCount: 100,
+        candidatesTokenCount: 30,
+        cachedContentTokenCount: 60,
+        totalTokenCount: 190,
+      }),
+    );
+    const wrapped = createWrappedGenerator(generateFn, vi.fn());
+    const generator = new LoggingContentGenerator(wrapped, createConfig(), {
+      model: 'test-model',
+      authType: AuthType.USE_OPENAI,
+      enableOpenAILogging: false,
+    });
+    const request = {
+      model: 'test-model',
+      contents: 'Hi',
+    } as unknown as GenerateContentParameters;
+
+    await generator.generateContent(request, 'prompt-cache-non-stream');
+
+    const spanRecord = getGenerateContentSpanRecord();
+    expect(spanRecord.endMetadata).toMatchObject({
+      success: true,
+      inputTokens: 100,
+      outputTokens: 30,
+      cachedInputTokens: 60,
+    });
+  });
+
  it('preserves non-stream success when response and OpenAI logging fail', async () => {
    vi.mocked(logApiResponse).mockImplementationOnce(() => {
      throw new Error('response-log-fail');
--- a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts
+++ b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts
@ -61,6 +61,7 @@ import {
  API_CALL_ABORTED_SPAN_STATUS_MESSAGE,
  API_CALL_FAILED_SPAN_STATUS_MESSAGE,
 } from '../../telemetry/tracer.js';
+import { hasUserVisibleContent } from './streamContentDetection.js';

 const debugLogger = createDebugLogger('LOGGING_CONTENT_GENERATOR');

@ -285,6 +286,7 @@ export class LoggingContentGenerator implements ContentGenerator {
        success: true,
        inputTokens: response.usageMetadata?.promptTokenCount,
        outputTokens: response.usageMetadata?.candidatesTokenCount,
+        cachedInputTokens: response.usageMetadata?.cachedContentTokenCount,
        durationMs: Date.now() - startTime,
      });
      return response;
@ -462,6 +464,14 @@ export class LoggingContentGenerator implements ContentGenerator {
    let firstModelVersion = '';
    let lastUsageMetadata: GenerateContentResponseUsageMetadata | undefined;
    let errorOccurred = false;
+
+    // TTFT (time to first token): wall-clock from generateContentStream
+    // dispatch to the first stream chunk containing user-visible content.
+    // Method-local closure variable — NEVER an instance field — because
+    // LoggingContentGenerator is shared across concurrent generateContentStream
+    // calls (one per ContentGenerator, see contentGenerator.ts:createContentGenerator).
+    // See docs/design/telemetry-llm-request-timing-design.md (D1, D2).
+    let ttftMs: number | undefined;
    // Tracks whether the idle timeout fired and ended the span. If so,
    // a resumed-after-timeout consumer must not call endLLMRequestSpan
    // again (the helper would no-op, but more importantly we skip the
@ -516,6 +526,13 @@ export class LoggingContentGenerator implements ContentGenerator {
        if (response.usageMetadata) {
          lastUsageMetadata = response.usageMetadata;
        }
+        // Capture TTFT on the first stream chunk that contains user-visible
+        // content. hasUserVisibleContent skips role-only / usageMetadata-only
+        // chunks, so TTFT reflects "model produced something the operator can
+        // attribute to user-perceived latency."
+        if (ttftMs === undefined && hasUserVisibleContent(response)) {
+          ttftMs = Date.now() - startTime;
+        }
        resetSpanTimeout?.();
        yield response;
      }
@ -601,6 +618,8 @@ export class LoggingContentGenerator implements ContentGenerator {
          success: !errorOccurred,
          inputTokens: lastUsageMetadata?.promptTokenCount,
          outputTokens: lastUsageMetadata?.candidatesTokenCount,
+          cachedInputTokens: lastUsageMetadata?.cachedContentTokenCount,
+          ttftMs,
          durationMs: Date.now() - startTime,
          error: errorOccurred
            ? aborted
--- a/packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts
+++ b/packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts
@ -0,0 +1,131 @@
+/**
+ * @license
+ * Copyright 2026 Qwen Team
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect, it } from 'vitest';
+import { GenerateContentResponse } from '@google/genai';
+import { hasUserVisibleContent } from './streamContentDetection.js';
+
+function chunkWithParts(parts: unknown[]): GenerateContentResponse {
+  const r = new GenerateContentResponse();
+  r.candidates = [
+    {
+      content: { role: 'model', parts: parts as never },
+    },
+  ];
+  return r;
+}
+
+describe('hasUserVisibleContent', () => {
+  it('returns true for non-empty text part', () => {
+    expect(hasUserVisibleContent(chunkWithParts([{ text: 'hi' }]))).toBe(true);
+  });
+
+  it('returns false for empty text part', () => {
+    expect(hasUserVisibleContent(chunkWithParts([{ text: '' }]))).toBe(false);
+  });
+
+  it('returns true for functionCall part', () => {
+    expect(
+      hasUserVisibleContent(
+        chunkWithParts([{ functionCall: { name: 'read', args: {} } }]),
+      ),
+    ).toBe(true);
+  });
+
+  it('returns true for inlineData part', () => {
+    expect(
+      hasUserVisibleContent(
+        chunkWithParts([
+          { inlineData: { mimeType: 'image/png', data: 'abc' } },
+        ]),
+      ),
+    ).toBe(true);
+  });
+
+  it('returns true for executableCode part', () => {
+    expect(
+      hasUserVisibleContent(
+        chunkWithParts([
+          { executableCode: { language: 'PYTHON', code: 'print(1)' } },
+        ]),
+      ),
+    ).toBe(true);
+  });
+
+  it('returns true for thought / reasoning part with thought: true', () => {
+    expect(hasUserVisibleContent(chunkWithParts([{ thought: true }]))).toBe(
+      true,
+    );
+  });
+
+  it('returns false for thought: false (explicit non-thought part)', () => {
+    // Codebase convention: `thought` is a boolean flag where false means
+    // "explicitly not a thought." A part with only `thought: false` and no
+    // other content must not trigger TTFT.
+    expect(hasUserVisibleContent(chunkWithParts([{ thought: false }]))).toBe(
+      false,
+    );
+  });
+
+  it('returns false for thought: undefined / missing (default non-thought)', () => {
+    // A bare object without the `thought` key is the common case for non-thinking
+    // chunks; must not match the thought branch.
+    expect(hasUserVisibleContent(chunkWithParts([{}]))).toBe(false);
+  });
+
+  it('returns true when thought: true coexists with empty text', () => {
+    // First Anthropic <thinking> chunk often arrives as { text: '', thought: true }.
+    // Per design doc D1, "thought / reasoning content" is user-visible — TTFT fires.
+    expect(
+      hasUserVisibleContent(chunkWithParts([{ text: '', thought: true }])),
+    ).toBe(true);
+  });
+
+  it('returns true when any part is user-visible (mixed)', () => {
+    expect(
+      hasUserVisibleContent(chunkWithParts([{ text: '' }, { text: 'hi' }])),
+    ).toBe(true);
+  });
+
+  it('returns false for empty parts array', () => {
+    expect(hasUserVisibleContent(chunkWithParts([]))).toBe(false);
+  });
+
+  it('returns false when candidates is missing', () => {
+    const r = new GenerateContentResponse();
+    expect(hasUserVisibleContent(r)).toBe(false);
+  });
+
+  it('returns false when content is missing', () => {
+    const r = new GenerateContentResponse();
+    r.candidates = [{}];
+    expect(hasUserVisibleContent(r)).toBe(false);
+  });
+
+  it('returns false when parts is undefined', () => {
+    const r = new GenerateContentResponse();
+    r.candidates = [{ content: { role: 'model' } }];
+    expect(hasUserVisibleContent(r)).toBe(false);
+  });
+
+  it('returns false for usage-only / role-only chunks', () => {
+    const r = new GenerateContentResponse();
+    r.candidates = [{ content: { role: 'model', parts: [] } }];
+    r.usageMetadata = { totalTokenCount: 42 };
+    expect(hasUserVisibleContent(r)).toBe(false);
+  });
+
+  it('handles parts that are non-objects defensively', () => {
+    expect(
+      hasUserVisibleContent(
+        chunkWithParts([null, undefined, 'string', 42, { text: 'real' }]),
+      ),
+    ).toBe(true);
+    expect(hasUserVisibleContent(chunkWithParts([null, undefined, 'x']))).toBe(
+      false,
+    );
+  });
+});
--- a/packages/core/src/core/loggingContentGenerator/streamContentDetection.ts
+++ b/packages/core/src/core/loggingContentGenerator/streamContentDetection.ts
@ -0,0 +1,57 @@
+/**
+ * @license
+ * Copyright 2026 Qwen Team
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import type { GenerateContentResponse } from '@google/genai';
+
+/**
+ * Detects whether a streaming chunk contains user-visible model output.
+ *
+ * Used by the LoggingContentGenerator stream wrapper to identify the first
+ * chunk that should trigger TTFT (time-to-first-token) measurement.
+ *
+ * A chunk is "user-visible" if any normalized Part in candidates[0].content.parts
+ * is one of:
+ *   - text with a non-empty string
+ *   - functionCall (tool use — even tool-call-only responses count)
+ *   - inlineData (image, binary blob)
+ *   - executableCode (sandbox / code-execution responses)
+ *   - thought / reasoning content (provider-dependent; o1, qwen thinking, Anthropic <thinking>)
+ *
+ * Chunks containing only role metadata, only usageMetadata (final summary
+ * chunk), or empty parts are NOT user-visible — TTFT should not fire on these.
+ *
+ * Centralised here (single predicate over the normalized GenerateContentResponse
+ * shape) so the four provider generators (Anthropic / OpenAI / Gemini / Qwen)
+ * don't each need their own first-token logic. Each provider already normalizes
+ * its native chunk shape to GenerateContentResponse before LoggingContentGenerator
+ * sees it (see loggingContentGenerator.ts generateContentStream).
+ */
+export function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
+  const parts = chunk.candidates?.[0]?.content?.parts;
+  if (!parts || parts.length === 0) return false;
+  return parts.some(isUserVisiblePart);
+}
+
+function isUserVisiblePart(part: unknown): boolean {
+  if (part === null || typeof part !== 'object') return false;
+  const p = part as {
+    text?: unknown;
+    functionCall?: unknown;
+    inlineData?: unknown;
+    executableCode?: unknown;
+    thought?: unknown;
+  };
+  if (typeof p.text === 'string' && p.text.length > 0) return true;
+  if (p.functionCall !== undefined) return true;
+  if (p.inlineData !== undefined) return true;
+  if (p.executableCode !== undefined) return true;
+  // `thought` is a boolean flag in this codebase — `true` means the part
+  // carries reasoning content, false / absent means none (see loggingContentGenerator.ts
+  // where `part.thought ? {...} : {}` is the canonical pattern). Match strict `=== true`
+  // rather than checking presence — `thought: false` parts are explicitly NOT user-visible.
+  if (p.thought === true) return true;
+  return false;
+}
--- a/packages/core/src/telemetry/session-tracing.test.ts
+++ b/packages/core/src/telemetry/session-tracing.test.ts
@ -378,6 +378,214 @@ describe('session-tracing', () => {
    });
  });

+  describe('LLM request spans — Phase 4a (timing decomposition + GenAI dual-emit)', () => {
+    it('startLLMRequestSpan dual-emits gen_ai.request.model alongside qwen-code.model', () => {
+      const span = startLLMRequestSpan('test-model', 'p');
+      endLLMRequestSpan(span, { success: true });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['qwen-code.model']).toBe('test-model');
+      expect(attrs['gen_ai.request.model']).toBe('test-model');
+    });
+
+    it('endLLMRequestSpan dual-emits gen_ai.usage.input_tokens / output_tokens', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        inputTokens: 100,
+        outputTokens: 50,
+      });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['input_tokens']).toBe(100);
+      expect(attrs['gen_ai.usage.input_tokens']).toBe(100);
+      expect(attrs['output_tokens']).toBe(50);
+      expect(attrs['gen_ai.usage.output_tokens']).toBe(50);
+    });
+
+    it('endLLMRequestSpan dual-emits gen_ai.usage.cached_tokens when present', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        inputTokens: 100,
+        cachedInputTokens: 40,
+      });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['cached_input_tokens']).toBe(40);
+      expect(attrs['gen_ai.usage.cached_tokens']).toBe(40);
+    });
+
+    it('endLLMRequestSpan omits cached_input_tokens when undefined', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, { success: true, inputTokens: 100 });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['cached_input_tokens']).toBeUndefined();
+      expect(attrs['gen_ai.usage.cached_tokens']).toBeUndefined();
+    });
+
+    it('endLLMRequestSpan emits cached_input_tokens === 0 (cache miss is meaningful info, not undefined)', () => {
+      // Providers that report 0 cached tokens are signaling an explicit cache
+      // miss. Distinct from undefined ("we don't know"). Both attribute names
+      // must propagate the literal 0.
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        inputTokens: 100,
+        cachedInputTokens: 0,
+      });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['cached_input_tokens']).toBe(0);
+      expect(attrs['gen_ai.usage.cached_tokens']).toBe(0);
+    });
+
+    it('endLLMRequestSpan writes ttft_ms and dual-emits gen_ai.server.time_to_first_token (in seconds)', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 234,
+        durationMs: 1000,
+      });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['ttft_ms']).toBe(234);
+      // Spec uses seconds as double — 234ms → 0.234s
+      expect(attrs['gen_ai.server.time_to_first_token']).toBeCloseTo(0.234, 6);
+    });
+
+    it('endLLMRequestSpan omits ttft_ms when undefined (non-streaming or aborted before first chunk)', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, { success: true, durationMs: 500 });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['ttft_ms']).toBeUndefined();
+      expect(attrs['gen_ai.server.time_to_first_token']).toBeUndefined();
+      expect(attrs['sampling_ms']).toBeUndefined();
+      expect(attrs['output_tokens_per_second']).toBeUndefined();
+    });
+
+    it('endLLMRequestSpan derives sampling_ms when ttftMs is set (no requestSetup)', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 200,
+        durationMs: 1000,
+      });
+
+      // sampling_ms = duration - ttft - (requestSetup ?? 0) = 1000 - 200 - 0
+      expect(mockSpans[0]!.attributes['sampling_ms']).toBe(800);
+    });
+
+    it('endLLMRequestSpan derives sampling_ms accounting for requestSetupMs (Phase 4b populates this)', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 200,
+        requestSetupMs: 300,
+        durationMs: 1000,
+      });
+
+      // sampling_ms = 1000 - 200 - 300
+      expect(mockSpans[0]!.attributes['sampling_ms']).toBe(500);
+    });
+
+    it('endLLMRequestSpan clamps sampling_ms to 0 when ttft + setup exceed duration (clock skew)', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 800,
+        requestSetupMs: 500,
+        durationMs: 1000,
+      });
+
+      // Math.max(0, 1000 - 800 - 500) = 0
+      expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0);
+    });
+
+    it('endLLMRequestSpan derives output_tokens_per_second from sampling_ms + outputTokens', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 200,
+        durationMs: 1200,
+        outputTokens: 500,
+      });
+
+      // sampling_ms = 1000ms = 1s; otps = 500 / 1.0 = 500
+      expect(mockSpans[0]!.attributes['sampling_ms']).toBe(1000);
+      expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(500);
+    });
+
+    it('endLLMRequestSpan rounds output_tokens_per_second to 2 decimals', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 200,
+        durationMs: 1325, // sampling_ms = 1125
+        outputTokens: 100, // otps = 100 / 1.125 = 88.888…
+      });
+
+      expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(88.89);
+    });
+
+    it('endLLMRequestSpan omits output_tokens_per_second when sampling_ms == 0', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 1000,
+        durationMs: 1000,
+        outputTokens: 50,
+      });
+
+      // sampling_ms = 0 → otps would be Infinity, must be omitted
+      expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0);
+      expect(
+        mockSpans[0]!.attributes['output_tokens_per_second'],
+      ).toBeUndefined();
+    });
+
+    it('endLLMRequestSpan omits output_tokens_per_second when outputTokens missing', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        ttftMs: 200,
+        durationMs: 1000,
+      });
+
+      expect(
+        mockSpans[0]!.attributes['output_tokens_per_second'],
+      ).toBeUndefined();
+    });
+
+    it('endLLMRequestSpan writes Phase 4b retry placeholders when caller provides them', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, {
+        success: true,
+        attempt: 3,
+        requestSetupMs: 4500,
+        retryTotalDelayMs: 4200,
+        durationMs: 5000,
+      });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['attempt']).toBe(3);
+      expect(attrs['request_setup_ms']).toBe(4500);
+      expect(attrs['retry_total_delay_ms']).toBe(4200);
+    });
+
+    it('endLLMRequestSpan omits Phase 4b fields when caller does not provide them (Phase 4a default)', () => {
+      const span = startLLMRequestSpan('m', 'p');
+      endLLMRequestSpan(span, { success: true, durationMs: 500 });
+
+      const attrs = mockSpans[0]!.attributes;
+      expect(attrs['attempt']).toBeUndefined();
+      expect(attrs['request_setup_ms']).toBeUndefined();
+      expect(attrs['retry_total_delay_ms']).toBeUndefined();
+    });
+  });
+
  describe('tool spans', () => {
    it('creates and ends a tool span', () => {
      const span = startToolSpan('ReadFile', { 'tool.call_id': 'call-1' });
--- a/packages/core/src/telemetry/session-tracing.ts
+++ b/packages/core/src/telemetry/session-tracing.ts
@ -46,9 +46,46 @@ export interface EndInteractionOptions {
 export interface LLMRequestMetadata {
  inputTokens?: number;
  outputTokens?: number;
+  /**
+   * Tokens served from the provider's prompt cache (Anthropic
+   * cache_read_input_tokens, OpenAI prompt_tokens_details.cached_tokens, etc).
+   * Normalized to GenerateContentResponseUsageMetadata.cachedContentTokenCount
+   * by each provider generator before reaching LoggingContentGenerator.
+   */
+  cachedInputTokens?: number;
  success: boolean;
  durationMs?: number;
  error?: string;
+  /**
+   * Time from the successful attempt's request dispatch to the first stream
+   * chunk containing user-visible content (text / functionCall / inlineData /
+   * executableCode / thought). Undefined for non-streaming requests, requests
+   * aborted before the first user-visible chunk, and any path that does not
+   * pass through LoggingContentGenerator's stream wrapper.
+   *
+   * Semantics: diverges from claude-code's ttftMs (which fires on
+   * Anthropic's message_start metadata event). Matches the "time to first
+   * actual token" intent of the industry-standard TTFT name.
+   * See docs/design/telemetry-llm-request-timing-design.md (D1).
+   */
+  ttftMs?: number;
+  /**
+   * Time from generateContent/generateContentStream entry to the start of the
+   * successful attempt (ms). Includes all failed retries + backoff sleeps.
+   * Populated by the retry layer in Phase 4b; undefined in Phase 4a.
+   */
+  requestSetupMs?: number;
+  /**
+   * Final attempt number (1-based). 1 = no retries. Populated by the retry
+   * layer in Phase 4b; undefined in Phase 4a.
+   */
+  attempt?: number;
+  /**
+   * Sum of all backoff delays before the successful attempt (ms). 0 if no
+   * retries occurred. Populated by the retry layer in Phase 4b; undefined
+   * in Phase 4a.
+   */
+  retryTotalDelayMs?: number;
 }

 export interface ToolSpanMetadata {
@ -321,6 +358,10 @@ export function startLLMRequestSpan(model: string, promptId: string): Span {
    'qwen-code.model': model,
    'qwen-code.prompt_id': promptId,
    'llm_request.context': parentCtx ? 'interaction' : 'standalone',
+    // Dual-emit OTel GenAI semantic convention (Stable). Private name
+    // (qwen-code.model) remains authoritative; gen_ai.* is a compat layer
+    // for spec-aware backends. See docs/design/telemetry-llm-request-timing-design.md (D8).
+    'gen_ai.request.model': model,
  };

  const span = getTracer().startSpan(
@ -362,10 +403,55 @@ export function endLLMRequestSpan(
    const endAttributes: Attributes = { duration_ms: duration };

    if (metadata) {
-      if (metadata.inputTokens !== undefined)
+      if (metadata.inputTokens !== undefined) {
        endAttributes['input_tokens'] = metadata.inputTokens;
-      if (metadata.outputTokens !== undefined)
+        // Dual-emit OTel GenAI semconv (Stable). Private name stays authoritative.
+        endAttributes['gen_ai.usage.input_tokens'] = metadata.inputTokens;
+      }
+      if (metadata.outputTokens !== undefined) {
        endAttributes['output_tokens'] = metadata.outputTokens;
+        endAttributes['gen_ai.usage.output_tokens'] = metadata.outputTokens;
+      }
+      if (metadata.cachedInputTokens !== undefined) {
+        endAttributes['cached_input_tokens'] = metadata.cachedInputTokens;
+        // Dual-emit OTel GenAI semconv (Experimental — may rename before Stable).
+        endAttributes['gen_ai.usage.cached_tokens'] =
+          metadata.cachedInputTokens;
+      }
+      if (metadata.ttftMs !== undefined) {
+        endAttributes['ttft_ms'] = metadata.ttftMs;
+        // Dual-emit OTel GenAI semconv (Experimental). Spec unit is seconds-as-double.
+        endAttributes['gen_ai.server.time_to_first_token'] =
+          metadata.ttftMs / 1000;
+      }
+      if (metadata.requestSetupMs !== undefined) {
+        endAttributes['request_setup_ms'] = metadata.requestSetupMs;
+      }
+      if (metadata.attempt !== undefined) {
+        endAttributes['attempt'] = metadata.attempt;
+      }
+      if (metadata.retryTotalDelayMs !== undefined) {
+        endAttributes['retry_total_delay_ms'] = metadata.retryTotalDelayMs;
+      }
+      // Derived: sampling_ms = time from first user-visible chunk to end
+      // (== output generation time, excluding setup + first-token delay).
+      // Computable only when ttftMs is set. requestSetupMs defaults to 0
+      // when undefined (no retries) — this gives the correct sampling
+      // duration in both Phase 4a (no retry data) and Phase 4b (with).
+      if (metadata.ttftMs !== undefined) {
+        const samplingMs = Math.max(
+          0,
+          duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),
+        );
+        endAttributes['sampling_ms'] = samplingMs;
+        // Derived: output tokens per second during sampling. Undefined when
+        // sampling_ms is 0 (avoid divide-by-zero) or when outputTokens missing.
+        if (samplingMs > 0 && metadata.outputTokens !== undefined) {
+          endAttributes['output_tokens_per_second'] =
+            Math.round((metadata.outputTokens / (samplingMs / 1000)) * 100) /
+            100;
+        }
+      }
      endAttributes['success'] = metadata.success;
      if (metadata.error !== undefined)
        endAttributes['error'] = truncateSpanError(metadata.error);