diff --git a/docs/design/telemetry-llm-request-timing-design.md b/docs/design/telemetry-llm-request-timing-design.md new file mode 100644 index 000000000..4a41b082d --- /dev/null +++ b/docs/design/telemetry-llm-request-timing-design.md @@ -0,0 +1,538 @@ +# LLM Request Timing Decomposition Design (P3 Phase 4) + +> Issue #3731 — Phase 4 of hierarchical session tracing. Adds time-to-first-token, request-setup duration, sampling duration, and per-attempt retry telemetry to the `qwen-code.llm_request` span so operators can answer "why was this LLM call slow?" without guessing. +> +> Builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321). Independent of Phase 3 (#4410, in review) — recommended to land Phase 3 first so Phase 4's per-attempt fields aggregate cleanly under subagent subtrees. + +## Problem + +`qwen-code.llm_request` spans today carry only `model`, `prompt_id`, `input_tokens`, `output_tokens`, `success`, `error`, `duration_ms`. Operators reading a single trace cannot tell: + +1. **How much of `duration_ms` was the model thinking vs the network setup.** A 12-second `duration_ms` could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say. +2. **When the user saw the first token.** TTFT (time-to-first-token) is the standard latency SLO for chat UIs. We can't compute it; we don't capture it. +3. **What happened during retries.** `retryWithBackoff` (`utils/retry.ts:285`) only calls `debugLogger.warn` — no OTel event, no span attribute. The 4 LLM call sites that go through it (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) have zero retry visibility in traces or metrics. `ContentRetryEvent` exists for content-recovery retries inside `geminiChat.ts:806,830` but not for the more common rate-limit / 5xx retries. +4. **That `api.request.breakdown` is dead code.** The metric is defined at `metrics.ts:242-251` with 4 `ApiRequestPhase` values, exported from `index.ts:117`, tested in `metrics.test.ts:646-675` — but `recordApiRequestBreakdown()` has zero callers in production code. The metric infrastructure is paid for; the data flow was never connected. + +These gaps make `qwen-code.llm_request` the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration. + +## Existing surface (no change) + +| Component | Location | Why we don't touch it | +| ------------------------------------------------------------ | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| LLM request span lifecycle | `session-tracing.ts` `startLLMRequestSpan` / `endLLMRequestSpan` | Phase 1 (#4126) established the helpers. We extend the metadata interface, don't restructure | +| Active span propagation into provider generators | `loggingContentGenerator.ts:213,287` | Phase 1 (#4126) replaced `withSpan('api.*')` with native helpers; the active context already reaches the stream wrapper | +| `ContentRetryEvent` schema + consumers | `types.ts:626`, `qwen-logger.ts:947`, `loggers.ts:717` | Existing event keeps its shape and downstreams; we add a sibling event class for the `retryWithBackoff` path | +| `LogToSpanProcessor` log-bridge spans | `log-to-span-processor.ts` | ContentRetryEvent's existing bridge continues to nest under the active LLM span. Phase 4 does not change this | +| `ApiRequestPhase` enum | `metrics.ts:330-334` | Public surface (4 values). We populate 3 of the 4 from production code; leave the enum unchanged for backward compatibility | +| Per-provider chunk normalization → `GenerateContentResponse` | `loggingContentGenerator.ts:286-393` | Each provider already normalizes to Google's `GenerateContentResponse` shape before LoggingContentGenerator sees the stream. TTFT detection runs centrally over this normalized shape; no per-provider code | +| `retryWithBackoff` general-purpose retry | `utils/retry.ts:140` | Used by both LLM callers and non-LLM (`channels/weixin/src/api.ts`). We extend with an opt-in `onRetry` callback rather than hard-coupling to LLM telemetry | +| Non-streaming `generateContent` | `loggingContentGenerator.ts:212` | TTFT is not meaningful for non-streaming; the new fields stay `undefined`. Span lifecycle and existing attrs unchanged | + +## Out-of-scope (deferred) + +- **SDK-level retries** (openai SDK `maxRetries=3`, google-genai SDK internal retries). These happen entirely inside the third-party SDK; observing them requires disabling SDK retries and reimplementing in `retryWithBackoff`. Separate decision, not Phase 4. +- **Per-token streaming metrics** (inter-token latency, per-chunk size). Useful for inference-engine perf debugging, not for the user-perceived latency questions Phase 4 targets. +- **Separate TTFT for reasoning/thinking blocks.** "First token" includes thinking content (see D1). A future enhancement could split `ttft_to_reasoning_ms` vs `ttft_to_answer_ms`, but only after we know there's demand. +- **Sampling phase as a dedicated child span.** Computable from `duration_ms - ttft_ms - request_setup_ms`; child span adds nothing for OTel-only backends (claude-code uses one for Perfetto only). Stored as a span attribute instead — see D6. +- **Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`) event-level rate limiting.** A single LLM request can produce 50+ `ContentRetryEvent` / `ApiRetryEvent` records under persistent retry. Capping emission is a follow-up — Phase 4 emits all events; if production volumes prove unbearable, add a per-span emission cap with a "+N more attempts (truncated)" summary event in a follow-up PR. +- **`TOKEN_PROCESSING` breakdown phase.** Enum value exists but qwen-code has no real post-stream local processing worth measuring (<10ms typical). Skipped in production callers; enum value retained for future use or for callers we don't control. +- **Migrating `ContentRetryEvent` onto LLM span as span events.** Same reasoning as Phase 3's `subagent_execution` LogRecord: existing consumers (qwen-logger RUM, future metrics) are tightly coupled to the LogRecord. Bridge-span coverage is good enough. + +## References (decision evidence) + +| Source | Key takeaway | +| --------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| claude-code (Anthropic) `claude.ts:1762, 1789, 1982, 2882` | TTFT captured as `Date.now() - start` on `message_start` SSE event; `start` reset per retry attempt. `requestSetupMs = start - startIncludingRetries`. `attemptStartTimes` array preserved per attempt. Confirms feasibility of the approach; their TTFT semantic is "first stream event" (we diverge to "first content" — see D1) | +| claude-code `perfettoTracing.ts:549-671` | Renders Request Setup → Attempt N (retry) → First Token → Sampling as nested B/E pairs. Demonstrates the visual decomposition; qwen-code does the same decomposition with OTel attributes since we have no Perfetto | +| claude-code `sessionTracing.ts:447` | Only `ttft_ms` makes it onto the OTel span (not `requestSetupMs`, not `samplingMs`, not per-attempt timing). We deliberately put more on the span — claude-code has Perfetto for visualization; we don't | +| opencode (sst/opencode) `session/llm.ts`, `route/client.ts` | No TTFT measurement. Single `LLM.run` Effect span covers everything. Validates that the gap exists across competing tools; not a reference for what to do | +| [OTel GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (status: Development / Experimental) | `gen_ai.usage.input_tokens` (Stable), `gen_ai.usage.output_tokens` (Stable), `gen_ai.usage.cached_tokens` (Experimental), `gen_ai.request.model` (Stable), `gen_ai.server.time_to_first_token` (Experimental, seconds as double). Dual-emit pattern follows #4410 precedent | +| [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events) | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent | +| Phase 3 design doc (`telemetry-subagent-spans-design.md`) | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields | + +## Design — seven decisions, each justified + +### D1 — TTFT semantic: "first chunk containing user-visible content" + +TTFT measures wall-clock from the **successful attempt's** request dispatch to the **first stream chunk that contains user-visible output**. A chunk is "user-visible" if any normalized `Part` in `candidates[0].content.parts` is one of: + +- `text` with non-empty string +- `functionCall` (tool use) +- `inlineData` (image, binary) +- `executableCode` +- `thought` / reasoning content (whatever the provider surfaces — Gemini's `thought`, Anthropic's `` block, OpenAI o1 reasoning chunk) + +Chunks containing only `role` metadata or only `usageMetadata` (final usage-summary chunk) do not trigger TTFT. + +**Why not "first stream event of any kind" (claude-code's choice)**: claude-code measures TTFT at `message_start`, an Anthropic-specific metadata event that fires 50–300ms before any actual content. Their internal `headlessProfiler.ts` already separates `time_to_first_response_ms` for the "user saw something" semantic, acknowledging the distinction. qwen-code spans multiple providers (Anthropic, OpenAI, Gemini, Qwen) — picking the metadata-event semantic means TTFT for Anthropic is fundamentally different from TTFT for OpenAI (which has no analogous metadata-only first event). The user-visible-content semantic is uniform across all 4 providers and matches "time-to-first-token" literally. + +**Why include `thought` / reasoning**: from the operator's perspective, reasoning chunks are still "the model produced output." Excluding them would understate TTFT for reasoning-heavy models (o1, Qwen thinking variants). Future split into `ttft_to_reasoning_ms` vs `ttft_to_answer_ms` is possible; not Phase 4. + +**Why include tool-call-only chunks**: agent tool-decision LLM calls (one `tool_use`, no text) are common in qwen-code's workflow. Excluding them means TTFT is undefined for these requests. The `functionCall` Part is meaningful output. + +**Cross-product comparison note**: design doc explicitly states `qwen-code.ttft_ms ≈ claude-code.time_to_first_response_ms ≠ claude-code.ttft_ms`. Operators comparing across products should align on the user-visible-content semantic. + +### D2 — TTFT measurement site: method-local variables in `LoggingContentGenerator.generateContentStream` + +The first-chunk detection runs inside the existing stream wrapper at `loggingContentGenerator.ts:393` (`async function* processStreamGenerator`). Per-call variables (`start`, `ttftMs`) live in the method's closure; **never as instance fields**. + +**Why never instance fields**: `LoggingContentGenerator` is instantiated **once per `ContentGenerator`** (`contentGenerator.ts:377`) and shared across all concurrent `generateContentStream` calls — subagent fan-out, warmup queries, side-queries from `geminiChat`. An instance field would be overwritten across concurrent calls, producing nonsense TTFT for one of every two interleaved requests. + +**Why not AsyncLocalStorage**: ALS would work but adds a context-management layer for a piece of state that doesn't need to escape the method. Method-local is simpler, zero overhead, zero risk of leakage. + +```ts +// loggingContentGenerator.ts — inside generateContentStream +const attemptStart = Date.now(); // per-call local +const requestEntryTime = Date.now(); // also per-call local — see D3 +let ttftMs: number | undefined; +const attemptStartTimes: number[] = [attemptStart]; +let retryTotalDelayMs = 0; +let finalAttempt = 1; +// stream wrapper inspects each chunk; first one matching hasUserVisibleContent: +// ttftMs = Date.now() - attemptStart; +``` + +`hasUserVisibleContent(chunk)` is a small standalone helper colocated with the wrapper, exported for tests: + +```ts +function hasUserVisibleContent(chunk: GenerateContentResponse): boolean { + const parts = chunk.candidates?.[0]?.content?.parts; + if (!parts?.length) return false; + return parts.some( + (p) => + (typeof p.text === 'string' && p.text.length > 0) || + p.functionCall !== undefined || + p.inlineData !== undefined || + p.executableCode !== undefined || + // @ts-expect-error — `thought` is not on all SDK versions but providers emit it + p.thought !== undefined, + ); +} +``` + +### D3 — `request_setup_ms` computation: entry-time vs successful-attempt-start + +`request_setup_ms` measures wall-clock from `generateContentStream`/`generateContent` entry to the **start of the successful attempt** — including all failed retries, backoff sleeps, and any pre-retry preparation work. + +```ts +request_setup_ms = attemptStart_of_successful_attempt - requestEntryTime; +``` + +When `attempt === 1` and no retries happened, `request_setup_ms` is small (just SDK setup). When retries occurred, it captures the entire retry-budget overhead. + +**Putting it on the OTel span (diverges from claude-code, which puts it only on Perfetto)**: rationale at three levels: + +1. **No Perfetto** — qwen-code has no out-of-band visualization layer. OTel attributes are the only channel. +2. **Single-trace debug** — operator sees `duration_ms=12000, request_setup_ms=11500, ttft_ms=200, sampling_ms=300` → instantly diagnoses "retries ate 11.5s, model itself was fast." Computing `request_setup_ms` from other fields requires also exposing `sampling_ms`, which we do anyway (D6). +3. **Negligible cost** — 1 INT64 attribute. Same order of magnitude as the existing `input_tokens`, `output_tokens` attributes. Backend ingest cost is not material. + +### D4 — Retry telemetry: `onRetry` callback option on `retryWithBackoff` + new `ApiRetryEvent` + +`retryWithBackoff` currently calls `logRetryAttempt` (`retry.ts:343`) which only writes to `debugLogger.warn`. We extend the `RetryOptions` interface with an opt-in callback: + +```ts +// utils/retry.ts +interface RetryOptions { + // ... existing fields ... + /** + * Optional. Called once per failed attempt, before the backoff sleep. + * Receives the attempt number (1-based), the error, and the delay before + * the next attempt. Use this to emit telemetry events for LLM call sites; + * leave undefined for non-LLM callers (e.g., channels/weixin) so they + * stay silent in LLM-specific telemetry channels. + */ + onRetry?: (info: RetryAttemptInfo) => void; +} + +interface RetryAttemptInfo { + attempt: number; // 1-based, matches debugLogger output + error: unknown; + errorStatus?: number; + delayMs: number; // backoff delay before next attempt +} +``` + +The 4 LLM call sites (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) register a callback that emits a new `ApiRetryEvent`: + +```ts +// types.ts — new event class, sibling to ContentRetryEvent +export class ApiRetryEvent implements BaseTelemetryEvent { + 'event.name': typeof EVENT_API_RETRY; + 'event.timestamp': string; + model: string; + prompt_id?: string; + attempt_number: number; // 1-based + error_type: string; + error_message: string; // truncated to 256 chars + status_code?: number; + retry_delay_ms: number; + // ... duration_ms set to retry_delay_ms so LogToSpanProcessor renders + // a bridge span of meaningful width + duration_ms: number; +} +``` + +**Why a new event class, not extending `ContentRetryEvent`**: + +- `ContentRetryEvent` has 2 downstream consumers (qwen-logger, log-record export). Changing its payload risks breaking them. +- The naming "content retry" semantically refers to content-recovery retries (invalid stream, schema repair) — extending it to cover rate-limit retries would muddy the schema. +- New event is additive; no consumer surprise. + +**Why not embed callback IN `retry.ts`**: `retry.ts` is called by `channels/weixin/src/api.ts` too (microsoft messaging API retries). Hard-coupling LLM telemetry inside retry.ts would emit `ApiRetryEvent` for non-LLM retries. The `onRetry` callback is opt-in per caller — LLM callers opt in, weixin caller doesn't. + +**ContentRetryEvent coexistence**: ContentRetryEvent stays as-is for content-recovery retries inside `geminiChat.ts:806,830`. ApiRetryEvent covers the rate-limit / 5xx retries from `retryWithBackoff`. The two events fire from different layers and never duplicate. Existing log-bridge behavior for both events is preserved via `LogToSpanProcessor` — both events nest under the active LLM span automatically (Phase 1 wiring ensures the LLM span is active during retries). + +**Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`)**: a single 429-loop request may emit 50+ events. Out of scope to rate-limit emission in Phase 4 — if production volumes prove unbearable, add a per-span cap with summary event in a follow-up PR. The aggregated `attempt` and `retry_total_delay_ms` on the parent LLM span (D5) remain accurate regardless of event cap. + +### D5 — Parent LLM span aggregation: scalar attributes only (no map-typed attrs) + +OTel span attributes are scalars (`string | number | boolean | array of these`). Map-typed attributes (like `retry_count_by_status: {429:2, 503:1}`) require JSON serialization and are awkward to query. Skip them. + +| Attribute | Type | Semantic | +| -------------------------- | ------ | ----------------------------------------------------------------------------------- | +| `attempt` | int | 1-based final attempt count (`attemptStartTimes.length`) | +| `retry_total_delay_ms` | int | Sum of all `delayMs` reported by `onRetry`; 0 if no retries | +| `ttft_ms` | int | TTFT per D1; undefined for non-streaming or aborted-before-first-chunk requests | +| `request_setup_ms` | int | Per D3 | +| `sampling_ms` | int | Per D6 | +| `output_tokens_per_second` | double | Derived; `output_tokens / (sampling_ms / 1000)`; undefined when `sampling_ms === 0` | + +Per-attempt status-code distribution (e.g., "2 of the 3 attempts were 429s") is queryable from log-bridge spans of `ApiRetryEvent` records. No need to duplicate it as a flattened attribute on the parent. + +**Why `sampling_ms` and `output_tokens_per_second` on the span**: derivable but cumbersome to compute in backend queries when summing across many spans. Same cost-benefit as `request_setup_ms` (D3). + +### D6 — Activate `recordApiRequestBreakdown()` for 3 of 4 phases + +In `endLLMRequestSpan` (or the wrapper that calls it), after computing TTFT/setup/sampling, emit: + +```ts +recordApiRequestBreakdown(config, model, [ + { phase: ApiRequestPhase.REQUEST_PREPARATION, durationMs: requestSetupMs }, + { phase: ApiRequestPhase.NETWORK_LATENCY, durationMs: ttftMs }, // ttftMs = network + first-token-generation + { phase: ApiRequestPhase.RESPONSE_PROCESSING, durationMs: samplingMs }, +]); +``` + +**Why skip `TOKEN_PROCESSING`**: qwen-code does stream chunk processing inline (consolidation happens in the wrapper at `loggingContentGenerator.ts:644`); the post-stream wrap-up phase is <10ms and not architecturally distinct. Filling it with a meaningless value pollutes the histogram. Leaving the enum value unused is safe — `apiRequestBreakdownHistogram.record(value, {model, phase})` is just a histogram with `phase` as a label; missing labels are simply absent in queries. + +**Why not redefine `NETWORK_LATENCY`**: the spec name is slightly misleading (it's network + first-token-generation, not pure network latency), but: + +- The enum is part of `metrics.ts:330-334` which is exported from `index.ts:117` and tested. +- Backend dashboards may already reference these phase names. +- Renaming or adding a new phase would be a breaking change for trivially marginal accuracy improvement. + +Document the semantic in the design doc; leave the enum unchanged. + +**Why on the span path, not parallel**: keeps `recordApiRequestBreakdown` colocated with span attribute writes — single gated emission point (see D7 idempotency), single ordering invariant. + +### D7 — `endLLMRequestSpan` idempotency: metric recording gated on existing double-end guard + +Phase 1.5 (#4302) established that `endLLMRequestSpan` may be called twice (abort path + error path collision). The existing guard at `session-tracing.ts:~470` (`if (!activeSpans.has(...)) return;`) prevents double `span.end()`. Phase 4 metric recording (D6) **must sit inside the same guarded block**, before `span.end()`: + +```ts +// session-tracing.ts — endLLMRequestSpan +const llmCtx = activeSpans.get(spanRef); +if (!llmCtx) return; // already ended — double-end guard +activeSpans.delete(spanRef); // claim the end + +// ... compute duration, set attributes ... +if (metadata) { + recordApiRequestBreakdown(config, llmCtx.attributes.model, [...]); // NEW — gated + recordTokenUsageMetrics(...); // existing +} + +span.end(); +``` + +This guarantees metric is recorded **exactly once** per LLM request, matching the span lifecycle. + +**Why not record in `loggingContentGenerator`**: it doesn't see the abort path. Recording at the span lifecycle layer ensures every LLM request that opens a span produces exactly one breakdown sample, regardless of success/failure/abort. + +### D8 — GenAI semantic conventions dual-emit (private name authoritative) + +Each Phase 4 attribute that corresponds to an OTel GenAI semconv attribute is written twice on the span: + +| qwen-code private (authoritative) | GenAI semconv (compat layer) | Unit conversion | Spec status | +| ------------------------------------------ | ----------------------------------------------- | --------------- | ------------ | +| `ttft_ms` (ms, int) | `gen_ai.server.time_to_first_token` (s, double) | `ttftMs / 1000` | Experimental | +| `input_tokens` (int) | `gen_ai.usage.input_tokens` (int) | identical | Stable | +| `output_tokens` (int) | `gen_ai.usage.output_tokens` (int) | identical | Stable | +| `cached_input_tokens` (int) (when present) | `gen_ai.usage.cached_tokens` (int) | identical | Experimental | +| `qwen-code.model` (string) | `gen_ai.request.model` (string) | identical | Stable | + +**Existing token attribute names** on the LLM span (set in `endLLMRequestSpan` before Phase 4): qwen-code uses bare `input_tokens` and `output_tokens` already. Phase 4 adds the `gen_ai.usage.*` siblings to match #4410's pattern. The bare names stay; **don't rename**. + +Fields with no GenAI semconv equivalent — `request_setup_ms`, `sampling_ms`, `retry_total_delay_ms`, `attempt`, `output_tokens_per_second` — are emitted only under the qwen-code namespace. + +**Why "private authoritative, semconv as compat"**: + +- Internal dashboards, SLOs, debugLogger output, qwen-logger RUM, ARMS queries — all reference `ttft_ms` etc. Treating those as canonical avoids a flag-day migration. +- The Experimental GenAI semconv may rename `gen_ai.server.time_to_first_token` before reaching Stable. If/when it does, we update the semconv emission; the qwen-code names don't move. +- Future spec-aware backends (Datadog AI views, Honeycomb AI, ARMS GenAI dashboards) auto-pick up the `gen_ai.*` attributes without our involvement. + +**Why dual-emit unit conversion** (ms ↔ seconds): GenAI semconv chose seconds-as-double for latency; qwen-code chose ms-as-int (matches `duration_ms` already on the span). Both representations have value; the conversion is cheap. + +## Helper API (additive to `session-tracing.ts`) + +```ts +// session-tracing.ts — LLMRequestMetadata interface extended (additive) +export interface LLMRequestMetadata { + // ... existing fields: inputTokens, outputTokens, cachedInputTokens, success, error, ... + + /** Time from successful attempt start to first user-visible content chunk (ms). Undefined for non-streaming or aborted-before-first-chunk requests. */ + ttftMs?: number; + + /** Time from generateContent entry to start of successful attempt (ms). Includes all failed retries + backoff. */ + requestSetupMs?: number; + + /** Final attempt number (1-based). 1 = no retries. */ + attempt?: number; + + /** Sum of all backoff delays before the successful attempt (ms). */ + retryTotalDelayMs?: number; +} + +// No new exported helpers — Phase 4 reuses startLLMRequestSpan / endLLMRequestSpan with extended metadata. +``` + +```ts +// types.ts — new event class +export class ApiRetryEvent implements BaseTelemetryEvent { + 'event.name': typeof EVENT_API_RETRY = EVENT_API_RETRY; + 'event.timestamp': string; + model: string; + prompt_id?: string; + attempt_number: number; + error_type: string; + error_message: string; + status_code?: number; + retry_delay_ms: number; + duration_ms: number; // = retry_delay_ms, drives LogToSpanProcessor bridge span width + + constructor(opts: { model: string; promptId?: string; attemptNumber: number; error: unknown; statusCode?: number; retryDelayMs: number }) { ... } +} + +// constants.ts +export const EVENT_API_RETRY = 'qwen-code.api_retry'; + +// loggers.ts +export function logApiRetry(config: Config, event: ApiRetryEvent): void { ... } +``` + +```ts +// utils/retry.ts — RetryOptions extension +interface RetryOptions { + // ... existing ... + onRetry?: (info: RetryAttemptInfo) => void; +} + +interface RetryAttemptInfo { + attempt: number; + error: unknown; + errorStatus?: number; + delayMs: number; +} + +// Inside retryWithBackoff, where logRetryAttempt is called today: +options.onRetry?.({ attempt, error, errorStatus, delayMs: actualDelay }); +logRetryAttempt(attempt, error, errorStatus); // existing debugLogger call unchanged +``` + +## Lifecycle wiring + +### Streaming path (the common case) + +```ts +// loggingContentGenerator.ts:283 — generateContentStream +async generateContentStream(req, userPromptId): Promise> { + const requestEntryTime = Date.now(); + let attemptStart = requestEntryTime; + const attemptStartTimes: number[] = [attemptStart]; + let retryTotalDelayMs = 0; + let finalAttempt = 1; + + // Use existing startLLMRequestSpan (Phase 1) + // Pass onRetry callback to whatever retry layer is in use: + const onRetry: RetryAttemptInfo & { invoke: ... } = (info) => { + finalAttempt = info.attempt + 1; // we're about to start attempt N+1 + retryTotalDelayMs += info.delayMs; + attemptStart = Date.now() + info.delayMs; // approximate; actual reset is at top of next attempt + attemptStartTimes.push(attemptStart); + // emit ApiRetryEvent + logApiRetry(this.config, new ApiRetryEvent({ + model: req.model, + promptId: userPromptId, + attemptNumber: info.attempt, + error: info.error, + statusCode: info.errorStatus, + retryDelayMs: info.delayMs, + })); + }; + + // stream wrapper detects first user-visible chunk: + return this.processStreamGenerator(stream, ..., { + onFirstUserVisibleChunk: (now) => { + ttftMs = now - attemptStart; + }, + }); +} +``` + +At span end (already in Phase 1's `endLLMRequestSpan` flow), include the new fields in `LLMRequestMetadata`: + +```ts +endLLMRequestSpan(llmSpan, { + success: true, + inputTokens, + outputTokens, + cachedInputTokens, + ttftMs, + requestSetupMs: attemptStart - requestEntryTime, + attempt: finalAttempt, + retryTotalDelayMs, +}); +``` + +### Non-streaming path + +`generateContent` (`loggingContentGenerator.ts:212`) does not produce streaming chunks. TTFT is `undefined`; `request_setup_ms` is still meaningful (captures retry overhead). The breakdown metric records 2 phases (REQUEST_PREPARATION + RESPONSE_PROCESSING where `RESPONSE_PROCESSING = duration_ms - request_setup_ms`), not 3. + +### Retry layer integration (4 sites) + +Each of the 4 LLM `retryWithBackoff` call sites adds `onRetry`: + +```ts +// client.ts:1540 (similar at baseLlmClient.ts:193, 282, geminiChat.ts:1039) +const result = await retryWithBackoff(apiCall, { + ...existingOptions, + onRetry: (info) => { + logApiRetry( + this.config, + new ApiRetryEvent({ + model, + promptId: userPromptId, + attemptNumber: info.attempt, + error: info.error, + statusCode: info.errorStatus, + retryDelayMs: info.delayMs, + }), + ); + // also feed back into LoggingContentGenerator's local retry accumulator + // (when in scope — for callers that don't go through LoggingContentGenerator, + // the LLM span still gets `attempt` and `retry_total_delay_ms` via the + // metadata path because endLLMRequestSpan is called at the LLM layer) + }, +}); +``` + +The non-LLM caller (`channels/weixin/src/api.ts`) **does not register `onRetry`** — no `ApiRetryEvent` is emitted for its retries, matching today's behavior. + +## Concurrent safety — the headline guarantee + +`LoggingContentGenerator` instance is shared (one per `ContentGenerator`, `contentGenerator.ts:377`). Three concurrent `generateContentStream` calls (e.g., 3 subagents fan out via `coreToolScheduler.runConcurrently`) execute three independent closures of `generateContentStream`: + +``` +call_A: attemptStart_A, ttftMs_A, ... (closure) +call_B: attemptStart_B, ttftMs_B, ... (closure) +call_C: attemptStart_C, ttftMs_C, ... (closure) +``` + +Per-call locals never overlap. Stream chunks are detected against the local `attemptStart` of each call. Span attributes are set at each call's own `endLLMRequestSpan`. + +`AsyncLocalStorageContextManager` (registered by NodeSDK at `sdk.ts:273`) already ensures the active OTel context — and thus the parent span passed to `startLLMRequestSpan` — is correct per fiber. + +## Files to change + +| File | Change | LOC est | +| -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | +| `packages/core/src/telemetry/constants.ts` | Add `EVENT_API_RETRY` constant | +2 | +| `packages/core/src/telemetry/types.ts` | Add `ApiRetryEvent` class + union member | +40 | +| `packages/core/src/telemetry/loggers.ts` | Add `logApiRetry()` function | +20 | +| `packages/core/src/telemetry/qwen-logger/qwen-logger.ts` | Add `logApiRetryEvent()` for RUM downstream consistency | +20 | +| `packages/core/src/telemetry/session-tracing.ts` | Extend `LLMRequestMetadata` (ttftMs, requestSetupMs, attempt, retryTotalDelayMs); extend `endLLMRequestSpan` to set new attrs + breakdown metric + dual-emit gen_ai.\* | +60 | +| `packages/core/src/telemetry/metrics.ts` | Wire `recordApiRequestBreakdown` callsite inside `endLLMRequestSpan` (no change to the existing recorder) | 0 | +| `packages/core/src/utils/retry.ts` | Add `onRetry?: (info: RetryAttemptInfo) => void` to RetryOptions; export `RetryAttemptInfo`; invoke callback in the existing logRetryAttempt site | +25 | +| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts` | TTFT capture: method-local accumulators + `hasUserVisibleContent` helper + first-chunk detection in stream wrapper; pass new metadata to `endLLMRequestSpan` | +80 | +| `packages/core/src/core/client.ts` | Wire `onRetry` callback at `retryWithBackoff` call site (`client.ts:1540`) | +15 | +| `packages/core/src/core/baseLlmClient.ts` | Wire `onRetry` callback at 2 `retryWithBackoff` call sites | +25 | +| `packages/core/src/core/geminiChat.ts` | Wire `onRetry` callback at `retryWithBackoff` call site (`geminiChat.ts:1039`) | +15 | +| `packages/core/src/telemetry/session-tracing.test.ts` | `endLLMRequestSpan` sets ttft_ms / request_setup_ms / attempt / retry_total_delay_ms / sampling_ms / output_tokens_per_second + gen_ai dual-emit + breakdown metric (each phase) + idempotent end | +120 | +| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts` | `hasUserVisibleContent` (text / functionCall / inlineData / executableCode / thought / role-only / usage-only); concurrent calls don't cross-contaminate; TTFT undefined when aborted before first chunk; TTFT undefined on non-streaming | +100 | +| `packages/core/src/utils/retry.test.ts` | `onRetry` invoked per failed attempt with correct `attempt`, `delayMs`, `error`, `errorStatus`; absence of `onRetry` is silent (no telemetry emitted) | +50 | +| `packages/core/src/telemetry/loggers.test.ts` | `logApiRetry` emits LogRecord with expected payload; bridges through LogToSpanProcessor to nested span under active LLM span | +40 | + +Total: 14 files, ~610 LOC. Larger than Phase 2 (#4321) but comparable to Phase 3 (#4410) and justified by the breadth of integration (4 retry sites + telemetry plumbing + streaming wrapper). + +If review pushes back on size: split into **Phase 4a + 4b + 4c**: + +- **4a** (~200 LOC): TTFT capture + extended `LLMRequestMetadata` + dual-emit. Self-contained value (TTFT visibility from day one). +- **4b** (~250 LOC): `onRetry` callback + `ApiRetryEvent` + 4 caller wiring. **Independently a bug fix** for the `retryWithBackoff` telemetry gap. +- **4c** (~160 LOC): `recordApiRequestBreakdown` activation + parent span aggregation attrs (`attempt`, `retry_total_delay_ms`, `sampling_ms`, `output_tokens_per_second`). Depends on 4a + 4b. + +## Testing strategy + +| Test | What it proves | +| -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------- | +| `hasUserVisibleContent` returns true for text/functionCall/inlineData/executableCode/thought | D1 semantics across part types | +| `hasUserVisibleContent` returns false for role-only and usage-only chunks | D1 negative cases | +| streaming: TTFT measured from attempt start to first user-visible chunk | End-to-end TTFT detection | +| streaming: TTFT undefined if stream aborts before any user-visible chunk | Edge case | +| streaming: TTFT computed from final attempt's start (not first attempt) | D3 — TTFT reset on retry | +| non-streaming: TTFT remains undefined | S3 decision | +| concurrent `generateContentStream` calls don't cross-contaminate TTFT | D2 — method-local guarantee | +| `endLLMRequestSpan` sets all Phase 4 attrs (ttft_ms, request_setup_ms, sampling_ms, attempt, retry_total_delay_ms, output_tokens_per_second) | Attribute presence | +| `endLLMRequestSpan` dual-emits gen_ai.server.time_to_first_token + gen_ai.usage.\* + gen_ai.request.model | D8 dual-emit | +| `endLLMRequestSpan` records breakdown metric with 3 phases for streaming, 2 for non-streaming | D6 | +| `endLLMRequestSpan` called twice: metric recorded exactly once, attrs not re-set | D7 idempotency | +| `retryWithBackoff` with `onRetry`: callback invoked per failed attempt with correct args | D4 callback contract | +| `retryWithBackoff` without `onRetry`: no telemetry emitted (silent for non-LLM callers) | P2 — channels/weixin scope protection | +| `client.ts` / `baseLlmClient.ts` / `geminiChat.ts` retry callsites emit `ApiRetryEvent` on retry | Integration of D4 at 4 sites | +| `ApiRetryEvent` LogRecord bridges via LogToSpanProcessor to a child span under active LLM span | Trace tree correctness | +| LLM span `attempt` field correctly reflects final attempt number under retries | D5 aggregation | +| LLM span `retry_total_delay_ms` correctly sums onRetry delays | D5 aggregation | +| `output_tokens_per_second` undefined when `sampling_ms === 0` (no streaming) | Avoid divide-by-zero | + +## Edge cases + +| Case | Handling | +| ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Stream aborts before any chunk arrives | `ttftMs = undefined`, `sampling_ms = undefined`, `output_tokens_per_second = undefined`. `attempt`, `request_setup_ms` still set. `success = false` | +| Stream aborts after first chunk | `ttftMs` set; `sampling_ms` = `duration_ms - ttftMs - request_setup_ms`; reflects partial response time. `success = false` | +| Retry succeeds on attempt 1 (no retries) | `attempt = 1`, `retry_total_delay_ms = 0`, no `ApiRetryEvent` emitted, breakdown metric records `request_setup_ms` close to 0 | +| Persistent retry mode 50+ attempts | 50+ `ApiRetryEvent` records emitted (out-of-scope cap deferred); LLM span `attempt = 51`, `retry_total_delay_ms = sum of all delays`. Operator sees aggregated view on span; full per-attempt detail in log-bridge spans | +| Non-LLM `retryWithBackoff` caller (channels/weixin) | No `onRetry` registered; only existing `debugLogger.warn` fires. No `ApiRetryEvent`; no breakdown metric (caller isn't an LLM site) | +| `endLLMRequestSpan` called twice (abort + error race) | Phase 1.5 guard at `activeSpans.delete()` returns early on second call; `recordApiRequestBreakdown` is inside the guard, recorded exactly once | +| Anthropic `message_start` chunk arrives before content | `hasUserVisibleContent` returns false for it (no parts with text/functionCall/etc.); TTFT not triggered until subsequent `content_block_delta` chunk | +| OpenAI first chunk with empty `delta.content` but `role` only | `hasUserVisibleContent` returns false; TTFT not triggered until first chunk with non-empty delta | +| Tool-call-only response (no text) | First chunk with `functionCall` Part triggers TTFT; `output_tokens_per_second` computed against tool-call token count | +| Concurrent subagents (3 calls in flight) | Each call's closure has its own `attemptStart`, `ttftMs`, `attemptStartTimes`. Per-call span receives its own metadata at `endLLMRequestSpan`. No interleaving (D2) | +| SDK-level retries inside openai-sdk (`maxRetries=3`) | Invisible to qwen-code telemetry — happens entirely inside SDK before retryWithBackoff sees the request. `attempt` reflects retryWithBackoff attempts only. Out of scope (see Out-of-scope) | +| `gen_ai.server.time_to_first_token` spec renames before reaching Stable | Single-file update: `session-tracing.ts:endLLMRequestSpan`. The qwen-code-native `ttft_ms` stays authoritative — no downstream impact | +| Subagent's LLM request | Parent is the subagent span (Phase 3). Phase 4 fields nest correctly. Aggregations grouped by `qwen-code.subagent.id` give per-subagent LLM perf — design-doc-future, easy follow-up | +| Reasoning model with long thought blocks | First `thought` Part triggers TTFT; `sampling_ms` includes both thinking + answer phases. Split into separate metrics deferred | + +## Rollback + +The change is additive at the OTel and metric level — every new attribute is optional, every new event is a new class. Existing dashboards that don't filter on the new fields keep working unchanged. + +Behavior-affecting changes: + +- New `ApiRetryEvent` LogRecord starts flowing → log volume increases proportional to retry rate (typically <1% of requests retry). Mitigate by sampling LogRecord at the SDK layer if needed. +- New breakdown metric `qwen-code.api.request.breakdown` starts producing time series → mild Prometheus cardinality bump (`{model, phase}` — bounded). +- `output_tokens_per_second` derived attribute may appear unusual on dashboards filtering "all attributes" — document. + +Rollback path: revert the single PR (or each of 4a/4b/4c independently). All new fields use defensive defaults (undefined / 0) and don't change span structure. + +## Sequencing + +- **After Phase 3 (#4410, in review)**: not a hard dependency. Phase 4 attributes attach to `qwen-code.llm_request` spans regardless of whether they're under a `qwen-code.subagent` (Phase 3) or `qwen-code.interaction` (Phase 1) parent. Recommend Phase 3 land first so per-attempt aggregation under subagent subtrees works naturally. +- **Independent of #4384** (`traceparent` + `X-Qwen-Code-Session-Id` outbound propagation). They touch the HTTP layer; Phase 4 touches the stream/retry/metric layer. +- **Independent of `clearDetailedSpanState` chat-compression follow-up** (#4097 follow-up). Different surface. + +## Open questions + +1. **`onRetry` callback firing semantics**: invoked **before** backoff sleep (current proposal) or **after** (when the next attempt is about to start)? Before is simpler — callback has all the info immediately; after would require capturing the just-completed delay separately. Pre-sleep is the recommendation; document in callback contract. +2. **Per-attempt timing on the LLM span**: should we add `attempt_durations_ms: number[]` array? OTel supports array-of-primitive attributes. Useful for "which attempt of N was slow" diagnostics. Defer until production data shows demand — log-bridge spans already carry the equivalent. +3. **Persistent retry mode emission cap**: at what `attempt > N` threshold should we start sampling? `N = 5` then 1-in-10? `N = 10` then summary-only? Defer until we have production volume data. +4. **`TOKEN_PROCESSING` phase**: keep enum value dormant or wire it to something (e.g., consolidation time)? Defer — wait for a real use case. +5. **Subagent-level LLM rollups**: trivial follow-up once Phase 4 lands — sum `ttft_ms`/`output_tokens`/`input_tokens` per subagent subtree. Not Phase 4 scope but the data flow enables it. diff --git a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts index 36571bf86..570cc9ae5 100644 --- a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts +++ b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts @@ -165,6 +165,11 @@ vi.mock('../../telemetry/index.js', () => { success: boolean; inputTokens?: number; outputTokens?: number; + cachedInputTokens?: number; + ttftMs?: number; + requestSetupMs?: number; + attempt?: number; + retryTotalDelayMs?: number; durationMs?: number; error?: string; }, @@ -616,6 +621,150 @@ describe('LoggingContentGenerator', () => { }); }); + it('captures ttftMs on the first user-visible stream chunk (Phase 4a)', async () => { + // Two chunks: first has text (user-visible), second has only usage. + // ttftMs must be set on the first chunk and not overwritten by the second. + const streamFn = vi.fn().mockResolvedValue( + (async function* () { + yield createResponse('r1', 'test-model', [{ text: 'hi' }]); + yield createResponse('r2', 'test-model', [], { + promptTokenCount: 10, + candidatesTokenCount: 2, + totalTokenCount: 12, + }); + })(), + ); + const wrapped = createWrappedGenerator(vi.fn(), streamFn); + const generator = new LoggingContentGenerator(wrapped, createConfig(), { + model: 'test-model', + authType: AuthType.USE_OPENAI, + enableOpenAILogging: false, + }); + const request = { + model: 'test-model', + contents: 'Hello', + } as unknown as GenerateContentParameters; + + const stream = await generator.generateContentStream( + request, + 'prompt-ttft', + ); + for await (const _ of stream) { + // consume + } + + const spanRecord = getStreamSpanRecord(); + const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined; + expect(meta).toBeDefined(); + expect(typeof meta!.ttftMs).toBe('number'); + expect(meta!.ttftMs!).toBeGreaterThanOrEqual(0); + }); + + it('forwards cachedInputTokens from usageMetadata to endLLMRequestSpan (Phase 4a)', async () => { + const streamFn = vi.fn().mockResolvedValue( + (async function* () { + yield createResponse('r1', 'test-model', [{ text: 'ok' }], { + promptTokenCount: 100, + candidatesTokenCount: 20, + cachedContentTokenCount: 40, + totalTokenCount: 160, + }); + })(), + ); + const wrapped = createWrappedGenerator(vi.fn(), streamFn); + const generator = new LoggingContentGenerator(wrapped, createConfig(), { + model: 'test-model', + authType: AuthType.USE_OPENAI, + enableOpenAILogging: false, + }); + const request = { + model: 'test-model', + contents: 'Hello', + } as unknown as GenerateContentParameters; + + const stream = await generator.generateContentStream( + request, + 'prompt-cache', + ); + for await (const _ of stream) { + // consume + } + + const spanRecord = getStreamSpanRecord(); + expect(spanRecord.endMetadata).toMatchObject({ + success: true, + inputTokens: 100, + cachedInputTokens: 40, + }); + }); + + it('leaves ttftMs undefined when stream yields no user-visible chunks (Phase 4a)', async () => { + // Stream emits only usage-metadata chunks (no text/functionCall/etc). + // ttftMs must stay undefined — TTFT is only meaningful when content arrives. + const streamFn = vi.fn().mockResolvedValue( + (async function* () { + yield createResponse('r1', 'test-model', [], { + promptTokenCount: 5, + candidatesTokenCount: 0, + totalTokenCount: 5, + }); + })(), + ); + const wrapped = createWrappedGenerator(vi.fn(), streamFn); + const generator = new LoggingContentGenerator(wrapped, createConfig(), { + model: 'test-model', + authType: AuthType.USE_OPENAI, + enableOpenAILogging: false, + }); + const request = { + model: 'test-model', + contents: 'Hello', + } as unknown as GenerateContentParameters; + + const stream = await generator.generateContentStream( + request, + 'prompt-no-content', + ); + for await (const _ of stream) { + // consume + } + + const spanRecord = getStreamSpanRecord(); + const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined; + expect(meta!.ttftMs).toBeUndefined(); + }); + + it('forwards cachedInputTokens to endLLMRequestSpan on non-stream success (Phase 4a)', async () => { + const generateFn = vi.fn().mockResolvedValue( + createResponse('resp-cache', 'test-model', [{ text: 'ok' }], { + promptTokenCount: 100, + candidatesTokenCount: 30, + cachedContentTokenCount: 60, + totalTokenCount: 190, + }), + ); + const wrapped = createWrappedGenerator(generateFn, vi.fn()); + const generator = new LoggingContentGenerator(wrapped, createConfig(), { + model: 'test-model', + authType: AuthType.USE_OPENAI, + enableOpenAILogging: false, + }); + const request = { + model: 'test-model', + contents: 'Hi', + } as unknown as GenerateContentParameters; + + await generator.generateContent(request, 'prompt-cache-non-stream'); + + const spanRecord = getGenerateContentSpanRecord(); + expect(spanRecord.endMetadata).toMatchObject({ + success: true, + inputTokens: 100, + outputTokens: 30, + cachedInputTokens: 60, + }); + }); + it('preserves non-stream success when response and OpenAI logging fail', async () => { vi.mocked(logApiResponse).mockImplementationOnce(() => { throw new Error('response-log-fail'); diff --git a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts index cfd16be7f..9a20db8f2 100644 --- a/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts +++ b/packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts @@ -61,6 +61,7 @@ import { API_CALL_ABORTED_SPAN_STATUS_MESSAGE, API_CALL_FAILED_SPAN_STATUS_MESSAGE, } from '../../telemetry/tracer.js'; +import { hasUserVisibleContent } from './streamContentDetection.js'; const debugLogger = createDebugLogger('LOGGING_CONTENT_GENERATOR'); @@ -285,6 +286,7 @@ export class LoggingContentGenerator implements ContentGenerator { success: true, inputTokens: response.usageMetadata?.promptTokenCount, outputTokens: response.usageMetadata?.candidatesTokenCount, + cachedInputTokens: response.usageMetadata?.cachedContentTokenCount, durationMs: Date.now() - startTime, }); return response; @@ -462,6 +464,14 @@ export class LoggingContentGenerator implements ContentGenerator { let firstModelVersion = ''; let lastUsageMetadata: GenerateContentResponseUsageMetadata | undefined; let errorOccurred = false; + + // TTFT (time to first token): wall-clock from generateContentStream + // dispatch to the first stream chunk containing user-visible content. + // Method-local closure variable — NEVER an instance field — because + // LoggingContentGenerator is shared across concurrent generateContentStream + // calls (one per ContentGenerator, see contentGenerator.ts:createContentGenerator). + // See docs/design/telemetry-llm-request-timing-design.md (D1, D2). + let ttftMs: number | undefined; // Tracks whether the idle timeout fired and ended the span. If so, // a resumed-after-timeout consumer must not call endLLMRequestSpan // again (the helper would no-op, but more importantly we skip the @@ -516,6 +526,13 @@ export class LoggingContentGenerator implements ContentGenerator { if (response.usageMetadata) { lastUsageMetadata = response.usageMetadata; } + // Capture TTFT on the first stream chunk that contains user-visible + // content. hasUserVisibleContent skips role-only / usageMetadata-only + // chunks, so TTFT reflects "model produced something the operator can + // attribute to user-perceived latency." + if (ttftMs === undefined && hasUserVisibleContent(response)) { + ttftMs = Date.now() - startTime; + } resetSpanTimeout?.(); yield response; } @@ -601,6 +618,8 @@ export class LoggingContentGenerator implements ContentGenerator { success: !errorOccurred, inputTokens: lastUsageMetadata?.promptTokenCount, outputTokens: lastUsageMetadata?.candidatesTokenCount, + cachedInputTokens: lastUsageMetadata?.cachedContentTokenCount, + ttftMs, durationMs: Date.now() - startTime, error: errorOccurred ? aborted diff --git a/packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts b/packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts new file mode 100644 index 000000000..04dd0fcd2 --- /dev/null +++ b/packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts @@ -0,0 +1,131 @@ +/** + * @license + * Copyright 2026 Qwen Team + * SPDX-License-Identifier: Apache-2.0 + */ + +import { describe, expect, it } from 'vitest'; +import { GenerateContentResponse } from '@google/genai'; +import { hasUserVisibleContent } from './streamContentDetection.js'; + +function chunkWithParts(parts: unknown[]): GenerateContentResponse { + const r = new GenerateContentResponse(); + r.candidates = [ + { + content: { role: 'model', parts: parts as never }, + }, + ]; + return r; +} + +describe('hasUserVisibleContent', () => { + it('returns true for non-empty text part', () => { + expect(hasUserVisibleContent(chunkWithParts([{ text: 'hi' }]))).toBe(true); + }); + + it('returns false for empty text part', () => { + expect(hasUserVisibleContent(chunkWithParts([{ text: '' }]))).toBe(false); + }); + + it('returns true for functionCall part', () => { + expect( + hasUserVisibleContent( + chunkWithParts([{ functionCall: { name: 'read', args: {} } }]), + ), + ).toBe(true); + }); + + it('returns true for inlineData part', () => { + expect( + hasUserVisibleContent( + chunkWithParts([ + { inlineData: { mimeType: 'image/png', data: 'abc' } }, + ]), + ), + ).toBe(true); + }); + + it('returns true for executableCode part', () => { + expect( + hasUserVisibleContent( + chunkWithParts([ + { executableCode: { language: 'PYTHON', code: 'print(1)' } }, + ]), + ), + ).toBe(true); + }); + + it('returns true for thought / reasoning part with thought: true', () => { + expect(hasUserVisibleContent(chunkWithParts([{ thought: true }]))).toBe( + true, + ); + }); + + it('returns false for thought: false (explicit non-thought part)', () => { + // Codebase convention: `thought` is a boolean flag where false means + // "explicitly not a thought." A part with only `thought: false` and no + // other content must not trigger TTFT. + expect(hasUserVisibleContent(chunkWithParts([{ thought: false }]))).toBe( + false, + ); + }); + + it('returns false for thought: undefined / missing (default non-thought)', () => { + // A bare object without the `thought` key is the common case for non-thinking + // chunks; must not match the thought branch. + expect(hasUserVisibleContent(chunkWithParts([{}]))).toBe(false); + }); + + it('returns true when thought: true coexists with empty text', () => { + // First Anthropic chunk often arrives as { text: '', thought: true }. + // Per design doc D1, "thought / reasoning content" is user-visible — TTFT fires. + expect( + hasUserVisibleContent(chunkWithParts([{ text: '', thought: true }])), + ).toBe(true); + }); + + it('returns true when any part is user-visible (mixed)', () => { + expect( + hasUserVisibleContent(chunkWithParts([{ text: '' }, { text: 'hi' }])), + ).toBe(true); + }); + + it('returns false for empty parts array', () => { + expect(hasUserVisibleContent(chunkWithParts([]))).toBe(false); + }); + + it('returns false when candidates is missing', () => { + const r = new GenerateContentResponse(); + expect(hasUserVisibleContent(r)).toBe(false); + }); + + it('returns false when content is missing', () => { + const r = new GenerateContentResponse(); + r.candidates = [{}]; + expect(hasUserVisibleContent(r)).toBe(false); + }); + + it('returns false when parts is undefined', () => { + const r = new GenerateContentResponse(); + r.candidates = [{ content: { role: 'model' } }]; + expect(hasUserVisibleContent(r)).toBe(false); + }); + + it('returns false for usage-only / role-only chunks', () => { + const r = new GenerateContentResponse(); + r.candidates = [{ content: { role: 'model', parts: [] } }]; + r.usageMetadata = { totalTokenCount: 42 }; + expect(hasUserVisibleContent(r)).toBe(false); + }); + + it('handles parts that are non-objects defensively', () => { + expect( + hasUserVisibleContent( + chunkWithParts([null, undefined, 'string', 42, { text: 'real' }]), + ), + ).toBe(true); + expect(hasUserVisibleContent(chunkWithParts([null, undefined, 'x']))).toBe( + false, + ); + }); +}); diff --git a/packages/core/src/core/loggingContentGenerator/streamContentDetection.ts b/packages/core/src/core/loggingContentGenerator/streamContentDetection.ts new file mode 100644 index 000000000..972926431 --- /dev/null +++ b/packages/core/src/core/loggingContentGenerator/streamContentDetection.ts @@ -0,0 +1,57 @@ +/** + * @license + * Copyright 2026 Qwen Team + * SPDX-License-Identifier: Apache-2.0 + */ + +import type { GenerateContentResponse } from '@google/genai'; + +/** + * Detects whether a streaming chunk contains user-visible model output. + * + * Used by the LoggingContentGenerator stream wrapper to identify the first + * chunk that should trigger TTFT (time-to-first-token) measurement. + * + * A chunk is "user-visible" if any normalized Part in candidates[0].content.parts + * is one of: + * - text with a non-empty string + * - functionCall (tool use — even tool-call-only responses count) + * - inlineData (image, binary blob) + * - executableCode (sandbox / code-execution responses) + * - thought / reasoning content (provider-dependent; o1, qwen thinking, Anthropic ) + * + * Chunks containing only role metadata, only usageMetadata (final summary + * chunk), or empty parts are NOT user-visible — TTFT should not fire on these. + * + * Centralised here (single predicate over the normalized GenerateContentResponse + * shape) so the four provider generators (Anthropic / OpenAI / Gemini / Qwen) + * don't each need their own first-token logic. Each provider already normalizes + * its native chunk shape to GenerateContentResponse before LoggingContentGenerator + * sees it (see loggingContentGenerator.ts generateContentStream). + */ +export function hasUserVisibleContent(chunk: GenerateContentResponse): boolean { + const parts = chunk.candidates?.[0]?.content?.parts; + if (!parts || parts.length === 0) return false; + return parts.some(isUserVisiblePart); +} + +function isUserVisiblePart(part: unknown): boolean { + if (part === null || typeof part !== 'object') return false; + const p = part as { + text?: unknown; + functionCall?: unknown; + inlineData?: unknown; + executableCode?: unknown; + thought?: unknown; + }; + if (typeof p.text === 'string' && p.text.length > 0) return true; + if (p.functionCall !== undefined) return true; + if (p.inlineData !== undefined) return true; + if (p.executableCode !== undefined) return true; + // `thought` is a boolean flag in this codebase — `true` means the part + // carries reasoning content, false / absent means none (see loggingContentGenerator.ts + // where `part.thought ? {...} : {}` is the canonical pattern). Match strict `=== true` + // rather than checking presence — `thought: false` parts are explicitly NOT user-visible. + if (p.thought === true) return true; + return false; +} diff --git a/packages/core/src/telemetry/session-tracing.test.ts b/packages/core/src/telemetry/session-tracing.test.ts index 5e7114d58..75cb6b977 100644 --- a/packages/core/src/telemetry/session-tracing.test.ts +++ b/packages/core/src/telemetry/session-tracing.test.ts @@ -378,6 +378,214 @@ describe('session-tracing', () => { }); }); + describe('LLM request spans — Phase 4a (timing decomposition + GenAI dual-emit)', () => { + it('startLLMRequestSpan dual-emits gen_ai.request.model alongside qwen-code.model', () => { + const span = startLLMRequestSpan('test-model', 'p'); + endLLMRequestSpan(span, { success: true }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['qwen-code.model']).toBe('test-model'); + expect(attrs['gen_ai.request.model']).toBe('test-model'); + }); + + it('endLLMRequestSpan dual-emits gen_ai.usage.input_tokens / output_tokens', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + inputTokens: 100, + outputTokens: 50, + }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['input_tokens']).toBe(100); + expect(attrs['gen_ai.usage.input_tokens']).toBe(100); + expect(attrs['output_tokens']).toBe(50); + expect(attrs['gen_ai.usage.output_tokens']).toBe(50); + }); + + it('endLLMRequestSpan dual-emits gen_ai.usage.cached_tokens when present', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + inputTokens: 100, + cachedInputTokens: 40, + }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['cached_input_tokens']).toBe(40); + expect(attrs['gen_ai.usage.cached_tokens']).toBe(40); + }); + + it('endLLMRequestSpan omits cached_input_tokens when undefined', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { success: true, inputTokens: 100 }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['cached_input_tokens']).toBeUndefined(); + expect(attrs['gen_ai.usage.cached_tokens']).toBeUndefined(); + }); + + it('endLLMRequestSpan emits cached_input_tokens === 0 (cache miss is meaningful info, not undefined)', () => { + // Providers that report 0 cached tokens are signaling an explicit cache + // miss. Distinct from undefined ("we don't know"). Both attribute names + // must propagate the literal 0. + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + inputTokens: 100, + cachedInputTokens: 0, + }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['cached_input_tokens']).toBe(0); + expect(attrs['gen_ai.usage.cached_tokens']).toBe(0); + }); + + it('endLLMRequestSpan writes ttft_ms and dual-emits gen_ai.server.time_to_first_token (in seconds)', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 234, + durationMs: 1000, + }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['ttft_ms']).toBe(234); + // Spec uses seconds as double — 234ms → 0.234s + expect(attrs['gen_ai.server.time_to_first_token']).toBeCloseTo(0.234, 6); + }); + + it('endLLMRequestSpan omits ttft_ms when undefined (non-streaming or aborted before first chunk)', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { success: true, durationMs: 500 }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['ttft_ms']).toBeUndefined(); + expect(attrs['gen_ai.server.time_to_first_token']).toBeUndefined(); + expect(attrs['sampling_ms']).toBeUndefined(); + expect(attrs['output_tokens_per_second']).toBeUndefined(); + }); + + it('endLLMRequestSpan derives sampling_ms when ttftMs is set (no requestSetup)', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 200, + durationMs: 1000, + }); + + // sampling_ms = duration - ttft - (requestSetup ?? 0) = 1000 - 200 - 0 + expect(mockSpans[0]!.attributes['sampling_ms']).toBe(800); + }); + + it('endLLMRequestSpan derives sampling_ms accounting for requestSetupMs (Phase 4b populates this)', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 200, + requestSetupMs: 300, + durationMs: 1000, + }); + + // sampling_ms = 1000 - 200 - 300 + expect(mockSpans[0]!.attributes['sampling_ms']).toBe(500); + }); + + it('endLLMRequestSpan clamps sampling_ms to 0 when ttft + setup exceed duration (clock skew)', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 800, + requestSetupMs: 500, + durationMs: 1000, + }); + + // Math.max(0, 1000 - 800 - 500) = 0 + expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0); + }); + + it('endLLMRequestSpan derives output_tokens_per_second from sampling_ms + outputTokens', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 200, + durationMs: 1200, + outputTokens: 500, + }); + + // sampling_ms = 1000ms = 1s; otps = 500 / 1.0 = 500 + expect(mockSpans[0]!.attributes['sampling_ms']).toBe(1000); + expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(500); + }); + + it('endLLMRequestSpan rounds output_tokens_per_second to 2 decimals', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 200, + durationMs: 1325, // sampling_ms = 1125 + outputTokens: 100, // otps = 100 / 1.125 = 88.888… + }); + + expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(88.89); + }); + + it('endLLMRequestSpan omits output_tokens_per_second when sampling_ms == 0', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 1000, + durationMs: 1000, + outputTokens: 50, + }); + + // sampling_ms = 0 → otps would be Infinity, must be omitted + expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0); + expect( + mockSpans[0]!.attributes['output_tokens_per_second'], + ).toBeUndefined(); + }); + + it('endLLMRequestSpan omits output_tokens_per_second when outputTokens missing', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + ttftMs: 200, + durationMs: 1000, + }); + + expect( + mockSpans[0]!.attributes['output_tokens_per_second'], + ).toBeUndefined(); + }); + + it('endLLMRequestSpan writes Phase 4b retry placeholders when caller provides them', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { + success: true, + attempt: 3, + requestSetupMs: 4500, + retryTotalDelayMs: 4200, + durationMs: 5000, + }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['attempt']).toBe(3); + expect(attrs['request_setup_ms']).toBe(4500); + expect(attrs['retry_total_delay_ms']).toBe(4200); + }); + + it('endLLMRequestSpan omits Phase 4b fields when caller does not provide them (Phase 4a default)', () => { + const span = startLLMRequestSpan('m', 'p'); + endLLMRequestSpan(span, { success: true, durationMs: 500 }); + + const attrs = mockSpans[0]!.attributes; + expect(attrs['attempt']).toBeUndefined(); + expect(attrs['request_setup_ms']).toBeUndefined(); + expect(attrs['retry_total_delay_ms']).toBeUndefined(); + }); + }); + describe('tool spans', () => { it('creates and ends a tool span', () => { const span = startToolSpan('ReadFile', { 'tool.call_id': 'call-1' }); diff --git a/packages/core/src/telemetry/session-tracing.ts b/packages/core/src/telemetry/session-tracing.ts index e4d5b14a2..97089df54 100644 --- a/packages/core/src/telemetry/session-tracing.ts +++ b/packages/core/src/telemetry/session-tracing.ts @@ -46,9 +46,46 @@ export interface EndInteractionOptions { export interface LLMRequestMetadata { inputTokens?: number; outputTokens?: number; + /** + * Tokens served from the provider's prompt cache (Anthropic + * cache_read_input_tokens, OpenAI prompt_tokens_details.cached_tokens, etc). + * Normalized to GenerateContentResponseUsageMetadata.cachedContentTokenCount + * by each provider generator before reaching LoggingContentGenerator. + */ + cachedInputTokens?: number; success: boolean; durationMs?: number; error?: string; + /** + * Time from the successful attempt's request dispatch to the first stream + * chunk containing user-visible content (text / functionCall / inlineData / + * executableCode / thought). Undefined for non-streaming requests, requests + * aborted before the first user-visible chunk, and any path that does not + * pass through LoggingContentGenerator's stream wrapper. + * + * Semantics: diverges from claude-code's ttftMs (which fires on + * Anthropic's message_start metadata event). Matches the "time to first + * actual token" intent of the industry-standard TTFT name. + * See docs/design/telemetry-llm-request-timing-design.md (D1). + */ + ttftMs?: number; + /** + * Time from generateContent/generateContentStream entry to the start of the + * successful attempt (ms). Includes all failed retries + backoff sleeps. + * Populated by the retry layer in Phase 4b; undefined in Phase 4a. + */ + requestSetupMs?: number; + /** + * Final attempt number (1-based). 1 = no retries. Populated by the retry + * layer in Phase 4b; undefined in Phase 4a. + */ + attempt?: number; + /** + * Sum of all backoff delays before the successful attempt (ms). 0 if no + * retries occurred. Populated by the retry layer in Phase 4b; undefined + * in Phase 4a. + */ + retryTotalDelayMs?: number; } export interface ToolSpanMetadata { @@ -321,6 +358,10 @@ export function startLLMRequestSpan(model: string, promptId: string): Span { 'qwen-code.model': model, 'qwen-code.prompt_id': promptId, 'llm_request.context': parentCtx ? 'interaction' : 'standalone', + // Dual-emit OTel GenAI semantic convention (Stable). Private name + // (qwen-code.model) remains authoritative; gen_ai.* is a compat layer + // for spec-aware backends. See docs/design/telemetry-llm-request-timing-design.md (D8). + 'gen_ai.request.model': model, }; const span = getTracer().startSpan( @@ -362,10 +403,55 @@ export function endLLMRequestSpan( const endAttributes: Attributes = { duration_ms: duration }; if (metadata) { - if (metadata.inputTokens !== undefined) + if (metadata.inputTokens !== undefined) { endAttributes['input_tokens'] = metadata.inputTokens; - if (metadata.outputTokens !== undefined) + // Dual-emit OTel GenAI semconv (Stable). Private name stays authoritative. + endAttributes['gen_ai.usage.input_tokens'] = metadata.inputTokens; + } + if (metadata.outputTokens !== undefined) { endAttributes['output_tokens'] = metadata.outputTokens; + endAttributes['gen_ai.usage.output_tokens'] = metadata.outputTokens; + } + if (metadata.cachedInputTokens !== undefined) { + endAttributes['cached_input_tokens'] = metadata.cachedInputTokens; + // Dual-emit OTel GenAI semconv (Experimental — may rename before Stable). + endAttributes['gen_ai.usage.cached_tokens'] = + metadata.cachedInputTokens; + } + if (metadata.ttftMs !== undefined) { + endAttributes['ttft_ms'] = metadata.ttftMs; + // Dual-emit OTel GenAI semconv (Experimental). Spec unit is seconds-as-double. + endAttributes['gen_ai.server.time_to_first_token'] = + metadata.ttftMs / 1000; + } + if (metadata.requestSetupMs !== undefined) { + endAttributes['request_setup_ms'] = metadata.requestSetupMs; + } + if (metadata.attempt !== undefined) { + endAttributes['attempt'] = metadata.attempt; + } + if (metadata.retryTotalDelayMs !== undefined) { + endAttributes['retry_total_delay_ms'] = metadata.retryTotalDelayMs; + } + // Derived: sampling_ms = time from first user-visible chunk to end + // (== output generation time, excluding setup + first-token delay). + // Computable only when ttftMs is set. requestSetupMs defaults to 0 + // when undefined (no retries) — this gives the correct sampling + // duration in both Phase 4a (no retry data) and Phase 4b (with). + if (metadata.ttftMs !== undefined) { + const samplingMs = Math.max( + 0, + duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0), + ); + endAttributes['sampling_ms'] = samplingMs; + // Derived: output tokens per second during sampling. Undefined when + // sampling_ms is 0 (avoid divide-by-zero) or when outputTokens missing. + if (samplingMs > 0 && metadata.outputTokens !== undefined) { + endAttributes['output_tokens_per_second'] = + Math.round((metadata.outputTokens / (samplingMs / 1000)) * 100) / + 100; + } + } endAttributes['success'] = metadata.success; if (metadata.error !== undefined) endAttributes['error'] = truncateSpanError(metadata.error);