mirror of
https://github.com/QwenLM/qwen-code.git
synced 2026-05-22 11:13:51 +00:00
feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731) (#4417)
Some checks are pending
Qwen Code CI / Classify PR (push) Waiting to run
Qwen Code CI / Lint (push) Blocked by required conditions
Qwen Code CI / Test (macos-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Test (ubuntu-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Test (windows-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Blocked by required conditions
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
Some checks are pending
Qwen Code CI / Classify PR (push) Waiting to run
Qwen Code CI / Lint (push) Blocked by required conditions
Qwen Code CI / Test (macos-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Test (ubuntu-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Test (windows-latest, Node 22.x) (push) Blocked by required conditions
Qwen Code CI / Post Coverage Comment (push) Blocked by required conditions
Qwen Code CI / CodeQL (push) Blocked by required conditions
E2E Tests / E2E Test (Linux) - sandbox:docker (push) Waiting to run
E2E Tests / E2E Test (Linux) - sandbox:none (push) Waiting to run
E2E Tests / E2E Test - macOS (push) Waiting to run
This commit is contained in:
parent
48b0a8bfce
commit
fd75f77e19
7 changed files with 1190 additions and 2 deletions
538
docs/design/telemetry-llm-request-timing-design.md
Normal file
538
docs/design/telemetry-llm-request-timing-design.md
Normal file
|
|
@ -0,0 +1,538 @@
|
|||
# LLM Request Timing Decomposition Design (P3 Phase 4)
|
||||
|
||||
> Issue #3731 — Phase 4 of hierarchical session tracing. Adds time-to-first-token, request-setup duration, sampling duration, and per-attempt retry telemetry to the `qwen-code.llm_request` span so operators can answer "why was this LLM call slow?" without guessing.
|
||||
>
|
||||
> Builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321). Independent of Phase 3 (#4410, in review) — recommended to land Phase 3 first so Phase 4's per-attempt fields aggregate cleanly under subagent subtrees.
|
||||
|
||||
## Problem
|
||||
|
||||
`qwen-code.llm_request` spans today carry only `model`, `prompt_id`, `input_tokens`, `output_tokens`, `success`, `error`, `duration_ms`. Operators reading a single trace cannot tell:
|
||||
|
||||
1. **How much of `duration_ms` was the model thinking vs the network setup.** A 12-second `duration_ms` could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.
|
||||
2. **When the user saw the first token.** TTFT (time-to-first-token) is the standard latency SLO for chat UIs. We can't compute it; we don't capture it.
|
||||
3. **What happened during retries.** `retryWithBackoff` (`utils/retry.ts:285`) only calls `debugLogger.warn` — no OTel event, no span attribute. The 4 LLM call sites that go through it (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) have zero retry visibility in traces or metrics. `ContentRetryEvent` exists for content-recovery retries inside `geminiChat.ts:806,830` but not for the more common rate-limit / 5xx retries.
|
||||
4. **That `api.request.breakdown` is dead code.** The metric is defined at `metrics.ts:242-251` with 4 `ApiRequestPhase` values, exported from `index.ts:117`, tested in `metrics.test.ts:646-675` — but `recordApiRequestBreakdown()` has zero callers in production code. The metric infrastructure is paid for; the data flow was never connected.
|
||||
|
||||
These gaps make `qwen-code.llm_request` the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.
|
||||
|
||||
## Existing surface (no change)
|
||||
|
||||
| Component | Location | Why we don't touch it |
|
||||
| ------------------------------------------------------------ | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| LLM request span lifecycle | `session-tracing.ts` `startLLMRequestSpan` / `endLLMRequestSpan` | Phase 1 (#4126) established the helpers. We extend the metadata interface, don't restructure |
|
||||
| Active span propagation into provider generators | `loggingContentGenerator.ts:213,287` | Phase 1 (#4126) replaced `withSpan('api.*')` with native helpers; the active context already reaches the stream wrapper |
|
||||
| `ContentRetryEvent` schema + consumers | `types.ts:626`, `qwen-logger.ts:947`, `loggers.ts:717` | Existing event keeps its shape and downstreams; we add a sibling event class for the `retryWithBackoff` path |
|
||||
| `LogToSpanProcessor` log-bridge spans | `log-to-span-processor.ts` | ContentRetryEvent's existing bridge continues to nest under the active LLM span. Phase 4 does not change this |
|
||||
| `ApiRequestPhase` enum | `metrics.ts:330-334` | Public surface (4 values). We populate 3 of the 4 from production code; leave the enum unchanged for backward compatibility |
|
||||
| Per-provider chunk normalization → `GenerateContentResponse` | `loggingContentGenerator.ts:286-393` | Each provider already normalizes to Google's `GenerateContentResponse` shape before LoggingContentGenerator sees the stream. TTFT detection runs centrally over this normalized shape; no per-provider code |
|
||||
| `retryWithBackoff` general-purpose retry | `utils/retry.ts:140` | Used by both LLM callers and non-LLM (`channels/weixin/src/api.ts`). We extend with an opt-in `onRetry` callback rather than hard-coupling to LLM telemetry |
|
||||
| Non-streaming `generateContent` | `loggingContentGenerator.ts:212` | TTFT is not meaningful for non-streaming; the new fields stay `undefined`. Span lifecycle and existing attrs unchanged |
|
||||
|
||||
## Out-of-scope (deferred)
|
||||
|
||||
- **SDK-level retries** (openai SDK `maxRetries=3`, google-genai SDK internal retries). These happen entirely inside the third-party SDK; observing them requires disabling SDK retries and reimplementing in `retryWithBackoff`. Separate decision, not Phase 4.
|
||||
- **Per-token streaming metrics** (inter-token latency, per-chunk size). Useful for inference-engine perf debugging, not for the user-perceived latency questions Phase 4 targets.
|
||||
- **Separate TTFT for reasoning/thinking blocks.** "First token" includes thinking content (see D1). A future enhancement could split `ttft_to_reasoning_ms` vs `ttft_to_answer_ms`, but only after we know there's demand.
|
||||
- **Sampling phase as a dedicated child span.** Computable from `duration_ms - ttft_ms - request_setup_ms`; child span adds nothing for OTel-only backends (claude-code uses one for Perfetto only). Stored as a span attribute instead — see D6.
|
||||
- **Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`) event-level rate limiting.** A single LLM request can produce 50+ `ContentRetryEvent` / `ApiRetryEvent` records under persistent retry. Capping emission is a follow-up — Phase 4 emits all events; if production volumes prove unbearable, add a per-span emission cap with a "+N more attempts (truncated)" summary event in a follow-up PR.
|
||||
- **`TOKEN_PROCESSING` breakdown phase.** Enum value exists but qwen-code has no real post-stream local processing worth measuring (<10ms typical). Skipped in production callers; enum value retained for future use or for callers we don't control.
|
||||
- **Migrating `ContentRetryEvent` onto LLM span as span events.** Same reasoning as Phase 3's `subagent_execution` LogRecord: existing consumers (qwen-logger RUM, future metrics) are tightly coupled to the LogRecord. Bridge-span coverage is good enough.
|
||||
|
||||
## References (decision evidence)
|
||||
|
||||
| Source | Key takeaway |
|
||||
| --------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| claude-code (Anthropic) `claude.ts:1762, 1789, 1982, 2882` | TTFT captured as `Date.now() - start` on `message_start` SSE event; `start` reset per retry attempt. `requestSetupMs = start - startIncludingRetries`. `attemptStartTimes` array preserved per attempt. Confirms feasibility of the approach; their TTFT semantic is "first stream event" (we diverge to "first content" — see D1) |
|
||||
| claude-code `perfettoTracing.ts:549-671` | Renders Request Setup → Attempt N (retry) → First Token → Sampling as nested B/E pairs. Demonstrates the visual decomposition; qwen-code does the same decomposition with OTel attributes since we have no Perfetto |
|
||||
| claude-code `sessionTracing.ts:447` | Only `ttft_ms` makes it onto the OTel span (not `requestSetupMs`, not `samplingMs`, not per-attempt timing). We deliberately put more on the span — claude-code has Perfetto for visualization; we don't |
|
||||
| opencode (sst/opencode) `session/llm.ts`, `route/client.ts` | No TTFT measurement. Single `LLM.run` Effect span covers everything. Validates that the gap exists across competing tools; not a reference for what to do |
|
||||
| [OTel GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (status: Development / Experimental) | `gen_ai.usage.input_tokens` (Stable), `gen_ai.usage.output_tokens` (Stable), `gen_ai.usage.cached_tokens` (Experimental), `gen_ai.request.model` (Stable), `gen_ai.server.time_to_first_token` (Experimental, seconds as double). Dual-emit pattern follows #4410 precedent |
|
||||
| [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events) | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent |
|
||||
| Phase 3 design doc (`telemetry-subagent-spans-design.md`) | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields |
|
||||
|
||||
## Design — seven decisions, each justified
|
||||
|
||||
### D1 — TTFT semantic: "first chunk containing user-visible content"
|
||||
|
||||
TTFT measures wall-clock from the **successful attempt's** request dispatch to the **first stream chunk that contains user-visible output**. A chunk is "user-visible" if any normalized `Part` in `candidates[0].content.parts` is one of:
|
||||
|
||||
- `text` with non-empty string
|
||||
- `functionCall` (tool use)
|
||||
- `inlineData` (image, binary)
|
||||
- `executableCode`
|
||||
- `thought` / reasoning content (whatever the provider surfaces — Gemini's `thought`, Anthropic's `<thinking>` block, OpenAI o1 reasoning chunk)
|
||||
|
||||
Chunks containing only `role` metadata or only `usageMetadata` (final usage-summary chunk) do not trigger TTFT.
|
||||
|
||||
**Why not "first stream event of any kind" (claude-code's choice)**: claude-code measures TTFT at `message_start`, an Anthropic-specific metadata event that fires 50–300ms before any actual content. Their internal `headlessProfiler.ts` already separates `time_to_first_response_ms` for the "user saw something" semantic, acknowledging the distinction. qwen-code spans multiple providers (Anthropic, OpenAI, Gemini, Qwen) — picking the metadata-event semantic means TTFT for Anthropic is fundamentally different from TTFT for OpenAI (which has no analogous metadata-only first event). The user-visible-content semantic is uniform across all 4 providers and matches "time-to-first-token" literally.
|
||||
|
||||
**Why include `thought` / reasoning**: from the operator's perspective, reasoning chunks are still "the model produced output." Excluding them would understate TTFT for reasoning-heavy models (o1, Qwen thinking variants). Future split into `ttft_to_reasoning_ms` vs `ttft_to_answer_ms` is possible; not Phase 4.
|
||||
|
||||
**Why include tool-call-only chunks**: agent tool-decision LLM calls (one `tool_use`, no text) are common in qwen-code's workflow. Excluding them means TTFT is undefined for these requests. The `functionCall` Part is meaningful output.
|
||||
|
||||
**Cross-product comparison note**: design doc explicitly states `qwen-code.ttft_ms ≈ claude-code.time_to_first_response_ms ≠ claude-code.ttft_ms`. Operators comparing across products should align on the user-visible-content semantic.
|
||||
|
||||
### D2 — TTFT measurement site: method-local variables in `LoggingContentGenerator.generateContentStream`
|
||||
|
||||
The first-chunk detection runs inside the existing stream wrapper at `loggingContentGenerator.ts:393` (`async function* processStreamGenerator`). Per-call variables (`start`, `ttftMs`) live in the method's closure; **never as instance fields**.
|
||||
|
||||
**Why never instance fields**: `LoggingContentGenerator` is instantiated **once per `ContentGenerator`** (`contentGenerator.ts:377`) and shared across all concurrent `generateContentStream` calls — subagent fan-out, warmup queries, side-queries from `geminiChat`. An instance field would be overwritten across concurrent calls, producing nonsense TTFT for one of every two interleaved requests.
|
||||
|
||||
**Why not AsyncLocalStorage**: ALS would work but adds a context-management layer for a piece of state that doesn't need to escape the method. Method-local is simpler, zero overhead, zero risk of leakage.
|
||||
|
||||
```ts
|
||||
// loggingContentGenerator.ts — inside generateContentStream
|
||||
const attemptStart = Date.now(); // per-call local
|
||||
const requestEntryTime = Date.now(); // also per-call local — see D3
|
||||
let ttftMs: number | undefined;
|
||||
const attemptStartTimes: number[] = [attemptStart];
|
||||
let retryTotalDelayMs = 0;
|
||||
let finalAttempt = 1;
|
||||
// stream wrapper inspects each chunk; first one matching hasUserVisibleContent:
|
||||
// ttftMs = Date.now() - attemptStart;
|
||||
```
|
||||
|
||||
`hasUserVisibleContent(chunk)` is a small standalone helper colocated with the wrapper, exported for tests:
|
||||
|
||||
```ts
|
||||
function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
|
||||
const parts = chunk.candidates?.[0]?.content?.parts;
|
||||
if (!parts?.length) return false;
|
||||
return parts.some(
|
||||
(p) =>
|
||||
(typeof p.text === 'string' && p.text.length > 0) ||
|
||||
p.functionCall !== undefined ||
|
||||
p.inlineData !== undefined ||
|
||||
p.executableCode !== undefined ||
|
||||
// @ts-expect-error — `thought` is not on all SDK versions but providers emit it
|
||||
p.thought !== undefined,
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
### D3 — `request_setup_ms` computation: entry-time vs successful-attempt-start
|
||||
|
||||
`request_setup_ms` measures wall-clock from `generateContentStream`/`generateContent` entry to the **start of the successful attempt** — including all failed retries, backoff sleeps, and any pre-retry preparation work.
|
||||
|
||||
```ts
|
||||
request_setup_ms = attemptStart_of_successful_attempt - requestEntryTime;
|
||||
```
|
||||
|
||||
When `attempt === 1` and no retries happened, `request_setup_ms` is small (just SDK setup). When retries occurred, it captures the entire retry-budget overhead.
|
||||
|
||||
**Putting it on the OTel span (diverges from claude-code, which puts it only on Perfetto)**: rationale at three levels:
|
||||
|
||||
1. **No Perfetto** — qwen-code has no out-of-band visualization layer. OTel attributes are the only channel.
|
||||
2. **Single-trace debug** — operator sees `duration_ms=12000, request_setup_ms=11500, ttft_ms=200, sampling_ms=300` → instantly diagnoses "retries ate 11.5s, model itself was fast." Computing `request_setup_ms` from other fields requires also exposing `sampling_ms`, which we do anyway (D6).
|
||||
3. **Negligible cost** — 1 INT64 attribute. Same order of magnitude as the existing `input_tokens`, `output_tokens` attributes. Backend ingest cost is not material.
|
||||
|
||||
### D4 — Retry telemetry: `onRetry` callback option on `retryWithBackoff` + new `ApiRetryEvent`
|
||||
|
||||
`retryWithBackoff` currently calls `logRetryAttempt` (`retry.ts:343`) which only writes to `debugLogger.warn`. We extend the `RetryOptions` interface with an opt-in callback:
|
||||
|
||||
```ts
|
||||
// utils/retry.ts
|
||||
interface RetryOptions<T> {
|
||||
// ... existing fields ...
|
||||
/**
|
||||
* Optional. Called once per failed attempt, before the backoff sleep.
|
||||
* Receives the attempt number (1-based), the error, and the delay before
|
||||
* the next attempt. Use this to emit telemetry events for LLM call sites;
|
||||
* leave undefined for non-LLM callers (e.g., channels/weixin) so they
|
||||
* stay silent in LLM-specific telemetry channels.
|
||||
*/
|
||||
onRetry?: (info: RetryAttemptInfo) => void;
|
||||
}
|
||||
|
||||
interface RetryAttemptInfo {
|
||||
attempt: number; // 1-based, matches debugLogger output
|
||||
error: unknown;
|
||||
errorStatus?: number;
|
||||
delayMs: number; // backoff delay before next attempt
|
||||
}
|
||||
```
|
||||
|
||||
The 4 LLM call sites (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) register a callback that emits a new `ApiRetryEvent`:
|
||||
|
||||
```ts
|
||||
// types.ts — new event class, sibling to ContentRetryEvent
|
||||
export class ApiRetryEvent implements BaseTelemetryEvent {
|
||||
'event.name': typeof EVENT_API_RETRY;
|
||||
'event.timestamp': string;
|
||||
model: string;
|
||||
prompt_id?: string;
|
||||
attempt_number: number; // 1-based
|
||||
error_type: string;
|
||||
error_message: string; // truncated to 256 chars
|
||||
status_code?: number;
|
||||
retry_delay_ms: number;
|
||||
// ... duration_ms set to retry_delay_ms so LogToSpanProcessor renders
|
||||
// a bridge span of meaningful width
|
||||
duration_ms: number;
|
||||
}
|
||||
```
|
||||
|
||||
**Why a new event class, not extending `ContentRetryEvent`**:
|
||||
|
||||
- `ContentRetryEvent` has 2 downstream consumers (qwen-logger, log-record export). Changing its payload risks breaking them.
|
||||
- The naming "content retry" semantically refers to content-recovery retries (invalid stream, schema repair) — extending it to cover rate-limit retries would muddy the schema.
|
||||
- New event is additive; no consumer surprise.
|
||||
|
||||
**Why not embed callback IN `retry.ts`**: `retry.ts` is called by `channels/weixin/src/api.ts` too (microsoft messaging API retries). Hard-coupling LLM telemetry inside retry.ts would emit `ApiRetryEvent` for non-LLM retries. The `onRetry` callback is opt-in per caller — LLM callers opt in, weixin caller doesn't.
|
||||
|
||||
**ContentRetryEvent coexistence**: ContentRetryEvent stays as-is for content-recovery retries inside `geminiChat.ts:806,830`. ApiRetryEvent covers the rate-limit / 5xx retries from `retryWithBackoff`. The two events fire from different layers and never duplicate. Existing log-bridge behavior for both events is preserved via `LogToSpanProcessor` — both events nest under the active LLM span automatically (Phase 1 wiring ensures the LLM span is active during retries).
|
||||
|
||||
**Persistent retry mode (`QWEN_CODE_UNATTENDED_RETRY`)**: a single 429-loop request may emit 50+ events. Out of scope to rate-limit emission in Phase 4 — if production volumes prove unbearable, add a per-span cap with summary event in a follow-up PR. The aggregated `attempt` and `retry_total_delay_ms` on the parent LLM span (D5) remain accurate regardless of event cap.
|
||||
|
||||
### D5 — Parent LLM span aggregation: scalar attributes only (no map-typed attrs)
|
||||
|
||||
OTel span attributes are scalars (`string | number | boolean | array of these`). Map-typed attributes (like `retry_count_by_status: {429:2, 503:1}`) require JSON serialization and are awkward to query. Skip them.
|
||||
|
||||
| Attribute | Type | Semantic |
|
||||
| -------------------------- | ------ | ----------------------------------------------------------------------------------- |
|
||||
| `attempt` | int | 1-based final attempt count (`attemptStartTimes.length`) |
|
||||
| `retry_total_delay_ms` | int | Sum of all `delayMs` reported by `onRetry`; 0 if no retries |
|
||||
| `ttft_ms` | int | TTFT per D1; undefined for non-streaming or aborted-before-first-chunk requests |
|
||||
| `request_setup_ms` | int | Per D3 |
|
||||
| `sampling_ms` | int | Per D6 |
|
||||
| `output_tokens_per_second` | double | Derived; `output_tokens / (sampling_ms / 1000)`; undefined when `sampling_ms === 0` |
|
||||
|
||||
Per-attempt status-code distribution (e.g., "2 of the 3 attempts were 429s") is queryable from log-bridge spans of `ApiRetryEvent` records. No need to duplicate it as a flattened attribute on the parent.
|
||||
|
||||
**Why `sampling_ms` and `output_tokens_per_second` on the span**: derivable but cumbersome to compute in backend queries when summing across many spans. Same cost-benefit as `request_setup_ms` (D3).
|
||||
|
||||
### D6 — Activate `recordApiRequestBreakdown()` for 3 of 4 phases
|
||||
|
||||
In `endLLMRequestSpan` (or the wrapper that calls it), after computing TTFT/setup/sampling, emit:
|
||||
|
||||
```ts
|
||||
recordApiRequestBreakdown(config, model, [
|
||||
{ phase: ApiRequestPhase.REQUEST_PREPARATION, durationMs: requestSetupMs },
|
||||
{ phase: ApiRequestPhase.NETWORK_LATENCY, durationMs: ttftMs }, // ttftMs = network + first-token-generation
|
||||
{ phase: ApiRequestPhase.RESPONSE_PROCESSING, durationMs: samplingMs },
|
||||
]);
|
||||
```
|
||||
|
||||
**Why skip `TOKEN_PROCESSING`**: qwen-code does stream chunk processing inline (consolidation happens in the wrapper at `loggingContentGenerator.ts:644`); the post-stream wrap-up phase is <10ms and not architecturally distinct. Filling it with a meaningless value pollutes the histogram. Leaving the enum value unused is safe — `apiRequestBreakdownHistogram.record(value, {model, phase})` is just a histogram with `phase` as a label; missing labels are simply absent in queries.
|
||||
|
||||
**Why not redefine `NETWORK_LATENCY`**: the spec name is slightly misleading (it's network + first-token-generation, not pure network latency), but:
|
||||
|
||||
- The enum is part of `metrics.ts:330-334` which is exported from `index.ts:117` and tested.
|
||||
- Backend dashboards may already reference these phase names.
|
||||
- Renaming or adding a new phase would be a breaking change for trivially marginal accuracy improvement.
|
||||
|
||||
Document the semantic in the design doc; leave the enum unchanged.
|
||||
|
||||
**Why on the span path, not parallel**: keeps `recordApiRequestBreakdown` colocated with span attribute writes — single gated emission point (see D7 idempotency), single ordering invariant.
|
||||
|
||||
### D7 — `endLLMRequestSpan` idempotency: metric recording gated on existing double-end guard
|
||||
|
||||
Phase 1.5 (#4302) established that `endLLMRequestSpan` may be called twice (abort path + error path collision). The existing guard at `session-tracing.ts:~470` (`if (!activeSpans.has(...)) return;`) prevents double `span.end()`. Phase 4 metric recording (D6) **must sit inside the same guarded block**, before `span.end()`:
|
||||
|
||||
```ts
|
||||
// session-tracing.ts — endLLMRequestSpan
|
||||
const llmCtx = activeSpans.get(spanRef);
|
||||
if (!llmCtx) return; // already ended — double-end guard
|
||||
activeSpans.delete(spanRef); // claim the end
|
||||
|
||||
// ... compute duration, set attributes ...
|
||||
if (metadata) {
|
||||
recordApiRequestBreakdown(config, llmCtx.attributes.model, [...]); // NEW — gated
|
||||
recordTokenUsageMetrics(...); // existing
|
||||
}
|
||||
|
||||
span.end();
|
||||
```
|
||||
|
||||
This guarantees metric is recorded **exactly once** per LLM request, matching the span lifecycle.
|
||||
|
||||
**Why not record in `loggingContentGenerator`**: it doesn't see the abort path. Recording at the span lifecycle layer ensures every LLM request that opens a span produces exactly one breakdown sample, regardless of success/failure/abort.
|
||||
|
||||
### D8 — GenAI semantic conventions dual-emit (private name authoritative)
|
||||
|
||||
Each Phase 4 attribute that corresponds to an OTel GenAI semconv attribute is written twice on the span:
|
||||
|
||||
| qwen-code private (authoritative) | GenAI semconv (compat layer) | Unit conversion | Spec status |
|
||||
| ------------------------------------------ | ----------------------------------------------- | --------------- | ------------ |
|
||||
| `ttft_ms` (ms, int) | `gen_ai.server.time_to_first_token` (s, double) | `ttftMs / 1000` | Experimental |
|
||||
| `input_tokens` (int) | `gen_ai.usage.input_tokens` (int) | identical | Stable |
|
||||
| `output_tokens` (int) | `gen_ai.usage.output_tokens` (int) | identical | Stable |
|
||||
| `cached_input_tokens` (int) (when present) | `gen_ai.usage.cached_tokens` (int) | identical | Experimental |
|
||||
| `qwen-code.model` (string) | `gen_ai.request.model` (string) | identical | Stable |
|
||||
|
||||
**Existing token attribute names** on the LLM span (set in `endLLMRequestSpan` before Phase 4): qwen-code uses bare `input_tokens` and `output_tokens` already. Phase 4 adds the `gen_ai.usage.*` siblings to match #4410's pattern. The bare names stay; **don't rename**.
|
||||
|
||||
Fields with no GenAI semconv equivalent — `request_setup_ms`, `sampling_ms`, `retry_total_delay_ms`, `attempt`, `output_tokens_per_second` — are emitted only under the qwen-code namespace.
|
||||
|
||||
**Why "private authoritative, semconv as compat"**:
|
||||
|
||||
- Internal dashboards, SLOs, debugLogger output, qwen-logger RUM, ARMS queries — all reference `ttft_ms` etc. Treating those as canonical avoids a flag-day migration.
|
||||
- The Experimental GenAI semconv may rename `gen_ai.server.time_to_first_token` before reaching Stable. If/when it does, we update the semconv emission; the qwen-code names don't move.
|
||||
- Future spec-aware backends (Datadog AI views, Honeycomb AI, ARMS GenAI dashboards) auto-pick up the `gen_ai.*` attributes without our involvement.
|
||||
|
||||
**Why dual-emit unit conversion** (ms ↔ seconds): GenAI semconv chose seconds-as-double for latency; qwen-code chose ms-as-int (matches `duration_ms` already on the span). Both representations have value; the conversion is cheap.
|
||||
|
||||
## Helper API (additive to `session-tracing.ts`)
|
||||
|
||||
```ts
|
||||
// session-tracing.ts — LLMRequestMetadata interface extended (additive)
|
||||
export interface LLMRequestMetadata {
|
||||
// ... existing fields: inputTokens, outputTokens, cachedInputTokens, success, error, ...
|
||||
|
||||
/** Time from successful attempt start to first user-visible content chunk (ms). Undefined for non-streaming or aborted-before-first-chunk requests. */
|
||||
ttftMs?: number;
|
||||
|
||||
/** Time from generateContent entry to start of successful attempt (ms). Includes all failed retries + backoff. */
|
||||
requestSetupMs?: number;
|
||||
|
||||
/** Final attempt number (1-based). 1 = no retries. */
|
||||
attempt?: number;
|
||||
|
||||
/** Sum of all backoff delays before the successful attempt (ms). */
|
||||
retryTotalDelayMs?: number;
|
||||
}
|
||||
|
||||
// No new exported helpers — Phase 4 reuses startLLMRequestSpan / endLLMRequestSpan with extended metadata.
|
||||
```
|
||||
|
||||
```ts
|
||||
// types.ts — new event class
|
||||
export class ApiRetryEvent implements BaseTelemetryEvent {
|
||||
'event.name': typeof EVENT_API_RETRY = EVENT_API_RETRY;
|
||||
'event.timestamp': string;
|
||||
model: string;
|
||||
prompt_id?: string;
|
||||
attempt_number: number;
|
||||
error_type: string;
|
||||
error_message: string;
|
||||
status_code?: number;
|
||||
retry_delay_ms: number;
|
||||
duration_ms: number; // = retry_delay_ms, drives LogToSpanProcessor bridge span width
|
||||
|
||||
constructor(opts: { model: string; promptId?: string; attemptNumber: number; error: unknown; statusCode?: number; retryDelayMs: number }) { ... }
|
||||
}
|
||||
|
||||
// constants.ts
|
||||
export const EVENT_API_RETRY = 'qwen-code.api_retry';
|
||||
|
||||
// loggers.ts
|
||||
export function logApiRetry(config: Config, event: ApiRetryEvent): void { ... }
|
||||
```
|
||||
|
||||
```ts
|
||||
// utils/retry.ts — RetryOptions extension
|
||||
interface RetryOptions<T> {
|
||||
// ... existing ...
|
||||
onRetry?: (info: RetryAttemptInfo) => void;
|
||||
}
|
||||
|
||||
interface RetryAttemptInfo {
|
||||
attempt: number;
|
||||
error: unknown;
|
||||
errorStatus?: number;
|
||||
delayMs: number;
|
||||
}
|
||||
|
||||
// Inside retryWithBackoff, where logRetryAttempt is called today:
|
||||
options.onRetry?.({ attempt, error, errorStatus, delayMs: actualDelay });
|
||||
logRetryAttempt(attempt, error, errorStatus); // existing debugLogger call unchanged
|
||||
```
|
||||
|
||||
## Lifecycle wiring
|
||||
|
||||
### Streaming path (the common case)
|
||||
|
||||
```ts
|
||||
// loggingContentGenerator.ts:283 — generateContentStream
|
||||
async generateContentStream(req, userPromptId): Promise<AsyncGenerator<GenerateContentResponse>> {
|
||||
const requestEntryTime = Date.now();
|
||||
let attemptStart = requestEntryTime;
|
||||
const attemptStartTimes: number[] = [attemptStart];
|
||||
let retryTotalDelayMs = 0;
|
||||
let finalAttempt = 1;
|
||||
|
||||
// Use existing startLLMRequestSpan (Phase 1)
|
||||
// Pass onRetry callback to whatever retry layer is in use:
|
||||
const onRetry: RetryAttemptInfo & { invoke: ... } = (info) => {
|
||||
finalAttempt = info.attempt + 1; // we're about to start attempt N+1
|
||||
retryTotalDelayMs += info.delayMs;
|
||||
attemptStart = Date.now() + info.delayMs; // approximate; actual reset is at top of next attempt
|
||||
attemptStartTimes.push(attemptStart);
|
||||
// emit ApiRetryEvent
|
||||
logApiRetry(this.config, new ApiRetryEvent({
|
||||
model: req.model,
|
||||
promptId: userPromptId,
|
||||
attemptNumber: info.attempt,
|
||||
error: info.error,
|
||||
statusCode: info.errorStatus,
|
||||
retryDelayMs: info.delayMs,
|
||||
}));
|
||||
};
|
||||
|
||||
// stream wrapper detects first user-visible chunk:
|
||||
return this.processStreamGenerator(stream, ..., {
|
||||
onFirstUserVisibleChunk: (now) => {
|
||||
ttftMs = now - attemptStart;
|
||||
},
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
At span end (already in Phase 1's `endLLMRequestSpan` flow), include the new fields in `LLMRequestMetadata`:
|
||||
|
||||
```ts
|
||||
endLLMRequestSpan(llmSpan, {
|
||||
success: true,
|
||||
inputTokens,
|
||||
outputTokens,
|
||||
cachedInputTokens,
|
||||
ttftMs,
|
||||
requestSetupMs: attemptStart - requestEntryTime,
|
||||
attempt: finalAttempt,
|
||||
retryTotalDelayMs,
|
||||
});
|
||||
```
|
||||
|
||||
### Non-streaming path
|
||||
|
||||
`generateContent` (`loggingContentGenerator.ts:212`) does not produce streaming chunks. TTFT is `undefined`; `request_setup_ms` is still meaningful (captures retry overhead). The breakdown metric records 2 phases (REQUEST_PREPARATION + RESPONSE_PROCESSING where `RESPONSE_PROCESSING = duration_ms - request_setup_ms`), not 3.
|
||||
|
||||
### Retry layer integration (4 sites)
|
||||
|
||||
Each of the 4 LLM `retryWithBackoff` call sites adds `onRetry`:
|
||||
|
||||
```ts
|
||||
// client.ts:1540 (similar at baseLlmClient.ts:193, 282, geminiChat.ts:1039)
|
||||
const result = await retryWithBackoff(apiCall, {
|
||||
...existingOptions,
|
||||
onRetry: (info) => {
|
||||
logApiRetry(
|
||||
this.config,
|
||||
new ApiRetryEvent({
|
||||
model,
|
||||
promptId: userPromptId,
|
||||
attemptNumber: info.attempt,
|
||||
error: info.error,
|
||||
statusCode: info.errorStatus,
|
||||
retryDelayMs: info.delayMs,
|
||||
}),
|
||||
);
|
||||
// also feed back into LoggingContentGenerator's local retry accumulator
|
||||
// (when in scope — for callers that don't go through LoggingContentGenerator,
|
||||
// the LLM span still gets `attempt` and `retry_total_delay_ms` via the
|
||||
// metadata path because endLLMRequestSpan is called at the LLM layer)
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
The non-LLM caller (`channels/weixin/src/api.ts`) **does not register `onRetry`** — no `ApiRetryEvent` is emitted for its retries, matching today's behavior.
|
||||
|
||||
## Concurrent safety — the headline guarantee
|
||||
|
||||
`LoggingContentGenerator` instance is shared (one per `ContentGenerator`, `contentGenerator.ts:377`). Three concurrent `generateContentStream` calls (e.g., 3 subagents fan out via `coreToolScheduler.runConcurrently`) execute three independent closures of `generateContentStream`:
|
||||
|
||||
```
|
||||
call_A: attemptStart_A, ttftMs_A, ... (closure)
|
||||
call_B: attemptStart_B, ttftMs_B, ... (closure)
|
||||
call_C: attemptStart_C, ttftMs_C, ... (closure)
|
||||
```
|
||||
|
||||
Per-call locals never overlap. Stream chunks are detected against the local `attemptStart` of each call. Span attributes are set at each call's own `endLLMRequestSpan`.
|
||||
|
||||
`AsyncLocalStorageContextManager` (registered by NodeSDK at `sdk.ts:273`) already ensures the active OTel context — and thus the parent span passed to `startLLMRequestSpan` — is correct per fiber.
|
||||
|
||||
## Files to change
|
||||
|
||||
| File | Change | LOC est |
|
||||
| -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
||||
| `packages/core/src/telemetry/constants.ts` | Add `EVENT_API_RETRY` constant | +2 |
|
||||
| `packages/core/src/telemetry/types.ts` | Add `ApiRetryEvent` class + union member | +40 |
|
||||
| `packages/core/src/telemetry/loggers.ts` | Add `logApiRetry()` function | +20 |
|
||||
| `packages/core/src/telemetry/qwen-logger/qwen-logger.ts` | Add `logApiRetryEvent()` for RUM downstream consistency | +20 |
|
||||
| `packages/core/src/telemetry/session-tracing.ts` | Extend `LLMRequestMetadata` (ttftMs, requestSetupMs, attempt, retryTotalDelayMs); extend `endLLMRequestSpan` to set new attrs + breakdown metric + dual-emit gen_ai.\* | +60 |
|
||||
| `packages/core/src/telemetry/metrics.ts` | Wire `recordApiRequestBreakdown` callsite inside `endLLMRequestSpan` (no change to the existing recorder) | 0 |
|
||||
| `packages/core/src/utils/retry.ts` | Add `onRetry?: (info: RetryAttemptInfo) => void` to RetryOptions; export `RetryAttemptInfo`; invoke callback in the existing logRetryAttempt site | +25 |
|
||||
| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts` | TTFT capture: method-local accumulators + `hasUserVisibleContent` helper + first-chunk detection in stream wrapper; pass new metadata to `endLLMRequestSpan` | +80 |
|
||||
| `packages/core/src/core/client.ts` | Wire `onRetry` callback at `retryWithBackoff` call site (`client.ts:1540`) | +15 |
|
||||
| `packages/core/src/core/baseLlmClient.ts` | Wire `onRetry` callback at 2 `retryWithBackoff` call sites | +25 |
|
||||
| `packages/core/src/core/geminiChat.ts` | Wire `onRetry` callback at `retryWithBackoff` call site (`geminiChat.ts:1039`) | +15 |
|
||||
| `packages/core/src/telemetry/session-tracing.test.ts` | `endLLMRequestSpan` sets ttft_ms / request_setup_ms / attempt / retry_total_delay_ms / sampling_ms / output_tokens_per_second + gen_ai dual-emit + breakdown metric (each phase) + idempotent end | +120 |
|
||||
| `packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts` | `hasUserVisibleContent` (text / functionCall / inlineData / executableCode / thought / role-only / usage-only); concurrent calls don't cross-contaminate; TTFT undefined when aborted before first chunk; TTFT undefined on non-streaming | +100 |
|
||||
| `packages/core/src/utils/retry.test.ts` | `onRetry` invoked per failed attempt with correct `attempt`, `delayMs`, `error`, `errorStatus`; absence of `onRetry` is silent (no telemetry emitted) | +50 |
|
||||
| `packages/core/src/telemetry/loggers.test.ts` | `logApiRetry` emits LogRecord with expected payload; bridges through LogToSpanProcessor to nested span under active LLM span | +40 |
|
||||
|
||||
Total: 14 files, ~610 LOC. Larger than Phase 2 (#4321) but comparable to Phase 3 (#4410) and justified by the breadth of integration (4 retry sites + telemetry plumbing + streaming wrapper).
|
||||
|
||||
If review pushes back on size: split into **Phase 4a + 4b + 4c**:
|
||||
|
||||
- **4a** (~200 LOC): TTFT capture + extended `LLMRequestMetadata` + dual-emit. Self-contained value (TTFT visibility from day one).
|
||||
- **4b** (~250 LOC): `onRetry` callback + `ApiRetryEvent` + 4 caller wiring. **Independently a bug fix** for the `retryWithBackoff` telemetry gap.
|
||||
- **4c** (~160 LOC): `recordApiRequestBreakdown` activation + parent span aggregation attrs (`attempt`, `retry_total_delay_ms`, `sampling_ms`, `output_tokens_per_second`). Depends on 4a + 4b.
|
||||
|
||||
## Testing strategy
|
||||
|
||||
| Test | What it proves |
|
||||
| -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------- |
|
||||
| `hasUserVisibleContent` returns true for text/functionCall/inlineData/executableCode/thought | D1 semantics across part types |
|
||||
| `hasUserVisibleContent` returns false for role-only and usage-only chunks | D1 negative cases |
|
||||
| streaming: TTFT measured from attempt start to first user-visible chunk | End-to-end TTFT detection |
|
||||
| streaming: TTFT undefined if stream aborts before any user-visible chunk | Edge case |
|
||||
| streaming: TTFT computed from final attempt's start (not first attempt) | D3 — TTFT reset on retry |
|
||||
| non-streaming: TTFT remains undefined | S3 decision |
|
||||
| concurrent `generateContentStream` calls don't cross-contaminate TTFT | D2 — method-local guarantee |
|
||||
| `endLLMRequestSpan` sets all Phase 4 attrs (ttft_ms, request_setup_ms, sampling_ms, attempt, retry_total_delay_ms, output_tokens_per_second) | Attribute presence |
|
||||
| `endLLMRequestSpan` dual-emits gen_ai.server.time_to_first_token + gen_ai.usage.\* + gen_ai.request.model | D8 dual-emit |
|
||||
| `endLLMRequestSpan` records breakdown metric with 3 phases for streaming, 2 for non-streaming | D6 |
|
||||
| `endLLMRequestSpan` called twice: metric recorded exactly once, attrs not re-set | D7 idempotency |
|
||||
| `retryWithBackoff` with `onRetry`: callback invoked per failed attempt with correct args | D4 callback contract |
|
||||
| `retryWithBackoff` without `onRetry`: no telemetry emitted (silent for non-LLM callers) | P2 — channels/weixin scope protection |
|
||||
| `client.ts` / `baseLlmClient.ts` / `geminiChat.ts` retry callsites emit `ApiRetryEvent` on retry | Integration of D4 at 4 sites |
|
||||
| `ApiRetryEvent` LogRecord bridges via LogToSpanProcessor to a child span under active LLM span | Trace tree correctness |
|
||||
| LLM span `attempt` field correctly reflects final attempt number under retries | D5 aggregation |
|
||||
| LLM span `retry_total_delay_ms` correctly sums onRetry delays | D5 aggregation |
|
||||
| `output_tokens_per_second` undefined when `sampling_ms === 0` (no streaming) | Avoid divide-by-zero |
|
||||
|
||||
## Edge cases
|
||||
|
||||
| Case | Handling |
|
||||
| ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| Stream aborts before any chunk arrives | `ttftMs = undefined`, `sampling_ms = undefined`, `output_tokens_per_second = undefined`. `attempt`, `request_setup_ms` still set. `success = false` |
|
||||
| Stream aborts after first chunk | `ttftMs` set; `sampling_ms` = `duration_ms - ttftMs - request_setup_ms`; reflects partial response time. `success = false` |
|
||||
| Retry succeeds on attempt 1 (no retries) | `attempt = 1`, `retry_total_delay_ms = 0`, no `ApiRetryEvent` emitted, breakdown metric records `request_setup_ms` close to 0 |
|
||||
| Persistent retry mode 50+ attempts | 50+ `ApiRetryEvent` records emitted (out-of-scope cap deferred); LLM span `attempt = 51`, `retry_total_delay_ms = sum of all delays`. Operator sees aggregated view on span; full per-attempt detail in log-bridge spans |
|
||||
| Non-LLM `retryWithBackoff` caller (channels/weixin) | No `onRetry` registered; only existing `debugLogger.warn` fires. No `ApiRetryEvent`; no breakdown metric (caller isn't an LLM site) |
|
||||
| `endLLMRequestSpan` called twice (abort + error race) | Phase 1.5 guard at `activeSpans.delete()` returns early on second call; `recordApiRequestBreakdown` is inside the guard, recorded exactly once |
|
||||
| Anthropic `message_start` chunk arrives before content | `hasUserVisibleContent` returns false for it (no parts with text/functionCall/etc.); TTFT not triggered until subsequent `content_block_delta` chunk |
|
||||
| OpenAI first chunk with empty `delta.content` but `role` only | `hasUserVisibleContent` returns false; TTFT not triggered until first chunk with non-empty delta |
|
||||
| Tool-call-only response (no text) | First chunk with `functionCall` Part triggers TTFT; `output_tokens_per_second` computed against tool-call token count |
|
||||
| Concurrent subagents (3 calls in flight) | Each call's closure has its own `attemptStart`, `ttftMs`, `attemptStartTimes`. Per-call span receives its own metadata at `endLLMRequestSpan`. No interleaving (D2) |
|
||||
| SDK-level retries inside openai-sdk (`maxRetries=3`) | Invisible to qwen-code telemetry — happens entirely inside SDK before retryWithBackoff sees the request. `attempt` reflects retryWithBackoff attempts only. Out of scope (see Out-of-scope) |
|
||||
| `gen_ai.server.time_to_first_token` spec renames before reaching Stable | Single-file update: `session-tracing.ts:endLLMRequestSpan`. The qwen-code-native `ttft_ms` stays authoritative — no downstream impact |
|
||||
| Subagent's LLM request | Parent is the subagent span (Phase 3). Phase 4 fields nest correctly. Aggregations grouped by `qwen-code.subagent.id` give per-subagent LLM perf — design-doc-future, easy follow-up |
|
||||
| Reasoning model with long thought blocks | First `thought` Part triggers TTFT; `sampling_ms` includes both thinking + answer phases. Split into separate metrics deferred |
|
||||
|
||||
## Rollback
|
||||
|
||||
The change is additive at the OTel and metric level — every new attribute is optional, every new event is a new class. Existing dashboards that don't filter on the new fields keep working unchanged.
|
||||
|
||||
Behavior-affecting changes:
|
||||
|
||||
- New `ApiRetryEvent` LogRecord starts flowing → log volume increases proportional to retry rate (typically <1% of requests retry). Mitigate by sampling LogRecord at the SDK layer if needed.
|
||||
- New breakdown metric `qwen-code.api.request.breakdown` starts producing time series → mild Prometheus cardinality bump (`{model, phase}` — bounded).
|
||||
- `output_tokens_per_second` derived attribute may appear unusual on dashboards filtering "all attributes" — document.
|
||||
|
||||
Rollback path: revert the single PR (or each of 4a/4b/4c independently). All new fields use defensive defaults (undefined / 0) and don't change span structure.
|
||||
|
||||
## Sequencing
|
||||
|
||||
- **After Phase 3 (#4410, in review)**: not a hard dependency. Phase 4 attributes attach to `qwen-code.llm_request` spans regardless of whether they're under a `qwen-code.subagent` (Phase 3) or `qwen-code.interaction` (Phase 1) parent. Recommend Phase 3 land first so per-attempt aggregation under subagent subtrees works naturally.
|
||||
- **Independent of #4384** (`traceparent` + `X-Qwen-Code-Session-Id` outbound propagation). They touch the HTTP layer; Phase 4 touches the stream/retry/metric layer.
|
||||
- **Independent of `clearDetailedSpanState` chat-compression follow-up** (#4097 follow-up). Different surface.
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **`onRetry` callback firing semantics**: invoked **before** backoff sleep (current proposal) or **after** (when the next attempt is about to start)? Before is simpler — callback has all the info immediately; after would require capturing the just-completed delay separately. Pre-sleep is the recommendation; document in callback contract.
|
||||
2. **Per-attempt timing on the LLM span**: should we add `attempt_durations_ms: number[]` array? OTel supports array-of-primitive attributes. Useful for "which attempt of N was slow" diagnostics. Defer until production data shows demand — log-bridge spans already carry the equivalent.
|
||||
3. **Persistent retry mode emission cap**: at what `attempt > N` threshold should we start sampling? `N = 5` then 1-in-10? `N = 10` then summary-only? Defer until we have production volume data.
|
||||
4. **`TOKEN_PROCESSING` phase**: keep enum value dormant or wire it to something (e.g., consolidation time)? Defer — wait for a real use case.
|
||||
5. **Subagent-level LLM rollups**: trivial follow-up once Phase 4 lands — sum `ttft_ms`/`output_tokens`/`input_tokens` per subagent subtree. Not Phase 4 scope but the data flow enables it.
|
||||
|
|
@ -165,6 +165,11 @@ vi.mock('../../telemetry/index.js', () => {
|
|||
success: boolean;
|
||||
inputTokens?: number;
|
||||
outputTokens?: number;
|
||||
cachedInputTokens?: number;
|
||||
ttftMs?: number;
|
||||
requestSetupMs?: number;
|
||||
attempt?: number;
|
||||
retryTotalDelayMs?: number;
|
||||
durationMs?: number;
|
||||
error?: string;
|
||||
},
|
||||
|
|
@ -616,6 +621,150 @@ describe('LoggingContentGenerator', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('captures ttftMs on the first user-visible stream chunk (Phase 4a)', async () => {
|
||||
// Two chunks: first has text (user-visible), second has only usage.
|
||||
// ttftMs must be set on the first chunk and not overwritten by the second.
|
||||
const streamFn = vi.fn().mockResolvedValue(
|
||||
(async function* () {
|
||||
yield createResponse('r1', 'test-model', [{ text: 'hi' }]);
|
||||
yield createResponse('r2', 'test-model', [], {
|
||||
promptTokenCount: 10,
|
||||
candidatesTokenCount: 2,
|
||||
totalTokenCount: 12,
|
||||
});
|
||||
})(),
|
||||
);
|
||||
const wrapped = createWrappedGenerator(vi.fn(), streamFn);
|
||||
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
|
||||
model: 'test-model',
|
||||
authType: AuthType.USE_OPENAI,
|
||||
enableOpenAILogging: false,
|
||||
});
|
||||
const request = {
|
||||
model: 'test-model',
|
||||
contents: 'Hello',
|
||||
} as unknown as GenerateContentParameters;
|
||||
|
||||
const stream = await generator.generateContentStream(
|
||||
request,
|
||||
'prompt-ttft',
|
||||
);
|
||||
for await (const _ of stream) {
|
||||
// consume
|
||||
}
|
||||
|
||||
const spanRecord = getStreamSpanRecord();
|
||||
const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined;
|
||||
expect(meta).toBeDefined();
|
||||
expect(typeof meta!.ttftMs).toBe('number');
|
||||
expect(meta!.ttftMs!).toBeGreaterThanOrEqual(0);
|
||||
});
|
||||
|
||||
it('forwards cachedInputTokens from usageMetadata to endLLMRequestSpan (Phase 4a)', async () => {
|
||||
const streamFn = vi.fn().mockResolvedValue(
|
||||
(async function* () {
|
||||
yield createResponse('r1', 'test-model', [{ text: 'ok' }], {
|
||||
promptTokenCount: 100,
|
||||
candidatesTokenCount: 20,
|
||||
cachedContentTokenCount: 40,
|
||||
totalTokenCount: 160,
|
||||
});
|
||||
})(),
|
||||
);
|
||||
const wrapped = createWrappedGenerator(vi.fn(), streamFn);
|
||||
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
|
||||
model: 'test-model',
|
||||
authType: AuthType.USE_OPENAI,
|
||||
enableOpenAILogging: false,
|
||||
});
|
||||
const request = {
|
||||
model: 'test-model',
|
||||
contents: 'Hello',
|
||||
} as unknown as GenerateContentParameters;
|
||||
|
||||
const stream = await generator.generateContentStream(
|
||||
request,
|
||||
'prompt-cache',
|
||||
);
|
||||
for await (const _ of stream) {
|
||||
// consume
|
||||
}
|
||||
|
||||
const spanRecord = getStreamSpanRecord();
|
||||
expect(spanRecord.endMetadata).toMatchObject({
|
||||
success: true,
|
||||
inputTokens: 100,
|
||||
cachedInputTokens: 40,
|
||||
});
|
||||
});
|
||||
|
||||
it('leaves ttftMs undefined when stream yields no user-visible chunks (Phase 4a)', async () => {
|
||||
// Stream emits only usage-metadata chunks (no text/functionCall/etc).
|
||||
// ttftMs must stay undefined — TTFT is only meaningful when content arrives.
|
||||
const streamFn = vi.fn().mockResolvedValue(
|
||||
(async function* () {
|
||||
yield createResponse('r1', 'test-model', [], {
|
||||
promptTokenCount: 5,
|
||||
candidatesTokenCount: 0,
|
||||
totalTokenCount: 5,
|
||||
});
|
||||
})(),
|
||||
);
|
||||
const wrapped = createWrappedGenerator(vi.fn(), streamFn);
|
||||
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
|
||||
model: 'test-model',
|
||||
authType: AuthType.USE_OPENAI,
|
||||
enableOpenAILogging: false,
|
||||
});
|
||||
const request = {
|
||||
model: 'test-model',
|
||||
contents: 'Hello',
|
||||
} as unknown as GenerateContentParameters;
|
||||
|
||||
const stream = await generator.generateContentStream(
|
||||
request,
|
||||
'prompt-no-content',
|
||||
);
|
||||
for await (const _ of stream) {
|
||||
// consume
|
||||
}
|
||||
|
||||
const spanRecord = getStreamSpanRecord();
|
||||
const meta = spanRecord.endMetadata as { ttftMs?: number } | undefined;
|
||||
expect(meta!.ttftMs).toBeUndefined();
|
||||
});
|
||||
|
||||
it('forwards cachedInputTokens to endLLMRequestSpan on non-stream success (Phase 4a)', async () => {
|
||||
const generateFn = vi.fn().mockResolvedValue(
|
||||
createResponse('resp-cache', 'test-model', [{ text: 'ok' }], {
|
||||
promptTokenCount: 100,
|
||||
candidatesTokenCount: 30,
|
||||
cachedContentTokenCount: 60,
|
||||
totalTokenCount: 190,
|
||||
}),
|
||||
);
|
||||
const wrapped = createWrappedGenerator(generateFn, vi.fn());
|
||||
const generator = new LoggingContentGenerator(wrapped, createConfig(), {
|
||||
model: 'test-model',
|
||||
authType: AuthType.USE_OPENAI,
|
||||
enableOpenAILogging: false,
|
||||
});
|
||||
const request = {
|
||||
model: 'test-model',
|
||||
contents: 'Hi',
|
||||
} as unknown as GenerateContentParameters;
|
||||
|
||||
await generator.generateContent(request, 'prompt-cache-non-stream');
|
||||
|
||||
const spanRecord = getGenerateContentSpanRecord();
|
||||
expect(spanRecord.endMetadata).toMatchObject({
|
||||
success: true,
|
||||
inputTokens: 100,
|
||||
outputTokens: 30,
|
||||
cachedInputTokens: 60,
|
||||
});
|
||||
});
|
||||
|
||||
it('preserves non-stream success when response and OpenAI logging fail', async () => {
|
||||
vi.mocked(logApiResponse).mockImplementationOnce(() => {
|
||||
throw new Error('response-log-fail');
|
||||
|
|
|
|||
|
|
@ -61,6 +61,7 @@ import {
|
|||
API_CALL_ABORTED_SPAN_STATUS_MESSAGE,
|
||||
API_CALL_FAILED_SPAN_STATUS_MESSAGE,
|
||||
} from '../../telemetry/tracer.js';
|
||||
import { hasUserVisibleContent } from './streamContentDetection.js';
|
||||
|
||||
const debugLogger = createDebugLogger('LOGGING_CONTENT_GENERATOR');
|
||||
|
||||
|
|
@ -285,6 +286,7 @@ export class LoggingContentGenerator implements ContentGenerator {
|
|||
success: true,
|
||||
inputTokens: response.usageMetadata?.promptTokenCount,
|
||||
outputTokens: response.usageMetadata?.candidatesTokenCount,
|
||||
cachedInputTokens: response.usageMetadata?.cachedContentTokenCount,
|
||||
durationMs: Date.now() - startTime,
|
||||
});
|
||||
return response;
|
||||
|
|
@ -462,6 +464,14 @@ export class LoggingContentGenerator implements ContentGenerator {
|
|||
let firstModelVersion = '';
|
||||
let lastUsageMetadata: GenerateContentResponseUsageMetadata | undefined;
|
||||
let errorOccurred = false;
|
||||
|
||||
// TTFT (time to first token): wall-clock from generateContentStream
|
||||
// dispatch to the first stream chunk containing user-visible content.
|
||||
// Method-local closure variable — NEVER an instance field — because
|
||||
// LoggingContentGenerator is shared across concurrent generateContentStream
|
||||
// calls (one per ContentGenerator, see contentGenerator.ts:createContentGenerator).
|
||||
// See docs/design/telemetry-llm-request-timing-design.md (D1, D2).
|
||||
let ttftMs: number | undefined;
|
||||
// Tracks whether the idle timeout fired and ended the span. If so,
|
||||
// a resumed-after-timeout consumer must not call endLLMRequestSpan
|
||||
// again (the helper would no-op, but more importantly we skip the
|
||||
|
|
@ -516,6 +526,13 @@ export class LoggingContentGenerator implements ContentGenerator {
|
|||
if (response.usageMetadata) {
|
||||
lastUsageMetadata = response.usageMetadata;
|
||||
}
|
||||
// Capture TTFT on the first stream chunk that contains user-visible
|
||||
// content. hasUserVisibleContent skips role-only / usageMetadata-only
|
||||
// chunks, so TTFT reflects "model produced something the operator can
|
||||
// attribute to user-perceived latency."
|
||||
if (ttftMs === undefined && hasUserVisibleContent(response)) {
|
||||
ttftMs = Date.now() - startTime;
|
||||
}
|
||||
resetSpanTimeout?.();
|
||||
yield response;
|
||||
}
|
||||
|
|
@ -601,6 +618,8 @@ export class LoggingContentGenerator implements ContentGenerator {
|
|||
success: !errorOccurred,
|
||||
inputTokens: lastUsageMetadata?.promptTokenCount,
|
||||
outputTokens: lastUsageMetadata?.candidatesTokenCount,
|
||||
cachedInputTokens: lastUsageMetadata?.cachedContentTokenCount,
|
||||
ttftMs,
|
||||
durationMs: Date.now() - startTime,
|
||||
error: errorOccurred
|
||||
? aborted
|
||||
|
|
|
|||
|
|
@ -0,0 +1,131 @@
|
|||
/**
|
||||
* @license
|
||||
* Copyright 2026 Qwen Team
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*/
|
||||
|
||||
import { describe, expect, it } from 'vitest';
|
||||
import { GenerateContentResponse } from '@google/genai';
|
||||
import { hasUserVisibleContent } from './streamContentDetection.js';
|
||||
|
||||
function chunkWithParts(parts: unknown[]): GenerateContentResponse {
|
||||
const r = new GenerateContentResponse();
|
||||
r.candidates = [
|
||||
{
|
||||
content: { role: 'model', parts: parts as never },
|
||||
},
|
||||
];
|
||||
return r;
|
||||
}
|
||||
|
||||
describe('hasUserVisibleContent', () => {
|
||||
it('returns true for non-empty text part', () => {
|
||||
expect(hasUserVisibleContent(chunkWithParts([{ text: 'hi' }]))).toBe(true);
|
||||
});
|
||||
|
||||
it('returns false for empty text part', () => {
|
||||
expect(hasUserVisibleContent(chunkWithParts([{ text: '' }]))).toBe(false);
|
||||
});
|
||||
|
||||
it('returns true for functionCall part', () => {
|
||||
expect(
|
||||
hasUserVisibleContent(
|
||||
chunkWithParts([{ functionCall: { name: 'read', args: {} } }]),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
it('returns true for inlineData part', () => {
|
||||
expect(
|
||||
hasUserVisibleContent(
|
||||
chunkWithParts([
|
||||
{ inlineData: { mimeType: 'image/png', data: 'abc' } },
|
||||
]),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
it('returns true for executableCode part', () => {
|
||||
expect(
|
||||
hasUserVisibleContent(
|
||||
chunkWithParts([
|
||||
{ executableCode: { language: 'PYTHON', code: 'print(1)' } },
|
||||
]),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
it('returns true for thought / reasoning part with thought: true', () => {
|
||||
expect(hasUserVisibleContent(chunkWithParts([{ thought: true }]))).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
it('returns false for thought: false (explicit non-thought part)', () => {
|
||||
// Codebase convention: `thought` is a boolean flag where false means
|
||||
// "explicitly not a thought." A part with only `thought: false` and no
|
||||
// other content must not trigger TTFT.
|
||||
expect(hasUserVisibleContent(chunkWithParts([{ thought: false }]))).toBe(
|
||||
false,
|
||||
);
|
||||
});
|
||||
|
||||
it('returns false for thought: undefined / missing (default non-thought)', () => {
|
||||
// A bare object without the `thought` key is the common case for non-thinking
|
||||
// chunks; must not match the thought branch.
|
||||
expect(hasUserVisibleContent(chunkWithParts([{}]))).toBe(false);
|
||||
});
|
||||
|
||||
it('returns true when thought: true coexists with empty text', () => {
|
||||
// First Anthropic <thinking> chunk often arrives as { text: '', thought: true }.
|
||||
// Per design doc D1, "thought / reasoning content" is user-visible — TTFT fires.
|
||||
expect(
|
||||
hasUserVisibleContent(chunkWithParts([{ text: '', thought: true }])),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
it('returns true when any part is user-visible (mixed)', () => {
|
||||
expect(
|
||||
hasUserVisibleContent(chunkWithParts([{ text: '' }, { text: 'hi' }])),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
it('returns false for empty parts array', () => {
|
||||
expect(hasUserVisibleContent(chunkWithParts([]))).toBe(false);
|
||||
});
|
||||
|
||||
it('returns false when candidates is missing', () => {
|
||||
const r = new GenerateContentResponse();
|
||||
expect(hasUserVisibleContent(r)).toBe(false);
|
||||
});
|
||||
|
||||
it('returns false when content is missing', () => {
|
||||
const r = new GenerateContentResponse();
|
||||
r.candidates = [{}];
|
||||
expect(hasUserVisibleContent(r)).toBe(false);
|
||||
});
|
||||
|
||||
it('returns false when parts is undefined', () => {
|
||||
const r = new GenerateContentResponse();
|
||||
r.candidates = [{ content: { role: 'model' } }];
|
||||
expect(hasUserVisibleContent(r)).toBe(false);
|
||||
});
|
||||
|
||||
it('returns false for usage-only / role-only chunks', () => {
|
||||
const r = new GenerateContentResponse();
|
||||
r.candidates = [{ content: { role: 'model', parts: [] } }];
|
||||
r.usageMetadata = { totalTokenCount: 42 };
|
||||
expect(hasUserVisibleContent(r)).toBe(false);
|
||||
});
|
||||
|
||||
it('handles parts that are non-objects defensively', () => {
|
||||
expect(
|
||||
hasUserVisibleContent(
|
||||
chunkWithParts([null, undefined, 'string', 42, { text: 'real' }]),
|
||||
),
|
||||
).toBe(true);
|
||||
expect(hasUserVisibleContent(chunkWithParts([null, undefined, 'x']))).toBe(
|
||||
false,
|
||||
);
|
||||
});
|
||||
});
|
||||
|
|
@ -0,0 +1,57 @@
|
|||
/**
|
||||
* @license
|
||||
* Copyright 2026 Qwen Team
|
||||
* SPDX-License-Identifier: Apache-2.0
|
||||
*/
|
||||
|
||||
import type { GenerateContentResponse } from '@google/genai';
|
||||
|
||||
/**
|
||||
* Detects whether a streaming chunk contains user-visible model output.
|
||||
*
|
||||
* Used by the LoggingContentGenerator stream wrapper to identify the first
|
||||
* chunk that should trigger TTFT (time-to-first-token) measurement.
|
||||
*
|
||||
* A chunk is "user-visible" if any normalized Part in candidates[0].content.parts
|
||||
* is one of:
|
||||
* - text with a non-empty string
|
||||
* - functionCall (tool use — even tool-call-only responses count)
|
||||
* - inlineData (image, binary blob)
|
||||
* - executableCode (sandbox / code-execution responses)
|
||||
* - thought / reasoning content (provider-dependent; o1, qwen thinking, Anthropic <thinking>)
|
||||
*
|
||||
* Chunks containing only role metadata, only usageMetadata (final summary
|
||||
* chunk), or empty parts are NOT user-visible — TTFT should not fire on these.
|
||||
*
|
||||
* Centralised here (single predicate over the normalized GenerateContentResponse
|
||||
* shape) so the four provider generators (Anthropic / OpenAI / Gemini / Qwen)
|
||||
* don't each need their own first-token logic. Each provider already normalizes
|
||||
* its native chunk shape to GenerateContentResponse before LoggingContentGenerator
|
||||
* sees it (see loggingContentGenerator.ts generateContentStream).
|
||||
*/
|
||||
export function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
|
||||
const parts = chunk.candidates?.[0]?.content?.parts;
|
||||
if (!parts || parts.length === 0) return false;
|
||||
return parts.some(isUserVisiblePart);
|
||||
}
|
||||
|
||||
function isUserVisiblePart(part: unknown): boolean {
|
||||
if (part === null || typeof part !== 'object') return false;
|
||||
const p = part as {
|
||||
text?: unknown;
|
||||
functionCall?: unknown;
|
||||
inlineData?: unknown;
|
||||
executableCode?: unknown;
|
||||
thought?: unknown;
|
||||
};
|
||||
if (typeof p.text === 'string' && p.text.length > 0) return true;
|
||||
if (p.functionCall !== undefined) return true;
|
||||
if (p.inlineData !== undefined) return true;
|
||||
if (p.executableCode !== undefined) return true;
|
||||
// `thought` is a boolean flag in this codebase — `true` means the part
|
||||
// carries reasoning content, false / absent means none (see loggingContentGenerator.ts
|
||||
// where `part.thought ? {...} : {}` is the canonical pattern). Match strict `=== true`
|
||||
// rather than checking presence — `thought: false` parts are explicitly NOT user-visible.
|
||||
if (p.thought === true) return true;
|
||||
return false;
|
||||
}
|
||||
|
|
@ -378,6 +378,214 @@ describe('session-tracing', () => {
|
|||
});
|
||||
});
|
||||
|
||||
describe('LLM request spans — Phase 4a (timing decomposition + GenAI dual-emit)', () => {
|
||||
it('startLLMRequestSpan dual-emits gen_ai.request.model alongside qwen-code.model', () => {
|
||||
const span = startLLMRequestSpan('test-model', 'p');
|
||||
endLLMRequestSpan(span, { success: true });
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['qwen-code.model']).toBe('test-model');
|
||||
expect(attrs['gen_ai.request.model']).toBe('test-model');
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan dual-emits gen_ai.usage.input_tokens / output_tokens', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
inputTokens: 100,
|
||||
outputTokens: 50,
|
||||
});
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['input_tokens']).toBe(100);
|
||||
expect(attrs['gen_ai.usage.input_tokens']).toBe(100);
|
||||
expect(attrs['output_tokens']).toBe(50);
|
||||
expect(attrs['gen_ai.usage.output_tokens']).toBe(50);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan dual-emits gen_ai.usage.cached_tokens when present', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
inputTokens: 100,
|
||||
cachedInputTokens: 40,
|
||||
});
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['cached_input_tokens']).toBe(40);
|
||||
expect(attrs['gen_ai.usage.cached_tokens']).toBe(40);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan omits cached_input_tokens when undefined', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, { success: true, inputTokens: 100 });
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['cached_input_tokens']).toBeUndefined();
|
||||
expect(attrs['gen_ai.usage.cached_tokens']).toBeUndefined();
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan emits cached_input_tokens === 0 (cache miss is meaningful info, not undefined)', () => {
|
||||
// Providers that report 0 cached tokens are signaling an explicit cache
|
||||
// miss. Distinct from undefined ("we don't know"). Both attribute names
|
||||
// must propagate the literal 0.
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
inputTokens: 100,
|
||||
cachedInputTokens: 0,
|
||||
});
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['cached_input_tokens']).toBe(0);
|
||||
expect(attrs['gen_ai.usage.cached_tokens']).toBe(0);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan writes ttft_ms and dual-emits gen_ai.server.time_to_first_token (in seconds)', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 234,
|
||||
durationMs: 1000,
|
||||
});
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['ttft_ms']).toBe(234);
|
||||
// Spec uses seconds as double — 234ms → 0.234s
|
||||
expect(attrs['gen_ai.server.time_to_first_token']).toBeCloseTo(0.234, 6);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan omits ttft_ms when undefined (non-streaming or aborted before first chunk)', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, { success: true, durationMs: 500 });
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['ttft_ms']).toBeUndefined();
|
||||
expect(attrs['gen_ai.server.time_to_first_token']).toBeUndefined();
|
||||
expect(attrs['sampling_ms']).toBeUndefined();
|
||||
expect(attrs['output_tokens_per_second']).toBeUndefined();
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan derives sampling_ms when ttftMs is set (no requestSetup)', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 200,
|
||||
durationMs: 1000,
|
||||
});
|
||||
|
||||
// sampling_ms = duration - ttft - (requestSetup ?? 0) = 1000 - 200 - 0
|
||||
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(800);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan derives sampling_ms accounting for requestSetupMs (Phase 4b populates this)', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 200,
|
||||
requestSetupMs: 300,
|
||||
durationMs: 1000,
|
||||
});
|
||||
|
||||
// sampling_ms = 1000 - 200 - 300
|
||||
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(500);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan clamps sampling_ms to 0 when ttft + setup exceed duration (clock skew)', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 800,
|
||||
requestSetupMs: 500,
|
||||
durationMs: 1000,
|
||||
});
|
||||
|
||||
// Math.max(0, 1000 - 800 - 500) = 0
|
||||
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan derives output_tokens_per_second from sampling_ms + outputTokens', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 200,
|
||||
durationMs: 1200,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// sampling_ms = 1000ms = 1s; otps = 500 / 1.0 = 500
|
||||
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(1000);
|
||||
expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(500);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan rounds output_tokens_per_second to 2 decimals', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 200,
|
||||
durationMs: 1325, // sampling_ms = 1125
|
||||
outputTokens: 100, // otps = 100 / 1.125 = 88.888…
|
||||
});
|
||||
|
||||
expect(mockSpans[0]!.attributes['output_tokens_per_second']).toBe(88.89);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan omits output_tokens_per_second when sampling_ms == 0', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 1000,
|
||||
durationMs: 1000,
|
||||
outputTokens: 50,
|
||||
});
|
||||
|
||||
// sampling_ms = 0 → otps would be Infinity, must be omitted
|
||||
expect(mockSpans[0]!.attributes['sampling_ms']).toBe(0);
|
||||
expect(
|
||||
mockSpans[0]!.attributes['output_tokens_per_second'],
|
||||
).toBeUndefined();
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan omits output_tokens_per_second when outputTokens missing', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
ttftMs: 200,
|
||||
durationMs: 1000,
|
||||
});
|
||||
|
||||
expect(
|
||||
mockSpans[0]!.attributes['output_tokens_per_second'],
|
||||
).toBeUndefined();
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan writes Phase 4b retry placeholders when caller provides them', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, {
|
||||
success: true,
|
||||
attempt: 3,
|
||||
requestSetupMs: 4500,
|
||||
retryTotalDelayMs: 4200,
|
||||
durationMs: 5000,
|
||||
});
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['attempt']).toBe(3);
|
||||
expect(attrs['request_setup_ms']).toBe(4500);
|
||||
expect(attrs['retry_total_delay_ms']).toBe(4200);
|
||||
});
|
||||
|
||||
it('endLLMRequestSpan omits Phase 4b fields when caller does not provide them (Phase 4a default)', () => {
|
||||
const span = startLLMRequestSpan('m', 'p');
|
||||
endLLMRequestSpan(span, { success: true, durationMs: 500 });
|
||||
|
||||
const attrs = mockSpans[0]!.attributes;
|
||||
expect(attrs['attempt']).toBeUndefined();
|
||||
expect(attrs['request_setup_ms']).toBeUndefined();
|
||||
expect(attrs['retry_total_delay_ms']).toBeUndefined();
|
||||
});
|
||||
});
|
||||
|
||||
describe('tool spans', () => {
|
||||
it('creates and ends a tool span', () => {
|
||||
const span = startToolSpan('ReadFile', { 'tool.call_id': 'call-1' });
|
||||
|
|
|
|||
|
|
@ -46,9 +46,46 @@ export interface EndInteractionOptions {
|
|||
export interface LLMRequestMetadata {
|
||||
inputTokens?: number;
|
||||
outputTokens?: number;
|
||||
/**
|
||||
* Tokens served from the provider's prompt cache (Anthropic
|
||||
* cache_read_input_tokens, OpenAI prompt_tokens_details.cached_tokens, etc).
|
||||
* Normalized to GenerateContentResponseUsageMetadata.cachedContentTokenCount
|
||||
* by each provider generator before reaching LoggingContentGenerator.
|
||||
*/
|
||||
cachedInputTokens?: number;
|
||||
success: boolean;
|
||||
durationMs?: number;
|
||||
error?: string;
|
||||
/**
|
||||
* Time from the successful attempt's request dispatch to the first stream
|
||||
* chunk containing user-visible content (text / functionCall / inlineData /
|
||||
* executableCode / thought). Undefined for non-streaming requests, requests
|
||||
* aborted before the first user-visible chunk, and any path that does not
|
||||
* pass through LoggingContentGenerator's stream wrapper.
|
||||
*
|
||||
* Semantics: diverges from claude-code's ttftMs (which fires on
|
||||
* Anthropic's message_start metadata event). Matches the "time to first
|
||||
* actual token" intent of the industry-standard TTFT name.
|
||||
* See docs/design/telemetry-llm-request-timing-design.md (D1).
|
||||
*/
|
||||
ttftMs?: number;
|
||||
/**
|
||||
* Time from generateContent/generateContentStream entry to the start of the
|
||||
* successful attempt (ms). Includes all failed retries + backoff sleeps.
|
||||
* Populated by the retry layer in Phase 4b; undefined in Phase 4a.
|
||||
*/
|
||||
requestSetupMs?: number;
|
||||
/**
|
||||
* Final attempt number (1-based). 1 = no retries. Populated by the retry
|
||||
* layer in Phase 4b; undefined in Phase 4a.
|
||||
*/
|
||||
attempt?: number;
|
||||
/**
|
||||
* Sum of all backoff delays before the successful attempt (ms). 0 if no
|
||||
* retries occurred. Populated by the retry layer in Phase 4b; undefined
|
||||
* in Phase 4a.
|
||||
*/
|
||||
retryTotalDelayMs?: number;
|
||||
}
|
||||
|
||||
export interface ToolSpanMetadata {
|
||||
|
|
@ -321,6 +358,10 @@ export function startLLMRequestSpan(model: string, promptId: string): Span {
|
|||
'qwen-code.model': model,
|
||||
'qwen-code.prompt_id': promptId,
|
||||
'llm_request.context': parentCtx ? 'interaction' : 'standalone',
|
||||
// Dual-emit OTel GenAI semantic convention (Stable). Private name
|
||||
// (qwen-code.model) remains authoritative; gen_ai.* is a compat layer
|
||||
// for spec-aware backends. See docs/design/telemetry-llm-request-timing-design.md (D8).
|
||||
'gen_ai.request.model': model,
|
||||
};
|
||||
|
||||
const span = getTracer().startSpan(
|
||||
|
|
@ -362,10 +403,55 @@ export function endLLMRequestSpan(
|
|||
const endAttributes: Attributes = { duration_ms: duration };
|
||||
|
||||
if (metadata) {
|
||||
if (metadata.inputTokens !== undefined)
|
||||
if (metadata.inputTokens !== undefined) {
|
||||
endAttributes['input_tokens'] = metadata.inputTokens;
|
||||
if (metadata.outputTokens !== undefined)
|
||||
// Dual-emit OTel GenAI semconv (Stable). Private name stays authoritative.
|
||||
endAttributes['gen_ai.usage.input_tokens'] = metadata.inputTokens;
|
||||
}
|
||||
if (metadata.outputTokens !== undefined) {
|
||||
endAttributes['output_tokens'] = metadata.outputTokens;
|
||||
endAttributes['gen_ai.usage.output_tokens'] = metadata.outputTokens;
|
||||
}
|
||||
if (metadata.cachedInputTokens !== undefined) {
|
||||
endAttributes['cached_input_tokens'] = metadata.cachedInputTokens;
|
||||
// Dual-emit OTel GenAI semconv (Experimental — may rename before Stable).
|
||||
endAttributes['gen_ai.usage.cached_tokens'] =
|
||||
metadata.cachedInputTokens;
|
||||
}
|
||||
if (metadata.ttftMs !== undefined) {
|
||||
endAttributes['ttft_ms'] = metadata.ttftMs;
|
||||
// Dual-emit OTel GenAI semconv (Experimental). Spec unit is seconds-as-double.
|
||||
endAttributes['gen_ai.server.time_to_first_token'] =
|
||||
metadata.ttftMs / 1000;
|
||||
}
|
||||
if (metadata.requestSetupMs !== undefined) {
|
||||
endAttributes['request_setup_ms'] = metadata.requestSetupMs;
|
||||
}
|
||||
if (metadata.attempt !== undefined) {
|
||||
endAttributes['attempt'] = metadata.attempt;
|
||||
}
|
||||
if (metadata.retryTotalDelayMs !== undefined) {
|
||||
endAttributes['retry_total_delay_ms'] = metadata.retryTotalDelayMs;
|
||||
}
|
||||
// Derived: sampling_ms = time from first user-visible chunk to end
|
||||
// (== output generation time, excluding setup + first-token delay).
|
||||
// Computable only when ttftMs is set. requestSetupMs defaults to 0
|
||||
// when undefined (no retries) — this gives the correct sampling
|
||||
// duration in both Phase 4a (no retry data) and Phase 4b (with).
|
||||
if (metadata.ttftMs !== undefined) {
|
||||
const samplingMs = Math.max(
|
||||
0,
|
||||
duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),
|
||||
);
|
||||
endAttributes['sampling_ms'] = samplingMs;
|
||||
// Derived: output tokens per second during sampling. Undefined when
|
||||
// sampling_ms is 0 (avoid divide-by-zero) or when outputTokens missing.
|
||||
if (samplingMs > 0 && metadata.outputTokens !== undefined) {
|
||||
endAttributes['output_tokens_per_second'] =
|
||||
Math.round((metadata.outputTokens / (samplingMs / 1000)) * 100) /
|
||||
100;
|
||||
}
|
||||
}
|
||||
endAttributes['success'] = metadata.success;
|
||||
if (metadata.error !== undefined)
|
||||
endAttributes['error'] = truncateSpanError(metadata.error);
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue