airi/apps/server/docs/ai-context/observability-metrics.md
RainbowBird c627bce9c9
refactor(server): split services into domain/adapter layers, drop dead code
Why
- src/services/ was an unordered mix of single-file services and module
  directories with no shared classification axis, plus several long-dead
  admin batch helpers that survived the move to the simpler synchronous
  admin-flux-grants flow.

What
- services/ now has two top-level layers:
    domain/   — DB state + business rules (billing, characters, chats,
                flux, flux-transaction, llm-router, providers, request-log,
                stripe, user-deletion, admin/{flux-grants,router-config})
    adapters/ — thin wrappers over external SDKs / infra (config-kv, email,
                posthog, tts/)
- admin/* moved under domain/admin/ with consistent plural names
  (flux-grants, router-config).
- tts-adapters/ collapsed to adapters/tts/ (no redundant -adapters suffix
  once nested under adapters/).
- 63 src files + scripts/e2e-llm-router.ts + tests/verifications/_harness.ts
  had relative imports rewritten; git mv preserves blame.
- apps/server/CLAUDE.md and docs/ai-context/*.md updated to match new paths.

Dead code removed
- services/admin-flux-grant-batches/ (service + worker + tests, 1090 LOC) —
  superseded by admin-flux-grants and never wired into app.ts.
- routes/admin/flux-grant-batches/ — same.
- utils/redis-compressed.ts + test — zero production call sites.
- llm-router/index.ts re-exports trimmed from 26 to 6; only symbols with
  external consumers are kept.

Intentionally kept
- schemas/flux-grant-batch.ts and its schemas/index.ts export remain so the
  drizzle-kit generate diff stays empty. Removing them is a separate PR
  that owns the drop-table migration for flux_grant_batch /
  flux_grant_batch_recipient.

Verification
- pnpm -F @proj-airi/server typecheck: passes.
- pnpm exec eslint apps/server: 49 errors, identical to main baseline
  (all are pre-existing node/prefer-global/buffer in envelope-crypto and
  scripts/e2e-llm-router; untouched by this change).
- Vitest passes per-file; the 6 mockDB hook timeouts under full-parallel
  run are the known pushSchema-per-worker infra cost, not a regression.
2026-05-18 23:36:45 +08:00

12 KiB
Raw Permalink Blame History

Metrics Catalog

服务端当前所有 metric 的完整目录。按业务领域分组。

命名规则、airi.* 边界、attribute 选择请看 observability-conventions.md。本文档只做"哪些 metric 存在、怎么查"。

名字到 Prometheus 系列的换算

OTel SDK 在导出到 Prometheus 时做两件事:

  1. ._airi.billing.flux.consumedairi_billing_flux_consumed
  2. Counter 加 _total 后缀:auth.attemptsauth_attempts_total
  3. Histogram 拆三件套:http.server.request.duration
    • http_server_request_duration_seconds_bucket(含 le label
    • http_server_request_duration_seconds_count
    • http_server_request_duration_seconds_sum
  4. UpDownCounter / ObservableGauge 不加 _totalws.connections.activews_connections_activeuser.active_sessionsuser_active_sessions
  5. 带单位的 instrument 在 SDK 导出时把单位插进名字:airi.stripe.revenueunit minor_unit)→ airi_stripe_revenue_minor_unit_total

查询面板若拼名字时不确定后缀,先用 {__name__=~"airi_billing_flux.*"} 之类正则探一下。

HTTP来自 instrumentation-http

Metric 类型 Unit 来源 关键 attributes
http.server.request.duration Histogram s @hono/otel httpInstrumentationMiddleware in app.ts http.request.methodhttp.routehttp.response.status_code
http.server.active_requests UpDownCounter 同上 http.request.method

入站走 @hono/otel出站走 auto HttpInstrumentationauto instrumentation 在 Node http 层抓数据时 Hono 还没匹配路由,http.route label 永远为空。@hono/otel 在 Hono middleware 链里跑,能拿到匹配后的路由 pattern/api/v1/users/:id 而非具体 URL所以入站 metric 由它产生。auto HttpInstrumentation 在 instrumentation.ts 里通过 ignoreIncomingRequestHook: () => true 仅保留出站LLM gateway、Stripe、Resend那部分还是要它来跟踪。

STABLE-onlyinstrumentation.tsOTEL_SEMCONV_STABILITY_OPT_IN=http 提前注入。OLD 系列(http.server.duration in ms不再发射。详见 observability-conventions.md 的 SemconvStability 章节

/livez/readyzapp.ts 的 @hono/otel 包装层被显式 skipK8s 风格探针不进 metric。

Auth & Users

全部由 libs/auth.ts Better Auth hooks 触发。

Metric 类型 落点hook Labels
auth.attempts Counter before hookpath 含 /sign-in/sign-up auth.methodpath 末段)
auth.failures Counter after hookctx.context.returnederror auth.method
user.registered Counter databaseHooks.user.create.after
user.login Counter databaseHooks.session.create.after
user.active_sessions ObservableGauge app.ts registerActiveSessionsGaugescrape 时查 SELECT COUNT(*) FROM session WHERE expires_at > NOW()10s 内存缓存)

user.active_sessions 是 cluster-wide gaugedashboard 必须用 max() / avg(),不能用 sum()。所有副本读同一份 DB 报同一个值sum 会乘以副本数。详见 observability-conventions.md 的 Multi-Replica 章节

历史:之前是 UpDownCounter+1 on login, -1 on logout但 Better Auth session TTL 过期不会调 delete hookcounter 单实例就漂;多副本下登录在 A、登出在 B 会直接撕裂正负数。所以改成 DB-backed gauge。

Engagement

Metric 类型 落点 Labels
chat.messages Counter services/domain/chats.ts pushMessages
character.created Counter services/domain/characters.ts
character.deleted Counter 同上
character.engagement Counter 同上like/bookmark actionlike / unlike / bookmark / unbookmark
ws.connections.active ObservableGauge routes/chat-ws/index.ts addCallback walks userConnections Map
ws.messages.sent Counter 同上
ws.messages.received Counter services/domain/chats.ts

Revenue & Billing

Stripe lifecycle

Metric 类型 落点 Labels
stripe.checkout.created Counter routes/stripe/index.ts /checkout POST
stripe.checkout.completed Counter webhook checkout.session.completed
stripe.payment.failed Counter webhook invoice.payment_failed
stripe.subscription.event Counter webhook customer.subscription.* event_typecreated/updated/deleted
stripe.events Counter 任何 webhook event_type(完整 event.typee.g. invoice.paid
airi.stripe.revenue Counterminor_unit webhook checkout.session.completed + invoice.paid currencysourcecheckout/invoice

金额单位airi.stripe.revenue 用最小币种单位cents 等),跨币种 sum 没有意义,永远 sum by (currency)。要换主单位dollars 等)做 / 100 即可,前提是该币种没有不同 minor unit 比例。

Flux ledger

Metric 类型 落点 Labels
airi.billing.flux.consumed Counter routes/openai/v1/index.ts recordMetricschat / tts gen_ai.request.modelgen_ai.operation.name/airi.gen_ai.operation.kindhttp.response.status_code
airi.billing.flux.credited Counter services/domain/billing/billing-service.ts 三条入账路径 sourcestripe.checkout/stripe.invoice/promo/admin_grant/...)、typecredit/promo
airi.billing.flux.unbilled Counter routes/openai/v1/index.ts streaming 路径里 consumeFluxForLLM 失败的 catch gen_ai.request.modelreasondebit_failed)、stagestreaming
flux.insufficient_balance Counter services/domain/billing/billing-service.ts debitFlux
airi.billing.tts.chars Counter services/domain/billing/flux-meter.ts accumulate metertts)、model
airi.billing.tts.preflight_rejections Counter flux-meter.ts assertCanAfford meterreasoninsufficient_balance

airi.billing.flux.unbilled 是 P0 告警金线流式响应已经发给用户HTTP 200token 已经流出),但 post-stream debit 抛错——response 路径不会因此 5xxDB latency 也只在 catch 那一瞬间显著。HTTP / DB 告警覆盖不到这条静默 revenue leak。推荐 alertincrease(airi_billing_flux_unbilled_total[5m]) > 0 持续 > 0 立刻 page。

GenAI

Metric 类型 Unit 落点 Labels
gen_ai.client.operation.duration Histogram s routes/openai/v1/index.ts recordMetrics gen_ai.request.modelgen_ai.operation.name/airi.gen_ai.operation.kindhttp.response.status_code
gen_ai.client.operation.count Counter 同上 同上
gen_ai.client.token.usage.input Counter 同上 同上
gen_ai.client.token.usage.output Counter 同上 同上
gen_ai.client.first_token.duration Histogram s 流式 reader 第一个非空 chunk 抵达时 gen_ai.request.modelgen_ai.operation.name
airi.gen_ai.stream.interrupted Counter 流式 reader catch gen_ai.request.modelstagebefore_first_chunk/mid_stream

EmailResend

来源 services/adapters/email.tssend() 内部 try/catch。

Metric 类型 Labels
airi.email.send Counter templateverification/password_reset/magic_link/change_email/delete_account/unknown
airi.email.failures Counter templateerror_nameResend error.nameunhandled
airi.email.duration Histograms templateoutcomeok/error

Rate limiting

来源 middlewares/rate-limit.tshandler

Metric 类型 Labels
airi.rate_limit.blocked Counter routecallsite 提供e.g. auth.api / openai.completions / stripe.checkout)、key_typeuser/ip)、limit(窗口内最大次数)

注意route 是 callsite 显式提供的稳定 label不是 raw URL path —— URL path 是高 cardinality会爆炸。新加 rate limiter 时记得传 routeLabel

Node.js Runtime

来自 @opentelemetry/instrumentation-runtime-node,下面这些是 dashboard 上用到的子集(不全列):

  • v8js.memory.heap.{used,limit,space.physical_size,space.available_size} Gauge / bytes
  • nodejs.eventloop.delay.{p50,p99,mean,...} Gauge / s
  • nodejs.eventloop.utilization Gauge / ratio
  • v8js.gc.duration Histogram / s

已落地的 dashboard 行映射

airi-server-overview-cloud.jsonbuild.ts 生成(直接改 JSON 会在下次 regenerate 时被覆盖;改 build.ts),跑 pnpm -F @proj-airi/server otel:dashboards 重新生成。从上到下:

Row viz 关键 metric
Service Health stat / gauge user.active_sessionsmax())、ws.connections.activesum())、http.server.request.duration_countreq/s + 5xx%)、gen_ai.client.operation.countairi.email.{send,failures} 失败率
Distribution (now) donut HTTP methods / LLM models / HTTP status codes — increase([5m])
Traffic Trends timeseries 同 distribution 的数据 over time
Latency timeseries http.server.request.duration_bucketP95 by routegen_ai.client.first_token.duration_bucketP95 by model
Errors / Quality mix 4xx/5xx stacked area、airi.gen_ai.stream.interruptedairi.rate_limit.blocked
Business stat / gauge / donut airi.stripe.revenueby currency、checkout conversion %、stripe.events 分布
Infrastructure (collapsed, by service_instance_id) timeseries db_client_operation_duration P95clusterdb_client_connection_countv8js_memory_heap_used_bytes %、nodejs_eventloop_delay_p99_seconds
Logs logs Loki不是 Prometheus

Multi-replica 聚合方式:所有 panel 在 build.ts 里都已经按 observability-conventions.md 的副本安全表选择了正确的 aggregatorCounter 用 sum(rate)、cluster-wide gauge 用 max()、per-process gauge 用 sum()、infra 排查面板用 by (service_instance_id))。加新 panel 时按那张表对照一遍。

验证 metric 是否已注册

src/scripts/otel/smoke.ts 跑一遍:

pnpm -F @proj-airi/server exec node --import tsx ./src/scripts/otel/smoke.ts

会打印 SDK 启动时立即 export 的所有 instrument 名字。Counter 通过 .add(0) primingotel/index.ts primeCounter)后会出现在这里 —— Histogram 不会,要等真实 .record() 才出现。

加新 metric 时的 checklist

  1. 决定命名空间:能映射到 OTel semconv 就用标准名,否则放 airi.*(不要造新顶级前缀)
  2. utils/observability.ts 加常量
  3. otel/index.ts 的对应 metric group 接口(HttpMetrics/AuthMetrics/...)加字段,并在 initOtelmeter.create* 创建
  4. 如果是 CounterprimeCounter 调用列表里加一行 —— 否则低流量时 panel 看起来"没数据"
  5. 在 callsite 通过 DI 拿到 metrics 对象后调 .add() / .record()
  6. pnpm -F @proj-airi/server exec node --import tsx ./src/scripts/otel/smoke.ts 确认注册
  7. 更新本文档对应章节