--- title: "Evaluations (Evals)" version: 3.8.1 lastUpdated: 2026-05-13 --- # Evaluations (Evals) > **Source of truth:** `src/lib/evals/`, `src/lib/db/evals.ts`, `src/app/api/evals/` > **Last updated:** 2026-05-13 — v3.8.0 OmniRoute ships a generic evaluation framework you can use to benchmark routing configurations, single providers/models, or the bundled "golden set" suites. Use it to verify routing changes, validate new providers, and gate releases before promoting them to production traffic. The framework is implemented as: - A pure runner (`src/lib/evals/evalRunner.ts`) that registers in-memory built-in suites, evaluates outputs against expected criteria, and aggregates scorecards. - A persistence layer (`src/lib/db/evals.ts`) for custom (user-defined) suites and historical runs in SQLite. - An orchestration layer (`src/lib/evals/runtime.ts`) that executes each case by dispatching real calls to `POST /v1/chat/completions`, captures latency and outputs, and persists the run. - REST endpoints under `/api/evals/*` (management-auth only). - A dashboard surface at `Dashboard → Usage → Evals` (`EvalsTab.tsx`). ## Concepts ### Suite A suite is a named collection of test cases with a `description` and one or more cases. Suites come from two sources: | Source | Where defined | Mutable at runtime? | | ---------- | --------------------------------------------- | ------------------- | | `built-in` | Registered via `registerSuite()` at boot | No (code-defined) | | `custom` | Stored in SQLite `eval_suites` + `eval_cases` | Yes (via API/UI) | The current built-in suites (see `src/lib/evals/evalRunner.ts`): - `golden-set` — 10 baseline cases across greeting/math/translation/safety - `coding-proficiency` — Python/JS/SQL/TS/bug detection - `reasoning-logic` — syllogisms, word problems, pattern recognition - `multilingual` — translation and language detection - `safety-guardrails` — PII, jailbreak, refusal, bias awareness - `instruction-following` — JSON-only, numbered lists, language constraints - `codex-comparison` — head-to-head coding tasks intended for compare mode ### Case Each case carries: | Field | Description | | ---------- | ------------------------------------------------------------ | | `id` | Stable identifier (used to key outputs and metrics) | | `name` | Human-readable label | | `model` | Default model when the run uses `suite-default` targeting | | `input` | `{ messages, max_tokens? }` — sent to `/v1/chat/completions` | | `expected` | `{ strategy, value }` — scoring rubric (see below) | | `tags` | Optional labels (e.g. `safety`, `pii`, `jailbreak`) | ### Target The same suite can be run against different targets. The target schema is `evalTargetSchema` in `src/shared/validation/schemas.ts`: | Target type | `id` | Behavior | | --------------- | ---------- | --------------------------------------------------------------- | | `suite-default` | `null` | Each case uses its built-in `model` field | | `model` | model name | Force every case through one direct model (e.g. `gpt-4o`) | | `combo` | combo name | Run every case through one combo (exercises the routing engine) | For `model` and `combo`, the `id` field is required (enforced by Zod `superRefine`). When `compareTarget` is provided, both targets must differ — the runner persists both runs under the same `runGroupId` for A/B comparison. ## Scoring Rubrics Implemented in `evaluateCase()` (evalRunner.ts): | Strategy | Pass when… | | ---------- | -------------------------------------------------------------------- | | `exact` | `actualOutput === expected.value` | | `contains` | `actualOutput.toLowerCase().includes(expected.value.toLowerCase())` | | `regex` | `new RegExp(expected.value).test(actualOutput)` is truthy | | `custom` | `expected.fn(actualOutput, evalCase)` returns truthy (built-in only) | **Note:** Custom-function scoring is reserved for code-defined (built-in) suites because functions cannot be serialized through the API. The `evalCaseBuilderSchema` only accepts `contains | exact | regex` for user-created suites. There is no LLM-as-judge or embedding-based similarity scorer today — it would be a clean extension point in `evaluateCase()`. ## Database Schema Three tables (migrations `030_create_eval_runs.sql` and `031_create_eval_suites.sql`): | Table | Purpose | | ------------- | ---------------------------------------------------------------------------------------------------------------------------- | | `eval_suites` | Custom suite metadata (`id`, `name`, `description`) | | `eval_cases` | Cases per suite — `input_json`, `expected_*`, `tags_json` | | `eval_runs` | Historical runs — `pass_rate`, `total`, `passed`, `failed`, `avg_latency_ms`, `summary_json`, `results_json`, `outputs_json` | Built-in suites are **not** stored in the DB. They live in memory and are re-registered every time `evalRunner.ts` is imported. ## REST API All endpoints require management auth (`requireManagementAuth`) — they are not part of the public proxy surface. | Endpoint | Method | Description | | ----------------------------- | -------- | ------------------------------------------------------------- | | `/api/evals` | `GET` | List suites + recent runs + scorecard + targets + keys | | `/api/evals` | `POST` | Run a suite (single or compare) — schema `evalRunSuiteSchema` | | `/api/evals/{suiteId}` | `GET` | Fetch one suite (built-in or custom) | | `/api/evals/suites` | `POST` | Create a custom suite — schema `evalSuiteSaveSchema` | | `/api/evals/suites/{suiteId}` | `GET` | Fetch a custom suite | | `/api/evals/suites/{suiteId}` | `PUT` | Replace a custom suite (cases get re-inserted) | | `/api/evals/suites/{suiteId}` | `DELETE` | Delete a custom suite and its cases | ### Running a suite ```bash curl -X POST http://localhost:20128/api/evals \ -H "Cookie: auth_token=..." \ -H "Content-Type: application/json" \ -d '{ "suiteId": "golden-set", "target": { "type": "combo", "id": "my-combo" }, "apiKeyId": "optional-api-key-uuid" }' ``` Optional fields: - `outputs` — `Record` of pre-computed outputs. When provided, the runner **skips dispatch** and only scores the cached outputs (useful for offline evaluation). - `compareTarget` — second target to run in parallel; both runs share a generated `runGroupId` for head-to-head viewing. - `apiKeyId` — internal API key used to authenticate the dispatched `/v1/chat/completions` calls. Required when `REQUIRE_API_KEY` is enabled. ### Creating a custom suite ```bash curl -X POST http://localhost:20128/api/evals/suites \ -H "Cookie: auth_token=..." \ -H "Content-Type: application/json" \ -d '{ "name": "Production smoke", "description": "Quick sanity check before deploy", "cases": [ { "name": "JSON shape", "model": "gpt-4o", "input": { "messages": [{ "role": "user", "content": "Reply with {\"ok\": true}" }] }, "expected": { "strategy": "regex", "value": "\"ok\"\\s*:\\s*true" } } ] }' ``` ## Dispatch Pipeline `runEvalSuiteAgainstTarget()` (`src/lib/evals/runtime.ts`): 1. Resolves the suite (built-in or custom). 2. For each case, builds a `Request` to `/v1/chat/completions` with the case's `messages`, the resolved `model`, `stream: false`, and `max_tokens: 512` (or the case override). 3. Calls the chat handler directly (in-process — no extra HTTP hop). 4. Captures latency and extracts text from either `choices[0].message.content` or the Responses-API `output[]` payload. 5. Scores all outputs via `runSuite()`, then persists via `saveEvalRun()`. Cases run **sequentially**. There is no concurrency flag today. ## Dashboard The UI lives at `Dashboard → Usage → Evals` (`src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx`). From there you can: - Browse built-in and custom suites with case-by-case preview. - Create/edit/delete custom suites with the case builder. - Pick a target (suite defaults / model / combo), optionally a second `compareTarget`, optionally an API key, then run on demand. - Inspect run history, per-case pass/fail, latency, and captured outputs. - See the rolling scorecard aggregated across the latest run per `(suite, target)` scope. ## Relationship with the Auto-Assessment RFC A separate, narrower assessment subsystem lives at `src/domain/assessment/` (see also [AUTO-COMBO.md](../routing/AUTO-COMBO.md) for the live scoring engine). That subsystem targets the Auto Combo engine — automatically scoring providers and models so combos can self-heal when upstreams fail. It uses its own runner, its own categorizer, and its own scoring logic. The Evals framework documented here is the **broader, general-purpose testing surface**. Prefer it for arbitrary regression suites, A/B comparisons, and per-release smoke tests. Use the Auto-Assessment subsystem when you need real-time provider health to influence routing decisions. ## CI Integration There is no dedicated `eval:ci` npm script today. Two paths if you want to gate releases on eval results: - **HTTP path**: stand up the server, hit `POST /api/evals` with a known `suiteId` + `target`, and assert `runs[].summary.passRate >= N` in the response. - **In-process path**: import `runEvalSuiteAgainstTarget()` from `@/lib/evals/runtime` from a script, run against a test DB, and check the returned `PersistedEvalRun.summary`. Tests covering the route and history live at `tests/unit/evals-route.test.ts` and `tests/unit/evals-history.test.ts`. ## Extension Points Common changes and where to make them: - **New scoring strategy** — extend the `switch (evalCase.expected.strategy)` block in `evaluateCase()` (`evalRunner.ts`) and widen `EvalCaseStrategy` in `src/lib/db/evals.ts` plus `evalCaseBuilderSchema` in `schemas.ts`. - **New built-in suite** — define a suite object and call `registerSuite()` at the bottom of `evalRunner.ts`. It will be auto-discovered by `listSuites()`. - **Run with concurrency** — change the sequential `for` loop in `runEvalSuiteAgainstTarget()` to a bounded `Promise.all` (no concurrency control exists today). - **Stream/tool-call cases** — currently the runner forces `stream: false`. Streaming or tool-aware evaluation would require changes in `runtime.ts` (capture and aggregate SSE chunks before scoring). ## See Also - [USER_GUIDE.md](../guides/USER_GUIDE.md) — overall product walkthrough - [ARCHITECTURE.md](../architecture/ARCHITECTURE.md) — request pipeline reference - [AUTO-COMBO.md](../routing/AUTO-COMBO.md) — Auto Combo scoring engine (live runtime) - Source: `src/lib/evals/`, `src/lib/db/evals.ts`, `src/app/api/evals/` - UI: `src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx`