OmniRoute/docs/frameworks/EVALS.md
Diego Rodrigues de Sa e Souza 91b6983564
Release v3.8.1 (#2441)
Release v3.8.1 — feature flags settings page, bracketed combo names, security hardening, multi-driver SQLite
2026-05-21 01:29:12 -03:00

12 KiB

title version lastUpdated
Evaluations (Evals) 3.8.1 2026-05-13

Evaluations (Evals)

Source of truth: src/lib/evals/, src/lib/db/evals.ts, src/app/api/evals/ Last updated: 2026-05-13 — v3.8.0

OmniRoute ships a generic evaluation framework you can use to benchmark routing configurations, single providers/models, or the bundled "golden set" suites. Use it to verify routing changes, validate new providers, and gate releases before promoting them to production traffic.

The framework is implemented as:

  • A pure runner (src/lib/evals/evalRunner.ts) that registers in-memory built-in suites, evaluates outputs against expected criteria, and aggregates scorecards.
  • A persistence layer (src/lib/db/evals.ts) for custom (user-defined) suites and historical runs in SQLite.
  • An orchestration layer (src/lib/evals/runtime.ts) that executes each case by dispatching real calls to POST /v1/chat/completions, captures latency and outputs, and persists the run.
  • REST endpoints under /api/evals/* (management-auth only).
  • A dashboard surface at Dashboard → Usage → Evals (EvalsTab.tsx).

Concepts

Suite

A suite is a named collection of test cases with a description and one or more cases. Suites come from two sources:

Source Where defined Mutable at runtime?
built-in Registered via registerSuite() at boot No (code-defined)
custom Stored in SQLite eval_suites + eval_cases Yes (via API/UI)

The current built-in suites (see src/lib/evals/evalRunner.ts):

  • golden-set — 10 baseline cases across greeting/math/translation/safety
  • coding-proficiency — Python/JS/SQL/TS/bug detection
  • reasoning-logic — syllogisms, word problems, pattern recognition
  • multilingual — translation and language detection
  • safety-guardrails — PII, jailbreak, refusal, bias awareness
  • instruction-following — JSON-only, numbered lists, language constraints
  • codex-comparison — head-to-head coding tasks intended for compare mode

Case

Each case carries:

Field Description
id Stable identifier (used to key outputs and metrics)
name Human-readable label
model Default model when the run uses suite-default targeting
input { messages, max_tokens? } — sent to /v1/chat/completions
expected { strategy, value } — scoring rubric (see below)
tags Optional labels (e.g. safety, pii, jailbreak)

Target

The same suite can be run against different targets. The target schema is evalTargetSchema in src/shared/validation/schemas.ts:

Target type id Behavior
suite-default null Each case uses its built-in model field
model model name Force every case through one direct model (e.g. gpt-4o)
combo combo name Run every case through one combo (exercises the routing engine)

For model and combo, the id field is required (enforced by Zod superRefine). When compareTarget is provided, both targets must differ — the runner persists both runs under the same runGroupId for A/B comparison.

Scoring Rubrics

Implemented in evaluateCase() (evalRunner.ts):

Strategy Pass when…
exact actualOutput === expected.value
contains actualOutput.toLowerCase().includes(expected.value.toLowerCase())
regex new RegExp(expected.value).test(actualOutput) is truthy
custom expected.fn(actualOutput, evalCase) returns truthy (built-in only)

Note: Custom-function scoring is reserved for code-defined (built-in) suites because functions cannot be serialized through the API. The evalCaseBuilderSchema only accepts contains | exact | regex for user-created suites.

There is no LLM-as-judge or embedding-based similarity scorer today — it would be a clean extension point in evaluateCase().

Database Schema

Three tables (migrations 030_create_eval_runs.sql and 031_create_eval_suites.sql):

Table Purpose
eval_suites Custom suite metadata (id, name, description)
eval_cases Cases per suite — input_json, expected_*, tags_json
eval_runs Historical runs — pass_rate, total, passed, failed, avg_latency_ms, summary_json, results_json, outputs_json

Built-in suites are not stored in the DB. They live in memory and are re-registered every time evalRunner.ts is imported.

REST API

All endpoints require management auth (requireManagementAuth) — they are not part of the public proxy surface.

Endpoint Method Description
/api/evals GET List suites + recent runs + scorecard + targets + keys
/api/evals POST Run a suite (single or compare) — schema evalRunSuiteSchema
/api/evals/{suiteId} GET Fetch one suite (built-in or custom)
/api/evals/suites POST Create a custom suite — schema evalSuiteSaveSchema
/api/evals/suites/{suiteId} GET Fetch a custom suite
/api/evals/suites/{suiteId} PUT Replace a custom suite (cases get re-inserted)
/api/evals/suites/{suiteId} DELETE Delete a custom suite and its cases

Running a suite

curl -X POST http://localhost:20128/api/evals \
  -H "Cookie: auth_token=..." \
  -H "Content-Type: application/json" \
  -d '{
    "suiteId": "golden-set",
    "target": { "type": "combo", "id": "my-combo" },
    "apiKeyId": "optional-api-key-uuid"
  }'

Optional fields:

  • outputsRecord<caseId, string> of pre-computed outputs. When provided, the runner skips dispatch and only scores the cached outputs (useful for offline evaluation).
  • compareTarget — second target to run in parallel; both runs share a generated runGroupId for head-to-head viewing.
  • apiKeyId — internal API key used to authenticate the dispatched /v1/chat/completions calls. Required when REQUIRE_API_KEY is enabled.

Creating a custom suite

curl -X POST http://localhost:20128/api/evals/suites \
  -H "Cookie: auth_token=..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production smoke",
    "description": "Quick sanity check before deploy",
    "cases": [
      {
        "name": "JSON shape",
        "model": "gpt-4o",
        "input": { "messages": [{ "role": "user", "content": "Reply with {\"ok\": true}" }] },
        "expected": { "strategy": "regex", "value": "\"ok\"\\s*:\\s*true" }
      }
    ]
  }'

Dispatch Pipeline

runEvalSuiteAgainstTarget() (src/lib/evals/runtime.ts):

  1. Resolves the suite (built-in or custom).
  2. For each case, builds a Request to /v1/chat/completions with the case's messages, the resolved model, stream: false, and max_tokens: 512 (or the case override).
  3. Calls the chat handler directly (in-process — no extra HTTP hop).
  4. Captures latency and extracts text from either choices[0].message.content or the Responses-API output[] payload.
  5. Scores all outputs via runSuite(), then persists via saveEvalRun().

Cases run sequentially. There is no concurrency flag today.

Dashboard

The UI lives at Dashboard → Usage → Evals (src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx). From there you can:

  • Browse built-in and custom suites with case-by-case preview.
  • Create/edit/delete custom suites with the case builder.
  • Pick a target (suite defaults / model / combo), optionally a second compareTarget, optionally an API key, then run on demand.
  • Inspect run history, per-case pass/fail, latency, and captured outputs.
  • See the rolling scorecard aggregated across the latest run per (suite, target) scope.

Relationship with the Auto-Assessment RFC

A separate, narrower assessment subsystem lives at src/domain/assessment/ (see also AUTO-COMBO.md for the live scoring engine). That subsystem targets the Auto Combo engine — automatically scoring providers and models so combos can self-heal when upstreams fail. It uses its own runner, its own categorizer, and its own scoring logic.

The Evals framework documented here is the broader, general-purpose testing surface. Prefer it for arbitrary regression suites, A/B comparisons, and per-release smoke tests. Use the Auto-Assessment subsystem when you need real-time provider health to influence routing decisions.

CI Integration

There is no dedicated eval:ci npm script today. Two paths if you want to gate releases on eval results:

  • HTTP path: stand up the server, hit POST /api/evals with a known suiteId + target, and assert runs[].summary.passRate >= N in the response.
  • In-process path: import runEvalSuiteAgainstTarget() from @/lib/evals/runtime from a script, run against a test DB, and check the returned PersistedEvalRun.summary.

Tests covering the route and history live at tests/unit/evals-route.test.ts and tests/unit/evals-history.test.ts.

Extension Points

Common changes and where to make them:

  • New scoring strategy — extend the switch (evalCase.expected.strategy) block in evaluateCase() (evalRunner.ts) and widen EvalCaseStrategy in src/lib/db/evals.ts plus evalCaseBuilderSchema in schemas.ts.
  • New built-in suite — define a suite object and call registerSuite() at the bottom of evalRunner.ts. It will be auto-discovered by listSuites().
  • Run with concurrency — change the sequential for loop in runEvalSuiteAgainstTarget() to a bounded Promise.all (no concurrency control exists today).
  • Stream/tool-call cases — currently the runner forces stream: false. Streaming or tool-aware evaluation would require changes in runtime.ts (capture and aggregate SSE chunks before scoring).

See Also

  • USER_GUIDE.md — overall product walkthrough
  • ARCHITECTURE.md — request pipeline reference
  • AUTO-COMBO.md — Auto Combo scoring engine (live runtime)
  • Source: src/lib/evals/, src/lib/db/evals.ts, src/app/api/evals/
  • UI: src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx