Release v3.8.1 — feature flags settings page, bracketed combo names, security hardening, multi-driver SQLite
12 KiB
| title | version | lastUpdated |
|---|---|---|
| Evaluations (Evals) | 3.8.1 | 2026-05-13 |
Evaluations (Evals)
Source of truth:
src/lib/evals/,src/lib/db/evals.ts,src/app/api/evals/Last updated: 2026-05-13 — v3.8.0
OmniRoute ships a generic evaluation framework you can use to benchmark routing configurations, single providers/models, or the bundled "golden set" suites. Use it to verify routing changes, validate new providers, and gate releases before promoting them to production traffic.
The framework is implemented as:
- A pure runner (
src/lib/evals/evalRunner.ts) that registers in-memory built-in suites, evaluates outputs against expected criteria, and aggregates scorecards. - A persistence layer (
src/lib/db/evals.ts) for custom (user-defined) suites and historical runs in SQLite. - An orchestration layer (
src/lib/evals/runtime.ts) that executes each case by dispatching real calls toPOST /v1/chat/completions, captures latency and outputs, and persists the run. - REST endpoints under
/api/evals/*(management-auth only). - A dashboard surface at
Dashboard → Usage → Evals(EvalsTab.tsx).
Concepts
Suite
A suite is a named collection of test cases with a description and one or
more cases. Suites come from two sources:
| Source | Where defined | Mutable at runtime? |
|---|---|---|
built-in |
Registered via registerSuite() at boot |
No (code-defined) |
custom |
Stored in SQLite eval_suites + eval_cases |
Yes (via API/UI) |
The current built-in suites (see src/lib/evals/evalRunner.ts):
golden-set— 10 baseline cases across greeting/math/translation/safetycoding-proficiency— Python/JS/SQL/TS/bug detectionreasoning-logic— syllogisms, word problems, pattern recognitionmultilingual— translation and language detectionsafety-guardrails— PII, jailbreak, refusal, bias awarenessinstruction-following— JSON-only, numbered lists, language constraintscodex-comparison— head-to-head coding tasks intended for compare mode
Case
Each case carries:
| Field | Description |
|---|---|
id |
Stable identifier (used to key outputs and metrics) |
name |
Human-readable label |
model |
Default model when the run uses suite-default targeting |
input |
{ messages, max_tokens? } — sent to /v1/chat/completions |
expected |
{ strategy, value } — scoring rubric (see below) |
tags |
Optional labels (e.g. safety, pii, jailbreak) |
Target
The same suite can be run against different targets. The target schema is
evalTargetSchema in src/shared/validation/schemas.ts:
| Target type | id |
Behavior |
|---|---|---|
suite-default |
null |
Each case uses its built-in model field |
model |
model name | Force every case through one direct model (e.g. gpt-4o) |
combo |
combo name | Run every case through one combo (exercises the routing engine) |
For model and combo, the id field is required (enforced by Zod
superRefine). When compareTarget is provided, both targets must differ —
the runner persists both runs under the same runGroupId for A/B comparison.
Scoring Rubrics
Implemented in evaluateCase() (evalRunner.ts):
| Strategy | Pass when… |
|---|---|
exact |
actualOutput === expected.value |
contains |
actualOutput.toLowerCase().includes(expected.value.toLowerCase()) |
regex |
new RegExp(expected.value).test(actualOutput) is truthy |
custom |
expected.fn(actualOutput, evalCase) returns truthy (built-in only) |
Note: Custom-function scoring is reserved for code-defined (built-in)
suites because functions cannot be serialized through the API. The
evalCaseBuilderSchema only accepts contains | exact | regex for
user-created suites.
There is no LLM-as-judge or embedding-based similarity scorer today — it would
be a clean extension point in evaluateCase().
Database Schema
Three tables (migrations 030_create_eval_runs.sql and
031_create_eval_suites.sql):
| Table | Purpose |
|---|---|
eval_suites |
Custom suite metadata (id, name, description) |
eval_cases |
Cases per suite — input_json, expected_*, tags_json |
eval_runs |
Historical runs — pass_rate, total, passed, failed, avg_latency_ms, summary_json, results_json, outputs_json |
Built-in suites are not stored in the DB. They live in memory and are
re-registered every time evalRunner.ts is imported.
REST API
All endpoints require management auth (requireManagementAuth) — they are not
part of the public proxy surface.
| Endpoint | Method | Description |
|---|---|---|
/api/evals |
GET |
List suites + recent runs + scorecard + targets + keys |
/api/evals |
POST |
Run a suite (single or compare) — schema evalRunSuiteSchema |
/api/evals/{suiteId} |
GET |
Fetch one suite (built-in or custom) |
/api/evals/suites |
POST |
Create a custom suite — schema evalSuiteSaveSchema |
/api/evals/suites/{suiteId} |
GET |
Fetch a custom suite |
/api/evals/suites/{suiteId} |
PUT |
Replace a custom suite (cases get re-inserted) |
/api/evals/suites/{suiteId} |
DELETE |
Delete a custom suite and its cases |
Running a suite
curl -X POST http://localhost:20128/api/evals \
-H "Cookie: auth_token=..." \
-H "Content-Type: application/json" \
-d '{
"suiteId": "golden-set",
"target": { "type": "combo", "id": "my-combo" },
"apiKeyId": "optional-api-key-uuid"
}'
Optional fields:
outputs—Record<caseId, string>of pre-computed outputs. When provided, the runner skips dispatch and only scores the cached outputs (useful for offline evaluation).compareTarget— second target to run in parallel; both runs share a generatedrunGroupIdfor head-to-head viewing.apiKeyId— internal API key used to authenticate the dispatched/v1/chat/completionscalls. Required whenREQUIRE_API_KEYis enabled.
Creating a custom suite
curl -X POST http://localhost:20128/api/evals/suites \
-H "Cookie: auth_token=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Production smoke",
"description": "Quick sanity check before deploy",
"cases": [
{
"name": "JSON shape",
"model": "gpt-4o",
"input": { "messages": [{ "role": "user", "content": "Reply with {\"ok\": true}" }] },
"expected": { "strategy": "regex", "value": "\"ok\"\\s*:\\s*true" }
}
]
}'
Dispatch Pipeline
runEvalSuiteAgainstTarget() (src/lib/evals/runtime.ts):
- Resolves the suite (built-in or custom).
- For each case, builds a
Requestto/v1/chat/completionswith the case'smessages, the resolvedmodel,stream: false, andmax_tokens: 512(or the case override). - Calls the chat handler directly (in-process — no extra HTTP hop).
- Captures latency and extracts text from either
choices[0].message.contentor the Responses-APIoutput[]payload. - Scores all outputs via
runSuite(), then persists viasaveEvalRun().
Cases run sequentially. There is no concurrency flag today.
Dashboard
The UI lives at Dashboard → Usage → Evals
(src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx). From there you
can:
- Browse built-in and custom suites with case-by-case preview.
- Create/edit/delete custom suites with the case builder.
- Pick a target (suite defaults / model / combo), optionally a second
compareTarget, optionally an API key, then run on demand. - Inspect run history, per-case pass/fail, latency, and captured outputs.
- See the rolling scorecard aggregated across the latest run per
(suite, target)scope.
Relationship with the Auto-Assessment RFC
A separate, narrower assessment subsystem lives at src/domain/assessment/
(see also AUTO-COMBO.md for the live scoring engine).
That subsystem targets the Auto Combo engine — automatically scoring providers and
models so combos can self-heal when upstreams fail. It uses its own runner,
its own categorizer, and its own scoring logic.
The Evals framework documented here is the broader, general-purpose testing surface. Prefer it for arbitrary regression suites, A/B comparisons, and per-release smoke tests. Use the Auto-Assessment subsystem when you need real-time provider health to influence routing decisions.
CI Integration
There is no dedicated eval:ci npm script today. Two paths if you want to
gate releases on eval results:
- HTTP path: stand up the server, hit
POST /api/evalswith a knownsuiteId+target, and assertruns[].summary.passRate >= Nin the response. - In-process path: import
runEvalSuiteAgainstTarget()from@/lib/evals/runtimefrom a script, run against a test DB, and check the returnedPersistedEvalRun.summary.
Tests covering the route and history live at
tests/unit/evals-route.test.ts and tests/unit/evals-history.test.ts.
Extension Points
Common changes and where to make them:
- New scoring strategy — extend the
switch (evalCase.expected.strategy)block inevaluateCase()(evalRunner.ts) and widenEvalCaseStrategyinsrc/lib/db/evals.tsplusevalCaseBuilderSchemainschemas.ts. - New built-in suite — define a suite object and call
registerSuite()at the bottom ofevalRunner.ts. It will be auto-discovered bylistSuites(). - Run with concurrency — change the sequential
forloop inrunEvalSuiteAgainstTarget()to a boundedPromise.all(no concurrency control exists today). - Stream/tool-call cases — currently the runner forces
stream: false. Streaming or tool-aware evaluation would require changes inruntime.ts(capture and aggregate SSE chunks before scoring).
See Also
- USER_GUIDE.md — overall product walkthrough
- ARCHITECTURE.md — request pipeline reference
- AUTO-COMBO.md — Auto Combo scoring engine (live runtime)
- Source:
src/lib/evals/,src/lib/db/evals.ts,src/app/api/evals/ - UI:
src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx