mirror of https://github.com/diegosouzapw/OmniRoute.git synced 2026-05-22 19:57:07 +00:00

Diego Rodrigues de Sa e Souza 91b6983564

Release v3.8.1 — feature flags settings page, bracketed combo names, security hardening, multi-driver SQLite

2026-05-21 01:29:12 -03:00

12 KiB

Raw Permalink Blame History

title	version	lastUpdated
Evaluations (Evals)	3.8.1	2026-05-13

Evaluations (Evals)

Source of truth: src/lib/evals/, src/lib/db/evals.ts, src/app/api/evals/ Last updated: 2026-05-13 — v3.8.0

OmniRoute ships a generic evaluation framework you can use to benchmark routing configurations, single providers/models, or the bundled "golden set" suites. Use it to verify routing changes, validate new providers, and gate releases before promoting them to production traffic.

The framework is implemented as:

A pure runner (src/lib/evals/evalRunner.ts) that registers in-memory built-in suites, evaluates outputs against expected criteria, and aggregates scorecards.
A persistence layer (src/lib/db/evals.ts) for custom (user-defined) suites and historical runs in SQLite.
An orchestration layer (src/lib/evals/runtime.ts) that executes each case by dispatching real calls to POST /v1/chat/completions, captures latency and outputs, and persists the run.
REST endpoints under /api/evals/* (management-auth only).
A dashboard surface at Dashboard → Usage → Evals (EvalsTab.tsx).

Concepts

Suite

A suite is a named collection of test cases with a description and one or more cases. Suites come from two sources:

Source	Where defined	Mutable at runtime?
`built-in`	Registered via `registerSuite()` at boot	No (code-defined)
`custom`	Stored in SQLite `eval_suites` + `eval_cases`	Yes (via API/UI)

The current built-in suites (see src/lib/evals/evalRunner.ts):

golden-set — 10 baseline cases across greeting/math/translation/safety
coding-proficiency — Python/JS/SQL/TS/bug detection
reasoning-logic — syllogisms, word problems, pattern recognition
multilingual — translation and language detection
safety-guardrails — PII, jailbreak, refusal, bias awareness
instruction-following — JSON-only, numbered lists, language constraints
codex-comparison — head-to-head coding tasks intended for compare mode

Case

Each case carries:

Field	Description
`id`	Stable identifier (used to key outputs and metrics)
`name`	Human-readable label
`model`	Default model when the run uses `suite-default` targeting
`input`	`{ messages, max_tokens? }` — sent to `/v1/chat/completions`
`expected`	`{ strategy, value }` — scoring rubric (see below)
`tags`	Optional labels (e.g. `safety`, `pii`, `jailbreak`)

Target

The same suite can be run against different targets. The target schema is evalTargetSchema in src/shared/validation/schemas.ts:

Target type	`id`	Behavior
`suite-default`	`null`	Each case uses its built-in `model` field
`model`	model name	Force every case through one direct model (e.g. `gpt-4o`)
`combo`	combo name	Run every case through one combo (exercises the routing engine)

For model and combo, the id field is required (enforced by Zod superRefine). When compareTarget is provided, both targets must differ — the runner persists both runs under the same runGroupId for A/B comparison.

Scoring Rubrics

Implemented in evaluateCase() (evalRunner.ts):

Strategy	Pass when…
`exact`	`actualOutput === expected.value`
`contains`	`actualOutput.toLowerCase().includes(expected.value.toLowerCase())`
`regex`	`new RegExp(expected.value).test(actualOutput)` is truthy
`custom`	`expected.fn(actualOutput, evalCase)` returns truthy (built-in only)

Note: Custom-function scoring is reserved for code-defined (built-in) suites because functions cannot be serialized through the API. The evalCaseBuilderSchema only accepts contains | exact | regex for user-created suites.

There is no LLM-as-judge or embedding-based similarity scorer today — it would be a clean extension point in evaluateCase().

Database Schema

Three tables (migrations 030_create_eval_runs.sql and 031_create_eval_suites.sql):

Table	Purpose
`eval_suites`	Custom suite metadata (`id`, `name`, `description`)
`eval_cases`	Cases per suite — `input_json`, `expected_*`, `tags_json`
`eval_runs`	Historical runs — `pass_rate`, `total`, `passed`, `failed`, `avg_latency_ms`, `summary_json`, `results_json`, `outputs_json`

Built-in suites are not stored in the DB. They live in memory and are re-registered every time evalRunner.ts is imported.

REST API

All endpoints require management auth (requireManagementAuth) — they are not part of the public proxy surface.

Endpoint	Method	Description
`/api/evals`	`GET`	List suites + recent runs + scorecard + targets + keys
`/api/evals`	`POST`	Run a suite (single or compare) — schema `evalRunSuiteSchema`
`/api/evals/{suiteId}`	`GET`	Fetch one suite (built-in or custom)
`/api/evals/suites`	`POST`	Create a custom suite — schema `evalSuiteSaveSchema`
`/api/evals/suites/{suiteId}`	`GET`	Fetch a custom suite
`/api/evals/suites/{suiteId}`	`PUT`	Replace a custom suite (cases get re-inserted)
`/api/evals/suites/{suiteId}`	`DELETE`	Delete a custom suite and its cases

Running a suite

curl -X POST http://localhost:20128/api/evals \
  -H "Cookie: auth_token=..." \
  -H "Content-Type: application/json" \
  -d '{
    "suiteId": "golden-set",
    "target": { "type": "combo", "id": "my-combo" },
    "apiKeyId": "optional-api-key-uuid"
  }'

Optional fields:

outputs — Record<caseId, string> of pre-computed outputs. When provided, the runner skips dispatch and only scores the cached outputs (useful for offline evaluation).
compareTarget — second target to run in parallel; both runs share a generated runGroupId for head-to-head viewing.
apiKeyId — internal API key used to authenticate the dispatched /v1/chat/completions calls. Required when REQUIRE_API_KEY is enabled.

Creating a custom suite

curl -X POST http://localhost:20128/api/evals/suites \
  -H "Cookie: auth_token=..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production smoke",
    "description": "Quick sanity check before deploy",
    "cases": [
      {
        "name": "JSON shape",
        "model": "gpt-4o",
        "input": { "messages": [{ "role": "user", "content": "Reply with {\"ok\": true}" }] },
        "expected": { "strategy": "regex", "value": "\"ok\"\\s*:\\s*true" }
      }
    ]
  }'

Dispatch Pipeline

runEvalSuiteAgainstTarget() (src/lib/evals/runtime.ts):

Resolves the suite (built-in or custom).
For each case, builds a Request to /v1/chat/completions with the case's messages, the resolved model, stream: false, and max_tokens: 512 (or the case override).
Calls the chat handler directly (in-process — no extra HTTP hop).
Captures latency and extracts text from either choices[0].message.content or the Responses-API output[] payload.
Scores all outputs via runSuite(), then persists via saveEvalRun().

Cases run sequentially. There is no concurrency flag today.

Dashboard

The UI lives at Dashboard → Usage → Evals (src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx). From there you can:

Browse built-in and custom suites with case-by-case preview.
Create/edit/delete custom suites with the case builder.
Pick a target (suite defaults / model / combo), optionally a second compareTarget, optionally an API key, then run on demand.
Inspect run history, per-case pass/fail, latency, and captured outputs.
See the rolling scorecard aggregated across the latest run per (suite, target) scope.

Relationship with the Auto-Assessment RFC

A separate, narrower assessment subsystem lives at src/domain/assessment/ (see also AUTO-COMBO.md for the live scoring engine). That subsystem targets the Auto Combo engine — automatically scoring providers and models so combos can self-heal when upstreams fail. It uses its own runner, its own categorizer, and its own scoring logic.

The Evals framework documented here is the broader, general-purpose testing surface. Prefer it for arbitrary regression suites, A/B comparisons, and per-release smoke tests. Use the Auto-Assessment subsystem when you need real-time provider health to influence routing decisions.

CI Integration

There is no dedicated eval:ci npm script today. Two paths if you want to gate releases on eval results:

HTTP path: stand up the server, hit POST /api/evals with a known suiteId + target, and assert runs[].summary.passRate >= N in the response.
In-process path: import runEvalSuiteAgainstTarget() from @/lib/evals/runtime from a script, run against a test DB, and check the returned PersistedEvalRun.summary.

Tests covering the route and history live at tests/unit/evals-route.test.ts and tests/unit/evals-history.test.ts.

Extension Points

Common changes and where to make them:

New scoring strategy — extend the switch (evalCase.expected.strategy) block in evaluateCase() (evalRunner.ts) and widen EvalCaseStrategy in src/lib/db/evals.ts plus evalCaseBuilderSchema in schemas.ts.
New built-in suite — define a suite object and call registerSuite() at the bottom of evalRunner.ts. It will be auto-discovered by listSuites().
Run with concurrency — change the sequential for loop in runEvalSuiteAgainstTarget() to a bounded Promise.all (no concurrency control exists today).
Stream/tool-call cases — currently the runner forces stream: false. Streaming or tool-aware evaluation would require changes in runtime.ts (capture and aggregate SSE chunks before scoring).

12 KiB Raw Permalink Blame History