mirror of
https://github.com/diegosouzapw/OmniRoute.git
synced 2026-05-23 04:28:06 +00:00
Release v3.8.1 — feature flags settings page, bracketed combo names, security hardening, multi-driver SQLite
250 lines
12 KiB
Markdown
250 lines
12 KiB
Markdown
---
|
|
title: "Evaluations (Evals)"
|
|
version: 3.8.1
|
|
lastUpdated: 2026-05-13
|
|
---
|
|
|
|
# Evaluations (Evals)
|
|
|
|
> **Source of truth:** `src/lib/evals/`, `src/lib/db/evals.ts`, `src/app/api/evals/`
|
|
> **Last updated:** 2026-05-13 — v3.8.0
|
|
|
|
OmniRoute ships a generic evaluation framework you can use to benchmark routing
|
|
configurations, single providers/models, or the bundled "golden set" suites.
|
|
Use it to verify routing changes, validate new providers, and gate releases
|
|
before promoting them to production traffic.
|
|
|
|
The framework is implemented as:
|
|
|
|
- A pure runner (`src/lib/evals/evalRunner.ts`) that registers in-memory
|
|
built-in suites, evaluates outputs against expected criteria, and aggregates
|
|
scorecards.
|
|
- A persistence layer (`src/lib/db/evals.ts`) for custom (user-defined) suites
|
|
and historical runs in SQLite.
|
|
- An orchestration layer (`src/lib/evals/runtime.ts`) that executes each case
|
|
by dispatching real calls to `POST /v1/chat/completions`, captures latency
|
|
and outputs, and persists the run.
|
|
- REST endpoints under `/api/evals/*` (management-auth only).
|
|
- A dashboard surface at `Dashboard → Usage → Evals` (`EvalsTab.tsx`).
|
|
|
|
## Concepts
|
|
|
|
### Suite
|
|
|
|
A suite is a named collection of test cases with a `description` and one or
|
|
more cases. Suites come from two sources:
|
|
|
|
| Source | Where defined | Mutable at runtime? |
|
|
| ---------- | --------------------------------------------- | ------------------- |
|
|
| `built-in` | Registered via `registerSuite()` at boot | No (code-defined) |
|
|
| `custom` | Stored in SQLite `eval_suites` + `eval_cases` | Yes (via API/UI) |
|
|
|
|
The current built-in suites (see `src/lib/evals/evalRunner.ts`):
|
|
|
|
- `golden-set` — 10 baseline cases across greeting/math/translation/safety
|
|
- `coding-proficiency` — Python/JS/SQL/TS/bug detection
|
|
- `reasoning-logic` — syllogisms, word problems, pattern recognition
|
|
- `multilingual` — translation and language detection
|
|
- `safety-guardrails` — PII, jailbreak, refusal, bias awareness
|
|
- `instruction-following` — JSON-only, numbered lists, language constraints
|
|
- `codex-comparison` — head-to-head coding tasks intended for compare mode
|
|
|
|
### Case
|
|
|
|
Each case carries:
|
|
|
|
| Field | Description |
|
|
| ---------- | ------------------------------------------------------------ |
|
|
| `id` | Stable identifier (used to key outputs and metrics) |
|
|
| `name` | Human-readable label |
|
|
| `model` | Default model when the run uses `suite-default` targeting |
|
|
| `input` | `{ messages, max_tokens? }` — sent to `/v1/chat/completions` |
|
|
| `expected` | `{ strategy, value }` — scoring rubric (see below) |
|
|
| `tags` | Optional labels (e.g. `safety`, `pii`, `jailbreak`) |
|
|
|
|
### Target
|
|
|
|
The same suite can be run against different targets. The target schema is
|
|
`evalTargetSchema` in `src/shared/validation/schemas.ts`:
|
|
|
|
| Target type | `id` | Behavior |
|
|
| --------------- | ---------- | --------------------------------------------------------------- |
|
|
| `suite-default` | `null` | Each case uses its built-in `model` field |
|
|
| `model` | model name | Force every case through one direct model (e.g. `gpt-4o`) |
|
|
| `combo` | combo name | Run every case through one combo (exercises the routing engine) |
|
|
|
|
For `model` and `combo`, the `id` field is required (enforced by Zod
|
|
`superRefine`). When `compareTarget` is provided, both targets must differ —
|
|
the runner persists both runs under the same `runGroupId` for A/B comparison.
|
|
|
|
## Scoring Rubrics
|
|
|
|
Implemented in `evaluateCase()` (evalRunner.ts):
|
|
|
|
| Strategy | Pass when… |
|
|
| ---------- | -------------------------------------------------------------------- |
|
|
| `exact` | `actualOutput === expected.value` |
|
|
| `contains` | `actualOutput.toLowerCase().includes(expected.value.toLowerCase())` |
|
|
| `regex` | `new RegExp(expected.value).test(actualOutput)` is truthy |
|
|
| `custom` | `expected.fn(actualOutput, evalCase)` returns truthy (built-in only) |
|
|
|
|
**Note:** Custom-function scoring is reserved for code-defined (built-in)
|
|
suites because functions cannot be serialized through the API. The
|
|
`evalCaseBuilderSchema` only accepts `contains | exact | regex` for
|
|
user-created suites.
|
|
|
|
There is no LLM-as-judge or embedding-based similarity scorer today — it would
|
|
be a clean extension point in `evaluateCase()`.
|
|
|
|
## Database Schema
|
|
|
|
Three tables (migrations `030_create_eval_runs.sql` and
|
|
`031_create_eval_suites.sql`):
|
|
|
|
| Table | Purpose |
|
|
| ------------- | ---------------------------------------------------------------------------------------------------------------------------- |
|
|
| `eval_suites` | Custom suite metadata (`id`, `name`, `description`) |
|
|
| `eval_cases` | Cases per suite — `input_json`, `expected_*`, `tags_json` |
|
|
| `eval_runs` | Historical runs — `pass_rate`, `total`, `passed`, `failed`, `avg_latency_ms`, `summary_json`, `results_json`, `outputs_json` |
|
|
|
|
Built-in suites are **not** stored in the DB. They live in memory and are
|
|
re-registered every time `evalRunner.ts` is imported.
|
|
|
|
## REST API
|
|
|
|
All endpoints require management auth (`requireManagementAuth`) — they are not
|
|
part of the public proxy surface.
|
|
|
|
| Endpoint | Method | Description |
|
|
| ----------------------------- | -------- | ------------------------------------------------------------- |
|
|
| `/api/evals` | `GET` | List suites + recent runs + scorecard + targets + keys |
|
|
| `/api/evals` | `POST` | Run a suite (single or compare) — schema `evalRunSuiteSchema` |
|
|
| `/api/evals/{suiteId}` | `GET` | Fetch one suite (built-in or custom) |
|
|
| `/api/evals/suites` | `POST` | Create a custom suite — schema `evalSuiteSaveSchema` |
|
|
| `/api/evals/suites/{suiteId}` | `GET` | Fetch a custom suite |
|
|
| `/api/evals/suites/{suiteId}` | `PUT` | Replace a custom suite (cases get re-inserted) |
|
|
| `/api/evals/suites/{suiteId}` | `DELETE` | Delete a custom suite and its cases |
|
|
|
|
### Running a suite
|
|
|
|
```bash
|
|
curl -X POST http://localhost:20128/api/evals \
|
|
-H "Cookie: auth_token=..." \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"suiteId": "golden-set",
|
|
"target": { "type": "combo", "id": "my-combo" },
|
|
"apiKeyId": "optional-api-key-uuid"
|
|
}'
|
|
```
|
|
|
|
Optional fields:
|
|
|
|
- `outputs` — `Record<caseId, string>` of pre-computed outputs. When provided,
|
|
the runner **skips dispatch** and only scores the cached outputs (useful for
|
|
offline evaluation).
|
|
- `compareTarget` — second target to run in parallel; both runs share a
|
|
generated `runGroupId` for head-to-head viewing.
|
|
- `apiKeyId` — internal API key used to authenticate the dispatched
|
|
`/v1/chat/completions` calls. Required when `REQUIRE_API_KEY` is enabled.
|
|
|
|
### Creating a custom suite
|
|
|
|
```bash
|
|
curl -X POST http://localhost:20128/api/evals/suites \
|
|
-H "Cookie: auth_token=..." \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "Production smoke",
|
|
"description": "Quick sanity check before deploy",
|
|
"cases": [
|
|
{
|
|
"name": "JSON shape",
|
|
"model": "gpt-4o",
|
|
"input": { "messages": [{ "role": "user", "content": "Reply with {\"ok\": true}" }] },
|
|
"expected": { "strategy": "regex", "value": "\"ok\"\\s*:\\s*true" }
|
|
}
|
|
]
|
|
}'
|
|
```
|
|
|
|
## Dispatch Pipeline
|
|
|
|
`runEvalSuiteAgainstTarget()` (`src/lib/evals/runtime.ts`):
|
|
|
|
1. Resolves the suite (built-in or custom).
|
|
2. For each case, builds a `Request` to `/v1/chat/completions` with the case's
|
|
`messages`, the resolved `model`, `stream: false`, and `max_tokens: 512`
|
|
(or the case override).
|
|
3. Calls the chat handler directly (in-process — no extra HTTP hop).
|
|
4. Captures latency and extracts text from either `choices[0].message.content`
|
|
or the Responses-API `output[]` payload.
|
|
5. Scores all outputs via `runSuite()`, then persists via `saveEvalRun()`.
|
|
|
|
Cases run **sequentially**. There is no concurrency flag today.
|
|
|
|
## Dashboard
|
|
|
|
The UI lives at `Dashboard → Usage → Evals`
|
|
(`src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx`). From there you
|
|
can:
|
|
|
|
- Browse built-in and custom suites with case-by-case preview.
|
|
- Create/edit/delete custom suites with the case builder.
|
|
- Pick a target (suite defaults / model / combo), optionally a second
|
|
`compareTarget`, optionally an API key, then run on demand.
|
|
- Inspect run history, per-case pass/fail, latency, and captured outputs.
|
|
- See the rolling scorecard aggregated across the latest run per
|
|
`(suite, target)` scope.
|
|
|
|
## Relationship with the Auto-Assessment RFC
|
|
|
|
A separate, narrower assessment subsystem lives at `src/domain/assessment/`
|
|
(see also [AUTO-COMBO.md](../routing/AUTO-COMBO.md) for the live scoring engine).
|
|
That subsystem targets the Auto Combo engine — automatically scoring providers and
|
|
models so combos can self-heal when upstreams fail. It uses its own runner,
|
|
its own categorizer, and its own scoring logic.
|
|
|
|
The Evals framework documented here is the **broader, general-purpose
|
|
testing surface**. Prefer it for arbitrary regression suites, A/B comparisons,
|
|
and per-release smoke tests. Use the Auto-Assessment subsystem when you need
|
|
real-time provider health to influence routing decisions.
|
|
|
|
## CI Integration
|
|
|
|
There is no dedicated `eval:ci` npm script today. Two paths if you want to
|
|
gate releases on eval results:
|
|
|
|
- **HTTP path**: stand up the server, hit `POST /api/evals` with a known
|
|
`suiteId` + `target`, and assert `runs[].summary.passRate >= N` in the
|
|
response.
|
|
- **In-process path**: import `runEvalSuiteAgainstTarget()` from
|
|
`@/lib/evals/runtime` from a script, run against a test DB, and check the
|
|
returned `PersistedEvalRun.summary`.
|
|
|
|
Tests covering the route and history live at
|
|
`tests/unit/evals-route.test.ts` and `tests/unit/evals-history.test.ts`.
|
|
|
|
## Extension Points
|
|
|
|
Common changes and where to make them:
|
|
|
|
- **New scoring strategy** — extend the `switch (evalCase.expected.strategy)`
|
|
block in `evaluateCase()` (`evalRunner.ts`) and widen `EvalCaseStrategy` in
|
|
`src/lib/db/evals.ts` plus `evalCaseBuilderSchema` in `schemas.ts`.
|
|
- **New built-in suite** — define a suite object and call `registerSuite()` at
|
|
the bottom of `evalRunner.ts`. It will be auto-discovered by `listSuites()`.
|
|
- **Run with concurrency** — change the sequential `for` loop in
|
|
`runEvalSuiteAgainstTarget()` to a bounded `Promise.all` (no concurrency
|
|
control exists today).
|
|
- **Stream/tool-call cases** — currently the runner forces `stream: false`.
|
|
Streaming or tool-aware evaluation would require changes in `runtime.ts`
|
|
(capture and aggregate SSE chunks before scoring).
|
|
|
|
## See Also
|
|
|
|
- [USER_GUIDE.md](../guides/USER_GUIDE.md) — overall product walkthrough
|
|
- [ARCHITECTURE.md](../architecture/ARCHITECTURE.md) — request pipeline reference
|
|
- [AUTO-COMBO.md](../routing/AUTO-COMBO.md) — Auto Combo scoring engine (live runtime)
|
|
- Source: `src/lib/evals/`, `src/lib/db/evals.ts`, `src/app/api/evals/`
|
|
- UI: `src/app/(dashboard)/dashboard/usage/components/EvalsTab.tsx`
|