vrr/eigent

mirror of https://github.com/eigent-ai/eigent.git synced 2026-04-28 11:40:25 +00:00

Co-authored-by: bytecii <bytecii@users.noreply.github.com>
Co-authored-by: Wendong-Fan <w3ndong.fan@gmail.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>

2026-02-12 16:35:18 +08:00

5.5 KiB

Raw Blame History

Benchmark

Run workforce benchmarks against the Eigent API and grade results.

Setup

Start the backend and frontend in the main project directory:

npm run dev

Set your API key:

export OPENAI_API_KEY=sk-...

Usage

From the backend/ directory:

# Run all benchmarks
python3 -m benchmark.main

# Run a specific benchmark
python3 -m benchmark.main benchmark/dataset/0.json

Structure

benchmark/
  main.py           # Entry point
  client.py         # API client (SSE streaming, auto task start, auto human reply)
  environment.py    # BenchmarkConfig, BenchmarkData, Env, Tests models
  dataset/          # Benchmark JSON configs
    0.json
  checker/        # Checker scripts (pass/fail per benchmark)
    0.py
  grader/         # Grading scripts (milestone completeness per benchmark)
    0.py

Adding a benchmark

Create benchmark/dataset/<n>.json:

{
  "data": {
    "name": "<n>",
    "question": "Your task description",
    "env": {}
  },
  "metadata": {
    "difficulty": "easy|medium|hard",
    "description": "Brief description of what this benchmark tests",
    "tags": ["tag1", "tag2"]
  },
  "model_kwargs": {
    "model_platform": "openai",
    "model_type": "gpt-4o"
  },
  "tests": {
    "grader": ["benchmark/grader/<n>.py"],
    "checker": ["benchmark/checker/<n>.py"]
  }
}

The metadata field (optional) provides information about the benchmark:

difficulty: Indicates complexity level: "easy", "medium", or "hard"
description: Brief explanation of what skills or capabilities the benchmark tests
tags: Array of keywords for filtering and organization

The model_kwargs field is optional. Defaults come from BENCHMARK_* environment variables (see below), falling back to openai / gpt-5.2 / $OPENAI_API_KEY. Per-benchmark JSON values override the environment defaults.

Custom model providers

You can override the model for all benchmarks via environment variables (see .env.example):

export BENCHMARK_MODEL_PLATFORM="openai-compatible-model"
export BENCHMARK_MODEL_TYPE=""
export BENCHMARK_API_KEY=""
export BENCHMARK_API_URL=""

Variable	Default	Description
`BENCHMARK_MODEL_PLATFORM`	`openai`	Provider name. Use `openai-compatible-model` for any OpenAI-compatible API.
`BENCHMARK_MODEL_TYPE`	`gpt-5.2`	Model identifier passed to the provider.
`BENCHMARK_API_KEY`	`$OPENAI_API_KEY`	API key for the provider.
`BENCHMARK_API_URL`	`https://api.openai.com/v1`	Base URL for the provider's API.

Important: If the model is served through an OpenAI-compatible API (e.g. DeepSeek, MiniMax, Ollama, vLLM, LiteLLM, or any other non-OpenAI provider), set BENCHMARK_MODEL_PLATFORM to openai-compatible-model — not openai. The openai platform value is reserved for the official OpenAI API only.

To override a single benchmark, add model_kwargs to its JSON config — these take priority over environment variables.

Create benchmark/checker/<n>.py with a check(working_directory: str) -> bool function.
Create benchmark/grader/<n>.py with a grade(working_directory: str) -> tuple[int, int] function.

Checker

Each checker is a Python script that checks whether a benchmark task succeeded. It must export a check(working_directory: str) -> bool function that:

Returns True and prints PASS if the task succeeded.
Returns False and prints FAIL: <reason> if the task failed.

def check(working_directory: str) -> bool:
    # check files, run scripts, inspect output, etc.
    if success:
        print("PASS")
        return True
    else:
        print("FAIL: expected X, got Y")
        return False

Grader

Each grader is a Python script that evaluates milestone completeness. It must export a grade(working_directory: str) -> tuple[int, int] function that returns (completed, total):

def grade(working_directory: str) -> tuple[int, int]:
    total = 3
    completed = 0
    if milestone_1_done:
        completed += 1
    if milestone_2_done:
        completed += 1
    if milestone_3_done:
        completed += 1
    return completed, total  # e.g. 2/3

Results

After all benchmarks finish, results are saved to benchmark/{timestamp}_results.csv:

benchmark,model,type,script,result
0,openai/gpt-5.2,checker,benchmark/checker/0.py,FAIL
0,openai/gpt-5.2,grader,benchmark/grader/0.py,7/7

Result CSV files are gitignored.

TODO: With MCP servers

To provide MCP servers to the workforce, add installed_mcp to env. Use env_file to store credentials separately (.env files are gitignored). The credentials are injected into each MCP server's env before sending the payload to the backend.

# benchmark/envs/1.env
NOTION_API_KEY=ntn_xxxxx

{
  "data": {
    "name": "1",
    "question": "List all Notion pages",
    "env": {
      "env_file": "benchmark/envs/1.env",
      "installed_mcp": {
        "mcpServers": {
          "notion": {
            "command": "npx",
            "args": ["@modelcontextprotocol/server-notion"]
          }
        }
      }
    }
  },
  "tests": {
    "checker": ["benchmark/checker/1.py"]
  }
}

5.5 KiB Raw Blame History