eigent/backend/benchmark/README.md
bytecii f7bf29a40a
benchmark: update benchmark (#1207)
Co-authored-by: bytecii <bytecii@users.noreply.github.com>
Co-authored-by: Wendong-Fan <w3ndong.fan@gmail.com>
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
2026-02-12 16:35:18 +08:00

5.5 KiB

Benchmark

Run workforce benchmarks against the Eigent API and grade results.

Setup

  1. Start the backend and frontend in the main project directory:
npm run dev
  1. Set your API key:
export OPENAI_API_KEY=sk-...

Usage

From the backend/ directory:

# Run all benchmarks
python3 -m benchmark.main

# Run a specific benchmark
python3 -m benchmark.main benchmark/dataset/0.json

Structure

benchmark/
  main.py           # Entry point
  client.py         # API client (SSE streaming, auto task start, auto human reply)
  environment.py    # BenchmarkConfig, BenchmarkData, Env, Tests models
  dataset/          # Benchmark JSON configs
    0.json
  checker/        # Checker scripts (pass/fail per benchmark)
    0.py
  grader/         # Grading scripts (milestone completeness per benchmark)
    0.py

Adding a benchmark

  1. Create benchmark/dataset/<n>.json:
{
  "data": {
    "name": "<n>",
    "question": "Your task description",
    "env": {}
  },
  "metadata": {
    "difficulty": "easy|medium|hard",
    "description": "Brief description of what this benchmark tests",
    "tags": ["tag1", "tag2"]
  },
  "model_kwargs": {
    "model_platform": "openai",
    "model_type": "gpt-4o"
  },
  "tests": {
    "grader": ["benchmark/grader/<n>.py"],
    "checker": ["benchmark/checker/<n>.py"]
  }
}

The metadata field (optional) provides information about the benchmark:

  • difficulty: Indicates complexity level: "easy", "medium", or "hard"
  • description: Brief explanation of what skills or capabilities the benchmark tests
  • tags: Array of keywords for filtering and organization

The model_kwargs field is optional. Defaults come from BENCHMARK_* environment variables (see below), falling back to openai / gpt-5.2 / $OPENAI_API_KEY. Per-benchmark JSON values override the environment defaults.

Custom model providers

You can override the model for all benchmarks via environment variables (see .env.example):

export BENCHMARK_MODEL_PLATFORM="openai-compatible-model"
export BENCHMARK_MODEL_TYPE=""
export BENCHMARK_API_KEY=""
export BENCHMARK_API_URL=""
Variable Default Description
BENCHMARK_MODEL_PLATFORM openai Provider name. Use openai-compatible-model for any OpenAI-compatible API.
BENCHMARK_MODEL_TYPE gpt-5.2 Model identifier passed to the provider.
BENCHMARK_API_KEY $OPENAI_API_KEY API key for the provider.
BENCHMARK_API_URL https://api.openai.com/v1 Base URL for the provider's API.

Important: If the model is served through an OpenAI-compatible API (e.g. DeepSeek, MiniMax, Ollama, vLLM, LiteLLM, or any other non-OpenAI provider), set BENCHMARK_MODEL_PLATFORM to openai-compatible-modelnot openai. The openai platform value is reserved for the official OpenAI API only.

To override a single benchmark, add model_kwargs to its JSON config — these take priority over environment variables.

  1. Create benchmark/checker/<n>.py with a check(working_directory: str) -> bool function.

  2. Create benchmark/grader/<n>.py with a grade(working_directory: str) -> tuple[int, int] function.

Checker

Each checker is a Python script that checks whether a benchmark task succeeded. It must export a check(working_directory: str) -> bool function that:

  • Returns True and prints PASS if the task succeeded.
  • Returns False and prints FAIL: <reason> if the task failed.
def check(working_directory: str) -> bool:
    # check files, run scripts, inspect output, etc.
    if success:
        print("PASS")
        return True
    else:
        print("FAIL: expected X, got Y")
        return False

Grader

Each grader is a Python script that evaluates milestone completeness. It must export a grade(working_directory: str) -> tuple[int, int] function that returns (completed, total):

def grade(working_directory: str) -> tuple[int, int]:
    total = 3
    completed = 0
    if milestone_1_done:
        completed += 1
    if milestone_2_done:
        completed += 1
    if milestone_3_done:
        completed += 1
    return completed, total  # e.g. 2/3

Results

After all benchmarks finish, results are saved to benchmark/{timestamp}_results.csv:

benchmark,model,type,script,result
0,openai/gpt-5.2,checker,benchmark/checker/0.py,FAIL
0,openai/gpt-5.2,grader,benchmark/grader/0.py,7/7

Result CSV files are gitignored.

TODO: With MCP servers

To provide MCP servers to the workforce, add installed_mcp to env. Use env_file to store credentials separately (.env files are gitignored). The credentials are injected into each MCP server's env before sending the payload to the backend.

# benchmark/envs/1.env
NOTION_API_KEY=ntn_xxxxx
{
  "data": {
    "name": "1",
    "question": "List all Notion pages",
    "env": {
      "env_file": "benchmark/envs/1.env",
      "installed_mcp": {
        "mcpServers": {
          "notion": {
            "command": "npx",
            "args": ["@modelcontextprotocol/server-notion"]
          }
        }
      }
    }
  },
  "tests": {
    "checker": ["benchmark/checker/1.py"]
  }
}