vrr/eigent

mirror of https://github.com/eigent-ai/eigent.git synced 2026-04-28 11:40:25 +00:00

History

bytecii eadc6ee56b test: add basic benchmark (#1138 ) Co-authored-by: bytecii <bytecii@users.noreply.github.com> Co-authored-by: Tong Chen <web_chentong@163.com>		2026-02-05 01:15:55 -08:00
..
checker	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
dataset	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
grader	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
.gitignore	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
__init__.py	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
client.py	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
environment.py	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
main.py	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
Makefile	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00
README.md	test: add basic benchmark (#1138 )	2026-02-05 01:15:55 -08:00

README.md

Benchmark

Run workforce benchmarks against the Eigent API and grade results.

Setup

Start the backend and frontend in the main project directory:

npm run dev

Set your API key:

export OPENAI_API_KEY=sk-...

Usage

From the backend/ directory:

# Run all benchmarks
python3 -m benchmark.main

# Run a specific benchmark
python3 -m benchmark.main benchmark/dataset/0.json

Structure

benchmark/
  main.py           # Entry point
  client.py         # API client (SSE streaming, auto task start, auto human reply)
  environment.py    # BenchmarkConfig, BenchmarkData, Env, Tests models
  dataset/          # Benchmark JSON configs
    0.json
  checker/        # Checker scripts (pass/fail per benchmark)
    0.py
  grader/         # Grading scripts (milestone completeness per benchmark)
    0.py

Adding a benchmark

Create benchmark/dataset/<n>.json:

{
  "data": {
    "name": "<n>",
    "question": "Your task description",
    "env": {}
  },
  "metadata": {
    "difficulty": "easy|medium|hard",
    "description": "Brief description of what this benchmark tests",
    "tags": ["tag1", "tag2"]
  },
  "model_kwargs": {
    "model_platform": "openai",
    "model_type": "gpt-4o"
  },
  "tests": {
    "grader": ["benchmark/grader/<n>.py"],
    "checker": ["benchmark/checker/<n>.py"]
  }
}

The metadata field (optional) provides information about the benchmark:

difficulty: Indicates complexity level: "easy", "medium", or "hard"
description: Brief explanation of what skills or capabilities the benchmark tests
tags: Array of keywords for filtering and organization

model_platform and model_type default to "openai" and "gpt-4o". api_key defaults to $OPENAI_API_KEY. Set api_url for custom endpoints.

Create benchmark/checker/<n>.py with a check(working_directory: str) -> bool function.
Create benchmark/grader/<n>.py with a grade(working_directory: str) -> tuple[int, int] function.

Checker

Each checker is a Python script that checks whether a benchmark task succeeded. It must export a check(working_directory: str) -> bool function that:

Returns True and prints PASS if the task succeeded.
Returns False and prints FAIL: <reason> if the task failed.

def check(working_directory: str) -> bool:
    # check files, run scripts, inspect output, etc.
    if success:
        print("PASS")
        return True
    else:
        print("FAIL: expected X, got Y")
        return False

Grader

Each grader is a Python script that evaluates milestone completeness. It must export a grade(working_directory: str) -> tuple[int, int] function that returns (completed, total):

def grade(working_directory: str) -> tuple[int, int]:
    total = 3
    completed = 0
    if milestone_1_done:
        completed += 1
    if milestone_2_done:
        completed += 1
    if milestone_3_done:
        completed += 1
    return completed, total  # e.g. 2/3

Results

After all benchmarks finish, results are saved to benchmark/{timestamp}_results.csv:

benchmark,model,type,script,result
0,openai/gpt-5.2,checker,benchmark/checker/0.py,FAIL
0,openai/gpt-5.2,grader,benchmark/grader/0.py,7/7

Result CSV files are gitignored.

TODO: With MCP servers

To provide MCP servers to the workforce, add installed_mcp to env. Use env_file to store credentials separately (.env files are gitignored). The credentials are injected into each MCP server's env before sending the payload to the backend.

# benchmark/envs/1.env
NOTION_API_KEY=ntn_xxxxx

{
  "data": {
    "name": "1",
    "question": "List all Notion pages",
    "env": {
      "env_file": "benchmark/envs/1.env",
      "installed_mcp": {
        "mcpServers": {
          "notion": {
            "command": "npx",
            "args": ["@modelcontextprotocol/server-notion"]
          }
        }
      }
    }
  },
  "tests": {
    "checker": ["benchmark/checker/1.py"]
  }
}