Co-authored-by: bytecii <bytecii@users.noreply.github.com> Co-authored-by: Wendong-Fan <w3ndong.fan@gmail.com> Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| answer | ||
| checker | ||
| dataset | ||
| grader | ||
| .env.example | ||
| .gitignore | ||
| __init__.py | ||
| client.py | ||
| environment.py | ||
| main.py | ||
| Makefile | ||
| README.md | ||
Benchmark
Run workforce benchmarks against the Eigent API and grade results.
Setup
- Start the backend and frontend in the main project directory:
npm run dev
- Set your API key:
export OPENAI_API_KEY=sk-...
Usage
From the backend/ directory:
# Run all benchmarks
python3 -m benchmark.main
# Run a specific benchmark
python3 -m benchmark.main benchmark/dataset/0.json
Structure
benchmark/
main.py # Entry point
client.py # API client (SSE streaming, auto task start, auto human reply)
environment.py # BenchmarkConfig, BenchmarkData, Env, Tests models
dataset/ # Benchmark JSON configs
0.json
checker/ # Checker scripts (pass/fail per benchmark)
0.py
grader/ # Grading scripts (milestone completeness per benchmark)
0.py
Adding a benchmark
- Create
benchmark/dataset/<n>.json:
{
"data": {
"name": "<n>",
"question": "Your task description",
"env": {}
},
"metadata": {
"difficulty": "easy|medium|hard",
"description": "Brief description of what this benchmark tests",
"tags": ["tag1", "tag2"]
},
"model_kwargs": {
"model_platform": "openai",
"model_type": "gpt-4o"
},
"tests": {
"grader": ["benchmark/grader/<n>.py"],
"checker": ["benchmark/checker/<n>.py"]
}
}
The metadata field (optional) provides information about the benchmark:
difficulty: Indicates complexity level: "easy", "medium", or "hard"description: Brief explanation of what skills or capabilities the benchmark teststags: Array of keywords for filtering and organization
The model_kwargs field is optional. Defaults come from BENCHMARK_* environment variables (see below), falling back to openai / gpt-5.2 / $OPENAI_API_KEY. Per-benchmark JSON values override the environment defaults.
Custom model providers
You can override the model for all benchmarks via environment variables (see .env.example):
export BENCHMARK_MODEL_PLATFORM="openai-compatible-model"
export BENCHMARK_MODEL_TYPE=""
export BENCHMARK_API_KEY=""
export BENCHMARK_API_URL=""
| Variable | Default | Description |
|---|---|---|
BENCHMARK_MODEL_PLATFORM |
openai |
Provider name. Use openai-compatible-model for any OpenAI-compatible API. |
BENCHMARK_MODEL_TYPE |
gpt-5.2 |
Model identifier passed to the provider. |
BENCHMARK_API_KEY |
$OPENAI_API_KEY |
API key for the provider. |
BENCHMARK_API_URL |
https://api.openai.com/v1 |
Base URL for the provider's API. |
Important: If the model is served through an OpenAI-compatible API (e.g. DeepSeek, MiniMax, Ollama, vLLM, LiteLLM, or any other non-OpenAI provider), set
BENCHMARK_MODEL_PLATFORMtoopenai-compatible-model— notopenai. Theopenaiplatform value is reserved for the official OpenAI API only.
To override a single benchmark, add model_kwargs to its JSON config — these take priority over environment variables.
-
Create
benchmark/checker/<n>.pywith acheck(working_directory: str) -> boolfunction. -
Create
benchmark/grader/<n>.pywith agrade(working_directory: str) -> tuple[int, int]function.
Checker
Each checker is a Python script that checks whether a benchmark task succeeded. It must export a check(working_directory: str) -> bool function that:
- Returns
Trueand printsPASSif the task succeeded. - Returns
Falseand printsFAIL: <reason>if the task failed.
def check(working_directory: str) -> bool:
# check files, run scripts, inspect output, etc.
if success:
print("PASS")
return True
else:
print("FAIL: expected X, got Y")
return False
Grader
Each grader is a Python script that evaluates milestone completeness. It must export a grade(working_directory: str) -> tuple[int, int] function that returns (completed, total):
def grade(working_directory: str) -> tuple[int, int]:
total = 3
completed = 0
if milestone_1_done:
completed += 1
if milestone_2_done:
completed += 1
if milestone_3_done:
completed += 1
return completed, total # e.g. 2/3
Results
After all benchmarks finish, results are saved to benchmark/{timestamp}_results.csv:
benchmark,model,type,script,result
0,openai/gpt-5.2,checker,benchmark/checker/0.py,FAIL
0,openai/gpt-5.2,grader,benchmark/grader/0.py,7/7
Result CSV files are gitignored.
TODO: With MCP servers
To provide MCP servers to the workforce, add installed_mcp to env.
Use env_file to store credentials separately (.env files are gitignored).
The credentials are injected into each MCP server's env before sending the payload to the backend.
# benchmark/envs/1.env
NOTION_API_KEY=ntn_xxxxx
{
"data": {
"name": "1",
"question": "List all Notion pages",
"env": {
"env_file": "benchmark/envs/1.env",
"installed_mcp": {
"mcpServers": {
"notion": {
"command": "npx",
"args": ["@modelcontextprotocol/server-notion"]
}
}
}
}
},
"tests": {
"checker": ["benchmark/checker/1.py"]
}
}