mirror of
https://github.com/eigent-ai/eigent.git
synced 2026-05-17 04:01:18 +00:00
|
Some checks failed
Test / Run Python Tests (push) Has been cancelled
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (javascript-typescript) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Pre-commit / pre-commit (push) Has been cancelled
Co-authored-by: bytecii <bytecii@users.noreply.github.com> Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| template | ||
| .gitignore | ||
| __init__.py | ||
| adapter.py | ||
| evaluate.py | ||
| README.md | ||
| run_adapter.py | ||
Eigent Benchmark — Harbor Adapter
Converts eigent benchmark tasks into Harbor task format for standardized agent evaluation.
Prerequisites
# Install Harbor CLI
uv tool install harbor
# Set your API key
export ANTHROPIC_API_KEY=your-key-here
Quick Start
cd backend/benchmark/harbor
# 1. Generate Harbor tasks
python run_adapter.py
# 2. Verify with oracle (should score 1.0)
harbor run \
-p datasets/eigent-bench \
-a oracle \
--env docker
# 3. Run with an agent
harbor run \
-p datasets/eigent-bench \
-a claude-code \
-m anthropic/claude-sonnet-4-20250514 \
--env docker
Generating Tasks
cd backend/benchmark/harbor
# Default (all tasks, 50/50 weights)
python run_adapter.py
# Custom weights (70% checker, 30% grader)
python run_adapter.py --checker-weight 0.7 --grader-weight 0.3
# Limit number of tasks
python run_adapter.py --limit 2
Running Benchmarks
Harbor builds a Docker image from each task's environment/Dockerfile automatically on first run. Subsequent runs reuse the cached image.
Run a specific task
# Run only task 0 (hello world)
harbor run \
-p datasets/eigent-bench \
-a claude-code \
-m anthropic/claude-haiku-4-5-20251001 \
-t "eigent-bench-0000" \
--env docker
Run all tasks
harbor run \
-p datasets/eigent-bench \
-a claude-code \
-m anthropic/claude-sonnet-4-20250514 \
--env docker
Run with different models
# Claude Haiku
harbor run -p datasets/eigent-bench -a claude-code \
-m anthropic/claude-haiku-4-5-20251001 --env docker
Run with concurrency
harbor run \
-p datasets/eigent-bench \
-a claude-code \
-m anthropic/claude-sonnet-4-20250514 \
--env docker \
--n-concurrent 3
Rebuild Docker Image
Harbor caches Docker images. To force a rebuild (e.g., after changing the Dockerfile):
# Clear Harbor's Docker build cache
docker builder prune -f
# Then rerun — Harbor will rebuild the image
harbor run -p datasets/eigent-bench -a oracle --env docker
Reward Computation
reward = checker_weight × checker_score + grader_weight × grader_score
checker_score: 1.0 if checker passes, 0.0 if failsgrader_score: completed_milestones / total_milestones (0.0–1.0)- Default weights: 0.5 / 0.5 (configurable via CLI flags)
Results
Results are written to jobs/<timestamp>/result.json. Each trial directory contains:
jobs/<timestamp>/
├── result.json # Overall results
├── job.log # Build + run logs
└── eigent-bench-NNNN__<id>/
├── agent/ # Agent logs and trajectory
│ ├── claude-code.txt # Raw agent output
│ └── trajectory.json # ATIF trajectory
└── verifier/
├── metrics.json # Detailed checker/grader results
└── test-stdout.txt # Verifier output
Generated Task Structure
eigent-bench-NNNN/
├── task.toml # Harbor metadata
├── instruction.md # Task question
├── environment/
│ ├── Dockerfile # Eigent stack + evaluation deps
│ └── workspace/
│ └── .env.development
├── tests/
│ ├── test.sh # Entry point → evaluate.py
│ ├── evaluate.py # Checker+grader → reward.txt
│ ├── checker_N.py # Copied from benchmark/checker/
│ ├── grader_N.py # Copied from benchmark/grader/
│ └── config.json # Task config + weights
└── solution/
└── solve.sh # Oracle solution