Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| mcp-harness | ||
| suite | ||
| .gitignore | ||
| config.yaml | ||
| gym.png | ||
| Justfile | ||
| README.md | ||
Open Model Gym
Run agent tests across a matrix of models × runners × scenarios.
It isn't hard for any agent to do ok with opus, but lets scale things in the other direction. What do we have to break things down to.
Quick Start
just install # one-time setup
just run # run full matrix (3 reps each)
just report # view results
How It Works
The test harness runs every combination of models, runners, and scenarios defined in your matrix. Each test runs multiple times (default 3) and keeps the worst result — if a test fails even once, it's marked failed. This catches flaky passes.
Configuration
Edit config.yaml to define your test matrix:
Models
LLMs to test against. Supports any provider (Anthropic, OpenAI, Ollama, etc.):
models:
- name: opus
provider: anthropic
model: claude-opus-4-5-20251101
- name: qwen3-coder
provider: ollama
model: qwen3-coder:64k
- name: gpt4
provider: openai
model: gpt-4-turbo
Runners
Agent frameworks that execute the tests. Each runner has its own binary, type, and configuration:
runners:
# Goose agent with extensions
- name: goose-full
type: goose
bin: goose # path to binary (can be absolute)
extensions: [developer, todo, skills]
stdio:
- node mcp-harness/dist/index.js
# OpenCode agent
- name: opencode
type: opencode
bin: opencode # path to binary
stdio:
- node mcp-harness/dist/index.js
# Custom goose binary path
- name: goose-dev
type: goose
bin: /path/to/my/goose-dev
extensions: [developer]
Supported runner types:
Runner Details
Each runner has different setup requirements, MCP integration methods, and session handling.
Goose
Goose is an open-source coding agent with built-in MCP support.
Setup: Install via brew install goose or from source.
MCP Integration: Native support. The harness writes a config.yaml to an isolated .goose-root/ directory with extensions and MCP servers:
extensions:
developer:
enabled: true
mcp_harness:
type: stdio
enabled: true
cmd: node
args: [mcp-harness/dist/index.js]
Session Handling: Uses --name <session> for named sessions, --resume to continue:
- Turn 1:
goose run -i <prompt> --name <session> - Turn 2+:
goose run -i <prompt> --name <session> --resume - Single-turn:
goose run -i <prompt> --no-session
OpenCode
OpenCode is a terminal-based coding agent.
Setup: Install via their website or package manager.
MCP Integration: Native support. The harness writes an opencode.json config to the workdir:
{
"mcp": {
"harness": {
"type": "local",
"command": ["node", "mcp-harness/dist/index.js"],
"enabled": true
}
},
"model": "anthropic/claude-opus-4-5-20251101"
}
Session Handling: Uses --continue to resume the last session in the working directory:
- Turn 1:
opencode run "<prompt>" - Turn 2+:
opencode run --continue "<prompt>"
⚠️ OpenCode doesn't support named sessions, so multi-turn scenarios exclude it.
Pi
Pi is a lightweight coding agent that requires an adapter for MCP support.
Setup:
# Install Pi
npm install -g @anthropic/pi # or from source
# Install the MCP adapter (required for MCP tools)
pi install npm:pi-mcp-adapter
The just install recipe auto-installs pi-mcp-adapter if missing.
MCP Integration: Via pi-mcp-adapter. The harness dynamically writes a .pi-mcp.json config to the workdir:
{
"mcpServers": {
"harness": {
"command": "node",
"args": ["mcp-harness/dist/index.js"],
"lifecycle": "eager",
"env": { "MCP_HARNESS_LOG": "<workdir>/tool-calls.log" }
}
},
"settings": { "directTools": true }
}
Key settings:
directTools: true— Registers MCP tools directly in Pi's tool list (no wrapper)lifecycle: "eager"— Connects to MCP servers at startup
Model Configuration: Pi requires custom models (like Ollama) to be defined in models.json. The harness automatically generates this config in an isolated .pi-root/ directory and sets PI_CODING_AGENT_DIR to use it:
{
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"api": "openai-completions",
"apiKey": "ollama",
"models": [{ "id": "model-name", "name": "Model Name", ... }]
}
}
}
The harness copies auth.json from your real Pi config (~/.pi/agent/) so API keys work.
Session Handling: Uses --session <path> for file-based sessions, --continue to resume:
- Turn 1:
pi -p --session <path> "<prompt>" - Turn 2+:
pi -p --continue --session <path> "<prompt>" - Single-turn:
pi -p --no-session "<prompt>"
The -p flag runs Pi in non-interactive "print" mode for automation
Matrix
Define which scenarios run against which models/runners:
matrix:
- scenario: file-editing
models: [opus, qwen3-coder] # omit to run all models
runners: [goose-full, opencode] # omit to run all runners
- scenario: everyday-app-automation
# runs against ALL models and ALL runners
Scenarios
Scenarios live in suite/scenarios/ as YAML files:
name: file-editing
description: Create and edit files
prompt: |
1. Create joke.md containing a short joke
2. Edit hello.rs to add a debug function
setup:
hello.rs: |
fn main() { println!("Hello!"); }
validate:
- type: file_exists
path: joke.md
- type: file_matches
path: hello.rs
regex: "fn\\s+debug"
Validation Rules
| Rule | Description |
|---|---|
file_exists |
File exists at path |
file_not_empty |
File exists and has content |
file_contains |
File contains literal string |
file_matches |
File matches regex pattern |
command_succeeds |
Shell command exits 0 |
tool_called |
MCP tool was called with matching args (regex supported) |
Tool call validation example:
validate:
- type: tool_called
tool: slack_search_messages
args:
query: /quarterly.?review/ # regex pattern
- type: tool_called
tool: jira_create_issue
args:
summary: /Q1.*Review/
description: /David Brown/
MCP Harness
Mock MCP server providing simulated tools for testing agent tool-use without hitting real APIs.
cd mcp-harness && npm install && npm run build
Available tools: gdrive, sheets, salesforce, slack, calendar, gmail, jira, github
Each tool returns realistic mock data. Tool calls are logged to tool-calls.log in the workdir for validation.
Commands
| Command | Description |
|---|---|
just run |
Full test run (3 reps each, worst kept) |
just test |
Quick run (1 rep each) |
just scenario <name> |
Run specific scenario |
just agent <name> |
Run specific agent |
just report |
Open HTML results |
CLI Flags
# Filter by scenario, model, or runner
npx tsx src/runner.ts --scenario=file-editing --model=opus --runner=goose
# Control repetition count
npx tsx src/runner.ts --run-count=5
# Don't auto-open browser
npx tsx src/runner.ts --no-open
Output
report.html— Live-updating HTML matrix showing pass/fail status, duration, and validation detailslogs/— Full agent output logs for each run