goose/evals/open-model-gym/README.md
Jack Amadeo 583acd4335
chore(aaif): rename a bunch of repository references (#8152)
Signed-off-by: Michael Neale <michael.neale@gmail.com>
Co-authored-by: Michael Neale <michael.neale@gmail.com>

continuing migration to aaif
2026-04-07 15:34:48 +10:00

291 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Open Model Gym
Run agent tests across a matrix of **models × runners × scenarios**.
It isn't hard for any agent to do ok with opus, but lets scale things in the other direction. What do we have to break things down to.
<img width="1768" height="1133" alt="image" src="https://github.com/user-attachments/assets/29915659-ee6b-4a8b-ba5e-58420b168b43" />
## Quick Start
```bash
just install # one-time setup
just run # run full matrix (3 reps each)
just report # view results
```
## How It Works
The test harness runs every combination of models, runners, and scenarios defined in your matrix. Each test runs multiple times (default 3) and keeps the **worst result** — if a test fails even once, it's marked failed. This catches flaky passes.
## Configuration
Edit `config.yaml` to define your test matrix:
### Models
LLMs to test against. Supports any provider (Anthropic, OpenAI, Ollama, etc.):
```yaml
models:
- name: opus
provider: anthropic
model: claude-opus-4-5-20251101
- name: qwen3-coder
provider: ollama
model: qwen3-coder:64k
- name: gpt4
provider: openai
model: gpt-4-turbo
```
### Runners
Agent frameworks that execute the tests. Each runner has its own binary, type, and configuration:
```yaml
runners:
# Goose agent with extensions
- name: goose-full
type: goose
bin: goose # path to binary (can be absolute)
extensions: [developer, todo, skills]
stdio:
- node mcp-harness/dist/index.js
# OpenCode agent
- name: opencode
type: opencode
bin: opencode # path to binary
stdio:
- node mcp-harness/dist/index.js
# Custom goose binary path
- name: goose-dev
type: goose
bin: /path/to/my/goose-dev
extensions: [developer]
```
**Supported runner types:**
- `goose` — [Goose](https://github.com/aaif-goose/goose) agent framework
- `opencode` — [OpenCode](https://opencode.ai) agent framework
- `pi` — [Pi](https://github.com/badlogic/pi-mono) coding agent
## Runner Details
Each runner has different setup requirements, MCP integration methods, and session handling.
### Goose
[Goose](https://github.com/aaif-goose/goose) is an open-source coding agent with built-in MCP support.
**Setup:** Install via `brew install goose` or from source.
**MCP Integration:** Native support. The harness writes a `config.yaml` to an isolated `.goose-root/` directory with extensions and MCP servers:
```yaml
extensions:
developer:
enabled: true
mcp_harness:
type: stdio
enabled: true
cmd: node
args: [mcp-harness/dist/index.js]
```
**Session Handling:** Uses `--name <session>` for named sessions, `--resume` to continue:
- Turn 1: `goose run -i <prompt> --name <session>`
- Turn 2+: `goose run -i <prompt> --name <session> --resume`
- Single-turn: `goose run -i <prompt> --no-session`
### OpenCode
[OpenCode](https://opencode.ai) is a terminal-based coding agent.
**Setup:** Install via their website or package manager.
**MCP Integration:** Native support. The harness writes an `opencode.json` config to the workdir:
```json
{
"mcp": {
"harness": {
"type": "local",
"command": ["node", "mcp-harness/dist/index.js"],
"enabled": true
}
},
"model": "anthropic/claude-opus-4-5-20251101"
}
```
**Session Handling:** Uses `--continue` to resume the last session in the working directory:
- Turn 1: `opencode run "<prompt>"`
- Turn 2+: `opencode run --continue "<prompt>"`
⚠️ OpenCode doesn't support named sessions, so multi-turn scenarios exclude it.
### Pi
[Pi](https://github.com/badlogic/pi-mono) is a lightweight coding agent that requires an adapter for MCP support.
**Setup:**
```bash
# Install Pi
npm install -g @anthropic/pi # or from source
# Install the MCP adapter (required for MCP tools)
pi install npm:pi-mcp-adapter
```
The `just install` recipe auto-installs pi-mcp-adapter if missing.
**MCP Integration:** Via [pi-mcp-adapter](https://github.com/nicobailon/pi-mcp-adapter). The harness dynamically writes a `.pi-mcp.json` config to the workdir:
```json
{
"mcpServers": {
"harness": {
"command": "node",
"args": ["mcp-harness/dist/index.js"],
"lifecycle": "eager",
"env": { "MCP_HARNESS_LOG": "<workdir>/tool-calls.log" }
}
},
"settings": { "directTools": true }
}
```
Key settings:
- `directTools: true` — Registers MCP tools directly in Pi's tool list (no wrapper)
- `lifecycle: "eager"` — Connects to MCP servers at startup
**Model Configuration:** Pi requires custom models (like Ollama) to be defined in `models.json`. The harness automatically generates this config in an isolated `.pi-root/` directory and sets `PI_CODING_AGENT_DIR` to use it:
```json
{
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"api": "openai-completions",
"apiKey": "ollama",
"models": [{ "id": "model-name", "name": "Model Name", ... }]
}
}
}
```
The harness copies `auth.json` from your real Pi config (`~/.pi/agent/`) so API keys work.
**Session Handling:** Uses `--session <path>` for file-based sessions, `--continue` to resume:
- Turn 1: `pi -p --session <path> "<prompt>"`
- Turn 2+: `pi -p --continue --session <path> "<prompt>"`
- Single-turn: `pi -p --no-session "<prompt>"`
The `-p` flag runs Pi in non-interactive "print" mode for automation
### Matrix
Define which scenarios run against which models/runners:
```yaml
matrix:
- scenario: file-editing
models: [opus, qwen3-coder] # omit to run all models
runners: [goose-full, opencode] # omit to run all runners
- scenario: everyday-app-automation
# runs against ALL models and ALL runners
```
## Scenarios
Scenarios live in `suite/scenarios/` as YAML files:
```yaml
name: file-editing
description: Create and edit files
prompt: |
1. Create joke.md containing a short joke
2. Edit hello.rs to add a debug function
setup:
hello.rs: |
fn main() { println!("Hello!"); }
validate:
- type: file_exists
path: joke.md
- type: file_matches
path: hello.rs
regex: "fn\\s+debug"
```
### Validation Rules
| Rule | Description |
|------|-------------|
| `file_exists` | File exists at path |
| `file_not_empty` | File exists and has content |
| `file_contains` | File contains literal string |
| `file_matches` | File matches regex pattern |
| `command_succeeds` | Shell command exits 0 |
| `tool_called` | MCP tool was called with matching args (regex supported) |
**Tool call validation example:**
```yaml
validate:
- type: tool_called
tool: slack_search_messages
args:
query: /quarterly.?review/ # regex pattern
- type: tool_called
tool: jira_create_issue
args:
summary: /Q1.*Review/
description: /David Brown/
```
## MCP Harness
Mock MCP server providing simulated tools for testing agent tool-use without hitting real APIs.
```bash
cd mcp-harness && npm install && npm run build
```
**Available tools:** gdrive, sheets, salesforce, slack, calendar, gmail, jira, github
Each tool returns realistic mock data. Tool calls are logged to `tool-calls.log` in the workdir for validation.
## Commands
| Command | Description |
|---------|-------------|
| `just run` | Full test run (3 reps each, worst kept) |
| `just test` | Quick run (1 rep each) |
| `just scenario <name>` | Run specific scenario |
| `just agent <name>` | Run specific agent |
| `just report` | Open HTML results |
### CLI Flags
```bash
# Filter by scenario, model, or runner
npx tsx src/runner.ts --scenario=file-editing --model=opus --runner=goose
# Control repetition count
npx tsx src/runner.ts --run-count=5
# Don't auto-open browser
npx tsx src/runner.ts --no-open
```
## Output
- `report.html` — Live-updating HTML matrix showing pass/fail status, duration, and validation details
- `logs/` — Full agent output logs for each run