mirror of
https://github.com/block/goose.git
synced 2026-05-02 21:40:58 +00:00
Signed-off-by: Michael Neale <michael.neale@gmail.com> Co-authored-by: Michael Neale <michael.neale@gmail.com> continuing migration to aaif
291 lines
7.7 KiB
Markdown
291 lines
7.7 KiB
Markdown
# Open Model Gym
|
||
|
||
Run agent tests across a matrix of **models × runners × scenarios**.
|
||
|
||
It isn't hard for any agent to do ok with opus, but lets scale things in the other direction. What do we have to break things down to.
|
||
|
||
<img width="1768" height="1133" alt="image" src="https://github.com/user-attachments/assets/29915659-ee6b-4a8b-ba5e-58420b168b43" />
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
just install # one-time setup
|
||
just run # run full matrix (3 reps each)
|
||
just report # view results
|
||
```
|
||
|
||
## How It Works
|
||
|
||
The test harness runs every combination of models, runners, and scenarios defined in your matrix. Each test runs multiple times (default 3) and keeps the **worst result** — if a test fails even once, it's marked failed. This catches flaky passes.
|
||
|
||
## Configuration
|
||
|
||
Edit `config.yaml` to define your test matrix:
|
||
|
||
### Models
|
||
|
||
LLMs to test against. Supports any provider (Anthropic, OpenAI, Ollama, etc.):
|
||
|
||
```yaml
|
||
models:
|
||
- name: opus
|
||
provider: anthropic
|
||
model: claude-opus-4-5-20251101
|
||
|
||
- name: qwen3-coder
|
||
provider: ollama
|
||
model: qwen3-coder:64k
|
||
|
||
- name: gpt4
|
||
provider: openai
|
||
model: gpt-4-turbo
|
||
```
|
||
|
||
### Runners
|
||
|
||
Agent frameworks that execute the tests. Each runner has its own binary, type, and configuration:
|
||
|
||
```yaml
|
||
runners:
|
||
# Goose agent with extensions
|
||
- name: goose-full
|
||
type: goose
|
||
bin: goose # path to binary (can be absolute)
|
||
extensions: [developer, todo, skills]
|
||
stdio:
|
||
- node mcp-harness/dist/index.js
|
||
|
||
# OpenCode agent
|
||
- name: opencode
|
||
type: opencode
|
||
bin: opencode # path to binary
|
||
stdio:
|
||
- node mcp-harness/dist/index.js
|
||
|
||
# Custom goose binary path
|
||
- name: goose-dev
|
||
type: goose
|
||
bin: /path/to/my/goose-dev
|
||
extensions: [developer]
|
||
```
|
||
|
||
**Supported runner types:**
|
||
- `goose` — [Goose](https://github.com/aaif-goose/goose) agent framework
|
||
- `opencode` — [OpenCode](https://opencode.ai) agent framework
|
||
- `pi` — [Pi](https://github.com/badlogic/pi-mono) coding agent
|
||
|
||
## Runner Details
|
||
|
||
Each runner has different setup requirements, MCP integration methods, and session handling.
|
||
|
||
### Goose
|
||
|
||
[Goose](https://github.com/aaif-goose/goose) is an open-source coding agent with built-in MCP support.
|
||
|
||
**Setup:** Install via `brew install goose` or from source.
|
||
|
||
**MCP Integration:** Native support. The harness writes a `config.yaml` to an isolated `.goose-root/` directory with extensions and MCP servers:
|
||
|
||
```yaml
|
||
extensions:
|
||
developer:
|
||
enabled: true
|
||
mcp_harness:
|
||
type: stdio
|
||
enabled: true
|
||
cmd: node
|
||
args: [mcp-harness/dist/index.js]
|
||
```
|
||
|
||
**Session Handling:** Uses `--name <session>` for named sessions, `--resume` to continue:
|
||
- Turn 1: `goose run -i <prompt> --name <session>`
|
||
- Turn 2+: `goose run -i <prompt> --name <session> --resume`
|
||
- Single-turn: `goose run -i <prompt> --no-session`
|
||
|
||
### OpenCode
|
||
|
||
[OpenCode](https://opencode.ai) is a terminal-based coding agent.
|
||
|
||
**Setup:** Install via their website or package manager.
|
||
|
||
**MCP Integration:** Native support. The harness writes an `opencode.json` config to the workdir:
|
||
|
||
```json
|
||
{
|
||
"mcp": {
|
||
"harness": {
|
||
"type": "local",
|
||
"command": ["node", "mcp-harness/dist/index.js"],
|
||
"enabled": true
|
||
}
|
||
},
|
||
"model": "anthropic/claude-opus-4-5-20251101"
|
||
}
|
||
```
|
||
|
||
**Session Handling:** Uses `--continue` to resume the last session in the working directory:
|
||
- Turn 1: `opencode run "<prompt>"`
|
||
- Turn 2+: `opencode run --continue "<prompt>"`
|
||
|
||
⚠️ OpenCode doesn't support named sessions, so multi-turn scenarios exclude it.
|
||
|
||
### Pi
|
||
|
||
[Pi](https://github.com/badlogic/pi-mono) is a lightweight coding agent that requires an adapter for MCP support.
|
||
|
||
**Setup:**
|
||
```bash
|
||
# Install Pi
|
||
npm install -g @anthropic/pi # or from source
|
||
|
||
# Install the MCP adapter (required for MCP tools)
|
||
pi install npm:pi-mcp-adapter
|
||
```
|
||
|
||
The `just install` recipe auto-installs pi-mcp-adapter if missing.
|
||
|
||
**MCP Integration:** Via [pi-mcp-adapter](https://github.com/nicobailon/pi-mcp-adapter). The harness dynamically writes a `.pi-mcp.json` config to the workdir:
|
||
|
||
```json
|
||
{
|
||
"mcpServers": {
|
||
"harness": {
|
||
"command": "node",
|
||
"args": ["mcp-harness/dist/index.js"],
|
||
"lifecycle": "eager",
|
||
"env": { "MCP_HARNESS_LOG": "<workdir>/tool-calls.log" }
|
||
}
|
||
},
|
||
"settings": { "directTools": true }
|
||
}
|
||
```
|
||
|
||
Key settings:
|
||
- `directTools: true` — Registers MCP tools directly in Pi's tool list (no wrapper)
|
||
- `lifecycle: "eager"` — Connects to MCP servers at startup
|
||
|
||
**Model Configuration:** Pi requires custom models (like Ollama) to be defined in `models.json`. The harness automatically generates this config in an isolated `.pi-root/` directory and sets `PI_CODING_AGENT_DIR` to use it:
|
||
|
||
```json
|
||
{
|
||
"providers": {
|
||
"ollama": {
|
||
"baseUrl": "http://localhost:11434/v1",
|
||
"api": "openai-completions",
|
||
"apiKey": "ollama",
|
||
"models": [{ "id": "model-name", "name": "Model Name", ... }]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
The harness copies `auth.json` from your real Pi config (`~/.pi/agent/`) so API keys work.
|
||
|
||
**Session Handling:** Uses `--session <path>` for file-based sessions, `--continue` to resume:
|
||
- Turn 1: `pi -p --session <path> "<prompt>"`
|
||
- Turn 2+: `pi -p --continue --session <path> "<prompt>"`
|
||
- Single-turn: `pi -p --no-session "<prompt>"`
|
||
|
||
The `-p` flag runs Pi in non-interactive "print" mode for automation
|
||
|
||
### Matrix
|
||
|
||
Define which scenarios run against which models/runners:
|
||
|
||
```yaml
|
||
matrix:
|
||
- scenario: file-editing
|
||
models: [opus, qwen3-coder] # omit to run all models
|
||
runners: [goose-full, opencode] # omit to run all runners
|
||
|
||
- scenario: everyday-app-automation
|
||
# runs against ALL models and ALL runners
|
||
```
|
||
|
||
## Scenarios
|
||
|
||
Scenarios live in `suite/scenarios/` as YAML files:
|
||
|
||
```yaml
|
||
name: file-editing
|
||
description: Create and edit files
|
||
prompt: |
|
||
1. Create joke.md containing a short joke
|
||
2. Edit hello.rs to add a debug function
|
||
|
||
setup:
|
||
hello.rs: |
|
||
fn main() { println!("Hello!"); }
|
||
|
||
validate:
|
||
- type: file_exists
|
||
path: joke.md
|
||
- type: file_matches
|
||
path: hello.rs
|
||
regex: "fn\\s+debug"
|
||
```
|
||
|
||
### Validation Rules
|
||
|
||
| Rule | Description |
|
||
|------|-------------|
|
||
| `file_exists` | File exists at path |
|
||
| `file_not_empty` | File exists and has content |
|
||
| `file_contains` | File contains literal string |
|
||
| `file_matches` | File matches regex pattern |
|
||
| `command_succeeds` | Shell command exits 0 |
|
||
| `tool_called` | MCP tool was called with matching args (regex supported) |
|
||
|
||
**Tool call validation example:**
|
||
```yaml
|
||
validate:
|
||
- type: tool_called
|
||
tool: slack_search_messages
|
||
args:
|
||
query: /quarterly.?review/ # regex pattern
|
||
- type: tool_called
|
||
tool: jira_create_issue
|
||
args:
|
||
summary: /Q1.*Review/
|
||
description: /David Brown/
|
||
```
|
||
|
||
## MCP Harness
|
||
|
||
Mock MCP server providing simulated tools for testing agent tool-use without hitting real APIs.
|
||
|
||
```bash
|
||
cd mcp-harness && npm install && npm run build
|
||
```
|
||
|
||
**Available tools:** gdrive, sheets, salesforce, slack, calendar, gmail, jira, github
|
||
|
||
Each tool returns realistic mock data. Tool calls are logged to `tool-calls.log` in the workdir for validation.
|
||
|
||
## Commands
|
||
|
||
| Command | Description |
|
||
|---------|-------------|
|
||
| `just run` | Full test run (3 reps each, worst kept) |
|
||
| `just test` | Quick run (1 rep each) |
|
||
| `just scenario <name>` | Run specific scenario |
|
||
| `just agent <name>` | Run specific agent |
|
||
| `just report` | Open HTML results |
|
||
|
||
### CLI Flags
|
||
|
||
```bash
|
||
# Filter by scenario, model, or runner
|
||
npx tsx src/runner.ts --scenario=file-editing --model=opus --runner=goose
|
||
|
||
# Control repetition count
|
||
npx tsx src/runner.ts --run-count=5
|
||
|
||
# Don't auto-open browser
|
||
npx tsx src/runner.ts --no-open
|
||
```
|
||
|
||
## Output
|
||
|
||
- `report.html` — Live-updating HTML matrix showing pass/fail status, duration, and validation details
|
||
- `logs/` — Full agent output logs for each run
|