mirror of
https://github.com/block/goose.git
synced 2026-04-26 10:40:45 +00:00
|
Some checks failed
Canary / Prepare Version (push) Waiting to run
Canary / build-cli (push) Blocked by required conditions
Canary / Upload Install Script (push) Blocked by required conditions
Canary / bundle-desktop (push) Blocked by required conditions
Canary / bundle-desktop-intel (push) Blocked by required conditions
Canary / bundle-desktop-linux (push) Blocked by required conditions
Canary / bundle-desktop-windows (push) Blocked by required conditions
Canary / Release (push) Blocked by required conditions
CI / Build Rust Project on Windows (push) Waiting to run
CI / changes (push) Waiting to run
CI / Check Rust Code Format (push) Blocked by required conditions
CI / Build and Test Rust Project (push) Blocked by required conditions
CI / Check MSRV (push) Blocked by required conditions
CI / Lint Rust Code (push) Blocked by required conditions
CI / Check Generated Schemas are Up-to-Date (push) Blocked by required conditions
CI / Test and Lint Electron Desktop App (push) Blocked by required conditions
Goose 2 CI / Lint & Format (push) Waiting to run
Goose 2 CI / Unit Tests (push) Waiting to run
Goose 2 CI / Desktop Build & E2E (push) Waiting to run
Goose 2 CI / Rust Lint (push) Waiting to run
Live Provider Tests / Smoke Tests (Code Execution) (push) Blocked by required conditions
Live Provider Tests / check-fork (push) Waiting to run
Live Provider Tests / changes (push) Blocked by required conditions
Live Provider Tests / Build Binary (push) Blocked by required conditions
Live Provider Tests / Smoke Tests (push) Blocked by required conditions
Live Provider Tests / Compaction Tests (push) Blocked by required conditions
Live Provider Tests / goose server HTTP integration tests (push) Blocked by required conditions
Publish Docker Image / docker (push) Waiting to run
Scorecard supply-chain security / Scorecard analysis (push) Waiting to run
Unused Dependencies / machete (push) Has been cancelled
Signed-off-by: Douwe Osinga <douwe@squareup.com> Co-authored-by: Douwe Osinga <douwe@squareup.com> |
||
|---|---|---|
| .. | ||
| bench-postprocess-scripts | ||
| provider-error-proxy | ||
| test-subrecipes-examples | ||
| build-windows.ps1 | ||
| check-openapi-schema.sh | ||
| clean-gh-pages.sh | ||
| diagnostics-viewer.py | ||
| goose-db-helper.sh | ||
| parse-benchmark-results.sh | ||
| pre-release.sh | ||
| README.md | ||
| run-benchmarks.sh | ||
| test_compaction.sh | ||
| test_mcp.sh | ||
| test_subrecipes.sh | ||
Goose Benchmark Scripts
This directory contains scripts for running and analyzing Goose benchmarks.
run-benchmarks.sh
This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.
Prerequisites
- Goose CLI must be built or installed
jqcommand-line tool for JSON processing (optional, but recommended for result analysis)
Usage
./scripts/run-benchmarks.sh [options]
Options
-p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-sonnet-4')-s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')-o, --output-dir: Directory to store benchmark results (default: './benchmark-results')-d, --debug: Use debug build instead of release build-h, --help: Show help message
Examples
# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-sonnet-4' --suites 'core,small_models'
# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug
How It Works
The script:
- Parses the provider:model pairs and benchmark suites
- Determines whether to use the debug or release binary
- For each provider:model pair:
- Sets the
GOOSE_PROVIDERandGOOSE_MODELenvironment variables - Runs the benchmark with the specified suites
- Analyzes the results for failures
- Sets the
- Generates a summary of all benchmark runs
Output
The script creates the following files in the output directory:
summary.md: A summary of all benchmark results{provider}-{model}.json: Raw JSON output from each benchmark run{provider}-{model}-analysis.txt: Analysis of each benchmark run
Exit Codes
0: All benchmarks completed successfully1: One or more benchmarks failed
parse-benchmark-results.sh
This script analyzes a single benchmark JSON result file and identifies any failures.
Usage
./scripts/parse-benchmark-results.sh path/to/benchmark-results.json
Output
The script outputs an analysis of the benchmark results to stdout, including:
- Basic information about the benchmark run
- Results for each evaluation in each suite
- Summary of passed and failed metrics
Exit Codes
0: All metrics passed successfully1: One or more metrics failed