goose/scripts
Jack Amadeo c6755d3259
Some checks failed
Canary / Prepare Version (push) Waiting to run
Canary / build-cli (push) Blocked by required conditions
Canary / Upload Install Script (push) Blocked by required conditions
Canary / bundle-desktop (push) Blocked by required conditions
Canary / bundle-desktop-intel (push) Blocked by required conditions
Canary / bundle-desktop-linux (push) Blocked by required conditions
Canary / bundle-desktop-windows (push) Blocked by required conditions
Canary / Release (push) Blocked by required conditions
CI / Build Rust Project on Windows (push) Waiting to run
CI / changes (push) Waiting to run
CI / Check Rust Code Format (push) Blocked by required conditions
CI / Build and Test Rust Project (push) Blocked by required conditions
CI / Check MSRV (push) Blocked by required conditions
CI / Lint Rust Code (push) Blocked by required conditions
CI / Check Generated Schemas are Up-to-Date (push) Blocked by required conditions
CI / Test and Lint Electron Desktop App (push) Blocked by required conditions
Goose 2 CI / Lint & Format (push) Waiting to run
Goose 2 CI / Unit Tests (push) Waiting to run
Goose 2 CI / Desktop Build & E2E (push) Waiting to run
Goose 2 CI / Rust Lint (push) Waiting to run
Live Provider Tests / Smoke Tests (Code Execution) (push) Blocked by required conditions
Live Provider Tests / check-fork (push) Waiting to run
Live Provider Tests / changes (push) Blocked by required conditions
Live Provider Tests / Build Binary (push) Blocked by required conditions
Live Provider Tests / Smoke Tests (push) Blocked by required conditions
Live Provider Tests / Compaction Tests (push) Blocked by required conditions
Live Provider Tests / goose server HTTP integration tests (push) Blocked by required conditions
Publish Docker Image / docker (push) Waiting to run
Scorecard supply-chain security / Scorecard analysis (push) Waiting to run
Unused Dependencies / machete (push) Has been cancelled
Port provider tests to typescript (#8237)
Signed-off-by: Douwe Osinga <douwe@squareup.com>
Co-authored-by: Douwe Osinga <douwe@squareup.com>
2026-04-24 17:31:27 +00:00
..
bench-postprocess-scripts Spelling (#7137) 2026-02-11 14:35:24 +00:00
provider-error-proxy chore(deps): bump aiohttp from 3.13.3 to 3.13.4 in /scripts/provider-error-proxy (#8245) 2026-04-02 00:30:50 +00:00
test-subrecipes-examples feat: replace subagent and skills with unified summon extension (#6964) 2026-02-10 19:13:38 +00:00
build-windows.ps1 fix: VMware Tanzu Platform provider - bug fixes, streaming, UI improvements (#8126) 2026-03-26 18:16:01 +00:00
check-openapi-schema.sh bump openapi version directly (#5674) 2025-11-11 10:15:42 -05:00
clean-gh-pages.sh Clean PR preview sites from gh-pages branch history (#6161) 2025-12-18 16:22:57 -05:00
diagnostics-viewer.py Diagnostic files copying (#7209) 2026-02-13 13:50:42 +00:00
goose-db-helper.sh (re)Standardize Session Name Attribute (#5279) 2025-10-24 13:34:08 -04:00
parse-benchmark-results.sh feat: goose bench framework for functional and regression testing 2025-03-05 21:23:00 -05:00
pre-release.sh gh fall back (#7695) 2026-03-06 16:21:30 +00:00
README.md Remove deprecated Claude 3.5 models (#4590) 2025-09-10 14:41:02 -05:00
run-benchmarks.sh Remove deprecated Claude 3.5 models (#4590) 2025-09-10 14:41:02 -05:00
test_compaction.sh Flip on developer extension in compaction smoke test (#7514) 2026-02-25 21:27:26 +00:00
test_mcp.sh nit: show dir in title, and less... jank (#7138) 2026-02-13 04:16:46 +00:00
test_subrecipes.sh nit: show dir in title, and less... jank (#7138) 2026-02-13 04:16:46 +00:00

Goose Benchmark Scripts

This directory contains scripts for running and analyzing Goose benchmarks.

run-benchmarks.sh

This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.

Prerequisites

  • Goose CLI must be built or installed
  • jq command-line tool for JSON processing (optional, but recommended for result analysis)

Usage

./scripts/run-benchmarks.sh [options]

Options

  • -p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-sonnet-4')
  • -s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')
  • -o, --output-dir: Directory to store benchmark results (default: './benchmark-results')
  • -d, --debug: Use debug build instead of release build
  • -h, --help: Show help message

Examples

# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-sonnet-4' --suites 'core,small_models'

# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug

How It Works

The script:

  1. Parses the provider:model pairs and benchmark suites
  2. Determines whether to use the debug or release binary
  3. For each provider:model pair:
    • Sets the GOOSE_PROVIDER and GOOSE_MODEL environment variables
    • Runs the benchmark with the specified suites
    • Analyzes the results for failures
  4. Generates a summary of all benchmark runs

Output

The script creates the following files in the output directory:

  • summary.md: A summary of all benchmark results
  • {provider}-{model}.json: Raw JSON output from each benchmark run
  • {provider}-{model}-analysis.txt: Analysis of each benchmark run

Exit Codes

  • 0: All benchmarks completed successfully
  • 1: One or more benchmarks failed

parse-benchmark-results.sh

This script analyzes a single benchmark JSON result file and identifies any failures.

Usage

./scripts/parse-benchmark-results.sh path/to/benchmark-results.json

Output

The script outputs an analysis of the benchmark results to stdout, including:

  • Basic information about the benchmark run
  • Results for each evaluation in each suite
  • Summary of passed and failed metrics

Exit Codes

  • 0: All metrics passed successfully
  • 1: One or more metrics failed