The cursor installer changed its binary install location from
~/.cursor/bin/agent to ~/.local/bin/agent (as of 2026-03-25 release).
Updates:
- agent-setup.ts: fix PATH in install, launchCmd, updateCmd, and
the pathScript written to ~/.bashrc/~/.zshrc
- verify.sh: fix E2E binary check to look in ~/.local/bin first
- Bump CLI to 0.27.3
-- qa/e2e-tester
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Adds cursor.Dockerfile and includes cursor in the docker.yml matrix
so nightly builds produce ghcr.io/openrouterteam/spawn-cursor:latest.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When AGENT_TIMEOUT_hermes is non-numeric, get_agent_timeout() skips the
env var and uses the built-in _AGENT_TIMEOUT_hermes=3600, NOT the global
AGENT_TIMEOUT=1800. The test expected ${AGENT_TIMEOUT} (1800) but the
function correctly returns 3600 (hermes built-in default). This test was
failing silently, masking the correct behavior.
Also filed OpenRouterTeam/spawn#3042 for cursor missing from e2e framework.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Add cursor to ALL_AGENTS, verify_cursor, input_test_cursor, and their
dispatch cases so e2e sweeps cover the cursor agent.
Fixes#3042
Agent: issue-fixer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace StrictHostKeyChecking=no with accept-new across all E2E cloud
drivers (aws, gcp, digitalocean, hetzner), the shared SSH_BASE_OPTS
constant, and pull-history.ts. accept-new trusts new hosts on first
connection (needed for freshly provisioned VMs) but verifies on
subsequent connections, preventing MITM attacks on reconnect.
Fixes#3031
Agent: style-reviewer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix(e2e): ensure agent binary available after spawnrc fallback
When the provision timeout kills the CLI before agent install completes
(common in --fast mode on Sprite), the manual .spawnrc fallback creates
credentials but does not verify the agent binary is present. This causes
"openclaw not found" failures in E2E verification.
Add _ensure_agent_binary() that runs after the manual .spawnrc fallback:
1. Checks if the agent binary exists on the remote VM
2. If missing, runs the agent's install command directly
3. Verifies the binary is available after install
Also adds cursor agent to the env vars fallback and binary check.
Fixes#3028
Agent: ux-engineer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix(security): add --proto '=https' to cursor install curl command
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Before this change, gh auth login wrote the token file with default
permissions, and chmod 600 was applied afterward — leaving a window
where the file could be read by other users on multi-user systems.
Now the credential directory is created with 700 permissions and umask
is set to 077 before the write, so the token file is created with
restrictive permissions from the start.
Agent: complexity-hunter
Fixes#3030
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace shell interpolation of base64-encoded commands in SSH invocations
with stdin piping. Previously the encoded command was interpolated into the
remote shell string; now it is passed via stdin to `base64 -d | bash`,
making the approach structurally immune to command injection regardless
of the encoded content.
Fixes#3029Fixes#3022
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* feat: add Cursor CLI agent across all clouds
Adds Cursor's terminal-based AI coding agent (the `agent` command from
cursor.com/cli) to the spawn matrix. Routes LLM requests through
OpenRouter via --endpoint flag and CURSOR_API_KEY env var.
- manifest.json: new cursor agent entry + all 6 cloud matrix entries
- agent-setup.ts: install, configure, launch, and update definitions
- Shell scripts for all 6 clouds (local, hetzner, aws, do, gcp, sprite)
- Config: writes ~/.cursor/cli-config.json with full permissions
- Icon: cursor.png from cursor.com/apple-touch-icon.png
- All cloud READMEs updated with cursor.sh usage
- CLI version bumped to 0.26.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add spawn skill injection for Cursor CLI
Writes a .cursor/rules/spawn.mdc rule file with alwaysApply: true
during setup, teaching the Cursor agent how to use the spawn CLI
to provision child cloud VMs. Uses the same base64 upload pattern
as other agent config files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Signed-off-by: Ahmed Abushagur <ahmed@abushagur.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: A <258483684+la14-1@users.noreply.github.com>
Fixes#3019
Replace `grep -qx` with `grep -qxF` in the `ensure_in_path` function
to prevent regex pattern injection. Without -F, attacker-controlled
SPAWN_INSTALL_DIR or BUN_INSTALL env vars containing regex metacharacters
(e.g. `/.*`) could cause false positive/negative PATH matches, potentially
bypassing the symlink creation logic.
Agent: issue-fixer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The GCP E2E cloud driver defaulted to us-central1-a when GCP_ZONE was
not set in the environment. The QA VM stores zone config in
~/.config/spawn/gcp.json (alongside GCP_PROJECT) but _gcp_validate_env
only read GCP_PROJECT from the environment — it never loaded GCP_ZONE.
This caused E2E failures when us-central1-a had insufficient resources:
3 agents (openclaw, opencode, kilocode) failed with "SSH port never
opened" because GCP couldn't provision instances in that zone.
Fix: load both GCP_PROJECT and GCP_ZONE from the config file in
_gcp_validate_env when they are not already set in the environment,
matching how key-request.sh loads GCP_PROJECT for provisioning.
Verified: all 3 previously failing agents now pass on europe-west1-b.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add allowlist validation for the bun binary path resolved via `command -v bun`
before using it in symlink operations that may run with sudo privileges. If bun
is found at an unexpected location, skip the symlink and warn the user. This
prevents a privilege escalation attack where a malicious binary on PATH could be
symlinked to /usr/local/bin/bun with elevated privileges.
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ai-review.sh is sourced by e2e.sh but was missing from the bash -n
syntax check loop in sh/test/e2e-lib.sh. This means syntax errors in
ai-review.sh would not be caught by the test harness.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Adds a non-empty check after mktemp and guards the EXIT trap so rm -rf
only fires when tmpdir is non-empty and still a directory. This is a
defense-in-depth hardening — the current code is safe due to set -e,
but explicit validation is best practice for rm -rf operations.
Fixes#2998
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sprite has a bun shim at /.sprite/bin/bun that delegates to
$HOME/.bun/bin/bun, but that binary doesn't exist on fresh VMs.
`command -v bun` returns true (finds the shim) so the install script
skips bun installation, then bun fails when actually invoked.
Fixed in two places:
- installSpawnCli: source shell profiles, test `bun --version` (not
just existence), and install bun fresh if it doesn't work
- install.sh: replace `command -v bun` with `bun --version` to detect
broken shims
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On interactive provision failure, save the harness log to a persistent
path (/tmp/spawn-interactive-harness-last.log) for post-mortem inspection,
and filter output to only show [harness] prefixed lines (30 lines) instead
of dumping 50 raw lines of mixed output.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Ahmed Abushagur <ahmed@abushagur.com>
Hermes installs a Python virtualenv which takes 20+ min on fresh VMs.
The previous 300s install timeout caused the CLI to give up before
writing .spawnrc, leading to 30-min E2E timeouts on Hetzner, DigitalOcean,
and GCP (but not Sprite, which has a manual .spawnrc fallback).
Changes:
- agent-setup.ts: hermes installAgent timeout 300s → 600s
- common.sh: add hermes per-agent overrides (_PROVISION_TIMEOUT_hermes=720,
_AGENT_TIMEOUT_hermes=3600) to give the install enough headroom
- package.json: bump CLI version 0.25.26 → 0.25.27
-- qa/e2e-tester
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
When the quality cycle e2e-tester re-runs only failed agents
(e.g. `e2e.sh --cloud hetzner zeroclaw codex`), e2e.sh was firing
a matrix email showing only those 2 agents — both PASS if the retry
succeeded. This looked like "2 tests ran, all passed" when in reality
32 tests ran with 2 failures.
- Add SPAWN_E2E_SKIP_EMAIL=1 env var check at the top of send_matrix_email
- Update qa-quality-prompt.md to set SPAWN_E2E_SKIP_EMAIL=1 on re-runs
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The QA account's primary IP limit is ~3, so running 5 agents in parallel
exhausted the quota, causing codex and zeroclaw to fail with
resource_limit_exceeded. Reducing _hetzner_max_parallel to 3 keeps
provisioning within quota while still running agents concurrently.
Verified: zeroclaw and codex both PASS on Hetzner after this fix.
-- qa/e2e-tester
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
- hetzner.sh: Pipe base64-encoded command via stdin to SSH instead of
embedding it in the SSH command string via variable expansion. The
remote bash reads stdin, base64-decodes, and executes.
- verify.sh: Add remote-side re-validation of base64 and timeout values
in _stage_prompt_remotely and _stage_timeout_remotely. Values are
assigned to remote shell variables and validated before writing to
temp files, providing defense-in-depth against injection.
- provision.sh: Add explicit early rejection of dangerous shell chars
($, `, \) in env var values from cloud_headless_env, and add
remote-side re-validation of base64 payload before writing.
Fixes#2937Fixes#2938Fixes#2939
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
- Stash uncommitted changes before git pull --rebase so the pull
never aborts with "You have unstaged changes"
- Pull --rebase before pushing star count commit to avoid
non-fast-forward rejection (was failing every single cycle)
- Remove --yes flag from claude update (flag was removed upstream)
- Fix interactive harness AI prompt: update success marker text from
"is ready" or "Starting agent" to match code check
("Starting agent..." or "setup completed successfully")
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- fix misplaced interactive_provision comment block in interactive.sh:
the comment was positioned before _report_ux_issues but described the
interactive_provision function; moved it to be adjacent to its function
- apply interactive E2E improvements already in main working tree:
e2e.sh: add verify_agent call after interactive_provision to wait for
.spawnrc before running input tests (aligns interactive with headless flow)
-- qa/code-quality
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes for Sprite E2E failures in long-running batches (73+ min):
1. Retry `_sprite_provision_verify`: list failures now retry 3x with
exponential backoff (5s, 10s, 20s) instead of failing immediately.
Fixes kilocode batch 6 "Could not list Sprite instances" errors.
2. Increase `CREATE_TIMEOUT_SECS` default from 300s to 600s and add
`Client.Timeout`, `request canceled`, and `authentication failed`
to the transient error retry pattern in `spriteRetry`. Also uses
linear backoff (3s * attempt) instead of fixed 3s delay.
Fixes hermes batch 7 HTTP timeout errors.
3. Add `_sprite_refresh_auth` + `cloud_refresh_auth` interface. The
E2E orchestrator calls `cloud_refresh_auth` before each provisioning
batch. For Sprite, this re-validates the token via `sprite org list`
and attempts `sprite auth refresh` if expired.
Fixes junie batch 8 "authentication failed" errors.
Fixes#2934
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Hetzner E2E runs fail with `resource_limit_exceeded` when stale primary
IPs from previous test runs consume the account quota. This adds proactive
cleanup at two levels:
1. E2E shell driver: `_hetzner_cleanup_orphaned_ips()` deletes unattached
primary IPs during pre-batch stale cleanup, freeing quota before any
new servers are provisioned.
2. TypeScript CLI: `hetzner/main.ts` calls `cleanupOrphanedPrimaryIps()`
before `createServer()` in headless/non-interactive mode, ensuring
each agent provisioning attempt starts with a clean IP quota.
The existing reactive cleanup (retry after failure) in `hetzner.ts`
remains as a fallback.
Fixes#2933
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): harden pkill regex escaping against all metacharacters (#2911)
The sed character class `[.[\*^$]` was malformed and missed several
extended regex metacharacters (+, ?, (, ), {, }, |). Replace with a
correct bracket expression that escapes all POSIX ERE metacharacters.
Although app_name is already validated to [A-Za-z0-9._-], fixing the
escaping is defense-in-depth against future changes to the validation.
Agent: security-auditor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): correct sed bracket expression to escape ] character
Place ] first in character class so it's treated as literal.
Use \\ to match literal backslash.
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): pass SPAWN_NAME + SPAWN_ENABLED_STEPS to interactive harness
Without SPAWN_NAME, cmdRun prompts 'Name your spawn' interactively.
The AI driver (Claude Haiku) can't respond because ANTHROPIC_AUTH_TOKEN
is an OpenRouter key — every Anthropic API call returns 401, so the harness
returns <wait> indefinitely until the 20-min SESSION_TIMEOUT_MS fires.
SPAWN_ENABLED_STEPS=auto-update bypasses the setup options multiselect,
ensuring the harness only tests the provisioning/installation UX.
* fix(e2e): fix _stage_timeout_remotely stdin pipe issue on Hetzner
Same root cause as _stage_prompt_remotely: _hetzner_exec runs commands via
"printf | base64 -d | bash", which makes bash's stdin the decode pipe.
So piped data from the outer SSH call never reaches subcommands.
"printf '%s' 'VALUE' | cloud_exec APP 'cat > /tmp/.e2e-timeout'" always
creates an empty file, causing "timeout: invalid time interval ''" when
the input test runs.
Fix: embed the validated numeric timeout value directly in the printf
command string (safe — _validate_timeout ensures only [0-9] digits).
* test(e2e): add claude PATH diagnostics to input_test_claude
Temporary debug output to trace where claude is installed
after interactive provision completes.
* test(e2e): save harness transcript JSON on success for debugging
* fix(e2e): remove 'is ready' from harness success pattern
'SSH is ready' (emitted ~15s into provision when SSH connects but before
any agent installation) matched the /is ready/ pattern, triggering false
success detection. The harness killed the spawn CLI during cloud-init wait,
leaving a VM with no agent installed.
Fix: use the same precise patterns as the main repo's harness:
/Starting agent\.\.\.|setup completed successfully/i
Both only fire after orchestrate.ts completes the full setup.
* chore(e2e): remove temporary debug instrumentation
* feat(e2e): add ai-powered ux review after interactive provision
After each successful interactive E2E run, the harness sends the full
terminal transcript to Claude (via OpenRouter) with a UX reviewer prompt.
It looks for confusing messages, noisy output, missing context in spinners,
and unhelpful errors that don't explain next steps.
Findings are returned as uxIssues[] in the harness JSON result.
interactive.sh then files a GitHub issue per run listing each problem
with a verbatim example and concrete suggestion.
Uses OPENROUTER_API_KEY (already in env) so it works on the QA VM
where ANTHROPIC_API_KEY is an OpenRouter key.
* refactor(e2e): throttle ux issue filing — 33% chance, 3+ issues required
- Random 33% gate: UX review runs on ~1 in 3 successful interactive
provisions, not every run
- Minimum bar: only surface findings when AI found 3+ clear issues
(filters one-off nits)
- Tighter system prompt: only flag obvious problems (repeated messages,
debug leaks, cryptic errors), not minor style preferences
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(e2e): replace random throttle with stricter ux review prompt
Instead of Math.random() to suppress issues, make the AI self-regulate:
the system prompt now instructs it to only flag genuinely bad problems
(repeated messages, raw stack traces, no-feedback waits) and treat
zero findings as a good outcome, not a failure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The stdin piping approach was broken: _hetzner_exec runs remote commands via
"printf '%s' 'ENCODED_CMD' | base64 -d | bash", which connects bash's stdin to
the base64 pipe rather than SSH's outer stdin. So `cat > /tmp/.e2e-prompt` read
from EOF — the encoded prompt was never written to the remote file.
Fix: embed the validated base64 prompt directly in the command string using
printf. This is safe because _validate_base64 ensures the prompt contains only
[A-Za-z0-9+/=] — no characters that can break out of single quotes or inject
shell metacharacters.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
* chore: update agent GitHub star counts
* fix(qa): load ANTHROPIC_AUTH_TOKEN as ANTHROPIC_API_KEY for interactive E2E
QA VMs store the Anthropic key as ANTHROPIC_AUTH_TOKEN in
/etc/spawn-qa-auth.env, but the e2e-interactive handler only looked for
ANTHROPIC_API_KEY — causing the 6am cron to fail immediately with
"ANTHROPIC_API_KEY not set". Accept either name when loading from the
auth env file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): bump interactive harness timeout to 20min, fix zombie VM teardown
- SESSION_TIMEOUT_MS: 10min → 20min — provisioning a VM takes 3-4 min
before onboarding even starts; 10min wasn't enough headroom
- interactive.sh: call cloud_provision_verify even on harness failure so
teardown can find and delete any VM that was partially created (e.g.
on timeout mid-provision) — previously left zombie VMs with no .meta file
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
AI log review now includes the git diff since the last fully passing
E2E run, enabling causal analysis like "this 404 likely caused by
commit abc123 which deleted file Y". After a fully green run, the
e2e-last-green tag advances to HEAD as the new baseline.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(security): harden remote command construction in provision.sh
Split the .spawnrc upload fallback into two separate cloud_exec calls
to separate data from commands. Step 1 writes the validated base64
payload to a remote temp file. Step 2 decodes from that file and
sets up shell rc sourcing using a static command string with no
interpolated variables.
This eliminates command injection risk in the control-flow portion
of the remote command (for loop, grep, etc.) even if the base64
validation were ever bypassed, since user-controlled data never
appears in the same command string as shell control flow.
Fixes#2882
Agent: complexity-hunter
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: correct error handling + use mktemp for temp file
- Return 1 (not 0) when step 1 fails to avoid masking provisioning failures
- Use mktemp -t spawnrc.b64 to avoid race conditions on concurrent provisions
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: propagate step 2 failure in provision.sh (return 1)
The else branch for step 2 (decode + shell rc setup) logged an error
but the function still returned 0, masking the failure. Now returns 1
so provisioning failures are correctly propagated.
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The env value whitelist allowed @, %, +, =, :, and , characters that
are unnecessary for cloud resource names (server names, regions, sizes)
and could be used as shell metacharacters in certain contexts. Restrict
to only [A-Za-z0-9._/-] which matches all legitimate cloud resource
identifiers.
Fixes#2883
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevent shell metacharacter interpretation in test prompt handling
by staging INPUT_TEST_TIMEOUT and attempt number to remote temp files
instead of interpolating them into remote command strings.
Previously, _TIMEOUT='${INPUT_TEST_TIMEOUT}' and --session-id
e2e-test-${attempt} were interpolated directly into double-quoted
remote command strings. While _validate_timeout enforces digits-only,
the structural pattern of local-to-remote variable interpolation is
inherently risky. Now all dynamic values (prompt, timeout, attempt)
are piped to remote temp files via stdin and read back on the remote
side, eliminating the injection surface entirely.
Fixes#2884
Agent: test-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): update input tests for latest agent CLI interfaces + auto-load email creds
claude: add --dangerously-skip-permissions --no-session-persistence to bypass
trust dialog when running in /tmp/e2e-test (not in ~/.claude.json trusted
projects list written during install)
codex: replace `codex exec --full-auto` (removed in new @openai/codex) with
`codex -q -a full-auto` — quiet mode + full-auto approval, no exec subcommand
email: auto-load RESEND_API_KEY + KEY_REQUEST_EMAIL from
/etc/spawn-key-server-auth.env (QA VM) or ~/.config/spawn/resend.env (local)
so send_matrix_email fires on every e2e run, not just QA-cycle runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): correct claude and codex input test commands
- claude: pass prompt as positional arg to claude -p instead of piping
via stdin (stdin pipe breaks through SSH exec chain, causing
"Input must be provided either through stdin or as a prompt argument"
error)
- codex: revert to `codex exec --full-auto` subcommand (correct for
v0.116.0 — previous -q -a full-auto flags don't exist)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(e2e): add AI-powered log review after provisioning
Feeds provision stderr/stdout logs to an LLM after each agent deploys.
Catches non-fatal issues that binary pass/fail checks miss: silent 404s,
failed component installs, connection instability, swallowed warnings.
This would have caught the keep-alive 404 and the sprite idle shutdown
that the existing E2E tests missed because installSpriteKeepAlive() is
non-fatal and the binary checks only verify final state.
- Uses gemini-flash-lite-2.0 via OpenRouter (cheap, fast)
- Advisory only — never fails the test, reports findings as warnings
- Truncates logs to last 200 lines to stay within token limits
- Skips gracefully if OPENROUTER_API_KEY is missing or API fails
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(e2e): add AI log review and --fast mode testing
AI log review:
- After each agent provisions, feeds stderr/stdout to gemini-flash-lite
to catch non-fatal issues binary checks miss (404s, failed installs,
connection drops, swallowed warnings)
- Advisory only — never fails the test, surfaces findings as warnings
- Would have caught the keep-alive 404 and sprite idle shutdown
--fast mode E2E:
- Add --fast flag to e2e.sh, passed through to spawn CLI during provision
- Update QA e2e-tester protocol to run both normal and --fast passes
- --fast enables images + tarballs + parallel boot
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
* fix: destroy orphaned Packer builder instances on workflow cancel
When a Packer Snapshots workflow is cancelled mid-build, Packer's process
is killed before it can clean up its temporary builder droplet/server.
This leaves orphaned packer-* instances running and costing money.
Add `if: cancelled()` cleanup steps for both DigitalOcean and Hetzner
that destroy any packer-* prefixed instances after cancellation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove Hetzner cleanup step — only DO needed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: remove Hetzner from Packer snapshots, add cancel cleanup
Remove Hetzner from the Packer workflow entirely — only DigitalOcean
snapshots are built. Deletes packer/hetzner.pkr.hcl and simplifies the
workflow by removing all Hetzner-specific steps and cloud conditionals.
Also adds a cancelled() cleanup step that destroys orphaned packer-*
builder droplets when a workflow run is cancelled mid-build.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: add missing sprite-keep-running.sh script
The keep-alive install was 404ing because sh/shared/sprite-keep-running.sh
never existed in the repo. The TypeScript code downloaded it from the CDN
(which maps to sh/shared/) but the file was never created.
The script wraps a command and pings the sprite's own public URL every 30s
to prevent inactivity shutdown. It resolves the URL via sprite-env info
(available on all sprites) and falls back to exec without keep-alive if
the URL can't be determined.
Also removes Hetzner from the Packer snapshots workflow entirely — only
DigitalOcean snapshots are built.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address security review — scope cleanup filter, fix JSON injection
1. Add `spawn-packer` tag to DO builder droplets in Packer template and
filter cleanup by tag instead of broad `packer-` name prefix. Prevents
accidentally destroying builder instances from other concurrent builds.
2. Use `jq --arg` for SINGLE_AGENT_INPUT instead of string interpolation
to prevent JSON injection via crafted agent names.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The manual .spawnrc fallback in provision.sh was using `printf '%s' "${env_b64}" | cloud_exec ...`,
which works for SSH-based clouds (Hetzner, GCP, AWS) where stdin is passed through the SSH
connection. However, Sprite's exec driver replaces stdin with the command pipe:
`printf '%s' "${cmd}" | sprite exec -s NAME -- bash`
This causes the outer env_b64 pipe to be lost — `base64 -d` receives no input and writes an
empty .spawnrc, which then fails the OPENROUTER_API_KEY and openrouter.ai verification checks.
Fix: embed the base64 data directly in the command string using `printf '%s' '${env_b64}'`.
This is safe because env_b64 is validated to contain only [A-Za-z0-9+/=] — the standard
base64 alphabet — which cannot break out of single quotes or cause shell injection.
Confirmed by E2E run where sprite/claude and sprite/openclaw both failed with:
[FAIL] OPENROUTER_API_KEY not found in .spawnrc
[FAIL] Failed to create manual .spawnrc
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
GCP's default 10 GB boot disk is insufficient for coding agents — node_modules,
apt packages, and build caches easily exceed it. Default to 40 GB and allow
override via GCP_DISK_SIZE env var.
Closes#2866
Co-authored-by: Claude <claude@anthropic.com>
Add defense-in-depth validation of INPUT_TEST_TIMEOUT directly in verify.sh
(not just relying on common.sh). Each input test function now calls
_validate_timeout() to ensure the value contains only digits before use.
Additionally, instead of interpolating INPUT_TEST_TIMEOUT directly into
remote command strings passed to cloud_exec, the timeout value is now
assigned to a single-quoted remote variable (_TIMEOUT) and referenced via
"$_TIMEOUT" on the remote side. This eliminates the injection surface even
if validation were somehow bypassed.
Affected functions: input_test_claude(), input_test_codex(),
input_test_openclaw(), input_test_zeroclaw().
Fixes#2849
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaces command string interpolation with stdin piping for the base64
prompt in verify.sh. Also anchors the _validate_base64 regex.
Fixes#2833
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes#2823: npm installs kilocode to /usr/local/bin when running as
root on GCP, but the E2E binary verify step didn't include /usr/local/bin
in PATH, causing false "binary not found" failures.
The .spawnrc PATH (generated by generateEnvConfig) already includes
/usr/local/bin, but verify_kilocode used a hardcoded PATH that omitted
it. This aligns kilocode and codex verify checks with openclaw and junie
which already include /usr/local/bin.
Also fixes the same latent issue in verify_codex.
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: never-give-up resilience layer — retry every failure instead of exiting
Add retryOrQuit() helper to shared/ui.ts that prompts "Try again? (Y/n)"
after any recoverable failure. Wrap all fatal exit points with retry loops:
- Cloud auth (Hetzner, DigitalOcean, AWS, GCP): retry after 3 failed tokens
- API key acquisition: retry after 3 failed OAuth+manual attempts
- Server creation: retry on any createServer failure (both fast & sequential)
- SSH readiness: retry on waitForReady timeout
- Agent install: retry on install failure
- Pre-launch hooks: retry on preLaunch failure
Non-interactive mode (SPAWN_NON_INTERACTIVE=1) still throws immediately.
Ctrl+C at any retry prompt exits cleanly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(e2e): add AI-driven interactive test harness
Add --interactive mode to the E2E test framework. Instead of running spawn
in headless mode (SPAWN_NON_INTERACTIVE=1), this spawns the CLI in a real
PTY and uses Claude Haiku to respond to prompts like a human user would.
New files:
- sh/e2e/interactive-harness.ts — Bun script that drives the PTY + AI loop
- sh/e2e/lib/interactive.sh — Bash integration with the E2E framework
Usage:
e2e.sh --cloud hetzner claude --interactive
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(qa): wire interactive E2E into scheduled QA pipeline
- Add `e2e-interactive` option to workflow_dispatch in qa.yml
- Add `e2e-interactive` run mode to qa.sh (loads cloud creds + ANTHROPIC_API_KEY)
- Runs `e2e.sh --cloud hetzner claude --interactive` directly (no Claude Code needed)
- Defaults to hetzner (cheapest), overridable via E2E_INTERACTIVE_CLOUD/AGENT env vars
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(qa): schedule interactive E2E daily at 6am UTC
Runs one agent (claude) on one cloud (hetzner) with AI-driven prompts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(qa): offset soak cron to avoid GitHub Actions schedule dedup
GitHub Actions deduplicates overlapping cron schedules into one run,
making `github.event.schedule` unpredictable. The soak test at `0 3 * * 1`
was getting absorbed by the `0 */4 * * *` quality sweep and never firing
as reason=soak.
Move soak to `30 1 * * 1` (Monday 1:30am UTC) — safely between the
0am and 4am quality sweep slots. Interactive E2E at `0 6 * * *` is
already safe (between the 4am and 8am slots).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(qa): add e2e-interactive to trigger server valid reasons
The trigger server validates reason query params against an allowlist.
Without this, the `e2e-interactive` dispatch returns 400.
Also note: `soak` is already in VALID_REASONS in the repo but the running
service on the QA VM is stale — needs a restart to pick up both soak and
e2e-interactive reasons.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The _run_with_restart wrapper in all 8 DigitalOcean agent scripts catches
SIGTERM/SIGKILL exit codes (143/137) and retries the orchestration process.
In headless mode (E2E tests), when the provision timeout kills the process,
this restart loop would re-run main.ts, creating duplicate droplets and
exhausting the account's droplet quota — causing ALL subsequent DO agents
to fail provisioning.
Skip the restart loop entirely when SPAWN_HEADLESS=1 (set by runScriptHeadless
in the CLI). The restart behavior is only useful for interactive sessions
where the user's SSH connection drops.
Fixes#2794
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Apply the same base64 encoding mitigation used by all other cloud
drivers (aws, hetzner, digitalocean, gcp). The command is encoded
locally, validated for safe characters, then decoded and executed
on the remote side via `base64 -d | bash`.
Fixes#2800
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The pre-run stale cleanup (added in #2789) used the same 30-minute max_age
as the post-run cleanup. Orphaned instances from recently-failed runs (< 30 min
old) were not cleaned, causing quota exhaustion on DigitalOcean and other clouds.
Pre-run cleanup now uses _CLEANUP_MAX_AGE=300 (5 min) to aggressively reclaim
orphaned e2e instances before provisioning new ones. Post-run cleanup retains
the 30-minute default. All 5 cloud drivers respect the override.
Fixes#2793
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes#2797. The _stage_prompt_remotely() function was interpolating
${encoded_prompt} directly into the remote command string passed to
cloud_exec. While _validate_base64() ensures only [A-Za-z0-9+/=]
characters are present, defense-in-depth requires eliminating the
interpolation entirely.
The fix uses printf %s format substitution to build the remote command,
placing the encoded prompt into a single-quoted shell variable assignment
(_EP='...') on the remote side. Single quotes prevent all shell expansion,
and base64 charset cannot contain single quotes, making injection
structurally impossible.
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Orphaned e2e instances from previously interrupted test runs (e.g. killed
by timeout) remain under the 30-minute max_age threshold and continue to
consume account capacity. This caused DigitalOcean "droplet limit exceeded"
422 errors when re-running the suite within 30 minutes of a failed run.
Add a pre-run stale cleanup call at the start of run_agents_for_cloud (after
credentials are validated, before agents start). This clears leftover e2e-*
instances immediately so they don't block provisioning in the new run.
-- qa/e2e-tester
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the pattern of embedding base64-encoded prompts directly into
remote command strings via shell variable interpolation with a two-step
approach: stage the encoded prompt to a remote temp file first, then
read from that file in the agent command. This eliminates RCE risk if
the prompt source ever becomes user-controlled.
Changes:
- Add _stage_prompt_remotely() helper that writes encoded prompt to
/tmp/.e2e-prompt on the remote host via an isolated cloud_exec call
- input_test_claude(): read prompt from temp file instead of _ENCODED_PROMPT var
- input_test_codex(): same
- input_test_openclaw(): same
- input_test_zeroclaw(): same
- Update _validate_base64() comment to reflect defense-in-depth role
Closes#2788
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add explicit validation that encoded_prompt only contains safe base64
characters ([A-Za-z0-9+/=]) in all input_test_* functions in verify.sh.
This makes the safety assumption explicit in code rather than relying
on documentation — if the base64 output ever contains unexpected chars,
the test aborts immediately instead of injecting them into a remote
command string.
Fixes#2775
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Validates LOG_DIR is within /tmp/spawn-e2e.* before deleting it,
preventing catastrophic data loss if LOG_DIR is somehow set to an
unexpected path via TMPDIR manipulation or future refactors.
Fixes#2777
Agent: issue-fixer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>