spawn

vrr/spawn

mirror of https://github.com/OpenRouterTeam/spawn.git synced 2026-05-12 06:00:25 +00:00

Author	SHA1	Message	Date
A	e2d6aa1444	fix: use json_escape in save_vm_connection to prevent malformed JSON (#1470 ) save_vm_connection built JSON via direct string interpolation, which produces malformed output if any value contains quotes, backslashes, or other JSON-special characters. This breaks spawn list/delete/history. Changes: - Use json_escape for all string fields in save_vm_connection - Use json_escape for GCP zone/project metadata values - Switch AWS, GCP, Daytona get_server_name to get_validated_server_name for consistency with Hetzner, DigitalOcean, Fly, OVH Agent: code-health Co-authored-by: B <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-19 16:23:27 +00:00
Ahmed Abushagur	8ee54d01a8	fix: harden agent reliability + security across all clouds (#1468 ) * docs: add spawn delete command to README Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: harden openclaw across all clouds — validation, reliability, performance Fixes multiple issues causing openclaw to break on most clouds: Bugs fixed: - Double-prefixed model ID (openrouter/openrouter/auto) in config generation - AWS gateway starting without env vars (missing .zshrc source) - DigitalOcean sourcing .spawnrc instead of .zshrc for gateway - Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p) Validation added: - API key checked against OpenRouter /auth/key endpoint with re-prompt on failure - Model ID verified against OpenRouter model list with re-prompt loop - openrouter/auto and openrouter/free bypass model check Reliability improvements: - Standardized gateway launch with </dev/null & disown across all 9 clouds - Gateway log auto-displayed on startup timeout for diagnostics - 2GB swap added to cloud-init to prevent OOM on small VMs - Portable install timeout (10 min) with macOS gtimeout fallback Performance: - Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s) - Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw - Fly.io: tries bun first (faster), falls back to npm Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip sudo in gh install when running as root (Fly.io containers) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id - verify_openrouter_key and verify_openrouter_model skip network calls when SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set - install_agent timeout wrapper now quotes the escaped command for defense in depth - model_id in openclaw JSON now uses json_escape() for consistency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove double-escaping in install_agent that broke shell operators install_agent() was wrapping commands with printf '%q' + bash -c before passing them to the run callback. But run callbacks (run_server, run_sprite, ssh_run_server) already handle escaping for remote transport. The double- escaping turned && \|\| > \| into literal characters, causing 'source' to treat the entire command as a single filename. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use local github-auth.sh instead of curling from main When running from a local checkout, base64-encode the local github-auth.sh and send it inline to the remote machine. This ensures fixes (like the sudo skip for root) take effect immediately without waiting for a merge to main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: handle github-auth errors gracefully instead of terminating GitHub CLI setup is optional — failures should not abort the spawn session. Guard both run_callback calls in offer_github_auth with \|\| log_warn so the script continues even if gh install fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use GOOGLE_GEMINI_BASE_URL to route Gemini CLI through OpenRouter Gemini CLI ignores OPENAI_BASE_URL — it uses GEMINI_API_KEY to talk directly to Google's API. The OpenRouter key is not a valid Google API key, so all requests fail with "API key not valid". Use GOOGLE_GEMINI_BASE_URL to redirect Gemini CLI to OpenRouter's endpoint. Fixes all 9 cloud gemini scripts + manifest.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: guard optional spawn_agent hooks so failures don't kill the session With set -eo pipefail, any unguarded failure terminates the script. Several optional operations in spawn_agent were unguarded: - agent_configure: config file uploads (agent works with defaults) - agent_save_connection: convenience JSON for spawn list - agent_pre_launch: gateway daemons, startup hooks - agent_pre_provision: pre-provision prompts - .spawnrc shell hooks: hooking env vars into .bashrc/.zshrc These now log warnings and continue instead of aborting. Critical steps (cloud_authenticate, agent_install, cloud_provision) still exit on failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: audit and fix env vars, escaping, and error handling across all agents Audit findings from 3 parallel agents, fixes applied: Env vars (4 agents fixed across 9 clouds each = 36 scripts): - Amazon Q: remove fake OPENAI_* vars (Q uses AWS auth, can't use OpenRouter) - Cline: replace OPENAI_* env vars with `cline auth -p openrouter` command - Open Interpreter: drop OPENAI_* vars, use only OPENROUTER_API_KEY (native support via --model flag) - NanoClaw: add ANTHROPIC_BASE_URL to .env file (was missing, requests went to Anthropic directly) Escaping: - execute_agent_non_interactive: replace printf '%q' with single-quote wrapping to avoid double-escaping on Fly.io Manifest updated for amazonq, cline, interpreter entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use setsid to detach openclaw gateway daemon from SSH sessions The gateway daemon launch (`nohup openclaw gateway ... & disown`) hangs on all clouds because SSH/exec channels wait for child FDs to close. setsid creates a new session, fully detaching the daemon so the channel can close immediately. Falls back to nohup where setsid is unavailable. Consolidates the daemon launch into a shared start_openclaw_gateway() function used by all 9 cloud scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: configure npm global prefix for non-root clouds (AWS, GCP, OVH) AWS Lightsail, GCP, and OVH SSH as non-root users (ubuntu/login user), so `npm install -g` fails with EACCES on /usr/local/lib/node_modules/. Fix: configure npm prefix to ~/.npm-global during cloud-init/setup and add ~/.npm-global/bin to the SSH PATH prefix so agent install commands find globally-installed npm binaries without sudo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove broken OpenRouter routing from Gemini CLI scripts Gemini CLI uses Google's native API format (/v1beta/models/:streamGenerateContent), not the OpenAI-compatible format (/v1/chat/completions). No base URL override can bridge this — the request formats are fundamentally incompatible. Same situation as Amazon Q (uses vendor-specific auth/API). Removed GEMINI_API_KEY and GOOGLE_GEMINI_BASE_URL from all 9 scripts + manifest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: auto-install AWS CLI and gcloud SDK when missing Instead of printing manual install instructions and exiting, both CLIs now auto-install: - AWS: downloads official .pkg (macOS) or .zip (Linux) installer - GCP: uses brew cask on macOS, Google's tarball installer on Linux Falls back to manual instructions if auto-install fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: nanoclaw — install Docker on Linux, fix hardcoded /root/ path Two issues broke NanoClaw on all clouds: 1. .env upload hardcoded /root/nanoclaw/.env — fails on non-root clouds (AWS=ubuntu, GCP=user, OVH=ubuntu). Now uses upload_config_file with $HOME which expands on the remote side. 2. NanoClaw requires a container runtime. On Linux it uses Docker, but Docker was never installed. Added Docker install via get.docker.com to all cloud scripts (with sudo where SSH user is non-root). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address security review findings from PR #1463 - Reject symlinked github-auth.sh before base64-encoding (falls back to remote URL) - Hide API key from process list using curl -K - instead of -H in verify_openrouter_key Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: quote OPENROUTER_API_KEY in cline auth to prevent command injection Unquoted variable in `cline auth -p openrouter -k ${OPENROUTER_API_KEY}` allows shell metacharacters in the key to execute arbitrary commands on the remote server. Wrapping in escaped double quotes prevents expansion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 08:36:24 -05:00
Ahmed Abushagur	f2795a6d84	fix: Node.js v22 upgrade, aider uv install, SSH & cloud reliability (#1440 ) * fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds aider-chat on Python 3.13 fails with `ImportError: cannot import name '_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved — those releases have no Python 3.13 binary wheels, so the C extension is missing at runtime. Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>` and single quotes get mangled by `printf '%q'` in run_server before the command reaches the remote machine) with `--upgrade`, which forces all transitive deps including Pillow to their latest compatible versions. Also adds a plain-text echo before the install so users see progress instead of a silent hang during the 2-4 minute install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: update aider/gptme/interpreter assertions from pip to uv The install method for aider, gptme, and open-interpreter was changed from pip to `uv tool install` across all clouds. The mock test assertions still checked for the old `pip.install.` patterns, causing 9 failures (3 agents × 3 clouds). Update patterns to match the actual `uv tool install` commands now used in all cloud scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: trigger test run for uv assertion fix * fix: prevent SSH hangs, restore stderr, fix command escaping across clouds - Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH stdin theft causing sequential install/verify/configure steps to hang - Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default SSH_OPTS so long-running installs don't silently drop on flaky networks - Remove 2>/dev/null from Fly.io run_server so remote command errors are no longer silently swallowed (--quiet flag still suppresses flyctl noise) - Fix Fly.io printf '%q' double-quoting: remove extra quotes around $escaped_cmd that prevented the remote shell from consuming escapes, breaking && \|\| \| operators in commands - Remove broken printf '%q' from Daytona run_server and interactive_session where it escaped shell operators into literal characters since daytona exec has no intermediate shell layer - Pin aider to --python 3.12 instead of --with audioop-lts across all clouds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add --pty to fly ssh console for interactive sessions fly ssh console -C does not allocate a pseudo-terminal by default, causing interactive TUI agents (aider, claude) to fail with "Input is not a terminal (fd=0)" or completely unresponsive input. Adding --pty forces PTY allocation, matching how other clouds handle interactive sessions (SSH uses -t, Sprite uses -tty). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: prepend ~/.local/bin to PATH in ssh_run_server After uv installs to ~/.local/bin, the current shell session doesn't have it in PATH, causing "uv: command not found" on DigitalOcean and all other SSH-based clouds (Hetzner, AWS, GCP, OVH). Fly.io's run_server already prepends this PATH — now the shared ssh_run_server does the same, fixing all SSH-based clouds at once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add Node.js to cloud-init for all cloud providers npm-based agents (codex, kilocode, etc.) fail with "npm: command not found" because Node.js isn't installed during cloud-init. Fly.io was the only provider installing Node.js (in wait_for_cloud_init). Now all cloud-init scripts install Node.js v22 LTS from nodesource, matching Fly.io's setup. Also adds ~/.local/bin to PATH in AWS and GCP cloud-init (was already in shared/DigitalOcean/Hetzner). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use apt packages for nodejs/npm instead of nodesource The nodesource setup script (setup_22.x) runs its own apt-get update and repository configuration, nearly doubling cloud-init time and causing hangs on DigitalOcean. Ubuntu 24.04 includes nodejs and npm in its default repos — just add them to the packages list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add timeouts and better error handling to Daytona CLI commands Daytona CLI commands (login, list, create) can hang indefinitely when the API is slow or unreachable. This causes: - "Failed to create sandbox: timeout" with no recovery - Token validation timeouts misreported as "invalid token" - Users re-entering valid tokens that also timeout Fixes: - Wrap all daytona CLI calls with timeout (30s for auth, 120s for create) - Detect timeout errors separately from auth errors - Show actionable "try again / check status" messages for timeouts - Add nodejs/npm to Daytona wait_for_cloud_init Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set DAYTONA_API_URL to Daytona Cloud by default The Daytona CLI may default to connecting to a local self-hosted server instead of Daytona Cloud. Without DAYTONA_API_URL set to https://app.daytona.io/api, every CLI command (login, list, create) hangs trying to reach a non-existent local server and times out. The SDK documents this as the default, but the CLI doesn't always pick it up — now we export it explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: symlink n-installed Node.js v22 over apt v18 to prevent shadowing n installs Node.js v22 to /usr/local/bin/node but apt's v18 at /usr/bin/node can shadow it in non-interactive SSH sessions. After n 22, symlink the new binaries over the apt ones so v22 is always resolved. Also fix hcloud CLI token extraction for new TOML format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address security review, add curl timeouts to trigger workflows - Fix ssh_run_server command injection concern: use single-quoted path_prefix so $HOME/$PATH expand remotely, not locally - Add --connect-timeout 15 --max-time 30 to trigger workflows to prevent 5-min hangs when server streams responses - Handle 409 (dedup) as success — expected when cron fires every 15min but cycles take 35min - Reduce workflow timeout-minutes from 5 to 2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-18 06:54:07 -05:00
Ahmed Abushagur	633ce8eaac	feat: upgrade default server sizes, fix Fly.io agent installs, improve E2E tests (#1428 ) - Upgrade default VM sizes across clouds for better agent performance: - Hetzner: cpx11 → cx23 (with cx22 fallback support for deprecated types) - DigitalOcean: s-2vcpu-2gb → s-2vcpu-4gb - Daytona: 2048MB → 4096MB memory - Oracle: VM.Standard.E2.1.Micro → VM.Standard.A1.Flex - OVH: d2-2 → d2-4 - Fix Fly.io agent failures: - Add Node.js + build-essential to wait_for_cloud_init (fixes npm-based agents) - Prepend PATH in interactive_session (fixes "source not found" errors) - Fix openclaw installs across clouds: use explicit PATH export instead of source - Fix DigitalOcean token validation (check "uuid" not "id") - Fix AWS cloud-init: chown .bashrc/.zshrc to ubuntu user - Improve Hetzner fallback: add "cheapest available" as last-resort fallback - Upgrade E2E tests: per-combo auto-fix, credential collection, robustness fixes Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 22:17:08 -08:00
Ahmed Abushagur	22b6a402f4	feat: E2E test harness, QA pipeline integration, macOS compat linter (#1425 ) * feat: add QA upgrade — macOS compat linter, per-agent mock assertions Layer 1: macOS compat linter (test/macos-compat.sh) - 12 rules (MC001–MC012) catching bash 3.2 incompatibilities - Detects: base64 -w0 file args, non-portable echo flags, source <(), ((var++)), read -d, nounset flag, sed -i, date %N, local -n, declare -A, ${var,,}, and \|& - Added to CI lint.yml in warn-only mode for burn-in - Integrated as Phase 0.5 in qa-dry-run.sh Layer 2: Per-agent mock assertions - test/fixtures/_shared_agent_assertions.sh with install checks for all 15 agents (claude, openclaw, aider, goose, etc.) - Integrated into test/mock.sh via _run_agent_assertions() Also includes branch fixes: - Fix base64 -w0 to use stdin redirect (aws, daytona, fly) - Fix fly/openclaw to use npm install instead of broken curl\|bash Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add E2E test harness and integrate into QA pipeline Add test/e2e.sh — a full E2E test harness that provisions real servers, installs agents, and verifies setup across all clouds. Features: - Smoke test (one canary agent per cloud) and full matrix modes - Credential auto-detection for 8 clouds - Per-cloud preflight validation (sequential) then parallel agent tests - Stale server cleanup, timing history, cross-cloud comparison - Auto-fix and optimization phases via Claude agents - macOS bash 3.2 compatible Integrate E2E as Phase 5 in both qa-cycle.sh and qa-dry-run.sh: - Runs after mock tests pass, gated on cloud credentials - Phase 5b auto-fixes failures using per-agent worktree branches - Parses results and includes in QA summary Also fixes: - shared/common.sh: honour SPAWN_NON_INTERACTIVE=1 in safe_read() - aws/lib/common.sh: fix SSH key import (use cat instead of base64, handle race condition on concurrent imports) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 20:41:07 -05:00
A	07ff397ee5	security: add SSH key path validation to aws/lib/common.sh (#1414 ) Add validation in ensure_ssh_key() to prevent path traversal and arbitrary file upload attacks: - Validate public key file exists and is a regular file - Reject symlinks to prevent reading sensitive system files - Enforce 10KB size limit (SSH pubkeys are ~100-600 bytes) Fixes #1407 Agent: complexity-hunter Co-authored-by: B <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-17 12:54:09 -05:00
A	c4eccbd72f	feat: prioritize clouds with CLI installed + hcloud CLI integration (#1375 ) * fix: auto-run gcloud auth login on expired GCP tokens Instead of telling users to run `gcloud auth login` manually, just run it automatically when auth check fails or instance creation hits a reauthentication error, then retry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: prioritize clouds with CLI installed + hcloud CLI integration When selecting a cloud provider, clouds are now sorted in 3 tiers: 1. Credentials detected (env vars set) — top priority 2. CLI installed (e.g., gcloud, hcloud, aws) — middle priority 3. Neither — default order Also adds hcloud CLI-first support for Hetzner operations (server create/delete/list, SSH key management, auth) with automatic fallback to the existing REST API when hcloud is not available. Closes #1370 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: rename aws-lightsail to aws across the project Simplifies the cloud key from "aws-lightsail" to "aws" — AWS should have a single entry regardless of the underlying service used. Renames the directory, updates manifest.json matrix keys, CLI map, test fixtures, README, and all agent scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: lab <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-16 20:12:35 -08:00

7 commits