spawn/aws/lib/common.sh
Ahmed Abushagur f2795a6d84
fix: Node.js v22 upgrade, aider uv install, SSH & cloud reliability (#1440)
* fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds

aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.

Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.

Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: update aider/gptme/interpreter assertions from pip to uv

The install method for aider, gptme, and open-interpreter was changed
from pip to `uv tool install` across all clouds. The mock test
assertions still checked for the old `pip.*install.*` patterns, causing
9 failures (3 agents × 3 clouds).

Update patterns to match the actual `uv tool install` commands now used
in all cloud scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: trigger test run for uv assertion fix

* fix: prevent SSH hangs, restore stderr, fix command escaping across clouds

- Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH
  stdin theft causing sequential install/verify/configure steps to hang
- Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default
  SSH_OPTS so long-running installs don't silently drop on flaky networks
- Remove 2>/dev/null from Fly.io run_server so remote command errors are
  no longer silently swallowed (--quiet flag still suppresses flyctl noise)
- Fix Fly.io printf '%q' double-quoting: remove extra quotes around
  $escaped_cmd that prevented the remote shell from consuming escapes,
  breaking && || | operators in commands
- Remove broken printf '%q' from Daytona run_server and interactive_session
  where it escaped shell operators into literal characters since daytona exec
  has no intermediate shell layer
- Pin aider to --python 3.12 instead of --with audioop-lts across all clouds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add --pty to fly ssh console for interactive sessions

fly ssh console -C does not allocate a pseudo-terminal by default,
causing interactive TUI agents (aider, claude) to fail with
"Input is not a terminal (fd=0)" or completely unresponsive input.

Adding --pty forces PTY allocation, matching how other clouds handle
interactive sessions (SSH uses -t, Sprite uses -tty).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prepend ~/.local/bin to PATH in ssh_run_server

After uv installs to ~/.local/bin, the current shell session doesn't
have it in PATH, causing "uv: command not found" on DigitalOcean and
all other SSH-based clouds (Hetzner, AWS, GCP, OVH).

Fly.io's run_server already prepends this PATH — now the shared
ssh_run_server does the same, fixing all SSH-based clouds at once.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add Node.js to cloud-init for all cloud providers

npm-based agents (codex, kilocode, etc.) fail with "npm: command not
found" because Node.js isn't installed during cloud-init. Fly.io was
the only provider installing Node.js (in wait_for_cloud_init).

Now all cloud-init scripts install Node.js v22 LTS from nodesource,
matching Fly.io's setup. Also adds ~/.local/bin to PATH in AWS and
GCP cloud-init (was already in shared/DigitalOcean/Hetzner).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use apt packages for nodejs/npm instead of nodesource

The nodesource setup script (setup_22.x) runs its own apt-get update
and repository configuration, nearly doubling cloud-init time and
causing hangs on DigitalOcean. Ubuntu 24.04 includes nodejs and npm
in its default repos — just add them to the packages list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add timeouts and better error handling to Daytona CLI commands

Daytona CLI commands (login, list, create) can hang indefinitely when
the API is slow or unreachable. This causes:
- "Failed to create sandbox: timeout" with no recovery
- Token validation timeouts misreported as "invalid token"
- Users re-entering valid tokens that also timeout

Fixes:
- Wrap all daytona CLI calls with timeout (30s for auth, 120s for create)
- Detect timeout errors separately from auth errors
- Show actionable "try again / check status" messages for timeouts
- Add nodejs/npm to Daytona wait_for_cloud_init

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: set DAYTONA_API_URL to Daytona Cloud by default

The Daytona CLI may default to connecting to a local self-hosted
server instead of Daytona Cloud. Without DAYTONA_API_URL set to
https://app.daytona.io/api, every CLI command (login, list, create)
hangs trying to reach a non-existent local server and times out.

The SDK documents this as the default, but the CLI doesn't always
pick it up — now we export it explicitly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: symlink n-installed Node.js v22 over apt v18 to prevent shadowing

n installs Node.js v22 to /usr/local/bin/node but apt's v18 at
/usr/bin/node can shadow it in non-interactive SSH sessions. After
n 22, symlink the new binaries over the apt ones so v22 is always
resolved. Also fix hcloud CLI token extraction for new TOML format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address security review, add curl timeouts to trigger workflows

- Fix ssh_run_server command injection concern: use single-quoted
  path_prefix so $HOME/$PATH expand remotely, not locally
- Add --connect-timeout 15 --max-time 30 to trigger workflows to
  prevent 5-min hangs when server streams responses
- Handle 409 (dedup) as success — expected when cron fires every 15min
  but cycles take 35min
- Reduce workflow timeout-minutes from 5 to 2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-18 06:54:07 -05:00

248 lines
9.4 KiB
Bash

#!/bin/bash
# Common bash functions for AWS Lightsail spawn scripts
# Uses AWS CLI (aws lightsail) — requires `aws` CLI configured with credentials
# Bash safety flags
set -eo pipefail
# ============================================================
# Provider-agnostic functions
# ============================================================
# Source shared provider-agnostic functions (local or remote fallback)
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" 2>/dev/null && pwd)"
if [[ -n "${SCRIPT_DIR}" && -f "${SCRIPT_DIR}/../../shared/common.sh" ]]; then
source "${SCRIPT_DIR}/../../shared/common.sh"
else
eval "$(curl -fsSL https://raw.githubusercontent.com/OpenRouterTeam/spawn/main/shared/common.sh)"
fi
# Note: Provider-agnostic functions (logging, OAuth, browser, nc_listen) are now in shared/common.sh
# ============================================================
# AWS Lightsail specific functions
# ============================================================
SPAWN_DASHBOARD_URL="https://lightsail.aws.amazon.com/"
# SSH_OPTS is now defined in shared/common.sh
# Configurable timeout/delay constants
INSTANCE_STATUS_POLL_DELAY=${INSTANCE_STATUS_POLL_DELAY:-5} # Delay between instance status checks
ensure_aws_cli() {
if ! command -v aws &>/dev/null; then
_log_diagnostic \
"AWS CLI is required but not installed" \
"aws command not found in PATH" \
--- \
"Install the AWS CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" \
"Or on macOS: brew install awscli"
return 1
fi
# Verify credentials are configured
if ! aws sts get-caller-identity &>/dev/null; then
_log_diagnostic \
"AWS CLI is not configured with valid credentials" \
"No AWS credentials found or credentials have expired" \
--- \
"Run: aws configure" \
"Or set environment variables: export AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=..."
return 1
fi
local region="${AWS_DEFAULT_REGION:-${LIGHTSAIL_REGION:-us-east-1}}"
export AWS_DEFAULT_REGION="${region}"
log_info "Using AWS region: ${region}"
}
ensure_ssh_key() {
local key_path="${HOME}/.ssh/id_ed25519"
local pub_path="${key_path}.pub"
# Generate key if needed
generate_ssh_key_if_missing "${key_path}"
# Validate SSH public key path before upload
if [[ ! -f "${pub_path}" ]]; then
log_error "SSH public key not found: ${pub_path}"
return 1
fi
if [[ -L "${pub_path}" ]]; then
log_error "SSH public key cannot be a symlink: ${pub_path}"
return 1
fi
# SSH public keys are typically 100-600 bytes (ed25519/RSA)
# Reject suspiciously large files to prevent arbitrary file upload
local size
size=$(wc -c <"${pub_path}")
if [[ ${size} -gt 10000 ]]; then
log_error "SSH public key file too large: ${size} bytes (max 10000)"
return 1
fi
local key_name="spawn-key"
# Check if already registered
if aws lightsail get-key-pair --key-pair-name "${key_name}" &>/dev/null; then
log_info "SSH key already registered with Lightsail"
return 0
fi
log_step "Importing SSH key to Lightsail..."
# --public-key-base64 accepts the OpenSSH key directly (not base64-wrapped)
aws lightsail import-key-pair \
--key-pair-name "${key_name}" \
--public-key-base64 "$(cat "${pub_path}")" \
>/dev/null 2>&1 || {
# Race condition: another process may have imported it
if aws lightsail get-key-pair --key-pair-name "${key_name}" &>/dev/null; then
log_info "SSH key already registered with Lightsail"
return 0
fi
log_error "Failed to import SSH key to Lightsail"
return 1
}
log_info "SSH key imported to Lightsail"
}
get_server_name() {
get_resource_name "LIGHTSAIL_SERVER_NAME" "Enter Lightsail instance name: "
}
get_cloud_init_userdata() {
cat << 'CLOUD_INIT_EOF'
#!/bin/bash
apt-get update -y
apt-get install -y curl unzip git zsh nodejs npm
# Upgrade Node.js to v22 LTS (apt has v18, agents like Cline need v20+)
# n installs to /usr/local/bin but apt's v18 at /usr/bin can shadow it, so symlink over
npm install -g n && n 22 && ln -sf /usr/local/bin/node /usr/bin/node && ln -sf /usr/local/bin/npm /usr/bin/npm && ln -sf /usr/local/bin/npx /usr/bin/npx
# Install Bun
su - ubuntu -c 'curl -fsSL https://bun.sh/install | bash'
# Install Claude Code
su - ubuntu -c 'curl -fsSL https://claude.ai/install.sh | bash'
# Configure PATH
echo 'export PATH="${HOME}/.claude/local/bin:${HOME}/.local/bin:${HOME}/.bun/bin:${PATH}"' >> /home/ubuntu/.bashrc
echo 'export PATH="${HOME}/.claude/local/bin:${HOME}/.local/bin:${HOME}/.bun/bin:${PATH}"' >> /home/ubuntu/.zshrc
chown ubuntu:ubuntu /home/ubuntu/.bashrc /home/ubuntu/.zshrc
touch /home/ubuntu/.cloud-init-complete
chown ubuntu:ubuntu /home/ubuntu/.cloud-init-complete
CLOUD_INIT_EOF
}
# Wait for Lightsail instance to become running and get its public IP
# Sets: LIGHTSAIL_SERVER_IP
# Usage: _wait_for_lightsail_instance NAME [MAX_ATTEMPTS]
_wait_for_lightsail_instance() {
local name="${1}"
local max_attempts=${2:-60}
local attempt=1
log_step "Waiting for instance to become running..."
while [[ ${attempt} -le ${max_attempts} ]]; do
local state
state=$(aws lightsail get-instance --instance-name "${name}" \
--query 'instance.state.name' --output text 2>/dev/null)
if [[ "${state}" == "running" ]]; then
LIGHTSAIL_SERVER_IP=$(aws lightsail get-instance --instance-name "${name}" \
--query 'instance.publicIpAddress' --output text)
export LIGHTSAIL_SERVER_IP
log_info "Instance running: IP=${LIGHTSAIL_SERVER_IP}"
return 0
fi
log_step "Instance state: ${state} (${attempt}/${max_attempts})"
sleep "${INSTANCE_STATUS_POLL_DELAY}"
attempt=$((attempt + 1))
done
log_error "Instance did not become running after ${max_attempts} checks"
log_warn "The instance may still be provisioning. You can:"
log_warn " 1. Re-run the command to try again"
log_warn " 2. Check the instance status: aws lightsail get-instance --instance-name '${name}'"
log_warn " 3. Check the Lightsail console: https://lightsail.aws.amazon.com/"
return 1
}
create_server() {
local name="${1}"
local bundle="${LIGHTSAIL_BUNDLE:-medium_3_0}"
local region="${AWS_DEFAULT_REGION:-us-east-1}"
local az="${region}a"
local blueprint="ubuntu_24_04"
# Validate env var inputs to prevent command injection
validate_resource_name "${bundle}" || { log_error "Invalid LIGHTSAIL_BUNDLE"; return 1; }
validate_region_name "${region}" || { log_error "Invalid AWS_DEFAULT_REGION"; return 1; }
log_step "Creating Lightsail instance '${name}' (bundle: ${bundle}, AZ: ${az})..."
local userdata
userdata=$(get_cloud_init_userdata)
if ! aws lightsail create-instances \
--instance-names "${name}" \
--availability-zone "${az}" \
--blueprint-id "${blueprint}" \
--bundle-id "${bundle}" \
--key-pair-name "spawn-key" \
--user-data "${userdata}" \
>/dev/null; then
log_error "Failed to create Lightsail instance"
log_warn "Common issues:"
log_warn " - Instance limit reached for your account"
log_warn " - Bundle unavailable in region (try different LIGHTSAIL_BUNDLE or LIGHTSAIL_REGION)"
log_warn " - AWS credentials lack Lightsail permissions (check IAM policy)"
log_warn " - Instance name '${name}' already in use"
return 1
fi
export LIGHTSAIL_INSTANCE_NAME="${name}"
log_info "Instance creation initiated: ${name}"
_wait_for_lightsail_instance "${name}"
save_vm_connection "${LIGHTSAIL_SERVER_IP}" "ubuntu" "" "$name" "aws"
}
# Lightsail uses 'ubuntu' user, not 'root'
SSH_USER="ubuntu"
# SSH operations — delegates to shared helpers
verify_server_connectivity() { ssh_verify_connectivity "$@"; }
run_server() { ssh_run_server "$@"; }
upload_file() { ssh_upload_file "$@"; }
interactive_session() { ssh_interactive_session "$@"; }
wait_for_cloud_init() {
local ip="${1}"
local max_attempts=${2:-60}
# First ensure SSH connectivity is established
ssh_verify_connectivity "${ip}" 30 5 || return 1
# Then wait for cloud-init completion marker
generic_ssh_wait "ubuntu" "${ip}" "${SSH_OPTS}" "test -f /home/ubuntu/.cloud-init-complete" "cloud-init" "${max_attempts}" 5
}
destroy_server() {
local name="${1}"
log_step "Destroying Lightsail instance ${name}..."
aws lightsail delete-instance --instance-name "${name}" >/dev/null
log_info "Instance ${name} destroyed"
}
list_servers() {
aws lightsail get-instances --query 'instances[].{Name:name,State:state.name,IP:publicIpAddress,Bundle:bundleId}' --output table
}
# ============================================================
# Cloud adapter interface
# ============================================================
cloud_authenticate() { ensure_aws_cli; ensure_ssh_key; }
cloud_provision() { create_server "$1"; }
cloud_wait_ready() { verify_server_connectivity "${LIGHTSAIL_SERVER_IP}"; wait_for_cloud_init "${LIGHTSAIL_SERVER_IP}" 60; }
cloud_run() { run_server "${LIGHTSAIL_SERVER_IP}" "$1"; }
cloud_upload() { upload_file "${LIGHTSAIL_SERVER_IP}" "$1" "$2"; }
cloud_interactive() { interactive_session "${LIGHTSAIL_SERVER_IP}" "$1"; }
cloud_label() { echo "Lightsail instance"; }