"spawn connect" is not a valid top-level CLI command — users following
this guidance after SSH reconnect failure would see "Unknown agent or
cloud: connect". Replace with "spawn last" which correctly reconnects
to the most recent spawn.
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* feat(telemetry): funnel + lifecycle events for onboarding drop-off
Adds low-volume, high-signal product events on top of the existing
errors/warnings telemetry (shared/telemetry.ts). Answers "where do users
bail before reaching a running agent" at the fleet level.
Funnel events (in orchestrate.ts, both fast and sequential paths):
funnel_started pipeline begins
funnel_cloud_authed cloud.authenticate() ok
funnel_credentials_ready OR key + preProvision resolved
funnel_vm_ready VM booted and SSH-reachable
funnel_install_completed agent install succeeded (tarball or live)
funnel_configure_completed agent.configure() ran
funnel_prelaunch_completed gateway / dashboard / preLaunch hooks done
funnel_handoff about to launch TUI (final step)
Every event carries elapsed_ms since funnel_started, plus agent and cloud
via telemetry context. Per-step counts reveal the drop-off funnel in
PostHog without touching any PII.
Lifecycle events (new shared/lifecycle-telemetry.ts):
spawn_connected { spawn_id, agent, cloud, connect_count, date }
fired from list.ts when the user reconnects via the interactive picker.
Increments connection.metadata.connect_count and writes last_connected_at
so subsequent events and the eventual spawn_deleted have the total.
spawn_deleted { spawn_id, agent, cloud, lifetime_hours, connect_count, date }
fired from delete.ts (both interactive confirmAndDelete and headless
cmdDelete loop) after a successful cloud destroy. lifetime_hours is
computed from SpawnRecord.timestamp to now. Clamped at 0 for corrupt
clocks. connect_count is read from metadata.
New captureEvent(name, properties) helper in telemetry.ts:
- Respects SPAWN_TELEMETRY=0 opt-out (no new flag)
- Runs every string property through the existing scrubber (API keys,
GitHub tokens, bearer, emails, IPs, base64 blobs, home paths)
- Non-string values pass through untouched
Tests: 20 new (15 lifecycle-telemetry + 2 captureEvent + 3 assertion
additions to disabled-telemetry). Full suite: 2129/2129 pass.
Bumps 1.0.10 -> 1.0.11. Patch bump — auto-propagates under #3296 policy.
* fix(test): replace mock.module with spyOn in lifecycle-telemetry tests
mock.module contaminates the global module registry when running under
--coverage, causing telemetry.test.ts and history-cov.test.ts to receive
mocked implementations instead of the real modules. Switch to spyOn with
mockRestore in afterEach so the real modules are preserved across files.
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): validate env var keys in skill injection (orchestrate.ts)
Fixes#3269
Agent: security-auditor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): add base64 validation for defense-in-depth in skill env injection
Add validation of base64-encoded values to match the existing pattern
in injectEnvVarsToRunner (line 518), providing defense-in-depth even
though base64 output is highly unlikely to contain invalid characters.
Agent: security-auditor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): base64-encode entire skill env payload before shell interpolation
Matches the injectEnvVarsToRunner pattern: base64-encode the full payload
and decode on the remote side, eliminating any shell interpolation of
individual env lines. Addresses review feedback on double-evaluation risk.
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
CLI plumbing for the skills feature. The skills catalog in manifest.json
is populated by the discovery scout (#3252), not manually curated.
Flow:
1. User runs `spawn claude hetzner --beta skills`
2. Skills picker shows available skills for that agent (from manifest.json)
3. User selects skills, enters required env vars (GITHUB_TOKEN, etc.)
4. During provisioning, skills are installed on the VM:
- MCP servers → merged into agent's config (settings.json, mcp.json)
- Instruction skills → SKILL.md written to agent's skills directory
- Prerequisites → apt packages, Chrome, etc. installed first
5. Env vars appended to .spawnrc for MCP server runtime access
Headless: SPAWN_SELECTED_SKILLS=github-mcp,context7 spawn claude hetzner
Supports: Claude Code, Cursor (native MCP config), all other agents
(generic mcp.json fallback).
Signed-off-by: Ahmed Abushagur <ahmed@abushagur.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pullChildHistory was awaited after the interactive session, blocking
process.exit() for up to 5+ minutes while it SSHed back into the VM.
This is a convenience feature for `spawn tree` — it should never make
the user wait.
Changed to fire-and-forget: process.exit() fires immediately,
killing any in-flight SSH calls. Headless mode still awaits it
since there's no user waiting.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Installs a cron job (every 6h) that checks for SSH key anomalies,
failed login attempts (brute-force), suspicious software (attack tools,
crypto miners), unexpected processes, rogue cron entries, and unusual
listening ports. Findings are written to /var/log/spawn-security-alerts.log
and displayed as warnings when users reconnect via `spawn connect`.
Signed-off-by: Ahmed Abushagur <ahmed@abushagur.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: complete VM recovery rewrite for spawn fix command
Fixes#3173
Rewrites spawn fix to use CloudRunner interface for full VM recovery
instead of a flat bash script piped over SSH. Now runs the same
install(), configure(), preLaunch() functions as initial provisioning.
- Added generic SSH CloudRunner (ssh-runner.ts) reusable by other commands
- Exported injectEnvVarsToRunner() from orchestrate.ts for shared use
- Fixed command injection vulnerability via validateIdentifier(binaryName)
- Updated dependency injection: runScript → makeRunner (CloudRunner)
- Updated tests to use CloudRunner-based DI pattern
Agent: code-health
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(ssh-runner): add coverage for validation paths
Tests cover the early-exit branches in makeSshRunner methods
(runServer invalid command, uploadFile/downloadFile path traversal)
that throw before any subprocess is spawned.
Agent: team-lead
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The same 12-line saveSpawnRecord block was duplicated 3 times in
runOrchestration() (fast-mode boot, fast-mode retry, sequential path).
A bug fixed in one copy could easily be missed in another. Extracted
a shared recordSpawn() helper that all 3 sites now call.
Agent: complexity-hunter
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Replace repeated 'SSH port closed (N/36)' with periodic updates every 5 attempts
- Add clear 'Provisioning complete. Connecting...' line before agent attach
Fixes#3053
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* feat: pull child spawn history back to parent for `spawn tree`
When the interactive session ends (or headless mode completes), the
parent downloads the child VM's history.json and merges records into
local history. Before downloading, it runs `spawn pull-history` on the
child, which recursively pulls from all grandchildren — so the full
tree collapses up to the root regardless of depth.
Changes:
- Add getParentFields() — sets parent_id/depth on saveSpawnRecord calls
- Add pullChildHistory() — downloads + merges child history after session
- Add `spawn pull-history` command for recursive SSH-based history pull
- Add 11 tests for parseAndMergeChildHistory
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: trigger CI recompute
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): validate user/ip params before SSH exec in pull-history
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): use shared validators for SSH params in pull-history and delete
Replace inline regex checks in pull-history.ts with validateUsername()
and validateConnectionIP() from security.ts, matching the pattern used
across connect.ts, fix.ts, and link.ts. Also add the same validation
to delete.ts:pullChildHistory which had no SSH parameter validation.
orchestrate.ts uses the runner abstraction (not raw user@ip), so its
SSH params come from the cloud provider, not untrusted history records.
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Ahmed Abushagur <ahmed@abushagur.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Shows a non-intrusive "⭐ Enjoying Spawn? Star us on GitHub!" message
to returning users (2+ successful spawns) after a successful spawn
session completes. Shown at most once per 30 days.
- New `maybeShowStarPrompt()` in `shared/star-prompt.ts`
- Tracks `starPromptShownAt` in `~/.config/spawn/preferences.json`
- Called after `execScript()` returns success in cmdRun, cmdInteractive,
and cmdAgentInteractive (skipped in headless mode)
- The `execScript()` return type changed from `void` to `boolean`
to indicate whether the script ran successfully
- Added 7 unit tests covering all gate conditions
Fixes#3020
Agent: issue-fixer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace hand-constructed openrouter.json path with getSpawnCloudConfigPath("openrouter")
for single-source-of-truth path resolution. Remove unused _cloudName parameter since
the function delegates ALL cloud credentials unconditionally.
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add /^[A-Za-z0-9+/=]+$/ validation after each .toString("base64") call
in delegateCloudCredentials() and injectEnvVars(), consistent with the
pattern established in agent-setup.ts by #2988.
Agent: security-auditor
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ux): replace download spinner with stderr logging, reset terminal before SSH handoff
Fixes two UX issues from live E2E session (#3001):
1. Download spinner (p.spinner from @clack/prompts) wrote ANSI escape codes
to stdout. When stdout is captured (E2E harness, piped output), these
sequences appeared as raw text rather than rendered colors. Replace
p.spinner() in downloadScriptWithFallback and downloadBundle with
logStep/logInfo/logError from shared/ui.ts, which write to stderr and
correctly check isTTY before emitting ANSI codes.
2. Garbled output at start of interactive session (overlapping status lines
from the remote agent's TUI) may be caused by residual ANSI state from
@clack/prompts (hidden cursor, active color attributes). Emit
ESC[?25h ESC[0m to stderr before prepareStdinForHandoff() to explicitly
restore cursor visibility and reset all attributes before the SSH session
takes over.
Agent: issue-fixer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: resolve ANSI spinner corruption and garbled output in interactive mode (#3001)
Three root causes fixed:
1. Spinner wrote to stdout while all other CLI status output goes to stderr,
causing ANSI escape sequence interleaving and corruption when both streams
are merged on a terminal. Redirected all p.spinner() calls to process.stderr.
2. unicode-detect.ts (which sets TERM=linux for SSH sessions to force ASCII
fallback) was only imported in commands/shared.ts but not in shared/ui.ts.
Cloud module entry points (hetzner/main.ts, etc.) that import shared/ui.ts
loaded @clack/prompts without the TERM override, causing Unicode spinner
frames in environments that can't render them.
3. After an interactive SSH session ends, the remote agent's TUI (e.g. Claude
Code) may leave the terminal in raw mode with altered attributes. Added
terminal reset (ANSI attribute reset + stty sane) after spawnInteractive()
returns to prevent garbled post-session output.
Agent: ux-engineer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
delegateCloudCredentials only copied the current cloud's config file
(e.g. sprite.json when spawning on Sprite). Child VMs couldn't spawn
on other clouds because their tokens weren't forwarded.
Now iterates all known clouds and copies every credential file that
exists locally, so the agent can spawn children on any cloud.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove `export` from `getTerminalWidth` in commands/info.ts — only
used internally, not exported from commands/index.ts barrel
- Remove `export` from `makeDockerExec` in shared/orchestrate.ts — only
used internally by `makeDockerRunner`, no external callers
- Bump CLI version to 0.26.6
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Sprite has a bun shim at /.sprite/bin/bun that delegates to
$HOME/.bun/bin/bun, but that binary doesn't exist on fresh VMs.
`command -v bun` returns true (finds the shim) so the install script
skips bun installation, then bun fails when actually invoked.
Fixed in two places:
- installSpawnCli: source shell profiles, test `bun --version` (not
just existence), and install bun fresh if it doesn't work
- install.sh: replace `command -v bun` with `bun --version` to detect
broken shims
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: spawn step skipped when no explicit --steps passed
The spawn skill injection condition used `enabledSteps?.has("spawn")`
which is falsy when enabledSteps is undefined (no --steps flag). Now
checks the recursive beta flag directly and falls through when no
explicit steps are selected, matching how auto-update works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: embed skill content in spawn-skill.ts instead of reading from disk
The skills/ directory exists in the repo but isn't bundled when the CLI
is installed via npm. readSkillContent() couldn't find the files at
runtime, causing "No spawn skill file for agent" on every deploy.
Fixed by embedding all skill content directly as string constants in the
module. Removed fs-based getSkillsDir/readSkillContent/getSpawnSkillSourceFile
in favor of a single AGENT_SKILLS config map with inline content.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When `--beta recursive` is active, a new "Spawn CLI" setup step injects
agent-native instruction files teaching each agent how to use the `spawn`
CLI to create child VMs. Skill files live in `skills/` at the repo root
and use each agent's native format (YAML frontmatter for Claude/Codex/
OpenClaw, plain markdown for others, append mode for Hermes).
- Add `skills/` directory with 8 agent-specific skill files
- Add `spawn-skill.ts` module with path mapping, file reading, and injection
- Register "spawn" as a conditional setup step gated by `--beta recursive`
- Wire `injectSpawnSkill()` into orchestrate.ts postInstall flow
- Add 52 tests covering path mapping, append mode, file existence, injection
- Bump CLI version to 0.26.0 (minor: new feature)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds non-empty guard to makeDockerExec to make the security boundary
explicit and prevent silent misuse with empty commands.
Fixes#2985
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add recursive spawn (--beta recursive)
Enables VMs to spawn child VMs. When --beta recursive is active:
- Injects SPAWN_PARENT_ID, SPAWN_DEPTH, SPAWN_BETA=recursive into .spawnrc
- Installs spawn CLI on the VM via install.sh
- Delegates cloud + OpenRouter credentials to the VM
- Tracks parent_id and depth on SpawnRecord for tree relationships
- Adds `spawn tree` command for full recursive tree view
- Adds `spawn history export` for pulling child history via SSH
- Adds `spawn list --json` and `spawn list --flat` flags
- Adds tree rendering in `spawn list` when parent-child relationships exist
- Adds cascade delete support in delete.ts
- Adds mergeChildHistory() for backward-pass history sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add recursive spawn to README
Add --beta recursive to beta features table, new commands
(spawn tree, spawn history export, spawn list --flat/--json)
to commands table, and a dedicated Recursive Spawn section
with usage examples for tree view and cascade delete.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add cmdTree coverage tests to fix mock test CI
The CI coverage threshold (90% functions, 80% lines) was failing
because tree.ts had 0% coverage. Added tests that exercise cmdTree
with empty history, tree rendering, JSON output, flat records,
and deleted/depth labels. tree.ts now has 100% coverage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(security): validate cloudName and use valibot in pullChildHistory
- Add cloudName validation against ^[a-z0-9-]+$ to prevent
command injection in delegateCloudCredentials
- Export SpawnRecordSchema from history.ts and replace loose
type guard with valibot schema validation in pullChildHistory
- Resolve merge conflicts with main (include both docker and
recursive beta features)
Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* test: add installSpawnCli and delegateCloudCredentials coverage
Export and test installSpawnCli (success + timeout failure paths)
and delegateCloudCredentials (no creds, with creds, write failure,
mkdir failure paths) to improve orchestrate.ts function coverage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gritQL rule false positives and delete.ts coverage
- use TsAsExpression() AST node instead of backtick pattern to avoid
matching import aliases as type assertions
- export and test findDescendants() and pullChildHistory() to bring
delete.ts line coverage above the 35% threshold
- add 8 new tests for descendant finding and history pull edge cases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: A <258483684+la14-1@users.noreply.github.com>
* fix: remove docker from --fast and fix docker cp into container
Two fixes for --beta docker:
1. Remove "docker" from --fast beta features — --fast was auto-enabling
--beta docker, pulling ghcr images that hang the session.
Users must now opt in explicitly with --beta docker.
2. Fix uploadFile in docker mode — .spawnrc was uploaded to the host
but never copied into the container. Add docker cp after SCP upload
so env vars and configs reach the agent inside the container.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: keep docker in --fast beta features
The docker cp fix resolves the hang — no need to remove docker from
--fast. The issue was missing file copy into the container, not the
docker mode itself.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: extract makeDockerRunner helper, fix uploadFile into container
Add makeDockerRunner() that wraps a CloudRunner so all commands and
file uploads target the Docker container. Replaces inline lambdas in
hetzner/main.ts and gcp/main.ts with a clean one-liner.
The key fix: uploadFile now docker cp's files into the container after
SCP — previously .spawnrc (API keys, env vars) only landed on the host,
so the agent inside the container had no config and hung.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(security): shellQuote remotePath in docker cp command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: remove local tarball download, use remote-only tarball install
The local-download-then-SCP-upload path was unnecessary complexity —
downloading a tarball to the user's machine just to re-upload it to the
VM is wasteful. The VM downloads directly from GitHub instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: force zeroclaw native runtime to prevent Docker container hang
ZeroClaw auto-detects Docker and launches in a container (pulling
ghcr.io/openrouterteam/spawn-zeroclaw), which hangs the interactive
session. Force native mode via ZEROCLAW_RUNTIME=native env var and
adapter = "native" in config.toml.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: disable openclaw Docker sandbox to prevent container hang
Same issue as zeroclaw — openclaw auto-detects Docker and runs agents
in containers, hanging the interactive session. Disable via
agents.defaults.sandbox.mode = off in config and fallback JSON.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: disable codex Docker sandbox to prevent container hang
Codex CLI also auto-detects Docker for sandboxing. Set
sandbox_mode = "danger-full-access" in config.toml — the VM itself
provides isolation, Docker sandboxing just causes hangs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract duplicate dockerExec helper from gcp/main.ts and hetzner/main.ts
into shared makeDockerExec() in orchestrate.ts. Both local functions were
identical — wrapping commands with docker exec using DOCKER_CONTAINER_NAME
and shellQuote.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add 180s timeout to uploadFileSprite to prevent indefinite hangs during
tarball uploads. Without a timeout, large tarballs or stalled Sprite
connections block the entire provisioning pipeline past the 720s E2E
provision timeout, causing agent binary not-found failures for openclaw,
zeroclaw, and codex.
Also skip the redundant remote tarball download fallback when a local
tarball was already downloaded but its upload/extract failed -- the
remote download would face the same extraction issues. This saves ~150s
in the fallback chain, leaving enough time for the live install to
complete within the provision timeout.
Fixes#2960
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- Suppress stdout+stderr from `claude install --force` to prevent duplicate
"successfully installed" messages (was printed up to 4x)
- Make logStepInline fall back to newline-separated output when stderr is not
a TTY, so SSH port polling status is readable in piped/captured contexts
- Consolidate post-install completion messages into a single clear milestone:
"Agent setup complete -- {agent} is ready on {cloud}"
- Bump CLI version to 0.25.16
Fixes#2899
Agent: ux-engineer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: skip interactive session in headless mode (#2892)
When SPAWN_HEADLESS=1, the orchestrator now exits with code 0 after
provisioning completes instead of attempting to launch the agent
interactively. This fixes Claude Code (and other agents) failing with
"Input must be provided through stdin or --prompt" when spawned via
`--headless --output json` without a prompt.
The VM is fully provisioned and ready — callers can SSH in or use
`spawn connect` to start the agent manually.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: clean up SPAWN_HEADLESS env in test afterEach to prevent leaks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Consolidate DOCKER_CONTAINER_NAME and DOCKER_REGISTRY constants from
gcp/main.ts and hetzner/main.ts into shared/orchestrate.ts. Both files
defined identical values ("spawn-agent" and "ghcr.io/openrouterteam"); they
now import the shared exports instead.
Bumps CLI patch version to 0.25.11.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Sprite CLI exits with code 1 on "connection closed" (not 255 like SSH).
The reconnect loop now treats exit code 1 on Sprite as a connection
drop, retrying up to 5 times with a 3s delay between attempts.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In fast mode, Promise.allSettled runs server boot, OAuth, and tarball
download concurrently. When all operations complete — especially after
Bun.serve.stop(true) in the OAuth flow removes its event loop handle —
the event loop can appear empty before the await continuation starts
new I/O operations. This causes Bun to exit silently with code 0,
dropping the user back to their shell after "Successfully obtained
OpenRouter API key via OAuth!" with no error.
Fix: keep a dummy setInterval handle alive during the fast-mode
concurrent section so the event loop never drains prematurely.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add .js extensions to 124 relative imports that were missing them.
The codebase is "type": "module" (ESM) and the dominant pattern already
used .js extensions, but 35 files had a mix of extensionless and .js
imports — sometimes within the same file. Standardize to .js everywhere.
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 24 TypeScript strict mode errors across 7 production files:
- interactive.ts: guard against undefined `val` in validate callback
- list.ts: use already-narrowed `conn` variable instead of `selected.connection`
- run.ts: widen `buildCloudLines` defaults param to `Record<string, unknown>`
- digitalocean.ts: use `toRecord()` to safely drill into nested API responses;
capture narrowed `oauthCode` in const for async closure
- history.ts: backfill missing record IDs via `backfillRecordIds()` helper;
use `v.safeParse` output directly to get properly typed records
- index.ts: use `Manifest` type for `showUnknownCommandError` parameter
- orchestrate.ts: capture narrowed `tunnel` and `getConnectionInfo` in const
variables before async closures
Fixes#2821
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
When SSH exits with code 255 (connection dropped/timed out), retry up
to 5 times with 3s delay between attempts. Clean exits (0), Ctrl+C
(130), and agent crashes exit immediately without retrying.
Only applies to remote clouds — local sessions skip reconnect logic.
Signed-off-by: L <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
* feat: never-give-up resilience layer — retry every failure instead of exiting
Add retryOrQuit() helper to shared/ui.ts that prompts "Try again? (Y/n)"
after any recoverable failure. Wrap all fatal exit points with retry loops:
- Cloud auth (Hetzner, DigitalOcean, AWS, GCP): retry after 3 failed tokens
- API key acquisition: retry after 3 failed OAuth+manual attempts
- Server creation: retry on any createServer failure (both fast & sequential)
- SSH readiness: retry on waitForReady timeout
- Agent install: retry on install failure
- Pre-launch hooks: retry on preLaunch failure
Non-interactive mode (SPAWN_NON_INTERACTIVE=1) still throws immediately.
Ctrl+C at any retry prompt exits cleanly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(e2e): add AI-driven interactive test harness
Add --interactive mode to the E2E test framework. Instead of running spawn
in headless mode (SPAWN_NON_INTERACTIVE=1), this spawns the CLI in a real
PTY and uses Claude Haiku to respond to prompts like a human user would.
New files:
- sh/e2e/interactive-harness.ts — Bun script that drives the PTY + AI loop
- sh/e2e/lib/interactive.sh — Bash integration with the E2E framework
Usage:
e2e.sh --cloud hetzner claude --interactive
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(qa): wire interactive E2E into scheduled QA pipeline
- Add `e2e-interactive` option to workflow_dispatch in qa.yml
- Add `e2e-interactive` run mode to qa.sh (loads cloud creds + ANTHROPIC_API_KEY)
- Runs `e2e.sh --cloud hetzner claude --interactive` directly (no Claude Code needed)
- Defaults to hetzner (cheapest), overridable via E2E_INTERACTIVE_CLOUD/AGENT env vars
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(qa): schedule interactive E2E daily at 6am UTC
Runs one agent (claude) on one cloud (hetzner) with AI-driven prompts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(qa): offset soak cron to avoid GitHub Actions schedule dedup
GitHub Actions deduplicates overlapping cron schedules into one run,
making `github.event.schedule` unpredictable. The soak test at `0 3 * * 1`
was getting absorbed by the `0 */4 * * *` quality sweep and never firing
as reason=soak.
Move soak to `30 1 * * 1` (Monday 1:30am UTC) — safely between the
0am and 4am quality sweep slots. Interactive E2E at `0 6 * * *` is
already safe (between the 4am and 8am slots).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(qa): add e2e-interactive to trigger server valid reasons
The trigger server validates reason query params against an allowlist.
Without this, the `e2e-interactive` dispatch returns 400.
Also note: `soak` is already in VALID_REASONS in the repo but the running
service on the QA VM is stale — needs a restart to pick up both soak and
e2e-interactive reasons.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: skip cloud-init for minimal-tier agents with tarballs/snapshots
Ubuntu 24.04 base images already have curl + git, so minimal-tier
agents (claude, opencode, zeroclaw, hermes) don't need the cloud-init
package install step when using tarballs or snapshots.
Adds skipCloudInit flag to CloudOrchestrator — set automatically when
(tarball || snapshot) && tier === "minimal". Each cloud's waitForReady
checks this flag and calls waitForSshOnly instead of waitForCloudInit.
Saves ~30-60s on minimal-tier agent deploys with --fast or --beta tarball.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add --fast mode and updated beta features to README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: remove timing table from README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
* feat: add --fast flag for parallel server boot + setup
Adds `--fast` flag that runs server creation concurrently with API key
prompt, account check, pre-provision hooks, tarball download, and env
config generation. Once SSH is up, uploads tarball and applies config.
--fast implies --beta tarball and --beta images, enabling snapshots
and pre-built tarballs automatically.
Flow without --fast (sequential):
auth → API key → preProvision → size → create → boot → install → configure
Flow with --fast (parallel):
auth → size → [create+boot | API key | preProvision | tarball download | accountCheck]
→ upload tarball → inject env → configure
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add --beta parallel as standalone opt-in for parallel setup
--beta parallel enables the parallel orchestration without implying
tarball/images. --fast still implies all three (tarball + images +
parallel).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tryCatchIf(isFileError) only catches filesystem errors (ENOENT, EACCES),
but JSON.parse throws SyntaxError on corrupted preferences.json. This
was the same bug fixed in 16a2f180 across 4 files, but orchestrate.ts
was missed. A corrupted ~/.spawn/preferences.json would crash the CLI
instead of gracefully falling back to no preferred model.
Agent: code-health
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Installs a systemd timer + oneshot service that updates the agent binary
and system packages every 6 hours without disrupting running instances.
Agent update safety:
- Binary agents (Go, Rust): Linux keeps old inode in memory; safe to replace
- npm agents: Node.js caches modules at startup; running processes unaffected
- New version takes effect on next restart via the existing restart loop
System update safety:
- Disables Ubuntu's unattended-upgrades to prevent dpkg lock contention
- Uses flock -w 300 on /var/lib/dpkg/lock-frontend before apt operations
- DEBIAN_FRONTEND=noninteractive with --force-confdef/--force-confold
User-facing:
- "Auto-update" option in setup multiselect (default on, user can uncheck)
- Skipped for local cloud and non-systemd systems
- Non-fatal: setup failure doesn't block agent launch
- Logs to /var/log/spawn-auto-update.log
Timer: 15min after boot, then every 6h with 30min random jitter.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded "bash" shell references with platform-aware utilities so
spawn works natively from PowerShell on Windows without WSL or Git Bash.
- New shared/shell.ts: isWindows(), getLocalShell(), getInstallScriptUrl(),
getInstallCmd(), getWhichCommand() with platform override for testability
- local/local.ts: use getLocalShell() for runLocal() and interactiveSession()
- commands/run.ts: spawnScript/runScriptHeadless use getLocalShell()
- commands/update.ts: Windows downloads install.ps1, runs via PowerShell
- update-check.ts: Windows auto-update uses install.ps1; "where" replaces "which"
- shared/orchestrate.ts: PowerShell-compatible .spawnrc setup for local Windows
- Remote SSH commands unchanged — remote servers are always Linux
Closes#2726
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
When the user selects the GitHub CLI step in setup options (interactive
prompt or --steps github), offerGithubAuth() was silently returning early
if no local gh token was found by detectGithubAuth(). This made the step
unreachable for users without gh installed locally — exactly the ones who
need remote setup most.
Fix: accept an `explicitlyRequested` parameter in offerGithubAuth(). When
true, skip the githubAuthRequested guard and always run the remote install.
The orchestrator passes enabledSteps?.has("github") as this flag.
detectGithubAuth() still auto-enables the step when a local token exists
(convenience forwarding), but can no longer block a user-explicit request.
Fixes#2672
Agent: issue-fixer
Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Move "Custom model" from OpenClaw-specific to common setup steps so
every agent shows it in the setup menu. Add modelEnvVar to agents that
support model override via environment variable:
- Kilo Code: KILOCODE_MODEL
- ZeroClaw: ZEROCLAW_MODEL
- Hermes: LLM_MODEL
- Junie: JUNIE_MODEL
When a custom model is selected, the env var is injected into .spawnrc
alongside the other agent env vars. OpenClaw continues to use its
existing configure() path. Claude and Codex don't have modelEnvVar
since they handle model routing differently.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add downloadFile to CloudRunner + local OpenClaw config merge
Add `downloadFile(remotePath, localPath)` to the CloudRunner interface
and implement it across all 6 cloud providers (Hetzner, AWS, GCP,
DigitalOcean, Sprite, Local) — mirroring the existing `uploadFile` with
reversed SCP direction.
Replace the OpenClaw config write with a download → deep-merge → upload
flow so config merging happens in our own linted TypeScript instead of
a remote script.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: move isPlainObject and deepMerge to shared utils
Extract `isPlainObject` to `shared/type-guards.ts` and `deepMerge` to
`shared/parse.ts` so they're reusable across the codebase.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: promote isPlainObject to shared package, use across codebase
Move `isPlainObject` from cli/type-guards.ts into
@openrouter/spawn-shared so it can be used everywhere. Replace
inline `val !== null && typeof val === "object" && !Array.isArray(val)`
checks in:
- shared/type-guards.ts (toRecord, toObjectArray)
- shared/parse.ts (parseJsonObj)
- cli/manifest.ts (isValidManifest)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: remove type-guards re-export, import directly from spawn-shared
Delete `packages/cli/src/shared/type-guards.ts` (was just a re-export
barrel). All 35 consuming files now import `getErrorMessage`, `isString`,
`isNumber`, `isPlainObject`, `toRecord`, etc. directly from
`@openrouter/spawn-shared`.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WhatsApp setup is too complex for normal users (QR scan + separate
device + pairing). Remove it from the setup options entirely.
Also change multiselect defaults to nothing pre-selected — let users
opt in to what they want instead of pre-selecting for them.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
When an agent has an SSH tunnel (e.g., OpenClaw dashboard), store the
tunnel remote port and browser URL template in connection.metadata at
spawn time. On reconnect via `spawn ls` → "Enter agent", re-establish
the SSH tunnel and open the dashboard automatically.
- Add saveMetadata() to history.ts for merging key-value pairs into records
- Store tunnel_remote_port and tunnel_browser_url_template in orchestrate.ts
- Re-establish tunnel in cmdEnterAgent (connect.ts) when metadata is present
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: messaging UX — silence doctor, fix groupPolicy, remove early WhatsApp pairing
- Set groupPolicy to "open" for both Telegram and WhatsApp (was
"allowlist" with empty allowFrom, causing doctor warnings)
- Suppress doctor warning spam by redirecting openclaw config set
stdout to /dev/null
- Remove WhatsApp pairing prompt (appeared immediately after QR scan
before user could message the bot — now just tells them the command)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: improve Telegram/WhatsApp pairing instructions
Add step-by-step instructions for Telegram pairing so users know to
search for their bot in Telegram and message it. Improve WhatsApp
post-link instructions to explain how contacts pair.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: pre-select Telegram in setup options as recommended channel
Telegram has the smoothest setup UX (bot token + pairing code) compared
to WhatsApp (QR scan + separate device). Pre-select it alongside Chrome
in the multiselect and label it as "recommended" in the hint.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Telegram is a built-in channel, not a plugin. Replace broken
`openclaw plugins enable telegram` (OOM) and `openclaw channels add`
(doesn't exist) with proper setup:
- Write channel config (botToken, dmPolicy: pairing, groups) directly
into the atomic JSON config file during setup
- After gateway starts, prompt user to pair via
`openclaw pairing approve <channel> <CODE>`
- WhatsApp: QR scan via `openclaw channels login`, then pairing
- Bump version to 0.17.16
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: set telegram groupPolicy to open during channel setup
OpenClaw defaults groupPolicy to "allowlist" with an empty groupAllowFrom,
which silently drops all group messages. Set it to "open" after adding the
Telegram channel so group messages work out of the box.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use OpenClaw config file for Telegram setup instead of broken CLI commands
Telegram is a built-in channel in OpenClaw, not a plugin. The previous
approach used `openclaw plugins enable telegram` (caused OOM on 2GB) and
`openclaw channels add --channel telegram` (command doesn't exist).
Now writes Telegram config (botToken, enabled, groupPolicy) directly into
the atomic JSON config file during setup. Also sets groupPolicy to "open"
so group messages work out of the box instead of being silently dropped.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use openclaw onboard for channel setup instead of manual config
OpenClaw has a built-in `openclaw onboard` command that interactively
guides users through Telegram/WhatsApp channel setup. Use that instead
of manually prompting for tokens and writing config ourselves.
- Remove custom Telegram token prompt from agent-setup.ts
- Remove broken `openclaw channels add` and `openclaw plugins enable`
- Run `openclaw onboard` after gateway starts for channel setup
- Base config (API key, gateway, model) still written atomically
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>