Commit graph

104 commits

Author SHA1 Message Date
A
262d081756
refactor: move fly TS into cli/src/fly/, add build-clouds.sh (#1604)
Move all fly TypeScript files from fly/lib/*.ts and fly/main.ts into
cli/src/fly/. This gives them access to cli/node_modules (@clack/prompts),
biome linting, and the existing bun:test infrastructure — no symlinks or
NODE_PATH hacks needed.

The org picker now uses @clack/prompts select() directly (static import,
bundled at build time).

New: cli/build-clouds.sh — auto-discovers cli/src/*/main.ts and bundles
each into {cloud}.js. Scalable to future cloud TS migrations:
  bash cli/build-clouds.sh        # build all
  bash cli/build-clouds.sh fly    # build one

Shims now check for cli/src/fly/main.ts (local) or download fly.js from
GitHub releases (remote curl|bash).

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 12:34:09 -08:00
Ahmed Abushagur
76cb8ead3b
fix: unpin Codex version, restore wire_api=responses (#1608)
Reverts the 0.94.0 pin — install latest Codex and use the required
wire_api="responses" format.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 20:12:23 +00:00
A
d7ff0739a2
fix: fly auth token deprecated + org picker + macaroon tokens (#1603)
* fix: fly auth token deprecated + org picker + macaroon discharge tokens

Three fixes for the fly/ TypeScript provider:

1. `fly auth token` is deprecated — newer flyctl outputs a message, not
   a token. Now tries `fly tokens create org --expiry 24h` first, with
   `fly auth token` as fallback. Uses org tokens (not deploy) since
   spawn needs to create new apps.

2. Token sanitization stripped macaroon discharge tokens at commas
   (`fm2_[^ ,]*` → `fm2_\S+`). The full composite token
   `fm2_xxx,fm2_yyy,fo1_zzz` is now preserved.

3. Org picker upgraded from numbered 1/2 input to arrow-key interactive
   selector with cursor navigation, scroll windowing, and fallback to
   numbered list when TTY is unavailable.

Also fixes: testFlyToken fallback sent `Bearer FlyV1 ...` (double prefix)
for macaroon tokens — now dispatches FlyV1 vs Bearer correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: never run test/mock.sh locally — opens browser, CI only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 11:06:19 -08:00
A
2ef621cc69
refactor: convert fly/ cloud provider from bash to TypeScript (#1601) (#1602)
Replace fly/lib/common.sh (741 lines of bash) with a TypeScript
implementation using Bun runtime. The fly/ provider was the most
complex bash code in the project — recent fixes (#1597, #1599, #1600)
highlight the pain of debugging HTTP calls, JSON parsing, and multi-step
auth flows in shell.

New TypeScript modules:
- fly/lib/ui.ts — logging, prompts, validation (zero deps)
- fly/lib/fly.ts — API client (fetch), auth chain, org listing, provisioning
- fly/lib/oauth.ts — OpenRouter OAuth via Bun.serve(), key management
- fly/lib/agents.ts — typed agent configs for all 6 agents
- fly/main.ts — orchestrator entry point

Agent .sh files become thin shims (~30 lines) that install bun if needed,
download TS sources for curl|bash execution, and delegate to main.ts.

Test coverage:
- 44 TypeScript unit tests (bun test) for pure logic
- 4 fly failure-mode tests (mock.sh) for error scenarios
- All existing test suites pass (110 run.sh, 76 mock.sh)

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 10:41:34 -08:00
A
119e0f3660
fix: use fly auth login (OAuth) instead of manual token paste (#1600)
* fix: use fly auth login (OAuth) instead of manual token paste

The fly auth flow was falling back to ensure_api_token_with_provider
which prompts users to manually paste a token from the dashboard.
This is bad UX when `fly auth login` exists and handles browser-based
OAuth automatically.

New auth chain:
1. FLY_API_TOKEN env var (if set and valid)
2. Saved config (~/.config/spawn/fly.json)
3. Existing fly CLI session (fly auth token)
4. fly auth login — browser OAuth flow (NEW)

Removes the manual token paste fallback entirely. If fly CLI isn't
installed, fails with a clear install instruction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add manual token paste as final fallback after OAuth

Auth chain is now:
1. FLY_API_TOKEN env var
2. Saved config
3. fly auth token (existing session)
4. fly auth login (OAuth)
5. Manual token paste (last resort)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 17:43:34 +00:00
A
fae0d764fd
fix: fail loudly with root cause when fly org fetch fails (#1599)
Previously _fly_list_orgs silently swallowed all errors (2>/dev/null
everywhere) and _fly_prompt_org fell back to manual input with no
diagnostic info. Now both paths (fly CLI + GraphQL) surface specific
failure reasons — missing CLI, empty output, parse errors with raw
JSON, GraphQL errors — and _fly_prompt_org fails hard with actionable
debug hints instead of silently defaulting.

Also always show the org picker when fetch succeeds (no silent default).

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 09:14:40 -08:00
A
c25566d810
fix: fall back to GraphQL API for Fly.io org listing when flyctl fails (#1597)
_fly_list_orgs previously relied solely on `flyctl orgs list --json`.
When flyctl is absent or its output is unexpected, the user gets dumped
into a manual "Enter Fly.io org slug" prompt — even though we already
have a valid API token.

Now tries flyctl first, then falls back to the Fly.io GraphQL API
(`api.fly.io/graphql`) using the saved FLY_API_TOKEN. Works with
both Bearer and FlyV1 macaroon tokens.

Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 09:03:21 -08:00
A
a1a1ae85dc
refactor: replace python3 with bun+TS in fly/lib/common.sh (#1596)
* refactor: replace python3 with bun+TS in fly/lib/common.sh, fix token validation

Three targeted fixes to the Fly.io library:

1. Replace all python3 with bun+TypeScript:
   - _fly_json: stdin-piped field extractor via bun -e (no eval, no env var
     size limits — handles arbitrarily large API responses)
   - _fly_json_ids: dedicated machine ID extractor for destroy_server
   - _fly_list_orgs: bun -e with flat dict + nodes/organizations support
   - list_servers: bun -e formatted table output
   Zero python3 invocations remain in the file.

2. Dual-endpoint _test_fly_token: tries Machines API first (deploy tokens),
   falls back to api.fly.io/v1/user (OAuth/personal tokens). Prevents
   rejecting valid personal tokens that lack Machines API access.

3. No more eval(): _fly_json uses direct property access (d[field]) instead
   of python3 eval(expr), eliminating the code injection surface entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: always prompt user for Fly.io org, never silently default

_fly_prompt_org now asks the user directly when the org list can't be
fetched, instead of silently falling back to "personal".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 08:56:24 -08:00
A
279eddd689
fix: verify Node.js install after nodesource setup, add fallback (#1581) (#1592)
The nodesource setup_22.x script can run successfully but leave nodejs
uninstalled on Fly.io machines. Add post-install verification with
`which node && node --version`, fall back to default Debian nodejs
package if nodesource fails, increase timeout from 120s to 180s, and
report a clear error if node is unavailable after all attempts.

Agent: code-health

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 11:46:25 -05:00
A
51493ccc75
refactor: rewrite fly/lib/common.sh — eliminate redundancy and fragility (#1594)
- Replace _fly_json_get (stdin pipe) + _fly_parse_error with unified _fly_json
  that takes JSON string, python expression, and default as arguments
- Collapse 5-step auth chain into 2 steps: fly auth token → ensure_api_token_with_provider
- Replace _validate_fly_token (dual-endpoint fallback) with _test_fly_token (single Machines API call)
- Replace python3-based _fly_build_machine_body with printf + json_escape
- Remove fly ssh console -C fallback from run_server (FLY_MACHINE_ID always set)
- Remove base64 fallback from upload_file (stdin pipe via fly machine exec only)
- Remove collision re-prompt loop from create_server, fail with actionable message
- Add _fly_cleanup_on_failure to delete app when machine creation fails
- Lighten wait_for_cloud_init: install curl/unzip/git only (drop build-essential, python3-pip, zsh)
- Delete unused functions: _try_flyctl_auth, _try_fly_browser_auth, get_fly_org, _fly_json_get, _fly_parse_error

730 → 536 lines. All 112 mock tests pass, zero regressions.

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 08:43:00 -08:00
A
3db19d90ac
fix: accept comma in Fly.io macaroon tokens & handle flat org dict (#1593)
Real `fly auth token` returns comma-separated multi-segment macaroon
tokens (fm2_...,fm2_...,fo1_...). The token validation regex rejected
commas, forcing re-auth on every run. Add comma to the allowed charset.

`fly orgs list --json` returns a flat dict ({"slug": "Name"}) on some
flyctl versions, not the list/nodes format the parser expected. Detect
and handle both formats so the org picker works correctly.

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 08:20:48 -08:00
A
ed9501235b
fix: bash 3.2 compat — sed for pattern sub + split local var=$(cmd) (#1572, #1571) (#1587)
Issue #1572: Replace bash 4+ ${//} pattern substitution in generate_env_config
with sed for macOS bash 3.2 compatibility.

Issue #1571: Split local var=$(cmd) declarations in fly/lib/common.sh so
exit codes propagate correctly with set -e on macOS bash 3.2.

Agent: code-health

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 07:49:17 -08:00
A
aa4174db9e
fix: add retry logic to wait_for_cloud_init for error recovery (#1575) (#1588)
Add _fly_run_with_retry helper that wraps run_server with configurable
retry count, sleep interval, and timeout. Apply it to package manager
and installer commands in wait_for_cloud_init so transient failures
(network timeouts, apt lock contention) no longer abort the entire
cloud-init sequence.

Agent: complexity-hunter

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 10:45:32 -05:00
A
282803a9bb
fix: add debug logging to ensure_fly_token auth chain (#1574) (#1589)
Add log_info/log_warn messages at each step of the 5-step auth chain
so users can see which auth method is being tried and why fallbacks occur.

Agent: ux-engineer

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 10:42:46 -05:00
L
3eca4221c6
fix: address architectural brittleness in Fly.io integration (issue #1581) (#1585)
Resolves sub-issues #1569, #1570, #1576, #1577, #1578, #1580.

#1569 — /wait endpoint replaces polling loop:
  _fly_wait_for_machine_start now uses GET /apps/{app}/machines/{id}/wait
  ?state=started&timeout=90. One blocking API call instead of 30 polls.

#1570 — fly machine exec replaces fly ssh console for run_server:
  run_server uses 'fly machine exec MACHINE_ID --app APP -- bash -c cmd'
  (direct API, no WireGuard tunnel) when FLY_MACHINE_ID is set. Falls
  back to 'fly ssh console -C' for environments without a machine ID.

#1576 — App name collision loop capped at 5 retries:
  Prevents infinite re-prompt. Suggests FLY_APP_NAME env var after 5
  failed attempts.

#1577 — destroy_server errors are now reported:
  All fly_api calls check for error responses. Reports failed machine
  deletions and exits non-zero on app deletion failure instead of
  always logging "destroyed" regardless of outcome.

#1578 — bun replaced with python3 for all JSON parsing:
  _fly_json_get, _fly_build_machine_body, _fly_list_orgs, destroy_server,
  list_servers all use python3 -c now. python3 is universally available;
  bun was only available after cloud-init completed on the target machine.

#1580 — upload_file uses stdin pipe instead of base64 string injection:
  'fly machine exec ... -- bash -c "cat > path" < local_file' streams
  file content directly. Eliminates the command-length/injection risk of
  embedding base64 content in a shell argument string.

test/mock.sh: add 'fly machine exec' case to the fly CLI mock.
test/fixtures/fly/_env.sh: add FLY_MACHINE_ID to test env.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 07:19:23 -08:00
L
b42d1a52d6
fix: don't bail on non-zero exit from fly orgs list + pass JSON as arg (#1582)
Some flyctl versions exit non-zero even on success. Removed '|| return 1'
so the output is always captured. Empty output is still a failure.

Also pass JSON as a bun argument (process.argv[1]) instead of piping via
stdin — avoids any Bun.stdin buffering issue in the _fly_list_orgs context.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 07:03:56 -08:00
L
07ce24b710
fix: capture interactive_pick output and export FLY_ORG in _fly_prompt_org (#1568)
interactive_pick() echoes the selected value to stdout — it does NOT
export the env var. _fly_prompt_org was calling it without capturing
the output, so FLY_ORG was never set and the echo printed the org
slug as a raw string to the terminal.

Fix: org=$(interactive_pick ...) && export FLY_ORG.
Also guard with the standard FLY_ORG / SPAWN_NON_INTERACTIVE early-exit.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:40:23 -08:00
L
6fab9b6ae5
fix: use fly orgs list for org picker + strip ANSI from token capture (#1567)
1. _fly_list_orgs: use 'fly orgs list --json' (flyctl) instead of the
   non-existent api.fly.io/v1/organizations REST endpoint. Pipe through
   interactive_pick (same pattern as Hetzner/GCP pickers) so org
   selection uses the shared arrow-key / fzf / numbered-list picker.

2. fly auth token captures: add 'sed s/\x1b...//g' to strip ANSI color
   escape codes. flyctl may output the token with terminal colors even
   when stdout is piped; the ESC character (\033) fails the security
   character check (^[a-zA-Z0-9._/@:+=\ -]+$) causing the token to be
   marked malformed and cleared on the next run.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:35:44 -08:00
L
0f63596a12
feat: use Fly.io API to list orgs in picker for _fly_prompt_org (#1566)
Replace flyctl-based org listing with a direct API call to
api.fly.io/v1/organizations, feeding results into _display_and_select
(the shared arrow-key / fzf / numbered-list picker).

_fly_list_orgs():
  - Calls GET /v1/organizations with Bearer auth
  - Emits pipe-delimited "slug|name (type)" lines for _display_and_select

_fly_prompt_org():
  - Single org: auto-selects silently
  - Multiple orgs: shows arrow-key picker via _display_and_select
    (defaults to "personal" if that slug is in the list)
  - API unavailable: falls back to safe_read prompt with "personal" default

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:31:57 -08:00
L
7f6d99f90f
fix: clear corrupt saved token + capture only first line of fly auth token (#1565)
Two fixes for persistent Fly.io auth failures:

1. shared/common.sh — _load_token_from_config():
   When the saved token fails the security character check, auto-delete
   the corrupt config file instead of silently returning 1. This prevents
   the user from being stuck in a loop where every run loads a malformed
   token (from a previous failed auth attempt) and immediately fails.
   Message changed from error to warn: "Saved token is malformed —
   clearing cached credentials."

2. fly/lib/common.sh — _try_flyctl_auth() and _try_fly_browser_auth():
   Pipe 'fly auth token' output through 'head -1' to capture only the
   first line. Newer flyctl versions may print warnings/metadata after
   the token on subsequent lines; previously these got concatenated into
   the token string via $() and could introduce characters that fail
   the security validator (newlines stripped by _sanitize_fly_token, but
   concatenated text from warning lines could contain unusual chars).

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:23:35 -08:00
L
4a7c0fab81
fix: skip token validation after flyctl browser auth + fix validation endpoint (#1564)
1. Skip _validate_fly_token after 'fly auth login':
   Token from flyctl is definitionally valid — calling the Machines API
   (api.machines.dev) with a user OAuth token causes a false failure
   because that API only accepts deploy tokens, not OAuth user tokens.

2. Fix _validate_fly_token endpoint:
   Now tries api.fly.io/v1/user (Bearer, accepts OAuth tokens) first,
   then falls back to the Machines API for deploy tokens. Prevents
   'no tokens found in header' false failures for env/config tokens.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:16:19 -08:00
L
e8eb836534
fix: use fly auth login for browser auth + add org selection (#1563)
Root cause of persistent 'no tokens found in header':
The CLI Sessions API returns a user-level OAuth code that requires
flyctl's internal token exchange step to become a valid API token.
We were using the raw access_token directly, bypassing that step.

_try_fly_browser_auth() — now delegates to flyctl:
  - Calls 'fly auth login' directly (flyctl handles browser open,
    polling, and token exchange internally)
  - Gets the final token via 'fly auth token' (always correct format)
  - Falls back to manual token entry if flyctl unavailable

_fly_prompt_org() — new function:
  - Called after successful auth (flyctl, browser, or manual)
  - Lists orgs via 'fly orgs list --json' if multiple exist
  - Shows picker or simple prompt; defaults to "personal"
  - Exports FLY_ORG for use in app creation / list_servers
  - Skipped when FLY_ORG is already set or SPAWN_NON_INTERACTIVE=1

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:11:32 -08:00
L
f1ca9cbce1
fix: smart bun mock + restore Bun JSON parsing in fly/lib (reverts #1553) (#1556)
* Revert "fix: handle raw m2. macaroon tokens from Fly.io CLI Sessions API (#1552)"

This reverts commit 9fc59ded1c.

* Revert "fix: replace bun -e with python3 in fly/lib/common.sh to fix 18 mock test failures (#1553)"

This reverts commit 328e6a6da4.

* fix: bun passthrough mock + restore Bun JSON parsing in fly/lib

Reverts PR #1553 (which reverted Bun in favour of Python to fix tests)
and instead fixes the root cause: the test/mock.sh bun mock was a dumb
no-op that discarded all output, causing _fly_json_get() to return empty
string and every fly script to fail with "Failed to extract machine ID".

test/mock.sh — smart bun mock:
- `bun -e "..."` (inline eval, used for JSON processing) → delegates to
  the real bun binary so _fly_json_get() / _fly_build_machine_body()
  actually produce correct output during tests
- All other bun invocations (install, run, etc.) → logged no-op as before

fly/lib/common.sh:
- Restores Bun-based _fly_json_get(), _fly_build_machine_body(),
  destroy_server machine-ID extraction, and list_servers table formatter
- Re-applies m2. macaroon token fix from #1552 (which was lost when
  #1553 reverted the whole file):
  _sanitize_fly_token now wraps raw m2.* tokens as "FlyV1 m2.*" so
  CLI Sessions OAuth tokens are sent with the correct auth header

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* test: add node fallback to bun mock for CI environments

CI (GitHub Actions ubuntu-latest) has node but not bun, so the bun
passthrough mock silently returns empty string, causing _fly_json_get
to fail and 18 Fly.io tests to break. Add a fallback chain:
real bun -> node (with Bun.stdin.text() polyfill) -> exit 0.

Agent: test-engineer
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-21 06:01:58 -08:00
A
c1f730c69a
fix: replace eval with declare and add base64 validation (#1557)
* fix: replace eval with declare and add base64 validation (issues #1554, #1555)

- shared/key-request.sh: replace eval with declare for defense-in-depth
  (eval avoided when safer declare alternative exists; validated vars stay safe)
- fly/lib/common.sh: add base64 output alphabet validation before shell
  interpolation, matching daytona/lib/common.sh proven-safe pattern

Fixes #1554
Fixes #1555

Agent: team-lead
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use printf -v instead of declare for safe variable assignment in key-request.sh

Addresses security review feedback on PR #1557. The declare approach
created a local variable whose export had no effect outside the function.
printf -v assigns directly in the current scope without eval or command
substitution.

Agent: pr-maintainer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 04:47:33 -05:00
L
9fc59ded1c
fix: handle raw m2. macaroon tokens from Fly.io CLI Sessions API (#1552)
Root cause of 'no tokens found in header' after browser OAuth:

The Fly.io CLI Sessions API returns raw macaroon tokens (e.g. m2.XXXX)
WITHOUT the 'FlyV1 ' prefix. _sanitize_fly_token only handled fm2_
tokens, so m2. tokens fell through unchanged and were sent as:
  Authorization: Bearer m2.XXXX
Fly.io's Machines API expects FlyV1 macaroon format, not Bearer.

Fixes:
- _sanitize_fly_token: add m2.* case that wraps as 'FlyV1 m2.XXX'
- _try_fly_browser_auth polling: eagerly wrap any non-FlyV1 token with
  'FlyV1 ' prefix at the source, before it's echoed back to the caller

Token format handling after fix:
  m2.XXXX         → FlyV1 m2.XXXX      ← CLI Sessions API (was broken)
  fm2_XXXX        → FlyV1 fm2_XXXX     ← still handled (unchanged)
  FlyV1 fm2_XXXX  → FlyV1 fm2_XXXX    ← already correct (unchanged)
  eyJhbGci...     → Bearer eyJ...      ← legacy JWT (fallback to manual)

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-20 23:54:34 -08:00
A
328e6a6da4
fix: replace bun -e with python3 in fly/lib/common.sh to fix 18 mock test failures (#1553)
bun is not installed in the mock test environment (CI or local test runs).
The mock harness stubs bun as a no-op logger, so _fly_json_get() always
returned empty string, causing "Failed to extract machine ID" and 18 fly
script test failures in bash test/mock.sh.

Replace all 4 bun -e invocations with equivalent python3 code:
- _fly_json_get: extract top-level JSON field from stdin
- _fly_build_machine_body: build machine creation JSON body
- _fly_destroy_app: extract machine IDs array
- list_servers: format apps table

python3 is always available and already has a pass-through mock in
test/mock.sh (like /usr/bin/python3). No behavior change for real runs.

Before: bash test/mock.sh fly → 18 passed, 18 failed
After:  bash test/mock.sh fly → 36 passed, 0 failed

Agent: code-health

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-21 02:19:46 -05:00
L
fe2c0b024b
fix: prevent premature exit when Fly.io CLI session access_token is null (#1551)
The polling loop in _try_fly_browser_auth() was returning immediately
on the first poll (t=2s) because:

  access_token=$(... "d.get('access_token','')")

When the JSON has "access_token": null (before the user completes
browser auth), Python's print(None) outputs the string "None".
Bash $() captures "None" as non-empty, passes [[ -n "$access_token" ]],
and returns it as the token — before the user even sees the browser.

Then _validate_fly_token(FLY_API_TOKEN="None") sends:
  Authorization: Bearer None
which Fly.io rejects with:
  verify: invalid token: no tokens found in header

Fix:
  d.get('access_token') or ''   →  None or '' = ''  (empty, keeps polling)
  + explicit != "None" guard for belt-and-suspenders

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-20 21:37:34 -08:00
L
4ae781d2a8
fix: remove 2>/dev/null from token validation calls in auth flow (#1549)
Token validation functions (test_hcloud_token, test_do_token,
test_daytona_token, _validate_fly_token) contain rich diagnostic
log_error/log_warn messages with error details and fix instructions.
Calling them with 2>/dev/null silently discarded all that output,
leaving users with no explanation when their token was rejected.

shared/common.sh — ensure_api_token_with_provider():
  Remove 2>/dev/null from "${test_func}" in both the env-var and
  config-file validation branches, so callers like test_hcloud_token
  can print API error details and remediation steps.

fly/lib/common.sh — ensure_fly_token():
  Remove 2>/dev/null from both _validate_fly_token calls (config-file
  path and post-browser-OAuth path) so users see why validation failed.

Note: Issue 1 (API polling in _poll_instance_once) is intentionally
left with 2>/dev/null — suppressing curl errors during a 60-iteration
polling loop prevents terminal flooding and is handled by '|| true'.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-20 21:27:42 -08:00
L
8c437435eb
fix: show Fly.io login URL immediately by removing 2>/dev/null suppression (#1548)
2>/dev/null on _try_fly_browser_auth() was swallowing all stderr,
including the auth URL printf and log_step messages that the user
needs to see for sandbox/headless environments.

Also add a 'Fetching Fly.io login URL...' log_step before the API
call so the user gets immediate feedback while the session is created
(the curl call can take 1-2 seconds before the URL is available).

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-20 21:12:17 -08:00
A
d6c53d838f
fix: source .spawnrc directly in agent launch commands for reliable env loading (#1546)
24 agent scripts (codex, opencode, kilocode, openclaw across 6 clouds) used
`source ~/.zshrc && <agent>` which loads env vars indirectly via a hook.
This fails silently when .zshrc has errors or the hook install was non-fatal,
causing agents to launch without OPENROUTER_API_KEY.

Change to `source ~/.spawnrc 2>/dev/null; source ~/.zshrc 2>/dev/null; <agent>`
which loads env vars directly (matching claude/zeroclaw pattern) and tolerates
.zshrc failures without blocking the agent.

Agent: code-health

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-20 23:37:03 -05:00
L
be176e4cdb
fix: confirm kebab resource name + improve Fly.io sandbox auth (#1525)
shared/common.sh — prompt_spawn_name():
  Replace log_info with safe_read so user confirms (or overrides) the
  derived kebab-case resource name before it's used for any cloud resource:
    Spawn name (e.g. "My Dev Box"): My Claude Box
    Resource name [my-claude-box]: ⏎   ← press Enter to accept

fly/lib/common.sh — _try_fly_browser_auth():
  - Print auth URL prominently on its own line (not just as a warning)
    so sandbox users can copy-paste it into their local browser
  - Suppress open_browser errors (|| true) so the script doesn't abort
    if no browser is available
  - Add explicit sandbox hint while polling
  - After 120s timeout: offer manual API token entry as a last resort
    with a direct link to fly.io/dashboard → Tokens

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-20 07:12:49 -08:00
Ahmed Abushagur
b5d174a472
fix: pin Codex to 0.94.0 + wire_api=chat for multi-turn stability (#1518)
* fix: switch Codex wire_api from "responses" to "chat" for multi-turn stability

The Responses API format causes "Invalid Responses API request" errors on
the second turn and beyond — conversation history items round-trip through
OpenRouter with null content fields and missing IDs that fail validation.
Chat Completions format is fully supported and avoids this issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: pin Codex to 0.94.0 + wire_api=chat for multi-turn stability

OpenRouter's Responses API proxy drops required fields (id, content) from
conversation-history items on multi-turn requests, causing "Invalid
Responses API request" at input[6]+. Codex >=0.97.0 removed wire_api=chat
support (openai/codex#10157), so we pin to 0.94.0 — the last release where
Chat Completions format still works.

Tracking: https://github.com/openai/codex/issues/12114
TODO: unpin once OpenRouter /responses handles round-trip correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 04:49:35 -05:00
A
50dd2f26ed
fix: repair Fly.io saved token loading (_load_token_from_config misuse) (#1513)
ensure_fly_token() called _load_token_from_config with only 1 argument
(config file path) but the function requires 3 (config_file, env_var_name,
provider_name). The empty env_var_name fails the security validation regex,
so the function always returns 1 silently. Users with saved Fly.io tokens
in ~/.config/spawn/fly.json were forced to re-authenticate every session.

Agent: code-health

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-20 03:54:41 -05:00
A
3280a44c45
feat: add browser-based OAuth login for Fly.io + token sanitizer (#1506)
Replace the prompt-first auth flow with a browser-based CLI session
flow (same as `fly auth login`). The new auth chain is:

1. Environment variable (FLY_API_TOKEN)
2. Saved config file (~/.config/spawn/fly.json)
3. flyctl CLI (`fly auth token`)
4. Browser OAuth via Fly.io CLI Sessions API (NEW)
5. Manual token prompt (last resort fallback)

The browser flow creates a CLI session via POST /api/v1/cli_sessions,
opens the auth URL in the user's browser, then polls for the access
token. This is the same mechanism flyctl uses internally.

Also add _sanitize_fly_token() to handle the Fly dashboard copy button
which includes the display name before the token (e.g. "Deploy Token
FlyV1 fm2_..."). The sanitizer strips everything before "FlyV1" or
extracts bare "fm2_" tokens, and trims whitespace/newlines. Applied
at every token entry point (env var, config, manual prompt).

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-19 22:50:19 -08:00
L
d5690a8b11
feat: spawn name prompt + kebab resource naming across all clouds (#1507)
* feat: add spawn name prompt and project confirmation to GCP flow

Ask for spawn name upfront (before auth), derive kebab-case default for
VM naming, and confirm the current GCP project before using it.

New interaction order:
  1. Spawn name: "My Dev Box" → kebab "my-dev-box" exported as
     GCP_INSTANCE_NAME_KEBAB
  2. gcloud auth + project confirm: "Current project: X  Keep? [Y/n]"
     If no → project picker shown
  3. SSH key
  4. Machine type picker (existing)
  5. Zone picker (existing)
  6. Instance name prompt: "Instance name [my-dev-box]: "
     User can press Enter to accept or type a custom name

New functions:
  _to_kebab_case()         — lowercases, replaces non-alnum with hyphens
  _gcp_prompt_spawn_name() — prompts for display name, exports kebab default;
                             honours SPAWN_NAME env var set by CLI (--name flag)

Modified:
  _gcp_resolve_project()  — adds Y/n confirmation when project already set
  get_server_name()       — shows kebab default in prompt, accepts Enter
  cloud_authenticate()    — calls _gcp_prompt_spawn_name first

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* feat: add spawn name prompt to all clouds via shared/common.sh

Move _to_kebab_case() and prompt_spawn_name() to shared/common.sh so all
clouds get upfront spawn name prompting and kebab-based resource naming.

shared/common.sh:
  + _to_kebab_case()    — "My Dev Box" → "my-dev-box"
  + prompt_spawn_name() — asks for display name, exports SPAWN_NAME_DISPLAY
                          and SPAWN_NAME_KEBAB; skips if already set;
                          honours SPAWN_NAME env var from CLI --name flag
  ~ get_resource_name() — replaces silent SPAWN_NAME fallback with a visible
                          prefilled default: "Enter server name [my-dev-box]: "

Per-cloud changes (cloud_authenticate gains prompt_spawn_name first):
  hetzner, fly, aws, daytona, digitalocean, sprite — one-line change each

gcp/lib/common.sh:
  - Remove _to_kebab_case()        (now in shared)
  - Remove _gcp_prompt_spawn_name() (now in shared as prompt_spawn_name)
  ~ cloud_authenticate: _gcp_prompt_spawn_name → prompt_spawn_name
  ~ get_server_name: simplified back to get_validated_server_name
    (shared get_resource_name now shows the kebab default in the prompt)

Result — every cloud shows this flow upfront:
  Spawn name (e.g. "My Dev Box"): My Claude Box
  ℹ Resource name: my-claude-box
  ...
  Enter server name [my-claude-box]: ⏎

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: use "Use project '...'?" instead of "Keep this project?" in GCP prompt

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-19 22:22:59 -08:00
Ahmed Abushagur
9e2f84adf0
fix: use native OpenRouter model_provider for Codex CLI config (#1490)
Codex CLI's OPENAI_BASE_URL env var approach causes "Invalid Responses
API request" errors because OpenRouter doesn't fully support the
Responses API wire format via base URL override. Switch all 8 codex
scripts to use ~/.codex/config.toml with model_provider="openrouter"
which uses the native OpenRouter integration.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 18:47:40 -05:00
A
b29cf4a75d
fix: sync cloud READMEs with current agent list (#1486)
READMEs across all 8 clouds still referenced 5 removed agents
(NanoClaw, Cline, gptme, Plandex, Continue) and were missing
ZeroClaw. Users following these docs got 404 errors.

Agent: ux-engineer

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-19 17:47:57 -05:00
L
a67d83ed38
feat: reorder agents and remove NanoClaw (#1477)
* feat: add ZeroClaw agent (14.9k stars, native OpenRouter support)

Add ZeroClaw — a Rust-based autonomous AI assistant framework by
Harvard/MIT/Sundai.Club communities — across all 8 clouds.

Scripts: local, hetzner, digitalocean, fly, aws, gcp, daytona, sprite
Install: bootstrap.sh with --install-rust + --install-system-deps
Config:  zeroclaw onboard --provider openrouter (via agent_configure)
Env:     OPENROUTER_API_KEY + ZEROCLAW_PROVIDER=openrouter (native support)
Launch:  zeroclaw agent

Note: ZeroClaw compiles from Rust source (~5-10 min build time).
A build-time warning is shown to set expectations.

Also update test/mock-curl-script.sh to stub zeroclaw install URLs and
add zeroclaw to mock agent binaries in test/mock.sh.

Bump CLI version 0.5.8 → 0.5.9.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* feat: reorder agents and remove NanoClaw

New agent order: claude → openclaw → zeroclaw → codex → opencode → kilocode

- Remove NanoClaw (8 scripts + manifest entry + matrix entries + README row)
- Reorder manifest.json agents section to match new order
- Reorder matrix entries by cloud (local/hetzner/fly/aws/daytona/digitalocean/gcp/sprite)
  with agents in new order within each cloud block
- Update README matrix table row order
- Update test/mock.sh mock agent binary list to match
- Bump CLI version 0.5.9 → 0.5.10

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-19 11:39:03 -08:00
L
f7458952b0
feat: remove Cline, gptme, Plandex, and Continue agents (#1475)
Delete 32 agent scripts ({cloud}/{cline,gptme,plandex,continue}.sh across
8 clouds), remove the 4 agents from manifest.json with all their matrix
entries, update README matrix rows, remove stale mock agent binaries and
plandex.ai URL patterns from test harness, update CLI help examples to use
remaining agents, and bump version 0.5.7 → 0.5.8.

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-19 11:12:46 -08:00
A
5612cda40b
feat: remove Aider, Goose, Open Interpreter, Gemini CLI, Amazon Q from matrix (#1472)
These 5 agents are being dropped from the Spawn matrix. This removes
45 agent scripts across 9 clouds, cleans the manifest, test fixtures,
READMEs, CLI source, and shared library comments.

Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-19 12:31:00 -05:00
A
d8785c3d0b
security: fix command injection in cline auth via remote env var expansion (#1473)
All 9 cline.sh scripts embedded OPENROUTER_API_KEY directly into the
cloud_run command string, allowing shell metacharacter injection on the
remote server. Fix by escaping the dollar sign (\${OPENROUTER_API_KEY})
so the variable is expanded on the remote machine where it's already
set via agent_env_vars()/generate_env_config, not locally before being
passed to cloud_run.

Agent: security-auditor

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-19 12:25:16 -05:00
Ahmed Abushagur
8ee54d01a8
fix: harden agent reliability + security across all clouds (#1468)
* docs: add spawn delete command to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: harden openclaw across all clouds — validation, reliability, performance

Fixes multiple issues causing openclaw to break on most clouds:

Bugs fixed:
- Double-prefixed model ID (openrouter/openrouter/auto) in config generation
- AWS gateway starting without env vars (missing .zshrc source)
- DigitalOcean sourcing .spawnrc instead of .zshrc for gateway
- Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p)

Validation added:
- API key checked against OpenRouter /auth/key endpoint with re-prompt on failure
- Model ID verified against OpenRouter model list with re-prompt loop
- openrouter/auto and openrouter/free bypass model check

Reliability improvements:
- Standardized gateway launch with </dev/null & disown across all 9 clouds
- Gateway log auto-displayed on startup timeout for diagnostics
- 2GB swap added to cloud-init to prevent OOM on small VMs
- Portable install timeout (10 min) with macOS gtimeout fallback

Performance:
- Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s)
- Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw
- Fly.io: tries bun first (faster), falls back to npm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip sudo in gh install when running as root (Fly.io containers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id

- verify_openrouter_key and verify_openrouter_model skip network calls when
  SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set
- install_agent timeout wrapper now quotes the escaped command for defense in depth
- model_id in openclaw JSON now uses json_escape() for consistency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove double-escaping in install_agent that broke shell operators

install_agent() was wrapping commands with printf '%q' + bash -c before
passing them to the run callback. But run callbacks (run_server, run_sprite,
ssh_run_server) already handle escaping for remote transport. The double-
escaping turned && || > | into literal characters, causing 'source' to
treat the entire command as a single filename.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use local github-auth.sh instead of curling from main

When running from a local checkout, base64-encode the local
github-auth.sh and send it inline to the remote machine. This
ensures fixes (like the sudo skip for root) take effect immediately
without waiting for a merge to main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle github-auth errors gracefully instead of terminating

GitHub CLI setup is optional — failures should not abort the spawn
session. Guard both run_callback calls in offer_github_auth with
|| log_warn so the script continues even if gh install fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use GOOGLE_GEMINI_BASE_URL to route Gemini CLI through OpenRouter

Gemini CLI ignores OPENAI_BASE_URL — it uses GEMINI_API_KEY to talk
directly to Google's API. The OpenRouter key is not a valid Google
API key, so all requests fail with "API key not valid".

Use GOOGLE_GEMINI_BASE_URL to redirect Gemini CLI to OpenRouter's
endpoint. Fixes all 9 cloud gemini scripts + manifest.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: guard optional spawn_agent hooks so failures don't kill the session

With set -eo pipefail, any unguarded failure terminates the script.
Several optional operations in spawn_agent were unguarded:

- agent_configure: config file uploads (agent works with defaults)
- agent_save_connection: convenience JSON for spawn list
- agent_pre_launch: gateway daemons, startup hooks
- agent_pre_provision: pre-provision prompts
- .spawnrc shell hooks: hooking env vars into .bashrc/.zshrc

These now log warnings and continue instead of aborting. Critical
steps (cloud_authenticate, agent_install, cloud_provision) still
exit on failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: audit and fix env vars, escaping, and error handling across all agents

Audit findings from 3 parallel agents, fixes applied:

**Env vars (4 agents fixed across 9 clouds each = 36 scripts):**
- Amazon Q: remove fake OPENAI_* vars (Q uses AWS auth, can't use OpenRouter)
- Cline: replace OPENAI_* env vars with `cline auth -p openrouter` command
- Open Interpreter: drop OPENAI_* vars, use only OPENROUTER_API_KEY (native support via --model flag)
- NanoClaw: add ANTHROPIC_BASE_URL to .env file (was missing, requests went to Anthropic directly)

**Escaping:**
- execute_agent_non_interactive: replace printf '%q' with single-quote wrapping to avoid double-escaping on Fly.io

**Manifest updated** for amazonq, cline, interpreter entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use setsid to detach openclaw gateway daemon from SSH sessions

The gateway daemon launch (`nohup openclaw gateway ... & disown`) hangs
on all clouds because SSH/exec channels wait for child FDs to close.
setsid creates a new session, fully detaching the daemon so the channel
can close immediately. Falls back to nohup where setsid is unavailable.

Consolidates the daemon launch into a shared start_openclaw_gateway()
function used by all 9 cloud scripts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: configure npm global prefix for non-root clouds (AWS, GCP, OVH)

AWS Lightsail, GCP, and OVH SSH as non-root users (ubuntu/login user),
so `npm install -g` fails with EACCES on /usr/local/lib/node_modules/.

Fix: configure npm prefix to ~/.npm-global during cloud-init/setup and
add ~/.npm-global/bin to the SSH PATH prefix so agent install commands
find globally-installed npm binaries without sudo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove broken OpenRouter routing from Gemini CLI scripts

Gemini CLI uses Google's native API format (/v1beta/models/:streamGenerateContent),
not the OpenAI-compatible format (/v1/chat/completions). No base URL override can
bridge this — the request formats are fundamentally incompatible. Same situation
as Amazon Q (uses vendor-specific auth/API).

Removed GEMINI_API_KEY and GOOGLE_GEMINI_BASE_URL from all 9 scripts + manifest.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: auto-install AWS CLI and gcloud SDK when missing

Instead of printing manual install instructions and exiting, both CLIs
now auto-install:

- AWS: downloads official .pkg (macOS) or .zip (Linux) installer
- GCP: uses brew cask on macOS, Google's tarball installer on Linux

Falls back to manual instructions if auto-install fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: nanoclaw — install Docker on Linux, fix hardcoded /root/ path

Two issues broke NanoClaw on all clouds:

1. .env upload hardcoded /root/nanoclaw/.env — fails on non-root clouds
   (AWS=ubuntu, GCP=user, OVH=ubuntu). Now uses upload_config_file with
   $HOME which expands on the remote side.

2. NanoClaw requires a container runtime. On Linux it uses Docker, but
   Docker was never installed. Added Docker install via get.docker.com
   to all cloud scripts (with sudo where SSH user is non-root).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address security review findings from PR #1463

- Reject symlinked github-auth.sh before base64-encoding (falls back to remote URL)
- Hide API key from process list using curl -K - instead of -H in verify_openrouter_key

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: quote OPENROUTER_API_KEY in cline auth to prevent command injection

Unquoted variable in `cline auth -p openrouter -k ${OPENROUTER_API_KEY}`
allows shell metacharacters in the key to execute arbitrary commands on
the remote server. Wrapping in escaped double quotes prevents expansion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 08:36:24 -05:00
Ahmed Abushagur
be904cbe1c
fix: install_agent double-escaping + github-auth reliability (#1460)
* docs: add spawn delete command to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: harden openclaw across all clouds — validation, reliability, performance

Fixes multiple issues causing openclaw to break on most clouds:

Bugs fixed:
- Double-prefixed model ID (openrouter/openrouter/auto) in config generation
- AWS gateway starting without env vars (missing .zshrc source)
- DigitalOcean sourcing .spawnrc instead of .zshrc for gateway
- Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p)

Validation added:
- API key checked against OpenRouter /auth/key endpoint with re-prompt on failure
- Model ID verified against OpenRouter model list with re-prompt loop
- openrouter/auto and openrouter/free bypass model check

Reliability improvements:
- Standardized gateway launch with </dev/null & disown across all 9 clouds
- Gateway log auto-displayed on startup timeout for diagnostics
- 2GB swap added to cloud-init to prevent OOM on small VMs
- Portable install timeout (10 min) with macOS gtimeout fallback

Performance:
- Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s)
- Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw
- Fly.io: tries bun first (faster), falls back to npm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip sudo in gh install when running as root (Fly.io containers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id

- verify_openrouter_key and verify_openrouter_model skip network calls when
  SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set
- install_agent timeout wrapper now quotes the escaped command for defense in depth
- model_id in openclaw JSON now uses json_escape() for consistency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove double-escaping in install_agent that broke shell operators

install_agent() was wrapping commands with printf '%q' + bash -c before
passing them to the run callback. But run callbacks (run_server, run_sprite,
ssh_run_server) already handle escaping for remote transport. The double-
escaping turned && || > | into literal characters, causing 'source' to
treat the entire command as a single filename.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use local github-auth.sh instead of curling from main

When running from a local checkout, base64-encode the local
github-auth.sh and send it inline to the remote machine. This
ensures fixes (like the sudo skip for root) take effect immediately
without waiting for a merge to main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: handle github-auth errors gracefully instead of terminating

GitHub CLI setup is optional — failures should not abort the spawn
session. Guard both run_callback calls in offer_github_auth with
|| log_warn so the script continues even if gh install fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use GOOGLE_GEMINI_BASE_URL to route Gemini CLI through OpenRouter

Gemini CLI ignores OPENAI_BASE_URL — it uses GEMINI_API_KEY to talk
directly to Google's API. The OpenRouter key is not a valid Google
API key, so all requests fail with "API key not valid".

Use GOOGLE_GEMINI_BASE_URL to redirect Gemini CLI to OpenRouter's
endpoint. Fixes all 9 cloud gemini scripts + manifest.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: guard optional spawn_agent hooks so failures don't kill the session

With set -eo pipefail, any unguarded failure terminates the script.
Several optional operations in spawn_agent were unguarded:

- agent_configure: config file uploads (agent works with defaults)
- agent_save_connection: convenience JSON for spawn list
- agent_pre_launch: gateway daemons, startup hooks
- agent_pre_provision: pre-provision prompts
- .spawnrc shell hooks: hooking env vars into .bashrc/.zshrc

These now log warnings and continue instead of aborting. Critical
steps (cloud_authenticate, agent_install, cloud_provision) still
exit on failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 05:21:55 -05:00
Ahmed Abushagur
159ad49fec
fix: harden openclaw across all clouds (#1456)
* docs: add spawn delete command to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: harden openclaw across all clouds — validation, reliability, performance

Fixes multiple issues causing openclaw to break on most clouds:

Bugs fixed:
- Double-prefixed model ID (openrouter/openrouter/auto) in config generation
- AWS gateway starting without env vars (missing .zshrc source)
- DigitalOcean sourcing .spawnrc instead of .zshrc for gateway
- Destructive rm -rf ~/.openclaw on re-runs (now mkdir -p)

Validation added:
- API key checked against OpenRouter /auth/key endpoint with re-prompt on failure
- Model ID verified against OpenRouter model list with re-prompt loop
- openrouter/auto and openrouter/free bypass model check

Reliability improvements:
- Standardized gateway launch with </dev/null & disown across all 9 clouds
- Gateway log auto-displayed on startup timeout for diagnostics
- 2GB swap added to cloud-init to prevent OOM on small VMs
- Portable install timeout (10 min) with macOS gtimeout fallback

Performance:
- Reordered spawn_agent: OAuth runs while VM provisions (saves 30-60s)
- Fly.io: bumped to 2GB RAM + 2 shared CPUs for openclaw
- Fly.io: tries bun first (faster), falls back to npm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip sudo in gh install when running as root (Fly.io containers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review — skip validation in tests, quote escaped cmd, escape model_id

- verify_openrouter_key and verify_openrouter_model skip network calls when
  SPAWN_SKIP_API_VALIDATION, BUN_ENV=test, or NODE_ENV=test is set
- install_agent timeout wrapper now quotes the escaped command for defense in depth
- model_id in openclaw JSON now uses json_escape() for consistency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove double-escaping in install_agent that broke shell operators

install_agent() was wrapping commands with printf '%q' + bash -c before
passing them to the run callback. But run callbacks (run_server, run_sprite,
ssh_run_server) already handle escaping for remote transport. The double-
escaping turned && || > | into literal characters, causing 'source' to
treat the entire command as a single filename.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 09:25:48 +00:00
A
f3ffb6caed
fix: broken error message in multi-creds validation, predictable temp path (#1442)
1. _multi_creds_validate referenced undefined help_url variable, causing
   empty "Get new credentials from: " error messages when OVH credential
   validation fails. Added help_url as parameter and pass it from caller.

2. _spawn_inject_env_vars (used by 130+ agent scripts via spawn_agent)
   uploaded credentials to static /tmp/env_config path. The older
   inject_env_vars_ssh/inject_env_vars_cb functions document this as a
   symlink attack vector and use randomized paths. Fixed to match.

3. Removed dead inject_env_vars_fly and inject_env_vars_sprite functions
   (all agent scripts now use spawn_agent -> _spawn_inject_env_vars).

Agent: code-health

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-18 07:51:28 -05:00
Ahmed Abushagur
db4aaa0c73
fix: prevent SSH hangs, fix command escaping, pin Python 3.12 for aider (#1439)
* fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds

aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.

Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.

Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: update aider/gptme/interpreter assertions from pip to uv

The install method for aider, gptme, and open-interpreter was changed
from pip to `uv tool install` across all clouds. The mock test
assertions still checked for the old `pip.*install.*` patterns, causing
9 failures (3 agents × 3 clouds).

Update patterns to match the actual `uv tool install` commands now used
in all cloud scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: trigger test run for uv assertion fix

* fix: prevent SSH hangs, restore stderr, fix command escaping across clouds

- Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH
  stdin theft causing sequential install/verify/configure steps to hang
- Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default
  SSH_OPTS so long-running installs don't silently drop on flaky networks
- Remove 2>/dev/null from Fly.io run_server so remote command errors are
  no longer silently swallowed (--quiet flag still suppresses flyctl noise)
- Fix Fly.io printf '%q' double-quoting: remove extra quotes around
  $escaped_cmd that prevented the remote shell from consuming escapes,
  breaking && || | operators in commands
- Remove broken printf '%q' from Daytona run_server and interactive_session
  where it escaped shell operators into literal characters since daytona exec
  has no intermediate shell layer
- Pin aider to --python 3.12 instead of --with audioop-lts across all clouds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add --pty to fly ssh console for interactive sessions

fly ssh console -C does not allocate a pseudo-terminal by default,
causing interactive TUI agents (aider, claude) to fail with
"Input is not a terminal (fd=0)" or completely unresponsive input.

Adding --pty forces PTY allocation, matching how other clouds handle
interactive sessions (SSH uses -t, Sprite uses -tty).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-18 04:23:15 -05:00
Ahmed Abushagur
d9e6d058e0
fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds (#1436)
aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.

Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.

Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-18 03:21:59 -05:00
Ahmed Abushagur
633ce8eaac
feat: upgrade default server sizes, fix Fly.io agent installs, improve E2E tests (#1428)
- Upgrade default VM sizes across clouds for better agent performance:
  - Hetzner: cpx11 → cx23 (with cx22 fallback support for deprecated types)
  - DigitalOcean: s-2vcpu-2gb → s-2vcpu-4gb
  - Daytona: 2048MB → 4096MB memory
  - Oracle: VM.Standard.E2.1.Micro → VM.Standard.A1.Flex
  - OVH: d2-2 → d2-4
- Fix Fly.io agent failures:
  - Add Node.js + build-essential to wait_for_cloud_init (fixes npm-based agents)
  - Prepend PATH in interactive_session (fixes "source not found" errors)
- Fix openclaw installs across clouds: use explicit PATH export instead of source
- Fix DigitalOcean token validation (check "uuid" not "id")
- Fix AWS cloud-init: chown .bashrc/.zshrc to ubuntu user
- Improve Hetzner fallback: add "cheapest available" as last-resort fallback
- Upgrade E2E tests: per-combo auto-fix, credential collection, robustness fixes

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:17:08 -08:00
Ahmed Abushagur
963144ecbd
fix: use pipx to install Aider across all clouds (#1429)
Ubuntu 24.04 blocks system-wide pip installs (PEP 668 externally-managed-
environment). Switch all aider.sh scripts from `pip install aider-chat`
to `python3 -m pip install pipx && pipx install aider-chat`, which
installs into an isolated virtualenv and works on all target distros.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 22:16:39 -08:00
Ahmed Abushagur
22b6a402f4
feat: E2E test harness, QA pipeline integration, macOS compat linter (#1425)
* feat: add QA upgrade — macOS compat linter, per-agent mock assertions

Layer 1: macOS compat linter (test/macos-compat.sh)
- 12 rules (MC001–MC012) catching bash 3.2 incompatibilities
- Detects: base64 -w0 file args, non-portable echo flags, source <(),
  ((var++)), read -d, nounset flag, sed -i, date %N, local -n,
  declare -A, ${var,,}, and |&
- Added to CI lint.yml in warn-only mode for burn-in
- Integrated as Phase 0.5 in qa-dry-run.sh

Layer 2: Per-agent mock assertions
- test/fixtures/_shared_agent_assertions.sh with install checks
  for all 15 agents (claude, openclaw, aider, goose, etc.)
- Integrated into test/mock.sh via _run_agent_assertions()

Also includes branch fixes:
- Fix base64 -w0 to use stdin redirect (aws, daytona, fly)
- Fix fly/openclaw to use npm install instead of broken curl|bash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add E2E test harness and integrate into QA pipeline

Add test/e2e.sh — a full E2E test harness that provisions real servers,
installs agents, and verifies setup across all clouds. Features:
- Smoke test (one canary agent per cloud) and full matrix modes
- Credential auto-detection for 8 clouds
- Per-cloud preflight validation (sequential) then parallel agent tests
- Stale server cleanup, timing history, cross-cloud comparison
- Auto-fix and optimization phases via Claude agents
- macOS bash 3.2 compatible

Integrate E2E as Phase 5 in both qa-cycle.sh and qa-dry-run.sh:
- Runs after mock tests pass, gated on cloud credentials
- Phase 5b auto-fixes failures using per-agent worktree branches
- Parses results and includes in QA summary

Also fixes:
- shared/common.sh: honour SPAWN_NON_INTERACTIVE=1 in safe_read()
- aws/lib/common.sh: fix SSH key import (use cat instead of base64,
  handle race condition on concurrent imports)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 20:41:07 -05:00