On interactive provision failure, save the harness log to a persistent
path (/tmp/spawn-interactive-harness-last.log) for post-mortem inspection,
and filter output to only show [harness] prefixed lines (30 lines) instead
of dumping 50 raw lines of mixed output.
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Ahmed Abushagur <ahmed@abushagur.com>
- fix misplaced interactive_provision comment block in interactive.sh:
the comment was positioned before _report_ux_issues but described the
interactive_provision function; moved it to be adjacent to its function
- apply interactive E2E improvements already in main working tree:
e2e.sh: add verify_agent call after interactive_provision to wait for
.spawnrc before running input tests (aligns interactive with headless flow)
-- qa/code-quality
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): pass SPAWN_NAME + SPAWN_ENABLED_STEPS to interactive harness
Without SPAWN_NAME, cmdRun prompts 'Name your spawn' interactively.
The AI driver (Claude Haiku) can't respond because ANTHROPIC_AUTH_TOKEN
is an OpenRouter key — every Anthropic API call returns 401, so the harness
returns <wait> indefinitely until the 20-min SESSION_TIMEOUT_MS fires.
SPAWN_ENABLED_STEPS=auto-update bypasses the setup options multiselect,
ensuring the harness only tests the provisioning/installation UX.
* fix(e2e): fix _stage_timeout_remotely stdin pipe issue on Hetzner
Same root cause as _stage_prompt_remotely: _hetzner_exec runs commands via
"printf | base64 -d | bash", which makes bash's stdin the decode pipe.
So piped data from the outer SSH call never reaches subcommands.
"printf '%s' 'VALUE' | cloud_exec APP 'cat > /tmp/.e2e-timeout'" always
creates an empty file, causing "timeout: invalid time interval ''" when
the input test runs.
Fix: embed the validated numeric timeout value directly in the printf
command string (safe — _validate_timeout ensures only [0-9] digits).
* test(e2e): add claude PATH diagnostics to input_test_claude
Temporary debug output to trace where claude is installed
after interactive provision completes.
* test(e2e): save harness transcript JSON on success for debugging
* fix(e2e): remove 'is ready' from harness success pattern
'SSH is ready' (emitted ~15s into provision when SSH connects but before
any agent installation) matched the /is ready/ pattern, triggering false
success detection. The harness killed the spawn CLI during cloud-init wait,
leaving a VM with no agent installed.
Fix: use the same precise patterns as the main repo's harness:
/Starting agent\.\.\.|setup completed successfully/i
Both only fire after orchestrate.ts completes the full setup.
* chore(e2e): remove temporary debug instrumentation
* feat(e2e): add ai-powered ux review after interactive provision
After each successful interactive E2E run, the harness sends the full
terminal transcript to Claude (via OpenRouter) with a UX reviewer prompt.
It looks for confusing messages, noisy output, missing context in spinners,
and unhelpful errors that don't explain next steps.
Findings are returned as uxIssues[] in the harness JSON result.
interactive.sh then files a GitHub issue per run listing each problem
with a verbatim example and concrete suggestion.
Uses OPENROUTER_API_KEY (already in env) so it works on the QA VM
where ANTHROPIC_API_KEY is an OpenRouter key.
* refactor(e2e): throttle ux issue filing — 33% chance, 3+ issues required
- Random 33% gate: UX review runs on ~1 in 3 successful interactive
provisions, not every run
- Minimum bar: only surface findings when AI found 3+ clear issues
(filters one-off nits)
- Tighter system prompt: only flag obvious problems (repeated messages,
debug leaks, cryptic errors), not minor style preferences
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(e2e): replace random throttle with stricter ux review prompt
Instead of Math.random() to suppress issues, make the AI self-regulate:
the system prompt now instructs it to only flag genuinely bad problems
(repeated messages, raw stack traces, no-feedback waits) and treat
zero findings as a good outcome, not a failure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: update agent GitHub star counts
* fix(qa): load ANTHROPIC_AUTH_TOKEN as ANTHROPIC_API_KEY for interactive E2E
QA VMs store the Anthropic key as ANTHROPIC_AUTH_TOKEN in
/etc/spawn-qa-auth.env, but the e2e-interactive handler only looked for
ANTHROPIC_API_KEY — causing the 6am cron to fail immediately with
"ANTHROPIC_API_KEY not set". Accept either name when loading from the
auth env file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(e2e): bump interactive harness timeout to 20min, fix zombie VM teardown
- SESSION_TIMEOUT_MS: 10min → 20min — provisioning a VM takes 3-4 min
before onboarding even starts; 10min wasn't enough headroom
- interactive.sh: call cloud_provision_verify even on harness failure so
teardown can find and delete any VM that was partially created (e.g.
on timeout mid-provision) — previously left zombie VMs with no .meta file
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: spawn-qa-bot <qa@openrouter.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: L <6723574+louisgv@users.noreply.github.com>
* feat: never-give-up resilience layer — retry every failure instead of exiting
Add retryOrQuit() helper to shared/ui.ts that prompts "Try again? (Y/n)"
after any recoverable failure. Wrap all fatal exit points with retry loops:
- Cloud auth (Hetzner, DigitalOcean, AWS, GCP): retry after 3 failed tokens
- API key acquisition: retry after 3 failed OAuth+manual attempts
- Server creation: retry on any createServer failure (both fast & sequential)
- SSH readiness: retry on waitForReady timeout
- Agent install: retry on install failure
- Pre-launch hooks: retry on preLaunch failure
Non-interactive mode (SPAWN_NON_INTERACTIVE=1) still throws immediately.
Ctrl+C at any retry prompt exits cleanly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(e2e): add AI-driven interactive test harness
Add --interactive mode to the E2E test framework. Instead of running spawn
in headless mode (SPAWN_NON_INTERACTIVE=1), this spawns the CLI in a real
PTY and uses Claude Haiku to respond to prompts like a human user would.
New files:
- sh/e2e/interactive-harness.ts — Bun script that drives the PTY + AI loop
- sh/e2e/lib/interactive.sh — Bash integration with the E2E framework
Usage:
e2e.sh --cloud hetzner claude --interactive
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(qa): wire interactive E2E into scheduled QA pipeline
- Add `e2e-interactive` option to workflow_dispatch in qa.yml
- Add `e2e-interactive` run mode to qa.sh (loads cloud creds + ANTHROPIC_API_KEY)
- Runs `e2e.sh --cloud hetzner claude --interactive` directly (no Claude Code needed)
- Defaults to hetzner (cheapest), overridable via E2E_INTERACTIVE_CLOUD/AGENT env vars
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(qa): schedule interactive E2E daily at 6am UTC
Runs one agent (claude) on one cloud (hetzner) with AI-driven prompts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(qa): offset soak cron to avoid GitHub Actions schedule dedup
GitHub Actions deduplicates overlapping cron schedules into one run,
making `github.event.schedule` unpredictable. The soak test at `0 3 * * 1`
was getting absorbed by the `0 */4 * * *` quality sweep and never firing
as reason=soak.
Move soak to `30 1 * * 1` (Monday 1:30am UTC) — safely between the
0am and 4am quality sweep slots. Interactive E2E at `0 6 * * *` is
already safe (between the 4am and 8am slots).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(qa): add e2e-interactive to trigger server valid reasons
The trigger server validates reason query params against an allowlist.
Without this, the `e2e-interactive` dispatch returns 400.
Also note: `soak` is already in VALID_REASONS in the repo but the running
service on the QA VM is stale — needs a restart to pick up both soak and
e2e-interactive reasons.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>