spawn

vrr/spawn

mirror of https://github.com/OpenRouterTeam/spawn.git synced 2026-04-29 20:39:29 +00:00

Author	SHA1	Message	Date
Ahmed Abushagur	f2795a6d84	fix: Node.js v22 upgrade, aider uv install, SSH & cloud reliability (#1440 ) * fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds aider-chat on Python 3.13 fails with `ImportError: cannot import name '_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved — those releases have no Python 3.13 binary wheels, so the C extension is missing at runtime. Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>` and single quotes get mangled by `printf '%q'` in run_server before the command reaches the remote machine) with `--upgrade`, which forces all transitive deps including Pillow to their latest compatible versions. Also adds a plain-text echo before the install so users see progress instead of a silent hang during the 2-4 minute install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: update aider/gptme/interpreter assertions from pip to uv The install method for aider, gptme, and open-interpreter was changed from pip to `uv tool install` across all clouds. The mock test assertions still checked for the old `pip.install.` patterns, causing 9 failures (3 agents × 3 clouds). Update patterns to match the actual `uv tool install` commands now used in all cloud scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: trigger test run for uv assertion fix * fix: prevent SSH hangs, restore stderr, fix command escaping across clouds - Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH stdin theft causing sequential install/verify/configure steps to hang - Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default SSH_OPTS so long-running installs don't silently drop on flaky networks - Remove 2>/dev/null from Fly.io run_server so remote command errors are no longer silently swallowed (--quiet flag still suppresses flyctl noise) - Fix Fly.io printf '%q' double-quoting: remove extra quotes around $escaped_cmd that prevented the remote shell from consuming escapes, breaking && \|\| \| operators in commands - Remove broken printf '%q' from Daytona run_server and interactive_session where it escaped shell operators into literal characters since daytona exec has no intermediate shell layer - Pin aider to --python 3.12 instead of --with audioop-lts across all clouds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add --pty to fly ssh console for interactive sessions fly ssh console -C does not allocate a pseudo-terminal by default, causing interactive TUI agents (aider, claude) to fail with "Input is not a terminal (fd=0)" or completely unresponsive input. Adding --pty forces PTY allocation, matching how other clouds handle interactive sessions (SSH uses -t, Sprite uses -tty). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: prepend ~/.local/bin to PATH in ssh_run_server After uv installs to ~/.local/bin, the current shell session doesn't have it in PATH, causing "uv: command not found" on DigitalOcean and all other SSH-based clouds (Hetzner, AWS, GCP, OVH). Fly.io's run_server already prepends this PATH — now the shared ssh_run_server does the same, fixing all SSH-based clouds at once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add Node.js to cloud-init for all cloud providers npm-based agents (codex, kilocode, etc.) fail with "npm: command not found" because Node.js isn't installed during cloud-init. Fly.io was the only provider installing Node.js (in wait_for_cloud_init). Now all cloud-init scripts install Node.js v22 LTS from nodesource, matching Fly.io's setup. Also adds ~/.local/bin to PATH in AWS and GCP cloud-init (was already in shared/DigitalOcean/Hetzner). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use apt packages for nodejs/npm instead of nodesource The nodesource setup script (setup_22.x) runs its own apt-get update and repository configuration, nearly doubling cloud-init time and causing hangs on DigitalOcean. Ubuntu 24.04 includes nodejs and npm in its default repos — just add them to the packages list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add timeouts and better error handling to Daytona CLI commands Daytona CLI commands (login, list, create) can hang indefinitely when the API is slow or unreachable. This causes: - "Failed to create sandbox: timeout" with no recovery - Token validation timeouts misreported as "invalid token" - Users re-entering valid tokens that also timeout Fixes: - Wrap all daytona CLI calls with timeout (30s for auth, 120s for create) - Detect timeout errors separately from auth errors - Show actionable "try again / check status" messages for timeouts - Add nodejs/npm to Daytona wait_for_cloud_init Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: set DAYTONA_API_URL to Daytona Cloud by default The Daytona CLI may default to connecting to a local self-hosted server instead of Daytona Cloud. Without DAYTONA_API_URL set to https://app.daytona.io/api, every CLI command (login, list, create) hangs trying to reach a non-existent local server and times out. The SDK documents this as the default, but the CLI doesn't always pick it up — now we export it explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: symlink n-installed Node.js v22 over apt v18 to prevent shadowing n installs Node.js v22 to /usr/local/bin/node but apt's v18 at /usr/bin/node can shadow it in non-interactive SSH sessions. After n 22, symlink the new binaries over the apt ones so v22 is always resolved. Also fix hcloud CLI token extraction for new TOML format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address security review, add curl timeouts to trigger workflows - Fix ssh_run_server command injection concern: use single-quoted path_prefix so $HOME/$PATH expand remotely, not locally - Add --connect-timeout 15 --max-time 30 to trigger workflows to prevent 5-min hangs when server streams responses - Handle 409 (dedup) as success — expected when cron fires every 15min but cycles take 35min - Reduce workflow timeout-minutes from 5 to 2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-18 06:54:07 -05:00
L	6e13256d96	refactor: simplify claude launch — no streaming, no output monitoring (#1412 ) Replace the complex claude launch pattern (subshell + PID file + tee pipe + stream-json + 50-line watchdog monitoring log file growth + session-end detection) with a simple direct launch: claude -p "..." >> "${LOG_FILE}" 2>&1 & The watchdog is now just a wall-clock timeout. The idle-output detection, stream-json result parsing, and tee piping are all removed. Also remove GitHub Actions concurrency groups — the trigger server already handles dedup (409 for same issue, 409 for same reason), making the GH Actions concurrency groups redundant queuing. Changes: - refactor.sh: simple launch + wall-clock-only watchdog - security.sh: same simplification - discovery.sh: same (refactored _kill_claude_process and _run_watchdog_loop to simpler signatures) - All 4 workflows: remove concurrency groups Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-17 09:02:47 -08:00
L	f3cfe890f7	refactor: simplify trigger server to fire-and-forget + fix monitoring loop prompts (#1384 ) The trigger server streamed script stdout back to GitHub Actions via a long-lived HTTP response, requiring --http1.1, heartbeat injection, server.timeout(req, 0), createEnqueuer, drainStreamOutput, and 90-min GH Actions timeouts. In practice GitHub Actions is just a dumb trigger — the real state lives on the VM (log files, journalctl). Simplify to fire-and-forget: spawn script, return 200 JSON immediately. Also fix the refactor and discovery team lead monitoring loops. The prompts buried the loop in a single compressed line that the model ignored (doing Bash("sleep 10") repeatedly without calling TaskList). Replace with a dedicated "Monitor Loop (CRITICAL)" section with numbered steps, matching the security.sh pattern that actually works. Changes: - trigger-server.ts: remove ~150 lines of streaming code (createEnqueuer, drainStreamOutput, startStreamingRun, heartbeat, ReadableStream), replace with startFireAndForgetRun (stdout: "inherit", immediate JSON) - All 4 workflows: simple curl POST, timeout-minutes 90→5, remove --http1.1/-N/--max-time/exit-code handling - refactor.sh: add Monitor Loop (CRITICAL) section with numbered steps - discovery-team-prompt.txt: same Monitor Loop fix - SKILL.md: update architecture docs, remove streaming sections Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-17 10:47:52 -05:00
A	99a9badf62	ci: increase refactor team frequency to every 15 minutes (#1378 ) Co-authored-by: lab <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-16 20:50:03 -08:00
Ahmed Abushagur	3fbdf56c4c	fix: add guardrails to prevent bots from inventing unnecessary work (#1347 ) - Add team lead pre-approval gate: teammates spawn in plan mode and must get approval before creating any PR (hard gate, not just prompt rules) - Add diminishing returns rule: default posture is "code is good, shut down" - Add dedup rule: check for existing open/closed PRs before creating new ones - Require concrete PR justification (what breaks without this change) - Add off-limits files list (.github/workflows, .claude/skills, CLAUDE.md) - Use git pathspec exclusions in refactor.sh to never stage protected files - Constrain pr-maintainer to only act on approved or feedback PRs - Reduce refactor cron from every 5 minutes to every 2 hours Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 20:24:25 -05:00
A	d589b0d74e	fix: tilde expansion in upload_config_file + bump refactor frequency (#1131 ) Fix #1114 — `mv` failed because `~/.claude/settings.json` was single-quoted on the remote shell, preventing tilde expansion. Remove the single quotes around remote_path and add a mkdir -p safety net. Also bump the refactor team cron from hourly to every 5 minutes. Co-authored-by: lab <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-14 17:08:36 -05:00
L	0a0512652a	chore: reduce workflow cron frequencies (#1046 ) - discovery: every 30 min → every 3 days - refactor: every 5 min → hourly - security: every 5 min → every 30 min Co-authored-by: Security Reviewer <security-reviewer@spawn.dev> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-13 18:55:40 -08:00
L	f7c6e07867	feat: security triage applies full label taxonomy (#766 ) * feat: security triage now applies full label taxonomy Triage mode now applies: - Safety label (safe-to-work / malicious / needs-human-review) - Content-type label (bug, enhancement, security, question, etc.) - Lifecycle label (Pending Review) so downstream teams can pick up Team-building mode now transitions lifecycle labels: - Adds "In Progress" at start, removes it on close Added a "Available Labels Reference" section to the triage prompt documenting all label categories for the agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: all security-filed issues get safe-to-work + Pending Review Issues filed by the security team (scan findings, drift/anomaly reports, follow-up issues from closed PRs) now automatically get `safe-to-work` and `Pending Review` labels so downstream teams can immediately pick them up without waiting for another triage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove Pending Review from safe-to-work issues safe-to-work already means triage is complete — adding Pending Review is redundant and confusing. Now only UNCLEAR issues get Pending Review (they still need human attention). SAFE issues and security-filed issues skip straight to actionable with just safe-to-work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: normalize all labels to kebab-case Renamed on GitHub: - "In Progress" → "in-progress" - "Pending Review" → "pending-review" - "Under Review" → "under-review" - "good first issue" → "good-first-issue" - "help wanted" → "help-wanted" Updated all references in security.sh and refactor.sh to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: align issue templates and workflows with actual repo labels Created missing labels: cloud-request, agent-request, cli. Replaced nonexistent needs-triage with pending-review in all templates. Templates updated: - bug_report: bug + pending-review - cli_feature_request: cli + enhancement + pending-review - cloud_request: cloud-request + enhancement + pending-review - agent_request: agent-request + enhancement + pending-review Workflows updated: - refactor.yml: trigger on safe-to-work AND (bug\|cli\|enhancement\|maintenance) - discovery.yml: already correct (safe-to-work AND cloud-request\|agent-request) - security.yml: already correct (team-building label check) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sprite <noreply@sprites.dev> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-12 16:20:07 -08:00
L	4924a7d5db	feat: add security triage gate for issue safety before agent processing (#734 ) New issues are triaged by the security team before other workflows can act on them. The triage agent checks for prompt injection, social engineering, spam, and unsafe payloads — marking safe issues with `safe-to-work`, closing malicious ones, or flagging unclear ones for human review. Discovery and refactor workflows now require the `safe-to-work` label in addition to their existing label requirements. Co-authored-by: Sprite <noreply@sprites.dev> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-12 14:23:33 -08:00
L	4d175ae6c7	feat: add Team Building issue template + route workflows by label (#733 ) - New issue template: Team Building (team-building label) — 2 fields: which agent team to improve + what to change - Security team gets a new team_building mode: reads the issue, spawns implementer + reviewer (both Opus), creates PR, reviews, merges, closes issue - Discovery workflow: only triggers on cloud-request / agent-request issues - Refactor workflow: only triggers on bug / cli issues - Security workflow: only triggers on team-building issues (+ PR/schedule) - All workflows still run on schedule and workflow_dispatch as before Co-authored-by: Sprite <noreply@sprites.dev> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-12 14:17:57 -08:00
B	200b6dc5b2	fix: Force HTTP/1.1 for streaming to avoid HTTP/2 stream errors HTTP/2 has strict stream lifecycle management that doesn't play well with long-lived chunked responses — curl exits with error 92 (stream not closed cleanly: INTERNAL_ERROR). HTTP/1.1 handles persistent streaming connections natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 22:51:35 +00:00
B	874b9c95f4	feat: Stream script output back to GH Actions instead of keep-alive Replace the broken keep-alive ping loop with a fundamentally better approach: the trigger server now streams the script's stdout/stderr back as the HTTP response body in chunks. The GH Action holds the curl connection open for the entire cycle duration (~90 min timeout). This works because Sprite keeps VMs alive while "actively servicing HTTP requests." A single long-lived streaming response satisfies this naturally — no synthetic pings needed. Key changes: trigger-server.ts: - /trigger now returns a streaming text/plain Response - stdout/stderr piped through ReadableStream with chunked output - 30s heartbeat lines injected during silent periods - Client disconnect handled gracefully (process keeps running) - X-Accel-Buffering: no header to prevent proxy buffering discovery.yml / refactor.yml: - curl -sSN --fail-with-body streams output in real-time - timeout-minutes: 90 to hold the connection for full cycles - Error responses (429/409/401) still print body and exit cleanly discovery.sh / refactor.sh: - Removed all keep-alive logic (start_keepalive/stop_keepalive) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 18:09:26 +00:00
A	6f47c852c8	Increase refactor workflow frequency from 30min to 5min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 16:24:12 +00:00
A	7ace2695e6	feat: Run issue-fix cycles concurrently with refactor cycles (#145 ) Issue triggers now spawn lightweight 2-agent runs (15-min timeout) in isolated worktrees, while refactor cycles continue independently with the full 6-agent team (30-min timeout). Duplicate issue runs are rejected with 409. - trigger-server.ts: pass SPAWN_ISSUE/SPAWN_REASON env vars to script, add issue dedup (409), include issue in health/trigger responses - refactor.sh: dual-mode (issue vs refactor) with isolated worktrees, mode-specific prompts and timeouts, scoped cleanup - start-refactor.sh: set MAX_CONCURRENT=3 (gitignored, local only) - refactor.yml: handle 409 alongside existing 429 Co-authored-by: A <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 22:15:19 -08:00
B	6b5a547e2d	fix: Treat 429 (cycle already running) as success in workflows When MAX_CONCURRENT=1 and a cycle is in progress, the trigger server returns 429. This is expected behavior, not an error — the previous curl -f treated it as failure (exit code 22). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 03:43:32 +00:00
A	ab343d26a2	fix: Prevent duplicate work, add graceful shutdown, and enforce team lifecycle (#86 ) - Change trigger-server MAX_CONCURRENT default from 3 to 1 to prevent overlapping cycles that duplicate GitHub issue comments - Add SIGTERM/SIGINT handling to trigger-server so running scripts finish gracefully on service restart instead of being killed mid-flight - Add cleanup trap to refactor.sh for worktree/tempfile cleanup on exit - Add pre-cycle cleanup of stale worktrees, merged branches, and abandoned PRs from previously interrupted cycles - Add mandatory Lifecycle Management section to team lead prompt requiring shutdown_request to all teammates before exiting - Add dedup checks to community-coordinator: check existing comments before posting to prevent duplicate acknowledgments/resolutions - Pass issue number in workflow trigger reason for better logging Co-authored-by: A <6723574+louisgv@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 09:10:56 -08:00
B	4c456df091	fix: Switch to direct sprite URL with bearer auth The Sprite start service API (/services/{name}/start) returns "service name required" for all service names — appears to be an API bug. Switched to hitting the sprite's public URL directly with TRIGGER_SECRET bearer auth instead. - Re-added TRIGGER_SECRET auth to trigger-server.ts - Set sprite url_settings.auth to "public" - Updated both workflows to use SPRITE_URL + TRIGGER_SECRET pattern - Aligned workflow structure (both use same env vars and curl format) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 09:07:49 +00:00
B	9eb9e74295	debug: Print secret lengths and hash to verify values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 08:35:23 +00:00
B	87e5790880	debug: Echo SVC_NAME in refactor workflow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 08:16:52 +00:00
Sprite	a361d92e13	fix: Pass env vars correctly in refactor workflow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 08:13:09 +00:00
Sprite	758e79bb59	fix: Inline secret refs in curl URL to avoid env var issues SERVICE_NAME env var may conflict with GitHub Actions internals. Inline the secrets directly in the URL template instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 01:04:26 +00:00
Sprite	57cf080c39	chore: Run refactor workflow every 30 minutes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-09 00:09:04 +00:00
Sprite	66221dac80	fix: Use duration=0s to fire-and-forget on start service API The Sprite start service API returns streaming NDJSON, causing curl -f to fail with exit code 22. Use duration=0s to return immediately and drop -f flag since the response is streaming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-08 23:40:50 +00:00
Sprite	b7b102a352	fix: Remove curl timeout on trigger workflows Sprite may take time to wake from pause, causing --max-time 30 to fail with exit code 22. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-08 21:33:03 +00:00
Sprite	38ffd7ebd6	feat: Update trigger workflows to use Sprite start service API - Replace SPRITE_URL/SPRITE_SECRET pattern with SPRITE_NAME/SERVICE_NAME - Use Sprite start service API endpoint (api.sprites.dev) - Share SPRITE_TOKEN across all services - Update skill documentation to reflect new approach - Delete deprecated URL/SECRET based secrets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-08 20:29:19 +00:00
Sprite	286609c1ed	feat: Add concurrency limits to trigger workflows Add max 3 concurrent run limits: - GitHub Actions: concurrency groups prevent workflow queue buildup - trigger-server: tracks concurrent runs, rejects with 429 if at max - Configurable via MAX_CONCURRENT env var (defaults to 3) - Returns running count and max in trigger response This prevents resource exhaustion when workflows trigger frequently. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 19:34:52 +00:00
L	4a05b32897	Add GitHub Actions triggers for Sprite services (#53 ) * refactor: Automated improvements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * chore: Remove __pycache__ and add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Sprite <noreply@sprite.dev> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 10:29:18 -08:00

27 commits