* fix: use uv --upgrade to ensure Python 3.13-compatible Pillow across all clouds
aider-chat on Python 3.13 fails with `ImportError: cannot import name
'_imaging' from 'PIL'` when an old Pillow version (pre-10.4) is resolved
— those releases have no Python 3.13 binary wheels, so the C extension
is missing at runtime.
Replace `--with 'Pillow>=10.2.0'` (which was silently broken — the `>`
and single quotes get mangled by `printf '%q'` in run_server before the
command reaches the remote machine) with `--upgrade`, which forces all
transitive deps including Pillow to their latest compatible versions.
Also adds a plain-text echo before the install so users see progress
instead of a silent hang during the 2-4 minute install.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test: update aider/gptme/interpreter assertions from pip to uv
The install method for aider, gptme, and open-interpreter was changed
from pip to `uv tool install` across all clouds. The mock test
assertions still checked for the old `pip.*install.*` patterns, causing
9 failures (3 agents × 3 clouds).
Update patterns to match the actual `uv tool install` commands now used
in all cloud scripts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ci: trigger test run for uv assertion fix
* fix: prevent SSH hangs, restore stderr, fix command escaping across clouds
- Add < /dev/null to ssh_run_server and generic_ssh_wait to prevent SSH
stdin theft causing sequential install/verify/configure steps to hang
- Add ServerAliveInterval, ServerAliveCountMax, ConnectTimeout to default
SSH_OPTS so long-running installs don't silently drop on flaky networks
- Remove 2>/dev/null from Fly.io run_server so remote command errors are
no longer silently swallowed (--quiet flag still suppresses flyctl noise)
- Fix Fly.io printf '%q' double-quoting: remove extra quotes around
$escaped_cmd that prevented the remote shell from consuming escapes,
breaking && || | operators in commands
- Remove broken printf '%q' from Daytona run_server and interactive_session
where it escaped shell operators into literal characters since daytona exec
has no intermediate shell layer
- Pin aider to --python 3.12 instead of --with audioop-lts across all clouds
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add --pty to fly ssh console for interactive sessions
fly ssh console -C does not allocate a pseudo-terminal by default,
causing interactive TUI agents (aider, claude) to fail with
"Input is not a terminal (fd=0)" or completely unresponsive input.
Adding --pty forces PTY allocation, matching how other clouds handle
interactive sessions (SSH uses -t, Sprite uses -tty).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: prepend ~/.local/bin to PATH in ssh_run_server
After uv installs to ~/.local/bin, the current shell session doesn't
have it in PATH, causing "uv: command not found" on DigitalOcean and
all other SSH-based clouds (Hetzner, AWS, GCP, OVH).
Fly.io's run_server already prepends this PATH — now the shared
ssh_run_server does the same, fixing all SSH-based clouds at once.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add Node.js to cloud-init for all cloud providers
npm-based agents (codex, kilocode, etc.) fail with "npm: command not
found" because Node.js isn't installed during cloud-init. Fly.io was
the only provider installing Node.js (in wait_for_cloud_init).
Now all cloud-init scripts install Node.js v22 LTS from nodesource,
matching Fly.io's setup. Also adds ~/.local/bin to PATH in AWS and
GCP cloud-init (was already in shared/DigitalOcean/Hetzner).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use apt packages for nodejs/npm instead of nodesource
The nodesource setup script (setup_22.x) runs its own apt-get update
and repository configuration, nearly doubling cloud-init time and
causing hangs on DigitalOcean. Ubuntu 24.04 includes nodejs and npm
in its default repos — just add them to the packages list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add timeouts and better error handling to Daytona CLI commands
Daytona CLI commands (login, list, create) can hang indefinitely when
the API is slow or unreachable. This causes:
- "Failed to create sandbox: timeout" with no recovery
- Token validation timeouts misreported as "invalid token"
- Users re-entering valid tokens that also timeout
Fixes:
- Wrap all daytona CLI calls with timeout (30s for auth, 120s for create)
- Detect timeout errors separately from auth errors
- Show actionable "try again / check status" messages for timeouts
- Add nodejs/npm to Daytona wait_for_cloud_init
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: set DAYTONA_API_URL to Daytona Cloud by default
The Daytona CLI may default to connecting to a local self-hosted
server instead of Daytona Cloud. Without DAYTONA_API_URL set to
https://app.daytona.io/api, every CLI command (login, list, create)
hangs trying to reach a non-existent local server and times out.
The SDK documents this as the default, but the CLI doesn't always
pick it up — now we export it explicitly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: symlink n-installed Node.js v22 over apt v18 to prevent shadowing
n installs Node.js v22 to /usr/local/bin/node but apt's v18 at
/usr/bin/node can shadow it in non-interactive SSH sessions. After
n 22, symlink the new binaries over the apt ones so v22 is always
resolved. Also fix hcloud CLI token extraction for new TOML format.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: address security review, add curl timeouts to trigger workflows
- Fix ssh_run_server command injection concern: use single-quoted
path_prefix so $HOME/$PATH expand remotely, not locally
- Add --connect-timeout 15 --max-time 30 to trigger workflows to
prevent 5-min hangs when server streams responses
- Handle 409 (dedup) as success — expected when cron fires every 15min
but cycles take 35min
- Reduce workflow timeout-minutes from 5 to 2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the complex claude launch pattern (subshell + PID file + tee
pipe + stream-json + 50-line watchdog monitoring log file growth +
session-end detection) with a simple direct launch:
claude -p "..." >> "${LOG_FILE}" 2>&1 &
The watchdog is now just a wall-clock timeout. The idle-output detection,
stream-json result parsing, and tee piping are all removed.
Also remove GitHub Actions concurrency groups — the trigger server
already handles dedup (409 for same issue, 409 for same reason), making
the GH Actions concurrency groups redundant queuing.
Changes:
- refactor.sh: simple launch + wall-clock-only watchdog
- security.sh: same simplification
- discovery.sh: same (refactored _kill_claude_process and
_run_watchdog_loop to simpler signatures)
- All 4 workflows: remove concurrency groups
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The trigger server streamed script stdout back to GitHub Actions via a
long-lived HTTP response, requiring --http1.1, heartbeat injection,
server.timeout(req, 0), createEnqueuer, drainStreamOutput, and 90-min
GH Actions timeouts. In practice GitHub Actions is just a dumb trigger
— the real state lives on the VM (log files, journalctl). Simplify to
fire-and-forget: spawn script, return 200 JSON immediately.
Also fix the refactor and discovery team lead monitoring loops. The
prompts buried the loop in a single compressed line that the model
ignored (doing Bash("sleep 10") repeatedly without calling TaskList).
Replace with a dedicated "Monitor Loop (CRITICAL)" section with numbered
steps, matching the security.sh pattern that actually works.
Changes:
- trigger-server.ts: remove ~150 lines of streaming code (createEnqueuer,
drainStreamOutput, startStreamingRun, heartbeat, ReadableStream),
replace with startFireAndForgetRun (stdout: "inherit", immediate JSON)
- All 4 workflows: simple curl POST, timeout-minutes 90→5, remove
--http1.1/-N/--max-time/exit-code handling
- refactor.sh: add Monitor Loop (CRITICAL) section with numbered steps
- discovery-team-prompt.txt: same Monitor Loop fix
- SKILL.md: update architecture docs, remove streaming sections
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add team lead pre-approval gate: teammates spawn in plan mode and must
get approval before creating any PR (hard gate, not just prompt rules)
- Add diminishing returns rule: default posture is "code is good, shut down"
- Add dedup rule: check for existing open/closed PRs before creating new ones
- Require concrete PR justification (what breaks without this change)
- Add off-limits files list (.github/workflows, .claude/skills, CLAUDE.md)
- Use git pathspec exclusions in refactor.sh to never stage protected files
- Constrain pr-maintainer to only act on approved or feedback PRs
- Reduce refactor cron from every 5 minutes to every 2 hours
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Fix#1114 — `mv` failed because `~/.claude/settings.json` was
single-quoted on the remote shell, preventing tilde expansion.
Remove the single quotes around remote_path and add a mkdir -p
safety net.
Also bump the refactor team cron from hourly to every 5 minutes.
Co-authored-by: lab <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- discovery: every 30 min → every 3 days
- refactor: every 5 min → hourly
- security: every 5 min → every 30 min
Co-authored-by: Security Reviewer <security-reviewer@spawn.dev>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: security triage now applies full label taxonomy
Triage mode now applies:
- Safety label (safe-to-work / malicious / needs-human-review)
- Content-type label (bug, enhancement, security, question, etc.)
- Lifecycle label (Pending Review) so downstream teams can pick up
Team-building mode now transitions lifecycle labels:
- Adds "In Progress" at start, removes it on close
Added a "Available Labels Reference" section to the triage prompt
documenting all label categories for the agent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: all security-filed issues get safe-to-work + Pending Review
Issues filed by the security team (scan findings, drift/anomaly
reports, follow-up issues from closed PRs) now automatically get
`safe-to-work` and `Pending Review` labels so downstream teams
can immediately pick them up without waiting for another triage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: remove Pending Review from safe-to-work issues
safe-to-work already means triage is complete — adding Pending Review
is redundant and confusing. Now only UNCLEAR issues get Pending Review
(they still need human attention). SAFE issues and security-filed
issues skip straight to actionable with just safe-to-work.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: normalize all labels to kebab-case
Renamed on GitHub:
- "In Progress" → "in-progress"
- "Pending Review" → "pending-review"
- "Under Review" → "under-review"
- "good first issue" → "good-first-issue"
- "help wanted" → "help-wanted"
Updated all references in security.sh and refactor.sh to match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: align issue templates and workflows with actual repo labels
Created missing labels: cloud-request, agent-request, cli.
Replaced nonexistent needs-triage with pending-review in all templates.
Templates updated:
- bug_report: bug + pending-review
- cli_feature_request: cli + enhancement + pending-review
- cloud_request: cloud-request + enhancement + pending-review
- agent_request: agent-request + enhancement + pending-review
Workflows updated:
- refactor.yml: trigger on safe-to-work AND (bug|cli|enhancement|maintenance)
- discovery.yml: already correct (safe-to-work AND cloud-request|agent-request)
- security.yml: already correct (team-building label check)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Sprite <noreply@sprites.dev>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New issues are triaged by the security team before other workflows can
act on them. The triage agent checks for prompt injection, social
engineering, spam, and unsafe payloads — marking safe issues with
`safe-to-work`, closing malicious ones, or flagging unclear ones for
human review. Discovery and refactor workflows now require the
`safe-to-work` label in addition to their existing label requirements.
Co-authored-by: Sprite <noreply@sprites.dev>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New issue template: Team Building (team-building label) — 2 fields:
which agent team to improve + what to change
- Security team gets a new team_building mode: reads the issue, spawns
implementer + reviewer (both Opus), creates PR, reviews, merges, closes issue
- Discovery workflow: only triggers on cloud-request / agent-request issues
- Refactor workflow: only triggers on bug / cli issues
- Security workflow: only triggers on team-building issues (+ PR/schedule)
- All workflows still run on schedule and workflow_dispatch as before
Co-authored-by: Sprite <noreply@sprites.dev>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTTP/2 has strict stream lifecycle management that doesn't play well
with long-lived chunked responses — curl exits with error 92
(stream not closed cleanly: INTERNAL_ERROR). HTTP/1.1 handles
persistent streaming connections natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the broken keep-alive ping loop with a fundamentally better
approach: the trigger server now streams the script's stdout/stderr
back as the HTTP response body in chunks. The GH Action holds the
curl connection open for the entire cycle duration (~90 min timeout).
This works because Sprite keeps VMs alive while "actively servicing
HTTP requests." A single long-lived streaming response satisfies
this naturally — no synthetic pings needed.
Key changes:
trigger-server.ts:
- /trigger now returns a streaming text/plain Response
- stdout/stderr piped through ReadableStream with chunked output
- 30s heartbeat lines injected during silent periods
- Client disconnect handled gracefully (process keeps running)
- X-Accel-Buffering: no header to prevent proxy buffering
discovery.yml / refactor.yml:
- curl -sSN --fail-with-body streams output in real-time
- timeout-minutes: 90 to hold the connection for full cycles
- Error responses (429/409/401) still print body and exit cleanly
discovery.sh / refactor.sh:
- Removed all keep-alive logic (start_keepalive/stop_keepalive)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Issue triggers now spawn lightweight 2-agent runs (15-min timeout) in
isolated worktrees, while refactor cycles continue independently with
the full 6-agent team (30-min timeout). Duplicate issue runs are
rejected with 409.
- trigger-server.ts: pass SPAWN_ISSUE/SPAWN_REASON env vars to script,
add issue dedup (409), include issue in health/trigger responses
- refactor.sh: dual-mode (issue vs refactor) with isolated worktrees,
mode-specific prompts and timeouts, scoped cleanup
- start-refactor.sh: set MAX_CONCURRENT=3 (gitignored, local only)
- refactor.yml: handle 409 alongside existing 429
Co-authored-by: A <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When MAX_CONCURRENT=1 and a cycle is in progress, the trigger server
returns 429. This is expected behavior, not an error — the previous
curl -f treated it as failure (exit code 22).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change trigger-server MAX_CONCURRENT default from 3 to 1 to prevent
overlapping cycles that duplicate GitHub issue comments
- Add SIGTERM/SIGINT handling to trigger-server so running scripts finish
gracefully on service restart instead of being killed mid-flight
- Add cleanup trap to refactor.sh for worktree/tempfile cleanup on exit
- Add pre-cycle cleanup of stale worktrees, merged branches, and
abandoned PRs from previously interrupted cycles
- Add mandatory Lifecycle Management section to team lead prompt requiring
shutdown_request to all teammates before exiting
- Add dedup checks to community-coordinator: check existing comments
before posting to prevent duplicate acknowledgments/resolutions
- Pass issue number in workflow trigger reason for better logging
Co-authored-by: A <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Sprite start service API (/services/{name}/start) returns
"service name required" for all service names — appears to be an API
bug. Switched to hitting the sprite's public URL directly with
TRIGGER_SECRET bearer auth instead.
- Re-added TRIGGER_SECRET auth to trigger-server.ts
- Set sprite url_settings.auth to "public"
- Updated both workflows to use SPRITE_URL + TRIGGER_SECRET pattern
- Aligned workflow structure (both use same env vars and curl format)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SERVICE_NAME env var may conflict with GitHub Actions internals.
Inline the secrets directly in the URL template instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Sprite start service API returns streaming NDJSON, causing curl -f
to fail with exit code 22. Use duration=0s to return immediately and
drop -f flag since the response is streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sprite may take time to wake from pause, causing --max-time 30 to fail
with exit code 22.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace SPRITE_URL/SPRITE_SECRET pattern with SPRITE_NAME/SERVICE_NAME
- Use Sprite start service API endpoint (api.sprites.dev)
- Share SPRITE_TOKEN across all services
- Update skill documentation to reflect new approach
- Delete deprecated URL/SECRET based secrets
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add max 3 concurrent run limits:
- GitHub Actions: concurrency groups prevent workflow queue buildup
- trigger-server: tracks concurrent runs, rejects with 429 if at max
- Configurable via MAX_CONCURRENT env var (defaults to 3)
- Returns running count and max in trigger response
This prevents resource exhaustion when workflows trigger frequently.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>