feat: Add RUN_TIMEOUT_MS tuning guide and set default to 2 hours

- Default RUN_TIMEOUT_MS increased to 7200000 (2h) based on observed
  team cycle durations of 1-2 hours
- SKILL.md now documents the data-driven tuning approach: start high
  (6-12h), collect log data, then tune down to 2x longest observed cycle
- Updated health/trigger response docs and workflow template with
  429-tolerant curl pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
B 2026-02-10 04:05:18 +00:00
parent 1451d51784
commit c2e83a4453
2 changed files with 69 additions and 13 deletions

View file

@ -47,18 +47,25 @@ GitHub Actions (cron / events / manual)
The trigger server lives at:
`/home/sprite/spawn/.claude/skills/setup-trigger-service/trigger-server.ts`
It reads two required env vars:
- `TRIGGER_SECRET` — Bearer token for authenticating requests
- `TARGET_SCRIPT` — Absolute path to the script to run on trigger
It reads env vars:
- `TRIGGER_SECRET` (required) — Bearer token for authenticating requests
- `TARGET_SCRIPT` (required) — Absolute path to the script to run on trigger
- `REPO_ROOT` (optional) — Working directory for the script (defaults to script's parent dir)
- `MAX_CONCURRENT` (optional) — Max parallel runs (default: `1`)
- `RUN_TIMEOUT_MS` (optional) — Kill runs older than this in milliseconds (default: `7200000` = 2 hours)
**Stale run detection:**
Before accepting a trigger, the server checks if tracked processes are still alive (`kill -0`). Dead processes are reaped automatically. Runs exceeding `RUN_TIMEOUT_MS` are force-killed to free the slot.
**Endpoints:**
- `GET /health``{"status":"ok"}` (no auth, for health checks)
- `POST /trigger` → validates `Authorization: Bearer <secret>`, runs target script in background
- `GET /health``{"status":"ok","running":N,"max":N,"timeoutSec":N,"runs":[...]}` (no auth, shows per-run pid/age)
- `POST /trigger` → validates `Authorization: Bearer <secret>`, reaps stale runs, then runs target script in background
**Responses:**
- `200``{"triggered":true,"reason":"...","running":N,"max":N}` on success
- `401``{"error":"unauthorized"}` if bearer token is wrong
- `429``{"error":"max concurrent runs reached"}` if at limit (default 3, configurable via `MAX_CONCURRENT` env var)
- `429``{"error":"max concurrent runs reached","oldestAgeSec":N}` if at limit
- `503``{"error":"server is shutting down"}` during graceful shutdown
## Step 2: Generate a trigger secret
@ -78,6 +85,7 @@ Create `start-<service-name>.sh` in the skill directory:
#!/bin/bash
export TRIGGER_SECRET="<secret-from-step-2>"
export TARGET_SCRIPT="/home/sprite/spawn/.claude/skills/setup-trigger-service/<target-script>.sh"
export REPO_ROOT="/home/sprite/spawn"
exec bun run /home/sprite/spawn/.claude/skills/setup-trigger-service/trigger-server.ts
```
@ -191,8 +199,18 @@ jobs:
SPRITE_URL: ${{ secrets.<SERVICE_NAME>_SPRITE_URL }}
TRIGGER_SECRET: ${{ secrets.<SERVICE_NAME>_TRIGGER_SECRET }}
run: |
curl -sf -X POST "${SPRITE_URL}/trigger?reason=${{ github.event_name }}" \
-H "Authorization: Bearer ${TRIGGER_SECRET}"
HTTP_CODE=$(curl -s -o /tmp/response.json -w '%{http_code}' -X POST \
"${SPRITE_URL}/trigger?reason=${{ github.event_name }}" \
-H "Authorization: Bearer ${TRIGGER_SECRET}")
cat /tmp/response.json
if [ "$HTTP_CODE" = "429" ]; then
echo "Cycle already running, skipping"
elif [ "$HTTP_CODE" -ge 200 ] && [ "$HTTP_CODE" -lt 300 ]; then
echo "Triggered successfully"
else
echo "Failed with HTTP $HTTP_CODE"
exit 1
fi
```
**Cron examples:**
@ -222,7 +240,45 @@ printf '<secret-from-step-2>' | gh secret set <SERVICE_NAME>_TRIGGER_SECRET --re
| `<SERVICE>_SPRITE_URL` | `DISCOVERY_SPRITE_URL` | Public URL of the Sprite |
| `<SERVICE>_TRIGGER_SECRET` | `DISCOVERY_TRIGGER_SECRET` | Bearer token for the trigger server |
## Step 8: Ensure the target script is single-cycle
## Step 8: Tune RUN_TIMEOUT_MS
`RUN_TIMEOUT_MS` controls how long a run can execute before the trigger server force-kills it and frees the slot. **Start high, then tune down based on real data.**
### Recommended approach
1. **Start with a high timeout (6-12 hours).** You don't know how long cycles take yet. A too-short timeout kills legitimate runs mid-work, leaving orphaned branches, half-merged PRs, and dirty worktrees.
2. **Run several cycles and collect data.** Check the trigger server logs for actual run durations:
```bash
# Look for "finished" lines with duration
cat /.sprite/logs/services/<service-name>.log | grep 'finished'
```
3. **Set the timeout to 2x your longest observed cycle.** For example, if cycles take 30-90 minutes, set `RUN_TIMEOUT_MS` to `10800000` (3 hours). This gives headroom for slow cycles without letting truly hung processes block the slot forever.
4. **Re-evaluate after changes.** Adding more agents to a team, increasing the scope of work, or hitting API rate limits can all increase cycle time. Check logs periodically.
### Current values (based on observed data)
| Service | Observed cycle time | RUN_TIMEOUT_MS | Rationale |
|---------|-------------------|----------------|-----------|
| Discovery (improve.sh) | 1-2 hours | `7200000` (2h) | Team cycles with 5+ agents, worktrees, PRs |
| Refactor (refactor.sh) | TBD | `7200000` (2h) | Start high, tune after data |
To override, add to the wrapper script:
```bash
export RUN_TIMEOUT_MS=14400000 # 4 hours
```
Or set it to a very high value initially:
```bash
export RUN_TIMEOUT_MS=43200000 # 12 hours (safe starting point)
```
## Step 9: Ensure the target script is single-cycle
The target script (e.g., `refactor.sh`, `improve.sh`) MUST:
@ -295,7 +351,7 @@ rm -rf /tmp/spawn-worktrees
These conventions are already embedded in the prompts of `improve.sh` and `refactor.sh`. When adding new service scripts, copy the same patterns.
## Step 9: Commit and push
## Step 10: Commit and push
Commit the workflow file and .gitignore changes (but NOT the wrapper script):
@ -305,7 +361,7 @@ git commit -m "feat: Add GitHub Actions trigger for <service-name>"
git push origin main
```
## Step 10: Test end-to-end
## Step 11: Test end-to-end
```bash
# Trigger manually via GitHub Actions
@ -348,7 +404,7 @@ To add a new automation script (beyond improve.sh and refactor.sh):
| curl exits with code 22 | The sprite URL may require auth — run Step 5 to set `auth: "public"` |
| Script runs but nothing happens | Check the target script works standalone: `bash /path/to/script.sh` |
| Sprite doesn't wake | Verify `<SERVICE>_SPRITE_URL` secret matches the Sprite's public URL |
| `{"error":"max concurrent runs reached"}` | Max concurrent limit reached (default 3) — wait for runs to finish or increase `MAX_CONCURRENT` env var in wrapper script |
| `{"error":"max concurrent runs reached"}` | Max concurrent limit reached (default 1) — wait for runs to finish or increase `MAX_CONCURRENT` env var in wrapper script |
| env vars not passed | Use the wrapper script pattern (not `--env` flag with commas in values) |
| GitHub Actions secret is empty | Check `gh secret list --repo <owner>/<repo>` and re-set with `printf` (not `echo`, to avoid trailing newline) |

View file

@ -22,7 +22,7 @@ const TRIGGER_SECRET = process.env.TRIGGER_SECRET ?? "";
const TARGET_SCRIPT = process.env.TARGET_SCRIPT ?? "";
const MAX_CONCURRENT = parseInt(process.env.MAX_CONCURRENT ?? "1", 10);
const RUN_TIMEOUT_MS = parseInt(
process.env.RUN_TIMEOUT_MS ?? String(30 * 60 * 1000),
process.env.RUN_TIMEOUT_MS ?? String(2 * 60 * 60 * 1000),
10
);