fix(sprite): add retry for list failures, increase timeout, refresh auth on expiry (#2936)

Three fixes for Sprite E2E failures in long-running batches (73+ min):

1. Retry `_sprite_provision_verify`: list failures now retry 3x with
   exponential backoff (5s, 10s, 20s) instead of failing immediately.
   Fixes kilocode batch 6 "Could not list Sprite instances" errors.

2. Increase `CREATE_TIMEOUT_SECS` default from 300s to 600s and add
   `Client.Timeout`, `request canceled`, and `authentication failed`
   to the transient error retry pattern in `spriteRetry`. Also uses
   linear backoff (3s * attempt) instead of fixed 3s delay.
   Fixes hermes batch 7 HTTP timeout errors.

3. Add `_sprite_refresh_auth` + `cloud_refresh_auth` interface. The
   E2E orchestrator calls `cloud_refresh_auth` before each provisioning
   batch. For Sprite, this re-validates the token via `sprite org list`
   and attempts `sprite auth refresh` if expired.
   Fixes junie batch 8 "authentication failed" errors.

Fixes #2934

Agent: ux-engineer

Co-authored-by: B <6723574+louisgv@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
A 2026-03-23 21:47:58 -07:00 committed by GitHub
parent 50319e0d39
commit e9cbab5b7f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 116 additions and 23 deletions

View file

@ -133,6 +133,15 @@ cloud_install_wait() {
fi
}
# Refresh auth token if the cloud driver supports it (e.g. Sprite tokens
# expire after ~60 min). Called before each provisioning batch to prevent
# auth expiry failures in long-running E2E suites. See #2934.
cloud_refresh_auth() {
if type "_${ACTIVE_CLOUD}_refresh_auth" >/dev/null 2>&1; then
"_${ACTIVE_CLOUD}_refresh_auth" "$@"
fi
}
# ---------------------------------------------------------------------------
# Per-agent provision timeout overrides
#