open-notebook/docs/deployment/retry-configuration.md
Luis Novo f79a9040ae
Release 1.2 (#242)
* chore: improve podcast transcripts

* fix: remove date from insight - fixes #241

* fix: improve scrolling on source and insights - fixes #237

* chore: update esperanto to fix: #234

* chore: update esperanto to fix #226

* fix: process vectorization as subcommands to handle larger documents more gracefully - fix: #229

* feat: enable background job retry capabilities

* feat: reenable content types that were disabled during alpha version

* fix: remove unnecessary model caching causing many issues.

* feat: support multiple azure endpoints and keys just like openai compatible. Fixes #215

* docs: update azure variables

* chore: bump and update dependencies
2025-11-01 14:40:00 -03:00

9.6 KiB

Retry Configuration Guide

Open Notebook includes automatic retry capabilities for background commands to handle transient failures gracefully. This guide explains how retry works and how to configure it for your deployment.

Overview

The retry system (powered by surreal-commands v1.2.0+) automatically retries failed commands when they encounter transient errors like:

  • Database transaction conflicts during concurrent operations
  • Network failures when calling external APIs (embedding providers, LLMs)
  • Request timeouts to external services
  • Rate limits from third-party APIs

Permanent errors (invalid input, authentication failures, etc.) are not retried and fail immediately.

How It Works

Architecture

Command Execution
    ↓
Try to execute
    ↓
Success? → Done
    ↓
Transient Error? (RuntimeError, ConnectionError, TimeoutError)
    ↓
Retry with backoff
    ↓
Max attempts reached?
    ↓
Final failure → Report error

Retry Strategies

Exponential Jitter (default, recommended):

  • Waits: 1s → ~2s → ~4s → ~8s → ~16s (with randomization)
  • Prevents "thundering herd" when many workers retry simultaneously
  • Best for: Database conflicts, concurrent operations

Exponential:

  • Waits: 1s → 2s → 4s → 8s → 16s (predictable)
  • Good for: API rate limits (predictable backoff helps with quota reset)

Fixed:

  • Waits: 2s → 2s → 2s → 2s → 2s (constant)
  • Best for: Quick recovery scenarios

Random:

  • Waits: Random between min and max
  • Use when: You want unpredictable retry timing

Global Configuration

Configure default retry behavior for all commands via environment variables in your .env file:

# Enable/disable retry globally (default: true)
SURREAL_COMMANDS_RETRY_ENABLED=true

# Maximum retry attempts before giving up (default: 3)
SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS=3

# Wait strategy between retry attempts (default: exponential_jitter)
# Options: exponential_jitter, exponential, fixed, random
SURREAL_COMMANDS_RETRY_WAIT_STRATEGY=exponential_jitter

# Minimum wait time between retries in seconds (default: 1)
SURREAL_COMMANDS_RETRY_WAIT_MIN=1

# Maximum wait time between retries in seconds (default: 30)
SURREAL_COMMANDS_RETRY_WAIT_MAX=30

# Worker concurrency (affects likelihood of DB conflicts)
# Higher concurrency = more conflicts but faster processing
# Lower concurrency = fewer conflicts but slower processing
SURREAL_COMMANDS_MAX_TASKS=5

Tuning Global Defaults

For resource-constrained deployments (low CPU/memory):

SURREAL_COMMANDS_MAX_TASKS=2
SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS=3
SURREAL_COMMANDS_RETRY_WAIT_MAX=20
  • Fewer concurrent tasks reduce conflict likelihood
  • Lower max wait since conflicts are rare

For high-performance deployments (powerful servers):

SURREAL_COMMANDS_MAX_TASKS=10
SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS=5
SURREAL_COMMANDS_RETRY_WAIT_MAX=30
  • More concurrent tasks for faster processing
  • More retries to handle increased conflicts

For debugging (disable retries to see immediate errors):

SURREAL_COMMANDS_RETRY_ENABLED=false

Per-Command Configuration

Individual commands can override global defaults. Open Notebook uses custom retry strategies for specific operations:

embed_chunk (Database Operations)

Handles concurrent chunk embedding with retry for transaction conflicts:

@command(
    "embed_chunk",
    app="open_notebook",
    retry={
        "max_attempts": 5,
        "wait_strategy": "exponential_jitter",
        "wait_min": 1,
        "wait_max": 30,
        "retry_on": [RuntimeError, ConnectionError, TimeoutError],
    },
)

What it retries:

  • SurrealDB transaction conflicts (RuntimeError)
  • Network failures to embedding provider (ConnectionError)
  • Request timeouts (TimeoutError)

What it doesn't retry:

  • Invalid input (ValueError)
  • Authentication errors
  • Missing embedding model

Why 5 attempts? Database conflicts are cheap to retry (local operation), so we retry more aggressively.

vectorize_source & rebuild_embeddings (Orchestration)

Orchestration commands that coordinate other jobs disable retries to fail fast:

@command("vectorize_source", app="open_notebook", retry=None)

Why no retries?

  • Job submission failures should be immediately visible
  • Allows quick debugging of orchestration issues
  • Individual child jobs (embed_chunk) have their own retry logic

Common Scenarios

Issue: Vectorization fails with "transaction conflict" errors

Symptoms:

RuntimeError: Failed to commit transaction due to a read or write conflict

Solution 1 - Reduce concurrency (fewer conflicts):

SURREAL_COMMANDS_MAX_TASKS=3

Solution 2 - Increase retry attempts:

SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS=7

Solution 3 - Longer backoff (give more time between retries):

SURREAL_COMMANDS_RETRY_WAIT_MAX=60

Issue: Embedding provider rate limits (429 errors)

Symptoms:

HTTP 429: Rate limit exceeded

Solution - Configure longer waits:

SURREAL_COMMANDS_RETRY_WAIT_MIN=5
SURREAL_COMMANDS_RETRY_WAIT_MAX=120
SURREAL_COMMANDS_RETRY_WAIT_STRATEGY=exponential

This gives the API quota time to reset between retries.

Issue: Slow/unstable network to embedding provider

Symptoms:

TimeoutError: Request timed out
ConnectionError: Failed to establish connection

Solution - More retries with longer waits:

SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS=5
SURREAL_COMMANDS_RETRY_WAIT_MAX=60

Issue: Want to see errors immediately (debugging)

Solution - Disable retries temporarily:

SURREAL_COMMANDS_RETRY_ENABLED=false

Remember to re-enable after debugging!

Monitoring Retry Behavior

Check Worker Logs

Retry attempts are logged automatically:

Transaction conflict for chunk 42 - will be retried by retry mechanism
[Retry] Attempt 2/5 for embed_chunk, waiting 2.3s
[Retry] Attempt 3/5 for embed_chunk, waiting 5.1s
Successfully embedded chunk 42

Look for Retry Patterns

High retry rate (many retries happening):

  • Consider reducing SURREAL_COMMANDS_MAX_TASKS
  • Check if external services are slow/unstable
  • May need to increase SURREAL_COMMANDS_RETRY_WAIT_MAX

Retries exhausted (commands failing after all retries):

  • Check if issue is actually permanent (auth error, invalid config)
  • May need to increase SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS
  • Check external service status

No retries (operations always succeed first try):

  • Your retry configuration is working well!
  • Could potentially increase SURREAL_COMMANDS_MAX_TASKS for better performance

Best Practices

Do

  • Use exponential_jitter for concurrent operations (prevents thundering herd)
  • Set reasonable max_attempts (3-5 for most operations)
  • Monitor retry rates to tune configuration
  • Test retry behavior with large documents after config changes
  • Document custom retry strategies in your deployment notes

Don't

  • Don't set max_attempts too high (>10) - may mask real issues
  • Don't use fixed strategy for concurrent operations - causes thundering herd
  • Don't disable retries in production unless debugging
  • Don't set wait_max too low (<5s) - may exhaust retries too quickly
  • Don't forget to re-enable retries after debugging

Advanced: Custom Retry Logic

If you're developing custom commands, you can configure retry behavior:

from surreal_commands import command

@command(
    "my_custom_command",
    app="my_app",
    retry={
        "max_attempts": 3,
        "wait_strategy": "exponential_jitter",
        "wait_min": 1,
        "wait_max": 30,
        "retry_on": [RuntimeError, ConnectionError, TimeoutError],
    },
)
async def my_custom_command(input_data):
    try:
        # Your command logic
        result = await some_operation()
        return result

    except RuntimeError:
        # Re-raise to trigger retry
        raise

    except ValueError:
        # Don't retry - permanent error
        return {"success": False, "error": str(e)}

Key points:

  • Exceptions in retry_on must be re-raised to trigger retries
  • Other exceptions should be caught and returned as failures
  • Transient errors: RuntimeError, ConnectionError, TimeoutError
  • Permanent errors: ValueError, AuthenticationError, etc.

Troubleshooting

Retries not working

Check 1: Is retry enabled?

grep SURREAL_COMMANDS_RETRY_ENABLED .env
# Should show: SURREAL_COMMANDS_RETRY_ENABLED=true

Check 2: Is the exception being re-raised? Check your command code - exceptions must be re-raised to trigger retries.

Check 3: Is the exception in retry_on list? Only exceptions listed in retry_on are retried.

Worker crashing on errors

Issue: Worker crashes instead of retrying

Cause: Exception is not being caught by retry mechanism

Solution: Check that the exception type is in the retry_on list and is being re-raised in the command.

Retries taking too long

Issue: Commands retry forever

Cause: wait_max is too high or max_attempts is too high

Solution: Reduce retry parameters:

SURREAL_COMMANDS_RETRY_MAX_ATTEMPTS=3
SURREAL_COMMANDS_RETRY_WAIT_MAX=30

References