From dfa217ab513afc8beb28a9e3194e6a373ee23a64 Mon Sep 17 00:00:00 2001 From: Deluan Date: Fri, 25 Apr 2025 12:54:29 -0400 Subject: [PATCH] docs(scanner): add overview README document Signed-off-by: Deluan --- scanner/README.md | 466 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 466 insertions(+) create mode 100644 scanner/README.md diff --git a/scanner/README.md b/scanner/README.md new file mode 100644 index 000000000..b2c682381 --- /dev/null +++ b/scanner/README.md @@ -0,0 +1,466 @@ +# Navidrome Scanner: Technical Overview + +This document provides a comprehensive technical explanation of Navidrome's music library scanner system. + +## Architecture Overview + +The Navidrome scanner is built on a multi-phase pipeline architecture designed for efficient processing of music files. It systematically traverses file system directories, processes metadata, and maintains a database representation of the music library. A key performance feature is that some phases run sequentially while others execute in parallel. + +```mermaid +flowchart TD + subgraph "Scanner Execution Flow" + Controller[Scanner Controller] --> Scanner[Scanner Implementation] + + Scanner --> Phase1[Phase 1: Folders Scan] + Phase1 --> Phase2[Phase 2: Missing Tracks] + + Phase2 --> ParallelPhases + + subgraph ParallelPhases["Parallel Execution"] + Phase3[Phase 3: Refresh Albums] + Phase4[Phase 4: Playlist Import] + end + + ParallelPhases --> FinalSteps[Final Steps: GC + Stats] + end + + %% Triggers that can initiate a scan + FileChanges[File System Changes] -->|Detected by| Watcher[Filesystem Watcher] + Watcher -->|Triggers| Controller + + ScheduledJob[Scheduled Job] -->|Based on Scanner.Schedule| Controller + ServerStartup[Server Startup] -->|If Scanner.ScanOnStartup=true| Controller + ManualTrigger[Manual Scan via UI/API] -->|Admin user action| Controller + CLICommand[Command Line: navidrome scan] -->|Direct invocation| Controller + PIDChange[PID Configuration Change] -->|Forces full scan| Controller + DBMigration[Database Migration] -->|May require full scan| Controller + + Scanner -.->|Alternative| External[External Scanner Process] +``` + +The execution flow shows that Phases 1 and 2 run sequentially, while Phases 3 and 4 execute in parallel to maximize performance before the final processing steps. + +## Core Components + +### Scanner Controller (`controller.go`) + +This is the entry point for all scanning operations. It provides: + +- Public API for initiating scans and checking scan status +- Event broadcasting to notify clients about scan progress +- Serialization of scan operations (prevents concurrent scans) +- Progress tracking and monitoring +- Error collection and reporting + +```go +type Scanner interface { + // ScanAll starts a full scan of the music library. This is a blocking operation. + ScanAll(ctx context.Context, fullScan bool) (warnings []string, err error) + Status(context.Context) (*StatusInfo, error) +} +``` + +### Scanner Implementation (`scanner.go`) + +The primary implementation that orchestrates the four-phase scanning pipeline. Each phase follows the Phase interface pattern: + +```go +type phase[T any] interface { + producer() ppl.Producer[T] + stages() []ppl.Stage[T] + finalize(error) error + description() string +} +``` + +This design enables: +- Type-safe pipeline construction with generics +- Modular phase implementation +- Separation of concerns +- Easy measurement of performance + +### External Scanner (`external.go`) + +The External Scanner is a specialized implementation that offloads the scanning process to a separate subprocess. This is specifically designed to address memory management challenges in long-running Navidrome instances. + +```go +// scannerExternal is a scanner that runs an external process to do the scanning. It is used to avoid +// memory leaks or retention in the main process, as the scanner can consume a lot of memory. The +// external process will be spawned with the same executable as the current process, and will run +// the "scan" command with the "--subprocess" flag. +// +// The external process will send progress updates to the main process through its STDOUT, and the main +// process will forward them to the caller. +``` + +```mermaid +sequenceDiagram + participant MP as Main Process + participant ES as External Scanner + participant SP as Subprocess (navidrome scan --subprocess) + participant FS as File System + participant DB as Database + + Note over MP: DevExternalScanner=true + MP->>ES: ScanAll(ctx, fullScan) + activate ES + + ES->>ES: Locate executable path + ES->>SP: Start subprocess with args:
scan --subprocess --configfile ... etc. + activate SP + + Note over ES,SP: Create pipe for communication + + par Subprocess executes scan + SP->>FS: Read files & metadata + SP->>DB: Update database + and Main process monitors progress + loop For each progress update + SP->>ES: Send encoded progress info via stdout pipe + ES->>MP: Forward progress info + end + end + + SP-->>ES: Subprocess completes (success/error) + deactivate SP + ES-->>MP: Return aggregated warnings/errors + deactivate ES +``` + +Technical details: + +1. **Process Isolation** + - Spawns a separate process using the same executable + - Uses the `--subprocess` flag to indicate it's running as a child process + - Preserves configuration by passing required flags (`--configfile`, `--datafolder`, etc.) + +2. **Inter-Process Communication** + - Uses a pipe for bidirectional communication + - Encodes/decodes progress updates using Go's `gob` encoding for efficient binary transfer + - Properly handles process termination and error propagation + +3. **Memory Management Benefits** + - Scanning operations can be memory-intensive, especially with large music libraries + - Memory leaks or excessive allocations are automatically cleaned up when the process terminates + - Main Navidrome process remains stable even if scanner encounters memory-related issues + +4. **Error Handling** + - Detects non-zero exit codes from the subprocess + - Propagates error messages back to the main process + - Ensures resources are properly cleaned up, even in error conditions + +## Scanning Process Flow + +### Phase 1: Folder Scan (`phase_1_folders.go`) + +This phase handles the initial traversal and media file processing. + +```mermaid +flowchart TD + A[Start Phase 1] --> B{Full Scan?} + B -- Yes --> C[Scan All Folders] + B -- No --> D[Scan Modified Folders] + C --> E[Read File Metadata] + D --> E + E --> F[Create Artists] + E --> G[Create Albums] + F --> H[Save to Database] + G --> H + H --> I[Mark Missing Folders] + I --> J[End Phase 1] +``` + +**Technical implementation details:** + +1. **Folder Traversal** + - Uses `walkDirTree` to traverse the directory structure + - Handles symbolic links and hidden files + - Processes `.ndignore` files for exclusions + - Maps files to appropriate types (audio, image, playlist) + +2. **Metadata Extraction** + - Processes files in batches (defined by `filesBatchSize = 200`) + - Extracts metadata using the configured storage backend + - Converts raw metadata to `MediaFile` objects + - Collects and normalizes tag information + +3. **Album and Artist Creation** + - Groups tracks by album ID + - Creates album records from track metadata + - Handles album ID changes by tracking previous IDs + - Creates artist records from track participants + +4. **Database Persistence** + - Uses transactions for atomic updates + - Preserves album annotations across ID changes + - Updates library-artist mappings + - Marks missing tracks for later processing + - Pre-caches artwork for performance + +### Phase 2: Missing Tracks Processing (`phase_2_missing_tracks.go`) + +This phase identifies tracks that have moved or been deleted. + +```mermaid +flowchart TD + A[Start Phase 2] --> B[Load Libraries] + B --> C[Get Missing and Matching Tracks] + C --> D[Group by PID] + D --> E{Match Type?} + E -- Exact --> F[Update Path] + E -- Same PID --> G[Update If Only One] + E -- Equivalent --> H[Update If No Better Match] + F --> I[End Phase 2] + G --> I + H --> I +``` + +**Technical implementation details:** + +1. **Track Identification Strategy** + - Uses persistent identifiers (PIDs) to track tracks across scans + - Loads missing tracks and potential matches from the database + - Groups tracks by PID to limit comparison scope + +2. **Match Analysis** + - Applies three levels of matching criteria: + - Exact match (full metadata equivalence) + - Single match for a PID + - Equivalent match (same base path or similar metadata) + - Prioritizes matches in order of confidence + +3. **Database Update Strategy** + - Preserves the original track ID + - Updates the path to the new location + - Deletes the duplicate entry + - Uses transactions to ensure atomicity + +### Phase 3: Album Refresh (`phase_3_refresh_albums.go`) + +This phase updates album information based on the latest track metadata. + +```mermaid +flowchart TD + A[Start Phase 3] --> B[Load Touched Albums] + B --> C[Filter Unmodified] + C --> D{Changes Detected?} + D -- Yes --> E[Refresh Album Data] + D -- No --> F[Skip] + E --> G[Update Database] + F --> H[End Phase 3] + G --> H + H --> I[Refresh Statistics] +``` + +**Technical implementation details:** + +1. **Album Selection Logic** + - Loads albums that have been "touched" in previous phases + - Uses a producer-consumer pattern for efficient processing + - Retrieves all media files for each album for completeness + +2. **Change Detection** + - Rebuilds album metadata from associated tracks + - Compares album attributes for changes + - Skips albums with no media files + - Avoids unnecessary database updates + +3. **Statistics Refreshing** + - Updates album play counts + - Updates artist play counts + - Maintains consistency between related entities + +### Phase 4: Playlist Import (`phase_4_playlists.go`) + +This phase imports and updates playlists from the file system. + +```mermaid +flowchart TD + A[Start Phase 4] --> B{AutoImportPlaylists?} + B -- No --> C[Skip] + B -- Yes --> D{Admin User Exists?} + D -- No --> E[Log Warning & Skip] + D -- Yes --> F[Load Folders with Playlists] + F --> G{For Each Folder} + G --> H[Read Directory] + H --> I{For Each Playlist} + I --> J[Import Playlist] + J --> K[Pre-cache Artwork] + K --> L[End Phase 4] + C --> L + E --> L +``` + +**Technical implementation details:** + +1. **Playlist Discovery** + - Loads folders known to contain playlists + - Focuses on folders that have been touched in previous phases + - Handles both playlist formats (M3U, NSP) + +2. **Import Process** + - Uses the core.Playlists service for import + - Handles both regular and smart playlists + - Updates existing playlists when changed + - Pre-caches playlist cover art + +3. **Configuration Awareness** + - Respects the AutoImportPlaylists setting + - Requires an admin user for playlist import + - Logs appropriate messages for configuration issues + +## Final Processing Steps + +After the four main phases, several finalization steps occur: + +1. **Garbage Collection** + - Removes dangling tracks with no files + - Cleans up empty albums + - Removes orphaned artists + - Deletes orphaned annotations + +2. **Statistics Refresh** + - Updates artist song and album counts + - Refreshes tag usage statistics + - Updates aggregate metrics + +3. **Library Status Update** + - Marks scan as completed + - Updates last scan timestamp + - Stores persistent ID configuration + +4. **Database Optimization** + - Performs database maintenance + - Optimizes tables and indexes + - Reclaims space from deleted records + +## File System Watching + +The watcher system (`watcher.go`) provides real-time monitoring of file system changes: + +```mermaid +flowchart TD + A[Start Watcher] --> B[For Each Library] + B --> C[Start Library Watcher] + C --> D[Monitor File Events] + D --> E{Change Detected?} + E -- Yes --> F[Wait for More Changes] + F --> G{Time Elapsed?} + G -- Yes --> H[Trigger Scan] + G -- No --> F + H --> I[Wait for Scan Completion] + I --> D +``` + +**Technical implementation details:** + +1. **Event Throttling** + - Uses a timer to batch changes + - Prevents excessive rescanning + - Configurable wait period + +2. **Library-specific Watching** + - Each library has its own watcher goroutine + - Translates paths to library-relative paths + - Filters irrelevant changes + +3. **Platform Adaptability** + - Uses storage-provided watcher implementation + - Supports different notification mechanisms per platform + - Graceful fallback when watching is not supported + +## Edge Cases and Optimizations + +### Handling Album ID Changes + +The scanner carefully manages album identity across scans: +- Tracks previous album IDs to handle ID generation changes +- Preserves annotations when IDs change +- Maintains creation timestamps for consistent sorting + +### Detecting Moved Files + +A sophisticated algorithm identifies moved files: +1. Groups missing and new files by their Persistent ID +2. Applies multiple matching strategies in priority order +3. Updates paths rather than creating duplicate entries + +### Resuming Interrupted Scans + +If a scan is interrupted: +- The next scan detects this condition +- Forces a full scan if the previous one was a full scan +- Continues from where it left off for incremental scans + +### Memory Efficiency + +Several strategies minimize memory usage: +- Batched file processing (200 files at a time) +- External scanner process option +- Database-side filtering where possible +- Stream processing with pipelines + +### Concurrency Control + +The scanner implements a sophisticated concurrency model to optimize performance: + +1. **Phase-Level Parallelism**: + - Phases 1 and 2 run sequentially due to their dependencies + - Phases 3 and 4 run in parallel using the `chain.RunParallel()` function + - Final steps run sequentially to ensure data consistency + +2. **Within-Phase Concurrency**: + - Each phase has configurable concurrency for its stages + - For example, `phase_1_folders.go` processes folders concurrently: `ppl.NewStage(p.processFolder, ppl.Name("process folder"), ppl.Concurrency(conf.Server.DevScannerThreads))` + - Multiple stages can exist within a phase, each with its own concurrency level + +3. **Pipeline Architecture Benefits**: + - Producer-consumer pattern minimizes memory usage + - Work is streamed through stages rather than accumulated + - Back-pressure is automatically managed + +4. **Thread Safety Mechanisms**: + - Atomic counters for statistics gathering + - Mutex protection for shared resources + - Transactional database operations + +## Configuration Options + +The scanner's behavior can be customized through several configuration settings that directly affect its operation: + +### Core Scanner Options + +| Setting | Description | Default | +|-------------------------|------------------------------------------------------------------|----------------| +| `Scanner.Enabled` | Whether the automatic scanner is enabled | true | +| `Scanner.Schedule` | Cron expression or duration for scheduled scans (e.g., "@daily") | "0" (disabled) | +| `Scanner.ScanOnStartup` | Whether to scan when the server starts | true | +| `Scanner.WatcherWait` | Delay before triggering scan after file changes detected | 5s | +| `Scanner.ArtistJoiner` | String used to join multiple artists in track metadata | " • " | + +### Playlist Processing + +| Setting | Description | Default | +|-----------------------------|----------------------------------------------------------|---------| +| `PlaylistsPath` | Path(s) to search for playlists (supports glob patterns) | "" | +| `AutoImportPlaylists` | Whether to import playlists during scanning | true | + +### Performance Options + +| Setting | Description | Default | +|----------------------|-----------------------------------------------------------|---------| +| `DevExternalScanner` | Use external process for scanning (reduces memory issues) | true | +| `DevScannerThreads` | Number of concurrent processing threads during scanning | 5 | + +### Persistent ID Options + +| Setting | Description | Default | +|-------------|---------------------------------------------------------------------|---------------------------------------------------------------------| +| `PID.Track` | Format for track persistent IDs (critical for tracking moved files) | "musicbrainz_trackid\|albumid,discnumber,tracknumber,title" | +| `PID.Album` | Format for album persistent IDs (affects album grouping) | "musicbrainz_albumid\|albumartistid,album,albumversion,releasedate" | + +These options can be set in the Navidrome configuration file (e.g., `navidrome.toml`) or via environment variables with the `ND_` prefix (e.g., `ND_SCANNER_ENABLED=false`). For environment variables, dots in option names are replaced with underscores. + +## Conclusion + +The Navidrome scanner represents a sophisticated system for efficiently managing music libraries. Its phase-based pipeline architecture, careful handling of edge cases, and performance optimizations allow it to handle libraries of significant size while maintaining data integrity and providing a responsive user experience. \ No newline at end of file