From dfa217ab513afc8beb28a9e3194e6a373ee23a64 Mon Sep 17 00:00:00 2001
From: Deluan <deluan@navidrome.org>
Date: Fri, 25 Apr 2025 12:54:29 -0400
Subject: [PATCH] docs(scanner): add overview README document

Signed-off-by: Deluan <deluan@navidrome.org>
---
 scanner/README.md | 466 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 466 insertions(+)
 create mode 100644 scanner/README.md

diff --git a/scanner/README.md b/scanner/README.md
new file mode 100644
index 000000000..b2c682381
--- /dev/null
+++ b/scanner/README.md
@@ -0,0 +1,466 @@
+# Navidrome Scanner: Technical Overview
+
+This document provides a comprehensive technical explanation of Navidrome's music library scanner system.
+
+## Architecture Overview
+
+The Navidrome scanner is built on a multi-phase pipeline architecture designed for efficient processing of music files. It systematically traverses file system directories, processes metadata, and maintains a database representation of the music library. A key performance feature is that some phases run sequentially while others execute in parallel.
+
+```mermaid
+flowchart TD
+    subgraph "Scanner Execution Flow"
+        Controller[Scanner Controller] --> Scanner[Scanner Implementation]
+        
+        Scanner --> Phase1[Phase 1: Folders Scan]
+        Phase1 --> Phase2[Phase 2: Missing Tracks]
+        
+        Phase2 --> ParallelPhases
+        
+        subgraph ParallelPhases["Parallel Execution"]
+            Phase3[Phase 3: Refresh Albums]
+            Phase4[Phase 4: Playlist Import]
+        end
+        
+        ParallelPhases --> FinalSteps[Final Steps: GC + Stats]
+    end
+    
+    %% Triggers that can initiate a scan
+    FileChanges[File System Changes] -->|Detected by| Watcher[Filesystem Watcher]
+    Watcher -->|Triggers| Controller
+    
+    ScheduledJob[Scheduled Job] -->|Based on Scanner.Schedule| Controller
+    ServerStartup[Server Startup] -->|If Scanner.ScanOnStartup=true| Controller
+    ManualTrigger[Manual Scan via UI/API] -->|Admin user action| Controller
+    CLICommand[Command Line: navidrome scan] -->|Direct invocation| Controller
+    PIDChange[PID Configuration Change] -->|Forces full scan| Controller
+    DBMigration[Database Migration] -->|May require full scan| Controller
+    
+    Scanner -.->|Alternative| External[External Scanner Process]
+```
+
+The execution flow shows that Phases 1 and 2 run sequentially, while Phases 3 and 4 execute in parallel to maximize performance before the final processing steps.
+
+## Core Components
+
+### Scanner Controller (`controller.go`)
+
+This is the entry point for all scanning operations. It provides:
+
+- Public API for initiating scans and checking scan status
+- Event broadcasting to notify clients about scan progress
+- Serialization of scan operations (prevents concurrent scans)
+- Progress tracking and monitoring
+- Error collection and reporting
+
+```go
+type Scanner interface {
+    // ScanAll starts a full scan of the music library. This is a blocking operation.
+    ScanAll(ctx context.Context, fullScan bool) (warnings []string, err error)
+    Status(context.Context) (*StatusInfo, error)
+}
+```
+
+### Scanner Implementation (`scanner.go`)
+
+The primary implementation that orchestrates the four-phase scanning pipeline. Each phase follows the Phase interface pattern:
+
+```go
+type phase[T any] interface {
+    producer() ppl.Producer[T]
+    stages() []ppl.Stage[T]
+    finalize(error) error
+    description() string
+}
+```
+
+This design enables:
+- Type-safe pipeline construction with generics
+- Modular phase implementation
+- Separation of concerns
+- Easy measurement of performance
+
+### External Scanner (`external.go`)
+
+The External Scanner is a specialized implementation that offloads the scanning process to a separate subprocess. This is specifically designed to address memory management challenges in long-running Navidrome instances.
+
+```go
+// scannerExternal is a scanner that runs an external process to do the scanning. It is used to avoid
+// memory leaks or retention in the main process, as the scanner can consume a lot of memory. The
+// external process will be spawned with the same executable as the current process, and will run
+// the "scan" command with the "--subprocess" flag.
+//
+// The external process will send progress updates to the main process through its STDOUT, and the main
+// process will forward them to the caller.
+```
+
+```mermaid
+sequenceDiagram
+    participant MP as Main Process
+    participant ES as External Scanner
+    participant SP as Subprocess (navidrome scan --subprocess)
+    participant FS as File System
+    participant DB as Database
+    
+    Note over MP: DevExternalScanner=true
+    MP->>ES: ScanAll(ctx, fullScan)
+    activate ES
+    
+    ES->>ES: Locate executable path
+    ES->>SP: Start subprocess with args:<br>scan --subprocess --configfile ... etc.
+    activate SP
+    
+    Note over ES,SP: Create pipe for communication
+    
+    par Subprocess executes scan
+        SP->>FS: Read files & metadata
+        SP->>DB: Update database
+    and Main process monitors progress
+        loop For each progress update
+            SP->>ES: Send encoded progress info via stdout pipe
+            ES->>MP: Forward progress info
+        end
+    end
+    
+    SP-->>ES: Subprocess completes (success/error)
+    deactivate SP
+    ES-->>MP: Return aggregated warnings/errors
+    deactivate ES
+```
+
+Technical details:
+
+1. **Process Isolation**
+    - Spawns a separate process using the same executable
+    - Uses the `--subprocess` flag to indicate it's running as a child process
+    - Preserves configuration by passing required flags (`--configfile`, `--datafolder`, etc.)
+
+2. **Inter-Process Communication**
+    - Uses a pipe for bidirectional communication
+    - Encodes/decodes progress updates using Go's `gob` encoding for efficient binary transfer
+    - Properly handles process termination and error propagation
+
+3. **Memory Management Benefits**
+    - Scanning operations can be memory-intensive, especially with large music libraries
+    - Memory leaks or excessive allocations are automatically cleaned up when the process terminates
+    - Main Navidrome process remains stable even if scanner encounters memory-related issues
+
+4. **Error Handling**
+    - Detects non-zero exit codes from the subprocess
+    - Propagates error messages back to the main process
+    - Ensures resources are properly cleaned up, even in error conditions
+
+## Scanning Process Flow
+
+### Phase 1: Folder Scan (`phase_1_folders.go`)
+
+This phase handles the initial traversal and media file processing.
+
+```mermaid
+flowchart TD
+    A[Start Phase 1] --> B{Full Scan?}
+    B -- Yes --> C[Scan All Folders]
+    B -- No --> D[Scan Modified Folders]
+    C --> E[Read File Metadata]
+    D --> E
+    E --> F[Create Artists]
+    E --> G[Create Albums]
+    F --> H[Save to Database]
+    G --> H
+    H --> I[Mark Missing Folders]
+    I --> J[End Phase 1]
+```
+
+**Technical implementation details:**
+
+1. **Folder Traversal**
+    - Uses `walkDirTree` to traverse the directory structure
+    - Handles symbolic links and hidden files
+    - Processes `.ndignore` files for exclusions
+    - Maps files to appropriate types (audio, image, playlist)
+
+2. **Metadata Extraction**
+    - Processes files in batches (defined by `filesBatchSize = 200`)
+    - Extracts metadata using the configured storage backend
+    - Converts raw metadata to `MediaFile` objects
+    - Collects and normalizes tag information
+
+3. **Album and Artist Creation**
+    - Groups tracks by album ID
+    - Creates album records from track metadata
+    - Handles album ID changes by tracking previous IDs
+    - Creates artist records from track participants
+
+4. **Database Persistence**
+    - Uses transactions for atomic updates
+    - Preserves album annotations across ID changes
+    - Updates library-artist mappings
+    - Marks missing tracks for later processing
+    - Pre-caches artwork for performance
+
+### Phase 2: Missing Tracks Processing (`phase_2_missing_tracks.go`)
+
+This phase identifies tracks that have moved or been deleted.
+
+```mermaid
+flowchart TD
+    A[Start Phase 2] --> B[Load Libraries]
+    B --> C[Get Missing and Matching Tracks]
+    C --> D[Group by PID]
+    D --> E{Match Type?}
+    E -- Exact --> F[Update Path]
+    E -- Same PID --> G[Update If Only One]
+    E -- Equivalent --> H[Update If No Better Match]
+    F --> I[End Phase 2]
+    G --> I
+    H --> I
+```
+
+**Technical implementation details:**
+
+1. **Track Identification Strategy**
+    - Uses persistent identifiers (PIDs) to track tracks across scans
+    - Loads missing tracks and potential matches from the database
+    - Groups tracks by PID to limit comparison scope
+
+2. **Match Analysis**
+    - Applies three levels of matching criteria:
+        - Exact match (full metadata equivalence)
+        - Single match for a PID
+        - Equivalent match (same base path or similar metadata)
+    - Prioritizes matches in order of confidence
+
+3. **Database Update Strategy**
+    - Preserves the original track ID
+    - Updates the path to the new location
+    - Deletes the duplicate entry
+    - Uses transactions to ensure atomicity
+
+### Phase 3: Album Refresh (`phase_3_refresh_albums.go`)
+
+This phase updates album information based on the latest track metadata.
+
+```mermaid
+flowchart TD
+    A[Start Phase 3] --> B[Load Touched Albums]
+    B --> C[Filter Unmodified]
+    C --> D{Changes Detected?}
+    D -- Yes --> E[Refresh Album Data]
+    D -- No --> F[Skip]
+    E --> G[Update Database]
+    F --> H[End Phase 3]
+    G --> H
+    H --> I[Refresh Statistics]
+```
+
+**Technical implementation details:**
+
+1. **Album Selection Logic**
+    - Loads albums that have been "touched" in previous phases
+    - Uses a producer-consumer pattern for efficient processing
+    - Retrieves all media files for each album for completeness
+
+2. **Change Detection**
+    - Rebuilds album metadata from associated tracks
+    - Compares album attributes for changes
+    - Skips albums with no media files
+    - Avoids unnecessary database updates
+
+3. **Statistics Refreshing**
+    - Updates album play counts
+    - Updates artist play counts
+    - Maintains consistency between related entities
+
+### Phase 4: Playlist Import (`phase_4_playlists.go`)
+
+This phase imports and updates playlists from the file system.
+
+```mermaid
+flowchart TD
+    A[Start Phase 4] --> B{AutoImportPlaylists?}
+    B -- No --> C[Skip]
+    B -- Yes --> D{Admin User Exists?}
+    D -- No --> E[Log Warning & Skip]
+    D -- Yes --> F[Load Folders with Playlists]
+    F --> G{For Each Folder}
+    G --> H[Read Directory]
+    H --> I{For Each Playlist}
+    I --> J[Import Playlist]
+    J --> K[Pre-cache Artwork]
+    K --> L[End Phase 4]
+    C --> L
+    E --> L
+```
+
+**Technical implementation details:**
+
+1. **Playlist Discovery**
+    - Loads folders known to contain playlists
+    - Focuses on folders that have been touched in previous phases
+    - Handles both playlist formats (M3U, NSP)
+
+2. **Import Process**
+    - Uses the core.Playlists service for import
+    - Handles both regular and smart playlists
+    - Updates existing playlists when changed
+    - Pre-caches playlist cover art
+
+3. **Configuration Awareness**
+    - Respects the AutoImportPlaylists setting
+    - Requires an admin user for playlist import
+    - Logs appropriate messages for configuration issues
+
+## Final Processing Steps
+
+After the four main phases, several finalization steps occur:
+
+1. **Garbage Collection**
+    - Removes dangling tracks with no files
+    - Cleans up empty albums
+    - Removes orphaned artists
+    - Deletes orphaned annotations
+
+2. **Statistics Refresh**
+    - Updates artist song and album counts
+    - Refreshes tag usage statistics
+    - Updates aggregate metrics
+
+3. **Library Status Update**
+    - Marks scan as completed
+    - Updates last scan timestamp
+    - Stores persistent ID configuration
+
+4. **Database Optimization**
+    - Performs database maintenance
+    - Optimizes tables and indexes
+    - Reclaims space from deleted records
+
+## File System Watching
+
+The watcher system (`watcher.go`) provides real-time monitoring of file system changes:
+
+```mermaid
+flowchart TD
+    A[Start Watcher] --> B[For Each Library]
+    B --> C[Start Library Watcher]
+    C --> D[Monitor File Events]
+    D --> E{Change Detected?}
+    E -- Yes --> F[Wait for More Changes]
+    F --> G{Time Elapsed?}
+    G -- Yes --> H[Trigger Scan]
+    G -- No --> F
+    H --> I[Wait for Scan Completion]
+    I --> D
+```
+
+**Technical implementation details:**
+
+1. **Event Throttling**
+    - Uses a timer to batch changes
+    - Prevents excessive rescanning
+    - Configurable wait period
+
+2. **Library-specific Watching**
+    - Each library has its own watcher goroutine
+    - Translates paths to library-relative paths
+    - Filters irrelevant changes
+
+3. **Platform Adaptability**
+    - Uses storage-provided watcher implementation
+    - Supports different notification mechanisms per platform
+    - Graceful fallback when watching is not supported
+
+## Edge Cases and Optimizations
+
+### Handling Album ID Changes
+
+The scanner carefully manages album identity across scans:
+- Tracks previous album IDs to handle ID generation changes
+- Preserves annotations when IDs change
+- Maintains creation timestamps for consistent sorting
+
+### Detecting Moved Files
+
+A sophisticated algorithm identifies moved files:
+1. Groups missing and new files by their Persistent ID
+2. Applies multiple matching strategies in priority order
+3. Updates paths rather than creating duplicate entries
+
+### Resuming Interrupted Scans
+
+If a scan is interrupted:
+- The next scan detects this condition
+- Forces a full scan if the previous one was a full scan
+- Continues from where it left off for incremental scans
+
+### Memory Efficiency
+
+Several strategies minimize memory usage:
+- Batched file processing (200 files at a time)
+- External scanner process option
+- Database-side filtering where possible
+- Stream processing with pipelines
+
+### Concurrency Control
+
+The scanner implements a sophisticated concurrency model to optimize performance:
+
+1. **Phase-Level Parallelism**:
+    - Phases 1 and 2 run sequentially due to their dependencies
+    - Phases 3 and 4 run in parallel using the `chain.RunParallel()` function
+    - Final steps run sequentially to ensure data consistency
+
+2. **Within-Phase Concurrency**:
+    - Each phase has configurable concurrency for its stages
+    - For example, `phase_1_folders.go` processes folders concurrently: `ppl.NewStage(p.processFolder, ppl.Name("process folder"), ppl.Concurrency(conf.Server.DevScannerThreads))`
+    - Multiple stages can exist within a phase, each with its own concurrency level
+
+3. **Pipeline Architecture Benefits**:
+    - Producer-consumer pattern minimizes memory usage
+    - Work is streamed through stages rather than accumulated
+    - Back-pressure is automatically managed
+
+4. **Thread Safety Mechanisms**:
+    - Atomic counters for statistics gathering
+    - Mutex protection for shared resources
+    - Transactional database operations
+
+## Configuration Options
+
+The scanner's behavior can be customized through several configuration settings that directly affect its operation:
+
+### Core Scanner Options
+
+| Setting                 | Description                                                      | Default        | 
+|-------------------------|------------------------------------------------------------------|----------------|
+| `Scanner.Enabled`       | Whether the automatic scanner is enabled                         | true           |
+| `Scanner.Schedule`      | Cron expression or duration for scheduled scans (e.g., "@daily") | "0" (disabled) |
+| `Scanner.ScanOnStartup` | Whether to scan when the server starts                           | true           |
+| `Scanner.WatcherWait`   | Delay before triggering scan after file changes detected         | 5s             |
+| `Scanner.ArtistJoiner`  | String used to join multiple artists in track metadata           | " • "          |
+
+### Playlist Processing
+
+| Setting                     | Description                                              | Default |
+|-----------------------------|----------------------------------------------------------|---------|
+| `PlaylistsPath`             | Path(s) to search for playlists (supports glob patterns) | ""      |
+| `AutoImportPlaylists`       | Whether to import playlists during scanning              | true    |
+
+### Performance Options
+
+| Setting              | Description                                               | Default |
+|----------------------|-----------------------------------------------------------|---------|
+| `DevExternalScanner` | Use external process for scanning (reduces memory issues) | true    |
+| `DevScannerThreads`  | Number of concurrent processing threads during scanning   | 5       |
+
+### Persistent ID Options
+
+| Setting     | Description                                                         | Default                                                             |
+|-------------|---------------------------------------------------------------------|---------------------------------------------------------------------|
+| `PID.Track` | Format for track persistent IDs (critical for tracking moved files) | "musicbrainz_trackid\|albumid,discnumber,tracknumber,title"         |
+| `PID.Album` | Format for album persistent IDs (affects album grouping)            | "musicbrainz_albumid\|albumartistid,album,albumversion,releasedate" |
+
+These options can be set in the Navidrome configuration file (e.g., `navidrome.toml`) or via environment variables with the `ND_` prefix (e.g., `ND_SCANNER_ENABLED=false`). For environment variables, dots in option names are replaced with underscores.
+
+## Conclusion
+
+The Navidrome scanner represents a sophisticated system for efficiently managing music libraries. Its phase-based pipeline architecture, careful handling of edge cases, and performance optimizations allow it to handle libraries of significant size while maintaining data integrity and providing a responsive user experience.
\ No newline at end of file