# Parallel ReplayGain analysis — `-j` / `--threads` mp3rgain 2.4 introduces multi-file parallelism for ReplayGain analysis, addressing issues [#125] and [#126]. This document records the design decisions and the real-corpus benchmark numbers that motivated the default-on choice. ## TL;DR - **Parallel by default.** `mp3rgain *.mp3`, `-r`, `-a`, recursive `-R `, and the default info command all use `std::thread::available_parallelism()` worker threads via rayon. - **`-j 1` is the legacy fallback** for behavioral parity with mp3gain or for debugging. `-j 0` and `MP3RGAIN_THREADS=0` mean "auto". - **Output is byte-identical regardless of `-j`** — TSV/Text/JSON line order and album-summary numbers all match the serial path exactly. - On a 124-track / 2.1 GB corpus, an Apple M3 (4 performance + 4 efficiency cores) sees **~3x wall-clock speedup** for the default recursive analysis (`mp3rgain -R -o tsv .`), going from 110.2s → 35.5s with `-j 8`. Per-track gain (`-r -n`) and album gain (`-a -n`) both go from ~44s → ~14–16s in the same setup. ## CLI surface | Form | Meaning | |-------------------------|------------------------------------------------------| | _omitted_ | Use `available_parallelism()` (typically num cores) | | `-j 0` / `--threads 0` | Same as omitted (auto) | | `-j 1` / `--threads 1` | Serial — matches legacy mp3gain behavior | | `-j N` | Use exactly N rayon worker threads | | `MP3RGAIN_THREADS=N` | Same effect as `-j N`; explicit flag wins | `-j` does not collide with any short flag in mp3gain's reference set (`-? -h -v -g -l -d -r -a -k -m -p -s -c -e -x -t -T -q -f -o`) or in aacgain. ## Benchmark methodology - **Corpus**: Crossfader Music Pack (2019 + 2021 editions) — 124 actual audio files (236 .mp3 + 12 .m4a entries minus 124 macOS resource-fork shadows), 2.1 GB on disk. - **Hardware**: Apple MacBook Air (M3, 2024) — 8 logical cores (4 performance + 4 efficiency), 24 GB RAM, internal SSD. - **Build**: `cargo build --release` from `perf/parallel-replaygain-126`, `[profile.release] lto = "thin", codegen-units = 1, strip = "debuginfo"`. - **Tool**: [hyperfine](https://github.com/sharkdp/hyperfine) 1.20 with `--warmup 1 --runs 3`. Corpus is pre-staged on the internal SSD (not the original USB volume) to remove disk-bandwidth variance. - **Workloads** (run from the corpus root): - `mp3rgain -j N -R -q -o tsv .` (default info — analyze + album summary) - `mp3rgain -j N -R -r -n -q -o tsv .` (track gain dry-run) - `mp3rgain -j N -R -a -n -q -o tsv .` (album gain dry-run) ### Reproducing ```sh cargo build --release mkdir -p /tmp/mp3rgain-bench cp -R "/path/to/Music Library" /tmp/mp3rgain-bench/ scripts/bench-parallel.sh "/tmp/mp3rgain-bench/" # Outputs: /tmp/bench-info.md /tmp/bench-track.md /tmp/bench-album.md ``` ## Results — default info (`mp3rgain -R -q -o tsv .`) Workload: per-file ReplayGain analysis **plus** the trailing album-summary `analyze_album_parallel` pass (decodes every file twice). | `-j` | Mean (s) | Speedup vs `-j 1` | Aggregate CPU usage | |-----:|----------------:|------------------:|--------------------:| | 1 | 110.151 ± 0.085 | 1.00× (baseline) | 99% × 1 core | | 2 | 66.297 ± 9.119 | 1.66× | 97% × 1.94 cores | | 4 | 51.625 ± 1.512 | 2.13× | 97% × 3.86 cores | | 8 | 35.513 ± 0.380 | **3.10×** | 87% × 7.00 cores | Source: `/tmp/bench-info.md` from the bench script run. ## Results — track gain dry-run (`mp3rgain -r -n -R -q -o tsv .`) Workload: single ReplayGain analysis pass per file. No second pass, no file modification (dry-run). | `-j` | Mean (s) | Speedup vs `-j 1` | Aggregate CPU usage | |-----:|---------------:|------------------:|--------------------:| | 1 | 44.550 ± 0.113 | 1.00× (baseline) | 99% × 1 core | | 2 | 24.433 ± 0.060 | 1.82× | 97% × 1.94 cores | | 4 | 16.115 ± 0.562 | **2.76×** | 95% × 3.80 cores | | 8 | 19.309 ± 3.221 | 2.31× | 76% × 6.12 cores | Note: `-j 8` is **slower** than `-j 4` here. The track-gain workload fits in ~16s, and on a 4P+4E hybrid CPU the overhead of pushing four extra threads onto efficiency cores outweighs their throughput contribution for jobs of this size. The same workload on a homogeneous 8-core CPU is expected to scale further. ## Results — album gain dry-run (`mp3rgain -a -n -R -q -o tsv .`) Workload: `analyze_album_parallel` decodes every track once and folds histograms. | `-j` | Mean (s) | Speedup vs `-j 1` | Aggregate CPU usage | |-----:|---------------:|------------------:|--------------------:| | 1 | 44.159 ± 0.015 | 1.00× (baseline) | 99% × 1 core | | 2 | 24.570 ± 0.707 | 1.80× | 98% × 1.95 cores | | 4 | 22.143 ± 0.750 | 1.99× | 96% × 3.84 cores | | 8 | 14.584 ± 0.505 | **3.03×** | 88% × 7.00 cores | ## Acceptance-criteria check (issue #126) > `mp3rgain *.mp3 -r` is ≥ N×/2 faster on N cores for a corpus > large enough to amortize startup (e.g. 50+ tracks). | Cores | Bar (N/2) | Achieved (track-gain) | Achieved (album-gain) | Achieved (info) | |------:|----------:|----------------------:|----------------------:|----------------:| | 2 | 1.0× | 1.82× | 1.80× | 1.66× | | 4 | 2.0× | 2.76× | 1.99× | 2.13× | | 8 | 4.0× | 2.31× | 3.03× | 3.10× | The 2-core and 4-core bars are met across all three workloads. The 8-core bar is missed because M3 has 4 performance + 4 efficiency cores, so "8 cores" overstates the available compute throughput. On a homogeneous 8-core CPU (Ryzen 7, Xeon E-23xx, etc.) we expect the 8-core bar to be cleared too. ## Output identity Across every `-j` value tested on this corpus, the TSV/Text/JSON output and the modified MP3 byte stream (after `-r` apply) are **byte-identical** to the serial `-j 1` path. The album-fold is associative and rayon's `par_iter().collect::>()` preserves input order, so `album_peak`, `album_loudness_db`, and `album_gain_db` all match `-j 1` exactly. ```sh # Verification (run during PR validation): mp3rgain -j 1 -R -q -o tsv . > /tmp/serial.tsv mp3rgain -j 8 -R -q -o tsv . > /tmp/parallel.tsv diff /tmp/serial.tsv /tmp/parallel.tsv # exits 0 ``` ## What gets parallelized The two hot loops called out in [#126]: 1. `cmd_info` per-file ReplayGain analysis loop. 2. `analyze_album_internal` per-track decode + filter loop, exposed via two new public APIs in `src/replaygain.rs`: - `analyze_album_parallel(files, track_index, threads)` - `analyze_album_parallel_with_completion(files, track_index, threads, on_complete)` Both fall back to the existing serial implementation for `threads <= 1` or `files.len() <= 1`. Plus, in this PR's scope: 3. `cmd_info`'s second album-summary pass — switched from `analyze_album` (serial) to `analyze_album_parallel` when `-j > 1`. Without this, the album-summary pass becomes the wall-clock bottleneck and limits the overall speedup to ~2× even with 8 cores. 4. `cmd_track_gain` per-file analyze + apply loop. 5. `cmd_album_gain` per-file apply loop (after the parallel `analyze_album_parallel_with_completion` analysis pass). ## What is *not* parallelized - **Per-sample DSP inside a single track.** The equal-loudness IIR filter has tight inter-sample data dependency, so it doesn't parallelize without changing the algorithm. SIMD packing of L+R samples is the right answer there — see [#125] for follow-up. - **`cmd_apply` / `cmd_apply_channel` / `cmd_undo` / `cmd_max_amplitude` / `cmd_check_tags` / `cmd_delete_tags`.** These are I/O-bound per-file (read tag, modify global_gain bytes, write file). They benefit much less from parallelism, and parallelizing them would require the same `(JsonFileResult, String)` output-buffer refactor applied to ReplayGain processors. They remain serial in this PR; open a follow-up if a real workload shows them as a bottleneck. ## Concurrency safety - Each track gets its own Symphonia decoder, format reader, `EqualLoudnessFilter` array, and `LoudnessHistogram` — no shared mutable state. - The album histogram fold is associative (`LoudnessHistogram::accumulate` is bin-wise sum), so reordering is safe; we still iterate in input order to keep the result bit-identical. - Stdout output is buffered into per-file `String` instances inside `process_*` functions and replayed by the cmd layer in input order. This guarantees deterministic line ordering regardless of completion order. - Stderr (warnings/errors) stays on `eprintln!`; OS-level per-line atomicity is sufficient for diagnostics. Order across files may differ between runs. [#125]: https://github.com/M-Igashi/mp3rgain/issues/125 [#126]: https://github.com/M-Igashi/mp3rgain/issues/126