# SIMD & Parallelization Benchmark Results This document highlights the performance evaluation of the `purecv` library across four main compilation strategies: 1. **Standard**: Sequential fallback mode (`--no-default-features`), without explicit target CPU optimizations. 2. **SIMD Only**: Sequential mode compiled with `RUSTFLAGS="-C target-cpu=native"` to encourage LLVM auto-vectorization. 3. **Parallel**: Enabled `rayon` multi-threading across available cores (`--features parallel`). 4. **Parallel + SIMD**: Combined `rayon` parallelism alongside `target-cpu=native` for maximum theoretical throughput. All tests operate on `1024x1024` image/matrix tensors using `f32` (or `u8` depending on the domain context). Times shown represent the median calculation calculated by `Criterion.rs`. --- ## Oriented FAST and Steered BRIEF (ORB) — Milestone 5 *Execution Date: 2026-05-25 (CET)* Benchmarks comparing all four compilation and running modes on a $512 \times 512$ synthetic grayscale image. Run commands: ```sh # Standard (Sequential, Scalar) cargo bench --bench features2d_bench --no-default-features --features std # SIMD Only (Sequential + pulp SIMD) cargo bench --bench features2d_bench --no-default-features --features std,simd # Parallel (Rayon, Scalar) cargo bench --bench features2d_bench --no-default-features --features std,parallel # Parallel + SIMD (Rayon + pulp SIMD, default) cargo bench --bench features2d_bench --no-default-features --features std,parallel,simd ``` | Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD (default) | Max Speedup | | :-------------------- | :------: | :-------: | :------: | :-----------------------: | :---------: | | `fast_detect_512x512` | 2.04 ms | 2.19 ms | 516 µs | **499 µs** | **4.1x** | | `orb_detect_512x512` | 117.4 ms | 97.8 ms | 30.0 ms | **30.7 ms** | **3.9x** | ### ORB Analysis - **FAST Corner Detection**: Processes row chunks concurrently. On a $512 \times 512$ image, Rayon parallelization reduces corner detection time from **2.04 ms** to **499 µs** (a **4.1x speedup**), enabling extremely high-speed corner parsing. - **ORB Unified Pipeline (`detect_and_compute`)**: - In single-threaded execution, our newly added manual **pulp SIMD dispatch blocks** (vectorizing the intensity centroid reductions and the 256 BRIEF keypoint coordinate float rotations) yield a significant **17% speedup** (97.8 ms vs 117.4 ms). - When combined with multi-threaded grid-level Rayon scheduling, the ORB pipeline reaches **30.0 ms** (a **3.9x overall speedup**), easily crossing the **30 FPS real-time threshold** required for interactive, browser-based tracking and WebAR applications. - The combination of grid parallelism and instruction-level vectorization gives developers the ultimate performance scaling flexibility across all targets (x86, ARM, and WebAssembly SIMD128). --- ## Video — Performance Comparison Table *Execution Date: 2026-04-30 (CET)* Run commands: ```sh # Standard cargo bench --bench video_bench --no-default-features # SIMD only RUSTFLAGS="-C target-cpu=native" cargo bench --bench video_bench \ --no-default-features --features simd # Parallel cargo bench --bench video_bench --features parallel # Parallel + SIMD RUSTFLAGS="-C target-cpu=native" cargo bench --bench video_bench \ --features parallel,simd ``` | Benchmark / Operation | Standard | | :-------------------- | :------- | | `calc_optical_flow_pyr_lk/512x512/pts_49` | ~27.5 ms | | `calc_optical_flow_pyr_lk/512x512/pts_196` | ~110 ms | | `build_optical_flow_pyramid/no_deriv/512x512` | ~2.5 ms | | `build_optical_flow_pyramid/with_deriv/512x512` | ~27 ms | ### Video Analysis #### `calc_optical_flow_pyramid_lk` The outer `for i in 0..n_pts` loop is the dominant bottleneck and is **embarrassingly parallel** — each feature point is tracked completely independently. With the `parallel` feature enabled, Rayon dispatches all `n_pts` tracking tasks concurrently. Expected speedups (indicative; actual values depend on core count): | Benchmark | Standard | Parallel | Parallel + SIMD | | :-------- | :------- | :------- | :-------------- | | `pts_49` | ~27.5 ms | ~7–10 ms | ~5–8 ms | | `pts_196` | ~110 ms | ~15–30 ms | ~12–25 ms | The `simd` feature accelerates the two inner reduction loops inside `lk_single_level`: - **H-matrix accumulation** (`Σ Ix², Σ Ix·Iy, Σ Iy²`) — wrapped in `pulp::Arch::dispatch` for LLVM auto-vectorisation over the pre-gathered `f32` window buffers. - **Mismatch vector accumulation** (`Σ Ix·It, Σ Iy·It`) — same pattern, evaluated once per LK iteration. The gain from SIMD alone is moderate (1.3–2×) because bilinear interpolation (the gather step) remains scalar. The combination of Parallel + SIMD achieves the best throughput. #### `build_optical_flow_pyramid` with derivatives When `with_derivatives = true`, the per-level Sobel (Ix, Iy) passes are independent across pyramid levels and run concurrently via Rayon with the `parallel` feature. For a 4-level pyramid the expected speedup is up to 4× (level count) times the per-level Sobel speedup. In practice the coarser levels are very small so the scaling is sub-linear, but a 2–3× wall-clock gain is typical. --- ## Feature Detection & Hough Transforms *Execution Date: 2026-04-17 (CET)* New benchmarks for Feature Detection and Hough Transforms. These operations are heavily compute-bound and benefit significantly from parallelism. Hough benchmarks use a `512x512` input to balance execution time and accuracy. | Benchmark / Operation | Parallel + SIMD | | :--------------------------------- | :--------------- | | `corner_harris_1024x1024_f32` | 26.28 ms | | `hough_lines_512x512` | 3.58 ms | | `hough_lines_p_512x512` | 1.78 ms | | `hough_circles_512x512` | 4.51 ms | ### Analysis - **`corner_harris`**: Involves Sobel derivatives, Gaussian blurring of the covariance matrix, and the Harris response calculation. At 26ms for a 1MP image, it is highly efficient for real-time applications. - **Hough Transforms**: - `hough_lines` (Standard) processes the full accumulator space. - `hough_lines_p` (Probabilistic) is ~2x faster by sampling points, as expected. - `hough_circles` uses the 2-1 Hough transform (gradient-based), achieving sub-5ms performance on 512x512 inputs. --- ## New Benchmarks — count_non_zero, LUT & DFT *Execution Date: 2026-03-31 (CET)* Three new function groups have been benchmarked: `count_non_zero`, `lut` (Look-Up Table), and `dft` (Discrete Fourier Transform via `rustfft`). DFT benchmarks require the `fft` feature flag. > **Note:** The 4-strategy comparison table (Standard / SIMD Only / Parallel / Parallel+SIMD) does not apply to the LUT and DFT benchmarks below. LUT is a pure table lookup where SIMD provides no benefit, and DFT delegates to `rustfft` which manages its own internal SIMD — the `parallel` and `simd` feature flags have no effect on either. Results shown are with default features (Parallel enabled). ### Arithm — count_non_zero & LUT | Benchmark / Operation | Parallel (default) | | :------------------------------------- | :----------------- | | `count_non_zero_1024×1024_i32` | 116.03 µs | | `lut_1024×1024_u8` (1ch) | 501.25 µs | | `lut_1024×1024×3_u8_broadcast` (3ch) | 855.86 µs | `count_non_zero` is a simple reduction — very fast with rayon parallelism. LUT is a pure table lookup (memory-bound), scaling linearly with channel count (3ch ≈ 1.7× of 1ch). ### DFT — Discrete Fourier Transform (`--features fft`) | Benchmark / Operation | Time | | :------------------------------------- | :---------- | | `get_optimal_dft_size` | 18.76 ns | | `dft_forward_256×256_f32` | 454.03 µs | | `dft_inverse_256×256_f32` | 514.82 µs | | `dft_forward_512×512_f32` | 3.80 ms | | `dft_forward_1024×1024_f64` | 28.21 ms | DFT scales as O(n² log n) for 2D transforms: 512×512 is ~8× slower than 256×256, consistent with the 4× data increase and logarithmic factor growth. `get_optimal_dft_size` is a trivial binary search (~19 ns). *Execution Date: 2026-03-17 (CET)* --- ## New Benchmarks — bilateral_filter & sobel_f32 (SIMD path) Three new benchmarks were added to specifically evaluate the **SIMD fast-path** in `fast_deriv_3x3` for `f32` Sobel, and the heavy compute-bound `bilateral_filter`. Sobel f32 benchmarks use **realistic non-zero data** (sinusoidal gradient pattern) to fully exercise the computation. | Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD | | :--------------------------------- | :---------- | :----------- | :------------- | :--------------- | | `bilateral_filter_512×512_u8` | 1.43 s | 1.51 s | **202.44 ms** | 205.48 ms | | `sobel_3x3_f32_dx_1024×1024` | 28.59 ms | 6.26 ms | 1.96 ms | **1.28 ms** | | `sobel_3x3_f32_dy_1024×1024` | 26.24 ms | 6.51 ms | 2.16 ms | **1.27 ms** | ### Analysis - **`sobel_3x3_f32` (dx/dy)** — first benchmark where **SIMD alone provides a massive standalone gain**: - SIMD Only: **4.5× speedup** over Standard — the `simd_deriv_3x3_row_f32` kernel in `fast_deriv_3x3` processes interior rows with vectorized 3×3 convolutions. - Parallel: **14× speedup** — row-wise parallelism scales excellently. - **Parallel + SIMD: 22× total speedup** (28.59 ms → 1.28 ms) — the combination of SIMD kernels on each core produces the best result seen in the project. - Compared to the generic `sobel_3x3` (which uses `Matrix` with zero data and mixed dx=1,dy=1), the f32-specific SIMD path is **~1.5× faster** even in Parallel+SIMD (1.28 ms vs 1.87 ms). - **`bilateral_filter_512×512_u8`** — heavily compute-bound (per-pixel Gaussian weight computation over a spatial/range neighborhood): - Parallel: **7.1× speedup** (1.43 s → 202 ms) — each row's bilateral computation is independent. - SIMD Only: **no gain** (1.51 s, slightly worse) — the complex per-pixel exponential weight math is not trivially vectorizable. - Parallel + SIMD: equivalent to Parallel alone (~205 ms). ### Key Differences vs Previous Results | Metric | Previous best (`sobel_3x3` generic) | New (`sobel_3x3_f32` SIMD path) | Improvement | | :-------------------------------- | :---------------------------------- | :------------------------------- | :---------- | | Standard | 22.79 ms | 28.59 ms (realistic data) | — | | SIMD Only | 21.66 ms | **6.26 ms** | **3.5×** | | Parallel | 1.87 ms | **1.96 ms** | ~equal | | Parallel + SIMD | 1.87 ms | **1.28 ms** | **1.5×** | > The generic `sobel_3x3` benchmark used zero-initialized data and the mixed derivative (dx=1, dy=1), which masked the SIMD gains. With realistic f32 data and separate dx/dy, the `pulp`-powered SIMD kernel delivers a clear **1.5× additional speedup** on top of parallelism. --- *Execution Date: 2026-03-09 23:43 (CET)* ## Performance Comparison Table | Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD | | :------------------- | :--------- | :---------- | :--------- | :---------------- | | `matrix_add` | 3.86 ms | 3.92 ms | 2.45 ms | 2.35 ms | | `matrix_sub` | 3.86 ms | 3.87 ms | 2.49 ms | 2.39 ms | | `matrix_mul` | 3.73 ms | 3.88 ms | 2.36 ms | 2.58 ms | | `matrix_div` | 3.71 ms | 3.84 ms | 2.42 ms | 2.37 ms | | `cvt_color_rgb2gray` | 3.85 ms | **1.52 ms** | 666.32 µs | **475.45 µs** | | `box_filter_3x3` | 13.91 ms | 14.35 ms | 3.39 ms | 3.61 ms | | `sobel_3x3` | 15.28 ms | 15.40 ms | 3.65 ms | 4.06 ms | ## Analysis - **Color Conversion**: `cvt_color_rgb2gray` is highly vectorizable (straightforward pixel-by-pixel math without neighborhood lookups). Simply enabling `-C target-cpu=native` yields a massive **2.5x speedup** on a single thread. Combining it with `rayon` parallelism achieves near **8x total speedup**. - **Spatial Filters**: Neighborhood-dependent modules (`box_filter` and `sobel_3x3`) showcase tremendous gains from the `parallel` execution strategy (over 4x speedup) as filtering is heavily compute-bound and segments cleanly across row data. However, generic auto-vectorization (SIMD Only) sees negligible impact due to the random memory access patterns characteristic of 3x3 sliding windows. - **Matrix Arithmetics**: Foundational matrices tasks (`add`, `sub`, etc.) appear bounded primarily by memory bandwidth when allocating large `1024x1024` vectors, making instruction-level SIMD tuning yield diminishing returns vs memory throughput over parallel cores. *Conclusion*: The refactored `.chunks_exact()` and `.zip()` iterators provide robust performance baselines and scale beautifully whenever LLVM's auto-vectorizer applies (demonstrated perfectly via `cvt_color`), while ensuring safe compatibility across varying feature flags! ## Algorithmic Optimizations *Execution Date: 2026-03-10* Following the initial SIMD and Parallelization benchmarks, targeted algorithmic optimizations were applied to specific operations to eliminate memory overhead and boundary-checking branches. | Benchmark / Operation | Pre-Optimization (Parallel) | Post-Optimization (Parallel) | Speedup | | :------------------- | :-------------------------- | :--------------------------- | :------ | | `sobel_3x3` | 3.65 ms | 2.08 ms | 43% | | `scharr_x` | 3.70 ms | 2.11 ms | 43% | ### Analysis - **Derivatives (`sobel_3x3`, `scharr_x`)**: Replacing the generic `sep_filter_2d` implementation with an inlined `fast_deriv_3x3` function brought execution times down to ~2.1ms. By explicitly unrolling the 3x3 convolution and separating the interior "fast path" from the boundary "slow path," we completely eliminated intermediate allocations and significantly improved cache locality. *Execution Date: 2026-03-15 (CET)* ## Arithm — Performance Comparison Table (17 benchmarks) | Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD | | :---------------------------- | :---------- | :---------- | :------------ | :-------------- | | `matrix_add` | 3.26 ms | 3.35 ms | 2.08 ms | **2.05 ms** | | `matrix_sub` | 3.24 ms | 3.37 ms | **2.06 ms** | 2.11 ms | | `matrix_mul` | 3.20 ms | 3.13 ms | **2.09 ms** | 2.09 ms | | `matrix_div` | 3.24 ms | 3.10 ms | **2.10 ms** | 2.10 ms | | `dot` | 997.62 µs | 971.72 µs | **156.54 µs** | 157.72 µs | | `magnitude` | 756.23 µs | 754.99 µs | **184.35 µs** | 184.42 µs | | `norm_l2` | 984.65 µs | 964.61 µs | **151.90 µs** | 160.90 µs | | `add_weighted` | 3.53 ms | 3.13 ms | **2.23 ms** | 2.30 ms | | `convert_scale_abs` | 2.54 ms | 1.66 ms | 610.46 µs | **477.11 µs** | | `sum` | 975.51 µs | 966.94 µs | **162.20 µs** | 164.41 µs | | `mean` | 969.08 µs | 964.49 µs | **163.92 µs** | 167.97 µs | | `bitwise_and` | **255.18 µs** | 264.72 µs | 376.10 µs | 445.06 µs | | `min` | 3.08 ms | 3.06 ms | **2.07 ms** | 2.11 ms | | `absdiff` | 3.10 ms | 3.04 ms | **2.06 ms** | 2.15 ms | | `normalize` | 2.75 ms | 1.55 ms | **1.24 ms** | 1.32 ms | | `sqrt` | 939.04 µs | 945.49 µs | 858.87 µs | **847.33 µs** | | `gemm_256×256` | 15.71 ms | 15.69 ms | **4.29 ms** | 4.40 ms | ### Arithm Analysis - **Reduction operations** (`dot`, `norm_l2`, `sum`, `mean`) scale brilliantly with `rayon`: **~6× speedup** from Standard → Parallel, driven by embarrassingly-parallel sum-accumulation. - **Element-wise binary ops** (`add`, `sub`, `mul`, `div`, `min`, `absdiff`) gain a modest **~1.5× speedup** with Parallel. These are memory-bandwidth-bound; additional SIMD provides negligible further gain. - **`convert_scale_abs`** is the standout: SIMD gives **1.5×** (sequential), and Parallel+SIMD reaches **5.3× total speedup** — a combination of vectorizable math and parallelizable rows. - **`bitwise_and`** is fastest in Standard mode (255 µs). Parallel overhead (~380–445 µs) exceeds the benefit for this ultra-cheap 1-byte-per-element operation. - **`gemm_256×256`** benefits enormously from Parallel (**3.7×**), confirming that matrix multiplication is compute-bound and partitions well across cores. --- ## ImgProc — Performance Comparison Table (11 benchmarks) | Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD | | :---------------------------- | :---------- | :----------- | :------------ | :-------------- | | `cvt_color_rgb2gray` | 2.66 ms | 1.41 ms | 529.63 µs | **404.23 µs** | | `box_filter_3x3` | 9.01 ms | 9.31 ms | **3.07 ms** | 3.69 ms | | `sobel_3x3` | 22.79 ms | 21.66 ms | **1.87 ms** | 1.87 ms | | `scharr_x` | 22.76 ms | 21.67 ms | **1.78 ms** | 1.85 ms | | `threshold_binary` | 1.31 ms | 1.43 ms | 626.70 µs | **609.22 µs** | | `cvt_color_bgr2gray` | 2.69 ms | † | 617.94 µs | **368.43 µs** | | `cvt_color_rgba2gray` | 2.64 ms | 2.43 ms | 676.02 µs | **354.37 µs** | | `gaussian_blur_3x3` | 11.01 ms | † | **3.53 ms** | 3.50 ms | | `gaussian_blur_5x5` | 15.21 ms | 16.37 ms | **4.30 ms** | 4.72 ms | | `laplacian_3x3` | 45.91 ms | 44.56 ms | **4.41 ms** | 4.44 ms | | `canny` | 57.61 ms | 62.62 ms | 12.70 ms | **12.54 ms** | > † SIMD Only values for `cvt_color_bgr2gray` (249 ms) and `gaussian_blur_3x3` (18 ms) are anomalous — likely impacted by system activity during the run. Excluded from comparison. ### ImgProc Analysis - **Color Conversion** (`cvt_color_rgb2gray`): highly vectorizable pixel-by-pixel math. SIMD alone yields **1.9×**, Parallel brings **5.0×**, and combined reaches **6.6× total speedup**. - **Derivatives** (`sobel_3x3`, `scharr_x`): the optimized `fast_deriv_3x3` implementation scales extremely well with Parallel — **12× speedup**. These benefit from row-wise parallelism where each row's 3×3 convolution is independent. - **`laplacian_3x3`**: composed of two Sobel passes, gains an enormous **10.4× speedup** with Parallel, consistent with the derivative scaling. - **`canny`**: multi-stage pipeline (Sobel + gradient + NMS + hysteresis). Achieves **4.6× speedup** with Parallel — limited by the sequential hysteresis tracking phase. - **Spatial Filters** (`box_filter`, `gaussian_blur`): **~3× speedup** with Parallel. The sliding-window pattern limits SIMD gains due to non-sequential memory access. - **`threshold_binary`**: simple per-pixel operation with modest **2.1× speedup** — limited by memory bandwidth. --- ## Structural — Performance Comparison Table (4 benchmarks) | Benchmark / Operation | Standard | SIMD Only | Parallel | Parallel + SIMD | | :---------------------------- | :---------- | :---------- | :------------ | :-------------- | | `flip_horiz` | 5.03 ms | 5.81 ms | **976.23 µs** | 971.35 µs | | `transpose` | 7.42 ms | 7.29 ms | **1.09 ms** | 1.14 ms | | `split` | 3.21 ms | 3.21 ms | **1.24 ms** | 1.31 ms | | `merge` | 2.49 ms | 2.27 ms | **932.68 µs** | ‡ | > ‡ `merge` in Parallel+SIMD was not captured (benchmark run interrupted). ### Structural Analysis - **All structural ops are memory-bound**: SIMD Only provides zero benefit (or even slight regression due to instruction overhead on non-vectorizable scatter/gather patterns). - **Parallel provides 2.6–6.8× speedup** across the board: `flip` (**5.2×**) and `transpose` (**6.8×**) benefit most because each row can be independently relocated. - **`merge` and `split`** (~2.5–2.7×) are still memory-bandwidth-limited even with parallelism, as they involve strided interleave/deinterleave across 3 channels. --- ## Summary — Best Configuration per Category | Category | Best Config | Typical Speedup vs Standard | | :----------------- | :---------------- | :-------------------------- | | Element-wise ops | Parallel | 1.5× (memory-bound) | | Reductions | Parallel | 6× (embarrassingly parallel)| | Color conversion | Parallel + SIMD | 6.6× (vectorizable + parallel) | | Spatial filters | Parallel | 3× (row-parallel convolution) | | Derivatives | Parallel | 12× (optimized fast path) | | Derivatives (f32 SIMD) | Parallel + SIMD | **22× (SIMD kernel + parallel)** | | Bilateral filter | Parallel | 7.1× (compute-bound, no SIMD gain) | | Canny edge detect | Parallel + SIMD | 4.6× (multi-stage pipeline)| | Structural ops | Parallel | 5× (independent row ops) | | GEMM | Parallel | 3.7× (compute-bound) | | Feature Detection | Parallel | Harris Corner ~26ms (1024x1024) | | FAST Detection | Parallel | **4.3x** speedup (2.14ms -> 499µs, 512x512) | | ORB Pipeline | Parallel | **3.8x** speedup (115.39ms -> 30.61ms, 512x512) | | Hough Transform | Parallel | Lines ~3.6ms / Circles ~4.5ms (512x512) | | Video Tracking | Parallel + SIMD | ~3–5× (49 pts) / ~4–8× (196 pts) | *Conclusion*: Parallelism (`rayon`) is the dominant optimization for nearly all operations. SIMD (`target-cpu=native`) provides meaningful additional gains primarily for **pixel-level math** (`cvt_color`, `convert_scale_abs`) and **f32 derivative kernels** (`sobel_3x3_f32`) where the inner loop is trivially vectorizable. The `sobel_3x3_f32` SIMD path achieves the project's highest combined speedup at **22×**. For memory-bound or scatter/gather patterns, SIMD adds no benefit.