## Slow loops

Python loops are inefficient for numeric operations.

In [None]:
import numpy as np

Here's a function that computes the sum of the log of all non-zero values.

In [None]:
def sum_log_nz(ary):
 res = np.zeros(ary.shape[0])
 for i in range(ary.shape[0]):
 v = ary[i] 
 if v != 0:
 res[i] = np.log(v)
 return res.sum()

Test the function

In [None]:
a = np.random.random(5_000_000)

In [None]:
a

In [None]:
sum_log_nz(a)

Time the function

In [None]:
%%time 
sum_log_nz(a)

## SIMD Loops

Numba can compile the inefficient pure-Python loop into SIMD-vectorized native loop.

In [None]:
import numba

Try compiling the function with Numba.

Notice the difference between settings of `fastmath=`.

In [None]:
fast_sum_log_nz = numba.njit(fastmath=True)(sum_log_nz)
fast_sum_log_nz

In [None]:
fast_sum_log_nz(a)

Notice the improved performance

In [None]:
%%time

fast_sum_log_nz(a)

In [None]:
fast_sum_log_nz.inspect_cfg(fast_sum_log_nz.signatures[0]).display()

## Parallel Loops

Numba can auto-parallize the function to leverage multiple threads.

In [None]:
par_sum_log_nz = numba.njit(parallel=True)(sum_log_nz)

In [None]:
par_sum_log_nz(a)

Use the `.parallel_diagnostics()` to inspect what the compiler has done to optimize the function.

Note: 
* notice how the manually written loop is not recognized.

In [None]:
par_sum_log_nz.parallel_diagnostics()

Use `numba.prange` to mark a loop for parallelization.

In [None]:
@numba.njit(parallel=True, fastmath=True)
def par_sum_log_nz(ary):
 res = np.zeros(ary.shape[0])
 for i in numba.prange(ary.shape[0]):
 v = ary[i] 
 if v != 0:
 res[i] = np.log(v)
 return res.sum()

In [None]:
par_sum_log_nz(a)

In [None]:
%%time
par_sum_log_nz(a)

Compare the result of the `.parallel_diagnostics()` with the previous version.

Note:
* 3 loops are recognized.
* the loops are fused because they iterate over the same domain.

In [None]:
par_sum_log_nz.parallel_diagnostics()