{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  Numerical linear algebra: introduction
1.1  BLAS
1.2  Examples
1.3  Memory hierarchy and level-3 fraction
1.4  Effect of data layout
1.5  BLAS in R
1.6  Avoid memory allocation: some examples
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 1.1.0\n", "Commit 80516ca202 (2019-01-21 21:24 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin14.5.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-6.0.1 (ORCJIT, skylake)\n", "Environment:\n", " JULIA_EDITOR = code\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Numerical linear algebra: introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Topics in numerical algebra: \n", " - BLAS \n", " - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n", " - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$ \n", " - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$ \n", " - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$ \n", " - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$ \n", " - iterative methods for numerical linear algebra \n", "\n", "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n", "\n", "* Our major **goal** (or learning objectives) is to \n", " 1. know the complexity (flop count) of each task\n", " 2. be familiar with the BLAS and LAPACK functions (what they do) \n", " 3. do **not** re-invent wheels by implementing these dense linear algebra subroutines by yourself \n", " 4. understand the need for iterative methods \n", " 5. apply appropriate numerical algebra tools to various statistical problems \n", "\n", "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n", " - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/v1.1/stdlib/LinearAlgebra/#BLAS-Functions-1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS\n", "\n", "* BLAS stands for _basic linear algebra subprograms_. \n", "\n", "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n", "\n", "* There are many implementations of BLAS. \n", " - [Netlib](http://www.netlib.org/blas/) provides a reference implementation. \n", " - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries). **MKL implementation is the gold standard on market.** It is not open source but the compiled library is free for Linux and MacOS. \n", " - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS). **OpenBLAS is the best open source implementation**. \n", "\n", "* There are 3 levels of BLAS functions.\n", " - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n", " - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n", " - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n", "\n", "| Level | Example Operation | Name | Dimension | Flops | \n", "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n", "| 1 | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ | \n", "| 1 | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | axpy | $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ | \n", "| 2 | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | gaxpy | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 2 | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 3 | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | matrix multiplication | $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$ | $2mnp$ |\n", "\n", "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "> **The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.**\n", "\n", "Some operations _appear_ as level-3 but indeed are level-2. \n", "\n", "**Example 1**. A common operation in statistics is column scaling or row scaling\n", "$$\n", "\\begin{eqnarray*}\n", " \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n", " \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n", "\\end{eqnarray*}\n", "$$\n", "where $\\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form \n", "$$\n", "\\mathbf{X}^T \\mathbf{W} \\mathbf{X},\n", "$$\n", "where $\\mathbf{W}$ is a diagonal matrix with observation weights on diagonal. \n", "\n", " Column and row scalings are essentially level-2 operations!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Diagonal{Float64,Array{Float64,1}}:\n", " 0.140972 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ 0.143596 ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ 0.612494 ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ 0.0480573 ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋮ ⋱ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ 0.882415 ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ 0.450904 ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.814614" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools, LinearAlgebra, Random\n", "\n", "Random.seed!(123) # seed\n", "n = 2000\n", "A = rand(n, n) # n-by-n matrix\n", "d = rand(n) # n vector\n", "D = Diagonal(d) # diagonal matrix with d as diagonal" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Array{Float64,2}:\n", " 0.140972 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.143596 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.612494 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0480573 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " ⋮ ⋱ \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.882415 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.450904 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.814614" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Dfull = convert(Matrix, D) # convert to full matrix" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 100.961 ms (2.61% GC)\n", " median time: 102.280 ms (2.60% GC)\n", " mean time: 103.678 ms (3.29% GC)\n", " maximum time: 138.528 ms (26.71% GC)\n", " --------------\n", " samples: 49\n", " evals/sample: 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n", "# this is SLOW!\n", "@benchmark $A * $Dfull" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 9.051 ms (3.33% GC)\n", " median time: 11.632 ms (24.18% GC)\n", " mean time: 11.911 ms (24.67% GC)\n", " maximum time: 53.310 ms (83.51% GC)\n", " --------------\n", " samples: 420\n", " evals/sample: 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dispatch to special method for diagonal matrix multiplication.\n", "# columnwise scaling: O(n^2) flops\n", "@benchmark $A * $D" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 5.276 ms (0.00% GC)\n", " median time: 10.364 ms (0.00% GC)\n", " mean time: 11.183 ms (0.00% GC)\n", " maximum time: 19.343 ms (0.00% GC)\n", " --------------\n", " samples: 447\n", " evals/sample: 1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# in-place: avoid allocate space for result\n", "# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.\n", "@benchmark rmul!(A, D)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** In R or Matlab, `diag(d)` will create a full matrix. Be cautious using `diag` function: do we really need a full diagonal matrix?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Recompiling stale cache file /Users/huazhou/.julia/compiled/v1.1/RCall/8GFyb.ji for RCall [6f49c342-dc21-5d91-9882-a32aef131414]\n", "└ @ Base loading.jl:1184\n" ] }, { "data": { "text/plain": [ "RObject{RealSxp}\n", " [,1] [,2] [,3] [,4] [,5]\n", "[1,] 0.1639009 0.00000000 0.0000000 0.0000000 0.000000\n", "[2,] 0.0000000 0.03372306 0.0000000 0.0000000 0.000000\n", "[3,] 0.0000000 0.00000000 0.7544716 0.0000000 0.000000\n", "[4,] 0.0000000 0.00000000 0.0000000 0.3005411 0.000000\n", "[5,] 0.0000000 0.00000000 0.0000000 0.0000000 0.144251\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "\n", "R\"\"\"\n", "d <- runif(5)\n", "diag(d)\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">> >> >> \n", "d =\n", "\n", " 0.8147\n", " 0.9058\n", " 0.1270\n", " 0.9134\n", " 0.6324\n", "\n" ] }, { "data": { "text/plain": [ "5×5 Array{Float64,2}:\n", " 0.814724 0.0 0.0 0.0 0.0 \n", " 0.0 0.905792 0.0 0.0 0.0 \n", " 0.0 0.0 0.126987 0.0 0.0 \n", " 0.0 0.0 0.0 0.913376 0.0 \n", " 0.0 0.0 0.0 0.0 0.632359" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MATLAB\n", "\n", "mat\"\"\"\n", "d = rand(5, 1)\n", "diag(d)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2**. Innter product between two matrices $\\mathbf{A}, \\mathbf{B} \\in \\mathbb{R}^{m \\times n}$ is often written as \n", "$$\n", " \\text{trace}(\\mathbf{A}^T \\mathbf{B}), \\text{trace}(\\mathbf{B} \\mathbf{A}^T), \\text{trace}(\\mathbf{A} \\mathbf{B}^T), \\text{ or } \\text{trace}(\\mathbf{B}^T \\mathbf{A}).\n", "$$\n", "They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 104.447 ms (0.43% GC)\n", " median time: 110.943 ms (2.73% GC)\n", " mean time: 110.926 ms (2.70% GC)\n", " maximum time: 117.164 ms (2.65% GC)\n", " --------------\n", " samples: 46\n", " evals/sample: 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Random.seed!(123)\n", "n = 2000\n", "A, B = randn(n, n), randn(n, n)\n", "\n", "# slow way to evaluate this thing\n", "@benchmark tr(transpose($A) * $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But $\\text{trace}(\\mathbf{A}^T \\mathbf{B}) = <\\text{vec}(\\mathbf{A}), \\text{vec}(\\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 2.390 ms (0.00% GC)\n", " median time: 3.613 ms (0.00% GC)\n", " mean time: 3.618 ms (0.00% GC)\n", " maximum time: 5.299 ms (0.00% GC)\n", " --------------\n", " samples: 1375\n", " evals/sample: 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark dot($A, $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 3**. Similarly $\\text{diag}(\\mathbf{A}^T \\mathbf{B})$ can be calculated in $O(mn)$ flops." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.53 MiB\n", " allocs estimate: 3\n", " --------------\n", " minimum time: 105.656 ms (2.72% GC)\n", " median time: 110.821 ms (2.75% GC)\n", " mean time: 111.252 ms (2.69% GC)\n", " maximum time: 119.092 ms (2.59% GC)\n", " --------------\n", " samples: 45\n", " evals/sample: 1" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slow way to evaluate this thing\n", "@benchmark diag(transpose($A) * $B)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.53 MiB\n", " allocs estimate: 6\n", " --------------\n", " minimum time: 9.309 ms (5.34% GC)\n", " median time: 11.412 ms (22.53% GC)\n", " mean time: 11.468 ms (22.75% GC)\n", " maximum time: 13.552 ms (34.13% GC)\n", " --------------\n", " samples: 436\n", " evals/sample: 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# smarter\n", "@benchmark Diagonal(vec(sum($A .* $B, dims=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get rid of allocation of intermediate array at all, we can just write a double loop or use `dot` function." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 203.33 KiB\n", " allocs estimate: 4004\n", " --------------\n", " minimum time: 3.277 ms (0.00% GC)\n", " median time: 3.503 ms (0.00% GC)\n", " mean time: 3.598 ms (0.39% GC)\n", " maximum time: 6.018 ms (29.67% GC)\n", " --------------\n", " samples: 1386\n", " evals/sample: 1" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function diag_matmul(A, B)\n", " m, n = size(A)\n", " @assert size(B) == (m, n) \"A and B should have same size\"\n", " @views d = [dot(A[:, j], B[:, j]) for j in 1:n]\n", "# d = zeros(eltype(A), n)\n", "# for j in 1:n, i in 1:m\n", "# d[j] += A[i, j] * B[i, j]\n", "# end\n", " Diagonal(d)\n", "end\n", "\n", "@benchmark diag_matmul($A, $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory hierarchy and level-3 fraction\n", "\n", "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n", "\n", "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "* Numbers everyone should know\n", "\n", "| Operation | Time |\n", "|-------------------------------------|----------------|\n", "| L1 cache reference | 0.5 ns |\n", "| L2 cache reference | 7 ns |\n", "| Main memory reference | 100 ns |\n", "| Read 1 MB sequentially from memory | 250,000 ns |\n", "| Read 1 MB sequentially from SSD | 1,000,000 ns | \n", "| Read 1 MB sequentially from disk | 20,000,000 ns |\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " Source: \n", "\n", "* For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s. \n", "\n", "* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog? \n", "Answer: use **high-level BLAS** as much as possible.\n", "\n", "| BLAS | Dimension | Mem. Refs. | Flops | Ratio |\n", "|--------------------------------|------------------------------------------------------------|------------|--------|-------|\n", "| Level 1: $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $3n$ | $2n$ | 3:2 |\n", "| Level 2: $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$, $\\mathbf{A} \\in \\mathbb{R}^{n \\times n}$ | $n^2$ | $2n^2$ | 1:2 |\n", "| Level 3: $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | $\\mathbf{A}, \\mathbf{B}, \\mathbf{C} \\in\\mathbb{R}^{n \\times n}$ | $4n^2$ | $2n^3$ | 2:n | \n", "\n", "* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. **Surface-to-volume** effect. \n", "See [Dongarra slides](https://www.samsi.info/wp-content/uploads/2017/02/SAMSI-0217_Dongarra.pdf).\n", "\n", "\n", "\n", "* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**.\n", "\n", "* To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the [Quora question](https://www.quora.com/What-algorithm-does-BLAS-use-for-matrix-multiplication-Of-all-the-considerations-e-g-cache-popular-instruction-sets-Big-O-etc-which-one-turned-out-to-be-the-primary-bottleneck), especially the [video](https://youtu.be/JzNpKDW07rw). Bottomline is \n", "\n", "> **Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effect of data layout\n", "\n", "* Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.\n", "\n", "* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).\n", "\n", "* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.\n", " - x86 CPUs: 64 bytes \n", " - ARM CPUs: 32 bytes\n", "\n", "\n", "\n", "* Accessing column-major stored matrix by rows ($ij$ looping) causes lots of **cache misses**.\n", "\n", "* Take matrix multiplication as an example \n", "$$ \n", "\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}, \\quad \\mathbf{A} \\in \\mathbb{R}^{m \\times p}, \\mathbf{B} \\in \\mathbb{R}^{p \\times n}, \\mathbf{C} \\in \\mathbb{R}^{m \\times n}.\n", "$$\n", "Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. \n", " - `jki` or `kji` looping:\n", " ```julia\n", " # inner most loop\n", " for i = 1:m\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ``` \n", " - `ikj` or `kij` looping:\n", " ```julia\n", " # inner most loop \n", " for j = 1:n\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ``` \n", " - `ijk` or `jik` looping:\n", " ```julia\n", " # inner most loop \n", " for k = 1:p\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ```\n", "* We pay attention to the innermost loop, where the vector calculation occurs. The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is \n", "\n", "| Variant | A Stride | B Stride | C Stride |\n", "|----------------|----------|----------|----------|\n", "| $jki$ or $kji$ | Unit | 0 | Unit |\n", "| $ikj$ or $kij$ | 0 | Non-Unit | Non-Unit |\n", "| $ijk$ or $jik$ | Non-Unit | Unit | 0 | \n", "Apparently the variants $jki$ or $kji$ are preferred." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", " matmul_by_loop!(A, B, C, order)\n", "\n", "Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.\n", "\"\"\"\n", "function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)\n", " \n", " m = size(A, 1)\n", " p = size(A, 2)\n", " n = size(B, 2)\n", " fill!(C, 0)\n", " \n", " if order == \"jki\"\n", " for j = 1:n, k = 1:p, i = 1:m\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", "\n", " if order == \"kji\"\n", " for k = 1:p, j = 1:n, i = 1:m\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"ikj\"\n", " for i = 1:m, k = 1:p, j = 1:n\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", "\n", " if order == \"kij\"\n", " for k = 1:p, i = 1:m, j = 1:n\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"ijk\"\n", " for i = 1:m, j = 1:n, k = 1:p\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"jik\"\n", " for j = 1:n, i = 1:m, k = 1:p\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", "end\n", "\n", "using Random\n", "\n", "Random.seed!(123)\n", "m, p, n = 2000, 100, 2000\n", "A = rand(m, p)\n", "B = rand(p, n)\n", "C = zeros(m, n);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* $jki$ and $kji$ looping:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 276.404 ms (0.00% GC)\n", " median time: 282.157 ms (0.00% GC)\n", " mean time: 283.423 ms (0.00% GC)\n", " maximum time: 291.927 ms (0.00% GC)\n", " --------------\n", " samples: 18\n", " evals/sample: 1" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools\n", "\n", "@benchmark matmul_by_loop!($A, $B, $C, \"jki\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 335.573 ms (0.00% GC)\n", " median time: 355.841 ms (0.00% GC)\n", " mean time: 362.742 ms (0.00% GC)\n", " maximum time: 409.550 ms (0.00% GC)\n", " --------------\n", " samples: 14\n", " evals/sample: 1" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"kji\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* $ikj$ and $kij$ looping:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 2.405 s (0.00% GC)\n", " median time: 2.406 s (0.00% GC)\n", " mean time: 2.431 s (0.00% GC)\n", " maximum time: 2.480 s (0.00% GC)\n", " --------------\n", " samples: 3\n", " evals/sample: 1" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"ikj\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 2.714 s (0.00% GC)\n", " median time: 2.726 s (0.00% GC)\n", " mean time: 2.726 s (0.00% GC)\n", " maximum time: 2.738 s (0.00% GC)\n", " --------------\n", " samples: 2\n", " evals/sample: 1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"kij\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* $ijk$ and $jik$ looping:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 960.266 ms (0.00% GC)\n", " median time: 975.456 ms (0.00% GC)\n", " mean time: 979.488 ms (0.00% GC)\n", " maximum time: 1.001 s (0.00% GC)\n", " --------------\n", " samples: 6\n", " evals/sample: 1" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"ijk\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 1.017 s (0.00% GC)\n", " median time: 1.046 s (0.00% GC)\n", " mean time: 1.040 s (0.00% GC)\n", " maximum time: 1.058 s (0.00% GC)\n", " --------------\n", " samples: 5\n", " evals/sample: 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"ijk\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 7.078 ms (0.00% GC)\n", " median time: 7.229 ms (0.00% GC)\n", " mean time: 7.805 ms (0.00% GC)\n", " maximum time: 10.858 ms (0.00% GC)\n", " --------------\n", " samples: 640\n", " evals/sample: 1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark mul!($C, $A, $B)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 7.209 ms (0.00% GC)\n", " median time: 8.820 ms (0.00% GC)\n", " mean time: 8.798 ms (0.00% GC)\n", " maximum time: 11.884 ms (0.00% GC)\n", " --------------\n", " samples: 568\n", " evals/sample: 1" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# direct call of BLAS wrapper function\n", "@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS in R\n", "\n", "* **Tip for R user**. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{VecSxp}\n", "Unit: milliseconds\n", " expr min lq mean median uq max\n", " `#JL`$A %*% `#JL`$B 226.5638 234.8593 242.5344 242.8393 250.3439 259.4928\n", " neval\n", " 100\n" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "\n", "R\"\"\"\n", "library(\"microbenchmark\")\n", "microbenchmark($A %*% $B)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google `build R using MKL` to get started. Similarly we can build Julia using MKL.\n", "\n", "* Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0099658959695" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MATLAB\n", "\n", "mat\"\"\"\n", "f = @() $A * $B;\n", "timeit(f)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avoid memory allocation: some examples\n", "\n", "1. Transposing matrix is an expensive memory operation. \n", " - In R, the command \n", " ```R\n", " t(A) %*% x\n", " ```\n", " will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation\n", " - Julia is smart to avoid transposing matrix if possible." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "using Random, LinearAlgebra, BenchmarkTools\n", "Random.seed!(123)\n", "\n", "n = 1000\n", "A = rand(n, n)\n", "x = rand(n);" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Transpose{Float64,Array{Float64,2}}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(transpose(A))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(:parent,)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fieldnames(typeof(transpose(A)))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Ptr{Float64} @0x0000000142a9c000, Ptr{Float64} @0x0000000142a9c000)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same data in tranpose(A) and original matrix A\n", "pointer(transpose(A).parent), pointer(A)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 7.94 KiB\n", " allocs estimate: 1\n", " --------------\n", " minimum time: 98.703 μs (0.00% GC)\n", " median time: 141.025 μs (0.00% GC)\n", " mean time: 156.860 μs (0.00% GC)\n", " maximum time: 369.885 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dispatch to BLAS\n", "# does *not* actually transpose the matrix\n", "@benchmark transpose($A) * $x" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 103.759 μs (0.00% GC)\n", " median time: 139.496 μs (0.00% GC)\n", " mean time: 157.336 μs (0.00% GC)\n", " maximum time: 399.354 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pre-allocate result\n", "out = zeros(size(A, 2))\n", "@benchmark mul!($out, transpose($A), $x)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 106.017 μs (0.00% GC)\n", " median time: 137.537 μs (0.00% GC)\n", " mean time: 146.369 μs (0.00% GC)\n", " maximum time: 736.091 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# or call BLAS wrapper directly\n", "@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. [Broadcasting](https://docs.julialang.org/en/v1/base/arrays/#Broadcast-and-vectorization-1) in Julia achieves vectorized code without creating intermediate arrays.\n", "\n", " Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command\n", "```r\n", "max(abs(X), abs(Y))\n", "```\n", "will create two intermediate arrays and then one result array." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{VecSxp}\n", "Unit: milliseconds\n", " expr min lq mean median uq\n", " max(abs(`#JL`$X), abs(`#JL`$Y)) 5.752931 6.047952 7.97932 6.821782 8.24881\n", " max neval\n", " 26.40112 100\n" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "Random.seed!(123)\n", "X, Y = rand(1000, 1000), rand(1000, 1000)\n", "\n", "R\"\"\"\n", "library(microbenchmark)\n", "microbenchmark(max(abs($X), abs($Y)))\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Julia, dot operations are fused so no intermediate arrays are created." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 7.63 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 1.675 ms (0.00% GC)\n", " median time: 1.903 ms (0.00% GC)\n", " mean time: 2.529 ms (27.30% GC)\n", " maximum time: 5.326 ms (54.28% GC)\n", " --------------\n", " samples: 1972\n", " evals/sample: 1" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# no intermediate arrays created, only result array created\n", "@benchmark max.(abs.($X), abs.($Y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pre-allocating result array gets rid of memory allocation at all." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 1.135 ms (0.00% GC)\n", " median time: 1.197 ms (0.00% GC)\n", " mean time: 1.282 ms (0.00% GC)\n", " maximum time: 1.985 ms (0.00% GC)\n", " --------------\n", " samples: 3878\n", " evals/sample: 1" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# no memory allocation at all!\n", "Z = zeros(size(X)) # zero matrix of same size as X\n", "@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. [View](https://docs.julialang.org/en/v1/base/arrays/#Views-(SubArrays-and-other-view-types)-1) avoids creating extra copy of matrix data." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 488.39 KiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 70.431 μs (0.00% GC)\n", " median time: 304.043 μs (0.00% GC)\n", " mean time: 287.967 μs (14.85% GC)\n", " maximum time: 2.843 ms (84.42% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Random.seed!(123)\n", "A = randn(1000, 1000)\n", "\n", "# sum entries in a sub-matrix\n", "@benchmark sum($A[1:2:500, 1:2:500])" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 208 bytes\n", " allocs estimate: 7\n", " --------------\n", " minimum time: 118.440 μs (0.00% GC)\n", " median time: 128.581 μs (0.00% GC)\n", " mean time: 128.805 μs (0.00% GC)\n", " maximum time: 484.048 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view avoids creating a separate sub-matrix\n", "@benchmark sum(@view $A[1:2:500, 1:2:500])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The [`@views`](https://docs.julialang.org/en/v1/base/arrays/#Base.@views) macro, which can be useful in [some operations](https://discourse.julialang.org/t/why-is-a-manual-in-place-addition-so-much-faster-than-and-on-range-indexed-arrays/3302)." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 208 bytes\n", " allocs estimate: 7\n", " --------------\n", " minimum time: 118.367 μs (0.00% GC)\n", " median time: 120.747 μs (0.00% GC)\n", " mean time: 124.625 μs (0.00% GC)\n", " maximum time: 342.429 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark @views sum($A[1:2:500, 1:2:500])" ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.1.0", "language": "julia", "name": "julia-1.1" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.1.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "103px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": true, "toc_position": { "height": "689.8912963867188px", "left": "0px", "right": "619.4000244140625px", "top": "76.10054779052734px", "width": "224.36141967773438px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }