{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 1.1.0\n", "Commit 80516ca202 (2019-01-21 21:24 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin14.5.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-6.0.1 (ORCJIT, skylake)\n", "Environment:\n", " JULIA_EDITOR = code\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Numerical linear algebra: introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Topics in numerical algebra: \n", " - BLAS \n", " - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n", " - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$ \n", " - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$ \n", " - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$ \n", " - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$ \n", " - iterative methods for numerical linear algebra \n", "\n", "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n", "\n", "* Our major **goal** (or learning objectives) is to \n", " 1. know the complexity (flop count) of each task\n", " 2. be familiar with the BLAS and LAPACK functions (what they do) \n", " 3. do **not** re-invent wheels by implementing these dense linear algebra subroutines by yourself \n", " 4. understand the need for iterative methods \n", " 5. apply appropriate numerical algebra tools to various statistical problems \n", "\n", "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n", " - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/v1.1/stdlib/LinearAlgebra/#BLAS-Functions-1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS\n", "\n", "* BLAS stands for _basic linear algebra subprograms_. \n", "\n", "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n", "\n", "* There are many implementations of BLAS. \n", " - [Netlib](http://www.netlib.org/blas/) provides a reference implementation. \n", " - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries). **MKL implementation is the gold standard on market.** It is not open source but the compiled library is free for Linux and MacOS. \n", " - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS). **OpenBLAS is the best open source implementation**. \n", "\n", "* There are 3 levels of BLAS functions.\n", " - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n", " - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n", " - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n", "\n", "| Level | Example Operation | Name | Dimension | Flops | \n", "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n", "| 1 | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ | \n", "| 1 | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | axpy | $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ | \n", "| 2 | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | gaxpy | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 2 | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 3 | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | matrix multiplication | $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$ | $2mnp$ |\n", "\n", "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "> **The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.**\n", "\n", "Some operations _appear_ as level-3 but indeed are level-2. \n", "\n", "**Example 1**. A common operation in statistics is column scaling or row scaling\n", "$$\n", "\\begin{eqnarray*}\n", " \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n", " \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n", "\\end{eqnarray*}\n", "$$\n", "where $\\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form \n", "$$\n", "\\mathbf{X}^T \\mathbf{W} \\mathbf{X},\n", "$$\n", "where $\\mathbf{W}$ is a diagonal matrix with observation weights on diagonal. \n", "\n", " Column and row scalings are essentially level-2 operations!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Diagonal{Float64,Array{Float64,1}}:\n", " 0.140972 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ 0.143596 ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ 0.612494 ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ 0.0480573 ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋮ ⋱ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ 0.882415 ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ 0.450904 ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.814614" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools, LinearAlgebra, Random\n", "\n", "Random.seed!(123) # seed\n", "n = 2000\n", "A = rand(n, n) # n-by-n matrix\n", "d = rand(n) # n vector\n", "D = Diagonal(d) # diagonal matrix with d as diagonal" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Array{Float64,2}:\n", " 0.140972 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.143596 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.612494 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0480573 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " ⋮ ⋱ \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.882415 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.450904 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.814614" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Dfull = convert(Matrix, D) # convert to full matrix" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 100.961 ms (2.61% GC)\n", " median time: 102.280 ms (2.60% GC)\n", " mean time: 103.678 ms (3.29% GC)\n", " maximum time: 138.528 ms (26.71% GC)\n", " --------------\n", " samples: 49\n", " evals/sample: 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n", "# this is SLOW!\n", "@benchmark $A * $Dfull" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 9.051 ms (3.33% GC)\n", " median time: 11.632 ms (24.18% GC)\n", " mean time: 11.911 ms (24.67% GC)\n", " maximum time: 53.310 ms (83.51% GC)\n", " --------------\n", " samples: 420\n", " evals/sample: 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dispatch to special method for diagonal matrix multiplication.\n", "# columnwise scaling: O(n^2) flops\n", "@benchmark $A * $D" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 5.276 ms (0.00% GC)\n", " median time: 10.364 ms (0.00% GC)\n", " mean time: 11.183 ms (0.00% GC)\n", " maximum time: 19.343 ms (0.00% GC)\n", " --------------\n", " samples: 447\n", " evals/sample: 1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# in-place: avoid allocate space for result\n", "# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.\n", "@benchmark rmul!(A, D)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** In R or Matlab, `diag(d)` will create a full matrix. Be cautious using `diag` function: do we really need a full diagonal matrix?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Recompiling stale cache file /Users/huazhou/.julia/compiled/v1.1/RCall/8GFyb.ji for RCall [6f49c342-dc21-5d91-9882-a32aef131414]\n", "└ @ Base loading.jl:1184\n" ] }, { "data": { "text/plain": [ "RObject{RealSxp}\n", " [,1] [,2] [,3] [,4] [,5]\n", "[1,] 0.1639009 0.00000000 0.0000000 0.0000000 0.000000\n", "[2,] 0.0000000 0.03372306 0.0000000 0.0000000 0.000000\n", "[3,] 0.0000000 0.00000000 0.7544716 0.0000000 0.000000\n", "[4,] 0.0000000 0.00000000 0.0000000 0.3005411 0.000000\n", "[5,] 0.0000000 0.00000000 0.0000000 0.0000000 0.144251\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "\n", "R\"\"\"\n", "d <- runif(5)\n", "diag(d)\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">> >> >> \n", "d =\n", "\n", " 0.8147\n", " 0.9058\n", " 0.1270\n", " 0.9134\n", " 0.6324\n", "\n" ] }, { "data": { "text/plain": [ "5×5 Array{Float64,2}:\n", " 0.814724 0.0 0.0 0.0 0.0 \n", " 0.0 0.905792 0.0 0.0 0.0 \n", " 0.0 0.0 0.126987 0.0 0.0 \n", " 0.0 0.0 0.0 0.913376 0.0 \n", " 0.0 0.0 0.0 0.0 0.632359" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MATLAB\n", "\n", "mat\"\"\"\n", "d = rand(5, 1)\n", "diag(d)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2**. Innter product between two matrices $\\mathbf{A}, \\mathbf{B} \\in \\mathbb{R}^{m \\times n}$ is often written as \n", "$$\n", " \\text{trace}(\\mathbf{A}^T \\mathbf{B}), \\text{trace}(\\mathbf{B} \\mathbf{A}^T), \\text{trace}(\\mathbf{A} \\mathbf{B}^T), \\text{ or } \\text{trace}(\\mathbf{B}^T \\mathbf{A}).\n", "$$\n", "They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 104.447 ms (0.43% GC)\n", " median time: 110.943 ms (2.73% GC)\n", " mean time: 110.926 ms (2.70% GC)\n", " maximum time: 117.164 ms (2.65% GC)\n", " --------------\n", " samples: 46\n", " evals/sample: 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Random.seed!(123)\n", "n = 2000\n", "A, B = randn(n, n), randn(n, n)\n", "\n", "# slow way to evaluate this thing\n", "@benchmark tr(transpose($A) * $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But $\\text{trace}(\\mathbf{A}^T \\mathbf{B}) = <\\text{vec}(\\mathbf{A}), \\text{vec}(\\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 2.390 ms (0.00% GC)\n", " median time: 3.613 ms (0.00% GC)\n", " mean time: 3.618 ms (0.00% GC)\n", " maximum time: 5.299 ms (0.00% GC)\n", " --------------\n", " samples: 1375\n", " evals/sample: 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark dot($A, $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 3**. Similarly $\\text{diag}(\\mathbf{A}^T \\mathbf{B})$ can be calculated in $O(mn)$ flops." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.53 MiB\n", " allocs estimate: 3\n", " --------------\n", " minimum time: 105.656 ms (2.72% GC)\n", " median time: 110.821 ms (2.75% GC)\n", " mean time: 111.252 ms (2.69% GC)\n", " maximum time: 119.092 ms (2.59% GC)\n", " --------------\n", " samples: 45\n", " evals/sample: 1" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slow way to evaluate this thing\n", "@benchmark diag(transpose($A) * $B)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.53 MiB\n", " allocs estimate: 6\n", " --------------\n", " minimum time: 9.309 ms (5.34% GC)\n", " median time: 11.412 ms (22.53% GC)\n", " mean time: 11.468 ms (22.75% GC)\n", " maximum time: 13.552 ms (34.13% GC)\n", " --------------\n", " samples: 436\n", " evals/sample: 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# smarter\n", "@benchmark Diagonal(vec(sum($A .* $B, dims=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get rid of allocation of intermediate array at all, we can just write a double loop or use `dot` function." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 203.33 KiB\n", " allocs estimate: 4004\n", " --------------\n", " minimum time: 3.277 ms (0.00% GC)\n", " median time: 3.503 ms (0.00% GC)\n", " mean time: 3.598 ms (0.39% GC)\n", " maximum time: 6.018 ms (29.67% GC)\n", " --------------\n", " samples: 1386\n", " evals/sample: 1" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function diag_matmul(A, B)\n", " m, n = size(A)\n", " @assert size(B) == (m, n) \"A and B should have same size\"\n", " @views d = [dot(A[:, j], B[:, j]) for j in 1:n]\n", "# d = zeros(eltype(A), n)\n", "# for j in 1:n, i in 1:m\n", "# d[j] += A[i, j] * B[i, j]\n", "# end\n", " Diagonal(d)\n", "end\n", "\n", "@benchmark diag_matmul($A, $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory hierarchy and level-3 fraction\n", "\n", "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n", "\n", "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "* Numbers everyone should know\n", "\n", "| Operation | Time |\n", "|-------------------------------------|----------------|\n", "| L1 cache reference | 0.5 ns |\n", "| L2 cache reference | 7 ns |\n", "| Main memory reference | 100 ns |\n", "| Read 1 MB sequentially from memory | 250,000 ns |\n", "| Read 1 MB sequentially from SSD | 1,000,000 ns | \n", "| Read 1 MB sequentially from disk | 20,000,000 ns |\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " Source: