{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical linear algebra: introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Topics in numerical algebra: \n", " - BLAS \n", " - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n", " - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$ \n", " - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$ \n", " - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$ \n", " - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$ \n", " - iterative methods for numerical linear algebra \n", "\n", "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n", "\n", "* Our major **goal** (or learning objectives) is to \n", " 0. know the computational cost (flop count) of each task\n", " 0. be familiar with the BLAS and LAPACK functions (what they do) \n", " 0. do **not** re-invent wheels by implementing these dense linear algebra subroutines by yourself \n", " 0. understand the need for iterative methods \n", " 0. apply appropriate numerical algebra tools to various statistical problems \n", "\n", "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n", " - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/stable/stdlib/linalg/#BLAS-Functions-1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS\n", "\n", "* BLAS stands for _basic linear algebra subprograms_. \n", "\n", "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n", "\n", "* There are many implementations of BLAS. \n", " - [Netlib](http://www.netlib.org/blas/) provides a reference implementation \n", " - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries) \n", " - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS) \n", " - JuliaPro offers the option of using MKL\n", "\n", "* There are 3 levels of BLAS functions.\n", " - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n", " - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n", " - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n", "\n", "| Level | Example Operation | Name | Dimension | Flops |\n", "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n", "| 1 | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ |\n", "| | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | axpy | $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ |\n", "| 2 | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | gaxpy | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 3 | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | matrix multiplication | $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$ | $2mnp$ |\n", "\n", "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). \n", "\n", "* Some operations _appear_ as level-3 but indeed are level-2. \n", " - A common operation in statistics is column scaling or row scaling\n", " $$\n", " \\begin{eqnarray*}\n", " \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n", " \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n", " \\end{eqnarray*}\n", " $$\n", " where $\\mathbf{D}$ is diagonal. \n", " - These are essentially level-2 operations!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000-element Array{Float64,1}:\n", " 0.763192\n", " 0.668759\n", " 0.709337\n", " 0.088416\n", " 0.289151\n", " 0.375468\n", " 0.437356\n", " 0.179474\n", " 0.122238\n", " 0.895783\n", " 0.332146\n", " 0.206425\n", " 0.747789\n", " ⋮ \n", " 0.484487\n", " 0.567484\n", " 0.75455 \n", " 0.703483\n", " 0.166205\n", " 0.754612\n", " 0.231834\n", " 0.769243\n", " 0.805681\n", " 0.553389\n", " 0.450904\n", " 0.814614" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools\n", "\n", "srand(123) # seed\n", "n = 2000\n", "A = rand(n, n)\n", "d = rand(n) # d vector" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Array{Float64,2}:\n", " 0.763192 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.668759 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.709337 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.088416 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " ⋮ ⋱ \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.553389 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.450904 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.814614" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D = diagm(d) # diagonal matrix with d as diagonal" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 100.553 ms (0.35% GC)\n", " median time: 109.386 ms (2.32% GC)\n", " mean time: 115.684 ms (3.36% GC)\n", " maximum time: 226.789 ms (1.78% GC)\n", " --------------\n", " samples: 44\n", " evals/sample: 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n", "@benchmark A * D" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Diagonal{Float64}:\n", " 0.763192 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ 0.668759 ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ 0.709337 ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ 0.088416 ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋮ ⋱ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ 0.553389 ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ 0.450904 ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.814614" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Diagonal(d)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 3\n", " --------------\n", " minimum time: 7.149 ms (2.63% GC)\n", " median time: 8.212 ms (24.38% GC)\n", " mean time: 8.453 ms (25.21% GC)\n", " maximum time: 65.978 ms (88.55% GC)\n", " --------------\n", " samples: 591\n", " evals/sample: 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# current way for columnwise scaling: O(n^2) flops\n", "@benchmark A * Diagonal(d)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "*(A::AbstractArray{T,2} where T, D::Diagonal) at linalg/diagonal.jl:152" ], "text/plain": [ "*(A::AbstractArray{T,2} where T, D::Diagonal) in Base.LinAlg at linalg/diagonal.jl:152" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@which A * Diagonal(d)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 4.479 ms (0.00% GC)\n", " median time: 11.387 ms (0.00% GC)\n", " mean time: 11.209 ms (0.00% GC)\n", " maximum time: 20.322 ms (0.00% GC)\n", " --------------\n", " samples: 447\n", " evals/sample: 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# in-place: avoid allocate space for result\n", "@benchmark scale!(A, d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory hierarchy and level-3 fraction\n", "\n", "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n", "\n", "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "* Numbers everyone should know\n", "\n", "| Operation | Time |\n", "|-------------------------------------|----------------|\n", "| L1 cache reference | 0.5 ns |\n", "| L2 cache reference | 7 ns |\n", "| Main memory reference | 100 ns |\n", "| Read 1 MB sequentially from memory | 250,000 ns |\n", "| Read 1 MB sequentially from SSD | 1,000,000 ns | \n", "| Read 1 MB sequentially from disk | 20,000,000 ns |\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " Source: