{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 1.6.2\n", "Commit 1b93d53fc4 (2021-07-14 15:36 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin18.7.0)\n", " CPU: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz\n", " WORD_SIZE: 64\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-11.0.1 (ORCJIT, skylake)\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m\u001b[1m Activating\u001b[22m\u001b[39m environment at `~/Dropbox/class/M1399.000200/2021/M1399_000200-2021fall/Project.toml`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m\u001b[1m Status\u001b[22m\u001b[39m `~/Dropbox/class/M1399.000200/2021/M1399_000200-2021fall/Project.toml`\n", " \u001b[90m [7d9fca2a] \u001b[39mArpack v0.5.3\n", " \u001b[90m [6e4b80f9] \u001b[39mBenchmarkTools v1.1.4\n", " \u001b[90m [1e616198] \u001b[39mCOSMO v0.8.1\n", " \u001b[90m [f65535da] \u001b[39mConvex v0.14.13\n", " \u001b[90m [a93c6f00] \u001b[39mDataFrames v1.2.2\n", " \u001b[90m [31a5f54b] \u001b[39mDebugger v0.6.8\n", " \u001b[90m [31c24e10] \u001b[39mDistributions v0.24.18\n", " \u001b[90m [e2685f51] \u001b[39mECOS v0.12.3\n", " \u001b[90m [f6369f11] \u001b[39mForwardDiff v0.10.19\n", " \u001b[90m [28b8d3ca] \u001b[39mGR v0.58.1\n", " \u001b[90m [c91e804a] \u001b[39mGadfly v1.3.3\n", " \u001b[90m [bd48cda9] \u001b[39mGraphRecipes v0.5.7\n", " \u001b[90m [82e4d734] \u001b[39mImageIO v0.5.8\n", " \u001b[90m [6218d12a] \u001b[39mImageMagick v1.2.1\n", " \u001b[90m [916415d5] \u001b[39mImages v0.24.1\n", " \u001b[90m [b6b21f68] \u001b[39mIpopt v0.7.0\n", " \u001b[90m [42fd0dbc] \u001b[39mIterativeSolvers v0.9.1\n", " \u001b[90m [4076af6c] \u001b[39mJuMP v0.21.9\n", " \u001b[90m [b51810bb] \u001b[39mMatrixDepot v1.0.4\n", " \u001b[90m [1ec41992] \u001b[39mMosekTools v0.9.4\n", " \u001b[90m [76087f3c] \u001b[39mNLopt v0.6.3\n", " \u001b[90m [47be7bcc] \u001b[39mORCA v0.5.0\n", " \u001b[90m [a03496cd] \u001b[39mPlotlyBase v0.4.3\n", " \u001b[90m [f0f68f2c] \u001b[39mPlotlyJS v0.15.0\n", " \u001b[90m [91a5bcdd] \u001b[39mPlots v1.21.2\n", " \u001b[90m [438e738f] \u001b[39mPyCall v1.92.3\n", " \u001b[90m [d330b81b] \u001b[39mPyPlot v2.9.0\n", " \u001b[90m [dca85d43] \u001b[39mQuartzImageIO v0.7.3\n", " \u001b[90m [6f49c342] \u001b[39mRCall v0.13.12\n", " \u001b[90m [ce6b1742] \u001b[39mRDatasets v0.7.5\n", " \u001b[90m [c946c3f1] \u001b[39mSCS v0.7.1\n", " \u001b[90m [276daf66] \u001b[39mSpecialFunctions v1.6.1\n", " \u001b[90m [2913bbd2] \u001b[39mStatsBase v0.33.10\n", " \u001b[90m [b8865327] \u001b[39mUnicodePlots v2.0.1\n", " \u001b[90m [0f1e0344] \u001b[39mWebIO v0.8.15\n", " \u001b[90m [8f399da3] \u001b[39mLibdl\n", " \u001b[90m [2f01184e] \u001b[39mSparseArrays\n", " \u001b[90m [10745b16] \u001b[39mStatistics\n" ] } ], "source": [ "using Pkg\n", "Pkg.activate(\"../..\")\n", "Pkg.status()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Numerical linear algebra: introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Topics in numerical linear algebra: \n", " - BLAS \n", " - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n", " - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$ \n", " - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$ \n", " - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$ \n", " - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$ \n", " - iterative methods for numerical linear algebra \n", "\n", "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n", "\n", "* Our major **goal** (or learning objectives) is to \n", " 1. know the complexity (flop count) of each task\n", " 2. be familiar with the BLAS and LAPACK functions (what they do) \n", " 3. do **not** re-invent the wheel by implementing these dense linear algebra subroutines by yourself \n", " 4. understand the need for iterative methods \n", " 5. apply appropriate numerical algebra tools to various statistical problems \n", "\n", "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n", " - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/v1.1/stdlib/LinearAlgebra/#BLAS-Functions-1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS\n", "\n", "* BLAS stands for _basic linear algebra subprograms_. \n", "\n", "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n", "\n", "* There are many implementations of BLAS. \n", " - [Netlib](http://www.netlib.org/blas/) provides a reference implementation. \n", " - Matlab uses Intel's [Math Kernel Library (MKL)](https://software.intel.com/en-us/node/520724). **MKL implementation is the gold standard on market.** It is not open source but the compiled library is free for Linux and MacOS. \n", " - Julia uses [OpenBLAS](http://www.openblas.net). OpenBLAS is a fast, actively maintained fork of [GotoBLAS](https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2), which has an [interesting history](https://www.nytimes.com/2005/11/28/technology/writing-the-fastest-code-by-hand-for-fun-a-human-computer-keeps.html).\n", " - Julia can be compiled to use the MKL from the source.\n", "\n", "* There are 3 levels of BLAS functions.\n", " - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n", " - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n", " - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n", "\n", "| Level | Example Operation | Name | Dimension | Flops | \n", "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n", "| 1 | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ | \n", "| 1 | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | axpy | $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ | \n", "| 2 | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | gaxpy | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 2 | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 3 | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | matrix multiplication | $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$ | $2mnp$ |\n", "\n", "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "\n", "> **The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.**\n", "\\\n", " -- James Gentle, *Matrix Algebra*, Springer, New York (2007).\n", "\n", "Some operations _appear_ as level-3 but indeed are level-2. \n", "\n", "### Example 1\n", "\n", "A common operation in statistics is column scaling or row scaling\n", "$$\n", "\\begin{eqnarray*}\n", " \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n", " \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n", "\\end{eqnarray*}\n", "$$\n", "where $\\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form \n", "$$\n", "\\mathbf{X}^T \\mathbf{W} \\mathbf{X},\n", "$$\n", "where $\\mathbf{W}$ is a diagonal matrix with observation weights on diagonal. \n", "\n", " Column and row scalings are essentially level-2 operations!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Diagonal{Float64, Vector{Float64}}:\n", " 0.140972 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ 0.143596 ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ 0.612494 ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ 0.0480573 ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋮ ⋱ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ 0.882415 ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ 0.450904 ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.814614" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools, LinearAlgebra, Random\n", "\n", "Random.seed!(123) # seed\n", "n = 2000\n", "A = rand(n, n) # n-by-n matrix\n", "d = rand(n) # n vector\n", "D = Diagonal(d) # diagonal matrix with d as diagonal" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Matrix{Float64}:\n", " 0.140972 0.0 0.0 0.0 … 0.0 0.0 0.0\n", " 0.0 0.143596 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.612494 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0480573 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " ⋮ ⋱ \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.882415 0.0 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.450904 0.0\n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.814614" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Dfull = convert(Matrix, D) # convert to full matrix" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 39 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m 99.756 ms\u001b[22m\u001b[39m … \u001b[35m183.883 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 4.41%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m121.387 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m131.600 ms\u001b[22m\u001b[39m ± \u001b[32m 25.298 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m1.90% ± 2.80%\n", "\n", " \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▁\u001b[39m▄\u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[34m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▆\u001b[39m▆\u001b[39m█\u001b[39m▆\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▁\u001b[39m▆\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m█\u001b[34m▆\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m \u001b[39m▁\n", " 99.8 ms\u001b[90m Histogram: frequency by time\u001b[39m 184 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m30.52 MiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m2\u001b[39m." ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n", "# this is SLOW!\n", "@benchmark $A * $Dfull" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 497 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m 6.324 ms\u001b[22m\u001b[39m … \u001b[35m18.680 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m 0.00% … 41.11%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m 8.060 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m10.049 ms\u001b[22m\u001b[39m ± \u001b[32m 3.227 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m20.63% ± 20.55%\n", "\n", " \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▄\u001b[39m█\u001b[39m▄\u001b[39m▂\u001b[34m▁\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m▅\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▁\u001b[39m▂\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[32m▃\u001b[39m\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▅\u001b[39m▄\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m \u001b[39m▃\n", " 6.32 ms\u001b[90m Histogram: frequency by time\u001b[39m 17.2 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m30.52 MiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m4\u001b[39m." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dispatch to special method for diagonal matrix multiplication.\n", "# columnwise scaling: O(n^2) flops\n", "@benchmark $A * $D" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 372 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m 6.869 ms\u001b[22m\u001b[39m … \u001b[35m21.703 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m13.772 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m13.450 ms\u001b[22m\u001b[39m ± \u001b[32m 4.026 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▆\u001b[39m▅\u001b[39m▂\u001b[39m \u001b[39m▅\u001b[39m \u001b[39m▄\u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m▂\u001b[39m▂\u001b[39m \u001b[39m \u001b[39m▂\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[34m \u001b[39m\u001b[39m \u001b[39m▂\u001b[39m▃\u001b[39m \u001b[39m▃\u001b[39m█\u001b[39m▅\u001b[39m▃\u001b[39m \u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▃\u001b[39m▅\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▆\u001b[39m█\u001b[39m▄\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m█\u001b[39m▅\u001b[39m█\u001b[39m▇\u001b[39m▅\u001b[39m█\u001b[39m█\u001b[39m▂\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m▃\u001b[39m▄\u001b[32m▂\u001b[39m\u001b[34m▄\u001b[39m\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▅\u001b[39m▄\u001b[39m▇\u001b[39m▆\u001b[39m▃\u001b[39m▆\u001b[39m▃\u001b[39m▅\u001b[39m▂\u001b[39m▅\u001b[39m▄\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m▃\u001b[39m▆\u001b[39m▄\u001b[39m \u001b[39m▄\n", " 6.87 ms\u001b[90m Histogram: frequency by time\u001b[39m 20.9 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m96 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m2\u001b[39m." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# in-place: avoid allocating space for result\n", "# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.\n", "@benchmark rmul!(A, D)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** In R, `diag(d)` will create a full (dense) matrix. Be cautious using `diag` function: do we really need a full diagonal matrix?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{RealSxp}\n", " [,1] [,2] [,3] [,4] [,5]\n", "[1,] 0.9028847 0.0000000 0.0000000 0.0000000 0.00000000\n", "[2,] 0.0000000 0.2051186 0.0000000 0.0000000 0.00000000\n", "[3,] 0.0000000 0.0000000 0.8915648 0.0000000 0.00000000\n", "[4,] 0.0000000 0.0000000 0.0000000 0.4390209 0.00000000\n", "[5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.05749282\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "\n", "R\"\"\"\n", "d <- runif(5)\n", "diag(d)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 2\n", "\n", "Inner product between two matrices $\\mathbf{A}, \\mathbf{B} \\in \\mathbb{R}^{m \\times n}$ is often written as \n", "\n", "$$\n", " \\langle \\mathbf{A}, \\mathbf{B} \\rangle \n", " = \\sum_{i,j} \\mathbf{A}_{ij}\\mathbf{B}_{ij}\n", " = \\text{trace}(\\mathbf{A}^T \\mathbf{B}) \n", " = \\text{trace}(\\mathbf{B} \\mathbf{A}^T) \n", " = \\text{trace}(\\mathbf{A} \\mathbf{B}^T)\n", " = \\text{trace}(\\mathbf{B}^T \\mathbf{A}).\n", "$$\n", "\n", "They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 32 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m104.761 ms\u001b[22m\u001b[39m … \u001b[35m210.264 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 6.70%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m174.214 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m158.338 ms\u001b[22m\u001b[39m ± \u001b[32m 36.217 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m1.80% ± 3.03%\n", "\n", " \u001b[39m▄\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[34m \u001b[39m\u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▆\u001b[39m█\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m█\u001b[34m▆\u001b[39m\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m \u001b[39m▁\n", " 105 ms\u001b[90m Histogram: frequency by time\u001b[39m 210 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m30.52 MiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m2\u001b[39m." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Random.seed!(123)\n", "n = 2000\n", "A, B = randn(n, n), randn(n, n)\n", "\n", "# slow way to evaluate this thing\n", "@benchmark tr(transpose($A) * $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But $\\langle \\mathbf{A}, \\mathbf{B} \\rangle = \\langle \\text{vec}(\\mathbf{A}), \\text{vec}(\\mathbf{B}) \\rangle$. The latter is level-2 operation with $O(mn)$ flops." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 3891 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m1.213 ms\u001b[22m\u001b[39m … \u001b[35m 3.049 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m1.239 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m1.271 ms\u001b[22m\u001b[39m ± \u001b[32m78.007 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m▅\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[34m▅\u001b[39m\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[32m▄\u001b[39m\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▂\n", " \u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m \u001b[39m█\n", " 1.21 ms\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 1.52 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark dot($A, $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 3\n", "\n", "Similarly $\\text{diag}(\\mathbf{A}^T \\mathbf{B})=\\text{diag}(\\sum_{j}\\mathbf{A}_{ij}\\mathbf{B}_{ij})$ can be calculated in $O(mn)$ flops." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 38 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m105.100 ms\u001b[22m\u001b[39m … \u001b[35m205.494 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 6.87%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m127.099 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m133.460 ms\u001b[22m\u001b[39m ± \u001b[32m 26.864 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m2.20% ± 3.58%\n", "\n", " \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m▄\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[34m \u001b[39m\u001b[39m \u001b[39m▄\u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[34m▆\u001b[39m\u001b[39m▆\u001b[39m█\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m \u001b[39m▁\n", " 105 ms\u001b[90m Histogram: frequency by time\u001b[39m 205 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m30.53 MiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m3\u001b[39m." ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# slow way to evaluate this thing\n", "@benchmark diag(transpose($A) * $B)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 399 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m 7.757 ms\u001b[22m\u001b[39m … \u001b[35m24.467 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m 0.00% … 49.34%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m10.987 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m12.525 ms\u001b[22m\u001b[39m ± \u001b[32m 4.776 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m19.39% ± 21.20%\n", "\n", " \u001b[39m \u001b[39m█\u001b[39m▄\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[34m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▅\u001b[39m█\u001b[39m█\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▄\u001b[34m▄\u001b[39m\u001b[39m▃\u001b[39m▃\u001b[39m▄\u001b[39m▂\u001b[39m▄\u001b[32m▆\u001b[39m\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▂\u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▄\u001b[39m▅\u001b[39m▄\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▁\u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m \u001b[39m▃\n", " 7.76 ms\u001b[90m Histogram: frequency by time\u001b[39m 23.7 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m30.53 MiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m5\u001b[39m." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# smarter\n", "@benchmark Diagonal(vec(sum($A .* $B, dims=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get rid of allocation of intermediate array at all, we can just write a double loop or use `dot` function." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 1892 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m2.146 ms\u001b[22m\u001b[39m … \u001b[35m 5.498 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m2.453 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m2.629 ms\u001b[22m\u001b[39m ± \u001b[32m447.989 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m \u001b[39m▁\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m█\u001b[39m█\u001b[34m▆\u001b[39m\u001b[39m▅\u001b[39m▂\u001b[39m▂\u001b[32m▂\u001b[39m\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m█\u001b[39m▆\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▃\u001b[39m▄\u001b[39m▆\u001b[39m▄\u001b[39m▅\u001b[39m▄\u001b[39m▃\u001b[39m▅\u001b[39m▃\u001b[39m▄\u001b[39m▆\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▅\u001b[39m \u001b[39m█\n", " 2.15 ms\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 4.5 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m15.75 KiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m1\u001b[39m." ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function diag_matmul(A, B)\n", " m, n = size(A)\n", " @assert size(B) == (m, n) \"A and B should have same size\"\n", " # @views: Convert every array-slicing operation in the given expression to return a view.\n", " # See below for avoiding memory allocation\n", " @views d = [dot(A[:, j], B[:, j]) for j in 1:n]\n", "# d = zeros(eltype(A), n)\n", "# for j in 1:n, i in 1:m\n", "# d[j] += A[i, j] * B[i, j]\n", "# end\n", " Diagonal(d)\n", "end\n", "\n", "@benchmark diag_matmul($A, $B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory hierarchy and level-3 fraction\n", "\n", "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n", "\n", "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "* Numbers everyone should know\n", "\n", "| Operation | Time |\n", "|-------------------------------------|----------------|\n", "| L1 cache reference | 0.5 ns |\n", "| L2 cache reference | 7 ns |\n", "| Main memory reference | 100 ns |\n", "| Read 1 MB sequentially from memory | 250,000 ns |\n", "| Read 1 MB sequentially from SSD | 1,000,000 ns | \n", "| Read 1 MB sequentially from disk | 20,000,000 ns |\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " Source: \n", "\n", "* For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s. \n", "\n", "* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog? \n", "Answer: use **high-level BLAS** as much as possible.\n", "\n", "| BLAS | Dimension | Mem. Refs. | Flops | Ratio |\n", "|--------------------------------|------------------------------------------------------------|------------|--------|-------|\n", "| Level 1: $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $3n$ | $2n$ | 3:2 |\n", "| Level 2: $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$, $\\mathbf{A} \\in \\mathbb{R}^{n \\times n}$ | $n^2$ | $2n^2$ | 1:2 |\n", "| Level 3: $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | $\\mathbf{A}, \\mathbf{B}, \\mathbf{C} \\in\\mathbb{R}^{n \\times n}$ | $4n^2$ | $2n^3$ | 2:n | \n", "\n", "* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. \n", "See [Dongarra slides](https://www.samsi.info/wp-content/uploads/2017/02/SAMSI-0217_Dongarra.pdf).\n", "\n", "\n", "\n", "* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**.\n", "\n", "* To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see this [video](https://youtu.be/JzNpKDW07rw). The bottom line is \n", "\n", "> **Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effect of data layout\n", "\n", "* Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.\n", "\n", "* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).\n", "\n", "* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.\n", " - x86 CPUs: 64 bytes \n", " - ARM CPUs: 32 bytes\n", "\n", "\n", "\n", "Source: \n", "\n", "* Accessing *column-major* stored matrix by *rows* ($ij$ looping) causes lots of **cache misses**.\n", "\n", "* Take matrix multiplication as an example \n", "$$ \n", "\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}, \\quad \\mathbf{A} \\in \\mathbb{R}^{m \\times p}, \\mathbf{B} \\in \\mathbb{R}^{p \\times n}, \\mathbf{C} \\in \\mathbb{R}^{m \\times n}.\n", "$$\n", "Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. \n", " - `jki` or `kji` looping:\n", " ```julia\n", " # inner most loop\n", " for i = 1:m\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ``` \n", " - `ikj` or `kij` looping:\n", " ```julia\n", " # inner most loop \n", " for j = 1:n\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ``` \n", " - `ijk` or `jik` looping:\n", " ```julia\n", " # inner most loop \n", " for k = 1:p\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ```\n", "* We pay attention to the innermost loop, where the vector calculation occurs. The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is \n", "\n", "| Variant | A Stride | B Stride | C Stride |\n", "|----------------|----------|----------|----------|\n", "| `jki` or `kji` | Unit | 0 | Unit |\n", "| `ikj` or `kij` | 0 | Non-Unit | Non-Unit |\n", "| `ijk` or `jik` | Non-Unit | Unit | 0 | \n", "\n", "* Apparently the variants `jki` or `kji` are preferred." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", " matmul_by_loop!(A, B, C, order)\n", "\n", "Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.\n", "\"\"\"\n", "function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)\n", " \n", " m = size(A, 1)\n", " p = size(A, 2)\n", " n = size(B, 2)\n", " fill!(C, 0)\n", " \n", " if order == \"jki\"\n", " for j = 1:n, k = 1:p, i = 1:m\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", "\n", " if order == \"kji\"\n", " for k = 1:p, j = 1:n, i = 1:m\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"ikj\"\n", " for i = 1:m, k = 1:p, j = 1:n\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", "\n", " if order == \"kij\"\n", " for k = 1:p, i = 1:m, j = 1:n\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"ijk\"\n", " for i = 1:m, j = 1:n, k = 1:p\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"jik\"\n", " for j = 1:n, i = 1:m, k = 1:p\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", "end\n", "\n", "using Random\n", "\n", "Random.seed!(123)\n", "m, p, n = 2000, 100, 2000\n", "A = rand(m, p)\n", "B = rand(p, n)\n", "C = zeros(m, n);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `jki` and `kji` looping:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 12 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m409.931 ms\u001b[22m\u001b[39m … \u001b[35m548.997 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m425.141 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m444.185 ms\u001b[22m\u001b[39m ± \u001b[32m 46.687 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[34m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[34m▁\u001b[39m\u001b[39m▆\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▆\u001b[39m \u001b[39m▁\n", " 410 ms\u001b[90m Histogram: frequency by time\u001b[39m 549 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools\n", "\n", "@benchmark matmul_by_loop!($A, $B, $C, \"jki\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 8 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m674.248 ms\u001b[22m\u001b[39m … \u001b[35m729.114 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m714.277 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m712.280 ms\u001b[22m\u001b[39m ± \u001b[32m 16.689 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[32m▁\u001b[39m\u001b[39m \u001b[34m█\u001b[39m\u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[32m█\u001b[39m\u001b[39m▁\u001b[34m█\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m \u001b[39m▁\n", " 674 ms\u001b[90m Histogram: frequency by time\u001b[39m 729 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"kji\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `ikj` and `kij` looping:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 3 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m2.321 s\u001b[22m\u001b[39m … \u001b[35m 2.386 s\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m2.375 s \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m2.361 s\u001b[22m\u001b[39m ± \u001b[32m34.822 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[34m█\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \n", " \u001b[34m█\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m \u001b[39m▁\n", " 2.32 s\u001b[90m Histogram: frequency by time\u001b[39m 2.39 s \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"ikj\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 3 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m2.333 s\u001b[22m\u001b[39m … \u001b[35m 2.348 s\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m2.344 s \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m2.342 s\u001b[22m\u001b[39m ± \u001b[32m7.773 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[34m█\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \n", " \u001b[34m█\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m \u001b[39m▁\n", " 2.33 s\u001b[90m Histogram: frequency by time\u001b[39m 2.35 s \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"kij\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `ijk` and `jik` looping:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 6 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m889.913 ms\u001b[22m\u001b[39m … \u001b[35m 1.017 s\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m930.412 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m940.485 ms\u001b[22m\u001b[39m ± \u001b[32m42.680 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[34m█\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[34m█\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m \u001b[39m▁\n", " 890 ms\u001b[90m Histogram: frequency by time\u001b[39m 1.02 s \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"ijk\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 6 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m886.608 ms\u001b[22m\u001b[39m … \u001b[35m922.833 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m903.909 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m903.385 ms\u001b[22m\u001b[39m ± \u001b[32m 13.468 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[34m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m█\u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[34m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m█\u001b[39m \u001b[39m▁\n", " 887 ms\u001b[90m Histogram: frequency by time\u001b[39m 923 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!($A, $B, $C, \"ijk\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Julia wraps BLAS library for matrix multiplication. We see BLAS library wins the `jki` and `kji` looping by a large margin (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 793 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m4.850 ms\u001b[22m\u001b[39m … \u001b[35m11.389 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m6.007 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m6.291 ms\u001b[22m\u001b[39m ± \u001b[32m 1.009 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▂\u001b[39m▇\u001b[39m█\u001b[39m▆\u001b[39m▅\u001b[39m▄\u001b[34m▂\u001b[39m\u001b[39m▆\u001b[39m▇\u001b[39m▅\u001b[32m▁\u001b[39m\u001b[39m▃\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▄\u001b[39m▇\u001b[39m▄\u001b[39m▄\u001b[39m▇\u001b[39m▅\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▄\u001b[39m▇\u001b[39m▅\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▄\u001b[39m▅\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▁\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▆\u001b[39m▄\u001b[39m▆\u001b[39m \u001b[39m█\n", " 4.85 ms\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 9.9 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark mul!($C, $A, $B)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 790 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m5.715 ms\u001b[22m\u001b[39m … \u001b[35m 12.214 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m5.976 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m6.312 ms\u001b[22m\u001b[39m ± \u001b[32m952.165 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m█\u001b[34m▆\u001b[39m\u001b[39m▄\u001b[39m▂\u001b[39m▃\u001b[32m▂\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▄\u001b[39m▇\u001b[39m▅\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[39m▅\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▅\u001b[39m▄\u001b[39m▅\u001b[39m▆\u001b[39m▄\u001b[39m▄\u001b[39m▁\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▇\u001b[39m▄\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▇\u001b[39m▄\u001b[39m▄\u001b[39m▆\u001b[39m▄\u001b[39m▇\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▄\u001b[39m▅\u001b[39m▄\u001b[39m▁\u001b[39m▁\u001b[39m▄\u001b[39m \u001b[39m█\n", " 5.71 ms\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 10.1 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# direct call of BLAS wrapper function\n", "@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS in R\n", "\n", "* **Tip for R usesr**. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{VecSxp}\n", "Unit: milliseconds\n", " expr min lq mean median uq max neval\n", " `#JL`$A %*% `#JL`$B 909.2172 916.2797 940.2293 931.8024 947.0514 1230.23 100\n" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "\n", "R\"\"\"\n", "library(\"microbenchmark\")\n", "microbenchmark($A %*% $B)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Re-building R from scratch using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google `build R using MKL` to get started. Similarly we can build Julia using MKL.\n", "\n", "* Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra (and it's expensive!)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avoid memory allocation: some examples\n", "\n", "1. **Transposing** matrix is an expensive memory operation. \n", " - In R, the command \n", " ```R\n", " t(A) %*% x\n", " ```\n", " will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation\n", " - Julia is smart to avoid transposing matrix if possible." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "using Random, LinearAlgebra, BenchmarkTools\n", "Random.seed!(123)\n", "\n", "n = 1000\n", "A = rand(n, n)\n", "x = rand(n);" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Transpose{Float64, Matrix{Float64}}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(transpose(A))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(:parent,)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fieldnames(typeof(transpose(A)))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Ptr{Float64} @0x00007fe7a7800000, Ptr{Float64} @0x00007fe7a7800000)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# same data in tranpose(A) and original matrix A\n", "pointer(transpose(A).parent), pointer(A)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 10000 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m71.486 μs\u001b[22m\u001b[39m … \u001b[35m692.121 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m87.752 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m92.042 μs\u001b[22m\u001b[39m ± \u001b[32m 17.339 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▃\u001b[39m▅\u001b[39m▇\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[34m▆\u001b[39m\u001b[39m▅\u001b[39m▂\u001b[32m▁\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▃\u001b[39m▄\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m▇\u001b[39m▅\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m▃\n", " 71.5 μs\u001b[90m Histogram: frequency by time\u001b[39m 147 μs \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m7.94 KiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m1\u001b[39m." ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dispatch to BLAS\n", "# does *not* actually transpose the matrix\n", "@benchmark transpose($A) * $x" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 10000 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m69.739 μs\u001b[22m\u001b[39m … \u001b[35m475.971 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m83.509 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m87.852 μs\u001b[22m\u001b[39m ± \u001b[32m 14.373 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▃\u001b[39m▅\u001b[39m█\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[34m▅\u001b[39m\u001b[39m▄\u001b[39m▂\u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▁\u001b[39m▂\u001b[39m▃\u001b[39m▄\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m▆\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m \u001b[39m▃\n", " 69.7 μs\u001b[90m Histogram: frequency by time\u001b[39m 140 μs \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pre-allocate result\n", "out = zeros(size(A, 2))\n", "@benchmark mul!($out, transpose($A), $x)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 10000 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m68.930 μs\u001b[22m\u001b[39m … \u001b[35m459.098 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m85.314 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m92.348 μs\u001b[22m\u001b[39m ± \u001b[32m 18.167 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▃\u001b[39m▆\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▅\u001b[34m▃\u001b[39m\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▃\u001b[39m▄\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[32m▃\u001b[39m\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m▃\n", " 68.9 μs\u001b[90m Histogram: frequency by time\u001b[39m 141 μs \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# or call BLAS wrapper directly\n", "@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. [Broadcasting](https://docs.julialang.org/en/v1/base/arrays/#Broadcast-and-vectorization-1) in Julia achieves vectorized code without creating intermediate arrays.\n", "\n", " Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command\n", "```r\n", "max(abs(X), abs(Y))\n", "```\n", "will create two intermediate arrays and then one result array." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{VecSxp}\n", "Unit: milliseconds\n", " expr min lq mean median uq\n", " max(abs(`#JL`$X), abs(`#JL`$Y)) 5.418381 6.065276 9.559118 7.078557 8.298674\n", " max neval\n", " 83.00264 100\n" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "Random.seed!(123)\n", "X, Y = rand(1000, 1000), rand(1000, 1000)\n", "\n", "R\"\"\"\n", "library(microbenchmark)\n", "microbenchmark(max(abs($X), abs($Y)))\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Julia, dot operations are fused so no intermediate arrays are created." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 2582 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m892.227 μs\u001b[22m\u001b[39m … \u001b[35m13.614 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m 0.00% … 66.26%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m 1.299 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m 1.928 ms\u001b[22m\u001b[39m ± \u001b[32m 2.262 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m21.39% ± 15.21%\n", "\n", " \u001b[39m▃\u001b[39m▇\u001b[34m█\u001b[39m\u001b[39m▆\u001b[39m▃\u001b[39m▂\u001b[32m \u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n", " \u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m▇\u001b[39m\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▃\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▃\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▇\u001b[39m \u001b[39m█\n", " 892 μs\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 11.5 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m7.63 MiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m2\u001b[39m." ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# no intermediate arrays created, only result array created\n", "@benchmark max.(abs.($X), abs.($Y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pre-allocating result array gets rid of memory allocation at all." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 6054 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m740.775 μs\u001b[22m\u001b[39m … \u001b[35m 2.007 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m771.994 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m816.781 μs\u001b[22m\u001b[39m ± \u001b[32m102.622 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[39m▅\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[34m▅\u001b[39m\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[32m▃\u001b[39m\u001b[39m▃\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m▁\u001b[39m \u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▂\n", " \u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▅\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▅\u001b[39m▅\u001b[39m \u001b[39m█\n", " 741 μs\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 1.23 ms \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# no memory allocation at all!\n", "Z = zeros(size(X)) # zero matrix of same size as X\n", "@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. [View](https://docs.julialang.org/en/v1/base/arrays/#Views-(SubArrays-and-other-view-types)-1) avoids creating extra copy of matrix data." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 10000 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m 79.961 μs\u001b[22m\u001b[39m … \u001b[35m 9.766 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m 0.00% … 96.49%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m268.438 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m296.359 μs\u001b[22m\u001b[39m ± \u001b[32m507.203 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m10.42% ± 5.92%\n", "\n", " \u001b[39m▆\u001b[39m▃\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\u001b[39m▅\u001b[39m▇\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[32m▃\u001b[39m\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▃\n", " \u001b[39m█\u001b[39m█\u001b[39m▅\u001b[39m█\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m█\u001b[39m▇\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▅\u001b[39m▁\u001b[39m▃\u001b[39m▄\u001b[39m▅\u001b[39m▁\u001b[39m▄\u001b[39m▃\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▃\u001b[39m▇\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m \u001b[39m█\n", " 80 μs\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 440 μs \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m488.39 KiB\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m2\u001b[39m." ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Random.seed!(123)\n", "A = randn(1000, 1000)\n", "\n", "# sum entries in a sub-matrix\n", "@benchmark sum($A[1:2:500, 1:2:500])" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 10000 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m102.922 μs\u001b[22m\u001b[39m … \u001b[35m303.586 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m103.595 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m110.998 μs\u001b[22m\u001b[39m ± \u001b[32m 20.398 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[34m█\u001b[39m\u001b[39m▆\u001b[39m▄\u001b[39m▅\u001b[39m▃\u001b[32m▂\u001b[39m\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\n", " \u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▇\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▄\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▅\u001b[39m \u001b[39m█\n", " 103 μs\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 209 μs \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view avoids creating a separate sub-matrix\n", "@benchmark sum(@view $A[1:2:500, 1:2:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [`@views`](https://docs.julialang.org/en/v1/base/arrays/#Base.@views) macro, which can be useful in [some operations](https://discourse.julialang.org/t/why-is-a-manual-in-place-addition-so-much-faster-than-and-on-range-indexed-arrays/3302). (See also Example 3 above.)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: 10000 samples with 1 evaluation.\n", " Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m102.900 μs\u001b[22m\u001b[39m … \u001b[35m305.959 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m103.132 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n", " Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m108.654 μs\u001b[22m\u001b[39m ± \u001b[32m 18.127 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n", "\n", " \u001b[34m█\u001b[39m\u001b[39m▄\u001b[39m▃\u001b[39m▂\u001b[32m▁\u001b[39m\u001b[39m▁\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▁\n", " \u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▇\u001b[39m▆\u001b[39m▇\u001b[39m█\u001b[39m▇\u001b[39m▆\u001b[39m▅\u001b[39m▆\u001b[39m▆\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▅\u001b[39m▆\u001b[39m▇\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▄\u001b[39m▅\u001b[39m▄\u001b[39m▆\u001b[39m▄\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▆\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▅\u001b[39m▄\u001b[39m▅\u001b[39m▅\u001b[39m▄\u001b[39m▅\u001b[39m▄\u001b[39m \u001b[39m█\n", " 103 μs\u001b[90m \u001b[39m\u001b[90mHistogram: \u001b[39m\u001b[90m\u001b[1mlog(\u001b[22m\u001b[39m\u001b[90mfrequency\u001b[39m\u001b[90m\u001b[1m)\u001b[22m\u001b[39m\u001b[90m by time\u001b[39m 197 μs \u001b[0m\u001b[1m<\u001b[22m\n", "\n", " Memory estimate\u001b[90m: \u001b[39m\u001b[33m0 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m0\u001b[39m." ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark @views sum($A[1:2:500, 1:2:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acknowledgment\n", "\n", "This lecture note has evolved from [Dr. Hua Zhou](http://hua-zhou.github.io)'s 2019 Spring Statistical Computing course notes available at ." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.6.2", "language": "julia", "name": "julia-1.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.6.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "103px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": true, "toc_position": { "height": "689.8912963867188px", "left": "0px", "right": "619.4000244140625px", "top": "76.10054779052734px", "width": "224.36141967773438px" }, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 4 }