{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#Numerical-linear-algebra:-introduction\" data-toc-modified-id=\"Numerical-linear-algebra:-introduction-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Numerical linear algebra: introduction</a></div><div class=\"lev2 toc-item\"><a href=\"#BLAS\" data-toc-modified-id=\"BLAS-11\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>BLAS</a></div><div class=\"lev2 toc-item\"><a href=\"#Examples\" data-toc-modified-id=\"Examples-12\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Examples</a></div><div class=\"lev2 toc-item\"><a href=\"#Memory-hierarchy-and-level-3-fraction\" data-toc-modified-id=\"Memory-hierarchy-and-level-3-fraction-13\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Memory hierarchy and level-3 fraction</a></div><div class=\"lev2 toc-item\"><a href=\"#Effect-of-data-layout\" data-toc-modified-id=\"Effect-of-data-layout-14\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Effect of data layout</a></div><div class=\"lev2 toc-item\"><a href=\"#BLAS-in-R\" data-toc-modified-id=\"BLAS-in-R-15\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>BLAS in R</a></div><div class=\"lev2 toc-item\"><a href=\"#Avoid-memory-allocation:-some-examples\" data-toc-modified-id=\"Avoid-memory-allocation:-some-examples-16\"><span class=\"toc-item-num\">1.6&nbsp;&nbsp;</span>Avoid memory allocation: some examples</a></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Julia Version 1.1.0\n",
      "Commit 80516ca202 (2019-01-21 21:24 UTC)\n",
      "Platform Info:\n",
      "  OS: macOS (x86_64-apple-darwin14.5.0)\n",
      "  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n",
      "  WORD_SIZE: 64\n",
      "  LIBM: libopenlibm\n",
      "  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)\n",
      "Environment:\n",
      "  JULIA_EDITOR = code\n"
     ]
    }
   ],
   "source": [
    "versioninfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numerical linear algebra: introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Topics in numerical algebra: \n",
    "    - BLAS  \n",
    "    - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n",
    "    - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$  \n",
    "    - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$  \n",
    "    - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$  \n",
    "    - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$  \n",
    "    - iterative methods for numerical linear algebra    \n",
    "\n",
    "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n",
    "\n",
    "* Our major **goal** (or learning objectives) is to  \n",
    "    1. know the complexity (flop count) of each task\n",
    "    2. be familiar with the BLAS and LAPACK functions (what they do)  \n",
    "    3. do **not** re-invent wheels by implementing these dense linear algebra subroutines by yourself  \n",
    "    4. understand the need for iterative methods  \n",
    "    5. apply appropriate numerical algebra tools to various statistical problems \n",
    "\n",
    "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n",
    "    - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/v1.1/stdlib/LinearAlgebra/#BLAS-Functions-1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BLAS\n",
    "\n",
    "* BLAS stands for _basic linear algebra subprograms_. \n",
    "\n",
    "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n",
    "\n",
    "* There are many implementations of BLAS. \n",
    "    - [Netlib](http://www.netlib.org/blas/) provides a reference implementation.  \n",
    "    - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries). **MKL implementation is the gold standard on market.** It is not open source but the compiled library is free for Linux and MacOS.    \n",
    "    - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS). **OpenBLAS is the best open source implementation**.    \n",
    "\n",
    "* There are 3 levels of BLAS functions.\n",
    "    - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n",
    "    - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n",
    "    - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n",
    "\n",
    "| Level | Example Operation                      | Name        | Dimension                                 | Flops |  \n",
    "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n",
    "| 1     | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$  |  \n",
    "| 1 | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ |  axpy           |  $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ |  $2n$    |  \n",
    "| 2     | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ |  gaxpy           |  $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$                                     |  $2mn$     |\n",
    "| 2 | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update            |    $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$                                       | $2mn$      |\n",
    "| 3     | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$                                       |  matrix multiplication           |  $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$                                         | $2mnp$      |\n",
    "\n",
    "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "> **The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.**\n",
    "\n",
    "Some operations _appear_ as level-3 but indeed are level-2.  \n",
    "\n",
    "**Example 1**. A common operation in statistics is column scaling or row scaling\n",
    "$$\n",
    "\\begin{eqnarray*}\n",
    "    \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n",
    "    \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n",
    "\\end{eqnarray*}\n",
    "$$\n",
    "where $\\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form  \n",
    "$$\n",
    "\\mathbf{X}^T \\mathbf{W} \\mathbf{X},\n",
    "$$\n",
    "where $\\mathbf{W}$ is a diagonal matrix with observation weights on diagonal.  \n",
    "\n",
    "  Column and row scalings are essentially level-2 operations!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2000×2000 Diagonal{Float64,Array{Float64,1}}:\n",
       " 0.140972   ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      \n",
       "  ⋅        0.143596   ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅        0.612494   ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅        0.0480573      ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       " ⋮                                        ⋱                              \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            0.882415   ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅        0.450904   ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅        0.814614"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using BenchmarkTools, LinearAlgebra, Random\n",
    "\n",
    "Random.seed!(123) # seed\n",
    "n = 2000\n",
    "A = rand(n, n) # n-by-n matrix\n",
    "d = rand(n)  # n vector\n",
    "D = Diagonal(d) # diagonal matrix with d as diagonal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2000×2000 Array{Float64,2}:\n",
       " 0.140972  0.0       0.0       0.0        …  0.0       0.0       0.0     \n",
       " 0.0       0.143596  0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.612494  0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0480573     0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " ⋮                                        ⋱                              \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.882415  0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.450904  0.0     \n",
       " 0.0       0.0       0.0       0.0           0.0       0.0       0.814614"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Dfull = convert(Matrix, D) # convert to full matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.52 MiB\n",
       "  allocs estimate:  2\n",
       "  --------------\n",
       "  minimum time:     100.961 ms (2.61% GC)\n",
       "  median time:      102.280 ms (2.60% GC)\n",
       "  mean time:        103.678 ms (3.29% GC)\n",
       "  maximum time:     138.528 ms (26.71% GC)\n",
       "  --------------\n",
       "  samples:          49\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n",
    "# this is SLOW!\n",
    "@benchmark $A * $Dfull"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.52 MiB\n",
       "  allocs estimate:  2\n",
       "  --------------\n",
       "  minimum time:     9.051 ms (3.33% GC)\n",
       "  median time:      11.632 ms (24.18% GC)\n",
       "  mean time:        11.911 ms (24.67% GC)\n",
       "  maximum time:     53.310 ms (83.51% GC)\n",
       "  --------------\n",
       "  samples:          420\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# dispatch to special method for diagonal matrix multiplication.\n",
    "# columnwise scaling: O(n^2) flops\n",
    "@benchmark $A * $D"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     5.276 ms (0.00% GC)\n",
       "  median time:      10.364 ms (0.00% GC)\n",
       "  mean time:        11.183 ms (0.00% GC)\n",
       "  maximum time:     19.343 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          447\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# in-place: avoid allocate space for result\n",
    "# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.\n",
    "@benchmark rmul!(A, D)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** In R or Matlab, `diag(d)` will create a full matrix. Be cautious using `diag` function: do we really need a full diagonal matrix?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Recompiling stale cache file /Users/huazhou/.julia/compiled/v1.1/RCall/8GFyb.ji for RCall [6f49c342-dc21-5d91-9882-a32aef131414]\n",
      "└ @ Base loading.jl:1184\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "RObject{RealSxp}\n",
       "          [,1]       [,2]      [,3]      [,4]     [,5]\n",
       "[1,] 0.1639009 0.00000000 0.0000000 0.0000000 0.000000\n",
       "[2,] 0.0000000 0.03372306 0.0000000 0.0000000 0.000000\n",
       "[3,] 0.0000000 0.00000000 0.7544716 0.0000000 0.000000\n",
       "[4,] 0.0000000 0.00000000 0.0000000 0.3005411 0.000000\n",
       "[5,] 0.0000000 0.00000000 0.0000000 0.0000000 0.144251\n"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using RCall\n",
    "\n",
    "R\"\"\"\n",
    "d <- runif(5)\n",
    "diag(d)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">> >> >> \n",
      "d =\n",
      "\n",
      "    0.8147\n",
      "    0.9058\n",
      "    0.1270\n",
      "    0.9134\n",
      "    0.6324\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5×5 Array{Float64,2}:\n",
       " 0.814724  0.0       0.0       0.0       0.0     \n",
       " 0.0       0.905792  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.126987  0.0       0.0     \n",
       " 0.0       0.0       0.0       0.913376  0.0     \n",
       " 0.0       0.0       0.0       0.0       0.632359"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using MATLAB\n",
    "\n",
    "mat\"\"\"\n",
    "d = rand(5, 1)\n",
    "diag(d)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 2**. Innter product between two matrices $\\mathbf{A}, \\mathbf{B} \\in \\mathbb{R}^{m \\times n}$ is often written as \n",
    "$$\n",
    "    \\text{trace}(\\mathbf{A}^T \\mathbf{B}), \\text{trace}(\\mathbf{B} \\mathbf{A}^T), \\text{trace}(\\mathbf{A} \\mathbf{B}^T), \\text{ or } \\text{trace}(\\mathbf{B}^T \\mathbf{A}).\n",
    "$$\n",
    "They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.52 MiB\n",
       "  allocs estimate:  2\n",
       "  --------------\n",
       "  minimum time:     104.447 ms (0.43% GC)\n",
       "  median time:      110.943 ms (2.73% GC)\n",
       "  mean time:        110.926 ms (2.70% GC)\n",
       "  maximum time:     117.164 ms (2.65% GC)\n",
       "  --------------\n",
       "  samples:          46\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Random.seed!(123)\n",
    "n = 2000\n",
    "A, B = randn(n, n), randn(n, n)\n",
    "\n",
    "# slow way to evaluate this thing\n",
    "@benchmark tr(transpose($A) * $B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But $\\text{trace}(\\mathbf{A}^T \\mathbf{B}) = <\\text{vec}(\\mathbf{A}), \\text{vec}(\\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     2.390 ms (0.00% GC)\n",
       "  median time:      3.613 ms (0.00% GC)\n",
       "  mean time:        3.618 ms (0.00% GC)\n",
       "  maximum time:     5.299 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          1375\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark dot($A, $B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example 3**. Similarly $\\text{diag}(\\mathbf{A}^T \\mathbf{B})$ can be calculated in $O(mn)$ flops."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.53 MiB\n",
       "  allocs estimate:  3\n",
       "  --------------\n",
       "  minimum time:     105.656 ms (2.72% GC)\n",
       "  median time:      110.821 ms (2.75% GC)\n",
       "  mean time:        111.252 ms (2.69% GC)\n",
       "  maximum time:     119.092 ms (2.59% GC)\n",
       "  --------------\n",
       "  samples:          45\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# slow way to evaluate this thing\n",
    "@benchmark diag(transpose($A) * $B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.53 MiB\n",
       "  allocs estimate:  6\n",
       "  --------------\n",
       "  minimum time:     9.309 ms (5.34% GC)\n",
       "  median time:      11.412 ms (22.53% GC)\n",
       "  mean time:        11.468 ms (22.75% GC)\n",
       "  maximum time:     13.552 ms (34.13% GC)\n",
       "  --------------\n",
       "  samples:          436\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# smarter\n",
    "@benchmark Diagonal(vec(sum($A .* $B, dims=1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get rid of allocation of intermediate array at all, we can just write a double loop or use `dot` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  203.33 KiB\n",
       "  allocs estimate:  4004\n",
       "  --------------\n",
       "  minimum time:     3.277 ms (0.00% GC)\n",
       "  median time:      3.503 ms (0.00% GC)\n",
       "  mean time:        3.598 ms (0.39% GC)\n",
       "  maximum time:     6.018 ms (29.67% GC)\n",
       "  --------------\n",
       "  samples:          1386\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function diag_matmul(A, B)\n",
    "    m, n = size(A)\n",
    "    @assert size(B) == (m, n) \"A and B should have same size\"\n",
    "    @views d = [dot(A[:, j], B[:, j]) for j in 1:n]\n",
    "#    d = zeros(eltype(A), n)\n",
    "#    for j in 1:n, i in 1:m\n",
    "#        d[j] += A[i, j] * B[i, j]\n",
    "#    end\n",
    "    Diagonal(d)\n",
    "end\n",
    "\n",
    "@benchmark diag_matmul($A, $B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Memory hierarchy and level-3 fraction\n",
    "\n",
    "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n",
    "\n",
    "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n",
    "\n",
    "<img src=\"./macpro_inside.png\" width=\"400\" align=\"center\">\n",
    "\n",
    "<img src=\"./cpu_die.png\" width=\"400\" align=\"center\">  \n",
    "\n",
    "<img src=\"http://images.bit-tech.net/content_images/2007/11/the_secrets_of_pc_memory_part_1/hei.png\" width=\"400\" align=\"center\">\n",
    "\n",
    "* Numbers everyone should know\n",
    "\n",
    "| Operation                           | Time           |\n",
    "|-------------------------------------|----------------|\n",
    "| L1 cache reference                  | 0.5 ns         |\n",
    "| L2 cache reference                  | 7 ns           |\n",
    "| Main memory reference               | 100 ns         |\n",
    "| Read 1 MB sequentially from memory  | 250,000 ns     |\n",
    "| Read 1 MB sequentially from SSD     | 1,000,000 ns   |  \n",
    "| Read 1 MB sequentially from disk    | 20,000,000 ns  |\n",
    "\n",
    "\n",
    "<!-- | Operation                           | Time           | -->\n",
    "<!-- |-------------------------------------|----------------| -->\n",
    "<!-- | L1 cache reference                  | 0.5 ns         | -->\n",
    "<!-- | Branch mispredict                   | 5 ns           | -->\n",
    "<!-- | L2 cache reference                  | 7 ns           | -->\n",
    "<!-- | Mutex lock/unlock                   | 100 ns         | -->\n",
    "<!-- | Main memory reference               | 100 ns         | -->\n",
    "<!-- | Compress 1K bytes with Zippy        | 10,000 ns      | -->\n",
    "<!-- | Send 2K bytes over 1 Gbps network   | 20,000 ns      | -->\n",
    "<!-- | Read 1 MB sequentially from memory  | 250,000 ns     | -->\n",
    "<!-- | Round trip within same datacenter   | 500,000 ns     | -->\n",
    "<!-- | Disk seek                           | 10,000,000 ns  | -->\n",
    "<!-- | Read 1 MB sequentially from network | 10,000,000 ns  | -->\n",
    "<!-- | Read 1 MB sequentially from disk    | 30,000,000 ns  | -->\n",
    "<!-- | Send packet CA->Netherlands->CA     | 150,000,000 ns | -->\n",
    "\n",
    "   Source: <https://gist.github.com/jboner/2841832>  \n",
    "\n",
    "* For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.  \n",
    "\n",
    "* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?  \n",
    "Answer: use **high-level BLAS** as much as possible.\n",
    "\n",
    "| BLAS | Dimension | Mem. Refs. | Flops  | Ratio |\n",
    "|--------------------------------|------------------------------------------------------------|------------|--------|-------|\n",
    "| Level 1: $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$     | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$                                           | $3n$       | $2n$   | 3:2   |\n",
    "| Level 2: $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$, $\\mathbf{A} \\in \\mathbb{R}^{n \\times n}$ | $n^2$      | $2n^2$ | 1:2   |\n",
    "| Level 3: $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | $\\mathbf{A}, \\mathbf{B}, \\mathbf{C} \\in\\mathbb{R}^{n \\times n}$                    | $4n^2$     | $2n^3$ | 2:n |  \n",
    "\n",
    "* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. **Surface-to-volume** effect.  \n",
    "See [Dongarra slides](https://www.samsi.info/wp-content/uploads/2017/02/SAMSI-0217_Dongarra.pdf).\n",
    "\n",
    "<img src=\"./blas_throughput.png\" width=\"500\" align=\"center\"/>\n",
    "\n",
    "* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**.\n",
    "\n",
    "* To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the [Quora question](https://www.quora.com/What-algorithm-does-BLAS-use-for-matrix-multiplication-Of-all-the-considerations-e-g-cache-popular-instruction-sets-Big-O-etc-which-one-turned-out-to-be-the-primary-bottleneck), especially the [video](https://youtu.be/JzNpKDW07rw). Bottomline is \n",
    "\n",
    "> **Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Effect of data layout\n",
    "\n",
    "* Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.\n",
    "\n",
    "* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).\n",
    "\n",
    "* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.\n",
    "    - x86 CPUs: 64 bytes  \n",
    "    - ARM CPUs: 32 bytes\n",
    "\n",
    "<img src=\"https://patterns.eecs.berkeley.edu/wordpress/wp-content/uploads/2013/04/dense02.png\" width=\"500\" align=\"center\"/>\n",
    "\n",
    "* Accessing column-major stored matrix by rows ($ij$ looping) causes lots of **cache misses**.\n",
    "\n",
    "* Take matrix multiplication as an example \n",
    "$$ \n",
    "\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}, \\quad \\mathbf{A} \\in \\mathbb{R}^{m \\times p}, \\mathbf{B} \\in \\mathbb{R}^{p \\times n}, \\mathbf{C} \\in \\mathbb{R}^{m \\times n}.\n",
    "$$\n",
    "Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. \n",
    "    - `jki` or `kji` looping:\n",
    "        ```julia\n",
    "        # inner most loop\n",
    "        for i = 1:m\n",
    "            C[i, j] = C[i, j] + A[i, k] * B[k, j]\n",
    "        end\n",
    "        ```  \n",
    "    - `ikj` or `kij` looping:\n",
    "        ```julia\n",
    "        # inner most loop        \n",
    "        for j = 1:n\n",
    "            C[i, j] = C[i, j] + A[i, k] * B[k, j]\n",
    "        end\n",
    "        ```  \n",
    "    - `ijk` or `jik` looping:\n",
    "        ```julia\n",
    "        # inner most loop        \n",
    "        for k = 1:p\n",
    "            C[i, j] = C[i, j] + A[i, k] * B[k, j]\n",
    "        end\n",
    "        ```\n",
    "* We pay attention to the innermost loop, where the vector calculation occurs. The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is  \n",
    "\n",
    "| Variant        | A Stride | B Stride | C Stride |\n",
    "|----------------|----------|----------|----------|\n",
    "| $jki$ or $kji$ | Unit     | 0        | Unit     |\n",
    "| $ikj$ or $kij$ | 0        | Non-Unit | Non-Unit |\n",
    "| $ijk$ or $jik$ | Non-Unit | Unit     | 0        |       \n",
    "Apparently the variants $jki$ or $kji$ are preferred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "    matmul_by_loop!(A, B, C, order)\n",
    "\n",
    "Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.\n",
    "\"\"\"\n",
    "function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)\n",
    "    \n",
    "    m = size(A, 1)\n",
    "    p = size(A, 2)\n",
    "    n = size(B, 2)\n",
    "    fill!(C, 0)\n",
    "    \n",
    "    if order == \"jki\"\n",
    "        for j = 1:n, k = 1:p, i = 1:m\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "\n",
    "    if order == \"kji\"\n",
    "        for k = 1:p, j = 1:n, i = 1:m\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "    if order == \"ikj\"\n",
    "        for i = 1:m, k = 1:p, j = 1:n\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "\n",
    "    if order == \"kij\"\n",
    "        for k = 1:p, i = 1:m, j = 1:n\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "    if order == \"ijk\"\n",
    "        for i = 1:m, j = 1:n, k = 1:p\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "    if order == \"jik\"\n",
    "        for j = 1:n, i = 1:m, k = 1:p\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "end\n",
    "\n",
    "using Random\n",
    "\n",
    "Random.seed!(123)\n",
    "m, p, n = 2000, 100, 2000\n",
    "A = rand(m, p)\n",
    "B = rand(p, n)\n",
    "C = zeros(m, n);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $jki$ and $kji$ looping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     276.404 ms (0.00% GC)\n",
       "  median time:      282.157 ms (0.00% GC)\n",
       "  mean time:        283.423 ms (0.00% GC)\n",
       "  maximum time:     291.927 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          18\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using BenchmarkTools\n",
    "\n",
    "@benchmark matmul_by_loop!($A, $B, $C, \"jki\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     335.573 ms (0.00% GC)\n",
       "  median time:      355.841 ms (0.00% GC)\n",
       "  mean time:        362.742 ms (0.00% GC)\n",
       "  maximum time:     409.550 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          14\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!($A, $B, $C, \"kji\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $ikj$ and $kij$ looping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     2.405 s (0.00% GC)\n",
       "  median time:      2.406 s (0.00% GC)\n",
       "  mean time:        2.431 s (0.00% GC)\n",
       "  maximum time:     2.480 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          3\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!($A, $B, $C, \"ikj\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     2.714 s (0.00% GC)\n",
       "  median time:      2.726 s (0.00% GC)\n",
       "  mean time:        2.726 s (0.00% GC)\n",
       "  maximum time:     2.738 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          2\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!($A, $B, $C, \"kij\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $ijk$ and $jik$ looping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     960.266 ms (0.00% GC)\n",
       "  median time:      975.456 ms (0.00% GC)\n",
       "  mean time:        979.488 ms (0.00% GC)\n",
       "  maximum time:     1.001 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          6\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!($A, $B, $C, \"ijk\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     1.017 s (0.00% GC)\n",
       "  median time:      1.046 s (0.00% GC)\n",
       "  mean time:        1.040 s (0.00% GC)\n",
       "  maximum time:     1.058 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          5\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!($A, $B, $C, \"ijk\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     7.078 ms (0.00% GC)\n",
       "  median time:      7.229 ms (0.00% GC)\n",
       "  mean time:        7.805 ms (0.00% GC)\n",
       "  maximum time:     10.858 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          640\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark mul!($C, $A, $B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     7.209 ms (0.00% GC)\n",
       "  median time:      8.820 ms (0.00% GC)\n",
       "  mean time:        8.798 ms (0.00% GC)\n",
       "  maximum time:     11.884 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          568\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# direct call of BLAS wrapper function\n",
    "@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BLAS in R\n",
    "\n",
    "* **Tip for R user**. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RObject{VecSxp}\n",
       "Unit: milliseconds\n",
       "                expr      min       lq     mean   median       uq      max\n",
       " `#JL`$A %*% `#JL`$B 226.5638 234.8593 242.5344 242.8393 250.3439 259.4928\n",
       " neval\n",
       "   100\n"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using RCall\n",
    "\n",
    "R\"\"\"\n",
    "library(\"microbenchmark\")\n",
    "microbenchmark($A %*% $B)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google `build R using MKL` to get started. Similarly we can build Julia using MKL.\n",
    "\n",
    "* Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0099658959695"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using MATLAB\n",
    "\n",
    "mat\"\"\"\n",
    "f = @() $A * $B;\n",
    "timeit(f)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Avoid memory allocation: some examples\n",
    "\n",
    "1. Transposing matrix is an expensive memory operation.  \n",
    "    - In R, the command \n",
    "        ```R\n",
    "        t(A) %*% x\n",
    "        ```\n",
    "    will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation\n",
    "    - Julia is smart to avoid transposing matrix if possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Random, LinearAlgebra, BenchmarkTools\n",
    "Random.seed!(123)\n",
    "\n",
    "n = 1000\n",
    "A = rand(n, n)\n",
    "x = rand(n);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Transpose{Float64,Array{Float64,2}}"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "typeof(transpose(A))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(:parent,)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fieldnames(typeof(transpose(A)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Ptr{Float64} @0x0000000142a9c000, Ptr{Float64} @0x0000000142a9c000)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# same data in tranpose(A) and original matrix A\n",
    "pointer(transpose(A).parent), pointer(A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  7.94 KiB\n",
       "  allocs estimate:  1\n",
       "  --------------\n",
       "  minimum time:     98.703 μs (0.00% GC)\n",
       "  median time:      141.025 μs (0.00% GC)\n",
       "  mean time:        156.860 μs (0.00% GC)\n",
       "  maximum time:     369.885 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# dispatch to BLAS\n",
    "# does *not* actually transpose the matrix\n",
    "@benchmark transpose($A) * $x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     103.759 μs (0.00% GC)\n",
       "  median time:      139.496 μs (0.00% GC)\n",
       "  mean time:        157.336 μs (0.00% GC)\n",
       "  maximum time:     399.354 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pre-allocate result\n",
    "out = zeros(size(A, 2))\n",
    "@benchmark mul!($out, transpose($A), $x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     106.017 μs (0.00% GC)\n",
       "  median time:      137.537 μs (0.00% GC)\n",
       "  mean time:        146.369 μs (0.00% GC)\n",
       "  maximum time:     736.091 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# or call BLAS wrapper directly\n",
    "@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. [Broadcasting](https://docs.julialang.org/en/v1/base/arrays/#Broadcast-and-vectorization-1) in Julia achieves vectorized code without creating intermediate arrays.\n",
    "\n",
    "    Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command\n",
    "```r\n",
    "max(abs(X), abs(Y))\n",
    "```\n",
    "will create two intermediate arrays and then one result array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RObject{VecSxp}\n",
       "Unit: milliseconds\n",
       "                            expr      min       lq    mean   median      uq\n",
       " max(abs(`#JL`$X), abs(`#JL`$Y)) 5.752931 6.047952 7.97932 6.821782 8.24881\n",
       "      max neval\n",
       " 26.40112   100\n"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using RCall\n",
    "Random.seed!(123)\n",
    "X, Y = rand(1000, 1000), rand(1000, 1000)\n",
    "\n",
    "R\"\"\"\n",
    "library(microbenchmark)\n",
    "microbenchmark(max(abs($X), abs($Y)))\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Julia, dot operations are fused so no intermediate arrays are created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  7.63 MiB\n",
       "  allocs estimate:  2\n",
       "  --------------\n",
       "  minimum time:     1.675 ms (0.00% GC)\n",
       "  median time:      1.903 ms (0.00% GC)\n",
       "  mean time:        2.529 ms (27.30% GC)\n",
       "  maximum time:     5.326 ms (54.28% GC)\n",
       "  --------------\n",
       "  samples:          1972\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# no intermediate arrays created, only result array created\n",
    "@benchmark max.(abs.($X), abs.($Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-allocating result array gets rid of memory allocation at all."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     1.135 ms (0.00% GC)\n",
       "  median time:      1.197 ms (0.00% GC)\n",
       "  mean time:        1.282 ms (0.00% GC)\n",
       "  maximum time:     1.985 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          3878\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# no memory allocation at all!\n",
    "Z = zeros(size(X)) # zero matrix of same size as X\n",
    "@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. [View](https://docs.julialang.org/en/v1/base/arrays/#Views-(SubArrays-and-other-view-types)-1) avoids creating extra copy of matrix data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  488.39 KiB\n",
       "  allocs estimate:  2\n",
       "  --------------\n",
       "  minimum time:     70.431 μs (0.00% GC)\n",
       "  median time:      304.043 μs (0.00% GC)\n",
       "  mean time:        287.967 μs (14.85% GC)\n",
       "  maximum time:     2.843 ms (84.42% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Random.seed!(123)\n",
    "A = randn(1000, 1000)\n",
    "\n",
    "# sum entries in a sub-matrix\n",
    "@benchmark sum($A[1:2:500, 1:2:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  208 bytes\n",
       "  allocs estimate:  7\n",
       "  --------------\n",
       "  minimum time:     118.440 μs (0.00% GC)\n",
       "  median time:      128.581 μs (0.00% GC)\n",
       "  mean time:        128.805 μs (0.00% GC)\n",
       "  maximum time:     484.048 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view avoids creating a separate sub-matrix\n",
    "@benchmark sum(@view $A[1:2:500, 1:2:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The [`@views`](https://docs.julialang.org/en/v1/base/arrays/#Base.@views) macro, which can be useful in [some operations](https://discourse.julialang.org/t/why-is-a-manual-in-place-addition-so-much-faster-than-and-on-range-indexed-arrays/3302)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  208 bytes\n",
       "  allocs estimate:  7\n",
       "  --------------\n",
       "  minimum time:     118.367 μs (0.00% GC)\n",
       "  median time:      120.747 μs (0.00% GC)\n",
       "  mean time:        124.625 μs (0.00% GC)\n",
       "  maximum time:     342.429 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark @views sum($A[1:2:500, 1:2:500])"
   ]
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.1.0",
   "language": "julia",
   "name": "julia-1.1"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.1.0"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "103px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_position": {
    "height": "689.8912963867188px",
    "left": "0px",
    "right": "619.4000244140625px",
    "top": "76.10054779052734px",
    "width": "224.36141967773438px"
   },
   "toc_section_display": "block",
   "toc_window_display": true,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}