{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev2 toc-item\"><a href=\"#Numerical-linear-algebra:-introduction\" data-toc-modified-id=\"Numerical-linear-algebra:-introduction-01\"><span class=\"toc-item-num\">0.1&nbsp;&nbsp;</span>Numerical linear algebra: introduction</a></div><div class=\"lev2 toc-item\"><a href=\"#BLAS\" data-toc-modified-id=\"BLAS-02\"><span class=\"toc-item-num\">0.2&nbsp;&nbsp;</span>BLAS</a></div><div class=\"lev2 toc-item\"><a href=\"#Memory-hierarchy-and-level-3-fraction\" data-toc-modified-id=\"Memory-hierarchy-and-level-3-fraction-03\"><span class=\"toc-item-num\">0.3&nbsp;&nbsp;</span>Memory hierarchy and level-3 fraction</a></div><div class=\"lev2 toc-item\"><a href=\"#Effect-of-data-layout\" data-toc-modified-id=\"Effect-of-data-layout-04\"><span class=\"toc-item-num\">0.4&nbsp;&nbsp;</span>Effect of data layout</a></div><div class=\"lev2 toc-item\"><a href=\"#Avoid-memory-allocation:-some-examples\" data-toc-modified-id=\"Avoid-memory-allocation:-some-examples-05\"><span class=\"toc-item-num\">0.5&nbsp;&nbsp;</span>Avoid memory allocation: some examples</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numerical linear algebra: introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Topics in numerical algebra: \n",
    "    - BLAS  \n",
    "    - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n",
    "    - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$  \n",
    "    - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$  \n",
    "    - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$  \n",
    "    - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$  \n",
    "    - iterative methods for numerical linear algebra    \n",
    "\n",
    "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n",
    "\n",
    "* Our major **goal** (or learning objectives) is to  \n",
    "    0. know the computational cost (flop count) of each task\n",
    "    0. be familiar with the BLAS and LAPACK functions (what they do)  \n",
    "    0. do **not** re-invent wheels by implementing these dense linear algebra subroutines by yourself  \n",
    "    0. understand the need for iterative methods  \n",
    "    0. apply appropriate numerical algebra tools to various statistical problems \n",
    "\n",
    "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n",
    "    - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/stable/stdlib/linalg/#BLAS-Functions-1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BLAS\n",
    "\n",
    "* BLAS stands for _basic linear algebra subprograms_. \n",
    "\n",
    "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n",
    "\n",
    "* There are many implementations of BLAS. \n",
    "    - [Netlib](http://www.netlib.org/blas/) provides a reference implementation  \n",
    "    - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries)  \n",
    "    - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS)  \n",
    "    - JuliaPro offers the option of using MKL\n",
    "\n",
    "* There are 3 levels of BLAS functions.\n",
    "    - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n",
    "    - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n",
    "    - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n",
    "\n",
    "| Level | Example Operation                      | Name        | Dimension                                 | Flops |\n",
    "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n",
    "| 1     | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$  |\n",
    "|       | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ |  axpy           |  $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ |  $2n$    |\n",
    "| 2     | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ |  gaxpy           |  $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$                                     |  $2mn$     |\n",
    "|       | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update            |    $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$                                       | $2mn$      |\n",
    "| 3     | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$                                       |  matrix multiplication           |  $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$                                         | $2mnp$      |\n",
    "\n",
    "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). \n",
    "\n",
    "* Some operations _appear_ as level-3 but indeed are level-2.  \n",
    "    - A common operation in statistics is column scaling or row scaling\n",
    "    $$\n",
    "    \\begin{eqnarray*}\n",
    "        \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n",
    "        \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n",
    "    \\end{eqnarray*}\n",
    "    $$\n",
    "    where $\\mathbf{D}$ is diagonal.  \n",
    "    - These are essentially level-2 operations!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2000-element Array{Float64,1}:\n",
       " 0.763192\n",
       " 0.668759\n",
       " 0.709337\n",
       " 0.088416\n",
       " 0.289151\n",
       " 0.375468\n",
       " 0.437356\n",
       " 0.179474\n",
       " 0.122238\n",
       " 0.895783\n",
       " 0.332146\n",
       " 0.206425\n",
       " 0.747789\n",
       " ⋮       \n",
       " 0.484487\n",
       " 0.567484\n",
       " 0.75455 \n",
       " 0.703483\n",
       " 0.166205\n",
       " 0.754612\n",
       " 0.231834\n",
       " 0.769243\n",
       " 0.805681\n",
       " 0.553389\n",
       " 0.450904\n",
       " 0.814614"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using BenchmarkTools\n",
    "\n",
    "srand(123) # seed\n",
    "n = 2000\n",
    "A = rand(n, n)\n",
    "d = rand(n)  # d vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2000×2000 Array{Float64,2}:\n",
       " 0.763192  0.0       0.0       0.0       …  0.0       0.0       0.0     \n",
       " 0.0       0.668759  0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.709337  0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.088416     0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " ⋮                                       ⋱                              \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.553389  0.0       0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.450904  0.0     \n",
       " 0.0       0.0       0.0       0.0          0.0       0.0       0.814614"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "D = diagm(d) # diagonal matrix with d as diagonal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.52 MiB\n",
       "  allocs estimate:  2\n",
       "  --------------\n",
       "  minimum time:     100.553 ms (0.35% GC)\n",
       "  median time:      109.386 ms (2.32% GC)\n",
       "  mean time:        115.684 ms (3.36% GC)\n",
       "  maximum time:     226.789 ms (1.78% GC)\n",
       "  --------------\n",
       "  samples:          44\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n",
    "@benchmark A * D"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2000×2000 Diagonal{Float64}:\n",
       " 0.763192   ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      \n",
       "  ⋅        0.668759   ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅        0.709337   ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅        0.088416      ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       " ⋮                                       ⋱                              \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅           0.553389   ⋅         ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅        0.450904   ⋅      \n",
       "  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅        0.814614"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Diagonal(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  30.52 MiB\n",
       "  allocs estimate:  3\n",
       "  --------------\n",
       "  minimum time:     7.149 ms (2.63% GC)\n",
       "  median time:      8.212 ms (24.38% GC)\n",
       "  mean time:        8.453 ms (25.21% GC)\n",
       "  maximum time:     65.978 ms (88.55% GC)\n",
       "  --------------\n",
       "  samples:          591\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# current way for columnwise scaling: O(n^2) flops\n",
    "@benchmark A * Diagonal(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "*(A::<b>AbstractArray{T,2} where T</b>, D::<b>Diagonal</b>) at <a href=\"https://github.com/JuliaLang/julia/tree/d386e40c17d43b79fc89d3e579fc04547241787c/base/linalg/diagonal.jl#L152\" target=\"_blank\">linalg/diagonal.jl:152</a>"
      ],
      "text/plain": [
       "*(A::AbstractArray{T,2} where T, D::Diagonal) in Base.LinAlg at linalg/diagonal.jl:152"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@which A * Diagonal(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     4.479 ms (0.00% GC)\n",
       "  median time:      11.387 ms (0.00% GC)\n",
       "  mean time:        11.209 ms (0.00% GC)\n",
       "  maximum time:     20.322 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          447\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# in-place: avoid allocate space for result\n",
    "@benchmark scale!(A, d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Memory hierarchy and level-3 fraction\n",
    "\n",
    "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n",
    "\n",
    "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n",
    "\n",
    "<img src=\"./macpro_inside.png\" width=\"400\" align=\"center\">\n",
    "\n",
    "<img src=\"./cpu_die.png\" width=\"400\" align=\"center\">  \n",
    "\n",
    "<img src=\"http://images.bit-tech.net/content_images/2007/11/the_secrets_of_pc_memory_part_1/hei.png\" width=\"400\" align=\"center\">\n",
    "\n",
    "* Numbers everyone should know\n",
    "\n",
    "| Operation                           | Time           |\n",
    "|-------------------------------------|----------------|\n",
    "| L1 cache reference                  | 0.5 ns         |\n",
    "| L2 cache reference                  | 7 ns           |\n",
    "| Main memory reference               | 100 ns         |\n",
    "| Read 1 MB sequentially from memory  | 250,000 ns     |\n",
    "| Read 1 MB sequentially from SSD     | 1,000,000 ns   |  \n",
    "| Read 1 MB sequentially from disk    | 20,000,000 ns  |\n",
    "\n",
    "\n",
    "<!-- | Operation                           | Time           | -->\n",
    "<!-- |-------------------------------------|----------------| -->\n",
    "<!-- | L1 cache reference                  | 0.5 ns         | -->\n",
    "<!-- | Branch mispredict                   | 5 ns           | -->\n",
    "<!-- | L2 cache reference                  | 7 ns           | -->\n",
    "<!-- | Mutex lock/unlock                   | 100 ns         | -->\n",
    "<!-- | Main memory reference               | 100 ns         | -->\n",
    "<!-- | Compress 1K bytes with Zippy        | 10,000 ns      | -->\n",
    "<!-- | Send 2K bytes over 1 Gbps network   | 20,000 ns      | -->\n",
    "<!-- | Read 1 MB sequentially from memory  | 250,000 ns     | -->\n",
    "<!-- | Round trip within same datacenter   | 500,000 ns     | -->\n",
    "<!-- | Disk seek                           | 10,000,000 ns  | -->\n",
    "<!-- | Read 1 MB sequentially from network | 10,000,000 ns  | -->\n",
    "<!-- | Read 1 MB sequentially from disk    | 30,000,000 ns  | -->\n",
    "<!-- | Send packet CA->Netherlands->CA     | 150,000,000 ns | -->\n",
    "\n",
    "   Source: <https://gist.github.com/jboner/2841832>  \n",
    "\n",
    "* For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.  \n",
    "\n",
    "* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?  \n",
    "Answer: use **high-level BLAS** as much as possible.\n",
    "\n",
    "| BLAS                                                           | Dimension                                                                           | Mem. Refs. | Flops  | Ratio |\n",
    "|----------------------------------------------------------------|-------------------------------------------------------------------------------------|------------|--------|-------|\n",
    "| Level 1: $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$     | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$                                           | $3n$       | $2n$   | 3:2   |\n",
    "| Level 2: $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$, $\\mathbf{A} \\in \\mathbb{R}^{n \\times n}$ | $n^2$      | $2n^2$ | 1:2   |\n",
    "| Level 3: $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | $\\mathbf{A}, \\mathbf{B}, \\mathbf{C} \\in\\mathbb{R}^{n \\times n}$                    | $4n^2$     | $2n^3$ | 2:n |  \n",
    "\n",
    "* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. **Surface-to-volume** effect.  \n",
    "See [Dongarra slides](https://www.samsi.info/wp-content/uploads/2017/02/SAMSI-0217_Dongarra.pdf).\n",
    "\n",
    "<img src=\"./blas_throughput.png\" width=\"500\" align=\"center\"/>\n",
    "\n",
    "* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Effect of data layout\n",
    "\n",
    "* Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.\n",
    "\n",
    "* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).\n",
    "\n",
    "* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.\n",
    "    - x86 CPUs: 64 bytes  \n",
    "    - ARM CPUS: 32 bytes\n",
    "\n",
    "<img src=\"https://patterns.eecs.berkeley.edu/wordpress/wp-content/uploads/2013/04/dense02.png\" width=\"500\" align=\"center\"/>\n",
    "\n",
    "* Accessing column-major stored matrix by rows causes lots of **cache misses**.\n",
    "\n",
    "* Take matrix multiplication as an example \n",
    "$$ \n",
    "\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}, \\quad \\mathbf{A} \\in \\mathbb{R}^{m \\times p}, \\mathbf{B} \\in \\mathbb{R}^{p \\times n}, \\mathbf{C} \\in \\mathbb{R}^{m \\times n}.\n",
    "$$\n",
    "Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. \n",
    "    - `jki` or `kji` looping:\n",
    "        ```julia\n",
    "        # inner most loop\n",
    "        for i = 1:m\n",
    "            C[i, j] = C[i, j] + A[i, k] * B[k, j]\n",
    "        end\n",
    "        ```  \n",
    "    - `ikj` or `kij` looping:\n",
    "        ```julia\n",
    "        # inner most loop        \n",
    "        for j = 1:n\n",
    "            C[i, j] = C[i, j] + A[i, k] * B[k, j]\n",
    "        end\n",
    "        ```  \n",
    "    - `ijk` or `jik` looping:\n",
    "        ```julia\n",
    "        # inner most loop        \n",
    "        for k = 1:p\n",
    "            C[i, j] = C[i, j] + A[i, k] * B[k, j]\n",
    "        end\n",
    "        ```\n",
    "* We pay attention to the innermost loop, where the vector calculation occurs. The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is  \n",
    "\n",
    "| Variant        | A Stride | B Stride | C Stride |\n",
    "|----------------|----------|----------|----------|\n",
    "| $jki$ or $kji$ | Unit     | 0        | Unit     |\n",
    "| $ikj$ or $kij$ | 0        | Non-Unit | Non-Unit |\n",
    "| $ijk$ or $jik$ | Non-Unit | Unit     | 0        |       \n",
    "Apparently the variants $jki$ or $kji$ are preferred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "    matmul_by_loop!(A, B, C, order)\n",
    "\n",
    "Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.\n",
    "\"\"\"\n",
    "function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)\n",
    "    \n",
    "    m = size(A, 1)\n",
    "    p = size(A, 2)\n",
    "    n = size(B, 2)\n",
    "    fill!(C, 0)\n",
    "    \n",
    "    if order == \"jki\"\n",
    "        for j = 1:n, k = 1:p, i = 1:m\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "\n",
    "    if order == \"kji\"\n",
    "        for k = 1:p, j = 1:n, i = 1:m\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "    if order == \"ikj\"\n",
    "        for i = 1:m, k = 1:p, j = 1:n\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "\n",
    "    if order == \"kij\"\n",
    "        for k = 1:p, i = 1:m, j = 1:n\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "    if order == \"ijk\"\n",
    "        for i = 1:m, j = 1:n, k = 1:p\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "    if order == \"jik\"\n",
    "        for j = 1:n, i = 1:m, k = 1:p\n",
    "            C[i, j] += A[i, k] * B[k, j]\n",
    "        end\n",
    "    end\n",
    "    \n",
    "end\n",
    "\n",
    "srand(123)\n",
    "m, n, p = 2000, 100, 2000\n",
    "A = rand(m, n)\n",
    "B = rand(n, p)\n",
    "C = zeros(m, p);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $jki$ and $kji$ looping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     430.149 ms (0.00% GC)\n",
       "  median time:      444.670 ms (0.00% GC)\n",
       "  mean time:        469.967 ms (0.00% GC)\n",
       "  maximum time:     563.758 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          11\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using BenchmarkTools\n",
    "\n",
    "@benchmark matmul_by_loop!(A, B, C, \"jki\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     473.348 ms (0.00% GC)\n",
       "  median time:      483.225 ms (0.00% GC)\n",
       "  mean time:        486.455 ms (0.00% GC)\n",
       "  maximum time:     509.115 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          11\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!(A, B, C, \"kji\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $ikj$ and $kij$ looping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     2.383 s (0.00% GC)\n",
       "  median time:      2.643 s (0.00% GC)\n",
       "  mean time:        2.643 s (0.00% GC)\n",
       "  maximum time:     2.903 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          2\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!(A, B, C, \"ikj\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     2.403 s (0.00% GC)\n",
       "  median time:      2.411 s (0.00% GC)\n",
       "  mean time:        2.411 s (0.00% GC)\n",
       "  maximum time:     2.419 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          3\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!(A, B, C, \"kij\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* $ijk$ and $jik$ looping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     952.224 ms (0.00% GC)\n",
       "  median time:      958.913 ms (0.00% GC)\n",
       "  mean time:        1.042 s (0.00% GC)\n",
       "  maximum time:     1.344 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          5\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!(A, B, C, \"ijk\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     951.335 ms (0.00% GC)\n",
       "  median time:      994.415 ms (0.00% GC)\n",
       "  mean time:        1.033 s (0.00% GC)\n",
       "  maximum time:     1.253 s (0.00% GC)\n",
       "  --------------\n",
       "  samples:          5\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark matmul_by_loop!(A, B, C, \"ijk\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "A_mul_B!<i>{T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64}}</i>(C::<b>Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}</b>, A::<b>Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}</b>, B::<b>Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}</b>) at <a href=\"https://github.com/JuliaLang/julia/tree/d386e40c17d43b79fc89d3e579fc04547241787c/base/linalg/matmul.jl#L148\" target=\"_blank\">linalg/matmul.jl:148</a>"
      ],
      "text/plain": [
       "A_mul_B!(C::Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, A::Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}, B::Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} in Base.LinAlg at linalg/matmul.jl:148"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@which A_mul_B!(C, A, B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     7.096 ms (0.00% GC)\n",
       "  median time:      7.280 ms (0.00% GC)\n",
       "  mean time:        7.448 ms (0.00% GC)\n",
       "  maximum time:     10.894 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          670\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark A_mul_B!(C, A, B)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     4.842 ms (0.00% GC)\n",
       "  median time:      5.061 ms (0.00% GC)\n",
       "  mean time:        5.705 ms (0.00% GC)\n",
       "  maximum time:     16.833 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          874\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the [Quora question](https://www.quora.com/What-algorithm-does-BLAS-use-for-matrix-multiplication-Of-all-the-considerations-e-g-cache-popular-instruction-sets-Big-O-etc-which-one-turned-out-to-be-the-primary-bottleneck), especially the [video](https://youtu.be/JzNpKDW07rw). Bottomline is \n",
    "\n",
    "> **Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Avoid memory allocation: some examples\n",
    "\n",
    "* Transposing matrix is an expensive memory operation.  \n",
    "    - In R, the command \n",
    "        ```R\n",
    "        t(A) %*% x\n",
    "        ```\n",
    "    will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation\n",
    "    - Julia is smart to avoid transposing matrix if possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  7.94 KiB\n",
       "  allocs estimate:  1\n",
       "  --------------\n",
       "  minimum time:     108.709 μs (0.00% GC)\n",
       "  median time:      136.560 μs (0.00% GC)\n",
       "  mean time:        141.652 μs (0.00% GC)\n",
       "  maximum time:     1.009 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "srand(123)\n",
    "\n",
    "n = 1000\n",
    "A = rand(n, n)\n",
    "x = rand(n)\n",
    "\n",
    "# dispatch to At_mul_B (and then to BLAS)\n",
    "# does *not* actually transpose the matrix\n",
    "@benchmark A' * x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "*<i>{T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64}, S}</i>(A::<b>Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}</b>, x::<b>Union{Base.ReshapedArray{S,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{S,1}, SubArray{S,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}</b>) at <a href=\"https://github.com/JuliaLang/julia/tree/d386e40c17d43b79fc89d3e579fc04547241787c/base/linalg/matmul.jl#L74\" target=\"_blank\">linalg/matmul.jl:74</a>"
      ],
      "text/plain": [
       "*(A::Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, x::Union{Base.ReshapedArray{S,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{S,1}, SubArray{S,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}) where {T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64}, S} in Base.LinAlg at linalg/matmul.jl:74"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@which A' * x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  7.94 KiB\n",
       "  allocs estimate:  1\n",
       "  --------------\n",
       "  minimum time:     110.028 μs (0.00% GC)\n",
       "  median time:      150.350 μs (0.00% GC)\n",
       "  mean time:        203.158 μs (0.00% GC)\n",
       "  maximum time:     2.248 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# dispatch to BLAS\n",
    "@benchmark At_mul_B(A, x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  7.64 MiB\n",
       "  allocs estimate:  3\n",
       "  --------------\n",
       "  minimum time:     2.909 ms (0.00% GC)\n",
       "  median time:      3.582 ms (0.00% GC)\n",
       "  mean time:        3.956 ms (16.06% GC)\n",
       "  maximum time:     13.355 ms (28.46% GC)\n",
       "  --------------\n",
       "  samples:          1259\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's force transpose\n",
    "@benchmark transpose(A) * x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     101.399 μs (0.00% GC)\n",
       "  median time:      147.288 μs (0.00% GC)\n",
       "  mean time:        175.533 μs (0.00% GC)\n",
       "  maximum time:     954.276 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pre-allocate result\n",
    "out = zeros(size(A, 2))\n",
    "@benchmark At_mul_B!(out, A, x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Broadcasting](https://docs.julialang.org/en/stable/manual/functions/#man-vectorized-1) in Julia achieves vectorized code without creating intermediate arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mabs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.\u001b[39m\n",
      "Stacktrace:\n",
      " [1] \u001b[1mdepwarn\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::Symbol\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:70\u001b[22m\u001b[22m\n",
      " [2] \u001b[1mabs\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Float64,2}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:57\u001b[22m\u001b[22m\n",
      " [3] \u001b[1m##core#727\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316\u001b[22m\u001b[22m\n",
      " [4] \u001b[1m##sample#728\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322\u001b[22m\u001b[22m\n",
      " [5] \u001b[1m#_run#16\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350\u001b[22m\u001b[22m\n",
      " [6] \u001b[1m(::BenchmarkTools.#kw##_run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [7] \u001b[1manonymous\u001b[22m\u001b[22m at \u001b[1m./<missing>:?\u001b[22m\u001b[22m\n",
      " [8] \u001b[1m#run_result#19\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44\u001b[22m\u001b[22m\n",
      " [9] \u001b[1m(::BenchmarkTools.#kw##run_result)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [10] \u001b[1m#run#21\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n",
      " [11] \u001b[1m(::Base.#kw##run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [12] \u001b[1mwarmup\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:100\u001b[22m\u001b[22m\n",
      " [13] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m\n",
      " [14] \u001b[1mexecute_request\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket, ::IJulia.Msg\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158\u001b[22m\u001b[22m\n",
      " [15] \u001b[1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385\u001b[22m\u001b[22m\n",
      " [16] \u001b[1meventloop\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8\u001b[22m\u001b[22m\n",
      " [17] \u001b[1m(::IJulia.##14#17)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./task.jl:335\u001b[22m\u001b[22m\n",
      "while loading In[23], in expression starting on line 234\n",
      "\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mabs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.\u001b[39m\n",
      "Stacktrace:\n",
      " [1] \u001b[1mdepwarn\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::Symbol\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:70\u001b[22m\u001b[22m\n",
      " [2] \u001b[1mabs\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Float64,2}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:57\u001b[22m\u001b[22m\n",
      " [3] \u001b[1m##core#727\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316\u001b[22m\u001b[22m\n",
      " [4] \u001b[1m##sample#728\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322\u001b[22m\u001b[22m\n",
      " [5] \u001b[1m#_run#16\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350\u001b[22m\u001b[22m\n",
      " [6] \u001b[1m(::BenchmarkTools.#kw##_run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [7] \u001b[1manonymous\u001b[22m\u001b[22m at \u001b[1m./<missing>:?\u001b[22m\u001b[22m\n",
      " [8] \u001b[1m#run_result#19\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44\u001b[22m\u001b[22m\n",
      " [9] \u001b[1m(::BenchmarkTools.#kw##run_result)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [10] \u001b[1m#run#21\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n",
      " [11] \u001b[1m(::Base.#kw##run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [12] \u001b[1m(::BenchmarkTools.#kw##warmup)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#warmup, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./<missing>:0\u001b[22m\u001b[22m\n",
      " [13] \u001b[1m#tune!#26\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:155\u001b[22m\u001b[22m\n",
      " [14] \u001b[1mtune!\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:155\u001b[22m\u001b[22m\n",
      " [15] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m\n",
      " [16] \u001b[1mexecute_request\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket, ::IJulia.Msg\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158\u001b[22m\u001b[22m\n",
      " [17] \u001b[1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385\u001b[22m\u001b[22m\n",
      " [18] \u001b[1meventloop\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8\u001b[22m\u001b[22m\n",
      " [19] \u001b[1m(::IJulia.##14#17)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./task.jl:335\u001b[22m\u001b[22m\n",
      "while loading In[23], in expression starting on line 235\n",
      "\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mabs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.\u001b[39m\n",
      "Stacktrace:\n",
      " [1] \u001b[1mdepwarn\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::Symbol\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:70\u001b[22m\u001b[22m\n",
      " [2] \u001b[1mabs\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Float64,2}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:57\u001b[22m\u001b[22m\n",
      " [3] \u001b[1m##core#727\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316\u001b[22m\u001b[22m\n",
      " [4] \u001b[1m##sample#728\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322\u001b[22m\u001b[22m\n",
      " [5] \u001b[1m#_run#16\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350\u001b[22m\u001b[22m\n",
      " [6] \u001b[1m_run\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:344\u001b[22m\u001b[22m\n",
      " [7] \u001b[1m#run_result#19\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44\u001b[22m\u001b[22m\n",
      " [8] \u001b[1m#run#21\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n",
      " [9] \u001b[1mrun\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n",
      " [10] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m\n",
      " [11] \u001b[1mexecute_request\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket, ::IJulia.Msg\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158\u001b[22m\u001b[22m\n",
      " [12] \u001b[1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385\u001b[22m\u001b[22m\n",
      " [13] \u001b[1meventloop\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8\u001b[22m\u001b[22m\n",
      " [14] \u001b[1m(::IJulia.##14#17)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./task.jl:335\u001b[22m\u001b[22m\n",
      "while loading In[23], in expression starting on line 236\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  22.92 MiB\n",
       "  allocs estimate:  240\n",
       "  --------------\n",
       "  minimum time:     7.062 ms (18.87% GC)\n",
       "  median time:      7.582 ms (19.88% GC)\n",
       "  mean time:        9.103 ms (19.75% GC)\n",
       "  maximum time:     26.794 ms (17.30% GC)\n",
       "  --------------\n",
       "  samples:          551\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "srand(123)\n",
    "X, Y = rand(1000,1000), rand(1000,1000)\n",
    "\n",
    "# two temporary arrays are created\n",
    "@benchmark max(abs(X), abs(Y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  7.63 MiB\n",
       "  allocs estimate:  27\n",
       "  --------------\n",
       "  minimum time:     2.589 ms (0.00% GC)\n",
       "  median time:      3.110 ms (0.00% GC)\n",
       "  mean time:        3.657 ms (16.53% GC)\n",
       "  maximum time:     11.023 ms (33.54% GC)\n",
       "  --------------\n",
       "  samples:          1365\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# no temporary arrays created\n",
    "@benchmark max.(abs.(X), abs.(Y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  128 bytes\n",
       "  allocs estimate:  4\n",
       "  --------------\n",
       "  minimum time:     1.961 ms (0.00% GC)\n",
       "  median time:      2.232 ms (0.00% GC)\n",
       "  mean time:        2.350 ms (0.00% GC)\n",
       "  maximum time:     8.337 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          2116\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# no memory allocation at all!\n",
    "Z = zeros(X)\n",
    "@benchmark Z .= max.(abs.(X), abs.(Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [View](https://docs.julialang.org/en/stable/stdlib/arrays/#Base.view) avoids creating extra copy of matrix data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  488.47 KiB\n",
       "  allocs estimate:  5\n",
       "  --------------\n",
       "  minimum time:     64.237 μs (0.00% GC)\n",
       "  median time:      266.307 μs (0.00% GC)\n",
       "  mean time:        237.880 μs (12.22% GC)\n",
       "  maximum time:     2.320 ms (77.69% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "srand(123)\n",
    "A = randn(1000, 1000)\n",
    "\n",
    "# sum entries in a sub-matrix\n",
    "@benchmark sum(A[1:2:500, 1:2:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  1.20 KiB\n",
       "  allocs estimate:  36\n",
       "  --------------\n",
       "  minimum time:     98.406 μs (0.00% GC)\n",
       "  median time:      107.452 μs (0.00% GC)\n",
       "  mean time:        121.689 μs (0.00% GC)\n",
       "  maximum time:     1.269 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view avoids creating a separate sub-matrix\n",
    "@benchmark sum(@view A[1:2:500, 1:2:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Julia 0.6 adds the [`@views`](https://docs.julialang.org/en/latest/manual/performance-tips.html#Consider-using-views-for-slices-1) macro, which can be useful in [some operations](https://discourse.julialang.org/t/why-is-a-manual-in-place-addition-so-much-faster-than-and-on-range-indexed-arrays/3302)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  1.34 KiB\n",
       "  allocs estimate:  40\n",
       "  --------------\n",
       "  minimum time:     98.835 μs (0.00% GC)\n",
       "  median time:      100.799 μs (0.00% GC)\n",
       "  mean time:        114.490 μs (0.00% GC)\n",
       "  maximum time:     812.846 μs (0.00% GC)\n",
       "  --------------\n",
       "  samples:          10000\n",
       "  evals/sample:     1"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@benchmark @views sum(A[1:2:500, 1:2:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Julia Version 0.6.2\n",
      "Commit d386e40c17 (2017-12-13 18:08 UTC)\n",
      "Platform Info:\n",
      "  OS: macOS (x86_64-apple-darwin14.5.0)\n",
      "  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n",
      "  WORD_SIZE: 64\n",
      "  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)\n",
      "  LAPACK: libopenblas64_\n",
      "  LIBM: libopenlibm\n",
      "  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)\n"
     ]
    }
   ],
   "source": [
    "versioninfo()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.2",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "103px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": true,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}