{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

0.1  Numerical linear algebra: introduction
0.2  BLAS
0.3  Memory hierarchy and level-3 fraction
0.4  Effect of data layout
0.5  Avoid memory allocation: some examples
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical linear algebra: introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Topics in numerical algebra: \n", " - BLAS \n", " - solve linear equations $\\mathbf{A} \\mathbf{x} = \\mathbf{b}$\n", " - regression computations $\\mathbf{X}^T \\mathbf{X} \\beta = \\mathbf{X}^T \\mathbf{y}$ \n", " - eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{x}$ \n", " - generalized eigen-problems $\\mathbf{A} \\mathbf{x} = \\lambda \\mathbf{B} \\mathbf{x}$ \n", " - singular value decompositions $\\mathbf{A} = \\mathbf{U} \\Sigma \\mathbf{V}^T$ \n", " - iterative methods for numerical linear algebra \n", "\n", "* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).\n", "\n", "* Our major **goal** (or learning objectives) is to \n", " 0. know the computational cost (flop count) of each task\n", " 0. be familiar with the BLAS and LAPACK functions (what they do) \n", " 0. do **not** re-invent wheels by implementing these dense linear algebra subroutines by yourself \n", " 0. understand the need for iterative methods \n", " 0. apply appropriate numerical algebra tools to various statistical problems \n", "\n", "* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. \n", " - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/stable/stdlib/linalg/#BLAS-Functions-1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BLAS\n", "\n", "* BLAS stands for _basic linear algebra subprograms_. \n", "\n", "* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.\n", "\n", "* There are many implementations of BLAS. \n", " - [Netlib](http://www.netlib.org/blas/) provides a reference implementation \n", " - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries) \n", " - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS) \n", " - JuliaPro offers the option of using MKL\n", "\n", "* There are 3 levels of BLAS functions.\n", " - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation\n", " - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation\n", " - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation\n", "\n", "| Level | Example Operation | Name | Dimension | Flops |\n", "|-------|----------------------------------------|-------------|-------------------------------------------|-------|\n", "| 1 | $\\alpha \\gets \\mathbf{x}^T \\mathbf{y}$ | dot product | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ |\n", "| | $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | axpy | $\\alpha \\in \\mathbb{R}$, $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $2n$ |\n", "| 2 | $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | gaxpy | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| | $\\mathbf{A} \\gets \\mathbf{A} + \\mathbf{y} \\mathbf{x}^T$ | rank one update | $\\mathbf{A} \\in \\mathbb{R}^{m \\times n}$, $\\mathbf{x} \\in \\mathbb{R}^n$, $\\mathbf{y} \\in \\mathbb{R}^m$ | $2mn$ |\n", "| 3 | $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | matrix multiplication | $\\mathbf{A} \\in \\mathbb{R}^{m \\times p}$, $\\mathbf{B} \\in \\mathbb{R}^{p \\times n}$, $\\mathbf{C} \\in \\mathbb{R}^{m \\times n}$ | $2mnp$ |\n", "\n", "* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). \n", "\n", "* Some operations _appear_ as level-3 but indeed are level-2. \n", " - A common operation in statistics is column scaling or row scaling\n", " $$\n", " \\begin{eqnarray*}\n", " \\mathbf{A} &=& \\mathbf{A} \\mathbf{D} \\quad \\text{(column scaling)} \\\\\n", " \\mathbf{A} &=& \\mathbf{D} \\mathbf{A} \\quad \\text{(row scaling)},\n", " \\end{eqnarray*}\n", " $$\n", " where $\\mathbf{D}$ is diagonal. \n", " - These are essentially level-2 operations!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000-element Array{Float64,1}:\n", " 0.763192\n", " 0.668759\n", " 0.709337\n", " 0.088416\n", " 0.289151\n", " 0.375468\n", " 0.437356\n", " 0.179474\n", " 0.122238\n", " 0.895783\n", " 0.332146\n", " 0.206425\n", " 0.747789\n", " ⋮ \n", " 0.484487\n", " 0.567484\n", " 0.75455 \n", " 0.703483\n", " 0.166205\n", " 0.754612\n", " 0.231834\n", " 0.769243\n", " 0.805681\n", " 0.553389\n", " 0.450904\n", " 0.814614" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools\n", "\n", "srand(123) # seed\n", "n = 2000\n", "A = rand(n, n)\n", "d = rand(n) # d vector" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Array{Float64,2}:\n", " 0.763192 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.668759 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.709337 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.088416 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " ⋮ ⋱ \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.553389 0.0 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.450904 0.0 \n", " 0.0 0.0 0.0 0.0 0.0 0.0 0.814614" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D = diagm(d) # diagonal matrix with d as diagonal" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 2\n", " --------------\n", " minimum time: 100.553 ms (0.35% GC)\n", " median time: 109.386 ms (2.32% GC)\n", " mean time: 115.684 ms (3.36% GC)\n", " maximum time: 226.789 ms (1.78% GC)\n", " --------------\n", " samples: 44\n", " evals/sample: 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this is calling BLAS routine for matrix multiplication: O(n^3) flops\n", "@benchmark A * D" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000×2000 Diagonal{Float64}:\n", " 0.763192 ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ 0.668759 ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ 0.709337 ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ 0.088416 ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋮ ⋱ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ 0.553389 ⋅ ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ 0.450904 ⋅ \n", " ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.814614" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Diagonal(d)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 30.52 MiB\n", " allocs estimate: 3\n", " --------------\n", " minimum time: 7.149 ms (2.63% GC)\n", " median time: 8.212 ms (24.38% GC)\n", " mean time: 8.453 ms (25.21% GC)\n", " maximum time: 65.978 ms (88.55% GC)\n", " --------------\n", " samples: 591\n", " evals/sample: 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# current way for columnwise scaling: O(n^2) flops\n", "@benchmark A * Diagonal(d)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "*(A::AbstractArray{T,2} where T, D::Diagonal) at linalg/diagonal.jl:152" ], "text/plain": [ "*(A::AbstractArray{T,2} where T, D::Diagonal) in Base.LinAlg at linalg/diagonal.jl:152" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@which A * Diagonal(d)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 4.479 ms (0.00% GC)\n", " median time: 11.387 ms (0.00% GC)\n", " mean time: 11.209 ms (0.00% GC)\n", " maximum time: 20.322 ms (0.00% GC)\n", " --------------\n", " samples: 447\n", " evals/sample: 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# in-place: avoid allocate space for result\n", "@benchmark scale!(A, d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Memory hierarchy and level-3 fraction\n", "\n", "> **Key to high performance is effective use of memory hierarchy. True on all architectures.**\n", "\n", "* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "* Numbers everyone should know\n", "\n", "| Operation | Time |\n", "|-------------------------------------|----------------|\n", "| L1 cache reference | 0.5 ns |\n", "| L2 cache reference | 7 ns |\n", "| Main memory reference | 100 ns |\n", "| Read 1 MB sequentially from memory | 250,000 ns |\n", "| Read 1 MB sequentially from SSD | 1,000,000 ns | \n", "| Read 1 MB sequentially from disk | 20,000,000 ns |\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " Source: \n", "\n", "* For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s. \n", "\n", "* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog? \n", "Answer: use **high-level BLAS** as much as possible.\n", "\n", "| BLAS | Dimension | Mem. Refs. | Flops | Ratio |\n", "|----------------------------------------------------------------|-------------------------------------------------------------------------------------|------------|--------|-------|\n", "| Level 1: $\\mathbf{y} \\gets \\mathbf{y} + \\alpha \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$ | $3n$ | $2n$ | 3:2 |\n", "| Level 2: $\\mathbf{y} \\gets \\mathbf{y} + \\mathbf{A} \\mathbf{x}$ | $\\mathbf{x}, \\mathbf{y} \\in \\mathbb{R}^n$, $\\mathbf{A} \\in \\mathbb{R}^{n \\times n}$ | $n^2$ | $2n^2$ | 1:2 |\n", "| Level 3: $\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}$ | $\\mathbf{A}, \\mathbf{B}, \\mathbf{C} \\in\\mathbb{R}^{n \\times n}$ | $4n^2$ | $2n^3$ | 2:n | \n", "\n", "* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. **Surface-to-volume** effect. \n", "See [Dongarra slides](https://www.samsi.info/wp-content/uploads/2017/02/SAMSI-0217_Dongarra.pdf).\n", "\n", "\n", "\n", "* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effect of data layout\n", "\n", "* Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.\n", "\n", "* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).\n", "\n", "* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.\n", " - x86 CPUs: 64 bytes \n", " - ARM CPUS: 32 bytes\n", "\n", "\n", "\n", "* Accessing column-major stored matrix by rows causes lots of **cache misses**.\n", "\n", "* Take matrix multiplication as an example \n", "$$ \n", "\\mathbf{C} \\gets \\mathbf{C} + \\mathbf{A} \\mathbf{B}, \\quad \\mathbf{A} \\in \\mathbb{R}^{m \\times p}, \\mathbf{B} \\in \\mathbb{R}^{p \\times n}, \\mathbf{C} \\in \\mathbb{R}^{m \\times n}.\n", "$$\n", "Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. \n", " - `jki` or `kji` looping:\n", " ```julia\n", " # inner most loop\n", " for i = 1:m\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ``` \n", " - `ikj` or `kij` looping:\n", " ```julia\n", " # inner most loop \n", " for j = 1:n\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ``` \n", " - `ijk` or `jik` looping:\n", " ```julia\n", " # inner most loop \n", " for k = 1:p\n", " C[i, j] = C[i, j] + A[i, k] * B[k, j]\n", " end\n", " ```\n", "* We pay attention to the innermost loop, where the vector calculation occurs. The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is \n", "\n", "| Variant | A Stride | B Stride | C Stride |\n", "|----------------|----------|----------|----------|\n", "| $jki$ or $kji$ | Unit | 0 | Unit |\n", "| $ikj$ or $kij$ | 0 | Non-Unit | Non-Unit |\n", "| $ijk$ or $jik$ | Non-Unit | Unit | 0 | \n", "Apparently the variants $jki$ or $kji$ are preferred." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\"\"\"\n", " matmul_by_loop!(A, B, C, order)\n", "\n", "Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.\n", "\"\"\"\n", "function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)\n", " \n", " m = size(A, 1)\n", " p = size(A, 2)\n", " n = size(B, 2)\n", " fill!(C, 0)\n", " \n", " if order == \"jki\"\n", " for j = 1:n, k = 1:p, i = 1:m\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", "\n", " if order == \"kji\"\n", " for k = 1:p, j = 1:n, i = 1:m\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"ikj\"\n", " for i = 1:m, k = 1:p, j = 1:n\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", "\n", " if order == \"kij\"\n", " for k = 1:p, i = 1:m, j = 1:n\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"ijk\"\n", " for i = 1:m, j = 1:n, k = 1:p\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", " if order == \"jik\"\n", " for j = 1:n, i = 1:m, k = 1:p\n", " C[i, j] += A[i, k] * B[k, j]\n", " end\n", " end\n", " \n", "end\n", "\n", "srand(123)\n", "m, n, p = 2000, 100, 2000\n", "A = rand(m, n)\n", "B = rand(n, p)\n", "C = zeros(m, p);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* $jki$ and $kji$ looping:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 430.149 ms (0.00% GC)\n", " median time: 444.670 ms (0.00% GC)\n", " mean time: 469.967 ms (0.00% GC)\n", " maximum time: 563.758 ms (0.00% GC)\n", " --------------\n", " samples: 11\n", " evals/sample: 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using BenchmarkTools\n", "\n", "@benchmark matmul_by_loop!(A, B, C, \"jki\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 473.348 ms (0.00% GC)\n", " median time: 483.225 ms (0.00% GC)\n", " mean time: 486.455 ms (0.00% GC)\n", " maximum time: 509.115 ms (0.00% GC)\n", " --------------\n", " samples: 11\n", " evals/sample: 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!(A, B, C, \"kji\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* $ikj$ and $kij$ looping:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 2.383 s (0.00% GC)\n", " median time: 2.643 s (0.00% GC)\n", " mean time: 2.643 s (0.00% GC)\n", " maximum time: 2.903 s (0.00% GC)\n", " --------------\n", " samples: 2\n", " evals/sample: 1" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!(A, B, C, \"ikj\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 2.403 s (0.00% GC)\n", " median time: 2.411 s (0.00% GC)\n", " mean time: 2.411 s (0.00% GC)\n", " maximum time: 2.419 s (0.00% GC)\n", " --------------\n", " samples: 3\n", " evals/sample: 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!(A, B, C, \"kij\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* $ijk$ and $jik$ looping:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 952.224 ms (0.00% GC)\n", " median time: 958.913 ms (0.00% GC)\n", " mean time: 1.042 s (0.00% GC)\n", " maximum time: 1.344 s (0.00% GC)\n", " --------------\n", " samples: 5\n", " evals/sample: 1" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!(A, B, C, \"ijk\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 951.335 ms (0.00% GC)\n", " median time: 994.415 ms (0.00% GC)\n", " mean time: 1.033 s (0.00% GC)\n", " maximum time: 1.253 s (0.00% GC)\n", " --------------\n", " samples: 5\n", " evals/sample: 1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark matmul_by_loop!(A, B, C, \"ijk\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "A_mul_B!{T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64}}(C::Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, A::Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}, B::Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}) at linalg/matmul.jl:148" ], "text/plain": [ "A_mul_B!(C::Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, A::Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}, B::Union{Union{Base.ReshapedArray{T,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,1}, SubArray{T,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}}) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} in Base.LinAlg at linalg/matmul.jl:148" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@which A_mul_B!(C, A, B)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 7.096 ms (0.00% GC)\n", " median time: 7.280 ms (0.00% GC)\n", " mean time: 7.448 ms (0.00% GC)\n", " maximum time: 10.894 ms (0.00% GC)\n", " --------------\n", " samples: 670\n", " evals/sample: 1" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark A_mul_B!(C, A, B)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 4.842 ms (0.00% GC)\n", " median time: 5.061 ms (0.00% GC)\n", " mean time: 5.705 ms (0.00% GC)\n", " maximum time: 16.833 ms (0.00% GC)\n", " --------------\n", " samples: 874\n", " evals/sample: 1" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the [Quora question](https://www.quora.com/What-algorithm-does-BLAS-use-for-matrix-multiplication-Of-all-the-considerations-e-g-cache-popular-instruction-sets-Big-O-etc-which-one-turned-out-to-be-the-primary-bottleneck), especially the [video](https://youtu.be/JzNpKDW07rw). Bottomline is \n", "\n", "> **Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avoid memory allocation: some examples\n", "\n", "* Transposing matrix is an expensive memory operation. \n", " - In R, the command \n", " ```R\n", " t(A) %*% x\n", " ```\n", " will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation\n", " - Julia is smart to avoid transposing matrix if possible." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 7.94 KiB\n", " allocs estimate: 1\n", " --------------\n", " minimum time: 108.709 μs (0.00% GC)\n", " median time: 136.560 μs (0.00% GC)\n", " mean time: 141.652 μs (0.00% GC)\n", " maximum time: 1.009 ms (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "srand(123)\n", "\n", "n = 1000\n", "A = rand(n, n)\n", "x = rand(n)\n", "\n", "# dispatch to At_mul_B (and then to BLAS)\n", "# does *not* actually transpose the matrix\n", "@benchmark A' * x" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "*{T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64}, S}(A::Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, x::Union{Base.ReshapedArray{S,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{S,1}, SubArray{S,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}) at linalg/matmul.jl:74" ], "text/plain": [ "*(A::Union{Base.ReshapedArray{T,2,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{T,2}, SubArray{T,2,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}, x::Union{Base.ReshapedArray{S,1,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T}, DenseArray{S,1}, SubArray{S,1,A,I,L} where L} where I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex, Int64, Range{Int64}},N} where N} where A<:Union{Base.ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{DenseArray, SubArray{T,N,P,I,true} where I<:Tuple{Union{Base.Slice, UnitRange},Vararg{Any,N} where N} where P where N where T} where N where T, DenseArray}) where {T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64}, S} in Base.LinAlg at linalg/matmul.jl:74" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@which A' * x" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 7.94 KiB\n", " allocs estimate: 1\n", " --------------\n", " minimum time: 110.028 μs (0.00% GC)\n", " median time: 150.350 μs (0.00% GC)\n", " mean time: 203.158 μs (0.00% GC)\n", " maximum time: 2.248 ms (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dispatch to BLAS\n", "@benchmark At_mul_B(A, x)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 7.64 MiB\n", " allocs estimate: 3\n", " --------------\n", " minimum time: 2.909 ms (0.00% GC)\n", " median time: 3.582 ms (0.00% GC)\n", " mean time: 3.956 ms (16.06% GC)\n", " maximum time: 13.355 ms (28.46% GC)\n", " --------------\n", " samples: 1259\n", " evals/sample: 1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's force transpose\n", "@benchmark transpose(A) * x" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 0 bytes\n", " allocs estimate: 0\n", " --------------\n", " minimum time: 101.399 μs (0.00% GC)\n", " median time: 147.288 μs (0.00% GC)\n", " mean time: 175.533 μs (0.00% GC)\n", " maximum time: 954.276 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pre-allocate result\n", "out = zeros(size(A, 2))\n", "@benchmark At_mul_B!(out, A, x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Broadcasting](https://docs.julialang.org/en/stable/manual/functions/#man-vectorized-1) in Julia achieves vectorized code without creating intermediate arrays." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mabs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.\u001b[39m\n", "Stacktrace:\n", " [1] \u001b[1mdepwarn\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::Symbol\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:70\u001b[22m\u001b[22m\n", " [2] \u001b[1mabs\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Float64,2}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:57\u001b[22m\u001b[22m\n", " [3] \u001b[1m##core#727\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316\u001b[22m\u001b[22m\n", " [4] \u001b[1m##sample#728\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322\u001b[22m\u001b[22m\n", " [5] \u001b[1m#_run#16\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350\u001b[22m\u001b[22m\n", " [6] \u001b[1m(::BenchmarkTools.#kw##_run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [7] \u001b[1manonymous\u001b[22m\u001b[22m at \u001b[1m./:?\u001b[22m\u001b[22m\n", " [8] \u001b[1m#run_result#19\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44\u001b[22m\u001b[22m\n", " [9] \u001b[1m(::BenchmarkTools.#kw##run_result)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [10] \u001b[1m#run#21\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n", " [11] \u001b[1m(::Base.#kw##run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [12] \u001b[1mwarmup\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:100\u001b[22m\u001b[22m\n", " [13] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m\n", " [14] \u001b[1mexecute_request\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket, ::IJulia.Msg\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158\u001b[22m\u001b[22m\n", " [15] \u001b[1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385\u001b[22m\u001b[22m\n", " [16] \u001b[1meventloop\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8\u001b[22m\u001b[22m\n", " [17] \u001b[1m(::IJulia.##14#17)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./task.jl:335\u001b[22m\u001b[22m\n", "while loading In[23], in expression starting on line 234\n", "\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mabs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.\u001b[39m\n", "Stacktrace:\n", " [1] \u001b[1mdepwarn\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::Symbol\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:70\u001b[22m\u001b[22m\n", " [2] \u001b[1mabs\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Float64,2}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:57\u001b[22m\u001b[22m\n", " [3] \u001b[1m##core#727\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316\u001b[22m\u001b[22m\n", " [4] \u001b[1m##sample#728\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322\u001b[22m\u001b[22m\n", " [5] \u001b[1m#_run#16\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350\u001b[22m\u001b[22m\n", " [6] \u001b[1m(::BenchmarkTools.#kw##_run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [7] \u001b[1manonymous\u001b[22m\u001b[22m at \u001b[1m./:?\u001b[22m\u001b[22m\n", " [8] \u001b[1m#run_result#19\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44\u001b[22m\u001b[22m\n", " [9] \u001b[1m(::BenchmarkTools.#kw##run_result)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [10] \u001b[1m#run#21\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n", " [11] \u001b[1m(::Base.#kw##run)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [12] \u001b[1m(::BenchmarkTools.#kw##warmup)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::BenchmarkTools.#warmup, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./:0\u001b[22m\u001b[22m\n", " [13] \u001b[1m#tune!#26\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:155\u001b[22m\u001b[22m\n", " [14] \u001b[1mtune!\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:155\u001b[22m\u001b[22m\n", " [15] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m\n", " [16] \u001b[1mexecute_request\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket, ::IJulia.Msg\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158\u001b[22m\u001b[22m\n", " [17] \u001b[1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385\u001b[22m\u001b[22m\n", " [18] \u001b[1meventloop\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8\u001b[22m\u001b[22m\n", " [19] \u001b[1m(::IJulia.##14#17)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./task.jl:335\u001b[22m\u001b[22m\n", "while loading In[23], in expression starting on line 235\n", "\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mabs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.\u001b[39m\n", "Stacktrace:\n", " [1] \u001b[1mdepwarn\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::Symbol\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:70\u001b[22m\u001b[22m\n", " [2] \u001b[1mabs\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Float64,2}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./deprecated.jl:57\u001b[22m\u001b[22m\n", " [3] \u001b[1m##core#727\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316\u001b[22m\u001b[22m\n", " [4] \u001b[1m##sample#728\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322\u001b[22m\u001b[22m\n", " [5] \u001b[1m#_run#16\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350\u001b[22m\u001b[22m\n", " [6] \u001b[1m_run\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:344\u001b[22m\u001b[22m\n", " [7] \u001b[1m#run_result#19\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44\u001b[22m\u001b[22m\n", " [8] \u001b[1m#run#21\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}, ::BenchmarkTools.Parameters\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n", " [9] \u001b[1mrun\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::BenchmarkTools.Benchmark{Symbol(\"##benchmark#726\")}\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67\u001b[22m\u001b[22m\n", " [10] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m\n", " [11] \u001b[1mexecute_request\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket, ::IJulia.Msg\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158\u001b[22m\u001b[22m\n", " [12] \u001b[1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385\u001b[22m\u001b[22m\n", " [13] \u001b[1meventloop\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::ZMQ.Socket\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m/Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8\u001b[22m\u001b[22m\n", " [14] \u001b[1m(::IJulia.##14#17)\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./task.jl:335\u001b[22m\u001b[22m\n", "while loading In[23], in expression starting on line 236\n" ] }, { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 22.92 MiB\n", " allocs estimate: 240\n", " --------------\n", " minimum time: 7.062 ms (18.87% GC)\n", " median time: 7.582 ms (19.88% GC)\n", " mean time: 9.103 ms (19.75% GC)\n", " maximum time: 26.794 ms (17.30% GC)\n", " --------------\n", " samples: 551\n", " evals/sample: 1" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "srand(123)\n", "X, Y = rand(1000,1000), rand(1000,1000)\n", "\n", "# two temporary arrays are created\n", "@benchmark max(abs(X), abs(Y))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 7.63 MiB\n", " allocs estimate: 27\n", " --------------\n", " minimum time: 2.589 ms (0.00% GC)\n", " median time: 3.110 ms (0.00% GC)\n", " mean time: 3.657 ms (16.53% GC)\n", " maximum time: 11.023 ms (33.54% GC)\n", " --------------\n", " samples: 1365\n", " evals/sample: 1" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# no temporary arrays created\n", "@benchmark max.(abs.(X), abs.(Y))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 128 bytes\n", " allocs estimate: 4\n", " --------------\n", " minimum time: 1.961 ms (0.00% GC)\n", " median time: 2.232 ms (0.00% GC)\n", " mean time: 2.350 ms (0.00% GC)\n", " maximum time: 8.337 ms (0.00% GC)\n", " --------------\n", " samples: 2116\n", " evals/sample: 1" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# no memory allocation at all!\n", "Z = zeros(X)\n", "@benchmark Z .= max.(abs.(X), abs.(Y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [View](https://docs.julialang.org/en/stable/stdlib/arrays/#Base.view) avoids creating extra copy of matrix data." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 488.47 KiB\n", " allocs estimate: 5\n", " --------------\n", " minimum time: 64.237 μs (0.00% GC)\n", " median time: 266.307 μs (0.00% GC)\n", " mean time: 237.880 μs (12.22% GC)\n", " maximum time: 2.320 ms (77.69% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "srand(123)\n", "A = randn(1000, 1000)\n", "\n", "# sum entries in a sub-matrix\n", "@benchmark sum(A[1:2:500, 1:2:500])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 1.20 KiB\n", " allocs estimate: 36\n", " --------------\n", " minimum time: 98.406 μs (0.00% GC)\n", " median time: 107.452 μs (0.00% GC)\n", " mean time: 121.689 μs (0.00% GC)\n", " maximum time: 1.269 ms (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view avoids creating a separate sub-matrix\n", "@benchmark sum(@view A[1:2:500, 1:2:500])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Julia 0.6 adds the [`@views`](https://docs.julialang.org/en/latest/manual/performance-tips.html#Consider-using-views-for-slices-1) macro, which can be useful in [some operations](https://discourse.julialang.org/t/why-is-a-manual-in-place-addition-so-much-faster-than-and-on-range-indexed-arrays/3302)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 1.34 KiB\n", " allocs estimate: 40\n", " --------------\n", " minimum time: 98.835 μs (0.00% GC)\n", " median time: 100.799 μs (0.00% GC)\n", " mean time: 114.490 μs (0.00% GC)\n", " maximum time: 812.846 μs (0.00% GC)\n", " --------------\n", " samples: 10000\n", " evals/sample: 1" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@benchmark @views sum(A[1:2:500, 1:2:500])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 0.6.2\n", "Commit d386e40c17 (2017-12-13 18:08 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin14.5.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)\n", " LAPACK: libopenblas64_\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-3.9.1 (ORCJIT, skylake)\n" ] } ], "source": [ "versioninfo()" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "103px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }