{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# More Dots: Performance and Vectorization in Julia\n",
    "\n",
    "This [IJulia/Jupyter notebook](https://github.com/JuliaLang/IJulia.jl) presents some performance experiments with vectorization in Julia, to accompany the [blog post on syntactic loop fusion](http://julialang.org/blog/) in Julia 0.6.\n",
    "\n",
    "We use the following example problem:\n",
    "evaluating `f(2x^2 + 6x^3 - sqrt(x))`, for the function `f(x) = 3x^2 + 5x + 2`, elementwise for `x` in an array `X`, storing the results in-place in `X`.   We implement this in three different styles: \"traditional\" vectorized style `vec!(X)` ala Julia 0.4 or Matlab/Numpy, the devectorized style (explicit loops) `devec!(X)`, and new-style vectorization `newvec!` with syntactic loop fusion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "devec! (generic function with 1 method)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f(x) = 3x.^2 + 5x + 2\n",
    "\n",
    "# traditional-style vectorization:\n",
    "vec!(X) = X .= f(2X.^2 + 6X.^3 - sqrt.(X))\n",
    "\n",
    "# new-style vectorization (dot operations = syntactic loop fusion):\n",
    "newvec!(X) = X .= f.(2 .* X.^2 .+ 6 .* X.^3 .- sqrt.(X))\n",
    "\n",
    "# devectorized (explicit loops):\n",
    "function devec!(X)\n",
    "    for i in eachindex(X)\n",
    "        x = X[i]\n",
    "        X[i] = f(2x^2 + 6x^3 - sqrt(x))\n",
    "    end\n",
    "    return X\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A quick benchmark\n",
    "\n",
    "Let's run a simple benchmark, comparing the performance of the three functions for a vector of $10^6$ `Float64` values.  We will use the [BenchmarkTools package](https://github.com/JuliaCI/BenchmarkTools.jl) to collect timing statistics for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using BenchmarkTools\n",
    "BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10 # use 10s benchmarks to reduce timing noise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  91.55 mb\n",
       "  allocs estimate:  24\n",
       "  --------------\n",
       "  minimum time:     39.018 ms (30.62% GC)\n",
       "  median time:      46.892 ms (41.53% GC)\n",
       "  mean time:        77.637 ms (63.53% GC)\n",
       "  maximum time:     126.879 ms (75.92% GC)\n",
       "  --------------\n",
       "  samples:          129\n",
       "  evals/sample:     1\n",
       "  time tolerance:   5.00%\n",
       "  memory tolerance: 1.00%"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = zeros(10^6)\n",
    "\n",
    "t_vec = @benchmark vec!($X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0.00 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     3.325 ms (0.00% GC)\n",
       "  median time:      3.577 ms (0.00% GC)\n",
       "  mean time:        3.728 ms (0.00% GC)\n",
       "  maximum time:     6.484 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          2680\n",
       "  evals/sample:     1\n",
       "  time tolerance:   5.00%\n",
       "  memory tolerance: 1.00%"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_devec = @benchmark devec!($X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0.00 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     3.615 ms (0.00% GC)\n",
       "  median time:      3.902 ms (0.00% GC)\n",
       "  mean time:        4.104 ms (0.00% GC)\n",
       "  maximum time:     9.670 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          2434\n",
       "  evals/sample:     1\n",
       "  time tolerance:   5.00%\n",
       "  memory tolerance: 1.00%"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_newvec = @benchmark newvec!($X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "traditional vectorization slowdown = 11.73377323838746×\n",
      "new-style vectorization slowdown = 1.0871549107544276×\n",
      "\n",
      "traditional vectorization memory overhead = 12.00012×\n"
     ]
    }
   ],
   "source": [
    "println(\"traditional vectorization slowdown = \", time(minimum(t_vec)) / time(minimum(t_devec)), \"×\")\n",
    "println(\"new-style vectorization slowdown = \", time(minimum(t_newvec)) / time(minimum(t_devec)), \"×\")\n",
    "println(\"\\ntraditional vectorization memory overhead = \", memory(minimum(t_vec)) / sizeof(X), \"×\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen from the preceding timing ratios, the traditional vectorized version is **12× slower** than the devectorized version, but the new-style \"dot-vectorized\" version is **only 10% slower.**\n",
    "\n",
    "Also, the traditional vectorized version wastes a **factor of 12 in memory** compared to the size of `X` itself, due to the 12 temporary arrays that are allocated while evaluating `vec!`, whereas `newvec!` allocates **nothing**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More detailed benchmarks\n",
    "\n",
    "To help us understand the source of the 12× slowdown of the traditional vectorized version, we want to separate two potential performance problems: the cost of *allocating* all of the temporary arrays vs. the cost of *multiple loops* where only one loop is required.  To do this, I'll write a version of `vec!` in which all of the *12 temporary arrays are pre-allocated*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "vec_prealloc! (generic function with 2 methods)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function vec_prealloc!(X, TWELVE_TEMP_ARRAYS)\n",
    "    T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12 = TWELVE_TEMP_ARRAYS\n",
    "    \n",
    "    # evaluate T7 = 2X.^2 + 6X.^3 - sqrt.(X):\n",
    "    T1 .= X.^2\n",
    "    T2 .= 2 .* T1\n",
    "    T3 .= X.^3\n",
    "    T4 .= 6 .* T3\n",
    "    T5 .= T2 .+ T4\n",
    "    T6 .= sqrt.(X)\n",
    "    T7 .= T5 .- T6\n",
    "    \n",
    "    # evaluate T12 = f(T7):\n",
    "    T8 .= T7.^2\n",
    "    T9 .= 3 .* T8\n",
    "    T10 .= 5 .* T7\n",
    "    T11 .= T9 .+ T10\n",
    "    T12 .= T11 .+ 2\n",
    "    \n",
    "    # store result in X\n",
    "    X .= T12\n",
    "    return X\n",
    "end\n",
    "vec_prealloc!(X) = vec_prealloc!(X, ntuple(i -> similar(X), 12))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we do anything else, let's make sure that all of these functions are computing the same thing, by verifying that they give the same answer on a random vectory `Y`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Y = rand(100)\n",
    "vec!(copy(Y)) == vec_prealloc!(copy(Y)) == devec!(copy(Y)) == newvec!(copy(Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, it works!  Now, let's try our quick benchmark from above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BenchmarkTools.Trial: \n",
       "  memory estimate:  0.00 bytes\n",
       "  allocs estimate:  0\n",
       "  --------------\n",
       "  minimum time:     13.342 ms (0.00% GC)\n",
       "  median time:      13.821 ms (0.00% GC)\n",
       "  mean time:        14.224 ms (0.00% GC)\n",
       "  maximum time:     17.571 ms (0.00% GC)\n",
       "  --------------\n",
       "  samples:          703\n",
       "  evals/sample:     1\n",
       "  time tolerance:   5.00%\n",
       "  memory tolerance: 1.00%"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_vec_prealloc = @benchmark vec_prealloc!($X, $(ntuple(i -> similar(X), 12)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preallocated traditional vectorization slowdown = 4.012209861295182×\n"
     ]
    }
   ],
   "source": [
    "println(\"preallocated traditional vectorization slowdown = \", time(minimum(t_vec_prealloc)) / time(minimum(t_devec)), \"×\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even with pre-allocation, executing 12 separate loops (as `vec!` does implicitly) is a factor-of-four slowdown.\n",
    "\n",
    "To get further insight, let's benchmark as a function of the length of array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12-element Array{Int64,1}:\n",
       "        1\n",
       "        5\n",
       "       10\n",
       "       20\n",
       "       50\n",
       "      100\n",
       "     1000\n",
       "     5000\n",
       "    10000\n",
       "   100000\n",
       "  1000000\n",
       " 10000000"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "n = [1,5,10,20,50,100,1000,5000,10^4,10^5,10^6,10^7] # array sizes to benchmark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "benchmarking n = 1\n",
      "benchmarking n = 5\n",
      "benchmarking n = 10\n",
      "benchmarking n = 20\n",
      "benchmarking n = 50\n",
      "benchmarking n = 100\n",
      "benchmarking n = 1000\n",
      "benchmarking n = 5000\n",
      "benchmarking n = 10000\n",
      "benchmarking n = 100000\n",
      "benchmarking n = 1000000\n",
      "benchmarking n = 10000000\n"
     ]
    }
   ],
   "source": [
    "t = Array{Float64}(length(n), 4)\n",
    "for i = 1:length(n)\n",
    "    X = zeros(n[i])\n",
    "    println(\"benchmarking n = \", n[i])\n",
    "    t[i,1] = time(minimum(@benchmark devec!($X)))\n",
    "    t[i,2] = time(minimum(@benchmark vec!($X)))\n",
    "    t[i,3] = time(minimum(@benchmark vec_prealloc!($X, $(ntuple(i -> similar(X), 12)))))\n",
    "    t[i,4] = time(minimum(@benchmark newvec!($X)))\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13×5 Array{Any,2}:\n",
       "       \"n\"    \"devec\"    \"vec\"     \"vec_pre\"   \"newvec\"\n",
       "      1.0    1.0       33.0302   15.369       1.46505  \n",
       "      5.0    1.0       23.6401   11.1489      1.68799  \n",
       "     10.0    1.0       12.907     6.66253     1.63485  \n",
       "     20.0    1.0        8.28721   4.24866     1.35354  \n",
       "     50.0    1.0        4.89418   2.3063      1.16262  \n",
       "    100.0    1.0        3.30924   1.88955     1.1252   \n",
       "   1000.0    1.0        2.78468   1.65965     1.09363  \n",
       "   5000.0    1.0        2.6733    2.30583     1.08739  \n",
       "  10000.0    1.0        2.69954   2.39681     1.08704  \n",
       " 100000.0    1.0        3.30592   3.23626     1.08734  \n",
       "      1.0e6  1.0        8.13306   4.01938     1.08778  \n",
       "      1.0e7  1.0       22.818     5.96294     1.08505  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[\"n\" \"devec\" \"vec\" \"vec_pre\" \"newvec\"; n (t ./ t[:,1]) ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above table shows the execution times normalized by the time for the devectorized version.  We can see the following:\n",
    "\n",
    "* The traditionally vectorized code (`vec`) has a huge overhead for tiny arrays (where the relative cost of allocation is much bigger than arithmetic) and for very large arrays (where it spends a lot of time in the garbage collector).  At best, for intermediate-sized arrays, it is a factor of 2–3× slower than the devectorized loop.\n",
    "\n",
    "* Even if you eliminate the cost of allocation and garbage collection, in the pre-allocated version of the traditionally vectorized code (`vec_pre`), it is still much slower than the devectorized code.  This is especially true for very small vectors (due to the cost of setting up all of the little loops) and very large vectors (which don't fit into cache, and hence the cache misses induced by the non-fused loops are expensive).\n",
    "\n",
    "* The new-style \"dot-vectorized code\" (`newvec`) is mostly within 10% of the cost of the devectorized loop.  The worst case is that of length-1 arrays, where the overhead of the `broadcast` bookkeeping is noticable, but even then it is only a 30% penalty."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison to C code\n",
    "\n",
    "Experienced Julia programmers know that the `devec!` code above should have performance comparable to code in a low-level language, like C, but it is good to check this.\n",
    "\n",
    "The following code implements the same function in C.  To run it, we invoke the C compiler, link it into a shared library, load it from Julia, and call it with the `ccall` syntax from Julia.   (This won't work unless you have a C compiler installed, obviously.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "C_code = \"\"\"\n",
    "#include <stddef.h>\n",
    "#include <math.h>\n",
    "void devec(size_t n, double *X) {\n",
    "    for (size_t i = 0; i < n; ++i) {\n",
    "        double x = X[i];\n",
    "        double x2 = x*x;\n",
    "        double y = 2*x2 + 6*x*x2 - sqrt(x);\n",
    "        X[i] = 3*y*y + 5*y + 2;\n",
    "    }\n",
    "}\n",
    "\"\"\"\n",
    "# compile to a shared library by piping C_code to gcc:\n",
    "# (only works if you have gcc installed)\n",
    "const Clib = tempname()\n",
    "open(`gcc -fPIC -O3 -xc -shared -o $(Clib * \".\" * Libdl.dlext) -`, \"w\") do f\n",
    "    print(f, C_code)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "c_devec! (generic function with 1 method)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function c_devec!(X::Array{Float64})\n",
    "    ccall((\"devec\", Clib), Void, (Csize_t, Ptr{Float64}), length(X), X)\n",
    "    return X\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check that this gives the correct answer.  (We use `≈` rather than `==` because there are some slight differences in floating-point rounding errors between the C and Julia code.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "devec!(copy(Y)) ≈ c_devec!(copy(Y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hooray, it works!   Now let's benchmark the C code and compare it to the Julia `devec!`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "benchmarking n = 1\n",
      "benchmarking n = 5\n",
      "benchmarking n = 10\n",
      "benchmarking n = 20\n",
      "benchmarking n = 50\n",
      "benchmarking n = 100\n",
      "benchmarking n = 1000\n",
      "benchmarking n = 5000\n",
      "benchmarking n = 10000\n",
      "benchmarking n = 100000\n",
      "benchmarking n = 1000000\n",
      "benchmarking n = 10000000\n",
      "\n",
      "Julia devec time / C time: [1.12107,1.16494,1.17804,1.05198,1.05288,1.02803,1.00759,1.00788,1.0071,1.00482,1.00442,0.997434]\n"
     ]
    }
   ],
   "source": [
    "tc = Array{Float64}(length(n))\n",
    "for i = 1:length(n)\n",
    "    X = zeros(n[i])\n",
    "    println(\"benchmarking n = \", n[i])\n",
    "    tc[i] = time(minimum(@benchmark c_devec!($X)))\n",
    "end\n",
    "println(\"\\nJulia devec time / C time: \", t[:,1] ./ tc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, the **Julia `devec!` code is within a few percent of the speed of the C code** (sometimes even faster, although that's probably timing noise).  Unlike the C code, which\n",
    "works only for `Float64` (C `double`), however, the Julia `devec!` code is **type-generic**\n",
    "(it works for any type where `+`, `*`, `^`, and `sqrt` are defined)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.0",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}