{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Benchmarking Julia on a PDE: the Kuramoto-Sivashinksy equation\n", "\n", "John F. Gibson, Department of Mathematics and Statistics, University of New Hampshire
\n", "Last update 2019-10-30 using Julia-1.2.0.\n", "\n", "## The benchmark algorithm: KS-CNAB2\n", "\n", "The Kuramoto-Sivashinsky (KS) equation is a nonlinear time-evolving partial differential equation (PDE) on a 1d spatial domain.\n", "\n", "\\begin{equation*}\n", "u_t = -u u_{x} - u_{xx} - u_{xxxx}\n", "\\end{equation*}\n", "\n", "where $x$ is space, $t$ is time, and subscripts indicate differentiation. We assume a spatial domain $x \\in [0, L_x]$ with periodic boundary conditions.\n", "\n", "The KS-CNAB2 benchmark algorithm is a simple numerical integration scheme for the KS equation that uses Fourier expansion in space, collocation calculation of the nonlinear term $u u_x$, and finite-differencing in time, specifically 2nd-order Crank-Nicolson Adams-Bashforth (CNAB2) timestepping. Mathematical details of KS-CNAB2 algorithm are provided [below](#Mathematics-of-the-CNAB2-algorithm).\n", "\n", "As PDE solvers go, this one is super-simple, about twenty lines of code, and so comparable in scale to the Fibonnaci, pi-sum, etc. benchmarks at https://julialang.org/benchmarks/.\n", "However this benchmark is a little different from those in that the dominant cost of the algorithm should be the fast Fourier transforms (FFTs), which in all languages are performed with calls to the same external C library, [FFTW](http://fftw.org/). So what I'm testing here is the overhead each language imposes over FFTW for things like function calls, index bounds checking, allocation of temporary arrays, and pointer dereferencing. \n", "\n", "This benchmark is meant as a preliminary investigation towards using Julia for classic high-performance computations (HPC) for PDEs from engineering and physics.|\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results: \n", "\n", "### Execution time versus simulation size\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "10\n", "\n", "\n", "1 \n", "\n", "\n", "10\n", "\n", "\n", "2 \n", "\n", "\n", "10\n", "\n", "\n", "3 \n", "\n", "\n", "10\n", "\n", "\n", "4 \n", "\n", "\n", "10\n", "\n", "\n", "5 \n", "\n", "\n", "10\n", "\n", "\n", "-\n", "\n", "\n", "3 \n", "\n", "\n", "10\n", "\n", "\n", "-\n", "\n", "\n", "2 \n", "\n", "\n", "10\n", "\n", "\n", "-\n", "\n", "\n", "1 \n", "\n", "\n", "10\n", "\n", "\n", "0 \n", "\n", "\n", "10\n", "\n", "\n", "1 \n", "\n", "\n", "10\n", "\n", "\n", "2 \n", "\n", "\n", "execution time, Kuramoto-Sivashinky simulation\n", "\n", "\n", "Nx\n", "\n", "\n", "cpu time\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Python\n", "\n", "\n", "\n", "\n", "\n", "Matlab\n", "\n", "\n", "\n", "\n", "\n", "Julia naive\n", "\n", "\n", "\n", "\n", "\n", "Julia in-place\n", "\n", "\n", "\n", "\n", "\n", "Julia unrolled\n", "\n", "\n", "\n", "\n", "\n", "C\n", "\n", "\n", "\n", "\n", "\n", "C++\n", "\n", "\n", "\n", "Fortran\n", "\n", "\n", "\n", "Nx\n", "\n", "\n", "\n", "Nx log Nx\n", "\n", "\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using Plots\n", "using DelimitedFiles\n", "gr()\n", "\n", "d = readdlm(\"benchmark-data/cputime.asc\", comments=true)\n", "\n", "Nx = d[:,1]\n", "plot(Nx, d[:,7], label=\"Python\", marker=:circ, markersize=2, color=\"magenta\")\n", "plot!(Nx, d[:,5], label=\"Matlab\", marker=:circ, markersize=2, color=\"green\")\n", "plot!(Nx, d[:,9], label=\"Julia naive\", marker=:circ, markersize=2, color=\"orange\")\n", "plot!(Nx, d[:,10], label=\"Julia in-place\", marker=:circ, markersize=2, color=\"yellow\")\n", "plot!(Nx, d[:,11], label=\"Julia unrolled\", marker=:circ, markersize=2, color=\"red\")\n", "plot!(Nx, d[:,3], label=\"C\", marker=:circ, markersize=2, color=\"cyan\")\n", "plot!(Nx, d[:,4], label=\"C++\", marker=:circ, markersize=2, color=\"blue\")\n", "plot!(Nx, d[:,2], label=\"Fortran\", linestyle=:solid, color=\"black\")\n", "plot!(Nx, 2.5e-05*Nx, label=\"Nx\", linestyle=:dot, color=\"black\")\n", "plot!(Nx, 1.4e-05*Nx .* log10.(Nx), label=\"Nx log Nx\", linestyle=:dash, color=\"black\")\n", "plot!(xlabel=\"Nx\", ylabel=\"cpu time\", title=\"execution time, Kuramoto-Sivashinky simulation\")\n", "plot!(yscale=:log10, xscale=:log10,xlim=(10,2e05),ylim=(1e-03,2e02), legend=:topleft)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comments:**\n", "\n", "**Julia unrolled and Fortran results are practically identical.** Look closely for the black line (Fortran) passing through red dots (Julia unrolled).\n", "\n", "**Expectation of $N_x \\log N_X$ scaling**. Execution time for this algorithm should ideally be dominated by the $N_x \\log N_x$ cost of the FFTs. In the above plot, all the codes do appear to scale as $N_x \\log N_x$ at large $N_x$ with different prefactors. However it's hard to understand why there would be different prefactors for the FFT cost when all they're all calls to the same FFTW C library. Perhaps linear-cost language differences are actually dominating the FFTs. See the linear-linear plot below.\n", "\n", "**Different Julia implementation:** I wrote three different Julia codes using three different levels of Julia knowledge. \n", "\n", " * **Julia naive** is a straight translation of a Matlab code, all vectorized and paying no attention to allocation of temporary arrays inside the time-stepping loop. \n", " \n", " * **Julia in-place** eliminates temporaries and allocations by using in-place FFTs and julia-0.6's loop fusion capability.\n", " \n", " * **Julia unrolled** uses in-place FFTs and unrolls all the vector time-stepping operations into explicit for loops. \n", " \n", "**Who beats whom?** Julia naive beats Python by a factor of 1.38 and is slightly better than Matab. Julia in-place and Julia unrolled are close to C and (factors of 1.06 and 1.04), and Julia unrolled is close to C (factor of 1.04). Execution times of Julia in-place, Julia unrolled, C, C++, and Fortran are all pretty close, about a 15% spread. The benchmarks were averaged over thousands of runs for $N_x = 32$ scaling down to 8 runs for $N_x = 2^{17}$; even so the timing averages varied a few percent. \n", "\n", "**CPU, OS, and compiler:** All benchmarks were run single-threaded on a six-core Intel Core i7-3960X CPU @ 3.30GHz with 64 GB memory running openSUSE Leap 42.2. C and C++ were compiled with clang 3.8.0, Fortran with gfortran 4.8.3, Julia was julia-1.2.0 (downloaded from [julialang.org](http://www.julialang.org)), and all used optimization ``-O3``. For more details see [benchmark-data/cputime.asc](benchmark-data/cputime.asc). \n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "5.0×10\n", "\n", "\n", "4\n", "\n", "\n", "1.0×10\n", "\n", "\n", "5\n", "\n", "\n", "1.5×10\n", "\n", "\n", "5\n", "\n", "\n", "2.0×10\n", "\n", "\n", "5\n", "\n", "\n", "0\n", "\n", "\n", "10\n", "\n", "\n", "20\n", "\n", "\n", "30\n", "\n", "\n", "40\n", "\n", "\n", "execution time, Kuramoto-Sivashinky simulation\n", "\n", "\n", "Nx\n", "\n", "\n", "cpu time\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Python\n", "\n", "\n", "\n", "\n", "\n", "Matlab\n", "\n", "\n", "\n", "\n", "\n", "Julia naive\n", "\n", "\n", "\n", "\n", "\n", "Julia in-place\n", "\n", "\n", "\n", "\n", "\n", "Julia unrolled\n", "\n", "\n", "\n", "\n", "\n", "C\n", "\n", "\n", "\n", "\n", "\n", "C++\n", "\n", "\n", "\n", "Fortran\n", "\n", "\n", "\n", "Nx\n", "\n", "\n", "\n", "Nx log Nx\n", "\n", "\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot!(xlim=(1e02,2e05), ylim=(0,40), yscale=:linear, xscale=:linear)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Timings for the last datapoint, $N_x = 2^{17} = 131072$. Cputime is in seconds. The last column shows cputime normalized so that C = 1. \n", "\n", "language | cputime | ratio to C \n", "---------|---------|------\n", "Python | 35.7 | 2.45 \n", "Matlab | 26.8 | 1.83\n", "Julia naive | 25.9 | 1.77\n", "Julia in-place | 15.5 | 1.06\n", "Julia unrolled | 15.2 | 1.04 \n", "C | 14.6 | 1.00\n", "C++ | 14.4 | 0.99\n", "Fortran | 13.0 | 0.90 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Execution time versus lines of code" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "20\n", "\n", "\n", "40\n", "\n", "\n", "60\n", "\n", "\n", "80\n", "\n", "\n", "0\n", "\n", "\n", "10\n", "\n", "\n", "20\n", "\n", "\n", "30\n", "\n", "\n", "40\n", "\n", "\n", "lines of code\n", "\n", "\n", "cpu time, seconds\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Python\n", "\n", "\n", "\n", "\n", "\n", "Matlab\n", "\n", "\n", "\n", "\n", "\n", "Julia naive\n", "\n", "\n", "\n", "\n", "\n", "Julia in-place\n", "\n", "\n", "\n", "\n", "\n", "Julia unrolled\n", "\n", "\n", "\n", "\n", "\n", "Fortran\n", "\n", "\n", "\n", "\n", "\n", "C++\n", "\n", "\n", "\n", "\n", "\n", "C\n", "\n", "\n" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = readdlm(\"benchmark-data/linecount.asc\",comments=true)\n", "Nx = d[:,1]\n", "plot([d[1,2]], [d[1,1]], label=\"Python\", marker=:circ, color=\"magenta\")\n", "plot!([d[2,2]], [d[2,1]], label=\"Matlab\", marker=:circ, color=\"green\" )\n", "plot!([d[3,2]], [d[3,1]], label=\"Julia naive\", marker=:circ, color=\"orange\")\n", "plot!([d[4,2]], [d[4,1]], label=\"Julia in-place\", marker=:circ, color=\"yellow\")\n", "plot!([d[5,2]], [d[5,1]], label=\"Julia unrolled\", marker=:circ, color=\"red\")\n", "plot!([d[6,2]], [d[6,1]], label=\"Fortran\", marker=:circ, color=\"black\")\n", "plot!([d[7,2]], [d[7,1]], label=\"C++\", marker=:circ, color=\"blue\")\n", "plot!([d[8,2]], [d[8,1]], label=\"C\", marker=:circ, color=\"cyan\")\n", "plot!(xlabel=\"lines of code\", ylabel=\"cpu time, seconds\", xlim=(0,80), ylim=(0,40))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comments**\n", "\n", "Julia in-place and Julia unrolled clearly hit the sweet spot of low execution time and low line count. \n", "\n", "I included the ``KS module`` definition in the linecount for the Fortran ``ksintegrate`` function, since the module serves as part of the function interface. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ratio of execution speed to linecount\n", "\n", "Define execution speed as 1/cputime, so speed/linecount = 1/(cputime\\*linecount). Normalize so C = 1." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "2\n", "\n", "\n", "4\n", "\n", "\n", "6\n", "\n", "\n", "8\n", "\n", "\n", "10\n", "\n", "\n", "0\n", "\n", "\n", "1\n", "\n", "\n", "2\n", "\n", "\n", "3\n", "\n", "\n", "language\n", "\n", "\n", "speed/linecount (C++ = 1)\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Python\n", "\n", "\n", "\n", "\n", "\n", "Matlab\n", "\n", "\n", "\n", "\n", "\n", "Julia naive\n", "\n", "\n", "\n", "\n", "\n", "Julia in-place\n", "\n", "\n", "\n", "\n", "\n", "Julia unrolled\n", "\n", "\n", "\n", "\n", "\n", "Fortran\n", "\n", "\n", "\n", "\n", "\n", "C++\n", "\n", "\n", "\n", "\n", "\n", "C\n", "\n", "\n" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = d[8,2]*d[8,1] # normalization constant making C = 1\n", "plot([1], [c/(d[1,2]*d[1,1])], label=\"Python\", marker=:circ, color=\"magenta\")\n", "plot!([2], [c/(d[2,2]*d[2,1])], label=\"Matlab\", marker=:circ, color=\"green\" )\n", "plot!([3], [c/(d[3,2]*d[3,1])], label=\"Julia naive\", marker=:circ, color=\"orange\")\n", "plot!([4], [c/(d[4,2]*d[4,1])], label=\"Julia in-place\", marker=:circ, color=\"yellow\")\n", "plot!([5], [c/(d[5,2]*d[5,1])], label=\"Julia unrolled\", marker=:circ, color=\"red\")\n", "plot!([6], [c/(d[6,2]*d[6,1])], label=\"Fortran\", marker=:circ, color=\"black\")\n", "plot!([7], [c/(d[7,2]*d[7,1])], label=\"C++\", marker=:circ, color=\"blue\")\n", "plot!([8], [c/(d[8,2]*d[8,1])], label=\"C\", marker=:circ, color=\"cyan\")\n", "plot!(xlabel=\"language\", ylabel=\"speed/linecount (C++ = 1)\",ylim=(0,3), xlim=(0,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comments**\n", "\n", "Julia in-place is the winner in the speed/linecount metric. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mathematics of the CNAB2 algorithm\n", "\n", "Start from the Kuramoto-Sivashinsky equation $[0,L]$ with periodic boundary conditions\n", "\n", "\\begin{equation*}\n", "u_t = - u_{xx} - u_{xxxx} - u u_{x}\n", "\\end{equation*}\n", "\n", "on the domain $[0,L_x]$ with periodic boundary conditions and initial condition $u(x,0) = u_0(x)$. We will use a finite Fourier expansion to discretize space and finite-differencing to discretize time, specifically the 2nd-order rank-Nicolson/Adams-Bashforth (CNAB2) timestepping formula. CNAB2 is low-order but straightforward to describe and easy to implement for this simple benchmark.\n", "\n", "Write the KS equation as \n", "\n", "\\begin{equation*}\n", "u_t = Lu + N(u)\n", "\\end{equation*}\n", "\n", "where $Lu = - u_{xx} - u_{xxxx}$ is the linear terms and $N(u) = -u u_{x}$ is the nonlinear term. In practice we'll calculate the $N(u)$ in the equivalent form $N(u) = - 1/2 \\, d/dx \\, u^2$. \n", "\n", "Discretize time by letting $u^n(x) = u(x, n\\Delta t)$ for some small $\\Delta t$. The CNAB2 timestepping forumale approximates $u_t = Lu + N(u)$ at time $t = (n+1/2) \\, dt$ as \n", "\n", "\\begin{equation*}\n", "\\frac{u^{n+1} - u^n}{\\Delta t} = L\\left(u^{n+1} + u^n\\right) + \\frac{3}{2} N(u^n) - \\frac{1}{2} N(u^{n-1})\n", "\\end{equation*}\n", "\n", "\n", "Put the unknown future $u^{n+1}$'s on the left-hand side of the equation and the present $u^{n}$ and past $u^{n+1}$ on the right.\n", "\n", "\\begin{equation*}\n", "\\left(I - \\frac{\\Delta t}{2} L \\right) u^{n+1} = \\left(I + \\frac{\\Delta t}{2}L \\right) u^{n} + \\frac{3 \\Delta t}{2} N(u^n) - \\frac{\\Delta t}{2} N(u^{n-1})\n", "\\end{equation*}\n", "Note that the linear operator $L$ applies to the unknown $u^{n+1}$ on the LHS, but that the nonlinear operator $N$ applies only to the knowns $u^n$ and $u^{n-1}$ on the RHS. This is an *implicit* treatment of the linear terms, which keeps the algorithm stable for large time steps, and an *explicit* treament of the nonlinear term, which makes the timestepping equation linear in the unknown $u^{n+1}$.\n", "\n", "Now we discretize space with a finite Fourier expansion, so that $u$ now represents a vector of Fourier coefficients and $L$ turns into matrix (and a diagonal matrix, since Fourier modes are eigenfunctions of the linear operator). Let matrix $A = (I - \\Delta t/2 \\; L)$, matrix $B = (I + \\Delta t/2 \\; L)$, and let vector $N^n$ be the Fourier transform of a collocation calculation of $N(u^n)$. That is, $N^n$ is the Fourier transform of $- u u_x = - 1/2 \\, d/dx \\, u^2$ calculated at $N_x$ uniformly spaced gridpoints on the domain $[0, L_x]$. \n", "\n", "With the spatial discretization, then the CNAB2 timestepping formula becomes \n", "\n", "\\begin{equation*}\n", "A \\, u^{n+1} = B \\, u^n + \\frac{3 \\Delta t}{2} N^n - \\frac{\\Delta t}{2}N^{n-1}\n", "\\end{equation*}\n", "\n", "This is a simple $Ax=b$ linear algebra problem whose iteration approximates the time-evolution of the Kuramoto-Sivashinksy PDE. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Julia implementations of CNAB2 algorithm\n", "\n", "### Julia naive\n", "\n", "The naive Julia code is pretty much a line-by-line translation of the same thing in Matlab, about 30 lines of code excluding comments and whitespace. Here's a slight modification of the benchmarked algorithm which saves and plots $u(x,t)$ data." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ksintegrateNaive (generic function with 1 method)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using FFTW \n", "\n", "function ksintegrateNaive(u, Lx, dt, Nt, nsave)\n", " \n", " Nx = length(u) # number of gridpoints\n", " x = collect(0:(Nx-1)/Nx)*Lx\n", " kx = vcat(0:Nx/2-1, 0, -Nx/2+1:-1) # integer wavenumbers: exp(2*pi*kx*x/L)\n", " alpha = 2*pi*kx/Lx # real wavenumbers: exp(alpha*x)\n", " D = 1im*alpha; # D = d/dx operator in Fourier space\n", " L = alpha.^2 - alpha.^4 # linear operator -D^2 - D^4 in Fourier space\n", " G = -0.5*D # -1/2 D operator in Fourier space\n", " \n", " Nsave = div(Nt, nsave)+1 # number of saved time steps, including t=0\n", " t = (0:Nsave)*(dt*nsave) # t timesteps\n", " U = zeros(Nsave, Nx) # matrix of u(xⱼ, tᵢ) values\n", " U[1,:] = u # assign initial condition to U\n", " s = 2 # counter for saved data\n", " \n", " # Express PDE as u_t = Lu + N(u), L is linear part, N nonlinear part.\n", " # Then Crank-Nicolson Adams-Bashforth discretization is \n", " # \n", " # (I - dt/2 L) u^{n+1} = (I + dt/2 L) u^n + 3dt/2 N^n - dt/2 N^{n-1}\n", " #\n", " # let A = (I - dt/2 L) \n", " # B = (I + dt/2 L), then the CNAB timestep formula is\n", " # \n", " # u^{n+1} = A^{-1} (B u^n + 3dt/2 N^n - dt/2 N^{n-1}) \n", "\n", " # convenience variables\n", " dt2 = dt/2\n", " dt32 = 3*dt/2;\n", " A_inv = (ones(Nx) - dt2*L).^(-1)\n", " B = ones(Nx) + dt2*L\n", "\n", " Nn = G.*fft(u.*u) # -u u_x (spectral), notation Nn = N^n = N(u(n dt))\n", " Nn1 = copy(Nn) # notation Nn1 = N^{n-1} = N(u((n-1) dt))\n", " u = fft(u) # transform u to spectral\n", "\n", " # timestepping loop\n", " for n = 1:Nt\n", " Nn1 = copy(Nn) # shift nonlinear term in time: N^{n-1} <- N^n\n", " Nn = G.*fft(real(ifft(u)).^2) # compute Nn = -u u_x\n", "\n", " u = A_inv .* (B .* u + dt32*Nn - dt2*Nn1)\n", " \n", " if mod(n, nsave) == 0\n", " U[s,:] = real(ifft(u))\n", " s += 1 \n", " end\n", " end\n", "\n", " t,U\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the Julia code and plot results" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "50\n", "\n", "\n", "100\n", "\n", "\n", "150\n", "\n", "\n", "200\n", "\n", "\n", "0\n", "\n", "\n", "50\n", "\n", "\n", "100\n", "\n", "\n", "150\n", "\n", "\n", "200\n", "\n", "\n", "Kuramoto-Sivashinsky dynamics\n", "\n", "\n", "x\n", "\n", "\n", "t\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "-\n", "\n", "\n", "3\n", "\n", "\n", "-\n", "\n", "\n", "2\n", "\n", "\n", "-\n", "\n", "\n", "1\n", "\n", "\n", "0\n", "\n", "\n", "1\n", "\n", "\n", "2\n", "\n", "\n", "3\n", "\n", "\n", "\n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Lx = 64*pi\n", "Nx = 1024\n", "dt = 1/16\n", "nsave = 8\n", "Nt = 3200\n", "\n", "x = Lx*(0:Nx-1)/Nx\n", "u = cos.(x) + 0.1*sin.(x/8) + 0.01*cos.((2*pi/Lx)*x);\n", "t,U = ksintegrateNaive(u, Lx, dt, Nt, nsave)\n", "\n", "heatmap(x,t,U, xlim=(x[1], x[end]), ylim=(t[1], t[end]), fillcolor=:bluesreds)\n", "plot!(xlabel=\"x\", ylabel=\"t\", title=\"Kuramoto-Sivashinsky dynamics\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Julia in-place\n", "\n", "The naive Julia code (straight Matlab translation) is slightly slower than Matlab. Can tune the Julia code by \n", " * doing FFTs in-place \n", " * removing temporary vectors in time-stepping loop\n", " \n", "These very simple improvements get the Julia performance slightly faster than C++ within 10% of Fortran.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ksintegrateInplace (generic function with 1 method)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function ksintegrateInplace(u, Lx, dt, Nt, nsave);\n", " u = (1+0im)*u # force u to be complex\n", " Nx = length(u) # number of gridpoints\n", " kx = vcat(0:Nx/2-1, 0:0, -Nx/2+1:-1)# integer wavenumbers: exp(2*pi*kx*x/L)\n", " alpha = 2*pi*kx/Lx # real wavenumbers: exp(alpha*x)\n", " D = 1im*alpha # spectral D = d/dx operator \n", " L = alpha.^2 - alpha.^4 # spectral L = -D^2 - D^4 operator\n", " G = -0.5*D # spectral -1/2 D operator, to eval -u u_x = 1/2 d/dx u^2\n", "\n", " Nsave = div(Nt, nsave)+1 # number of saved time steps, including t=0\n", " t = (0:Nsave)*(dt*nsave) # t timesteps\n", " U = zeros(Nsave, Nx) # matrix of u(xⱼ, tᵢ) values\n", " U[1,:] = u # assign initial condition to U\n", " s = 2 # counter for saved data\n", " \n", " # convenience variables\n", " dt2 = dt/2\n", " dt32 = 3*dt/2\n", " A_inv = (ones(Nx) - dt2*L).^(-1)\n", " B = ones(Nx) + dt2*L\n", " \n", " # compute in-place FFTW plans\n", " FFT! = plan_fft!(u, flags=FFTW.ESTIMATE)\n", " IFFT! = plan_ifft!(u, flags=FFTW.ESTIMATE)\n", "\n", " # compute nonlinear term Nn == -u u_x \n", " Nn = G.*fft(u.^2); # Nn == -1/2 d/dx (u^2) = -u u_x\n", " Nn1 = copy(Nn); # Nn1 = Nn at first time step\n", " FFT!*u;\n", " \n", " # timestepping loop\n", " for n = 1:Nt\n", "\n", " Nn1 .= Nn # shift nonlinear term in time\n", " Nn .= u # put u into Nn in prep for comp of nonlinear term\n", " \n", " IFFT!*Nn; # transform Nn to gridpt values, in place\n", " Nn .= Nn.*Nn; # collocation calculation of u^2\n", " FFT!*Nn; # transform Nn back to spectral coeffs, in place\n", "\n", " Nn .= G.*Nn; # compute Nn == -1/2 d/dx (u^2) = -u u_x\n", "\n", " # loop fusion! Julia translates the folling line of code to a single for loop. \n", " u .= A_inv .* (B .* u .+ dt32.*Nn .- dt2.*Nn1); \n", " \n", " if mod(n, nsave) == 0\n", " U[s,:] = real(ifft(u))\n", " s += 1 \n", " end\n", " end\n", " \n", " t,U\n", "end" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "50\n", "\n", "\n", "100\n", "\n", "\n", "150\n", "\n", "\n", "200\n", "\n", "\n", "0\n", "\n", "\n", "50\n", "\n", "\n", "100\n", "\n", "\n", "150\n", "\n", "\n", "200\n", "\n", "\n", "Kuramoto-Sivashinsky dynamics\n", "\n", "\n", "x\n", "\n", "\n", "t\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "-\n", "\n", "\n", "3\n", "\n", "\n", "-\n", "\n", "\n", "2\n", "\n", "\n", "-\n", "\n", "\n", "1\n", "\n", "\n", "0\n", "\n", "\n", "1\n", "\n", "\n", "2\n", "\n", "\n", "3\n", "\n", "\n", "\n" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t,U = ksintegrateInplace(u, Lx, dt, Nt, nsave)\n", "heatmap(x,t,U, xlim=(x[1], x[end]), ylim=(t[1], t[end]), xlabel=\"x\", ylabel=\"t\", \n", "title=\"Kuramoto-Sivashinsky dynamics\", fillcolor=:bluesreds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Julia unrolled\n", "\n", "The [Julia unrolled code](codes/ksintegrateUnrolled.jl) uses in-place FFTs and unrolls several time-stepping vector operations into for loops." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Benchmark codes, all languages\n", "\n", "Here are the benchmark codes, which include both the integration algorithm and a driver program to run and time the algorithm at a given $N_x$. I haven't bothered to automate the running of the benchmark codes or the production of the benchmark data files. I run them manually as follows, from within the ``codes`` directory. \n", "\n", "**Python:** [ksbenchmark.py](codes/ksbenchmark.py) From an interactive python shell run \n", "\n", "```\n", "Python 2.7.13 (default, Mar 22 2017, 12:31:17) [GCC] \n", "IPython 3.2.2 -- An enhanced Interactive Python.\n", "In [1]: execfile(\"ksbenchmark.py\")\n", "\n", "In [2]: ksbenchmark(512)\n", "```\n", "\n", "**Matlab:** [ksbenchmark.m](codes/ksbenchmark.m) From a Matlab prompt \n", "\n", "```\n", ">> ksbenchmark(512)\n", "```\n", "\n", "\n", "**Julia:** [ksbenchmark.jl](codes/ksbenchmark.jl) At the Julia REPL run \n", "\n", "```\n", "julia> include(\"ksbenchmark.jl\")\n", "julia> ksbenchmark(512, ksintegrateNaive)\n", "```\n", "\n", "etc. for ``ksintegrateInplace`` and ``ksintegrateUnrolled``. Or, better, use Julia's benchmarking tools, which compute a more reliable timing that the simple average-of-Nruns tic-toc approach used in my ``ksbenchmark`` function. \n", "\n", "```\n", "julia> using BenchmarkTools\n", "julia> @benchmark ksintegrateNaive(ksinitconds(512)...)\n", "```\n", "\n", "**C:** [ksbenchmark.c](codes/ksbenchmark.c) At Unix prompt \n", "\n", "```\n", "bash$ clang -O3 -o ksbenchmark-c ksbenchmark.c -lfftw3 -lm\n", "bash$ ksbenchmark-c 512\n", "```\n", "\n", "**C++:** [ksbenchmark.cpp](codes/ksbenchmark.cpp) At Unix prompt\n", "\n", "```\n", "bash$ clang -O3 -o ksbenchmark-c++ -lfftw3 -lm\n", "bash$ ksbenchmark-c++ 512\n", "```\n", "\n", "**Fortran:** [ksbenchmark.f90](codes/ksbenchmark.f90) Edit the file to set $N_x$, then at Unix prompt\n", "\n", "```\n", "bash$ gfortran -O3 -o ksbenchmark-f90 -lfftw3\n", "bash$ ksbenchmark-f90\n", "```\n", "\n", "Details about optimization flags and compiler, language, library versions are listed in the [cputime.asc](benchmark-data/cputime.asc) data file. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "\n", "## Room for improvement\n", "\n", " * Automate the build and benchmarking process\n", " * Implement more highly-tuned/sophisticated approaches\n", " * Numba and Dedalus codes for Python\n", " * ``DifferentialEquations.jl + ApproxFun.jl`` code for Julia\n", " * real-to-complex versions wherever possible\n", " * multithreaded versions\n", " * Analysis of the overhead costs.\n", " * Do something similar with a 2d or 3d PDE and distributed-memory computations. \n", "\n", "\n", "## Acknowledgements\n", "\n", "Thanks to \n", "\n", " * Mahtab Lak, University of New Hampshire Mathematics. Her Ph.D. Minor Project in Applied Math formed the basis for this work. \n", " * Ashley Willis, University of Sheffield, wrote the Fortran 90 code. \n", " * David Sanders, Universidad Nacional Autónoma de México, for encouraging me to put this notebook together.\n", " * Chris Rackauckas, University of California Irvine, for Julia performance tips.\n", " * Sheehan Olver, University of Sydney, and Chris Rackauckas, for Julia Discourse discussions on strategies for implementing HPC PDE codes in Julia. \n", " * traktofon@github for fixing the Fortran code in numerous ways\n", " \n", "John F. Gibson,\n", "Dept. Mathematics and Statistics,\n", "University of New Hampshire" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.2.0", "language": "julia", "name": "julia-1.2" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 2 }