{
 "metadata": {
  "name": "",
  "signature": "sha256:2ef7f69dc3a748f5ef6e65a754eb56be463b690a83224a1b48ce16fdf8adfc21"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Performance\n",
      "\n",
      "How small do my tasks need to be (aka how fast is IPython)?\n",
      "\n",
      "In parallel computing, an important relationship to keep in mind is the\n",
      "ratio of computation to communication. In order for your simulation to\n",
      "perform reasonably, you must keep this ratio high. When testing out a\n",
      "new tool like IPython, it is important to examine the limit of\n",
      "granularity that is appropriate. If it takes half a second of overhead\n",
      "to run each task, then breaking your work up into millisecond chunks\n",
      "isn't going to make sense.\n",
      "\n",
      "Basic imports to use later, create a Client, and a LoadBalancedView of all the engines."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import time\n",
      "import numpy as np\n",
      "\n",
      "from IPython.parallel import Client\n",
      "\n",
      "rc = Client()\n",
      "view = rc.load_balanced_view()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Latency\n",
      "\n",
      "Sending and receiving tiny messages gives us a sense of the minimum time\n",
      "IPython must spend building and sending messages around. This should\n",
      "give us a sense of the *minimum* overhead of the communication system.\n",
      "\n",
      "This should give us a sense of the lower limit on available granularity."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def test_latency(v, n):\n",
      "    tic = time.time()\n",
      "    echo = lambda x: x\n",
      "    tic = time.time()\n",
      "    for i in xrange(n):\n",
      "        v.apply_async(echo, '')\n",
      "    toc = time.time()\n",
      "    v.wait()\n",
      "    tac = time.time()\n",
      "    sent = toc-tic\n",
      "    roundtrip = tac-tic\n",
      "    return sent, roundtrip"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for n in [8,16,32,64,128,256,512,1024]:\n",
      "    # short rest between tests\n",
      "    time.sleep(0.5)\n",
      "    s,rt = test_latency(view, n)\n",
      "    print \"%4i %6.1f %6.1f\" % (n,n/s,n/rt)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "   8 1336.0  233.0\n",
        "  16 1182.2  202.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  32 1378.5  271.1"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  64 1322.6  277.6"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 128  661.4  220.1"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 256 1218.5  278.8"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 512 1248.1  277.6"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1024 1249.6  288.3"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for n in [8,16,32,64,128,256,512,1024]:\n",
      "    # short rest between tests\n",
      "    time.sleep(0.5)\n",
      "    s,rt = test_latency(view, n)\n",
      "    print \"%4i %6.1f %6.1f\" % (n,n/s,n/rt)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "   8  969.3   97.8\n",
        "  16 1084.1  198.7"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  32 1251.3  261.6"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  64  795.2  208.2"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 128  831.3  282.1"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 256  667.4  172.6"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 512  438.0  165.9"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1024  645.9  175.9"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"files/figs/latency.png\" width=\"600px\"/>\n",
      "\n",
      "These tests were run on the loopback interface on a fast 8-core machine\n",
      "with 4 engines and slightly tuned non-default config (msgpack for serialization, TaskScheduler.hwm=0).\n",
      "\n",
      "The hwm optimization is the most important for performance of these benchmarks.\n",
      "\n",
      "\n",
      "The tests were done with the Python scheduler and pure-zmq scheduler,\n",
      "and with/without an SSH tunnel. We can see that the Python scheduler can\n",
      "do about 800 tasks/sec, while the pure-zmq scheduler gets an extra\n",
      "factor of two, at around 1.5k tasks/sec roundtrip. Purely outgoing - the\n",
      "time before the Client code can go on working, is closer to 4k msgs/sec\n",
      "sent. Using an SSH tunnel does not significantly impact performance, as\n",
      "long as you have a few tasks to line up.\n",
      "\n",
      "Running the same test on a dedicated cluster with up to 128 CPUs shows\n",
      "that IPython does scale reasonably well.\n",
      "\n",
      "<img src=\"files/figs/latency2.png\"/>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Throughput\n",
      "\n",
      "Echoing numpy arrays is similar to the latency test, but scaling the\n",
      "array size instead of the number of messages tests the limits when there\n",
      "is data to be transferred."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def test_throughput(v, n, s):\n",
      "    A = np.random.random(s/8) # doubles are 8B\n",
      "    tic = time.time()\n",
      "    echo = lambda x: x\n",
      "    tic = time.time()\n",
      "    for i in xrange(n):\n",
      "        v.apply_async(echo, A)\n",
      "    toc = time.time()\n",
      "    v.wait()\n",
      "    tac = time.time()\n",
      "    sent = toc-tic\n",
      "    roundtrip = tac-tic\n",
      "    return sent, roundtrip"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n = 128\n",
      "for sz in [1e1,1e2,1e3,1e4,1e5,5e5,1e6,2e6]:\n",
      "    # short rest between tests\n",
      "    time.sleep(1)\n",
      "    s,rt = test_throughput(view, n, int(sz))\n",
      "    print \"%8i  %6.1f t/s  %6.1f t/s %9.3f Mbps\" % (sz,n/s,n/rt, 1e-6*sz*n/rt)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "      10  1125.6 t/s   285.6 t/s     0.003 Mbps\n",
        "     100   967.2 t/s   281.8 t/s     0.028 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "    1000  1246.3 t/s   281.6 t/s     0.282 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "   10000  1285.6 t/s   265.3 t/s     2.653 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  100000   404.8 t/s   206.1 t/s    20.606 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  500000   294.2 t/s   159.6 t/s    79.822 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 1000000   328.4 t/s   108.8 t/s   108.802 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 2000000   425.9 t/s    83.4 t/s   166.755 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n = 128\n",
      "for sz in [1e1,1e2,1e3,1e4,1e5,5e5,1e6,2e6]:\n",
      "    # short rest between tests\n",
      "    time.sleep(1)\n",
      "    s,rt = test_throughput(view, n, int(sz))\n",
      "    print \"%8i  %6.1f t/s  %6.1f t/s %9.3f Mbps\" % (sz,n/s,n/rt, 1e-6*sz*n/rt)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "      10  1278.9 t/s   303.4 t/s     0.003 Mbps\n",
        "     100  1339.5 t/s   301.8 t/s     0.030 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "    1000   784.0 t/s   265.0 t/s     0.265 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "   10000  1083.7 t/s   274.0 t/s     2.740 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  100000   336.1 t/s   143.6 t/s    14.357 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  500000   289.4 t/s   150.9 t/s    75.453 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 1000000   226.7 t/s   102.6 t/s   102.639 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 2000000   368.9 t/s    70.2 t/s   140.495 Mbps"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"files/figs/throughput1.png\" width=\"600px\"/>\n",
      "\n",
      "Note that the dotted lines, which measure the time it took to *send* the\n",
      "arrays is *not* a function of the message size. This is again thanks to\n",
      "pyzmq's non-copying sends. Locally, we can send 100 4MB arrays in ~50\n",
      "ms, and libzmq will take care of actually transmitting the data while we\n",
      "can go on working.\n",
      "\n",
      "Plotting the same data, scaled by message size shows that we are\n",
      "saturating the connection at ~1Gbps with ~10kB messages when using\n",
      "SSH, and ~10Gbps with ~50kB messages when not using SSH.\n",
      "\n",
      "<img src=\"files/figs/throughput2.png\" width=\"600px\"/>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Map\n",
      "\n",
      "Another useful test is seeing how fast \\`view.map\\` is, for various\n",
      "numbers of tasks and for tasks of varying size.\n",
      "\n",
      "These tests were done on [AWS](http://aws.amazon.com/) extra-large\n",
      "instances with the help of\n",
      "[StarCluster](http://web.mit.edu/stardev/cluster/), so the IO and CPU\n",
      "performance are quite low compared to a physical cluster."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def test_map(v,dt,n):\n",
      "    ts = [dt]*n\n",
      "    tic = time.time()\n",
      "    amr = v.map_async(time.sleep, ts)\n",
      "    toc = time.time()\n",
      "    amr.get()\n",
      "    tac = time.time()\n",
      "    sent = toc-tic\n",
      "    roundtrip = tac-tic\n",
      "    return sent, roundtrip"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n = len(rc.ids) * 16\n",
      "for dt in np.logspace(-3,0,7):\n",
      "    time.sleep(0.5)\n",
      "    s,rt = test_map(view, dt, n)\n",
      "    print \"%4ims %5.1f%%\" % (1000*dt, 1600*dt / rt)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "   1ms   7.6%\n",
        "   3ms  17.5%"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  10ms  63.1%"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "  31ms  85.5%"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 100ms  95.8%"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        " 316ms  98.7%"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "1000ms  99.6%"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"files/figs/map.png\">\n",
      "\n",
      "This shows runs for jobs ranging from 1 to 128 ms, on 4,31,and 63\n",
      "engines. On this system, millisecond jobs are clearly too small, but by\n",
      "the time individual tasks are \\> 100 ms, IPython overhead is negligible.\n",
      "\n",
      "Now let's see how we use it for remote execution."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}