{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Updating a dictionary\n", "\n", "Often, we want to create a copy of a dictionary and update it. For example, we have a default set of parameters. We want to update them, but without disturbing the original parameter list.\n", "\n", "There are 2 ways of doing this.\n", "\n", "1. Make a copy of the dictionary and update it with the new dictionary. **This is twice as fast**\n", "2. Add the `.items()` list if both dictionaries, and make a new `dict` out of it.\n", "\n", "Here is the benchmark:\n", "\n", " update append\n", " 1. make a copy and update 1.58µs 1.66µs\n", " 2. add .items() and dictify 2.81µs 3.36µs" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000000 loops, best of 3: 1.58 µs per loop\n", "1000000 loops, best of 3: 1.66 µs per loop\n", "100000 loops, best of 3: 2.81 µs per loop\n", "100000 loops, best of 3: 3.36 µs per loop\n" ] } ], "source": [ "base = {x:x for x in range(20)}\n", "same = {x:x for x in range(20)}\n", "incr = {x:x for x in range(20, 40)}\n", "\n", "%timeit y=dict(base); y.update(same)\n", "%timeit y=dict(base); y.update(incr)\n", "\n", "base = base.items()\n", "same = same.items()\n", "incr = incr.items()\n", "\n", "%timeit dict(base + same)\n", "%timeit dict(base + incr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Large string creation\n", "\n", "Array joins are faster than successive appending\n", "\n", " 1,000 10,000 100,000 1,000,000\n", " appending 226µs 2.87ms 15.8ms 484ms\n", " array joins 116µs 1.11ms 11.2ms 146ms" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000 concatenations\n", "1000 loops, best of 3: 226 µs per loop\n", "10000 loops, best of 3: 116 µs per loop\n", "10000 concatenations\n", "1000 loops, best of 3: 2.87 ms per loop\n", "1000 loops, best of 3: 1.11 ms per loop\n", "100000 concatenations\n", "100 loops, best of 3: 15.8 ms per loop\n", "100 loops, best of 3: 11.2 ms per loop\n", "1000000 concatenations\n", "1 loops, best of 3: 484 ms per loop\n", "10 loops, best of 3: 146 ms per loop\n" ] } ], "source": [ "def string_append(s, count):\n", " result = ''\n", " for x in range(count):\n", " result += s\n", "\n", "def array_join(s, count):\n", " result = []\n", " for x in range(count):\n", " result.append(s)\n", " result = ''.join(result)\n", "\n", "for count in [1000, 10000, 100000, 1000000]:\n", " print count, 'concatenations'\n", " %timeit string_append('abc', count)\n", " %timeit array_join('abc', count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Date parsing\n", "\n", "Date parsing in Python is quite slow, especially for large arrays.\n", "Here's a benchmark of various approaches.\n", "\n", " to_datetime: 7740 ms\n", " dateutil: 6970 ms\n", " strptime: 1660 ms\n", " manual: 253 ms\n", " lookup: 9 ms\n", "\n", "Manual string-array-based parsing of dates is significantly faster.\n", "If there aren't many dates, lookups are *MUCH* faster." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n", "import datetime\n", "import dateutil.parser\n", "import pandas as pd\n", "\n", "s = pd.Series(['01-31-2012']*100000)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 7.74 s per loop\n" ] } ], "source": [ "# Use Pandas' built-in to_datetime\n", "%timeit pd.to_datetime(s)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 6.97 s per loop\n" ] } ], "source": [ "# Use dateutil.parser\n", "%timeit s.apply(dateutil.parser.parse)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 1.66 s per loop\n" ] } ], "source": [ "# Parse using datetime.strptime\n", "%timeit s.apply(lambda v: datetime.datetime.strptime(v, '%m-%d-%Y'))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 253 ms per loop\n" ] } ], "source": [ "# Manually parse the date\n", "%timeit s.apply(lambda v: datetime.datetime(int(v[6:10]), int(v[0:2]), int(v[3:5])))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 9.23 ms per loop\n" ] } ], "source": [ "def lookup(s):\n", " \"\"\"\n", " This is an extremely fast approach to datetime parsing.\n", " For large data, the same dates are often repeated. Rather than\n", " re-parse these, we store all unique dates, parse them, and\n", " use a lookup to convert all dates.\n", " \"\"\"\n", " return s.map({date:pd.to_datetime(date) for date in s.unique()})\n", "\n", "%timeit lookup(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mean vs Median\n", "\n", "Mean is *much (~100 times) faster* to calculate than median." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n", "import numpy\n", "\n", "data = numpy.random.rand(50000000)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 61.7 ms per loop\n" ] } ], "source": [ "timeit numpy.mean(data)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 6.13 s per loop\n" ] } ], "source": [ "timeit numpy.median(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading data\n", "\n", "HDF5 is the fastest way of reading tabular data.\n", "\n", " csv.DictReader: 2.78 s\n", " pickle: 2.41 s\n", " json: 2.39 s\n", " json-array: 799 ms\n", " csv.reader: 478 ms\n", " pd.read_csv 355 ms\n", " pd.read_pickle: 319 ms\n", " pd.read_hdf (table) 169 ms\n", " pd.read_hdf (stored) 123 ms" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# First, create a set of data files\n", "words = 'ad adipisicing aliqua aliquip amet anim aute cillum commodo consectetur consequat culpa cupidatat deserunt do dolor dolore duis ea eiusmod elit enim esse est et eu ex excepteur exercitation fugiat id in incididunt ipsum irure labore laboris laborum lorem magna minim mollit nisi non nostrud nulla occaecat officia pariatur proident qui quis reprehenderit sed sint sit sunt tempor ullamco ut velit veniam voluptate'.split()\n", "\n", "# Create the data in memory\n", "data = []\n", "for row in range(0, 1000000):\n", " data.append({\n", " 'A': words[row % len(words)],\n", " 'B': chr(64 + (row % 62)),\n", " 'C': row,\n", " 'D': row + 1,\n", " 'E': row + 2,\n", " 'F': row + 3,\n", " })\n", "\n", "# Save CSV\n", "import csv\n", "keys = sorted(data[0].keys())\n", "out = csv.DictWriter(open('sample.data.csv', 'w'),\n", " fieldnames=keys,\n", " lineterminator='\\n')\n", "out.writerow(dict(zip(keys, keys)))\n", "out.writerows(data)\n", "\n", "# Save JSON\n", "import json\n", "json.dump(data, open('sample.data.json', 'w'), separators= (',', ':'))\n", "\n", "# Save JSON-array\n", "import json\n", "json.dump([data[0].keys()] + [row.values() for row in data],\n", " open('sample.data-array.json', 'w'),\n", " separators= (',', ':'))\n", "\n", "# Save pickle\n", "import cPickle as pickle\n", "pickle.dump(data, open('sample.data.pickle', 'wb'), pickle.HIGHEST_PROTOCOL)\n", "\n", "# Save pandas pickle\n", "import pandas as pd\n", "df = pd.DataFrame(data, columns=data[0].keys())\n", "df.to_pickle('sample.data.pandas')\n", "\n", "# Save HDF5\n", "df.to_hdf('sample.data.h5', 'stored')\n", "df.to_hdf('sample.data.h5', 'table', table=True)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 2.78 s per loop\n", "1 loops, best of 3: 2.41 s per loop\n", "1 loops, best of 3: 2.39 s per loop\n", "1 loops, best of 3: 799 ms per loop\n", "1 loops, best of 3: 478 ms per loop\n", "1 loops, best of 3: 355 ms per loop\n", "1 loops, best of 3: 319 ms per loop\n", "10 loops, best of 3: 169 ms per loop\n", "10 loops, best of 3: 123 ms per loop\n" ] } ], "source": [ "import time\n", "import csv\n", "import json\n", "import cPickle as pickle\n", "import pandas as pd\n", "\n", "%timeit list(csv.DictReader(open('sample.data.csv')))\n", "%timeit pickle.load(open('sample.data.pickle', 'rb'))\n", "%timeit json.load(open('sample.data.json'))\n", "%timeit json.load(open('sample.data-array.json'))\n", "%timeit list(csv.reader(open('sample.data.csv')))\n", "%timeit pd.read_csv('sample.data.csv')\n", "%timeit pd.read_pickle('sample.data.pandas')\n", "%timeit pd.read_hdf('sample.data.h5', 'table')\n", "%timeit pd.read_hdf('sample.data.h5', 'stored')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Templates vs lxml vs cElementTree\n", "\n", "This is the time taken to generate a bar chart, in µs. The output could either be xml (etree) or text.\n", "\n", " xml text\n", " template 68 35\n", " lxml 73 87\n", " cElementTree 23 247\n", "\n", "For string output, tornado templates are extremely fast. Even for etree, it's faster than using lxml directly. However, if you only want etree output and not string, cElementTree is faster.\n", "\n", "To me, the template approach with lxml.fromstring appears optimal." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import HTML\n", "from tornado import template\n", "\n", "using_template = template.Template('''\n", "\n", " {% for i, x in enumerate(series) %}\n", " \n", " {% end %}\n", "\n", "''', autoescape=None).generate\n", "\n", "HTML(using_template(series=[1,2,3,4,3,2,1]))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from lxml import etree\n", "\n", "def using_lxml(series):\n", " root = etree.Element('svg', width=\"100\", height=\"50\")\n", " for i, x in enumerate(series):\n", " rect = etree.SubElement(root, 'rect',\n", " x = '%d' % (10 * i),\n", " width = '10',\n", " y = '%d' % (50 - 10 * x),\n", " height = '%d' % (10 * x),\n", " fill = '#88f',\n", " stroke = '#fff')\n", " return root\n", "\n", "HTML(etree.tostring(using_lxml(series=[1,2,3,4,3,2,1])))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import xml.etree.cElementTree as cElementTree\n", "\n", "def using_cElementTree(series):\n", " root = cElementTree.Element('svg', width=\"100\", height=\"50\")\n", " for i, x in enumerate(series):\n", " rect = cElementTree.SubElement(root, 'rect',\n", " x = '%d' % (10 * i),\n", " width = '10',\n", " y = '%d' % (50 - 10 * x),\n", " height = '%d' % (10 * x),\n", " fill = '#88f',\n", " stroke = '#fff')\n", " return root\n", "\n", "HTML(cElementTree.tostring(using_cElementTree(series=[1,2,3,4,3,2,1])))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10000 loops, best of 3: 67.5 µs per loop\n", "10000 loops, best of 3: 72.8 µs per loop\n", "10000 loops, best of 3: 22.6 µs per loop\n", "10000 loops, best of 3: 34.6 µs per loop\n", "10000 loops, best of 3: 87.2 µs per loop\n", "1000 loops, best of 3: 247 µs per loop\n" ] } ], "source": [ "# Create etree output\n", "%timeit etree.fromstring(using_template(series=[1,2,3,4,3,2,1]))\n", "%timeit using_lxml(series=[1,2,3,4,3,2,1])\n", "%timeit using_cElementTree(series=[1,2,3,4,3,2,1])\n", "\n", "# Create string output\n", "%timeit using_template(series=[1,2,3,4,3,2,1])\n", "%timeit etree.tostring(using_lxml(series=[1,2,3,4,3,2,1]))\n", "%timeit cElementTree.tostring(using_cElementTree(series=[1,2,3,4,3,2,1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Range search\n", "\n", "Here, we're trying to find where a value fits in a list of numbers. For example, in the list [1, 3, 7, 9], the number 4 would be just after the 2nd element 3.\n", "\n", "The summary is: use `numpy.searchsorted()` -- it's blazingly fast.\n", "\n", " 37,000 µs For loop \n", " 5,790 µs Numpy filtering\n", " 2,270 µs Numpy filtering on sorted values\n", " 1,850 µs Numpy index search on sorted values\n", " 1 µs numpy.searchsorted()\n", "\n", "Having read [this post](http://blog.clifreeder.com/blog/2013/04/21/ruby-is-too-slow-for-programming-competitions/) on Ruby being slow, I thought I'd check the same with Python. I got it running fairly fast, but there was one piece that was taking a fair bit of time: *counting numbers in a range*. Here's the slow version:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 37 ms per loop\n" ] } ], "source": [ "values = range(1000000)\n", "def count(values, a, b):\n", " count = 0\n", " for x in values:\n", " if a <= x <= b:\n", " count += 1\n", " return count\n", "\n", "%timeit count(values, 250000, 750000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, running a loop in Python for numbers is never a good idea. Let's move this to NumPy." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 5.79 ms per loop\n" ] } ], "source": [ "values = numpy.random.rand(1000000)\n", "%timeit ((.25 <= values) & (values <= .75)).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's not bad, but it could get a lot better. First, let's sort the values and try it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 2.27 ms per loop\n" ] } ], "source": [ "values.sort()\n", "%timeit ((.25 <= values) & (values <= .75)).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like that, it's faster. But we can do much better. Given that it's already sorted, what if we just found the index?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000 loops, best of 3: 1.85 ms per loop\n" ] } ], "source": [ "%timeit (values <= .75).argmin() - (.25 <= values).argmax()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A bit faster. It's wasteful of memory, though -- having to create 2 new arrays just to find the position of these two numbers. What if we searched for these?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000000 loops, best of 3: 1.45 µs per loop\n" ] } ], "source": [ "%timeit numpy.searchsorted(values, .75) - numpy.searchsorted(values, .25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's 1.45 *micro*seconds. It's *25 thousand* times faster than the original code, and *four thousand* times faster than the original NumPy code.\n", "\n", "If there's one thing I keep re-learning, it's that there's always a faster way of doing it, and if you really want to, you'll probably find it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Next power of 10\n", "\n", "The next power of 10 for 4 is 10^1. For 40, it's 10^2. For 400, it's 10^3. For 0.04, it's 10^-1. And so on.\n", "\n", "Most methods of calculating it are fast enough." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.Series(10 ** (6 * np.random.rand(10000) - 3))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 16.3 ms per loop\n" ] } ], "source": [ "def iterative(v):\n", " i = 1\n", " if v > 1:\n", " n = 0\n", " while i < v:\n", " i, n = i * 10, n + 1\n", " else:\n", " n = 1\n", " while i > v:\n", " i, n = i / 10., n - 1\n", " return n\n", "\n", "%timeit data.apply(iterative)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10000 loops, best of 3: 124 µs per loop\n" ] } ], "source": [ "%timeit numpy.ceil(numpy.log10(data))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 19.2 ms per loop\n" ] } ], "source": [ "%timeit data.apply(lambda v: numpy.ceil(numpy.log10(v)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hierarchical subtotals\n", "\n", "Given a DataFrame like this:\n", "\n", " A B C val\n", " X X X 0\n", " X X Y 1\n", " X Y X 2\n", " X Y Y 3\n", " Y X X 4\n", " Y X Y 5\n", " Y Y X 6\n", " Y Y Y 7\n", "\n", "... create a DataFrame like this, with subtotals.\n", "\n", " A B C val level\n", " na na na 28 0\n", " X na na 6 1\n", " X X na 1 2\n", " X X X 0 3\n", " X X Y 1 3\n", " X Y na 5 2\n", " X Y X 2 3\n", " X Y Y 3 3\n", " Y na na 22 1\n", " Y X na 9 2\n", " Y X X 4 3\n", " Y X Y 5 3\n", " Y Y na 13 2\n", " Y Y X 6 3\n", " Y Y Y 7 3" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.DataFrame({'A': list('XXXXYYYY'), 'B': list('XXYYXXYY'), 'C': list('XYXYXYXY'), 'val': range(8)})" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " level val\n", "A B C \n", "X 1 6\n", " X 2 1\n", " X 3 0\n", " Y 3 1\n", " Y 2 5\n", " X 3 2\n", " Y 3 3\n", "Y 1 22\n", " X 2 9\n", " X 3 4\n", " Y 3 5\n", " Y 2 13\n", " X 3 6\n", " Y 3 7\n" ] } ], "source": [ "groups = ['A', 'B', 'C']\n", "\n", "def subtotal(data, groups, agg):\n", " frames = []\n", " for level in range(1, 1 + len(groups)):\n", " frame = data.groupby(groups[:level], sort=False, as_index=False).agg(agg)\n", " frame['level'] = level\n", " frames.append(frame)\n", " df = pd.concat(frames)\n", " for group in groups:\n", " df[group].fillna('', inplace=True)\n", " return df.sort(groups).set_index(groups)\n", "\n", "print subtotal(data, groups=groups, agg={'val': 'sum'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is faster than the existing `layout.hierarchy`" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 23.9 ms per loop\n", "1 loops, best of 3: 294 ms per loop\n" ] } ], "source": [ "import layout\n", "odi = pd.read_csv('d:/site/gramener.com/viz/autolyse/data/odi-batting.csv', dtype={'Runs':float})\n", "groups = ['Weekday', 'Country', 'Player']\n", "agg = {'Runs': 'sum'}\n", "%timeit subtotal(odi, groups, agg)\n", "%timeit list(layout.hierarchy(odi, groups, agg=agg, size=lambda df: df['Runs'].sum()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# stack(series, groupby)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Numba\n", "\n", "I'm trying to see how fast numba is. `autojit(fn)` makes `fn` faster. `numpy.sum` and `@autojit` take about the same time. Python loops are much slower.\n", "\n", "Looks like @autojit is a decent replacement for `numpy.vectorize`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from numba import autojit\n", "\n", "def slow_sum(arr):\n", " M, N = arr.shape\n", " result = 0.0\n", " for i in range(M):\n", " for j in range(N):\n", " result += arr[i,j]\n", " return result\n", "\n", "fast_sum = autojit(slow_sum)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 20.5 ms per loop\n", "1 loops, best of 3: 25 ms per loop\n", "1 loops, best of 3: 426 ms per loop\n" ] } ], "source": [ "%timeit numpy.sum(numpy.random.rand(1000,1000))\n", "%timeit fast_sum(numpy.random.rand(1000,1000))\n", "%timeit slow_sum(numpy.random.rand(1000,1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Shifting a Numpy array\n", "\n", "If you have an array like this: `[3,4,5,6,7]` and you want to move it to the right dropping the last, filling left with nans: `[nan,3,4,5,6]`, what's the fastest way?\n", "\n", "Answer:\n", "\n", " result = numpy.roll(array, 1)\n", " result[0] = numpy.nan" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10000 loops, best of 3: 127 µs per loop\n", "100000 loops, best of 3: 14.3 µs per loop\n" ] } ], "source": [ "data = numpy.random.rand(1001)\n", "\n", "%timeit result = numpy.insert(data, 0, numpy.nan)[:-2]\n", "%timeit result = numpy.roll(data, 1); result[:1] = numpy.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Voronoi diagrams\n", "\n", "[Voronoi diagrams](http://en.wikipedia.org/wiki/Voronoi_diagram) takes a set of points, and creates polygons enclosing the space closer to each point than any other. This is the dual of [Delaunay triangulation](http://en.wikipedia.org/wiki/Delaunay_triangulation), which matplotlib and scipy provide by default, and can also be created directly on NumPy.\n", "\n", "Here's the speed generating via various methods:\n", "\n", " Method Time (10K) Time (100K)\n", " matplotlib.delaunay.triangulate.Triangulation 16.5ms 222ms\n", " voronoi() using the above 41.9ms 793ms\n", " scipy.spatial.Delaunay 51.4ms 797ms" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "scale = .9\n", "small = (1 - scale)/2 + scale * numpy.random.rand(2, 10000)\n", "large = (1 - scale)/2 + scale * numpy.random.rand(2, 100000)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "D:\\anaconda\\2.7\\lib\\site-packages\\matplotlib\\cbook.py:137: MatplotlibDeprecationWarning: The matplotlib.delaunay module was deprecated in version 1.4. Use matplotlib.tri.Triangulation instead.\n", " warnings.warn(message, mplDeprecation, stacklevel=1)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 19.5 ms per loop\n", "1 loop, best of 3: 239 ms per loop\n" ] } ], "source": [ "import matplotlib.delaunay.triangulate as tri\n", "%timeit tri.Triangulation(*small)\n", "%timeit tri.Triangulation(*large)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "1 loop, best of 3: 39.6 ms per loop\n", "1 loop, best of 3: 701 ms per loop\n" ] } ], "source": [ "def voronoi(X, Y):\n", " ''' Return line segments describing the voronoi diagram of X and Y '''\n", "\n", " # Get the points X, Y into a matrix P.\n", " P = numpy.zeros((X.size+4, 2))\n", " P[:X.size, 0], P[:Y.size, 1] = X, Y\n", "\n", " # Add four points at (pseudo) \"infinity\"\n", " m = max(numpy.abs(X).max(), numpy.abs(Y).max()) * 1e5\n", " P[X.size:, 0] = -m, -m, +m, +m\n", " P[Y.size:, 1] = -m, +m, -m, +m\n", "\n", " # Delaunay triangulate, and get the circumcenters\n", " D = tri.Triangulation(P[:, 0], P[:, 1])\n", " C = D.circumcenters\n", "\n", " # D.triangle_neighbours = 3 neighbours.\n", " # Each neighbourhood represents a line.\n", "\n", " n = len(C)\n", " tgt = D.triangle_neighbors\n", " src = (numpy.zeros_like(tgt).T + numpy.arange(n)).T\n", "\n", " # Remove all -1s\n", " positives = tgt >= 0\n", " n = positives.sum()\n", " src = src[positives].reshape(n)\n", " tgt = tgt[positives].reshape(n)\n", "\n", " # TODO: Clip to get polygons\n", " # --------------------------\n", "\n", " # Get areas\n", " # ---------\n", " # http://www.mathopenref.com/coordpolygonarea.html\n", " csrc = C[src]\n", " ctgt = C[tgt]\n", " areas = csrc[:,0] * ctgt[:,1] - csrc[:,1] * ctgt[:,0]\n", " # print areas\n", " # Now add up the areas by the indices given in src\n", "\n", " # Get the circumcenters\n", " return numpy.concatenate((C[tgt].reshape(n, 1, 2), C[src].reshape(n, 1, 2)), axis=1)\n", "\n", "%timeit voronoi(small[0,:], small[1,:])\n", "%timeit voronoi(large[0,:], large[1,:])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 90 ms per loop\n", "1 loop, best of 3: 1.29 s per loop\n" ] } ], "source": [ "from scipy.spatial import Voronoi\n", "%timeit Voronoi(small.T)\n", "%timeit Voronoi(large.T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# HDF5 vs SQLite3 vs PostgreSQL\n", "\n", "Which has the faster insert performance? Which has the faster read performance? This is specifically on a key-value index." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import random\n", "\n", "words = 'ad adipisicing aliqua aliquip amet anim aute cillum commodo consectetur consequat culpa cupidatat deserunt do dolor dolore duis ea eiusmod elit enim esse est et eu ex excepteur exercitation fugiat id in incididunt ipsum irure labore laboris laborum lorem magna minim mollit nisi non nostrud nulla occaecat officia pariatur proident qui quis reprehenderit sed sint sit sunt tempor ullamco ut velit veniam voluptate'.split()\n", "\n", "def get_data(size, keylen=5, vallen=10):\n", " 'Return an array of random key, int, str combinations'\n", " result = []\n", " hi = len(words) - 1\n", " keys = set()\n", " for index in range(size):\n", " while True:\n", " key = ' '.join(words[random.randint(0, hi)] for i in range(keylen))\n", " if key not in keys:\n", " break\n", " keys.add(key)\n", " num = random.randint(0, 10000000)\n", " val = ' '.join(words[random.randint(0, hi)] for i in range(vallen))\n", " result.append([key, num, val])\n", " return result" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import time\n", "import sqlite3\n", "\n", "def insert_sqlite3(data, drop=True):\n", " conn = sqlite3.connect('.test.sqlite3')\n", " try:\n", " if drop:\n", " conn.execute('DROP TABLE IF EXISTS test')\n", " conn.execute('CREATE TABLE IF NOT EXISTS test (k TEXT, n INTEGER, v TEXT, PRIMARY KEY(k))')\n", " start = time.time()\n", " conn.executemany('INSERT INTO test VALUES (?, ?, ?)', data)\n", " conn.commit()\n", " return time.time() - start\n", " finally:\n", " conn.close()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import time\n", "import psycopg2\n", "\n", "def insert_postgres(data, drop=True):\n", " conn = psycopg2.connect('host=localhost dbname=test user=postgres')\n", " try:\n", " cur = conn.cursor()\n", " if drop:\n", " cur.execute('DROP TABLE IF EXISTS test')\n", " cur.execute('CREATE TABLE IF NOT EXISTS test (k VARCHAR(70), n INTEGER, v VARCHAR(300), PRIMARY KEY(k))')\n", " conn.commit()\n", " start = time.time()\n", " cur = conn.cursor()\n", " cur.executemany('INSERT INTO test VALUES (%s, %s, %s)', data)\n", " conn.commit()\n", " return time.time() - start\n", " finally:\n", " cur.close()\n", " conn.close()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import time\n", "import tables\n", "\n", "class Test(tables.IsDescription):\n", " k = tables.StringCol(itemsize=70, pos=0)\n", " n = tables.Int16Col(pos=1)\n", " v = tables.StringCol(itemsize=300, pos=2)\n", " \n", "def insert_hdf5(data, drop=True):\n", " handle = tables.open_file('.test.h5', mode='w')\n", " try:\n", " root = handle.root\n", " table = handle.create_table(root, 'test', Test)\n", " insert = table.row\n", " start = time.time()\n", " for row in data:\n", " insert['k'], insert['n'], insert['v'] = row\n", " insert.append()\n", " insert.append()\n", " table.flush()\n", " return time.time() - start\n", " finally:\n", " handle.close()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " size sqlite3 postgres hdf5\n", " 1 289.3ms 0.0ms 0.0ms\n", " 10 192.9ms 44.3ms 0.0ms\n", " 100 200.7ms 31.3ms 0.0ms\n", " 1000 278.5ms 169.1ms 1.5ms\n", " 10000 435.5ms 1265.0ms 10.5ms\n", " 25000 648.7ms 3539.8ms 59.1ms\n" ] } ], "source": [ "print(' size sqlite3 postgres hdf5')\n", "for size in (1, 10, 100, 1000, 10000, 25000):\n", " print(\n", " '% 6d' % size,\n", " '% 7.1fms' % (insert_sqlite3(get_data(size), drop=True) * 1000),\n", " '% 7.1fms' % (insert_postgres(get_data(size), drop=True) * 1000),\n", " '% 7.1fms' % (insert_hdf5(get_data(size), drop=True) * 1000)\n", " )" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# How fast is data access?\n", "\n", "When it comes to data, the performance of the CPU/Memory, disk and database are key. This script measures how fast your system performs on these parameters.\n", "\n", "The whole script should run under a minute on most reasonably fast systems." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import print_function\n", "\n", "import io\n", "import time\n", "import sqlalchemy\n", "import numpy as np\n", "import pandas as pd\n", "from pathlib import Path\n", "\n", "class Timer:\n", " def __init__(self, msg):\n", " self.msg = msg\n", "\n", " def __enter__(self):\n", " self.start = time.clock()\n", "\n", " def __exit__(self, *args):\n", " self.end = time.clock()\n", " print('{:0.3f}s {:s}'.format(self.end - self.start, self.msg))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CPU / RAM\n", "\n", "This is a pure numerical computation on values in memory that computes the [eigenvalues](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.linalg.eig.html) of a random dataset." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.751s computation\n" ] } ], "source": [ "# Initialise the same data every time\n", "np.random.seed(0)\n", "data = np.random.random((1000, 1000))\n", "\n", "# Time the computation\n", "with Timer('computation'):\n", " np.linalg.eig(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Disk\n", "\n", "Let's time sequential writes and reads on the disk.\n", "\n", "The best way to do this is via [disktt](https://www.google.com/search?q=disktt) on Windows and [dd on Linux](https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd).\n", "\n", "Below is a crude approximation in Python. Note: this is heavily influenced by OS disk caching." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Change this to any folder in the drive you want to test\n", "folder = Path('D:/')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12.596s sequential disk write\n", "0.695s sequential disk read\n" ] } ], "source": [ "# Test the speed of the hard disk at this folder\n", "# ... with this string data\n", "data = bytes('0123456789') * 100000000\n", "\n", "# Run the test\n", "path = folder / 'tempfile'\n", "with path.open(mode='wb', buffering=0) as handle:\n", " with Timer('sequential disk write'):\n", " handle.write(data)\n", " \n", "with path.open(mode='rb', buffering=0) as handle:\n", " with Timer('sequential disk read'):\n", " handle.read()\n", " \n", "path.unlink()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Database\n", "\n", "This script tests the speed of MySQL. It assumes that a MySQL instance running on localhost and a database called `test` accessible to user `root` with no password. You can [change the connection string](http://docs.sqlalchemy.org/en/latest/core/engines.html#mysql) based on your configuration.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This is for a local MySQL database called test that you can connect to as root with no password\n", "engine = sqlalchemy.create_engine('mysql+pymysql://root@localhost/dbtest')\n", "\n", "# Test the connection\n", "connection = engine.connect()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Setup the data structures\n", "data = pd.DataFrame(np.random.randint(0, 1000, (1000000, 3)))\n", "\n", "metadata = sqlalchemy.MetaData(bind=engine)\n", "metadata.reflect()\n", "\n", "# Drop benchmark table\n", "if 'benchmark' in metadata.tables:\n", " metadata.tables['benchmark'].drop()\n", "\n", "# Create benchmark table again as MyISAM\n", "table = sqlalchemy.Table(\n", " 'benchmark', metadata,\n", " sqlalchemy.Column('0', sqlalchemy.Integer),\n", " sqlalchemy.Column('1', sqlalchemy.Integer),\n", " sqlalchemy.Column('2', sqlalchemy.Integer),\n", " extend_existing=True,\n", " mysql_engine='MyISAM',\n", ")\n", "metadata.create_all()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "39.400s database write\n", "19.366s database read\n" ] } ], "source": [ "with Timer('database write'):\n", " data.to_sql('benchmark', con=engine, if_exists='append', index=False)\n", " \n", "with Timer('database read'):\n", " data = pd.read_sql('benchmark', con=engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Group by categories\n", "\n", "Pandas processes categories much faster than text. For this sample:\n", "\n", " text categories\n", " .groupby() time 591ms 74ms Categories are ~8X faster in this case\n", " Memory usage 512MB 86MB Categories are much smaller (based on text length)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "cat1 = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon', 'Zeta', 'Eta']" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n = 10000000\n", "text = pd.DataFrame({\n", " 'cat1': pd.Series(pd.np.random.randint(0, len(cat1), n)).map(dict(enumerate(cat1))),\n", " 'val': pd.np.random.rand(n)\n", "})\n", "cats = text.copy()\n", "cats['cat1'] = cats['cat1'].astype('category')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loop, best of 3: 591 ms per loop\n", "10 loops, best of 3: 73.8 ms per loop\n" ] } ], "source": [ "%timeit text.groupby('cat1')['val'].sum()\n", "%timeit cats.groupby('cat1')['val'].sum()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 2 columns):\n", "cat1 object\n", "val float64\n", "dtypes: float64(1), object(1)\n", "memory usage: 512.3 MB\n", "\n", "RangeIndex: 10000000 entries, 0 to 9999999\n", "Data columns (total 2 columns):\n", "cat1 category\n", "val float64\n", "dtypes: category(1), float64(1)\n", "memory usage: 85.8 MB\n" ] } ], "source": [ "text.info(memory_usage='deep')\n", "cats.info(memory_usage='deep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Database groupby\n", "\n", "Observations on the performance of a `SELECT category, COUNT(id) FROM table GROUP BY category` query:\n", "\n", "### MySQL\n", "\n", "- Creating a hash / btree index on `category` does not speed it up (1.5s)\n", "- Using `VARCHAR` instead of `TEXT` does not speed it up\n", "- In-memory tables improve performance by just 20-40% (1.2s)\n", "- InnoDB tables are 60X times than MyISAM slower without an index (90s)\n", "- MySQL intelligently caches queries (which is good). Use `reset query cache` to reset the cache\n", "\n", "To move a large table into an in-memory table:\n", "\n", " SET GLOBAL tmp_table_size = 1024 * 1024 * 1024 * 2;\n", " SET GLOBAL max_heap_table_size = 1024 * 1024 * 1024 * 2;\n", "\n", " # Now disconnect and reconnect\n", "\n", " CREATE TABLE mem LIKE inr_import;\n", " ALTER TABLE mem ENGINE=MEMORY;\n", " INSERT INTO mem SELECT * FROM inr_import;\n", "\n", "\n", "PostgreSQL:\n", "\n", "- Creating a hash index on `category` does not speed it up" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }