{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook explores the speed of searching for values in sets and lists.\n", "\n", "After reading this notebook, watch Brandon Rhodes' videos [All Your Ducks In A Row: Data Structures in the Standard Library and Beyond](http://pyvideo.org/video/2571/all-your-ducks-in-a-row-data-structures-in-the-s) and [The Mighty Dictionary](http://pyvideo.org/video/276/the-mighty-dictionary-55)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def make_list(n):\n", " if True:\n", " return list(range(n))\n", " else:\n", " return list(str(i) for i in range(n))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n = int(25e6)\n", "m = int(n*.9)\n", "m = n // 2\n", "a_list = make_list(n)\n", "a_set = set(a_list)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 38.41 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 122 ns per loop\n", "The slowest run took 31.83 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 145 ns per loop\n" ] } ], "source": [ "# Finding something that is in a set is fast.\n", "%timeit m in a_set\n", "# Finding something that is _not_ in a set is also fast.\n", "%timeit n+1 in a_set" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loop, best of 3: 326 ms per loop\n", "1 loop, best of 3: 645 ms per loop\n" ] } ], "source": [ "# Finding something in a list can be slow.\n", "# It starts at the beginning and compares each value.\n", "# The search time depends on where the value is in the list.\n", "%timeit m in a_list\n", "# Finding something that is not is a list is the worst case.\n", "# It has to be compared to all values of the list.\n", "%timeit n+1 in a_list" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "max_exponent = 6" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10:\n", "The slowest run took 51.62 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 92 ns per loop\n", "100:\n", "The slowest run took 49.91 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 91 ns per loop\n", "1000:\n", "The slowest run took 39.22 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 118 ns per loop\n", "10000:\n", "The slowest run took 40.17 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 118 ns per loop\n", "100000:\n", "The slowest run took 36.67 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 124 ns per loop\n", "1000000:\n", "The slowest run took 40.67 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 122 ns per loop\n" ] } ], "source": [ "for n in (10 ** i for i in range(1, max_exponent+1)):\n", " m = n // 2\n", " print('%d:' % n)\n", " a_list = make_list(n)\n", " a_set = set(a_list)\n", "\n", " %timeit m in a_set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the difference between searching small sets and large sets in not large. This is the magic of Python sets and dictionaries. Read the [hash table](https://en.wikipedia.org/wiki/Hash_table) Wikipedia article for an explanation of how this works." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10:\n", "The slowest run took 23.84 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "1000000 loops, best of 3: 202 ns per loop\n", "100:\n", "1000000 loops, best of 3: 1.32 µs per loop\n", "1000:\n", "100000 loops, best of 3: 12.7 µs per loop\n", "10000:\n", "10000 loops, best of 3: 126 µs per loop\n", "100000:\n", "1000 loops, best of 3: 1.21 ms per loop\n", "1000000:\n", "100 loops, best of 3: 12.8 ms per loop\n" ] } ], "source": [ "for n in (10 ** i for i in range(1, max_exponent+1)):\n", " m = n // 2\n", " print('%d:' % n)\n", " a_list = make_list(n)\n", " a_set = set(a_list)\n", "\n", " %timeit m in a_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the search time for finding something in the middle of a list is proportional to the size of the list." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def set_foo(n):\n", " for i in range(n):\n", " i in a_set\n", "\n", "def list_foo(n):\n", " for i in range(n):\n", " i in a_list" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10:\n", "1000000 loops, best of 3: 1.91 µs per loop\n", "100:\n", "100000 loops, best of 3: 9.23 µs per loop\n", "1000:\n", "10000 loops, best of 3: 119 µs per loop\n", "10000:\n", "1000 loops, best of 3: 1.27 ms per loop\n", "100000:\n", "100 loops, best of 3: 13.2 ms per loop\n", "1000000:\n", "10 loops, best of 3: 133 ms per loop\n" ] } ], "source": [ "for n in (10 ** i for i in range(1, max_exponent+1)):\n", " print('%d:' % n)\n", " a_list = list(range(n))\n", " a_set = set(a_list)\n", "\n", " %timeit set_foo(n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10:\n", "100000 loops, best of 3: 2.62 µs per loop\n", "100:\n", "10000 loops, best of 3: 132 µs per loop\n", "1000:\n", "100 loops, best of 3: 12.6 ms per loop\n", "10000:\n", "1 loop, best of 3: 1.26 s per loop\n", "100000:\n", "1 loop, best of 3: 1min 59s per loop\n", "1000000:\n" ] } ], "source": [ "for n in (10 ** i for i in range(1, max_exponent+1)):\n", " print('%d:' % n)\n", " a_list = list(range(n))\n", " a_set = set(a_list)\n", "\n", " %timeit list_foo(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above cell does not finish overnight,\n", "so what you see above is what was saved before it finished." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }