{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook explores the speed of searching for values in sets and lists.\n",
    "\n",
    "After reading this notebook, watch Brandon Rhodes' videos [All Your Ducks In A Row: Data Structures in the Standard Library and Beyond](http://pyvideo.org/video/2571/all-your-ducks-in-a-row-data-structures-in-the-s) and [The Mighty Dictionary](http://pyvideo.org/video/276/the-mighty-dictionary-55)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def make_list(n):\n",
    "    if True:\n",
    "        return list(range(n))\n",
    "    else:\n",
    "        return list(str(i) for i in range(n))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n = int(25e6)\n",
    "m = int(n*.9)\n",
    "m = n // 2\n",
    "a_list = make_list(n)\n",
    "a_set = set(a_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 38.41 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 122 ns per loop\n",
      "The slowest run took 31.83 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 145 ns per loop\n"
     ]
    }
   ],
   "source": [
    "# Finding something that is in a set is fast.\n",
    "%timeit m in a_set\n",
    "# Finding something that is _not_ in a set is also fast.\n",
    "%timeit n+1 in a_set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loop, best of 3: 326 ms per loop\n",
      "1 loop, best of 3: 645 ms per loop\n"
     ]
    }
   ],
   "source": [
    "# Finding something in a list can be slow.\n",
    "# It starts at the beginning and compares each value.\n",
    "# The search time depends on where the value is in the list.\n",
    "%timeit m in a_list\n",
    "# Finding something that is not is a list is the worst case.\n",
    "# It has to be compared to all values of the list.\n",
    "%timeit n+1 in a_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "max_exponent = 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10:\n",
      "The slowest run took 51.62 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 92 ns per loop\n",
      "100:\n",
      "The slowest run took 49.91 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 91 ns per loop\n",
      "1000:\n",
      "The slowest run took 39.22 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 118 ns per loop\n",
      "10000:\n",
      "The slowest run took 40.17 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 118 ns per loop\n",
      "100000:\n",
      "The slowest run took 36.67 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 124 ns per loop\n",
      "1000000:\n",
      "The slowest run took 40.67 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 122 ns per loop\n"
     ]
    }
   ],
   "source": [
    "for n in (10 ** i for i in range(1, max_exponent+1)):\n",
    "    m = n // 2\n",
    "    print('%d:' % n)\n",
    "    a_list = make_list(n)\n",
    "    a_set = set(a_list)\n",
    "\n",
    "    %timeit m in a_set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the difference between searching small sets and large sets in not large. This is the magic of Python sets and dictionaries. Read the [hash table](https://en.wikipedia.org/wiki/Hash_table) Wikipedia article for an explanation of how this works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10:\n",
      "The slowest run took 23.84 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000000 loops, best of 3: 202 ns per loop\n",
      "100:\n",
      "1000000 loops, best of 3: 1.32 µs per loop\n",
      "1000:\n",
      "100000 loops, best of 3: 12.7 µs per loop\n",
      "10000:\n",
      "10000 loops, best of 3: 126 µs per loop\n",
      "100000:\n",
      "1000 loops, best of 3: 1.21 ms per loop\n",
      "1000000:\n",
      "100 loops, best of 3: 12.8 ms per loop\n"
     ]
    }
   ],
   "source": [
    "for n in (10 ** i for i in range(1, max_exponent+1)):\n",
    "    m = n // 2\n",
    "    print('%d:' % n)\n",
    "    a_list = make_list(n)\n",
    "    a_set = set(a_list)\n",
    "\n",
    "    %timeit m in a_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the search time for finding something in the middle of a list is proportional to the size of the list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def set_foo(n):\n",
    "    for i in range(n):\n",
    "        i in a_set\n",
    "\n",
    "def list_foo(n):\n",
    "    for i in range(n):\n",
    "        i in a_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10:\n",
      "1000000 loops, best of 3: 1.91 µs per loop\n",
      "100:\n",
      "100000 loops, best of 3: 9.23 µs per loop\n",
      "1000:\n",
      "10000 loops, best of 3: 119 µs per loop\n",
      "10000:\n",
      "1000 loops, best of 3: 1.27 ms per loop\n",
      "100000:\n",
      "100 loops, best of 3: 13.2 ms per loop\n",
      "1000000:\n",
      "10 loops, best of 3: 133 ms per loop\n"
     ]
    }
   ],
   "source": [
    "for n in (10 ** i for i in range(1, max_exponent+1)):\n",
    "    print('%d:' % n)\n",
    "    a_list = list(range(n))\n",
    "    a_set = set(a_list)\n",
    "\n",
    "    %timeit set_foo(n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10:\n",
      "100000 loops, best of 3: 2.62 µs per loop\n",
      "100:\n",
      "10000 loops, best of 3: 132 µs per loop\n",
      "1000:\n",
      "100 loops, best of 3: 12.6 ms per loop\n",
      "10000:\n",
      "1 loop, best of 3: 1.26 s per loop\n",
      "100000:\n",
      "1 loop, best of 3: 1min 59s per loop\n",
      "1000000:\n"
     ]
    }
   ],
   "source": [
    "for n in (10 ** i for i in range(1, max_exponent+1)):\n",
    "    print('%d:' % n)\n",
    "    a_list = list(range(n))\n",
    "    a_set = set(a_list)\n",
    "\n",
    "    %timeit list_foo(n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above cell does not finish overnight,\n",
    "so what you see above is what was saved before it finished."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}