{ "metadata": { "name": "choose k from n benchmark" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Choose k from n benchmark\n", "I recently learned that NumPy 1.7/1.8 will have a `choice` function that randomly chooses k of n elements, with or without replacement, with or without weighted probabilities. I was interested in the simplest case of choosing without replacement and without weights (uniformly).\n", "\n", "I took a look at [the code](https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/mtrand.pyx) and saw the implementation for this case is basically:\n", "\n", "```\n", "return np.random.permutation(n)[:k]\n", "```\n", "\n", "which surprised me because if `n` is much larger than `k` than it seems you are doing a lot of work for nothing permuting that big array.\n", "\n", "I stumbled upon a different suggestion on *stackoverflow* (see the comment by *Sa\u0161a \u0160ijak* on [this question](http://stackoverflow.com/questions/306400/how-do-i-randomly-select-an-item-from-a-list-using-python) which suggested using `random.sample` - that is built-in random, **not** *NumPy*'s random.\n", "\n", "So I made a quick benchmark for **my specific case**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import random\n", "import numpy as np\n", "def choose_no_rep_python(n,k):\n", " return random.sample(xrange(n), k) \n", "def choose_no_rep_numpy(n, k):\n", " return np.random.permutation(n)[:k]\n", "def choose_no_rep_numpy_take1(n, k):\n", " return np.random.permutation(n).take(range(k))\n", "def choose_no_rep_numpy_take2(n, k):\n", " return np.random.permutation(n).take(arange(k))\n", "def choose_no_rep_numpy_take3(n, k):\n", " return np.random.permutation(n).take(xrange(k))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit -n 10000 choose_no_rep_python(1000, 4)\n", "%timeit -n 10000 choose_no_rep_numpy(1000, 4)\n", "%timeit -n 10000 choose_no_rep_numpy_take1(1000, 4)\n", "%timeit -n 10000 choose_no_rep_numpy_take2(1000, 4)\n", "%timeit -n 10000 choose_no_rep_numpy_take3(1000, 4)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10000 loops, best of 3: 6.73 us per loop\n", "10000 loops, best of 3: 320 us per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "10000 loops, best of 3: 453 us per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "10000 loops, best of 3: 439 us per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "10000 loops, best of 3: 456 us per loop" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that the naive python implementation is ~50-fold faster than the NumPy implementation, which surprised me...\n", "\n", "## Technical details" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os, sys, platform\n", "print platform.system(), platform.release()\n", "print \"python version\", sys.version\n", "fin = os.popen(\"ipython -V\")\n", "print \"ipython version\", fin.read(),\n", "fin.close()\n", "print \"numpy version\", np.version.version" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Windows 7\n", "python version 2.7.3 |EPD 7.3-2 (64-bit)| (default, Apr 12 2012, 15:20:16) [MSC v.1500 64 bit (AMD64)]\n", "ipython version " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "0.13\n", "numpy version 1.6.1\n" ] } ], "prompt_number": 29 } ], "metadata": {} } ] }