{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#CR\u00dcCIAL P\u0178THON Week 3: Numpy Stride Tricks" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import Image \n", "Image(url='http://labrosa.ee.columbia.edu/crucialpython/logo.png', width=600) " ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "# Get our pylab mise en place going\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# I'll make up a time series of sequential data\n", "# \n", "x = np.arange(8 * 16)\n", "\n", "# And print it\n", "print x" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17\n", " 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35\n", " 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53\n", " 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71\n", " 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89\n", " 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107\n", " 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125\n", " 126 127]\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say `x` holds sequential data, and we want to do some sliding window analysis. This is common in time series modeling and signal processing.\n", "\n", "If the windows are disjoint (i.e., no overlap) then we can get away with simply reshaping `x` into a matrix like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "frame_length = 8\n", "\n", "x_framed = np.reshape(x, (-1, frame_length))\n", "\n", "print x_framed" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0 1 2 3 4 5 6 7]\n", " [ 8 9 10 11 12 13 14 15]\n", " [ 16 17 18 19 20 21 22 23]\n", " [ 24 25 26 27 28 29 30 31]\n", " [ 32 33 34 35 36 37 38 39]\n", " [ 40 41 42 43 44 45 46 47]\n", " [ 48 49 50 51 52 53 54 55]\n", " [ 56 57 58 59 60 61 62 63]\n", " [ 64 65 66 67 68 69 70 71]\n", " [ 72 73 74 75 76 77 78 79]\n", " [ 80 81 82 83 84 85 86 87]\n", " [ 88 89 90 91 92 93 94 95]\n", " [ 96 97 98 99 100 101 102 103]\n", " [104 105 106 107 108 109 110 111]\n", " [112 113 114 115 116 117 118 119]\n", " [120 121 122 123 124 125 126 127]]\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above example, `x_framed[i]` is a single window's worth of data.\n", "\n", "But what if we want the windows to overlap? Reshaping will not work, since we need multiple copies of most entries in the matrix.\n", "\n", "The simple thing to do is allocate a new array, and iteratively populate each row with a properly advanced window, like so:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "hop_length = frame_length / 2\n", "\n", "num_frames = 1 + (len(x) - frame_length) / hop_length\n", "\n", "x_framed_overlap = np.zeros( (num_frames, frame_length), dtype=int )\n", "\n", "for t in range(num_frames):\n", " x_framed_overlap[t, :] = x[t*hop_length:t*hop_length + frame_length]\n", " \n", "print x_framed_overlap" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0 1 2 3 4 5 6 7]\n", " [ 4 5 6 7 8 9 10 11]\n", " [ 8 9 10 11 12 13 14 15]\n", " [ 12 13 14 15 16 17 18 19]\n", " [ 16 17 18 19 20 21 22 23]\n", " [ 20 21 22 23 24 25 26 27]\n", " [ 24 25 26 27 28 29 30 31]\n", " [ 28 29 30 31 32 33 34 35]\n", " [ 32 33 34 35 36 37 38 39]\n", " [ 36 37 38 39 40 41 42 43]\n", " [ 40 41 42 43 44 45 46 47]\n", " [ 44 45 46 47 48 49 50 51]\n", " [ 48 49 50 51 52 53 54 55]\n", " [ 52 53 54 55 56 57 58 59]\n", " [ 56 57 58 59 60 61 62 63]\n", " [ 60 61 62 63 64 65 66 67]\n", " [ 64 65 66 67 68 69 70 71]\n", " [ 68 69 70 71 72 73 74 75]\n", " [ 72 73 74 75 76 77 78 79]\n", " [ 76 77 78 79 80 81 82 83]\n", " [ 80 81 82 83 84 85 86 87]\n", " [ 84 85 86 87 88 89 90 91]\n", " [ 88 89 90 91 92 93 94 95]\n", " [ 92 93 94 95 96 97 98 99]\n", " [ 96 97 98 99 100 101 102 103]\n", " [100 101 102 103 104 105 106 107]\n", " [104 105 106 107 108 109 110 111]\n", " [108 109 110 111 112 113 114 115]\n", " [112 113 114 115 116 117 118 119]\n", " [116 117 118 119 120 121 122 123]\n", " [120 121 122 123 124 125 126 127]]\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's wrong with this? A few of things:\n", "\n", "- It's *slow*, and requires lots of memory copies\n", "- The new array is much larger than the original time series\n", "- The smaller `hop_length` gets, the bigger the memory usage!\n", "\n", "If we only need read-only access to the data, then this is a lot of overhead.\n", "\n", "Of course, we could just use direct indexing, e.g., ``x[t * hop_length : t * hop_length + frame_length]``, but that gets ugly quickly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stride tricks\n", "\n", "All we really want is a clear way to index windowed samples from `x` without incurring a lot of computational overhead. \n", "\n", "NumPy's *stride tricks* submodule provides exactly this. We'll need to know a little about how numpy arrays are organized though.\n", "\n", "An `ndarray` contains numerical data of some type, say `int8` or `float64`, and this `dtype` determines how many bytes in ram each entry occupies. \n", "\n", "Although `ndarray`s expose multi-dimensional indexing, their contents are actually stored in a linear index. The default ordering of the dimensions is row-major (`order='C'`, as in *The C Programming Language*) or column-major (`order='F'`, as in *Fortran*). Yes, it's confusing.\n", "\n", "A two-dimensional row-major array `a` will be organized as follows:\n", "\n", " - `a[0, 0], a[0, 1], a[0, 2], ... a[0, n], a[1, 0], a[1, 1], a[1, 2], ...`\n", " \n", "while a column-major array will look like\n", "\n", " - `a[0, 0], a[1, 0], a[2, 0], ... a[m, 0], a[0, 1], a[1, 1], a[2, 1], ...`\n", " \n", "### Why does this matter?\n", "\n", "When you index an `ndarray` in python, say `M[i, j]`, it needs to translate the two-dimensional index `(i, j)` into a linear index:\n", "\n", " - `i * row_stride + j * column_stride`\n", " \n", "So column- or row-major array indexing is merely an issue of setting up the strides appropriately. Of course, this idea generalizes to `n` dimensional arrays by tacking on more indices and strides.\n", "\n", "We're going to use stride tricks to fool numpy into thinking it has much more data than it actually does, while keeping the same underlying data in place. How? By manually setting the row and column strides!\n", "\n", "The key idea here is that a row stride need not span the full shape of the array.\n", "\n", "### WARNING\n", "\n", "Stride tricks can be dangerous. Because we're digging into the guts of ndarrays, it's possible to violate memory protection if we're not careful. \n", "\n", "Ok, you've been warned." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from numpy.lib import stride_tricks" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# Ok, it's a linear index, so it's both F- and C-contiguous\n", "# Next, let's see how big each item in x is. This is provided by the `itemsize` field\n", "print x.itemsize, x.dtype" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "8 int64\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# Now, we're ready to set up our fake frmaed data\n", "\n", "# Each row advances by `hop_length` entries, times the number of bytes in each entry\n", "row_stride = x.itemsize * hop_length\n", "\n", "# Columns are still contiguous, and only advance by one entry\n", "col_stride = x.itemsize\n", "\n", "# as_strided will construct a new view of the data in x, using the shape and stride parameters we supply\n", "x_framed_strided = stride_tricks.as_strided(x, \n", " shape=(num_frames, frame_length), \n", " strides=(row_stride, col_stride))\n", "\n", "print x_framed_strided" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0 1 2 3 4 5 6 7]\n", " [ 4 5 6 7 8 9 10 11]\n", " [ 8 9 10 11 12 13 14 15]\n", " [ 12 13 14 15 16 17 18 19]\n", " [ 16 17 18 19 20 21 22 23]\n", " [ 20 21 22 23 24 25 26 27]\n", " [ 24 25 26 27 28 29 30 31]\n", " [ 28 29 30 31 32 33 34 35]\n", " [ 32 33 34 35 36 37 38 39]\n", " [ 36 37 38 39 40 41 42 43]\n", " [ 40 41 42 43 44 45 46 47]\n", " [ 44 45 46 47 48 49 50 51]\n", " [ 48 49 50 51 52 53 54 55]\n", " [ 52 53 54 55 56 57 58 59]\n", " [ 56 57 58 59 60 61 62 63]\n", " [ 60 61 62 63 64 65 66 67]\n", " [ 64 65 66 67 68 69 70 71]\n", " [ 68 69 70 71 72 73 74 75]\n", " [ 72 73 74 75 76 77 78 79]\n", " [ 76 77 78 79 80 81 82 83]\n", " [ 80 81 82 83 84 85 86 87]\n", " [ 84 85 86 87 88 89 90 91]\n", " [ 88 89 90 91 92 93 94 95]\n", " [ 92 93 94 95 96 97 98 99]\n", " [ 96 97 98 99 100 101 102 103]\n", " [100 101 102 103 104 105 106 107]\n", " [104 105 106 107 108 109 110 111]\n", " [108 109 110 111 112 113 114 115]\n", " [112 113 114 115 116 117 118 119]\n", " [116 117 118 119 120 121 122 123]\n", " [120 121 122 123 124 125 126 127]]\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "# What if we want our frames vertical instead? Just swap the row and column strides\n", "\n", "x_framed_strided_column = stride_tricks.as_strided(x, \n", " shape=(num_frames, frame_length), \n", " strides=(col_stride, row_stride))\n", "\n", "print x_framed_strided_column\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0 4 8 12 16 20 24 28]\n", " [ 1 5 9 13 17 21 25 29]\n", " [ 2 6 10 14 18 22 26 30]\n", " [ 3 7 11 15 19 23 27 31]\n", " [ 4 8 12 16 20 24 28 32]\n", " [ 5 9 13 17 21 25 29 33]\n", " [ 6 10 14 18 22 26 30 34]\n", " [ 7 11 15 19 23 27 31 35]\n", " [ 8 12 16 20 24 28 32 36]\n", " [ 9 13 17 21 25 29 33 37]\n", " [10 14 18 22 26 30 34 38]\n", " [11 15 19 23 27 31 35 39]\n", " [12 16 20 24 28 32 36 40]\n", " [13 17 21 25 29 33 37 41]\n", " [14 18 22 26 30 34 38 42]\n", " [15 19 23 27 31 35 39 43]\n", " [16 20 24 28 32 36 40 44]\n", " [17 21 25 29 33 37 41 45]\n", " [18 22 26 30 34 38 42 46]\n", " [19 23 27 31 35 39 43 47]\n", " [20 24 28 32 36 40 44 48]\n", " [21 25 29 33 37 41 45 49]\n", " [22 26 30 34 38 42 46 50]\n", " [23 27 31 35 39 43 47 51]\n", " [24 28 32 36 40 44 48 52]\n", " [25 29 33 37 41 45 49 53]\n", " [26 30 34 38 42 46 50 54]\n", " [27 31 35 39 43 47 51 55]\n", " [28 32 36 40 44 48 52 56]\n", " [29 33 37 41 45 49 53 57]\n", " [30 34 38 42 46 50 54 58]]\n" ] } ], "prompt_number": 9 } ], "metadata": {} } ] }