{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Module: https://github.com/SciTools/biggus/wiki \n", "Talk: [http://nbviewer.ipython.org/gist/pelson/9171602](http://nbviewer.ipython.org/gist/pelson/9171602)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##\u00a0Problem\n", "\n", "- Need\n", " - Very large NumPy arrays that don't fit into memory.\n", " - Lazy loading\n", " - Allows indexing and combination with other arrays.\n", "- Don't need\n", " - ragged arrays (fills regular numpy arrays)\n", " - heterogenous dtype\n", " - better-than-numpy performance\n", "\n", "##\u00a0Prior art\n", "\n", "- BLAZE, [https://code.google.com/p/blaze-lib/](https://code.google.com/p/blaze-lib/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Their work\n", "\n", "Biggus. [https://github.com/SciTools/biggus/wiki](https://github.com/SciTools/biggus/wiki)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import biggus\n", "\n", "np_array = np.empty((700, 200), dtype=int32)\n", "arr = biggus.NumpyArrayAdapter(np_array)\n", "print arr" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "#\u00a0np.concatenate\n", "bigger_arr = biggus.LinearMosaic([arr, arr], axis=0)\n", "print bigger_arr" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- But NumPy is creating a new space, then copying.\n", " - NumPY arrays are hard to subclass\n", "- Biggus does no copying." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#\u00a0no memory copying\n", "print biggus.LinearMosaic([arr, arr] * 20, axis=0)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "#\u00a0new dimension\n", "biggus.ArrayStack(np.array([arr, arr]))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "import h5py\n", "\n", "hdf_dataset = h5py.File('data.hdf5')['arange']\n", "\n", "# this is lazy; no data is loaded\n", "arr_hdf = biggus.NumpyArrayAdapter(hdf_dataset)\n", "\n", "print arr" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- HDF5 already has the ability to load arrays, but this is more generic.\n", "- Can combine (`LinearMosaic`) HDF5 and regular arrays." ] }, { "cell_type": "code", "collapsed": false, "input": [ "bigger_arr = biggus.LinearMosaic([bigger_arr, arr_hdf], axis=0)\n", "print bigger_arr" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ndarray` method realizes arrays, and brings it into memory." ] }, { "cell_type": "code", "collapsed": false, "input": [ "type(bigger_arr.ndarray()), bigger_arr.ndarray().shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "(numpy.ndarray, (1400, 200))" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do basic processing on massive arrays in chunks." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# These operations don't run when you do this\n", "mean = biggus.mean(bigger_arr, axis=0)\n", "std = biggus.std(bigger_arr, axis=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "print mean" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "<_Aggregation shape=(200,) dtype=dtype('float64')>\n" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "# _now_ we realize it, calculate mean in mean.ndarray()\n", "# done is chunks, data never all in-memory\n", "print np.all(mean.ndarray() == bigger_arr.ndarray().mean(axis=0))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "True\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Really though as you go chunk-by-chunk you want to do many operations at the same time." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# this realizes the result. it really is chunking the array\n", "#\u00a0into sub-arrays, aggregating results\n", "mean_np, std_np = biggus.ndarrays([mean, std])\n", "print type(mean_np)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##\u00a0Current limitations\n", "\n", "- Limited to axis 0 - pull request in the works\n", "- Threading is coming, but tasks are I/O bound so not too helpful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Array results\n", "\n", "- What if the result is in itself a big array?\n", "- Can never realize the full result.\n", "- Can chunk result directly into an HDF5 variable." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import h5py\n", "with h5py.File('result.hdf5', mode='w') as f_out:\n", " df = f_out.create_dataset('my_result', mean.shape, mean.dtype)\n", " biggus.save([mean], [df])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##\u00a0Toy example of taking OpenStreetMap tiles\n", "\n", "!!AI see video.\n", "\n", "- Eventually dealing with a 48GB uncompressed array, easily on a MacBook Air.\n", "\n", "##\u00a0Summary\n", "\n", "- Simple (1200 LOC)\n", "- Array-like class.\n", "- Array-like indexing, aggregation, concatenation, with any index-like object (NumPy, HDF5).\n", "- Supports streaming.\n", "- Conceptually no size limit on arrays you can manipulate.\n", "- TODO\n", " - Efficient chunking for complex chains of evaluations" ] } ], "metadata": {} } ] }