{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Module: https://github.com/SciTools/biggus/wiki  \n",
      "Talk: [http://nbviewer.ipython.org/gist/pelson/9171602](http://nbviewer.ipython.org/gist/pelson/9171602)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0Problem\n",
      "\n",
      "- Need\n",
      "    - Very large NumPy arrays that don't fit into memory.\n",
      "    - Lazy loading\n",
      "    - Allows indexing and combination with other arrays.\n",
      "- Don't need\n",
      "    - ragged arrays (fills regular numpy arrays)\n",
      "    - heterogenous dtype\n",
      "    - better-than-numpy performance\n",
      "\n",
      "##\u00a0Prior art\n",
      "\n",
      "- BLAZE, [https://code.google.com/p/blaze-lib/](https://code.google.com/p/blaze-lib/)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Their work\n",
      "\n",
      "Biggus. [https://github.com/SciTools/biggus/wiki](https://github.com/SciTools/biggus/wiki)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "import biggus\n",
      "\n",
      "np_array = np.empty((700, 200), dtype=int32)\n",
      "arr = biggus.NumpyArrayAdapter(np_array)\n",
      "print arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<NumpyArrayAdapter shape=(700, 200) dtype=dtype('int32')>\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#\u00a0np.concatenate\n",
      "bigger_arr = biggus.LinearMosaic([arr, arr], axis=0)\n",
      "print bigger_arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<LinearMosaic shape=(1400, 200) dtype=dtype('int32')>\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- But NumPy is creating a new space, then copying.\n",
      "    - NumPY arrays are hard to subclass\n",
      "- Biggus does no copying."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#\u00a0no memory copying\n",
      "print biggus.LinearMosaic([arr, arr] * 20, axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<LinearMosaic shape=(28000, 200) dtype=dtype('int32')>\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#\u00a0new dimension\n",
      "biggus.ArrayStack(np.array([arr, arr]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "<ArrayStack shape=(2, 700, 200) dtype=dtype('int32')>"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import h5py\n",
      "\n",
      "hdf_dataset = h5py.File('data.hdf5')['arange']\n",
      "\n",
      "# this is lazy; no data is loaded\n",
      "arr_hdf = biggus.NumpyArrayAdapter(hdf_dataset)\n",
      "\n",
      "print arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- HDF5 already has the ability to load arrays, but this is more generic.\n",
      "- Can combine (`LinearMosaic`) HDF5 and regular arrays."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bigger_arr = biggus.LinearMosaic([bigger_arr, arr_hdf], axis=0)\n",
      "print bigger_arr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `ndarray` method realizes arrays, and brings it into memory."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "type(bigger_arr.ndarray()), bigger_arr.ndarray().shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "(numpy.ndarray, (1400, 200))"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can do basic processing on massive arrays in chunks."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# These operations don't run when you do this\n",
      "mean = biggus.mean(bigger_arr, axis=0)\n",
      "std = biggus.std(bigger_arr, axis=0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<_Aggregation shape=(200,) dtype=dtype('float64')>\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# _now_ we realize it, calculate mean in mean.ndarray()\n",
      "# done is chunks, data never all in-memory\n",
      "print np.all(mean.ndarray() == bigger_arr.ndarray().mean(axis=0))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Really though as you go chunk-by-chunk you want to do many operations at the same time."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# this realizes the result. it really is chunking the array\n",
      "#\u00a0into sub-arrays, aggregating results\n",
      "mean_np, std_np = biggus.ndarrays([mean, std])\n",
      "print type(mean_np)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<type 'numpy.ndarray'>\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0Current limitations\n",
      "\n",
      "- Limited to axis 0 - pull request in the works\n",
      "- Threading is coming, but tasks are I/O bound so not too helpful."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Array results\n",
      "\n",
      "- What if the result is in itself a big array?\n",
      "- Can never realize the full result.\n",
      "- Can chunk result directly into an HDF5 variable."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import h5py\n",
      "with h5py.File('result.hdf5', mode='w') as f_out:\n",
      "    df = f_out.create_dataset('my_result', mean.shape, mean.dtype)\n",
      "    biggus.save([mean], [df])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0Toy example of taking OpenStreetMap tiles\n",
      "\n",
      "!!AI see video.\n",
      "\n",
      "- Eventually dealing with a 48GB uncompressed array, easily on a MacBook Air.\n",
      "\n",
      "##\u00a0Summary\n",
      "\n",
      "- Simple (1200 LOC)\n",
      "- Array-like class.\n",
      "- Array-like indexing, aggregation, concatenation, with any index-like object (NumPy, HDF5).\n",
      "- Supports streaming.\n",
      "- Conceptually no size limit on arrays you can manipulate.\n",
      "- TODO\n",
      "    - Efficient chunking for complex chains of evaluations"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}