{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Importing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.17.0'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "pd.options.display.max_rows = 6\n",
    "pd.options.display.max_columns = 8\n",
    "pd.__version__ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We often times have a variety of input data.\n",
    "\n",
    "- CSV\n",
    "- Excel\n",
    "- SQL\n",
    "- JSON\n",
    "- HDF5\n",
    "- pickle\n",
    "- msgpack\n",
    "- Stata\n",
    "- BigQuery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "This is subset of the data from beeradvocate.com, via [Standford](https://snap.stanford.edu/data/web-RateBeer.html). It's strangely formatted.\n",
    "\n",
    "This dataset is no longer available!\n",
    "\n",
    "<p style=\"font-size:20px\"; style=font-family:Courier>\n",
    "beer/name: Sausa Weizen<br>\n",
    "beer/beerI: 47986<br>\n",
    "beer/brewerId: 10325<br>\n",
    "beer/ABV: 5.00<br>\n",
    "beer/style: Hefeweizen<br>\n",
    "review/appearance: 2.5<br>\n",
    "review/aroma: 2<br>\n",
    "review/time: 1234817823<br>\n",
    "review/profileName: stcules<br>\n",
    "review/text: A lot of foam. But a lot.\tIn the smell some banana, and then lactic and tart. Not a good start.\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\tAgain tending to lactic sourness.\tSame for the taste. With some yeast and banana.<br>\n",
    "<br>\n",
    "beer/name: Red Moon<br>\n",
    "beer/beerId: 48213<br>\n",
    "beer/brewerId: 10325<br>\n",
    "beer/ABV: 6.20<br>\n",
    " ...<br>\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/beer2.csv.gz', \n",
    "                 index_col=0,\n",
    "                 parse_dates=['time'],\n",
    "                 encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abv</th>\n",
       "      <th>beer_id</th>\n",
       "      <th>brewer_id</th>\n",
       "      <th>beer_name</th>\n",
       "      <th>...</th>\n",
       "      <th>profile_name</th>\n",
       "      <th>review_taste</th>\n",
       "      <th>text</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7.0</td>\n",
       "      <td>2511</td>\n",
       "      <td>287</td>\n",
       "      <td>Bell's Cherry Stout</td>\n",
       "      <td>...</td>\n",
       "      <td>blaheath</td>\n",
       "      <td>4.5</td>\n",
       "      <td>Batch 8144\\tPitch black in color with a 1/2 f...</td>\n",
       "      <td>2009-10-05 21:31:48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.7</td>\n",
       "      <td>19736</td>\n",
       "      <td>9790</td>\n",
       "      <td>Duck-Rabbit Porter</td>\n",
       "      <td>...</td>\n",
       "      <td>GJ40</td>\n",
       "      <td>4.0</td>\n",
       "      <td>Sampled from a 12oz bottle in a standard pint...</td>\n",
       "      <td>2009-10-05 21:32:09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.8</td>\n",
       "      <td>11098</td>\n",
       "      <td>3182</td>\n",
       "      <td>Fürstenberg Premium Pilsener</td>\n",
       "      <td>...</td>\n",
       "      <td>biegaman</td>\n",
       "      <td>3.5</td>\n",
       "      <td>Haystack yellow with an energetic group of bu...</td>\n",
       "      <td>2009-10-05 21:32:13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49997</th>\n",
       "      <td>8.1</td>\n",
       "      <td>21950</td>\n",
       "      <td>2372</td>\n",
       "      <td>Terrapin Coffee Oatmeal Imperial Stout</td>\n",
       "      <td>...</td>\n",
       "      <td>ugaterrapin</td>\n",
       "      <td>4.5</td>\n",
       "      <td>Poured a light sucking crude oil beckoning bl...</td>\n",
       "      <td>2009-12-25 17:23:52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49998</th>\n",
       "      <td>4.6</td>\n",
       "      <td>5453</td>\n",
       "      <td>1306</td>\n",
       "      <td>Badger Original Ale</td>\n",
       "      <td>...</td>\n",
       "      <td>MrHurmateeowish</td>\n",
       "      <td>3.5</td>\n",
       "      <td>500ml brown bottle, 4.0% ABV. Pours a crystal...</td>\n",
       "      <td>2009-12-25 17:25:06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49999</th>\n",
       "      <td>9.4</td>\n",
       "      <td>47695</td>\n",
       "      <td>14879</td>\n",
       "      <td>Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout</td>\n",
       "      <td>...</td>\n",
       "      <td>strictly4DK</td>\n",
       "      <td>4.5</td>\n",
       "      <td>22 oz bottle poured into a flute glass, share...</td>\n",
       "      <td>2009-12-25 17:26:06</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>50000 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       abv  beer_id  brewer_id                                      beer_name         ...             profile_name  review_taste                                               text  \\\n",
       "0      7.0     2511        287                            Bell's Cherry Stout         ...                 blaheath           4.5   Batch 8144\\tPitch black in color with a 1/2 f...   \n",
       "1      5.7    19736       9790                             Duck-Rabbit Porter         ...                     GJ40           4.0   Sampled from a 12oz bottle in a standard pint...   \n",
       "2      4.8    11098       3182                   Fürstenberg Premium Pilsener         ...                 biegaman           3.5   Haystack yellow with an energetic group of bu...   \n",
       "...    ...      ...        ...                                            ...         ...                      ...           ...                                                ...   \n",
       "49997  8.1    21950       2372         Terrapin Coffee Oatmeal Imperial Stout         ...              ugaterrapin           4.5   Poured a light sucking crude oil beckoning bl...   \n",
       "49998  4.6     5453       1306                            Badger Original Ale         ...          MrHurmateeowish           3.5   500ml brown bottle, 4.0% ABV. Pours a crystal...   \n",
       "49999  9.4    47695      14879  Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout         ...              strictly4DK           4.5   22 oz bottle poured into a flute glass, share...   \n",
       "\n",
       "                     time  \n",
       "0     2009-10-05 21:31:48  \n",
       "1     2009-10-05 21:32:09  \n",
       "2     2009-10-05 21:32:13  \n",
       "...                   ...  \n",
       "49997 2009-12-25 17:23:52  \n",
       "49998 2009-12-25 17:25:06  \n",
       "49999 2009-12-25 17:26:06  \n",
       "\n",
       "[50000 rows x 13 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 50000 entries, 0 to 49999\n",
      "Data columns (total 13 columns):\n",
      "abv                  48389 non-null float64\n",
      "beer_id              50000 non-null int64\n",
      "brewer_id            50000 non-null int64\n",
      "beer_name            50000 non-null object\n",
      "beer_style           50000 non-null object\n",
      "review_appearance    50000 non-null float64\n",
      "review_aroma         50000 non-null float64\n",
      "review_overall       50000 non-null float64\n",
      "review_palate        50000 non-null float64\n",
      "profile_name         50000 non-null object\n",
      "review_taste         50000 non-null float64\n",
      "text                 49991 non-null object\n",
      "time                 50000 non-null datetime64[ns]\n",
      "dtypes: datetime64[ns](1), float64(6), int64(2), object(4)\n",
      "memory usage: 5.3+ MB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Fürstenberg Premium Pilsener'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we have some unicode\n",
    "df.loc[2,'beer_name']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df.to_csv('data/beer.csv', index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Excel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "df.to_excel('data/beer.xls', index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "data = pd.read_excel('data/beer.xls', sheetnames=[0], encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# SQL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sqlalchemy import create_engine\n",
    "!rm -f data/beer.sqlite\n",
    "engine = create_engine('sqlite:///data/beer.sqlite')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "df.to_sql('table', engine)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = pd.read_sql('table', engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# JSON"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/io.html#json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df.to_json('data/beer.json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = pd.read_json('data/beer.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# HDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jreback/miniconda/envs/python3/lib/python3.4/site-packages/pandas/core/generic.py:938: PerformanceWarning: \n",
      "your performance may suffer as PyTables will pickle object types that it cannot\n",
      "map directly to c-types [inferred_type->mixed,key->block3_values] [items->['beer_name', 'beer_style', 'profile_name', 'text']]\n",
      "\n",
      "  return pytables.to_hdf(path_or_buf, key, self, **kwargs)\n"
     ]
    }
   ],
   "source": [
    "# fixed format\n",
    "df.to_hdf('data/beer_mixed.hdf',\n",
    "          'df',\n",
    "           mode='w',\n",
    "           format='fixed',\n",
    "           encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>abv</th>\n",
       "      <th>beer_id</th>\n",
       "      <th>brewer_id</th>\n",
       "      <th>beer_name</th>\n",
       "      <th>...</th>\n",
       "      <th>profile_name</th>\n",
       "      <th>review_taste</th>\n",
       "      <th>text</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6663</th>\n",
       "      <td>4.5</td>\n",
       "      <td>47562</td>\n",
       "      <td>13307</td>\n",
       "      <td>It's Alright!</td>\n",
       "      <td>...</td>\n",
       "      <td>Frogzilla</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009-10-16 00:51:28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13012</th>\n",
       "      <td>8.0</td>\n",
       "      <td>2508</td>\n",
       "      <td>222</td>\n",
       "      <td>Maredsous 8 - Dubbel</td>\n",
       "      <td>...</td>\n",
       "      <td>Flightoficarus</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009-10-28 08:09:25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21034</th>\n",
       "      <td>6.5</td>\n",
       "      <td>14378</td>\n",
       "      <td>2593</td>\n",
       "      <td>Cherry Porter</td>\n",
       "      <td>...</td>\n",
       "      <td>CurtisFagan</td>\n",
       "      <td>4.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009-11-09 23:57:11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40357</th>\n",
       "      <td>9.5</td>\n",
       "      <td>47785</td>\n",
       "      <td>35</td>\n",
       "      <td>Samuel Adams Double Bock (Imperial Series)</td>\n",
       "      <td>...</td>\n",
       "      <td>zeapo</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009-12-10 01:42:08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40968</th>\n",
       "      <td>9.2</td>\n",
       "      <td>47360</td>\n",
       "      <td>35</td>\n",
       "      <td>Samuel Adams Imperial Stout</td>\n",
       "      <td>...</td>\n",
       "      <td>zeapo</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009-12-11 02:14:19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48669</th>\n",
       "      <td>5.0</td>\n",
       "      <td>924</td>\n",
       "      <td>142</td>\n",
       "      <td>Franziskaner Hefe-Weisse Dunkel</td>\n",
       "      <td>...</td>\n",
       "      <td>VTBobcat</td>\n",
       "      <td>5.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2009-12-23 13:54:37</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>9 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       abv  beer_id  brewer_id                                   beer_name         ...            profile_name  review_taste  text                time\n",
       "6663   4.5    47562      13307                               It's Alright!         ...               Frogzilla           1.0   NaN 2009-10-16 00:51:28\n",
       "13012  8.0     2508        222                        Maredsous 8 - Dubbel         ...          Flightoficarus           4.0   NaN 2009-10-28 08:09:25\n",
       "21034  6.5    14378       2593                               Cherry Porter         ...             CurtisFagan           4.5   NaN 2009-11-09 23:57:11\n",
       "...    ...      ...        ...                                         ...         ...                     ...           ...   ...                 ...\n",
       "40357  9.5    47785         35  Samuel Adams Double Bock (Imperial Series)         ...                   zeapo           4.0   NaN 2009-12-10 01:42:08\n",
       "40968  9.2    47360         35                 Samuel Adams Imperial Stout         ...                   zeapo           3.0   NaN 2009-12-11 02:14:19\n",
       "48669  5.0      924        142             Franziskaner Hefe-Weisse Dunkel         ...                VTBobcat           5.0   NaN 2009-12-23 13:54:37\n",
       "\n",
       "[9 rows x 13 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.text.isnull()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "data = pd.read_hdf('data/beer.hdf','df',encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    49991.000000\n",
       "mean       733.792003\n",
       "std        392.219226\n",
       "             ...     \n",
       "50%        642.000000\n",
       "75%        900.000000\n",
       "max       4902.000000\n",
       "Name: text, dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# wildly varying strings\n",
    "df.text.str.len().describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Timings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 2.85 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_excel('data/beer.xls', sheetnames=[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 662 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_sql('table', engine)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 1.15 s per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_json('data/beer.json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loops, best of 3: 552 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_csv('data/beer.csv', parse_dates=['time'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 123 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_hdf('data/beer.hdf','df')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df.to_pickle('data/beer.pkl')\n",
    "df.to_msgpack('data/beer.msgpack',encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 40.2 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_pickle('data/beer.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 60.7 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit pd.read_msgpack('data/beer.msgpack', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Storing Text vs Data\n",
    "http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Operating on Large Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 -> 10000\n",
      "1 -> 10000\n",
      "2 -> 10000\n",
      "3 -> 10000\n",
      "4 -> 10000\n"
     ]
    }
   ],
   "source": [
    "chunks = pd.read_csv('data/beer2.csv.gz', \n",
    "                      index_col=0,\n",
    "                      parse_dates=['time'],\n",
    "                      chunksize=10000)\n",
    "for i, chunk in enumerate(chunks):\n",
    "    print(\"%d -> %d\" % (i, len(chunk)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "# Using Odo\n",
    "http://odo.readthedocs.org/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Questions\n",
    "\n",
    "- which formats provide good fidelity\n",
    "  - hdf5, pickle, msgpack\n",
    "  \n",
    "- which formats can you query\n",
    "  - hdf5, sql\n",
    "  \n",
    "- which formats can you iterate\n",
    "  - csv, hdf5, sql\n",
    "  \n",
    "- which formats provide better interoprability\n",
    "  - csv, json, excel\n",
    "  \n",
    "- which formats can you transmit over the wire\n",
    "  - json, msgpack\n",
    "  \n",
    "- which formats have better compression\n",
    "  - hdf5, pickle, msgpack\n",
    "  \n",
    "- which formats allow multiple datasets in the same file\n",
    "  - hdf5, msgpack"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}