{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Importing Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'0.17.0'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "pd.options.display.max_rows = 6\n", "pd.options.display.max_columns = 8\n", "pd.__version__ " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We often times have a variety of input data.\n", "\n", "- CSV\n", "- Excel\n", "- SQL\n", "- JSON\n", "- HDF5\n", "- pickle\n", "- msgpack\n", "- Stata\n", "- BigQuery" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This is subset of the data from beeradvocate.com, via [Standford](https://snap.stanford.edu/data/web-RateBeer.html). It's strangely formatted.\n", "\n", "This dataset is no longer available!\n", "\n", "

\n", "beer/name: Sausa Weizen
\n", "beer/beerI: 47986
\n", "beer/brewerId: 10325
\n", "beer/ABV: 5.00
\n", "beer/style: Hefeweizen
\n", "review/appearance: 2.5
\n", "review/aroma: 2
\n", "review/time: 1234817823
\n", "review/profileName: stcules
\n", "review/text: A lot of foam. But a lot.\tIn the smell some banana, and then lactic and tart. Not a good start.\tQuite dark orange in color, with a lively carbonation (now visible, under the foam).\tAgain tending to lactic sourness.\tSame for the taste. With some yeast and banana.
\n", "
\n", "beer/name: Red Moon
\n", "beer/beerId: 48213
\n", "beer/brewerId: 10325
\n", "beer/ABV: 6.20
\n", " ...
\n", "

\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# CSV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df = pd.read_csv('data/beer2.csv.gz', \n", " index_col=0,\n", " parse_dates=['time'],\n", " encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abvbeer_idbrewer_idbeer_name...profile_namereview_tastetexttime
07.02511287Bell's Cherry Stout...blaheath4.5Batch 8144\\tPitch black in color with a 1/2 f...2009-10-05 21:31:48
15.7197369790Duck-Rabbit Porter...GJ404.0Sampled from a 12oz bottle in a standard pint...2009-10-05 21:32:09
24.8110983182Fürstenberg Premium Pilsener...biegaman3.5Haystack yellow with an energetic group of bu...2009-10-05 21:32:13
..............................
499978.1219502372Terrapin Coffee Oatmeal Imperial Stout...ugaterrapin4.5Poured a light sucking crude oil beckoning bl...2009-12-25 17:23:52
499984.654531306Badger Original Ale...MrHurmateeowish3.5500ml brown bottle, 4.0% ABV. Pours a crystal...2009-12-25 17:25:06
499999.44769514879Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout...strictly4DK4.522 oz bottle poured into a flute glass, share...2009-12-25 17:26:06
\n", "

50000 rows × 13 columns

\n", "
" ], "text/plain": [ " abv beer_id brewer_id beer_name ... profile_name review_taste text \\\n", "0 7.0 2511 287 Bell's Cherry Stout ... blaheath 4.5 Batch 8144\\tPitch black in color with a 1/2 f... \n", "1 5.7 19736 9790 Duck-Rabbit Porter ... GJ40 4.0 Sampled from a 12oz bottle in a standard pint... \n", "2 4.8 11098 3182 Fürstenberg Premium Pilsener ... biegaman 3.5 Haystack yellow with an energetic group of bu... \n", "... ... ... ... ... ... ... ... ... \n", "49997 8.1 21950 2372 Terrapin Coffee Oatmeal Imperial Stout ... ugaterrapin 4.5 Poured a light sucking crude oil beckoning bl... \n", "49998 4.6 5453 1306 Badger Original Ale ... MrHurmateeowish 3.5 500ml brown bottle, 4.0% ABV. Pours a crystal... \n", "49999 9.4 47695 14879 Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout ... strictly4DK 4.5 22 oz bottle poured into a flute glass, share... \n", "\n", " time \n", "0 2009-10-05 21:31:48 \n", "1 2009-10-05 21:32:09 \n", "2 2009-10-05 21:32:13 \n", "... ... \n", "49997 2009-12-25 17:23:52 \n", "49998 2009-12-25 17:25:06 \n", "49999 2009-12-25 17:26:06 \n", "\n", "[50000 rows x 13 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 50000 entries, 0 to 49999\n", "Data columns (total 13 columns):\n", "abv 48389 non-null float64\n", "beer_id 50000 non-null int64\n", "brewer_id 50000 non-null int64\n", "beer_name 50000 non-null object\n", "beer_style 50000 non-null object\n", "review_appearance 50000 non-null float64\n", "review_aroma 50000 non-null float64\n", "review_overall 50000 non-null float64\n", "review_palate 50000 non-null float64\n", "profile_name 50000 non-null object\n", "review_taste 50000 non-null float64\n", "text 49991 non-null object\n", "time 50000 non-null datetime64[ns]\n", "dtypes: datetime64[ns](1), float64(6), int64(2), object(4)\n", "memory usage: 5.3+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Fürstenberg Premium Pilsener'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we have some unicode\n", "df.loc[2,'beer_name']" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "df.to_csv('data/beer.csv', index=False, encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Excel" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "df.to_excel('data/beer.xls', index=False, encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "data = pd.read_excel('data/beer.xls', sheetnames=[0], encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# SQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sqlalchemy import create_engine\n", "!rm -f data/beer.sqlite\n", "engine = create_engine('sqlite:///data/beer.sqlite')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "df.to_sql('table', engine)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.read_sql('table', engine)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# JSON" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "http://pandas.pydata.org/pandas-docs/stable/io.html#json" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.to_json('data/beer.json')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.read_json('data/beer.json')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# HDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/jreback/miniconda/envs/python3/lib/python3.4/site-packages/pandas/core/generic.py:938: PerformanceWarning: \n", "your performance may suffer as PyTables will pickle object types that it cannot\n", "map directly to c-types [inferred_type->mixed,key->block3_values] [items->['beer_name', 'beer_style', 'profile_name', 'text']]\n", "\n", " return pytables.to_hdf(path_or_buf, key, self, **kwargs)\n" ] } ], "source": [ "# fixed format\n", "df.to_hdf('data/beer_mixed.hdf',\n", " 'df',\n", " mode='w',\n", " format='fixed',\n", " encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abvbeer_idbrewer_idbeer_name...profile_namereview_tastetexttime
66634.54756213307It's Alright!...Frogzilla1.0NaN2009-10-16 00:51:28
130128.02508222Maredsous 8 - Dubbel...Flightoficarus4.0NaN2009-10-28 08:09:25
210346.5143782593Cherry Porter...CurtisFagan4.5NaN2009-11-09 23:57:11
..............................
403579.54778535Samuel Adams Double Bock (Imperial Series)...zeapo4.0NaN2009-12-10 01:42:08
409689.24736035Samuel Adams Imperial Stout...zeapo3.0NaN2009-12-11 02:14:19
486695.0924142Franziskaner Hefe-Weisse Dunkel...VTBobcat5.0NaN2009-12-23 13:54:37
\n", "

9 rows × 13 columns

\n", "
" ], "text/plain": [ " abv beer_id brewer_id beer_name ... profile_name review_taste text time\n", "6663 4.5 47562 13307 It's Alright! ... Frogzilla 1.0 NaN 2009-10-16 00:51:28\n", "13012 8.0 2508 222 Maredsous 8 - Dubbel ... Flightoficarus 4.0 NaN 2009-10-28 08:09:25\n", "21034 6.5 14378 2593 Cherry Porter ... CurtisFagan 4.5 NaN 2009-11-09 23:57:11\n", "... ... ... ... ... ... ... ... ... ...\n", "40357 9.5 47785 35 Samuel Adams Double Bock (Imperial Series) ... zeapo 4.0 NaN 2009-12-10 01:42:08\n", "40968 9.2 47360 35 Samuel Adams Imperial Stout ... zeapo 3.0 NaN 2009-12-11 02:14:19\n", "48669 5.0 924 142 Franziskaner Hefe-Weisse Dunkel ... VTBobcat 5.0 NaN 2009-12-23 13:54:37\n", "\n", "[9 rows x 13 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.text.isnull()]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "data = pd.read_hdf('data/beer.hdf','df',encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "count 49991.000000\n", "mean 733.792003\n", "std 392.219226\n", " ... \n", "50% 642.000000\n", "75% 900.000000\n", "max 4902.000000\n", "Name: text, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# wildly varying strings\n", "df.text.str.len().describe()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Timings" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 2.85 s per loop\n" ] } ], "source": [ "%timeit pd.read_excel('data/beer.xls', sheetnames=[0])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 662 ms per loop\n" ] } ], "source": [ "%timeit pd.read_sql('table', engine)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 1.15 s per loop\n" ] } ], "source": [ "%timeit pd.read_json('data/beer.json')" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 3: 552 ms per loop\n" ] } ], "source": [ "%timeit pd.read_csv('data/beer.csv', parse_dates=['time'])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 123 ms per loop\n" ] } ], "source": [ "%timeit pd.read_hdf('data/beer.hdf','df')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df.to_pickle('data/beer.pkl')\n", "df.to_msgpack('data/beer.msgpack',encoding='utf-8')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 40.2 ms per loop\n" ] } ], "source": [ "%timeit pd.read_pickle('data/beer.pkl')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 60.7 ms per loop\n" ] } ], "source": [ "%timeit pd.read_msgpack('data/beer.msgpack', encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Storing Text vs Data\n", "http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Operating on Large Data" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 -> 10000\n", "1 -> 10000\n", "2 -> 10000\n", "3 -> 10000\n", "4 -> 10000\n" ] } ], "source": [ "chunks = pd.read_csv('data/beer2.csv.gz', \n", " index_col=0,\n", " parse_dates=['time'],\n", " chunksize=10000)\n", "for i, chunk in enumerate(chunks):\n", " print(\"%d -> %d\" % (i, len(chunk)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "# Using Odo\n", "http://odo.readthedocs.org/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Questions\n", "\n", "- which formats provide good fidelity\n", " - hdf5, pickle, msgpack\n", " \n", "- which formats can you query\n", " - hdf5, sql\n", " \n", "- which formats can you iterate\n", " - csv, hdf5, sql\n", " \n", "- which formats provide better interoprability\n", " - csv, json, excel\n", " \n", "- which formats can you transmit over the wire\n", " - json, msgpack\n", " \n", "- which formats have better compression\n", " - hdf5, pickle, msgpack\n", " \n", "- which formats allow multiple datasets in the same file\n", " - hdf5, msgpack" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }