{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Continuum,\n", "\n", "Introduction to Blaze\n", "=====================\n", "\n", "In this tutorial we'll learn how to use Blaze to discover, migrate, and query data living in other databases. Generally this tutorial will have the following format\n", "\n", "1. `odo` - Move data to database\n", "2. `blaze` - Query data in database\n", "\n", "\n", "Install\n", "-------\n", "\n", "This tutorial uses many different libraries that are all available with the [Anaconda Distribution](http://continuum.io/downloads). Once you have Anaconda install, please run these commands from a terminal:\n", "\n", "```\n", "$ conda install -y blaze\n", "$ conda install -y bokeh\n", "$ conda install -y odo\n", "```\n", "\n", "nbviewer: http://nbviewer.ipython.org/github/ContinuumIO/pydata-apps/blob/master/Section-1_blaze.ipynb\n", "\n", "github: https://github.com/ContinuumIO/pydata-apps\n", "\n", "
\n", "\n", "\n", "Goal: Accessible, Interactive, Analytic Queries\n", "-----------------------------------------------\n", "\n", "NumPy and Pandas provide accessible, interactive, analytic queries; this is valuable." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0SepalLengthSepalWidthPetalLengthPetalWidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species\n", "0 1 5.1 3.5 1.4 0.2 setosa\n", "1 2 4.9 3.0 1.4 0.2 setosa\n", "2 3 4.7 3.2 1.3 0.2 setosa\n", "3 4 4.6 3.1 1.5 0.2 setosa\n", "4 5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv('iris.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Species\n", "setosa 1.462\n", "versicolor 4.260\n", "virginica 5.552\n", "Name: PetalLength, dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(df.Species).PetalLength.mean() # Average petal length per species" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "But as data grows and systems become more complex, moving data and querying data become more difficult. Python already has excellent tools for data that fits in memory, but we want to hook up to data that is inconvenient.\n", "\n", "From now on, we're going to assume one of the following:\n", "\n", "1. You have an inconvenient amount of data\n", "2. That data should live someplace other than your computer\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Databases and Python\n", "--------------------\n", "\n", "When in-memory arrays/dataframes cease to be an option, we turn to databases. These live outside of the Python process and so might be less convenient. The open source Python ecosystem includes libraries to interact with these databases and with foreign data in general. \n", "\n", "Examples:\n", "\n", "* SQL - [`sqlalchemy`](http://sqlalchemy.org) \n", " * Hive/Cassandra - [`pyhive`](https://github.com/dropbox/PyHive)\n", " * Impala - [`impyla`](https://github.com/cloudera/impyla)\n", " * RedShift - [`redshift-sqlalchemy`](https://pypi.python.org/pypi/redshift-sqlalchemy)\n", " * ...\n", "* MongoDB - [`pymongo`](http://api.mongodb.org/python/current/)\n", "* HBase - [`happybase`](http://happybase.readthedocs.org/en/latest/)\n", "* Spark - [`pyspark`](http://spark.apache.org/docs/latest/api/python/)\n", "* SSH - [`paramiko`](http://www.paramiko.org/)\n", "* HDFS - [`pywebhdfs`](https://pypi.python.org/pypi/pywebhdfs)\n", "* Amazon S3 - [`boto`](https://boto.readthedocs.org/en/latest/)\n", "\n", "Today we're going to use some of these indirectly with `odo` (was `into`) and Blaze. We'll try to point out these libraries as we automate them so that, if you'd like, you can use them independently.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Continuum,\n", "\n", "`odo` (formerly `into`)\n", "=======================\n", "\n", "Odo migrates data between formats and locations.\n", "\n", "Before we can use a database we need to move data into it. The `odo` project provides a single consistent interface to move data between formats and between locations.\n", "\n", "We'll start with local data and eventually move out to remote data.\n", "\n", "[*odo docs*](http://odo.readthedocs.org/en/latest/index.html)\n", "\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples\n", "\n", "\n", "Odo moves data into a target from a source\n", "\n", "```python\n", ">>> odo(source, target)\n", "```\n", "\n", "The target and source can be either a Python object or a string URI. The following are all valid calls to `into`\n", "\n", "```python\n", ">>> odo('iris.csv', pd.DataFrame) # Load CSV file into new DataFrame\n", ">>> odo(my_df, 'iris.json') # Write DataFrame into JSON file\n", ">>> odo('iris.csv', 'iris.json') # Migrate data from CSV to JSON\n", "```\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Use `odo` to load the `iris.csv` file into a Python `list`, a `np.ndarray`, and a `pd.DataFrame`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from odo import odo\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0SepalLengthSepalWidthPetalLengthPetalWidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
565.43.91.70.4setosa
674.63.41.40.3setosa
785.03.41.50.2setosa
894.42.91.40.2setosa
9104.93.11.50.1setosa
10115.43.71.50.2setosa
11124.83.41.60.2setosa
12134.83.01.40.1setosa
13144.33.01.10.1setosa
14155.84.01.20.2setosa
15165.74.41.50.4setosa
16175.43.91.30.4setosa
17185.13.51.40.3setosa
18195.73.81.70.3setosa
19205.13.81.50.3setosa
20215.43.41.70.2setosa
21225.13.71.50.4setosa
22234.63.61.00.2setosa
23245.13.31.70.5setosa
24254.83.41.90.2setosa
25265.03.01.60.2setosa
26275.03.41.60.4setosa
27285.23.51.50.2setosa
28295.23.41.40.2setosa
29304.73.21.60.2setosa
.....................
1201216.93.25.72.3virginica
1211225.62.84.92.0virginica
1221237.72.86.72.0virginica
1231246.32.74.91.8virginica
1241256.73.35.72.1virginica
1251267.23.26.01.8virginica
1261276.22.84.81.8virginica
1271286.13.04.91.8virginica
1281296.42.85.62.1virginica
1291307.23.05.81.6virginica
1301317.42.86.11.9virginica
1311327.93.86.42.0virginica
1321336.42.85.62.2virginica
1331346.32.85.11.5virginica
1341356.12.65.61.4virginica
1351367.73.06.12.3virginica
1361376.33.45.62.4virginica
1371386.43.15.51.8virginica
1381396.03.04.81.8virginica
1391406.93.15.42.1virginica
1401416.73.15.62.4virginica
1411426.93.15.12.3virginica
1421435.82.75.11.9virginica
1431446.83.25.92.3virginica
1441456.73.35.72.5virginica
1451466.73.05.22.3virginica
1461476.32.55.01.9virginica
1471486.53.05.22.0virginica
1481496.23.45.42.3virginica
1491505.93.05.11.8virginica
\n", "

150 rows × 6 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species\n", "0 1 5.1 3.5 1.4 0.2 setosa\n", "1 2 4.9 3.0 1.4 0.2 setosa\n", "2 3 4.7 3.2 1.3 0.2 setosa\n", "3 4 4.6 3.1 1.5 0.2 setosa\n", "4 5 5.0 3.6 1.4 0.2 setosa\n", "5 6 5.4 3.9 1.7 0.4 setosa\n", "6 7 4.6 3.4 1.4 0.3 setosa\n", "7 8 5.0 3.4 1.5 0.2 setosa\n", "8 9 4.4 2.9 1.4 0.2 setosa\n", "9 10 4.9 3.1 1.5 0.1 setosa\n", "10 11 5.4 3.7 1.5 0.2 setosa\n", "11 12 4.8 3.4 1.6 0.2 setosa\n", "12 13 4.8 3.0 1.4 0.1 setosa\n", "13 14 4.3 3.0 1.1 0.1 setosa\n", "14 15 5.8 4.0 1.2 0.2 setosa\n", "15 16 5.7 4.4 1.5 0.4 setosa\n", "16 17 5.4 3.9 1.3 0.4 setosa\n", "17 18 5.1 3.5 1.4 0.3 setosa\n", "18 19 5.7 3.8 1.7 0.3 setosa\n", "19 20 5.1 3.8 1.5 0.3 setosa\n", "20 21 5.4 3.4 1.7 0.2 setosa\n", "21 22 5.1 3.7 1.5 0.4 setosa\n", "22 23 4.6 3.6 1.0 0.2 setosa\n", "23 24 5.1 3.3 1.7 0.5 setosa\n", "24 25 4.8 3.4 1.9 0.2 setosa\n", "25 26 5.0 3.0 1.6 0.2 setosa\n", "26 27 5.0 3.4 1.6 0.4 setosa\n", "27 28 5.2 3.5 1.5 0.2 setosa\n", "28 29 5.2 3.4 1.4 0.2 setosa\n", "29 30 4.7 3.2 1.6 0.2 setosa\n", ".. ... ... ... ... ... ...\n", "120 121 6.9 3.2 5.7 2.3 virginica\n", "121 122 5.6 2.8 4.9 2.0 virginica\n", "122 123 7.7 2.8 6.7 2.0 virginica\n", "123 124 6.3 2.7 4.9 1.8 virginica\n", "124 125 6.7 3.3 5.7 2.1 virginica\n", "125 126 7.2 3.2 6.0 1.8 virginica\n", "126 127 6.2 2.8 4.8 1.8 virginica\n", "127 128 6.1 3.0 4.9 1.8 virginica\n", "128 129 6.4 2.8 5.6 2.1 virginica\n", "129 130 7.2 3.0 5.8 1.6 virginica\n", "130 131 7.4 2.8 6.1 1.9 virginica\n", "131 132 7.9 3.8 6.4 2.0 virginica\n", "132 133 6.4 2.8 5.6 2.2 virginica\n", "133 134 6.3 2.8 5.1 1.5 virginica\n", "134 135 6.1 2.6 5.6 1.4 virginica\n", "135 136 7.7 3.0 6.1 2.3 virginica\n", "136 137 6.3 3.4 5.6 2.4 virginica\n", "137 138 6.4 3.1 5.5 1.8 virginica\n", "138 139 6.0 3.0 4.8 1.8 virginica\n", "139 140 6.9 3.1 5.4 2.1 virginica\n", "140 141 6.7 3.1 5.6 2.4 virginica\n", "141 142 6.9 3.1 5.1 2.3 virginica\n", "142 143 5.8 2.7 5.1 1.9 virginica\n", "143 144 6.8 3.2 5.9 2.3 virginica\n", "144 145 6.7 3.3 5.7 2.5 virginica\n", "145 146 6.7 3.0 5.2 2.3 virginica\n", "146 147 6.3 2.5 5.0 1.9 virginica\n", "147 148 6.5 3.0 5.2 2.0 virginica\n", "148 149 6.2 3.4 5.4 2.3 virginica\n", "149 150 5.9 3.0 5.1 1.8 virginica\n", "\n", "[150 rows x 6 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odo(\"iris.csv\", pd.DataFrame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "URI Strings\n", "-----------\n", "\n", "Odo refers to foreign data either with a Python object like a `sqlalchemy.Table` object for a SQL table, or with a string URI, like `postgresql://hostname::tablename`.\n", "\n", "URI's often take on the following form\n", "\n", " protocol://path-to-resource::path-within-resource\n", " \n", "Where `path-to-resource` might point to a file, a database hostname, etc. while `path-within-resource` might refer to a datapath or table name. Note the two main separators\n", "\n", "* `://` separates the protocol on the left (`sqlite`, `mongodb`, `ssh`, `hdfs`, `hive`, ...)\n", "* `::` separates the path within the database on the right (e.g. tablename)\n", "\n", "[*odo docs on uri strings*](http://odo.readthedocs.org/en/latest/uri.html)\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples\n", "\n", "Here are some example URIs\n", "\n", "```\n", "myfile.json\n", "myfiles.*.csv'\n", "postgresql://hostname::tablename\n", "mongodb://hostname/db::collection\n", "ssh://user@host:/path/to/myfile.csv\n", "hdfs://user@host:/path/to/*.csv\n", "```\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "Migrate your CSV file into a table named `iris` in a new SQLite database at `sqlite:///my.db`. Remember to use the `::` separator and to separate your database name from your table name.\n", "\n", "[*odo docs on SQL*](http://odo.readthedocs.org/en/latest/sql.html)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Table('iris', MetaData(bind=Engine(sqlite:///my.db)), Column('Unnamed: 0', BIGINT(), table=, nullable=False), Column('SepalLength', FLOAT(), table=), Column('SepalWidth', FLOAT(), table=), Column('PetalLength', FLOAT(), table=), Column('PetalWidth', FLOAT(), table=), Column('Species', TEXT(), table=), schema=None)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odo(\"iris.csv\", \"sqlite:///my.db::iris\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What kind of object did you get receive as output? Call `type` on your result." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "sqlalchemy.sql.schema.Table" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "How it works\n", "------------\n", "\n", "Odo is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.\n", "\n", "We visualize that network below:\n", "\n", "![](images/conversions.png)\n", "\n", "Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.\n", "\n", "A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:\n", "\n", "* Load in to a `DataFrame` (`pandas.read_csv`)\n", "* Convert to `np.recarray` (`DataFrame.to_records`)\n", "* Then to a Python `Iterator` (`np.ndarray.tolist`)\n", "* Finally to Mongo (`pymongo.Collection.insert`)\n", "\n", "Alternatively we could write a special function that uses MongoDB's native CSV\n", "loader and shortcut this entire process with a direct edge `CSV -> Mongo`.\n", "\n", "These functions are chosen because they are fast, often far faster than converting through a central serialization format.\n", "\n", "This picture is actually from an older version of `odo`, when the graph was still small enough to visualize pleasantly. See [*odo docs*](http://odo.readthedocs.org/en/latest/overview.html) for a more updated version.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remote Data\n", "-----------\n", "\n", "We can interact with remote data in three locations\n", "\n", "1. On Amazon's S3 (this will be quick)\n", "2. On a remote machine via `ssh`\n", "3. On the Hadoop File System (HDFS)\n", "\n", "For most of this we'll wait until we've seen Blaze, briefly we'll use S3.\n", "\n", "### S3\n", "\n", "For now, we quickly grab a file from Amazon's `S3`.\n", "\n", "This example depends on [`boto`](https://boto.readthedocs.org/en/latest/) to interact with S3.\n", "\n", " conda install boto\n", "\n", "[*odo docs on aws*](http://odo.readthedocs.org/en/latest/aws.html)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
525.294.71MaleNoSunDinner4
68.772.00MaleNoSunDinner2
726.883.12MaleNoSunDinner4
815.041.96MaleNoSunDinner2
914.783.23MaleNoSunDinner2
1010.271.71MaleNoSunDinner2
1135.265.00FemaleNoSunDinner4
1215.421.57MaleNoSunDinner2
1318.433.00MaleNoSunDinner4
1414.833.02FemaleNoSunDinner2
1521.583.92MaleNoSunDinner2
1610.331.67FemaleNoSunDinner3
1716.293.71MaleNoSunDinner3
1816.973.50FemaleNoSunDinner3
1920.653.35MaleNoSatDinner3
2017.924.08MaleNoSatDinner2
2120.292.75FemaleNoSatDinner2
2215.772.23FemaleNoSatDinner2
2339.427.58MaleNoSatDinner4
2419.823.18MaleNoSatDinner2
2517.812.34MaleNoSatDinner4
2613.372.00MaleNoSatDinner2
2712.692.00MaleNoSatDinner2
2821.704.30MaleNoSatDinner2
2919.653.00FemaleNoSatDinner2
........................
21428.176.50FemaleYesSatDinner3
21512.901.10FemaleYesSatDinner2
21628.153.00MaleYesSatDinner5
21711.591.50MaleYesSatDinner2
2187.741.44MaleYesSatDinner2
21930.143.09FemaleYesSatDinner4
22012.162.20MaleYesFriLunch2
22113.423.48FemaleYesFriLunch2
2228.581.92MaleYesFriLunch1
22315.983.00FemaleNoFriLunch3
22413.421.58MaleYesFriLunch2
22516.272.50FemaleYesFriLunch2
22610.092.00FemaleYesFriLunch2
22720.453.00MaleNoSatDinner4
22813.282.72MaleNoSatDinner2
22922.122.88FemaleYesSatDinner2
23024.012.00MaleYesSatDinner4
23115.693.00MaleYesSatDinner3
23211.613.39MaleNoSatDinner2
23310.771.47MaleNoSatDinner2
23415.533.00MaleYesSatDinner2
23510.071.25MaleNoSatDinner2
23612.601.00MaleYesSatDinner2
23732.831.17MaleYesSatDinner2
23835.834.67FemaleNoSatDinner3
23929.035.92MaleNoSatDinner3
24027.182.00FemaleYesSatDinner2
24122.672.00MaleYesSatDinner2
24217.821.75MaleNoSatDinner2
24318.783.00FemaleNoThurDinner2
\n", "

244 rows × 7 columns

\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4\n", "5 25.29 4.71 Male No Sun Dinner 4\n", "6 8.77 2.00 Male No Sun Dinner 2\n", "7 26.88 3.12 Male No Sun Dinner 4\n", "8 15.04 1.96 Male No Sun Dinner 2\n", "9 14.78 3.23 Male No Sun Dinner 2\n", "10 10.27 1.71 Male No Sun Dinner 2\n", "11 35.26 5.00 Female No Sun Dinner 4\n", "12 15.42 1.57 Male No Sun Dinner 2\n", "13 18.43 3.00 Male No Sun Dinner 4\n", "14 14.83 3.02 Female No Sun Dinner 2\n", "15 21.58 3.92 Male No Sun Dinner 2\n", "16 10.33 1.67 Female No Sun Dinner 3\n", "17 16.29 3.71 Male No Sun Dinner 3\n", "18 16.97 3.50 Female No Sun Dinner 3\n", "19 20.65 3.35 Male No Sat Dinner 3\n", "20 17.92 4.08 Male No Sat Dinner 2\n", "21 20.29 2.75 Female No Sat Dinner 2\n", "22 15.77 2.23 Female No Sat Dinner 2\n", "23 39.42 7.58 Male No Sat Dinner 4\n", "24 19.82 3.18 Male No Sat Dinner 2\n", "25 17.81 2.34 Male No Sat Dinner 4\n", "26 13.37 2.00 Male No Sat Dinner 2\n", "27 12.69 2.00 Male No Sat Dinner 2\n", "28 21.70 4.30 Male No Sat Dinner 2\n", "29 19.65 3.00 Female No Sat Dinner 2\n", ".. ... ... ... ... ... ... ...\n", "214 28.17 6.50 Female Yes Sat Dinner 3\n", "215 12.90 1.10 Female Yes Sat Dinner 2\n", "216 28.15 3.00 Male Yes Sat Dinner 5\n", "217 11.59 1.50 Male Yes Sat Dinner 2\n", "218 7.74 1.44 Male Yes Sat Dinner 2\n", "219 30.14 3.09 Female Yes Sat Dinner 4\n", "220 12.16 2.20 Male Yes Fri Lunch 2\n", "221 13.42 3.48 Female Yes Fri Lunch 2\n", "222 8.58 1.92 Male Yes Fri Lunch 1\n", "223 15.98 3.00 Female No Fri Lunch 3\n", "224 13.42 1.58 Male Yes Fri Lunch 2\n", "225 16.27 2.50 Female Yes Fri Lunch 2\n", "226 10.09 2.00 Female Yes Fri Lunch 2\n", "227 20.45 3.00 Male No Sat Dinner 4\n", "228 13.28 2.72 Male No Sat Dinner 2\n", "229 22.12 2.88 Female Yes Sat Dinner 2\n", "230 24.01 2.00 Male Yes Sat Dinner 4\n", "231 15.69 3.00 Male Yes Sat Dinner 3\n", "232 11.61 3.39 Male No Sat Dinner 2\n", "233 10.77 1.47 Male No Sat Dinner 2\n", "234 15.53 3.00 Male Yes Sat Dinner 2\n", "235 10.07 1.25 Male No Sat Dinner 2\n", "236 12.60 1.00 Male Yes Sat Dinner 2\n", "237 32.83 1.17 Male Yes Sat Dinner 2\n", "238 35.83 4.67 Female No Sat Dinner 3\n", "239 29.03 5.92 Male No Sat Dinner 3\n", "240 27.18 2.00 Female Yes Sat Dinner 2\n", "241 22.67 2.00 Male Yes Sat Dinner 2\n", "242 17.82 1.75 Male No Sat Dinner 2\n", "243 18.78 3.00 Female No Thur Dinner 2\n", "\n", "[244 rows x 7 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odo('s3://nyqpug/tips.csv', pd.DataFrame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\"Continuum,\n", "\n", "Blaze\n", "=====\n", "\n", "Blaze translates a subset of numpy/pandas syntax into database queries. It hides away the database.\n", "\n", "On simple datasets, like CSV files, Blaze acts like Pandas with slightly different syntax. In this case Blaze is just using Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Pandas example" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0SepalLengthSepalWidthPetalLengthPetalWidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species\n", "0 1 5.1 3.5 1.4 0.2 setosa\n", "1 2 4.9 3.0 1.4 0.2 setosa\n", "2 3 4.7 3.2 1.3 0.2 setosa\n", "3 4 4.6 3.1 1.5 0.2 setosa\n", "4 5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('iris.csv')\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'], dtype=object)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Species.unique()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 setosa\n", "50 versicolor\n", "100 virginica\n", "Name: Species, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Species.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Blaze example" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0SepalLengthSepalWidthPetalLengthPetalWidthSpecies
015.13.51.40.2setosa
124.93.01.40.2setosa
234.73.21.30.2setosa
344.63.11.50.2setosa
455.03.61.40.2setosa
" ], "text/plain": [ " Unnamed: 0 SepalLength SepalWidth PetalLength PetalWidth Species\n", "0 1 5.1 3.5 1.4 0.2 setosa\n", "1 2 4.9 3.0 1.4 0.2 setosa\n", "2 3 4.7 3.2 1.3 0.2 setosa\n", "3 4 4.6 3.1 1.5 0.2 setosa\n", "4 5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import blaze as bz\n", "\n", "d = bz.Data('iris.csv')\n", "d.head(5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Species
0setosa
1versicolor
2virginica
" ], "text/plain": [ " Species\n", "0 setosa\n", "1 versicolor\n", "2 virginica" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.Species.distinct()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Foreign Data\n", "------------\n", "\n", "Blaze does different things under-the-hood on different kinds of data\n", "\n", "* CSV files: Pandas DataFrames (or iterators of DataFrames)\n", "* SQL tables: [SQLAlchemy](http://sqlalchemy.org).\n", "* Mongo collections: [PyMongo](http://api.mongodb.org/python/current/)\n", "* ...\n", "\n", "SQL\n", "---\n", "\n", "We'll play with SQL a lot during this tutorial. Blaze translates your query to SQLAlchemy. SQLAlchemy then translates to the SQL dialect of your database, your database then executes that query intelligently.\n", "\n", "* Blaze $\\rightarrow$ SQLAlchemy $\\rightarrow$ SQL $\\rightarrow$ Database computation\n", "\n", "This translation process lets analysts interact with a familiar interface while leveraging a potentially powerful database.\n", "\n", "To keep things local we'll use SQLite, but this works with any database with a SQLAlchemy dialect. Examples in this section use the iris dataset. Exercises use the Lahman Baseball statistics database, year 2013.\n", "\n", "If you have not downloaded this dataset you could do so here - https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite. \n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m\u001b[34m06-simple-apps\u001b[m\u001b[m Section3-2_bokeh.ipynb\r\n", "\u001b[1m\u001b[34m07-final-app\u001b[m\u001b[m \u001b[1m\u001b[34mimages\u001b[m\u001b[m\r\n", "LICENSE iris.csv\r\n", "Section-1_blaze-solutions.ipynb lahman2013.sqlite\r\n", "Section3-1_blaze.ipynb my.db\r\n" ] } ], "source": [ "!ls " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples\n", "\n", "Lets dive into Blaze Syntax. For simple queries it looks and feels similar to Pandas" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "db = bz.Data('sqlite:///my.db')\n", "#db.iris\n", "#db.iris.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Species
0Species
1setosa
2versicolor
3virginica
" ], "text/plain": [ " Species\n", "0 Species\n", "1 setosa\n", "2 versicolor\n", "3 virginica" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.iris.Species.distinct()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesSepalLength
0versicolor7.0
1versicolor6.4
2versicolor6.9
3versicolor5.5
4versicolor6.5
5versicolor5.7
6versicolor6.3
7versicolor4.9
8versicolor6.6
9versicolor5.2
10versicolor5.0
" ], "text/plain": [ " Species SepalLength\n", "0 versicolor 7.0\n", "1 versicolor 6.4\n", "2 versicolor 6.9\n", "3 versicolor 5.5\n", "4 versicolor 6.5\n", "5 versicolor 5.7\n", "6 versicolor 6.3\n", "7 versicolor 4.9\n", "8 versicolor 6.6\n", "9 versicolor 5.2\n", "..." ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Work happens on the database\n", "\n", "If we were using pandas we would read the table into pandas, then use pandas' fast in-memory algorithms for computation. Here we translate your query into SQL and then send that query to the database to do the work.\n", "\n", "* Pandas $\\leftarrow_\\textrm{data}$ SQL, then Pandas computes\n", "* Blaze $\\rightarrow_\\textrm{query}$ SQL, then database computes\n", "\n", "If we want to dive into the internal API we can inspect the query that Blaze transmits.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT iris.\"Species\", iris.\"SepalLength\" \n", "FROM iris \n", "WHERE iris.\"Species\" = ?\n" ] } ], "source": [ "# Inspect SQL query\n", "query = db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]\n", "print bz.compute(query)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT iris.\"Species\", max(iris.\"PetalLength\") AS longest, min(iris.\"PetalLength\") AS shortest \n", "FROM iris GROUP BY iris.\"Species\"\n" ] } ], "source": [ "query = bz.by(db.iris.Species, longest=db.iris.PetalLength.max(),\n", " shortest=db.iris.PetalLength.min())\n", "print bz.compute(query)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'Species', u'PetalLength', u'PetalLength'),\n", " (u'setosa', 1.9, 1.0),\n", " (u'versicolor', 5.1, 3.0),\n", " (u'virginica', 6.9, 4.5)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odo(query, list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Exercises\n", "\n", "Now we load the Lahman baseball database and perform similar queries" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "dshape(\"\"\"{\n", " AllstarFull: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " gameNum: ?int32,\n", " gameID: ?string,\n", " teamID: ?string,\n", " lgID: ?string,\n", " GP: ?int32,\n", " startingPos: ?int32\n", " },\n", " Appearances: var * {\n", " yearID: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " playerID: ?string,\n", " G_all: ?int32,\n", " GS: ?int32,\n", " G_batting: ?int32,\n", " G_defense: ?int32,\n", " G_p: ?int32,\n", " G_c: ?int32,\n", " G_1b: ?int32,\n", " G_2b: ?int32,\n", " G_3b: ?int32,\n", " G_ss: ?int32,\n", " G_lf: ?int32,\n", " G_cf: ?int32,\n", " G_rf: ?int32,\n", " G_of: ?int32,\n", " G_dh: ?int32,\n", " G_ph: ?int32,\n", " G_pr: ?int32\n", " },\n", " AwardsManagers: var * {\n", " playerID: ?string,\n", " awardID: ?string,\n", " yearID: ?int32,\n", " lgID: ?string,\n", " tie: ?string,\n", " notes: ?string\n", " },\n", " AwardsPlayers: var * {\n", " playerID: ?string,\n", " awardID: ?string,\n", " yearID: ?int32,\n", " lgID: ?string,\n", " tie: ?string,\n", " notes: ?string\n", " },\n", " AwardsShareManagers: var * {\n", " awardID: ?string,\n", " yearID: ?int32,\n", " lgID: ?string,\n", " playerID: ?string,\n", " pointsWon: ?int32,\n", " pointsMax: ?int32,\n", " votesFirst: ?int32\n", " },\n", " AwardsSharePlayers: var * {\n", " awardID: ?string,\n", " yearID: ?int32,\n", " lgID: ?string,\n", " playerID: ?string,\n", " pointsWon: ?float64,\n", " pointsMax: ?int32,\n", " votesFirst: ?float64\n", " },\n", " Batting: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " stint: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " G: ?int32,\n", " G_batting: ?int32,\n", " AB: ?int32,\n", " R: ?int32,\n", " H: ?int32,\n", " 2B: ?int32,\n", " 3B: ?int32,\n", " HR: ?int32,\n", " RBI: ?int32,\n", " SB: ?int32,\n", " CS: ?int32,\n", " BB: ?int32,\n", " SO: ?int32,\n", " IBB: ?int32,\n", " HBP: ?int32,\n", " SH: ?int32,\n", " SF: ?int32,\n", " GIDP: ?int32,\n", " G_old: ?int32\n", " },\n", " BattingPost: var * {\n", " yearID: ?int32,\n", " round: ?string,\n", " playerID: ?string,\n", " teamID: ?string,\n", " lgID: ?string,\n", " G: ?int32,\n", " AB: ?int32,\n", " R: ?int32,\n", " H: ?int32,\n", " 2B: ?int32,\n", " 3B: ?int32,\n", " HR: ?int32,\n", " RBI: ?int32,\n", " SB: ?int32,\n", " CS: ?int32,\n", " BB: ?int32,\n", " SO: ?int32,\n", " IBB: ?int32,\n", " HBP: ?int32,\n", " SH: ?int32,\n", " SF: ?int32,\n", " GIDP: ?int32\n", " },\n", " Fielding: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " stint: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " POS: ?string,\n", " G: ?int32,\n", " GS: ?int32,\n", " InnOuts: ?int32,\n", " PO: ?int32,\n", " A: ?int32,\n", " E: ?int32,\n", " DP: ?int32,\n", " PB: ?int32,\n", " WP: ?int32,\n", " SB: ?int32,\n", " CS: ?int32,\n", " ZR: ?float64\n", " },\n", " FieldingOF: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " stint: ?int32,\n", " Glf: ?int32,\n", " Gcf: ?int32,\n", " Grf: ?int32\n", " },\n", " FieldingPost: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " round: ?string,\n", " POS: ?string,\n", " G: ?int32,\n", " GS: ?int32,\n", " InnOuts: ?int32,\n", " PO: ?int32,\n", " A: ?int32,\n", " E: ?int32,\n", " DP: ?int32,\n", " TP: ?int32,\n", " PB: ?int32,\n", " SB: ?int32,\n", " CS: ?int32\n", " },\n", " HallOfFame: var * {\n", " playerID: ?string,\n", " yearid: ?int32,\n", " votedBy: ?string,\n", " ballots: ?int32,\n", " needed: ?int32,\n", " votes: ?int32,\n", " inducted: ?string,\n", " category: ?string,\n", " needed_note: ?string\n", " },\n", " Managers: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " inseason: ?int32,\n", " G: ?int32,\n", " W: ?int32,\n", " L: ?int32,\n", " rank: ?int32,\n", " plyrMgr: ?string\n", " },\n", " ManagersHalf: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " inseason: ?int32,\n", " half: ?int32,\n", " G: ?int32,\n", " W: ?int32,\n", " L: ?int32,\n", " rank: ?int32\n", " },\n", " Master: var * {\n", " playerID: ?string,\n", " birthYear: ?int32,\n", " birthMonth: ?int32,\n", " birthDay: ?int32,\n", " birthCountry: ?string,\n", " birthState: ?string,\n", " birthCity: ?string,\n", " deathYear: ?int32,\n", " deathMonth: ?int32,\n", " deathDay: ?int32,\n", " deathCountry: ?string,\n", " deathState: ?string,\n", " deathCity: ?string,\n", " nameFirst: ?string,\n", " nameLast: ?string,\n", " nameGiven: ?string,\n", " weight: ?int32,\n", " height: ?float64,\n", " bats: ?string,\n", " throws: ?string,\n", " debut: ?float64,\n", " finalGame: ?float64,\n", " retroID: ?string,\n", " bbrefID: ?string\n", " },\n", " Pitching: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " stint: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " W: ?int32,\n", " L: ?int32,\n", " G: ?int32,\n", " GS: ?int32,\n", " CG: ?int32,\n", " SHO: ?int32,\n", " SV: ?int32,\n", " IPouts: ?int32,\n", " H: ?int32,\n", " ER: ?int32,\n", " HR: ?int32,\n", " BB: ?int32,\n", " SO: ?int32,\n", " BAOpp: ?float64,\n", " ERA: ?float64,\n", " IBB: ?int32,\n", " WP: ?int32,\n", " HBP: ?int32,\n", " BK: ?int32,\n", " BFP: ?int32,\n", " GF: ?int32,\n", " R: ?int32,\n", " SH: ?int32,\n", " SF: ?int32,\n", " GIDP: ?int32\n", " },\n", " PitchingPost: var * {\n", " playerID: ?string,\n", " yearID: ?int32,\n", " round: ?string,\n", " teamID: ?string,\n", " lgID: ?string,\n", " W: ?int32,\n", " L: ?int32,\n", " G: ?int32,\n", " GS: ?int32,\n", " CG: ?int32,\n", " SHO: ?int32,\n", " SV: ?int32,\n", " IPouts: ?int32,\n", " H: ?int32,\n", " ER: ?int32,\n", " HR: ?int32,\n", " BB: ?int32,\n", " SO: ?int32,\n", " BAOpp: ?float64,\n", " ERA: ?float64,\n", " IBB: ?int32,\n", " WP: ?int32,\n", " HBP: ?int32,\n", " BK: ?int32,\n", " BFP: ?int32,\n", " GF: ?int32,\n", " R: ?int32,\n", " SH: ?int32,\n", " SF: ?int32,\n", " GIDP: ?int32\n", " },\n", " Salaries: var * {\n", " yearID: ?int32,\n", " teamID: ?string,\n", " lgID: ?string,\n", " playerID: ?string,\n", " salary: ?float64\n", " },\n", " Schools: var * {\n", " schoolID: ?string,\n", " schoolName: ?string,\n", " schoolCity: ?string,\n", " schoolState: ?string,\n", " schoolNick: ?string\n", " },\n", " SchoolsPlayers: var * {\n", " playerID: ?string,\n", " schoolID: ?string,\n", " yearMin: ?int32,\n", " yearMax: ?int32\n", " },\n", " SeriesPost: var * {\n", " yearID: ?int32,\n", " round: ?string,\n", " teamIDwinner: ?string,\n", " lgIDwinner: ?string,\n", " teamIDloser: ?string,\n", " lgIDloser: ?string,\n", " wins: ?int32,\n", " losses: ?int32,\n", " ties: ?int32\n", " },\n", " Teams: var * {\n", " yearID: ?int32,\n", " lgID: ?string,\n", " teamID: ?string,\n", " franchID: ?string,\n", " divID: ?string,\n", " Rank: ?int32,\n", " G: ?int32,\n", " Ghome: ?int32,\n", " W: ?int32,\n", " L: ?int32,\n", " DivWin: ?string,\n", " WCWin: ?string,\n", " LgWin: ?string,\n", " WSWin: ?string,\n", " R: ?int32,\n", " AB: ?int32,\n", " H: ?int32,\n", " 2B: ?int32,\n", " 3B: ?int32,\n", " HR: ?int32,\n", " BB: ?int32,\n", " SO: ?int32,\n", " SB: ?int32,\n", " CS: ?int32,\n", " HBP: ?int32,\n", " SF: ?int32,\n", " RA: ?int32,\n", " ER: ?int32,\n", " ERA: ?float64,\n", " CG: ?int32,\n", " SHO: ?int32,\n", " SV: ?int32,\n", " IPouts: ?int32,\n", " HA: ?int32,\n", " HRA: ?int32,\n", " BBA: ?int32,\n", " SOA: ?int32,\n", " E: ?int32,\n", " DP: ?int32,\n", " FP: ?float64,\n", " name: ?string,\n", " park: ?string,\n", " attendance: ?int32,\n", " BPF: ?int32,\n", " PPF: ?int32,\n", " teamIDBR: ?string,\n", " teamIDlahman45: ?string,\n", " teamIDretro: ?string\n", " },\n", " TeamsFranchises: var * {\n", " franchID: ?string,\n", " franchName: ?string,\n", " active: ?string,\n", " NAassoc: ?string\n", " },\n", " TeamsHalf: var * {\n", " yearID: ?int32,\n", " lgID: ?string,\n", " teamID: ?string,\n", " Half: ?string,\n", " divID: ?string,\n", " DivWin: ?string,\n", " Rank: ?int32,\n", " G: ?int32,\n", " W: ?int32,\n", " L: ?int32\n", " },\n", " temp: var * {ID: ?int32, namefull: ?string, born: ?float64}\n", " }\"\"\")" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# db = bz.Data('postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com') # Use Postgres if you don't have the sqlite file\n", "db = bz.Data('sqlite:///lahman2013.sqlite')\n", "db.dshape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# View the Salaries table\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# What are the distinct teamIDs in the Salaries table?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# What is the minimum and maximum yearID in the Sarlaries table? \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For the Oakland Athletics (teamID OAK), pick out the playerID, salary, and yearID columns\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Sort that result by salary. \n", "# Use the ascending=False keyword argument to the sort function to find the highest paid players\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Example: Split-apply-combine\n", "\n", "In Pandas we perform computations on a *per-group* basis with the `groupby` operator. In Blaze our syntax is slightly different, using instead the `by` function." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Species\n", "setosa 1.0\n", "versicolor 3.0\n", "virginica 4.5\n", "Name: PetalLength, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "iris = pd.read_csv('iris.csv')\n", "iris.groupby('Species').PetalLength.min()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Species\n", "setosa 1.0\n", "versicolor 3.0\n", "virginica 4.5\n", "Name: PetalLength, dtype: float64\n" ] } ], "source": [ "iris = bz.Data('sqlite:///my.db::iris')\n", "bz.by(iris.Species, largest=iris.PetalLength.max(), \n", " smallest=iris.PetalLength.min())\n", "print(_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Store Results\n", "-------------\n", "\n", "By default Blaze only shows us the first ten lines of a result. This provides a more interactive feel and stops us from accidentally crushing our system. Sometimes we do want to compute all of the results and store them someplace.\n", "\n", "Blaze expressions are valid sources for `odo`. So we can store our results in any format." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'Species', u'PetalLength', u'PetalLength'),\n", " (u'setosa', 1.9, 1.0),\n", " (u'versicolor', 5.1, 3.0),\n", " (u'virginica', 6.9, 4.5)]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris = bz.Data('sqlite:///my.db::iris')\n", "query = bz.by(iris.Species, largest=iris.PetalLength.max(), # A lazily evaluated result\n", " smallest=iris.PetalLength.min()) \n", "\n", "odo(query, list) # A concrete result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Exercise: Storage\n", "\n", "The solution to the first split-apply-combine problem is below. Store that result in a list, a CSV file, and in a new SQL table in our database (use a uri like `sqlite://...` to specify the SQL table.)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "result = bz.by(db.Salaries.teamID, avg=db.Salaries.salary.mean(), \n", " max=db.Salaries.salary.max(), \n", " ratio=db.Salaries.salary.max() / db.Salaries.salary.min()\n", " ).sort('ratio', ascending=False)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(u'PHI', 2092230.9326364691, 25000000.0, 416.6666666666667),\n", " (u'LAN', 2346982.6980255516, 23854494.0, 397.5749),\n", " (u'NYN', 2317349.977245509, 23145011.0, 385.7501833333333),\n", " (u'DET', 1980834.9902080784, 23000000.0, 383.3333333333333),\n", " (u'MIN', 1525031.650385604, 23000000.0, 383.3333333333333),\n", " (u'BOS', 2692113.8568075118, 22500000.0, 375.0),\n", " (u'SFN', 2044198.6841463414, 22250000.0, 370.8333333333333),\n", " (u'TEX', 1874651.551143201, 22000000.0, 366.6666666666667),\n", " (u'SEA', 1932288.913202934, 20557143.0, 342.61905),\n", " (u'CHA', 1992653.5012315272, 17000000.0, 340.0)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odo(result, list)[:10]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }