{
 "metadata": {
  "name": "",
  "signature": "sha256:d825a277965c84a256612bb6ff63f606b5bba263c90ac5407506f08370ba339b"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
      "                                align=\"right\"\n",
      "                                width=\"30%\">\n",
      "\n",
      "# Getting Started with Blaze\n",
      "\n",
      "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
      "* Install software with `conda install -c blaze blaze`"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 1.  Basic Queries\n",
      "\n",
      "For basic tabular queries, Blaze shares the same syntax as Pandas."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from blaze import Data, by, join, transform"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank = Data([[1, 'Alice',   100],\n",
      "             [2, 'Bob',    -200],\n",
      "             [3, 'Charlie', 300],\n",
      "             [4, 'Dennis',  400],\n",
      "             [5, 'Edith',  -500]], columns=['id', 'name', 'amount'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Arithmetic and Reductions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank.amount"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "   amount\n",
        "0     100\n",
        "1    -200\n",
        "2     300\n",
        "3     400\n",
        "4    -500"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank.amount / 100"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "   amount\n",
        "0       1\n",
        "1      -2\n",
        "2       3\n",
        "3       4\n",
        "4      -5"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "(bank.amount / 100).mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "0.2"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Multiple columns and sorting"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank[['name', 'amount']].sort('amount')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Selections\n",
      "\n",
      "We select subsets of data by indexing one expression with another"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank[bank.amount < 0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "   id   name  amount\n",
        "0   2    Bob    -200\n",
        "1   5  Edith    -500"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Combining Operations\n",
      "\n",
      "We can combine these sorts of operations with each other"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank[bank.amount < 0].amount / 100"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "   amount\n",
        "0      -2\n",
        "1      -5"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bank[bank.amount < 0].name"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "    name\n",
        "0    Bob\n",
        "1  Edith"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Exercises\n",
      "\n",
      "Write expressions to answer the following questions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# What are the IDs of everyone with a positive amount?\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# What is the name of the person with amount 400?\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# What is the difference between the minimum and maximum amounts?\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 2. More complex queries\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we need a more interesting dataset.  We open the standard *iris* dataset, a table of 150 measurements of flowers in the iris genus.  We find this dataset in a CSV file in the `data/` directory.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris = Data('data/iris.csv')\n",
      "iris"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>sepal_length</th>\n",
        "      <th>sepal_width</th>\n",
        "      <th>petal_length</th>\n",
        "      <th>petal_width</th>\n",
        "      <th>species</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0 </th>\n",
        "      <td> 5.1</td>\n",
        "      <td> 3.5</td>\n",
        "      <td> 1.4</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1 </th>\n",
        "      <td> 4.9</td>\n",
        "      <td> 3.0</td>\n",
        "      <td> 1.4</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2 </th>\n",
        "      <td> 4.7</td>\n",
        "      <td> 3.2</td>\n",
        "      <td> 1.3</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3 </th>\n",
        "      <td> 4.6</td>\n",
        "      <td> 3.1</td>\n",
        "      <td> 1.5</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4 </th>\n",
        "      <td> 5.0</td>\n",
        "      <td> 3.6</td>\n",
        "      <td> 1.4</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5 </th>\n",
        "      <td> 5.4</td>\n",
        "      <td> 3.9</td>\n",
        "      <td> 1.7</td>\n",
        "      <td> 0.4</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6 </th>\n",
        "      <td> 4.6</td>\n",
        "      <td> 3.4</td>\n",
        "      <td> 1.4</td>\n",
        "      <td> 0.3</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7 </th>\n",
        "      <td> 5.0</td>\n",
        "      <td> 3.4</td>\n",
        "      <td> 1.5</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8 </th>\n",
        "      <td> 4.4</td>\n",
        "      <td> 2.9</td>\n",
        "      <td> 1.4</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td> 4.9</td>\n",
        "      <td> 3.1</td>\n",
        "      <td> 1.5</td>\n",
        "      <td> 0.1</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td> 5.4</td>\n",
        "      <td> 3.7</td>\n",
        "      <td> 1.5</td>\n",
        "      <td> 0.2</td>\n",
        "      <td> Iris-setosa</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "    sepal_length  sepal_width  petal_length  petal_width      species\n",
        "0            5.1          3.5           1.4          0.2  Iris-setosa\n",
        "1            4.9          3.0           1.4          0.2  Iris-setosa\n",
        "2            4.7          3.2           1.3          0.2  Iris-setosa\n",
        "3            4.6          3.1           1.5          0.2  Iris-setosa\n",
        "4            5.0          3.6           1.4          0.2  Iris-setosa\n",
        "5            5.4          3.9           1.7          0.4  Iris-setosa\n",
        "6            4.6          3.4           1.4          0.3  Iris-setosa\n",
        "7            5.0          3.4           1.5          0.2  Iris-setosa\n",
        "8            4.4          2.9           1.4          0.2  Iris-setosa\n",
        "9            4.9          3.1           1.5          0.1  Iris-setosa\n",
        "..."
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `blaze.Data` function operates on all of the file types that we saw in the previous sections on `into`.  Blaze expressions use functions like `discover` to get datashapes that help them interact with you."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris.dshape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "dshape(\"\"\"var * {\n",
        "  sepal_length: float64,\n",
        "  sepal_width: float64,\n",
        "  petal_length: float64,\n",
        "  petal_width: float64,\n",
        "  species: string\n",
        "  }\"\"\")"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Distinct\n",
      "\n",
      "Now some more queries.  Distinct finds unique entries"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris.species.distinct()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "           species\n",
        "0      Iris-setosa\n",
        "1  Iris-versicolor\n",
        "2   Iris-virginica"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Or count the number of distinct entries"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris.species.nunique()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "3"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris.sepal_length.nunique()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "35"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Transform\n",
      "\n",
      "Transform adds new columns based on old ones"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "transform(iris, sepal_ratio=iris.sepal_length / iris.sepal_width,\n",
      "                petal_ratio=iris.petal_length / iris.petal_width)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "    sepal_length  sepal_width  petal_length  petal_width      species  \\\n",
        "0            5.1          3.5           1.4          0.2  Iris-setosa   \n",
        "1            4.9          3.0           1.4          0.2  Iris-setosa   \n",
        "2            4.7          3.2           1.3          0.2  Iris-setosa   \n",
        "3            4.6          3.1           1.5          0.2  Iris-setosa   \n",
        "4            5.0          3.6           1.4          0.2  Iris-setosa   \n",
        "5            5.4          3.9           1.7          0.4  Iris-setosa   \n",
        "6            4.6          3.4           1.4          0.3  Iris-setosa   \n",
        "7            5.0          3.4           1.5          0.2  Iris-setosa   \n",
        "8            4.4          2.9           1.4          0.2  Iris-setosa   \n",
        "9            4.9          3.1           1.5          0.1  Iris-setosa   \n",
        "10           5.4          3.7           1.5          0.2  Iris-setosa   \n",
        "\n",
        "    sepal_ratio  petal_ratio  \n",
        "0      1.457143     7.000000  \n",
        "1      1.633333     7.000000  \n",
        "2      1.468750     6.500000  \n",
        "3      1.483871     7.500000  \n",
        "4      1.388889     7.000000  \n",
        "5      1.384615     4.250000  \n",
        "6      1.352941     4.666667  \n",
        "7      1.470588     7.500000  \n",
        "8      1.517241     7.000000  \n",
        "9      1.580645    15.000000  \n",
        "..."
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Split-apply-combine -- `by`\n",
      "\n",
      "Split-apply-combine queries, also known as Group-By, split the table into many groups and then do a reduction on each group.  We express these queries in blaze with the `by` operator\n",
      "\n",
      "    by(column-on-which-to-split, result_name=reduction_on_group())"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How many measurements do we have per species?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "by(iris.species, count=iris.species.count())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "           species  count\n",
        "0      Iris-setosa     50\n",
        "1  Iris-versicolor     50\n",
        "2   Iris-virginica     50"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How many measurements do we have per species and what is the longest petal length per species?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "by(iris.species, count=iris.species.count(), \n",
      "                 longest_petal=iris.petal_length.max())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "           species  count  longest_petal\n",
        "0      Iris-setosa     50            1.9\n",
        "1  Iris-versicolor     50            5.1\n",
        "2   Iris-virginica     50            6.9"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Exercise\n",
      "\n",
      "Write queries to answer the following questions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# What are the longest and shortest sepal_lengths per species?\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# What is the difference of longest to shortest sepal length per species\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### This is similar to how we solve these problems in Pandas\n",
      "\n",
      "So far, everything we've seen is similar to solving problems in Pandas"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "df = pd.read_csv('data/iris.csv')\n",
      "df.groupby(df.species).sepal_length.min()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "species\n",
        "Iris-setosa        4.3\n",
        "Iris-versicolor    4.9\n",
        "Iris-virginica     4.9\n",
        "Name: sepal_length, dtype: float64"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In fact, for small CSV files like this, Blaze *uses Pandas*, so one might consider just using Pandas directly.\n",
      "\n",
      "Blaze becomes more useful when we interact with data stored in different systems like SQL databases in the next section."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}