{ "metadata": { "name": "", "signature": "sha256:d825a277965c84a256612bb6ff63f606b5bba263c90ac5407506f08370ba339b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Blaze\n", "\n", "# Getting Started with Blaze\n", "\n", "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", "* Install software with `conda install -c blaze blaze`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic Queries\n", "\n", "For basic tabular queries, Blaze shares the same syntax as Pandas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from blaze import Data, by, join, transform" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "bank = Data([[1, 'Alice', 100],\n", " [2, 'Bob', -200],\n", " [3, 'Charlie', 300],\n", " [4, 'Dennis', 400],\n", " [5, 'Edith', -500]], columns=['id', 'name', 'amount'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Arithmetic and Reductions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bank.amount" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ " amount\n", "0 100\n", "1 -200\n", "2 300\n", "3 400\n", "4 -500" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "bank.amount / 100" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ " amount\n", "0 1\n", "1 -2\n", "2 3\n", "3 4\n", "4 -5" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "(bank.amount / 100).mean()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "0.2" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple columns and sorting" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bank[['name', 'amount']].sort('amount')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selections\n", "\n", "We select subsets of data by indexing one expression with another" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bank[bank.amount < 0]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ " id name amount\n", "0 2 Bob -200\n", "1 5 Edith -500" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining Operations\n", "\n", "We can combine these sorts of operations with each other" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bank[bank.amount < 0].amount / 100" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " amount\n", "0 -2\n", "1 -5" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "bank[bank.amount < 0].name" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " name\n", "0 Bob\n", "1 Edith" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercises\n", "\n", "Write expressions to answer the following questions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# What are the IDs of everyone with a positive amount?\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# What is the name of the person with amount 400?\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "# What is the difference between the minimum and maximum amounts?\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. More complex queries\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need a more interesting dataset. We open the standard *iris* dataset, a table of 150 measurements of flowers in the iris genus. We find this dataset in a CSV file in the `data/` directory. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "iris = Data('data/iris.csv')\n", "iris" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa\n", "5 5.4 3.9 1.7 0.4 Iris-setosa\n", "6 4.6 3.4 1.4 0.3 Iris-setosa\n", "7 5.0 3.4 1.5 0.2 Iris-setosa\n", "8 4.4 2.9 1.4 0.2 Iris-setosa\n", "9 4.9 3.1 1.5 0.1 Iris-setosa\n", "..." ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `blaze.Data` function operates on all of the file types that we saw in the previous sections on `into`. Blaze expressions use functions like `discover` to get datashapes that help them interact with you." ] }, { "cell_type": "code", "collapsed": false, "input": [ "iris.dshape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "dshape(\"\"\"var * {\n", " sepal_length: float64,\n", " sepal_width: float64,\n", " petal_length: float64,\n", " petal_width: float64,\n", " species: string\n", " }\"\"\")" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distinct\n", "\n", "Now some more queries. Distinct finds unique entries" ] }, { "cell_type": "code", "collapsed": false, "input": [ "iris.species.distinct()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ " species\n", "0 Iris-setosa\n", "1 Iris-versicolor\n", "2 Iris-virginica" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or count the number of distinct entries" ] }, { "cell_type": "code", "collapsed": false, "input": [ "iris.species.nunique()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "3" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "iris.sepal_length.nunique()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "35" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transform\n", "\n", "Transform adds new columns based on old ones" ] }, { "cell_type": "code", "collapsed": false, "input": [ "transform(iris, sepal_ratio=iris.sepal_length / iris.sepal_width,\n", " petal_ratio=iris.petal_length / iris.petal_width)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ " sepal_length sepal_width petal_length petal_width species \\\n", "0 5.1 3.5 1.4 0.2 Iris-setosa \n", "1 4.9 3.0 1.4 0.2 Iris-setosa \n", "2 4.7 3.2 1.3 0.2 Iris-setosa \n", "3 4.6 3.1 1.5 0.2 Iris-setosa \n", "4 5.0 3.6 1.4 0.2 Iris-setosa \n", "5 5.4 3.9 1.7 0.4 Iris-setosa \n", "6 4.6 3.4 1.4 0.3 Iris-setosa \n", "7 5.0 3.4 1.5 0.2 Iris-setosa \n", "8 4.4 2.9 1.4 0.2 Iris-setosa \n", "9 4.9 3.1 1.5 0.1 Iris-setosa \n", "10 5.4 3.7 1.5 0.2 Iris-setosa \n", "\n", " sepal_ratio petal_ratio \n", "0 1.457143 7.000000 \n", "1 1.633333 7.000000 \n", "2 1.468750 6.500000 \n", "3 1.483871 7.500000 \n", "4 1.388889 7.000000 \n", "5 1.384615 4.250000 \n", "6 1.352941 4.666667 \n", "7 1.470588 7.500000 \n", "8 1.517241 7.000000 \n", "9 1.580645 15.000000 \n", "..." ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split-apply-combine -- `by`\n", "\n", "Split-apply-combine queries, also known as Group-By, split the table into many groups and then do a reduction on each group. We express these queries in blaze with the `by` operator\n", "\n", " by(column-on-which-to-split, result_name=reduction_on_group())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many measurements do we have per species?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "by(iris.species, count=iris.species.count())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ " species count\n", "0 Iris-setosa 50\n", "1 Iris-versicolor 50\n", "2 Iris-virginica 50" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many measurements do we have per species and what is the longest petal length per species?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "by(iris.species, count=iris.species.count(), \n", " longest_petal=iris.petal_length.max())" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ " species count longest_petal\n", "0 Iris-setosa 50 1.9\n", "1 Iris-versicolor 50 5.1\n", "2 Iris-virginica 50 6.9" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Write queries to answer the following questions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# What are the longest and shortest sepal_lengths per species?\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "# What is the difference of longest to shortest sepal length per species\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is similar to how we solve these problems in Pandas\n", "\n", "So far, everything we've seen is similar to solving problems in Pandas" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "df = pd.read_csv('data/iris.csv')\n", "df.groupby(df.species).sepal_length.min()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "species\n", "Iris-setosa 4.3\n", "Iris-versicolor 4.9\n", "Iris-virginica 4.9\n", "Name: sepal_length, dtype: float64" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, for small CSV files like this, Blaze *uses Pandas*, so one might consider just using Pandas directly.\n", "\n", "Blaze becomes more useful when we interact with data stored in different systems like SQL databases in the next section." ] } ], "metadata": {} } ] }