{ "metadata": { "name": "", "signature": "sha256:05ae0cb667432cc3cca0cfc359be4b95108d55951367bf9229e536a93c4eba5c" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Blaze\n", "\n", "# Getting started with `into`\n", "\n", "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", "* Install software with `conda install -c blaze blaze`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 0. Introduction\n", "\n", "## Describing Data\n", "\n", "We describe \"data\" with a few attributes\n", "\n", "1. Where that data lives (in a file on your laptop, in the cloud, etc..)\n", "2. What data-types we use (we represent names as `strings` and balances as `float32`s)\n", "3. What storage format we use (we store in CSV, in a PostgreSQL database, in JSON)\n", "4. The semantic values that those bits represent (Barack Obama, or the number 3)\n", "\n", "As analysts we care only about point 4, the values that our data represent. Points 1-3 are incidental to how we use computers; these points only get in the way of analysis.\n", "\n", "As computationalists though we care very deeply about points 1-3. The choice of format, location, and datatype *strongly* impact the efficiency and correctness of our computations. Good choices here can mean the difference between waiting overnight and using our data *interactively*.\n", "\n", "Unfortunately points 1-3 encompass a lot of complexity and change more quickly than most analysts care to manage.\n", "\n", "The `into` project alleviates the pain of dealing with the first three points by providing intuitive description and transfer between data formats and storage systems. This allows analysts to quickly reason about and migrate their data to efficient, correct, and resilient formats." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Motivating Example\n", "\n", "Before we start small with the tutorial we give a more comprehensive example.\n", "\n", "We have a small CSV file holding the iris data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!head -5 data/iris.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sepal_length,sepal_width,petal_length,petal_width,species\r\n", "5.1,3.5,1.4,0.2,Iris-setosa\r\n", "4.9,3.0,1.4,0.2,Iris-setosa\r\n", "4.7,3.2,1.3,0.2,Iris-setosa\r\n", "4.6,3.1,1.5,0.2,Iris-setosa\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We put this data into a list, a NumPy array, and a SQLite database. We move data to three very different technologies with the same abstraction." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from into import into\n", "into(list, 'data/iris.csv')[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "[(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),\n", " (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),\n", " (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),\n", " (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),\n", " (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')]" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "into(np.ndarray, 'data/iris.csv')[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "rec.array([(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),\n", " (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),\n", " (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),\n", " (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),\n", " (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')], \n", " dtype=[('sepal_length', ', nullable=False), Column('sepal_width', FLOAT(), table=, nullable=False), Column('petal_length', FLOAT(), table=, nullable=False), Column('petal_width', FLOAT(), table=, nullable=False), Column('species', TEXT(), table=, nullable=False), schema=None)" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outline\n", "\n", "Into moves data between formats intuitively.\n", "\n", "We structure this tutorial as follows:\n", "\n", "1. **Basic How-to**: on how to use this library effectively\n", "2. **Datatypes**: to enhance performance\n", "3. **Internals**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`into` is available on conda\n", "\n", " conda install into\n", " or \n", " conda install into -c blaze # Up-to-date version\n", " \n", "or on PyPI\n", "\n", " pip install into\n", " or \n", " pip install git+http://github.com/ContinuumIO/into.git # Up-to-date version\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Basic Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1 `into`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from into import into" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Into takes two arguments, a target and a source\n", "\n", " into(target, source)\n", " \n", "And it turns the source into something like the target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `target` can be a type\n", "\n", "In which case it makes a new object of that type" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "into(np.ndarray, [1, 2, 3])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "array([1, 2, 3])" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "into(set, [1, 2, 3])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "{1, 2, 3}" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "into(pd.Series, (10, 20, 30))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "0 10\n", "1 20\n", "2 30\n", "dtype: int64" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Use into to turn the following DataFrame into an `np.ndarray` and a `list`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = pd.DataFrame([['Alice', 100],\n", " ['Bob', 200],\n", " ['Charlie', 300]], columns=['name', 'balance'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# Turn df into an np.ndarray\n", "# into(..., ...)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "# Turn df into a list\n", "# into(..., ...)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `target` can be an object\n", "\n", "In which we append the source to that object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "target = []\n", "into(target, (1, 2, 3))\n", "into(target, (1, 2, 3))\n", "into(target, (1, 2, 3))\n", "target" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "[1, 2, 3, 1, 2, 3, 1, 2, 3]" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Use `into` to make a set holding all the data in the following list of DataFrames." ] }, { "cell_type": "code", "collapsed": false, "input": [ "L = [pd.DataFrame({'name': ['Alice', 'Bob'], 'balance': [100, 200]}),\n", " pd.DataFrame({'name': ['Charlie', 'Dan'], 'balance': [300, 400]}),\n", " pd.DataFrame({'name': ['Edith', 'Frank'], 'balance': [500, 600]})]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "s = set()\n", "# Use into and some kind of for loop to put all of the data in L into the set s\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Repeat the last exercise but append all of the data onto a `tuple`. What do you expect to happen?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "t = tuple()\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's important to know that `into` has limitations. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The target can be a string\n", "\n", "Many data sources external to Python (like a CSV file) don't have a Python object that we can put as the source or target. In these cases we use string URIs. Examples of strings include\n", "\n", " myfile.csv\n", " myfile.json\n", " myfile.hdf5\n", " myfile.hdf5::/data\n", " sqlite:///myfile.db::table-name\n", " postgresql:///user:password@host:port::table-name\n", " ...\n", " \n", "These can go either in the source or target inputs.\n", "\n", "Here we write our dataframe to a CSV file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Write DataFrame to CSV file\n", "into('accounts.csv', df)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "# print out text in accounts.csv\n", "!head accounts.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "name,balance\r\n", "Alice,100\r\n", "Bob,200\r\n", "Charlie,300\r\n" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "# Read CSV file into memory as list\n", "into(list, 'accounts.csv')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "[(u'Alice', 100), (u'Bob', 200), (u'Charlie', 300)]" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Read the contents of the file `'data/iris.csv'` into a `pd.DataFrame`. Then write that dataframe to `'data/iris.json'`. Inspect the json data to ensure that it came out correctly." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "Write the contents of your json file to a SQLite database using the following URI as target\n", "\n", " `sqlite:///data/my.db::iris'\n", " \n", "Then read data from that SQLite database into Python to make sure that it arrived safely." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 `resource`\n", "\n", "Much interesting data lives *outside* of Python. As we just saw, we often use URIs to specify this kind of data.\n", "\n", "Here we load a bit of a SQL database on baseball statistics ([download here](https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite)) into memory as a list" ] }, { "cell_type": "code", "collapsed": false, "input": [ "into(list, 'sqlite:///data/lahman2013.sqlite::BattingPost')[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),\n", " (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),\n", " (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),\n", " (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),\n", " (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we learn how these strings work so that we can specify many types of external data.\n", "\n", "Internally `into` uses the function `resource` to turn a string into a Python proxy object. Usually these objects don't hold the data themselves. They just serve as useful pointers to where the data lives. In most cases we use other Python projects for proxy objects.\n", "\n", "In the case of SQL tables, resource returns a `sqlalchemy.Table` object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from into import resource\n", "t = resource('sqlite:///data/lahman2013.sqlite::BattingPost')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "type(t)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "sqlalchemy.sql.schema.Table" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "t" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "Table('BattingPost', MetaData(bind=Engine(sqlite:///data/lahman2013.sqlite)), Column('yearID', INTEGER(), table=), Column('round', TEXT(), table=), Column('playerID', TEXT(), table=), Column('teamID', TEXT(), table=), Column('lgID', TEXT(), table=), Column('G', INTEGER(), table=), Column('AB', INTEGER(), table=), Column('R', INTEGER(), table=), Column('H', INTEGER(), table=), Column('2B', INTEGER(), table=), Column('3B', INTEGER(), table=), Column('HR', INTEGER(), table=), Column('RBI', INTEGER(), table=), Column('SB', INTEGER(), table=), Column('CS', INTEGER(), table=), Column('BB', INTEGER(), table=), Column('SO', INTEGER(), table=), Column('IBB', INTEGER(), table=), Column('HBP', INTEGER(), table=), Column('SH', INTEGER(), table=), Column('SF', INTEGER(), table=), Column('GIDP', INTEGER(), table=), schema=None)" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use *this* object as the `into` source." ] }, { "cell_type": "code", "collapsed": false, "input": [ "into(list, t)[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),\n", " (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),\n", " (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),\n", " (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),\n", " (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So when you write\n", "\n", "```python\n", "into(list, 'some-string')\n", "```\n", "\n", "It is actually just shorthand for\n", "\n", "```python\n", "into(list, resource('some-string'))\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise\n", "\n", "We have some data in the `data/` directory. Use `resource` on each of the following strings to see what it returns.\n", "\n", " sqlite:///data/lahman2013.sqlite::BattingPost\n", " sqlite:///data/lahman2013.sqlite::Salaries\n", " sqlite:///data/lahman2013.sqlite\n", " data/sample.hdf5::/points\n", " data/sample.hdf5\n", " data/iris.csv" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The `::` separator\n", "\n", "In the last exercise we saw the following URI for a table in a SQLite database\n", "\n", " sqlite:///data/my.db::iris\n", " \n", "We deconstruct this URI to make it more clear. First we separate the URI on `::` to separate out the database from the table name\n", "\n", " Database: sqlite:///data/my.db\n", " Table name: iris\n", " \n", "We use the `::` separator whenever datasets live within some nested structure like a database or HDF5 file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying protocols with `://`\n", "\n", "The database string `sqlite:///data/my.db` is specific to SQLAlchemy, but follows a common format, notably\n", "\n", " Protocol: sqlite://\n", " Filename: data/my.db\n", " \n", "Into also uses protocols in many cases to give extra hints on how to handle your data. For example Python has a few different libraries to handle HDF5 files (`h5py`, `pytables`, `pandas.HDFStore`). By default when we see a URI like `myfile.hdf5` we currently use `h5py`. To override this behavior you can specify a protocol string like `hdfstore://myfile.hdf5` to specify that you want to use the special `pandas.HDFStore` format.\n", "\n", "*Note:* SQLAlchemy strings are a little odd in that they use three slashes by default (e.g. `sqlite:///my.db`) and *four* slashes when using absolute paths (e.g. `sqlite:////Users/Alice/data/my.db`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise\n", "\n", "People use the `.json` extension in two ways. \n", "\n", "1. The entire file is one JSON blob, often a list of dictionaries. (Traditional JSON)\n", "2. Each line of the file is one JSON blob (Line-delimited JSON or JSON-lines.)\n", "\n", "Parsers have a hard time figuring out which case is which. When reading an existing file `into` can usually figure out if the file is line-delimited or not. When createing a file however we don't know what your intention is. You can specify your intention by adding either a `json://` or `jsonlines://` protocol." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we write our dataframe to a JSON file." ] }, { "cell_type": "code", "collapsed": true, "input": [ "into('accounts.json', df)\n", "!head accounts.json" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[{\"balance\": 100, \"name\": \"Alice\"}, {\"balance\": 200, \"name\": \"Bob\"}, {\"balance\": 300, \"name\": \"Charlie\"}]" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "!rm accounts.json # Remove old file" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the traditional single-JSON-blob-per-file format. \n", "\n", "Instead write our DataFrame in the line-delimited format by adding a `jsonlines://` protocol to the target string. Inspect the result to make sure that each line is a separate valid JSON blob." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }