{ "metadata": { "name": "", "signature": "sha256:ecd1889f30dd086fa191899741e3d2e3f75ef8a3fb15f463c88c0def7e579e3f" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Blaze\n", "\n", "# Getting started with `into`\n", "\n", "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", "* Install software with `conda install -c blaze blaze`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Design\n", "\n", "Into is a network of pair-wise conversions\n", "\n", "\"into\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nodes represent data formats. Arrows represent functions that migrate data from one format to another. Red nodes are possibly larger than memory.\n", "\n", "This differs from most data migration systems, which always migrate data through a common format.\n", "\n", "![](images/star.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This common strategy is simpler in design and easier to get right, so why does `into` use a more complex design?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some edges are very fast\n", "\n", "For example transforming an `np.ndarray` to a `pd.Series` or a `list` to `Iterator` only requires us to manipulate metadata which can be done quickly; the bytes of data remain untouched. In many cases transfering data through a common format can be much slower.\n", "\n", "For example consider `CSV -> SQL` migration. Using Python iterators as a common central format we're bound by SQLAlchemy's insertion code which maxes out at around 2000 records per second. Using CSV loaders native to the database (e.g. PostgreSQL Copy) we achieve more than 50000 records per second, turning an overnight task into 20 minutes.\n", "\n", "Efficient data migration *is intrinsically messy* in practice. The into graph reflects this mess." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How does `into` use this graph?\n", "\n", "When you convert one dataset into another, Into walks the graph above and finds the minimum path. Each edge corresponds to a single function and that edge is weighted according to relative cost. E.g. if we transform a CSV file into a set we can see the stages through which our data moves." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from into import into, convert, append, CSV\n", "path = convert.path(CSV, set)\n", "\n", "for source, target, func in path:\n", " print '%25s -> %-25s via %s()' % (source.__name__, target.__name__, func.__name__)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " CSV -> Chunks_pandas_DataFrame via CSV_to_chunks_of_dataframes()\n", " Chunks_pandas_DataFrame -> Iterator via numpy_chunks_to_iterator()\n", " Iterator -> list via iterator_to_list()\n", " list -> set via iterable_to_set()\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chunks of in-memory data\n", "\n", "The red nodes in our graph represent data that might be larger than memory. We we migrate between two red nodes we restrict ourselves to the subgraph of only red nodes so that we never blow up.\n", "\n", "Yet we still want to use NumPy and Pandas in these migrations (they're very helpful intermediaries) so we partition our data into a sequence of chunks such that each chunks fit in memory. We describe this data with parametrized types like `chunks(np.ndarray)` or `chunks(pd.DataFrame)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `into` = `convert` + `append`\n", "\n", "Recall the two modes of into" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Given type: Convert source to new object\n", "into(list, (1, 2, 3))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "[1, 2, 3]" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# Given object: Append source to that object\n", "L = [1, 2, 3]\n", "into(L, (4, 5, 6))\n", "L" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "[1, 2, 3, 4, 5, 6]" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "These two modes are actually separate functions, `convert` and `append`. Into uses both depending on the situation. A single `into` call may engage both functions.\n", "\n", "You should use `into` by default, it does some other checks. For the sake of being explicit however, here are examples using `convert` and `append` directly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from into import convert, append\n", "\n", "convert(list, (1, 2, 3))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "[1, 2, 3]" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "L = [1, 2, 3]\n", "append(L, (4, 5, 6))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "[1, 2, 3, 4, 5, 6]" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to extend `into`\n", "\n", "When we encounter new data formats we may wish to connect them to the `into` graph. We do this by implementing new versions of `discover`, `convert`, and `append` (if we support appending).\n", "\n", "We register new implementations of an operation like convert by creating a new Python function and decorating it with types and a cost.\n", "\n", "#### Example\n", "\n", "Here we define how to convert from a DataFrame to a NumPy array" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import pandas as pd\n", "\n", "# target type, source data, cost\n", "@convert.register(np.ndarray, pd.DataFrame, cost=1.0)\n", "def dataframe_to_numpy(df, **kwargs):\n", " return df.to_records(index=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 } ], "metadata": {} } ] }