{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dask Dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook gives a quick demo of using `dask.dataframe`. It is not intended to be a full tutorial on using dask, or a full demonstration of its capabilities. For more information see the docs [here](http://dask.pydata.org/en/latest/), or the tutorial [here](https://github.com/dask/dask-tutorial)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create some artificial data\n", "\n", "First we create some artificial data to work with. This creates a couple csvs in a directory, a common situation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from utils import accounts_csvs\n", "accounts_csvs(3, 1000000, 500)\n", "\n", "import os, glob\n", "filenames = os.path.join('data', 'accounts.*.csv')\n", "print('Files created:\\n%s' % '\\n'.join(glob.glob(filenames)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Dask Dataframes\n", "\n", "Dask dataframes can be created in many different methods. For more information, see the docs [here](http://dask.pydata.org/en/latest/dataframe-create.html).\n", "\n", "Here we'll be using `dd.read_csv`. This looks *almost* exactly like the `pandas.read_csv` function, with a few differences:\n", "\n", "- Can pass in a globstring (e.g. `accounts.*.csv`) instead of just a single filename\n", "- Takes in a few (optional) parameters to control the partitioning\n", "- Also works well with HDFS and S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import dask.dataframe as dd\n", "\n", "df = dd.read_csv(filenames)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%time len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dask Dataframes look like Pandas Dataframes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Pretty print out the Dataframe methods\n", "import textwrap\n", "print('\\n'.join(textwrap.wrap(str([f for f in dir(df) if not f.startswith('_')]))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.divisions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.npartitions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example computations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean amount" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.amount.mean().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean amount per account" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.groupby(df.names).amount.mean().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Divisions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Pandas, the index associates a value to each record/row of your data. Operations that align with the index, like loc can be a bit faster as a result.\n", "\n", "In dask.dataframe this index becomes even more important. A Dask DataFrame consists of several Pandas DataFrames. These dataframes are separated along the index by value. For example, when working with time series we may partition our large dataset by month.\n", "\n", "By partitioning our data semantically (e.g. by Month) rather than fixed sizes (as in `dask.array`), we can be more efficient in operations that select along the index. For example `loc` along a partitioned index will only need to look at the single partition that contains the requested data, as dask can infer which partition contains the value from the divisions. Without divisions, all partitions need to be inspected, as dask has no idea which partition contains the value.\n", "\n", "If the divisions are unknown, all the values in `.divisions` will be None." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.divisions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.known_divisions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However if we set the index to some new column then dask will divide our data roughly evenly along that column and create new divisions for us. Warning, set_index triggers immediate computation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%time df2 = df.set_index('names')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2.divisions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2.known_divisions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations like loc only need to load the relevant partitions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2.loc['Edith'].amount.mean().compute()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }