# Dask Dataframe

This notebook gives a quick demo of using `dask.dataframe`. It is not intended to be a full tutorial on using dask, or a full demonstration of its capabilities. For more information see the docs [here](http://dask.pydata.org/en/latest/), or the tutorial [here](https://github.com/dask/dask-tutorial).

### Create some artificial data

First we create some artificial data to work with. This creates a couple csvs in a directory, a common situation.

In [None]:
from utils import accounts_csvs
accounts_csvs(3, 1000000, 500)

import os, glob
filenames = os.path.join('data', 'accounts.*.csv')
print('Files created:\n%s' % '\n'.join(glob.glob(filenames)))

## Creating Dask Dataframes

Dask dataframes can be created in many different methods. For more information, see the docs [here](http://dask.pydata.org/en/latest/dataframe-create.html).

Here we'll be using `dd.read_csv`. This looks *almost* exactly like the `pandas.read_csv` function, with a few differences:

- Can pass in a globstring (e.g. `accounts.*.csv`) instead of just a single filename
- Takes in a few (optional) parameters to control the partitioning
- Also works well with HDFS and S3

In [None]:
import dask.dataframe as dd

df = dd.read_csv(filenames)
df

In [None]:
df.head()

In [None]:
%time len(df)

## Dask Dataframes look like Pandas Dataframes

In [None]:
# Pretty print out the Dataframe methods
import textwrap
print('\n'.join(textwrap.wrap(str([f for f in dir(df) if not f.startswith('_')]))))

In [None]:
df.dtypes

In [None]:
df.divisions

In [None]:
df.npartitions

In [None]:
df.visualize()

## Example computations

### Mean amount

In [None]:
df.amount.mean().compute()

### Mean amount per account

In [None]:
df.groupby(df.names).amount.mean().compute()

## Divisions

In Pandas, the index associates a value to each record/row of your data. Operations that align with the index, like loc can be a bit faster as a result.

In dask.dataframe this index becomes even more important. A Dask DataFrame consists of several Pandas DataFrames. These dataframes are separated along the index by value. For example, when working with time series we may partition our large dataset by month.

By partitioning our data semantically (e.g. by Month) rather than fixed sizes (as in `dask.array`), we can be more efficient in operations that select along the index. For example `loc` along a partitioned index will only need to look at the single partition that contains the requested data, as dask can infer which partition contains the value from the divisions. Without divisions, all partitions need to be inspected, as dask has no idea which partition contains the value.

If the divisions are unknown, all the values in `.divisions` will be None.

In [None]:
df.divisions

In [None]:
df.known_divisions

However if we set the index to some new column then dask will divide our data roughly evenly along that column and create new divisions for us. Warning, set_index triggers immediate computation.

In [None]:
%time df2 = df.set_index('names')

In [None]:
df2.divisions

In [None]:
df2.known_divisions

Operations like loc only need to load the relevant partitions

In [None]:
df2.loc['Edith'].amount.mean().compute()