{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing flight data with `dask.dataframe`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we do a brief demo of using `dask.dataframe` on the airline flight data from [here](http://stat-computing.org/dataexpo/2009/the-data.html). This contains data on 22 years of all flights inside the USA, including information such as origin and destination, delays, and cancellations. The data has been extracted and decompressed into a folder of csv files, totalling around 11 GB, and 121 million rows." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "121M\tairline_data/1987.csv\n", "478M\tairline_data/1988.csv\n", "464M\tairline_data/1989.csv\n", "486M\tairline_data/1990.csv\n", "468M\tairline_data/1991.csv\n", "470M\tairline_data/1992.csv\n", "468M\tairline_data/1993.csv\n", "478M\tairline_data/1994.csv\n", "506M\tairline_data/1995.csv\n", "509M\tairline_data/1996.csv\n", "515M\tairline_data/1997.csv\n", "513M\tairline_data/1998.csv\n", "527M\tairline_data/1999.csv\n", "544M\tairline_data/2000.csv\n", "573M\tairline_data/2001.csv\n", "506M\tairline_data/2002.csv\n", "598M\tairline_data/2003.csv\n", "639M\tairline_data/2004.csv\n", "640M\tairline_data/2005.csv\n", "641M\tairline_data/2006.csv\n", "670M\tairline_data/2007.csv\n", "657M\tairline_data/2008.csv\n", " 11G\ttotal\n" ] } ], "source": [ "%%bash\n", "du -hc airline_data/*.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Open with dask.dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load all the files into a `dask.dataframe` using a clone of the `read_csv` method from pandas. \n", "\n", "Note that dask determines the datatype of each column by reading the first 1000 rows of the first file. If a column has integers for 1000 rows, and then a `NaN`, it will error as the dtype was improperly guessed. As such, we need to pass in some specified datatypes for a few columns. We'll also rename the columns from the original `CamelCase` to `snake_case`, as the `snake_case` is more pythonic.\n", "\n", "``dask.dataframe.read_csv`` can take a glob of filenames, we'll only glob the first 3 years for now (the reason will become apparent later)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import dask.dataframe as dd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cols = ['year', 'month', 'day_of_month', 'day_of_week', 'deptime', 'crs_deptime', 'arrtime', \n", " 'crs_arrtime', 'unique_carrier', 'flight_num', 'tail_num', 'actual_elapsed_time',\n", " 'crs_elapsed_time', 'air_time', 'arrdelay', 'depdelay', 'origin', 'dest', 'distance', \n", " 'taxi_in', 'taxi_out', 'cancelled', 'cancellation_code', 'diverted', 'carrier_delay',\n", " 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']\n", "\n", "dtypes = {'cancellation_code': object, 'taxi_in': float, 'taxi_out': float, 'cancelled': bool,\n", " 'diverted': bool, 'carrier_delay': float, 'weather_delay': float, 'nas_delay': float,\n", " 'security_delay': float, 'late_aircraft_delay': float, 'tail_num': object,\n", " 'crs_deptime': float, 'crs_arrtime': float, 'flight_num': float, 'crs_elapsed_time': float,\n", " 'distance': float}\n", "\n", "df = dd.read_csv('airline_data/198*.csv', header=0, names=cols, dtype=dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a `dask.dataframe.DataFrame` object. This looks a lot like a pandas `DataFrame`, and has many of the same methods:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "month | \n", "day_of_month | \n", "day_of_week | \n", "deptime | \n", "crs_deptime | \n", "arrtime | \n", "crs_arrtime | \n", "unique_carrier | \n", "flight_num | \n", "... | \n", "taxi_in | \n", "taxi_out | \n", "cancelled | \n", "cancellation_code | \n", "diverted | \n", "carrier_delay | \n", "weather_delay | \n", "nas_delay | \n", "security_delay | \n", "late_aircraft_delay | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1987 | \n", "10 | \n", "14 | \n", "3 | \n", "741 | \n", "730 | \n", "912 | \n", "849 | \n", "PS | \n", "1451 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "1987 | \n", "10 | \n", "15 | \n", "4 | \n", "729 | \n", "730 | \n", "903 | \n", "849 | \n", "PS | \n", "1451 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "1987 | \n", "10 | \n", "17 | \n", "6 | \n", "741 | \n", "730 | \n", "918 | \n", "849 | \n", "PS | \n", "1451 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "1987 | \n", "10 | \n", "18 | \n", "7 | \n", "729 | \n", "730 | \n", "847 | \n", "849 | \n", "PS | \n", "1451 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "1987 | \n", "10 | \n", "19 | \n", "1 | \n", "749 | \n", "730 | \n", "922 | \n", "849 | \n", "PS | \n", "1451 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 29 columns
\n", "