{ "metadata": { "name": "", "signature": "sha256:ff6a8b6143d1b1597e06f9db6999659e88eb7089b846632db83ec8a8bc35521f" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data Science with Hadoop - predicting airline delays - part 1: PIG and Python" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the rapid adoption of Apache Hadoop in the enterprise, machine learning is becoming a key technology used by enterprises to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.\n", "\n", "It is a common misconception that the way we apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or dedicated, siloed clusters. In fact, the big change is in what is known as \u201cfeature engineering\u201d \u2013 the process by which very big raw data is transformed into a \u201cfeature matrix\u201d. Enabled by Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.\n", "\n", "Since the output of the feature engineering step (the \"feature matrix\") tends to be relatively small in size (typically in the 2-20GB range), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python's Scikit-learn, or SAS.\n", "\n", "In this multi-part blog post we will demonstrate, via an example, a step by step solution to a supervised learning problem. Our focus will be to show how to solve this problem with the various different tools and libraries, and how these integrate with Hadoop. In part 1 we focus on [Apache PIG](http://pig.apache.org/), Python and [Scikit-learn](http://scikit-learn.org/stable/). Later on we will look at other alternatives such as [R](http://www.r-project.org/) or [Spark/ML-Lib](http://spark.apache.org/docs/1.1.0/mllib-guide.html)." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Pig and Python Can\u2019t Fly But Can Predict Flight Delays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travellers and airlines. As our example use-case, we will build a supervised learning model that predicts airline delay from historial flight data and weather information. \n", "\n", "\n", "\n", "Let's begin by exploring the airline delay dataset available here: http://stat-computing.org/dataexpo/2009/the-data.html\n", "This dataset includes details about flights in the US from the years 1987-2008. Every row in the dataset includes 29 variables:\n", "
\n", " | Name | \n", "Description | \n", "
---|---|---|
1 | Year | 1987-2008 | \n", "
2 | Month | 1-12 | \n", "
3 | DayofMonth | 1-31 | \n", "
4 | DayOfWeek | 1 (Monday) - 7 (Sunday) | \n", "
5 | DepTime | actual departure time (local, hhmm) | \n", "
6 | CRSDepTime | scheduled departure time (local, hhmm) | \n", "
7 | ArrTime | actual arrival time (local, hhmm) | \n", "
8 | CRSArrTime | scheduled arrival time (local, hhmm) | \n", "
9 | UniqueCarrier | unique carrier code | \n", "
10 | FlightNum | flight number | \n", "
11 | TailNum | plane tail number | \n", "
12 | ActualElapsedTime | in minutes | \n", "
13 | CRSElapsedTime | in minutes | \n", "
14 | AirTime | in minutes | \n", "
15 | ArrDelay | arrival delay, in minutes | \n", "
16 | DepDelay | departure delay, in minutes | \n", "
17 | Origin | origin - IATA airport code | \n", "
18 | Dest | destination - IATA airport code | \n", "
19 | Distance | in miles | \n", "
20 | TaxiIn | taxi in time, in minutes | \n", "
21 | TaxiOut | taxi out time in minutes | \n", "
22 | Cancelled | was the flight cancelled? | \n", "
23 | CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) | \n", "
24 | Diverted | 1 = yes, 0 = no | \n", "
25 | CarrierDelay | in minutes | \n", "
26 | WeatherDelay | in minutes | \n", "
27 | NASDelay | in minutes | \n", "
28 | SecurityDelay | in minutes | \n", "
29 | LateAircraftDelay | in minutes | \n", "