{
 "metadata": {
  "name": "",
  "signature": "sha256:ff6a8b6143d1b1597e06f9db6999659e88eb7089b846632db83ec8a8bc35521f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Data Science with Hadoop - predicting airline delays - part 1: PIG and Python"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Introduction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "With the rapid adoption of Apache Hadoop in the enterprise, machine learning is becoming a key technology used by enterprises to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.\n",
      "\n",
      "It is a common misconception that the way we apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or dedicated, siloed clusters. In fact, the big change is in what is known as \u201cfeature engineering\u201d \u2013 the process by which very big raw data is transformed into a \u201cfeature matrix\u201d. Enabled by Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.\n",
      "\n",
      "Since the output of the feature engineering step (the \"feature matrix\") tends to be relatively small in size (typically in the 2-20GB range), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python's Scikit-learn, or SAS.\n",
      "\n",
      "In this multi-part blog post we will demonstrate, via an example, a step by step solution to a supervised learning problem. Our focus will be to show how to solve this problem with the various different tools and libraries, and how these integrate with Hadoop. In part 1 we focus on [Apache PIG](http://pig.apache.org/), Python and [Scikit-learn](http://scikit-learn.org/stable/). Later on we will look at other alternatives such as [R](http://www.r-project.org/) or [Spark/ML-Lib](http://spark.apache.org/docs/1.1.0/mllib-guide.html)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Pig and Python Can\u2019t Fly But Can Predict Flight Delays"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travellers and airlines. As our example use-case, we will build a supervised learning model that predicts airline delay from historial flight data and weather information. \n",
      "\n",
      "<img src=http://venturevillage.eu/wp-content/uploads/2013/02/Flight-delays.jpeg>\n",
      "\n",
      "Let's begin by exploring the airline delay dataset available here:  http://stat-computing.org/dataexpo/2009/the-data.html\n",
      "This dataset includes details about flights in the US from the years 1987-2008. Every row in the dataset includes 29 variables:\n",
      "<table width=\"100%\">\n",
      "<tr>\n",
      "  <th></th>\n",
      "  <th>Name</th>\n",
      "  <th>Description</th>\n",
      "</tr>\n",
      "<tr>\n",
      " <td>1  </td><td> Year              </td><td>1987-2008</td>\n",
      "</tr><tr>\n",
      " <td>2  </td><td> Month             </td><td>1-12</td>\n",
      "</tr><tr>\n",
      " <td>3  </td><td> DayofMonth        </td><td>1-31</td>\n",
      "</tr><tr>\n",
      " <td>4  </td><td> DayOfWeek         </td><td>1 (Monday) - 7 (Sunday)</td>\n",
      "</tr><tr>\n",
      " <td>5  </td><td> DepTime           </td><td>actual departure time (local, hhmm)</td>\n",
      "</tr><tr>\n",
      " <td>6  </td><td> CRSDepTime        </td><td>scheduled departure time (local, hhmm)</td>\n",
      "</tr><tr>\n",
      " <td>7  </td><td> ArrTime           </td><td>actual arrival time (local, hhmm)</td>\n",
      "</tr><tr>\n",
      " <td>8  </td><td> CRSArrTime        </td><td>scheduled arrival time (local, hhmm)</td>\n",
      "</tr><tr>\n",
      " <td>9  </td><td> UniqueCarrier     </td><td>unique carrier code</td>\n",
      "</tr><tr>\n",
      " <td>10 </td><td> FlightNum         </td><td>flight number</td>\n",
      "</tr><tr>\n",
      " <td>11 </td><td> TailNum           </td><td>plane tail number</td>\n",
      "</tr><tr>\n",
      " <td>12 </td><td> ActualElapsedTime </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>13 </td><td> CRSElapsedTime    </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>14 </td><td> AirTime           </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>15 </td><td> ArrDelay          </td><td>arrival delay, in minutes</td>\n",
      "</tr><tr>\n",
      " <td>16 </td><td> DepDelay          </td><td>departure delay, in minutes</td>\n",
      "</tr><tr>\n",
      " <td>17 </td><td> Origin            </td><td>origin - IATA airport code</td>\n",
      "</tr><tr>\n",
      " <td>18 </td><td> Dest              </td><td>destination - IATA airport code</td>\n",
      "</tr><tr>\n",
      " <td>19 </td><td> Distance          </td><td>in miles</td>\n",
      "</tr><tr>\n",
      " <td>20 </td><td> TaxiIn            </td><td>taxi in time, in minutes</td>\n",
      "</tr><tr>\n",
      " <td>21 </td><td> TaxiOut           </td><td>taxi out time in minutes</td>\n",
      "</tr><tr>\n",
      " <td>22 </td><td> Cancelled           </td><td>was the flight cancelled?</td>\n",
      "</tr><tr>\n",
      " <td>23 </td><td> CancellationCode  </td><td>reason for cancellation (A = carrier, B = weather, C = NAS, D = security)</td>\n",
      "</tr><tr>\n",
      " <td>24 </td><td> Diverted          </td><td>1 = yes, 0 = no</td>\n",
      "</tr><tr>\n",
      " <td>25 </td><td> CarrierDelay      </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>26 </td><td> WeatherDelay      </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>27 </td><td> NASDelay          </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>28 </td><td> SecurityDelay     </td><td>in minutes</td>\n",
      "</tr><tr>\n",
      " <td>29 </td><td> LateAircraftDelay </td><td>in minutes</td>\n",
      "</tr>\n",
      "</table>\n",
      "\n",
      "To simplify, we will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD), where we \"learn\" the model using data from 2007, and evaluate its performance using data from 2008.\n",
      "\n",
      "But first, let's do some exploration of this dataset.\n",
      "\n",
      "We start by importing some useful python libraries that we will need later like Pandas, Numpy, Scikit-learn and MatplotLib."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Python library imports: numpy, random, sklearn, pandas, etc\n",
      "\n",
      "import warnings\n",
      "warnings.filterwarnings('ignore')\n",
      "\n",
      "import sys\n",
      "import random\n",
      "import numpy as np\n",
      "import scipy as sp\n",
      "\n",
      "from sklearn import linear_model, cross_validation, metrics, svm\n",
      "from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_score\n",
      "from sklearn.ensemble import RandomForestClassifier\n",
      "from sklearn.preprocessing import StandardScaler\n",
      "\n",
      "import pandas as pd\n",
      "import matplotlib.pyplot as plt\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now define a utility function to read an HDFS file into a Pandas dataframe using Pydoop. Pydoop is a package that provides a Python API for Hadoop MapReduce and HDFS.\n",
      "\n",
      "Pydoop's *hdfs.open()* function reads a single file from HDFS. However many HDFS output files are actually multi-part files, so our *read_csv_from_hdfs()* function uses *hdfs.ls()* to grab all the needed file names, and then read each one separately. Finally, it concatenates the resulting Pandas dataframes of each file into a Pandas dataframe."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# function to read HDFS file into dataframe using PyDoop\n",
      "import pydoop.hdfs as hdfs\n",
      "def read_csv_from_hdfs(path, cols, col_types=None):\n",
      "  files = hdfs.ls(path);\n",
      "  pieces = []\n",
      "  for f in files:\n",
      "    fhandle = hdfs.open(f)\n",
      "    pieces.append(pd.read_csv(fhandle, names=cols, dtype=col_types, error_bad_lines=False))\n",
      "    fhandle.close()\n",
      "  return pd.concat(pieces, ignore_index=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Great. Now we got the logistics out of the way, so let's explore this dataset further. \n",
      "\n",
      "First, let's read the raw data for 2007 from HDFS into a Pandas dataframe. We use our utility function *read_csv_from_hdfs()* and provide it with column names since this is a raw file, not a HIVE table with meta-data. Let's see how it works:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# read 2007 year file\n",
      "cols = ['year', 'month', 'day', 'dow', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'Carrier', 'FlightNum', \n",
      "        'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', \n",
      "        'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay', \n",
      "        'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'];\n",
      "flt_2007 = read_csv_from_hdfs('airline/delay/2007.csv', cols)\n",
      "\n",
      "flt_2007.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "(7453216, 29)"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see 7.4M+ flights in 2007 and 29 variables.\n",
      "\n",
      "Our \"target\" variable will be *DepDelay* (scheduled departure delay in minutes). To build a classifier, we further refine our target variable into a binary variable by defining a \"delay\" as having 15 mins or more of delay, and \"non-delay\" otherwise. We thus create a new binary variable that we name *'DepDelayed'*.\n",
      "\n",
      "Let's look at some basic statistics, after limiting ourselves to flights originating from ORD:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = flt_2007[flt_2007['Origin']=='ORD'].dropna(subset=['DepDelay'])\n",
      "df['DepDelayed'] = df['DepDelay'].apply(lambda x: x>=15)\n",
      "print \"total flights: \" + str(df.shape[0])\n",
      "print \"total delays: \" + str(df['DepDelayed'].sum())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "total flights: 359169\n",
        "total delays: 109346\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's see how delayed flights are distributed by month:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Select a Pandas dataframe with flight originating from ORD\n",
      "\n",
      "# Compute average number of delayed flights per month\n",
      "grouped = df[['DepDelayed', 'month']].groupby('month').mean()\n",
      "\n",
      "# plot average delays by month\n",
      "grouped.plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "<matplotlib.axes._subplots.AxesSubplot at 0x7f0a688ca490>"
       ]
      },
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAEQCAYAAAB/SPUAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHcRJREFUeJzt3XmUVPWZ//H3QyMKsnQDGo0srWZR8tODiQouTNqT4LQR\nlTGJCUa0jSPugzmaA3HMQeMx6Cznl/m5DUa2gRAyyYzRSIxGf1RkYoLEUcbIoo0iixsggoAYkGf+\nqNtNUd1V1cut5Xvr8zqnDnXr3vp+7reqear6qdu3zN0REZGw9Cj3DoiISOepeIuIBEjFW0QkQCre\nIiIBUvEWEQmQireISIAKFm8zazSzVWb2qplNaWd9g5ltM7MXosutxdlVERFp0TPfSjOrAe4Fvgxs\nBJaZ2aPuvjJr09+5+/lF2kcREclS6J33qUCzu6919z3AQuCCdraz2PdMRERyKlS8jwLWZyxviG7L\n5MDpZrbczH5tZiPi3EEREWkrb9uEdGEu5L+Boe6+y8zOAX4JfKbbeyYiIjkVKt4bgaEZy0NJv/tu\n5e4fZFx/3MzuN7OB7v5e5nZmppOoiIh0gbu3aU0Xapv8Cfi0mdWbWS/gG8CjmRuY2SfMzKLrpwKW\nXbgzdqDTl2nTpnXpfl29KC/MLOUpL6l5ueR95+3ue83seuAJoAaY6e4rzeyqaP0M4GvANWa2F9gF\nfLPAC0KnrF27Ns7hlFfCvCTPTXnKK3deobYJ7v448HjWbTMyrt8H3BfrXomISF4V/xeWTU1Nygs0\nL8lzU57yyp1n+XoqsQaZeamyRESSwszwLnxgWXRmpkuCLplSqVRJf5aUp7xqyivY8y4FvSNPhuzi\nLSLFU/a2SfQrQUn2QYpLz6VI/Cq2bSIiIp2n4i1FE3pPUXnKq+Q8Fe9A9ejRg9dee62kmXPmzGHM\nmDElzRSR9lVkz7sUH3wVmnd9fT3vvvsuPXv2pKamhhEjRnDppZcyadKkbu9fU1MTP/3pTzn44IMB\nGD58OOeddx5Tp06lf//+HRqjR48eNDc3c8wxx3RrXzpjzpw5zJw5kyVLlrS7Xj1vkfgF2PP2Il4K\nMzMee+wxtm/fzrp165g6dSp33303V1xxRbdnZmZMmTKF7du3s3nzZmbPns0f//hHzjjjDHbt2tXt\n8UWkPOI6zLYjKrh4V45+/fpx3nnn8bOf/Yy5c+eyYsUKPvroI26++WaGDx/OEUccwTXXXMPu3buB\ndG9ryJAhTJ8+ncMOO4yjjz6aBQsWHDBmyzvUXr16cfLJJ/Poo4+yZcsWZs+e3brNrFmzGDFiBAMH\nDqSxsZF169a1u3+LFi3ipJNOYsCAAQwbNozbb7+9dd25557Lvffee8D2J554Io888ggAq1atYuzY\nsQwaNIjjjjuOn//8563bbdmyhfPPP58BAwYwatQo1qxZ06nHLfSeovKU1zW53jQuzrOu81S8O+GU\nU05hyJAhPPPMM0ydOpXm5maWL19Oc3MzGzdu5Ac/+EHrtu+88w5btmzhzTffZO7cuUyaNIlXX301\n59h9+/Zl7NixrS2JRx55hOnTp/Pwww+zefNmxowZw4QJE3Led/78+Wzbto1FixbxwAMPtBbnpqYm\n5s+f37rt8uXLefPNNzn33HPZuXMnY8eO5ZJLLmHTpk0sXLiQa6+9lpUr099yd91119GnTx/efvtt\nZs2axezZs3Ust0ilKNXpENNRbbV3O+DgRby0vy+Z6uvr/emnn25z++jRo/3OO+/0Qw891NesWdN6\n+7PPPutHH320u7svXrzYe/bs6bt27Wpdf9FFF/kdd9zh7u5NTU1+6623thl7ypQpfvbZZ7u7e2Nj\no8+cObN13ccff+x9+vTxdevWubu7mR2Qn2ny5Mn+ne98x93dP/zwQ6+rq/Pm5mZ3d7/pppv8uuuu\nc3f3hQsX+pgxYw6476RJk/z222/3vXv3+kEHHeSrV69uXXfLLbf4mWee2W6me8ceV5Ek63rtyv1/\nJ1rXpqbqnXcnbdy4kb1797Jr1y6+8IUvUFdXR11dHeeccw6bN29u3a6uro7evXu3Lg8fPpy33nqr\n4NgDBw4E4I033mDy5Mmt4w8aNKh1m2xLly7lrLPO4vDDD6e2tpYZM2awZcsWAA455BAuuugi5s2b\nh7uzcOFCJk6c2JqxdOnS1oy6ujoWLFjAO++8w+bNm9m7dy9Dh+7/Lo5hw4Z18VETkbipeHfCsmXL\n2LhxI+PHj6d3796sWLGCrVu3snXrVt5//322b9/euu3WrVsP+PDxjTfe4JOf/GTrcnb7YceOHTz1\n1FOth+INGzaMBx98sHX8rVu3snPnTkaPHt1mvy6++GLGjx/Phg0beP/997n66qvZt29f6/rLLruM\nn/zkJzz11FP06dOHUaNGtWZ88YtfPCDjgw8+4L777mPw4MH07NnzgD57rp57LsnpYSpPebEkxjqa\ninceHn2ouH37dh577DEmTJjAxIkTOfHEE7nyyiu58cYb2bRpE5B+R/zkk08ecP9p06axZ88elixZ\nwqJFi/j617/eOm7L2B999BHPP/8848ePZ9CgQVx++eUAXH311fzwhz9kxYoVAGzbtu2ADxMz7dix\ng7q6Onr16sVzzz3HggULDnhxOO200zAzbr75Zi699NLW28eNG8crr7zC/Pnz2bNnD3v27GHZsmWs\nWrWKmpoaLrzwQm677TY+/PBDVqxYwdy5c9XzFqkU7fVSinEhwJ537969vV+/fj5gwAA//fTT/f77\n7/d9+/a5u/vu3bv9lltu8WOOOcb79+/vxx9/vN9zzz3unu55DxkyxO+8804fPHiwDx8+3OfPn986\ndlNTk/fq1cv79evnffv29c997nM+depU37Zt2wH7MG/ePD/hhBO8f//+PnToUL/iiita1/Xo0aO1\n5/2LX/zChw8f7v369fNx48b5DTfc4BMnTjxgrDvuuMPNzF9//fUDbl+9erWfe+65fthhh/mgQYP8\nS1/6ki9fvtzd3Tdt2uTjxo3z/v37+6hRo/z73/9+mx559vMmUs26Xrs63/PWH+kUQSqVYuLEiaxf\nv75oGZ01b948fvzjH/PMM88ULUN/pCPVLl27uvJ/IPf/naD+SKe9V5m4L9Vk165d3HfffUyaNKmk\nuUnvYSpPeZ1MjHW0iizeSVApveEnnniCww8/nCOPPJKLL7643LsjIjGpyLaJhEnPpVS7qm+biIhI\nfireUjRJ72EqT3mdTIx1NBVvEZEAVUTPW5JDPW+pZqXseZf92+PL8Z+9Oy8YKk4iUgkqvm0S+jl3\nC0lyny/Jc1Oe8rqQGOtoFV+8RUSkrbL3vMuhGH0pEREd5y0iInlVfPEOvS9VMC3Bfb4kz015yutC\nYqyjVXzxFhGRttTz7tw91fMWkZzU8xYRkbwqvniH3pcqmJbgPl+S56Y85XUhMdbRKr54i4hIWwV7\n3mbWCPwIqAEecve7c2x3CvAH4CJ3/8921qvnLSKJVjE9bzOrAe4FGoERwAQzOz7HdncDvwF0pikR\nkSIr1DY5FWh297XuvgdYCFzQznY3AL8ANsW8f8H3pQqmJbjPl+S5KU95XUiMdbRCxfsoIPMr0DdE\nt7Uys6NIF/QHopvUVxARKbJCp4TtSCH+ETDV3d3SDZ+cbZOmpibq6+sBqK2tZeTIkTQ0NAD7XwVL\ntbz/VbCzy5Rlf0NdbqE85VVDXsaI0b8NWcu51qfHaGhoIJVKMWfOHIDWetmevB9Ymtlo4DZ3b4yW\nvwfsy/zQ0sxeY3/BHgzsAq5090ezxtIHliKSaBXzgSXwJ+DTZlZvZr2AbwAHFGV3P8bdj3b3o0n3\nva/JLtzd0fYVrdhKm1fq+ZUyL8lzU57yupAY62h52ybuvtfMrgeeIH2o4Ex3X2lmV0XrZ8S6NyIi\n0iE6t0nn7qm2iYjkVEltExERqUAVX7xD70sVTEtwny/Jc1Oe8rqQGOtoFV+8RUSkLfW8O3dP9bxF\nJCf1vEVEJK+KL96h96UKpiW4z5fkuSlPeV1IjHW0ii/eIiLSlnrenbunet4ikpN63iIiklfFF+/Q\n+1IF0xLc50vy3JSnvC4kxjpaxRdvERFpSz3vzt1TPW8RyUk9bxERyavii3fofamCaQnu8yV5bspT\nXhcSYx2t4ou3iIi0pZ535+6pnreI5KSet4iI5FXxxTv0vlTBtAT3+ZI8N+UprwuJsY5W8cVbRETa\nUs+7c/dUz1tEclLPW0RE8qr44h16X6pgWoL7fEmem/KU14XEWEfrGeto0q70r1JdozaNiLRHPe/O\n3bNLxVQ9dpHqoJ63iIjkVfHFO/S+VKXlqeetPOWVJy+RPW/1hEXS9H9BOqoiet5J70Gr5y0dpZ+V\nsKnnLSIieQVQvFPKizNNPe9g85L8s1INeTq3iYiIqOedxDwJl35Wwqaet4iI5BVA8U4pL860wHve\nZtblS9xC75kWTEt4Dzr05y+A4i2SzXNcFudZJ5IsBXveZtYI/AioAR5y97uz1l8A/ADYF12+6+7/\nv51x1PMuUV6SJf2xTPr8kq6UPe+8xdvMaoDVwJeBjcAyYIK7r8zY5lB33xldPwF42N0/1c5YKt4l\nykuypD+WSZ9f0lXSB5anAs3uvtbd9wALgQsyN2gp3JG+wOZO7XNBqXiHq/K80HveBRJLm6b5Ka9z\nibGOVqh4HwWsz1jeEN12ADMbb2YrgceBv4tv90REpD2F2iZfBRrd/cpo+RJglLvfkGP7MaT74p9t\nZ53aJiXKS7KkP5ZJn1/SlbJtUuisghuBoRnLQ0m/+26Xuy8xs55mNsjdt2Svb2pqor6+HoDa2lpG\njhxJQ0NDtDYV/dvZ5Wgp+hWoZbxCy0nOi+PMdJ2dX6mW92tZbujgcnqMcu9/tc8v6cv7tSw3dHB5\n//OXSqWYM2cOQGu9bJe757yQLu5rgHqgF/AicHzWNsey/x3854E1OcbyXAAHz3FZnGdd7jHzUV68\nebksXrw41vHcK2du7pqf8toqxvMXrWtTU/O+83b3vWZ2PfAE6UMFZ7r7SjO7Klo/A/gqcKmZ7QF2\nAN/MN6aIiHSfzm2ivG7nlVKS5wbJn1/SVdKhgiIiUoECKN4p5QWaF/pxtAXTND/ldS4x1tECKN4i\nIpJNPW/ldTuvlJI8N0j+/JJOPW8REckrgOKdUl6geaH3FKGyzh+unnfYeep5i5Sc57gszrNOpLjU\n81Zet/NKKemPZZKfu2qgnreIiOQVQPFOKS/QvNB7itWel/QedOg/nwEUbxERyaaet/K6nVdKSX8s\nk/zcVQP1vEVEJK8AindKeYHmhd5TrPa8pPegQ//5DKB4i4hINvW8ldftvFJK+mOZ5OeuGqjnLSIi\neQVQvFPKCzQv9J5iteclvQcd+s9nAMVbRESyqeetvG7nlVLSH8skP3fVQD1vERHJK4DinVJeoHmh\n9xSrPS/pPejQfz4DKN4iIpJNPW/ldTuvlJL+WCb5uasGpex59+xCiohIELrzdXSV/mIYQNskpbxA\n80LvKVZ7XnJ60JXyNXapWEcLoHiLiEg29byV1+28Ukr6Y5nk564ckvD86ThvEZEECaB4p5QXaJ56\n3mHnFeP5M7MuX+KXKsKYpcsLoHiLSLJUygeIYVPPW3ndziulpD+WSX7uIPmPp47zlqB09VfaEIqN\nSKUKoG2SUl4QeZXwa3CqSONWZ54+s6jsvACKt4iIZFPPW3llykvy3MLJK7WkP54Vd5y3mTWa2Soz\ne9XMprSz/ltmttzM/sfMfm9mJ3Z630VEpMMKFm8zqwHuBRqBEcAEMzs+a7PXgL9y9xOBO4AH49vF\nVHxDKa/EeaXMUl7saep5V3ReR442ORVodve1AGa2ELgAWNmygbv/IWP7pcCQGPdRRIokyWfdS7qC\nPW8z+xrw1+5+ZbR8CTDK3W/Isf3NwGfcfVLW7ep5K69MWcpTXrh53TnOu8N7YmZnAd8GzujofURE\npPM6Urw3AkMzlocCG7I3ij6k/DHQ6O5b2xuoqamJ+vp6AGpraxk5ciQNDQ3R2lT0b/Zyy2351u/v\nz7WMV2hZeaXIexG4Mcf69BgdHb9t/1V5yktmXiqVYs6cOQCt9bJd7p73QrrArwHqgV7RHhyftc0w\noBkYnWcczwVw8ByXxXnW5R4zH+WVKi/Jc1Oe8kqTF60j+9Kh47zN7BzgR0ANMNPdp5vZVVFFnmFm\nDwF/A6yL7rLH3U/NGsNzZSWhL6W8zt4vyXNTnvLiy8vV89Yf6SivTHlJnpvylBdfXsBfxpBSXrB5\npcxSnvKqKy+A4i0iItnUNlFemfKSPDflKS++vIDbJiIiki2A4p1SXrB5pcxSnvKqKy+A4i0iItnU\n81ZemfKSPDflKS++PPW8RUQSJIDinVJesHmlzFKe8qorL4DiLSIi2dTzVl6Z8pI8N+UpL7489bxF\nRBIkgOKdUl6weaXMUp7yqisvgOItIiLZ1PNWXpnykjw35Skvvjz1vEVEEiSA4p1SXrB5pcxSnvKq\nKy+A4i0iItnU81ZemfKSPDflKS++PPW8RUQSJIDinVJesHmlzFKe8qorL4DiLSIi2dTzVl6Z8pI8\nN+UpL7489bxFRBIkgOKdUl6weaXMUp7yqisvgOItIiLZ1PNWXpnykjw35Skvvjz1vEVEEiSA4p1S\nXrB5pcxSnvKqKy+A4i0iItnU81ZemfKSPDflKS++PPW8RUQSJIDinVJesHmlzFKe8qorL4DiLSIi\n2dTzVl6Z8pI8N+UpL7489bxFRBKkYPE2s0YzW2Vmr5rZlHbWH2dmfzCz3WZ2U/y7mIp/SOUlMEt5\nyquuvJ75VppZDXAv8GVgI7DMzB5195UZm20BbgDGx7pnIiKSU96et5mdBkxz98ZoeSqAu9/VzrbT\ngB3u/s85xlLPW3llylKe8sLN62rP+yhgfcbyhug2EREpo7xtE7r2EpJTU1MT9fX1ANTW1jJy5Ega\nGhqitano3+zlltvyrYdUKr3cMl6hZeWVIu9F4MYc69NjdHT8luWMPVSe8hKZl0qlmDNnDkBrvWyX\nu+e8AKOB32Qsfw+YkmPbacBNecbyXAAHz3FZnGdd7jHzUV6p8pI8N+UprzR50TqyL4V63j2B1cCX\ngDeB54AJfuAHli3b3gZ84Op5K6/ispSnvHDzcvW887ZN3H2vmV0PPAHUADPdfaWZXRWtn2FmRwDL\ngP7APjObDIxw9x1dmIGIiHRAweO83f1xd/+su3/K3adHt81w9xnR9bfdfai7D3D3OncfFm/hTsU3\nlPJKnFfKLOUpr7ry9BeWIiIB0rlNlFemvCTPTXnKiy9P5zYREUmQAIp3SnnB5pUyS3nKq668AIq3\niIhkU89beWXKS/LclKe8+PLU8xYRSZAAindKecHmlTJLecqrrrwAireIiGRTz1t5ZcpL8tyUp7z4\n8tTzFhFJkACKd0p5weaVMkt5yquuvACKt4iIZFPPW3llykvy3JSnvPjy1PMWEUmQAIp3SnnB5pUy\nS3nKq668AIq3iIhkU89beWXKS/LclKe8+PLU8xYRSZAAindKecHmlTJLecqrrrwAireIiGRTz1t5\nZcpL8tyUp7z48tTzFhFJkACKd0p5weaVMkt5yquuvACKt4iIZFPPW3llykvy3JSnvPjy1PMWEUmQ\nAIp3SnnB5pUyS3nKq668AIq3iIhkU89beWXKS/LclKe8+PLU8xYRSZAAindKecHmlTJLecqrrrwA\nireIiGRTz1t5ZcpL8tyUp7z48tTzFhFJkILF28wazWyVmb1qZlNybPP/ovXLzeykeHcxFe9wykto\nlvKUV115eYu3mdUA9wKNwAhggpkdn7XNV4BPufungUnAA7HuIS/GO5zyEpqlPOVVV16hd96nAs3u\nvtbd9wALgQuytjkfmAvg7kuBWjP7RHy7+H58QymvxHlJnpvylFfevELF+yhgfcbyhui2QtsM6f6u\niYhILoWKd0c/Ns3+JDTGQ1jWxjeU8kqcV8os5SmvuvLyHipoZqOB29y9MVr+HrDP3e/O2OZfgZS7\nL4yWVwFfdPd3ssYqzTGJIiIJ096hgj0L3OdPwKfNrB54E/gGMCFrm0eB64GFUbF/P7tw5woXEZGu\nyVu83X2vmV0PPAHUADPdfaWZXRWtn+Huvzazr5hZM7ATuLzoey0iUuVK9heWIiISn0Jtk0SLjln/\nJLDU3Xdk3N7o7r8pQt6ZwHvuvsLMGoCTgRfc/em4s8rJzMaQPsz0JXd/sgjjjwZWuvs2M+sDTAU+\nD7wM/NDdt8Wc93fAw+6+vuDG8eQdDHwT2OjuT5nZt4DTgRXAg9Fhu3FnHgtcSPpIsX3AamCBu2+P\nO0viEcSfx5tZ7K2Y6D/kL4EbgJfNbHzG6ulFyJsO/BMw18z+AbgL6A1MM7Pvxp2XYx/+rUjjPpdx\n/UrgHqAv6bl9rwiRs0i36AD+BehP+vH8EJhdhLw7gOfM7L/M7FozO6wIGZlmA18BJpvZPOBrwB9J\nvyA+FHeYmU0G/hU4OMo4GBgGLDWzs+LOk5i4e8VfgPVFGPPPQN/oej3pD2dvjJZfKELeCtK/6fQB\nPgAGRLf3Bv6nCHm/Iv1h8q8yLjtbbo8564WM638CDouuHwr8uQhzW5lx/b+z1i0vQt4LpN/onE36\nhWMT8BvgMqBfEfJeiv7tCbwL9IyWrWVdzHl/Bmqi632A30XXhwEvFiGvlvSL7SpgK/BedP0uoDbu\nvAL78ngRxhwQzWU+cHHWuvvjyqmYtomZvZRn9eHFiPSoVeLua6M2xn+Y2XDaHrceh7+4+15gr5mt\n8ehXe3f/0Mz2FSFvCOkXjIdI/xpspNs0/1SErBozGxhl1Lj7JgB332lme4uQ97KZfdvdZwHLzewU\nd19mZp8B/lKEPNx9H/Ak8KSZ9QLOIX3k1T8Dg2OO6xG1TvqQfnEfAGwBDqE4vy07cBDwcZRxKIC7\nrzOzg4qQ9+/A00AD8I67u5kdSfrF8N9Jv0jGxsw+n2sVEPO5mID0b06vAP8BfNvMvgp8y913A6fF\nFVIxxZt0gW4k/Uqc7dki5L1rZiPd/UUAd99hZuOAmcCJRcj7yMz6uPsu0v1ZAMyslnRxjdvJwGTg\n74HvuvsLZrbb3X9XhKz+wPPRdTezI939LTPrV4QsgL8F/sXMbiX9LvhZM9tA+i99/7ZIma3c/S/A\nI8AjZnZoESLmAyuBPcBNwBIzexYYTXQqipg9BCwzs6XAGOBuADM7nPSLRtzqPeNvRQDc/S3gLjP7\ndhHylgHP5Fg3oAh5x7r7hdH1h83s74GnzSz71CLdUjFHm5jZLGC2uy9pZ91P3T37+PLu5g0F9rj7\n21m3G3CGu/9XzHmHRK+82bcPBo5093y/eXQndwjwf0n/+n2+uw8tRk6O7D7AJ9z99SKNPwA4mvSb\nkA3Zz2WMOZ9199XFGDtPZj2w3d3fiz5MPBlY5e7Li5T3f4DjSLe5VhUjIyPrt8Bvgbke/U2ImR1B\n+p33WHf/csx5LwN/4+6vtLNufdz/J8xsJfC56Le1ltuagO+SbtUOjyWnUoq3FFf0W8Xp7n5LufdF\nqlvUYptK+qR2LSexe4f0ZzR3uft7Med9nfRnBW1elMxsvLv/Mua8fwSedPffZt3eCNzj6TOwdj9H\nxVtEKoWZXe7uxThiKFdey2cnweWpeItIxShGGyOpeZX0gaWIVIECR5bF+F0Ayc5T8RaRUiv1kWWJ\nzFPxFpFSW0T6qIsXsleYWTEOZU1knnreIiIBCuLcJiIiciAVbxGRAKl4i4gESMVbJAczG2Bm12Qs\nN5jZr8q5TyItVLxFcqsDri33Toi0R8VbEsHM6s1slZnNNrPVZvYTMzvbzH5vZq+Y2SlmNtDMfmlm\ny83sD2Z2QnTf28xslpktNrM1ZnZDNOxdwLFm9kL0BRoO9DWzn5vZSjObX675iug4b0mSY4Gvkj6P\n+TLgG+5+hpmdD9xC+pSxz7v7+OgbYv6N/edz/gxwFunT2642s/uBKaTPDncSpNsm0fYjgLeA35vZ\nGe7++1JNUKSF3nlLkrzu7i97+o8XXgaeim5/ifSpY88E5gG4+2JgUHTOcQcWufsed99C+vS5n6D9\nL+V4zt3fjDJeJP0tTCIlp+ItSfJRxvV97P9WHQdqon9zfUtS5jfwfEzu30o/6uB2IkWl4i3VZAnw\nLWhtgWxy9w/IXdA/AIr1bUAi3aJ3DZIk2ed68KzrtwOzzGw56S9jvixjXZvzRLj7lugDz5eAX0eX\nfBkiJaNzm4iIBEhtExGRAKl4i4gESMVbRCRAKt4iIgFS8RYRCZCKt4hIgFS8RUQCpOItIhKg/wUp\nO5ENXfPrHgAAAABJRU5ErkJggg==\n",
       "text": [
        "<matplotlib.figure.Figure at 0x7f0a88948610>"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that the average number of delays is highest in December and February, which is what we would expect.\n",
      "\n",
      "Now let's look at the hour-of-day:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute average number of delayed flights by hour\n",
      "df['hour'] = df['CRSDepTime'].map(lambda x: int(str(int(x)).zfill(4)[:2]))\n",
      "grouped = df[['DepDelayed', 'hour']].groupby('hour').mean()\n",
      "\n",
      "# plot average delays by hour of day\n",
      "grouped.plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "<matplotlib.axes._subplots.AxesSubplot at 0x7f0a2a26c4d0>"
       ]
      },
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAEQCAYAAACk818iAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnX2YFOWZr++HIURRcBBxTRQYNW7UJGaIChjD2sSPjF+R\nzYcGV8wYV0yiRrPmBMwmi8bLoNnstezGaDCLjAc0ZJNo5MQYlBxmdTdRwAia8BFQ8QONAYIQRD2D\nPOePqhl6mp7uqu6nZ97qee7rqmv6rar37t/b3fN299PV1aKqOI7jOPXDgL4O4DiO49jiE7vjOE6d\n4RO74zhOneETu+M4Tp3hE7vjOE6d4RO74zhOnVF2YheRFhFZIyLrRGRaif1OFJFdIvLJvHUbROQp\nEXlSRJZahXYcx3F6ZmCpjSLSANwKnAZsBJaJyEJVXV1kv1uAXxYoFMip6p/tIjuO4zilKPeKfSyw\nXlU3qGoHsAA4r8h+VwE/ATYV2SbVRXQcx3HSUG5iPxR4Ma/9UryuCxE5lGiyvz1elf9VVgUWi8hy\nEbmsyqyO4zhOAkqWYug+SffELGC6qqqICN1foZ+sqq+IyAjgYRFZo6qP5ncWET+ngeM4TgWoatGK\nSLlX7BuBkXntkUSv2vM5HlggIs8BnwRuE5GPx1f6Svx3E3AfUWmnWLiyy4wZMxLt11uees9U7+Pz\nTD6+UDMldZWi3MS+HDhKRJpEZBBwAbCwYFI+QlUPV9XDiersX1DVhSIyWESGAIjIfsAZwNNlrs9x\nHMepkpKlGFXdJSJXAouABmCOqq4Wkcvj7bNLdD8EuDeqzjAQuFtVH6o06IYNGyrtWhOPpSvETJYu\nz9S7nlBdIWSK56Nu3HDDDXutK/eKuBghjK+TcjV2VPVB4MGCdUUndFW9JO/ys0BzVenyaG62UVl5\nLF0hZrJ0eabe9YTqCidT/qQ9C7imYHtlB/KFMz6QSp6ZLBER7esMjuOES7FX2T1Rbi6JXOXmG6no\nFXtvIyJoDx+eln3F3lekuTOd8MnCP4oTMkkePz5ndBL0uWKsPmH2pW8XS9rb24PyWLpCzGTpsswE\nVi4rT1i3edATu+M4jpOeYGvscf2oDxI51vh96VRDsro4JKmNe43dcRynCpJ+TpaFSTRreCnGyRQh\n1nvrOVP1Ls1blhS0q5nQq8lUC09It7lP7HXJgAEDePbZZ3v1Otva2pgwYUKvXqfjOMXJVI29Nw6B\nTHJ7NDU18ac//YmBAwfS0NDAsccey8UXX8zUqVOrztja2soPf/hD3vnOdwIwevRozj33XKZPn87Q\noUMTOQYMGMD69es54ogjqsqShra2NubMmcOjjz661zavsfdPrOrZXmMvTqkaewZfsRe+lbNckiEi\n/PznP2f79u288MILTJ8+nVtuuYVLL7206tGJCNOmTWP79u1s3ryZuXPn8thjj3HyySezc+fOqv2O\n49Q/GZzYw2LIkCGce+65/OhHP+Kuu+5i1apVvPXWW3zlK19h9OjRHHLIIXzhC1/gzTffBKLa2WGH\nHcbMmTMZMWIEhx9+OPfcc083Z+erhUGDBnHCCSewcOFCtmzZwty5c7v2ufPOOzn22GM58MADaWlp\n4YUXXiia74EHHmDMmDEccMABjBo1qtt5Mc4++2xuvfXWbvsfd9xx3H///QCsWbOG008/neHDh3P0\n0Ufz4x//uGu/LVu28PGPf5wDDjiAcePG8cwzz1RxKyannuvZIWaydVl5LF1WnrBuc5/YjTjxxBM5\n7LDDeOSRR5g+fTrr169n5cqVrF+/no0bN/LNb36za99XX32VLVu28PLLL3PXXXcxdepU1q1b16N7\n//335/TTT+8qc9x///3MnDmT++67j82bNzNhwgQmT57cY9/58+ezbds2HnjgAW6//fauibu1tZX5\n8+d37bty5Upefvllzj77bF5//XVOP/10LrroIjZt2sSCBQv44he/yOrV0a8iXnHFFQwePJg//vGP\n3HnnncydO9e/Lew4oRDAtxK1GMXWAwpaw6V4lkKampr0V7/61V7rx48frzfddJPut99++swzz3St\n//Wvf62HH364qqouWbJEBw4cqDt37uzafv755+uNN96oqqqtra369a9/fS/3tGnT9IwzzlBV1ZaW\nFp0zZ07XtrffflsHDx6sL7zwgqqqiki368/n6quv1i9/+cuqqvrGG2/osGHDdP369aqqeu211+oV\nV1yhqqoLFizQCRMmdOs7depUveGGG3TXrl36jne8Q9euXdu17Wtf+5p+5CMfKXqdSW9Xp75I9v9a\n/rGR/P/eypWNx2ucs+i86q/YDdm4cSO7du1i586dHH/88QwbNoxhw4Zx5plnsnnz5q79hg0bxr77\n7tvVHj16NK+88kpZ94EHHgjA888/z9VXX93lHz58eNc+hTz++ONMnDiRgw8+mMbGRmbPns2WLVsA\n2GeffTj//POZN28eqsqCBQuYMmVK13U8/vjjXdcxbNgw7rnnHl599VU2b97Mrl27GDlyz2+wjBo1\nqsJbzXEca3xiN2LZsmVs3LiRSZMmse+++7Jq1Sq2bt3K1q1bee2119i+fXvXvlu3bu32Qejzzz/P\nu9/97q52YUljx44dLF68uOtwwlGjRnHHHXd0+bdu3crrr7/O+PHj98p14YUXMmnSJF566SVee+01\nPv/5z7N79+6u7Z/97Ge5++67Wbx4MYMHD2bcuHFd13HKKad0u46//OUvfO973+Oggw5i4MCB3er6\nPdX4rannenaImWxdVh5LV2UeEUm8VJTKa+x9g8YfcG7fvp2f//znTJ48mSlTpnDcccdx2WWXcc01\n17Bp0yYgeiX90EPdf2NkxowZdHR08Oijj/LAAw/w6U9/usvb6X7rrbd44oknmDRpEsOHD+eSS6LT\n3X/+85/nW9/6FqtWrQJg27Zt3T7YzGfHjh0MGzaMQYMGsXTpUu65555uD7aTTjoJEeErX/kKF198\ncdf6c845hz/84Q/Mnz+fjo4OOjo6WLZsGWvWrKGhoYFPfOITXH/99bzxxhusWrWKu+66y2vsTj+j\n8Kg6yy9gVRutfA28BVgDrAOmldjvRGAX8Mk0fUldY6/tkoSmpibdd999dciQIXrAAQfohz/8Yb3t\nttt09+7dqqr65ptv6te+9jU94ogjdOjQoXrMMcfod7/7XVWNauyHHXaY3nTTTXrQQQfp6NGjdf78\n+V3u1tZWHTRokA4ZMkT3339/fd/73qfTp0/Xbdu2dcswb948/cAHPqBDhw7VkSNH6qWXXtq1bcCA\nAV019p/85Cc6evRoHTJkiJ5zzjl61VVX6ZQpU7q5brzxRhURfe6557qtX7t2rZ599tk6YsQIHT58\nuJ566qm6cuVKVVXdtGmTnnPOOTp06FAdN26cfuMb39irJp9/vzn9D+q4xm6ZqVIoUWMv+QUlEWkA\n1gKnEf2w9TJgsqquLrLfw8BOYK6q/jRFXy2WoV6/1NLe3s6UKVN48cUX+zpKF/PmzeMHP/gBjzzy\nSE389XpfOqWp5y8oWWaqlGq+oDQWWK+qG1S1A1gAnFdkv6uIfsh6UwV9nT5k586dfO9732Pq1Kl9\nHSUR9VzPDjGTrcvKY+my8ti6al1jPxTIf2n5UryuCxE5lGjCvj1e1fn0VLZvfyWUWvSiRYs4+OCD\nede73sWFF17Y13GcACj24d/EiRNNPhB0eo9yp+1N8h5iFjBdVVWie7zzXk/8/qO1tZWmpiYAGhsb\nTX8UNjRyuVyvHUFSjo997GPs2LGjV6+z85VILperqN25rtL+tWpbjC+Xy/X5eCKWAJ3tzvHltycm\nHm/3/rkivmT35x6K5dnTTja+9m7X373d/frS5ckVyWf3eG1vb6etrQ2ga77siXI19vHA9araErev\nA3ar6i15+zzLnsn8IKI6+2XAn8r1jdf3qxp7f8Tvy+xgeZKsEOvZIWaqlGpq7MuBo0SkSUQGARcA\nC/N3UNUjVPVwVT2cqM7+BVVdmKSv46SlnuvZIWaKbYF5LF1WHltXtfdfyVKMqu4SkSuBRUADMEdV\nV4vI5fH22Wn7VpXWcRzHKUvQ52N36oe+fpw5yfBSTO9nqpRM/uapTwSO4ziVkZlTCtRzHTPETJYu\nz9S7HmtXfdezrTy2rlofx+44juNkjGBr7I7j9D5eY+/9TJVSZ7956jiO45QiMxN7PdcxQ8xk6fJM\nveuxdtV3PdvKY+vyGrvjOI7TDa+xO07GSfOdj96qQVu6vMbec4bMHcfuOE4akk0yTv8gM6WYeq5j\nhpjJ0uWZetcT2wJ0WXksXVYeW5fX2B3HcZxueI3dcTJOiDVoS1eI4wu9xu6v2B3HceqMzEzsIdYx\n6zmTpcsz9a4ntgXosvJYuqw8ti6vsTuO4zjd8Bq742ScEGvQlq4Qx5f5GruItIjIGhFZJyLTimw/\nT0RWisiTIvKEiHw0b9sGEXkq3ra0umE4juM4SSg5sYtIA3Ar0AIcC0wWkWMKdlusqh9U1TFAK3BH\n3jYFcqo6RlXHVhM0xDpmPWeydHmm4ohIoqXCVBX2q6XLymPpsvLYumpdYx8LrFfVDaraASwAzsvf\nQVVfz2vuD2wucPjX3RynRzRvWVLQ9hKlUxkla+wi8ingY6p6Wdy+CBinqlcV7DcJmAm8CzhDVZfG\n658FtgFvA7NV9QdFrsNr7E6/JMR6r9fY+ypTeQo91ZwrJtGMq6o/A34mIhOAecB7400nq+orIjIC\neFhE1qjqo4X9W1tbaWpqAqCxsZHm5mZyuRyw5y2Jt71dj+09b997akd9yvn2UNpXPk9nn3L59mQr\n7Sudp7NP1saXNE9nn/Lj65xqe/JNpL29nba2NoCu+bJHVLXHBRgP/DKvfR0wrUyfZ4DhRdbPAK4t\nsl6TsGTJkkT79ZbH0hViJkuXZypO9N+secuSgna0T3pPrV3J/mezMb4QMyW7zeN1RefhcjX25cBR\nItIkIoOAC4CF+TuIyJESv5cQkQ/FM/UWERksIkPi9fsBZwBPl7k+x3Ecp0rKHscuImcCs4AGYI6q\nzhSRywFUdbaIfBW4GOgAdgD/oKrLROQI4N5YMxC4W1VnFvFruQyOU4+EW+/1GnsWMpWqsfsXlByn\nj6inSaaWrhDHF0KmujgJ2N4fWvStx9IVYiZLl2dKbDLyhOqy8li6rDxhuTIzsTuO4zjJ8FKM4/QR\n9VQWqKUrxPGFkMl/89RxjLD84WjHqRWZKcXUc201xEyWrvrLpAXLkiLrKkpVRaYsuKw8li4rT1iu\nzEzsjuM4TjK8xu44KQihtpqFTJauEMcXQqa6ONzRcRzHSUZmJvYwaqu1cYWYydJV75m83tvbHkuX\nlScsV2YmdsdxHCcZXmN3nBSEUFvNQiZLV4jjCyGT19gdx3H6EZmZ2EOsrdZzJktXvWfyem9veyxd\nVp6wXJmZ2B3HcZxkeI3dcVIQQm01C5ksXSGOL4RMXmN3HMfpR5Sd2EWkRUTWiMg6EZlWZPt5IrJS\nRJ4UkSdE5KNJ+6YhxNpqPWeydNV7Jq/39rbH0mXlCctV8uyOItIA3AqcBmwElonIQlVdnbfbYlW9\nP97/A8B9wHsS9nUcx3GMKVljF5GTgBmq2hK3pwOo6s0l9v9XVR2ftK/X2J0sEUJtNQuZLF0hji+E\nTNXU2A8FXsxrvxSvK7yCSSKyGngQ+FKavo7jOI4t5X5oI9FLaVX9GfAzEZkAzBORo9OEaG1tpamp\nCYDGxkaam5vJ5XJA91poLpfrahduT9qeNWtWUX8l7cJslfpWrFjBNddcU3UeH1/yduEYk/aPaAdy\neZdXANfktfP2LOvr3D9X0HfP9bW3tycaz97Xn+8s3b//jm8W0Ez38ebtmThPfpY9eTr7VD++aN+2\ntjaArvmyR1S1xwUYD/wyr30dMK1Mn2eA4Un7RhHKs2TJkkT79ZbH0hViJktXPWUCFLRgWVJkXfnH\n9d4uK0+tXcn+Z7MxvhAzJbvN43VF5+FyNfaBwFrgVOBlYCkwWfM+ABWRI4FnVVVF5EPAj1X1yCR9\n4/5aKoPjhEQItdUsZLJ0hTi+EDJV/JunqrpLRK4EFgENwBxVXS0il8fbZwOfBC4WkQ5gB/CZUn3L\npHccx3GqpOxx7Kr6oKq+V1Xfo6oz43Wz40kdVf22qr5fVceo6gRVXVaqb6WEePxyPWeydNV7Jj+m\nurc9li4rT1gu/+ap4zhOneHninGcFIRQW81CJktXiOMLIZOfK8ZxHKcfkZmJPcTaaj1nsnSFkElE\nEi0VpqqwX608obqsPJYuK09YrsxM7I5TPZq3LCloeznQqR+8xu70C+qptpqFTJauEMcXQiavsTuO\n4/QjMjOx11O9t1aeUF0hZgqpHmrvCdVl5bF0WXnCcmVmYnccx3GS4TV2p19QT7XVLGSydIU4vhAy\neY3dcRynH5GZib2e670hZrJ01fex55YuK0+oLiuPpcvKE5YrMxO7018pPNa88Phzx3EK8Rq7Eyxe\n781uJktXiOMLIVPF52N3nLSkKY34E7rj1IaypRgRaRGRNSKyTkSmFdn+dyKyUkSeEpH/EZHj8rZt\niNc/KSJLqwna1/XeWrpCzFSdq1z5pJoJvdJMtfJYuqw8obqsPJYuK09YrpKv2EWkAbgVOA3YCCwT\nkYUFv4T0LPA3qrpNRFqAO4h+7xSi/+Ccqv65qpSO4zhOYsr95ulJwAxVbYnb0wFU9eYe9h8GPK2q\nh8Xt54ATVHVLievwGnsdEULtsZauEMcXYiZLV4jjCyFTNcexHwq8mNd+KV7XE5cCv8hrK7BYRJaL\nyGVlrstxHMcxoNzEnviltIhMBD4H5NfhT1bVMcCZwBUiMiF9xIh6rmeHmMnWZeWxdFl5LF1WnlBd\nVh5Ll5UnLFe5o2I2AiPz2iOJXrV3I/7A9AdAi6pu7Vyvqq/EfzeJyH3AWODRwv6tra00NTUB0NjY\nSHNzM7lcDth7culsF25P2l6xYkVV/WvRXrFihZmvr8cX0Q7k8i6vKGjn7VnW17l/Z3tFQTvqUy5f\n92zF2qX79974Cts+vu701C7dv+fxFT6euvuT5ynsv+f6bMYX7dvW1gbQNV/2RLka+0BgLXAq8DKw\nFJic/+GpiIwC/i9wkao+lrd+MNCgqn8Rkf2Ah4AbVPWhguvwGnsdEULtsZauEMcXYiZLV4jjCyFT\nxcexq+ouEbkSWAQ0AHNUdbWIXB5vnw38EzAMuD0+hrlDVccChwD3xusGAncXTuqO4ziOPWWPY1fV\nB1X1var6HlWdGa+bHU/qqOrfq+pwVR0TL2Pj9c+qanO8vL+zb6XUcz07hEx+XpYQXFaeUF1WHkuX\nlScsl58rxsnDz8viOPWAnyvGAeqr9lhLV4jjCzGTpSvE8YWQyc/H7jiO04/IzMReT/XsWnmsXfVd\nx7TyWLqsPKG6rDyWLitPWK7MTOyO4zhOMrzG7gD1VXuspSvE8YWYydIV4vhCyOQ1dsdxnH5EZib2\neq5nh5gptgXmsXRZeSxdVp5QXVYeS5eVJyxXZiZ2x3EcJxleY3eA+qo91tIV4vhCzGTpCnF8IWTy\n3zytU/z3RR3HKUZmSjH1XM+uzlPuNADVTOjV5KqFx9Jl5bF0WXlCdVl5LF1WnrBcmZnYHcdxnGR4\njT3DhFDny0ImS1eI4wsxk6UrxPGFkMmPY3ccx+lHZGZiD6OeXRtXmMeeW7qsPJYuK4+ly8oTqsvK\nY+my8oTlKjuxi0iLiKwRkXUiMq3I9r8TkZUi8pSI/E/8+6eJ+jqO4zj2lPvN0wai3zw9jeiHrZex\n92+engSsUtVtItICXK+q45P0jft7jb1CQqjzZSGTpSvE8YWYydIV4vhCyFRNjX0ssF5VN6hqB7AA\nOC9/B1X9japui5uPA4cl7es4juPYU25iPxR4Ma/9UryuJy4FflFh35KEWM8OMVNIdT57j6XLymPp\nsvKE6rLyWLqsPGG5yn3zNHGNREQmAp8DTk7bt7W1laamJgAaGxtpbm4ml8sBe096ne3C7UnbK1as\nqKp/LdorVqyouP+eB0Bne0VBu3M7JX15e/TQP2mezj75/VekzpN8fFGfcrdX92zF2qX79974Cts+\nvu701C7dv+fx1fb/pbNP9eOL9m1rawPomi97olyNfTxRzbwlbl8H7FbVWwr2Ow64F2hR1fUp+3qN\nvUJCqPNlIZOlK8TxhZjJ0hXi+ELIVE2NfTlwlIg0icgg4AJgYYF8FNGkflHnpJ60r+M4jmNPyYld\nVXcBVwKLgFXAj1R1tYhcLiKXx7v9EzAMuF1EnhSRpaX6Vho0xHp2iJlCqvPZeyxdVh5Ll5UnVJeV\nx9Jl5QnLVfbsjqr6IPBgwbrZeZf/Hvj7pH0dx3Gc2uLniskwIdT5spDJ0hXi+ELMZOkKcXwhZPJz\nxTiO4/QjMjOxh1jPDjFTSHU+e4+ly8pj6bLyhOqy8li6rDxhuTIzsTuO4zjJ8Bp7hgmhzpeFTJau\nEMcXYiZLV4jjCyGT19gdx3H6EZmZ2EOsZ4eYKaQ6n73H0mXlsXRZeUJ1WXksXVaesFyZmdgdx3Gc\nZHiNPcOEUOfLQiZLV4jjCzGTpSvE8YWQyWvsjuM4/YjMTOwh1rMrdYlIoqXCVBX2q6XLymPpsvJY\nuqw8obqsPJYuK09YrsxM7PWH5i1LCtpemnIcp3K8xt4H1FOdLwuZLF0hji/ETJauEMcXQiavsTuO\n4/QjMjOx11ONvYjJyBOqy8pj6bLyWLqsPKG6rDyWLitPWK7MTOyO4zhOMsrW2EWkBZgFNAD/UeQ3\nS48G5gJjgH9U1X/J27YB2A68DXSo6tgifq+xF98rE3W+LGSydIU4vhAzWbpCHF8ImUrV2Ev+gpKI\nNAC3AqcBG4FlIrKw4CfutgBXAZOKKBTIqeqfy6R2HMdxjChXihkLrFfVDaraASwAzsvfQVU3qepy\noKMHR6UHZHfDa+xZdll5LF1WHkuXlSdUl5XH0mXlCctV7jdPDwVezGu/BIxL4VdgsYi8DcxW1R+k\nzBcMSb8w1N/KSo7jhEe5ib3aWepkVX1FREYAD4vIGlV9tHCn1tZWmpqaAGhsbKS5uZlcLgfseVVs\n1e5cl7Z/hLLnmbTTl9+WlL78/sXb5cdTmKdzXeH20r69rz8XL939SW7fUtdvP75k9+fe11/c37fj\nyxXJ5+Mrfv3F/JWMr7Dd/fqq/X/p7FP9+KJ929raALrmy54o+eGpiIwHrlfVlrh9HbC78APUeNsM\nYEf+h6dJtmflw1P/UCm7mSxdIY4vxEyWrhDHF0Kmar6gtBw4SkSaRGQQcAGwsMdr7n6lg0VkSHx5\nP+AM4Oky19cjIdbYvc7X2x5Ll5XH0mXlCdVl5bF0WXnCcpUsxajqLhG5ElhEdLjjHFVdLSKXx9tn\ni8ghwDJgKLBbRK4GjgUOBu6Na9MDgbtV9aGq0jqO4zhl8XPFJMTfomY3k6UrxPGFmMnSFeL4Qsjk\n54pxHMfpR2RmYvcae5ZdVh5Ll5XH0mXlCdVl5bF0WXnCcmVmYnccx3GS4TX2hHjtMbuZLF0hji/E\nTJauEMcXQiavsTuO4/QjMjOxe409yy4rj6XLymPpsvKE6rLyWLqsPGG5MjOxO47jOMnwGntCvPaY\n3UyWrhDHF2ImS1eI4wshk9fYHcdx+hGZmdi9xp5ll5XH0mXlsXRZeUJ1WXksXVaesFyZmdgdx3Gc\nZHiNPSFee8xuJktXiOMLMZOlK8TxhZCp4t88zTpJf/UI/JePHMepHzJTiqm8Nq4Fy5Ii6ypOVUXf\nWnhCdVl5LF1WHkuXlSdUl5XH0mXlCcuVmYndcRzHSUZd19hDqIPV0hXi+ELMZOkKcXwhZrJ0hTi+\nEDJVdRy7iLSIyBoRWSci04psP1pEfiMib4rItWn6Oo7jOPaUnNhFpAG4FWgh+rm7ySJyTMFuW4Cr\ngO9U0DcxdsefW3ksXVaeUF1WHkuXlcfSZeUJ1WXlsXRZecJylXvFPhZYr6obVLUDWACcl7+Dqm5S\n1eVAR9q+juM4jj0la+wi8ingY6p6Wdy+CBinqlcV2XcGsENV/yVNX6+x1zqTpSubmSxdIY4vxEyW\nrhDHF0Kmao5jr2bGTdy3tbWVpqYmABobG2lubiaXywF7SjCVtve8pSnXpqQvb4+SvqT5yvtK5wl1\nfHv6VJcn+fiiPtXf3qX7+/gK23uylfaVztPZJ2vjS5qns4/FfNDe3k5bWxtA13zZI6ra4wKMB36Z\n174OmNbDvjOAa9P2jSJ0h70PNC+6lCPaTwuWJUXWWbnKe4q7apmpd8cXYqbirhBv8xAz+W3e+5mS\n3ebxOoot5Wrsy4GjRKRJRAYBFwALe9i38C1Bmr5F0IJlSUHbcRzHKUbZ49hF5ExgFtAAzFHVmSJy\nOYCqzhaRQ4BlwFBgN/AX4FhV3VGsbxG/FmaopzpYLV0hji/ETJauEMcXYiZLV4jjCyFTqRp7kF9Q\nqqcbv5auEMcXYiZLV4jjCzGTpSvE8YWQqU5+aKM9MI+ly8oTqsvKY+my8li6rDyhuqw8li4rT1iu\nDE3sjuM4ThK8FGPq8reoIWaydIU4vhAzWbpCHF8ImeqkFOM4juMkIUMTe3tgHkuXlSdUl5XH0mXl\nsXRZeUJ1WXksXVaesFwZmtgdx3GcJHiN3dTltccQM1m6QhxfiJksXSGOL4RMXmN3HMfpR2RoYm8P\nzGPpsvKE6rLyWLqsPJYuK0+oLiuPpcvKE5YrQxO74ziOkwSvsZu6vPYYYiZLV4jjCzGTpSvE8YWQ\nyWvsjuM4/YgMTeztgXksXVaeUF1WHkuXlcfSZeUJ1WXlsXRZecJyZWhidxzHcZLgNXZTl9ceQ8xk\n6QpxfCFmsnSFOL4QMnmN3XEcpx9RdmIXkRYRWSMi60RkWg/7/Hu8faWIjMlbv0FEnhKRJ0VkaXVR\n26vrbu6xdFl5QnVZeSxdVh5Ll5UnVJeVx9Jl5QnLNbDURhFpAG4FTgM2AstEZKGqrs7b5yzgPap6\nlIiMA24n+iFriN5f5FT1z1WldBzHcRJTssYuIicBM1S1JW5PB1DVm/P2+T6wRFV/FLfXAKeo6qsi\n8hxwgqqBr5L1AAALkklEQVRuKXEdXmOvaSZLVzYzWbpCHF+ImSxdIY4vhEzV1NgPBV7Ma78Ur0u6\njwKLRWS5iFxW5rocx3EcA0qWYkj2lARQ9FkD+IiqviwiI4CHRWSNqj5auFNraytNTU0ANDY2Fmxt\nz7uco7D21N4etXO5XNH2nv0727OA5rx2Mt/eeQqzFL/+ZL4VwDWp8oQ6vj2O/Ouv5fiiPulu705y\ne/n7dnyF2fY4fHz52fId+f5Kxlfb/5fOPtWPL9q3ra0NoGu+7BFV7XEhqpX/Mq99HTCtYJ/vA5/J\na68B/qqIawZwbZH1WgigoAXLkoL23v0q81i6ynuKu2qZqXfHF2Km4q4Qb/MQM/lt3vuZkt3m8TqK\nLeVq7AOBtcCpwMvAUmCy7v3h6ZWqepaIjAdmqep4ERkMNKjqX0RkP+Ah4AZVfajgOrQwQz3VwWrp\nCnF8IWaydIU4vhAzWbpCHF8ImUrV2EuWYlR1l4hcCSwCGoA5qrpaRC6Pt89W1V+IyFkish54Hbgk\n7n4IcG8UmoHA3YWTuuM4jlMDenop31sLPbzFsHg7mMxT27dLyVzZfTuYhUzFXSHe5iFm8tu89zMl\nu83jdRRb/JunjuM4dYafK8bU5bXHEDNZukIcX4iZLF0hji+ETH6uGMdxnH5Ehib29sA8li4rT6gu\nK4+ly8pj6bLyhOqy8li6rDxhuTI0sTuO4zhJ8Bq7qctrjyFmsnSFOL4QM1m6QhxfCJm8xu44jtOP\nyNDE3h6Yx9Jl5QnVZeWxdFl5LF1WnlBdVh5Ll5UnLFeGJnbHcRwnCV5jN3V57THETJauEMcXYiZL\nV4jjCyGT19gdx3H6ERma2NsD81i6rDyhuqw8li4rj6XLyhOqy8pj6bLyhOXK0MTuOI7jJMFr7KYu\nrz2GmMnSFeL4Qsxk6QpxfCFk8hq74zhOP6LsxC4iLSKyRkTWici0Hvb593j7ShEZk6Zvctqr627u\nsXRZeUJ1WXksXVYeS5eVJ1SXlcfSZeUJy1VyYheRBuBWoAU4FpgsIscU7HMW8B5VPQqYCtyetG86\nVlTetSYeS1eImSxdnql3PaG6PFNvucq9Yh8LrFfVDaraASwAzivY5+PAXQCq+jjQKCKHJOybgtcq\n71oTj6UrxEyWLs/Uu55QXZ6pt1zlJvZDgRfz2i/F65Ls8+4EfR3HcRxjyk3sSQ+ZKfrJrC0bAvNY\nuqw8obqsPJYuK4+ly8oTqsvKY+my8oTlKnm4o4iMB65X1Za4fR2wW1Vvydvn+0C7qi6I22uAU4DD\ny/WN1/ft8ZaO4zgZpafDHQeW6bccOEpEmoCXgQuAyQX7LASuBBbETwSvqeqrIrIlQd8egzmO4ziV\nUXJiV9VdInIlsAhoAOao6moRuTzePltVfyEiZ4nIeuB14JJSfWs5GMdxHCeAb546juM4tpQrxfQp\nIjKB6LDJp1X1oZR9xwOrVXWbiAwGpgMfAn4PfEtVt6VwfQm4T1VfLLtzac87gc8AG1V1sYj8HfBh\nYBVwR3xYaBrfkcAngMOA3cBa4B5V3V5NTsdxsk1QpxQQkaV5ly8DvgvsD8yIP3xNw51EpSGAfwOG\nAjcDbwBzU7puBJaKyH+LyBdFZETK/p3MBc4CrhaRecCngMeInrz+I41IRK4Gvg+8M+7/TmAU8LiI\nTKwwn5MSETm4rzMUIiLD+zpDf0VEHky5/wEicrOIzBeRCwu23VZxEFUNZgGezLu8HBgRX94P+F1K\n1+q8y78t2LYybS6iJ8EziJ4wNgG/BD4LDEnheTr+OxD4EzAwbkvnthSu3wEN8eXBwH/Fl0cBK1K6\nGome9NYAW4E/x5dvBhqN7tsHU+5/QHz984ELC7bdlsIzkuhJ8+Z4nHPj224ecHDKTAcWLMOJjks7\nEDgwhael4LafAzwN3AP8VcpMt+T9n5wAPAusB14AchU8zr8OHFnlfX0isCS+70YCDwPbgGXAmJSu\nIcA3id5pbwc2A48DrX31GCd6519sOR74Y0rXvXGGvwX+D/BTYJ/O+6PS+yCoV+xAg4gcGL/iaFDV\nTQCq+jqwK6Xr9yLyufjyShE5EUBE/hr4f2mDqepuVX1IVT9H9EWr24EzgedSaAbE5ZghwL5EkxfA\nPqR/96TAO/L67xfnfCFvfVL+k+jBniOaoA4EJhJ9/e0/k0pE5EM9LMcDY8oKutP5ruqnRKej+KmI\n7BOvOymFpw1YSTSxPEZUrjoLWEp8+osUbAaeyFuWEz0WOi8nZWbe5X8BXgHOJZr4ZqfMdHbn/wnw\nHeACVX0PcFrsTkNjvCwRkWUi8mUReXdKB8BtwLeBB4DfAHfE3unxtjTcTfQ/1gJcD/w7MAX4qIh8\nK4XH5DEes4zoti1cvsOe/+mkHKmq01X1PlU9F/gt8CsROSilpzvVPDNbL0Svfp6Ll2eBd+U9a1fy\nKvSu2PM40BF7HwE+mNLV4zMnsF8Kz3VxnrVE59VZRfRq8nfAV1NmuproVd5/xL7PxesPBh5J6fpD\nJduK7Ps20Su1YssbKTOtLGj/I/A/wEGl7o8inhV5l1/oaVtC17VE79SOy1v3XBpH4eOJ6ElHehp3\nAtdq4B3x5ccKtqV9F/hk/FeAvyF64vtjfP9NrXB81d7mTxW0l8d/BwBrU3hMHuPx/r8H/rqHbS9W\ncP8NKFjXGl/H82kfW12OSjv25kJUaji8wr4HAM1Eb1MPqdDxXsOxNBG/bQeOJDq+P9UTTZ7r/UR1\n+qOrzPQw8FXyygDAIcA0YHEKT3AP+PyJEripYFuqiS/uMxL4MfCvRJ/bPFeB4yXgH+Inig0FE/tT\nKV1XxfffR4le0f4b0RcEbwDmpXTt9YRJVDZsAeam8CwFPgacT3Rakb+N158CPJ4y02+ACfHl84BF\nedvSTOwmj/G436d7+p8DJqV0/TNwepH1LcC6tI+trv6VdvSlfhaiGvG32VN/3Bpf/jbpasfBPeCJ\nPvje63MQ4CjgJ1XcZucRvRN8tYK+1wMz8paD4/XvAv53Bb6JROWEJ4nexT0IXE78Sj6FZ4HR42ks\n0XlnfwiMBhYT1cd/C5yQ0vVBotLHa0Tv2N4brx8BfCmFx+Qxnuc7BjgV2L9g/Zl96erqa3FH+lK/\nC3CJkedzhplMXNV6iN5JfiDg28kkk/H4LDOluq3iCfS0wid68j7MTuj5ElH582fA8/kvWkj5gael\nq5vX6kb2pT4XUpZQau3xTNl29VUm48n4d52vrolKq8uBa/ralb8E/QUlp3cQkadLbP6r3vZYunox\nU+Lj2UPMZOkK8XFAdLDC8aq6Iz5/1U9FpElVZ6XJ0xlLVXcAqOoGEcnFvtGkP9OtpasLn9gdiP5p\nW4jqjoX8ug88ninbrhAzFU6gp1D5BPonEWlW1RWxb4eInEP0fYTj+tDVhU/sDkTHG++vqk8WbhCR\n/+oDj2fKtivETJYT6MVEh093oaodIvJZomP2+8rVhZ8EzHGcukdERgIdqvrHgvUCnKyq/903yWqD\nT+yO4zh1RminFHAcx3GqxCd2x3GcOsMndsdxnDrDJ3an3yEiTWWOj3acTOMTu+MYICJ+6LATDD6x\nO/2VBhG5Q0R+JyKLRGQfEWkWkcdEZKWI3CsijQAi0h6fUx4ROUhEnosvt4rIQhH5FdHZAx0nCHxi\nd/orRwG3qur7ic4c+Emi8/f/L1X9INFZEmfE+2q8FGMM8ElV9Z8jdILBJ3anv/Kcqj4VX36C6Nz4\njar6aLzuLqIfmyjHQ6r6Wi0COk6l+MTu9Ffeyrv8NtEvbuWTf/6QXez5X9mnYL+dxrkcp2p8Ynec\niG3An0XkI3F7CtGPRUD0K0cnxJc/1buxHCc9/km+018prJkr0U/vfV9EBgPPAJfE274D/KeITCU6\nKZXm9fFzcjjB4eeKcRzHqTO8FOM4jlNn+MTuOI5TZ/jE7jiOU2f4xO44jlNn+MTuOI5TZ/jE7jiO\nU2f4xO44jlNn/H+3jsdwzVyUcgAAAABJRU5ErkJggg==\n",
       "text": [
        "<matplotlib.figure.Figure at 0x7f0a2a65b990>"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A clear pattern here - flights tend to be delayed later in the day. Perhaps this is because delays tend to pile up as the day progresses and the problem tends to compound later in the day.\n",
      "\n",
      "Now let's look at delays by carrier:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Compute average number of delayed flights per carrier\n",
      "grouped1 = df[['DepDelayed', 'Carrier']].groupby('Carrier').filter(lambda x: len(x)>10)\n",
      "grouped2 = grouped1.groupby('Carrier').mean()\n",
      "carrier = grouped2.sort(['DepDelayed'], ascending=False)\n",
      "\n",
      "# display top 15 destination carriers by delay (from ORD)\n",
      "carrier[:15].plot(kind='bar')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "<matplotlib.axes._subplots.AxesSubplot at 0x7f0a88948690>"
       ]
      },
      {
       "metadata": {},
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAW8AAAEVCAYAAAAvhWSzAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXu8FXW5/98PG1FQ7mqmXDakpZgczBLUOEFKYXihixdM\ncJeJlnrQ9Lwgjx1QX8ax7Hc85SUqFIKUjqZpkpefxkrLRPQYduKSqAhiKhCCiBrIc/6Y2Zu1F2v2\nZjMze8/M/rxfr3ntNZf9nmevtdfznXnmO98xd0cIIUS+6NDWAQghhGg5St5CCJFDlLyFECKHKHkL\nIUQOUfIWQogcouQthBA5pNnkbWajzWyZmT1vZpOrrB9hZhvN7NlwujKdUIUQQtTTsamVZlYD3Aic\nAKwBFpnZfe6+tGLT37n7KSnFKIQQooLmjryPBla4+0p33wrMA06tsp0lHpkQQohImkveBwGry+Zf\nCZeV48CxZrbYzH5jZoOSDFAIIcTONFk2IUjMzfE/QF9332JmJwK/Aj4cOzIhhBCRNJe81wB9y+b7\nEhx9N+Dub5W9fsDMbjazXu7+9/LtzEyDqAghxG7g7juVppsrmzwNHGJmtWbWCTgDuK98AzP7gJlZ\n+PpowCoTd1kAuzxNnTq1Rdu3dJK/7fx5jl1++VvbH0WTR97uvs3MLgIeAmqAme6+1MzOD9fPAL4E\nfN3MtgFbgDObaRB2iZUrVyahkT+D/jzHLr/8WfE3VzbB3R8AHqhYNqPs9U3ATYlEI4QQYpfI7B2W\ndXV18hfUn+fY5Zc/K35rqqaSJGbmrbUvIYQoCmaG78YFyzajVCrJX1B/nmPPmt/MNBVoagnN1ryF\nENlGZ7TFoKXJW2UTIXJMeErd1mGIBIj6LHNXNhFCCBFNmyfvtOtCUWSpbtne/HmOvQh+UQzaPHkH\neJVpQcRyIYSIR4cOHXjxxRdbdZ+zZs1i+PDhifnavOYdHEm3JAbV+ISop1qdNKmz06Zo7jtYW1vL\nG2+8QceOHampqWHQoEFMmDCBiRMnxo6vrq6OO+64gz333BOA/v37c/LJJzNlyhS6deu2S44OHTqw\nYsUKBg4cGCuWljBr1ixmzpzJ448/XnW9at5CCKqftSY1NY+Zcf/997Np0yZWrVrFlClTuO666zj3\n3HNj/2VmxuTJk9m0aRPr1q3jtttu48knn+S4445jy5Ytsf15IcPJu5SuPed1yzz78xx7EfytTdeu\nXTn55JP5xS9+wezZs1myZAnvvfcel19+Of379+eAAw7g61//Ou+++y4Q/P19+vRh+vTp7LfffgwY\nMIDbb7+9kbP+CLVTp058/OMf57777mP9+vXcdtttDdvceuutDBo0iF69ejF69GhWrVpVNb758+dz\n5JFH0r17d/r168dVV13VsG7MmDHceOONjbYfPHgw9957LwDLli1j1KhR9O7dm0MPPZQ777yzYbv1\n69dzyimn0L17d4YOHcoLL7wQ413cmQwnbyFEkfjEJz5Bnz59eOyxx5gyZQorVqxg8eLFrFixgjVr\n1nD11Vc3bPv666+zfv16Xn31VWbPns3EiRN5/vnnI9377LMPo0aNaihJ3HvvvUyfPp177rmHdevW\nMXz4cMaNGxf5u3PnzmXjxo3Mnz+fW265pSE519XVMXfu3IZtFy9ezKuvvsqYMWN4++23GTVqFGef\nfTZr165l3rx5fOMb32Dp0uApkRdeeCFdunThtdde49Zbb+W2225LtqSV5tCHFcMaejUAB2/BVN0j\nRHuk2veh5d+plk7Nfwdra2v90Ucf3Wn5sGHD/Nprr/W9997bX3jhhYblTzzxhA8YMMDd3RcsWOAd\nO3b0LVu2NKw//fTT/ZprrnF397q6Or/yyit3ck+ePNk/85nPuLv76NGjfebMmQ3r3n//fe/SpYuv\nWrXK3d3NrNH+y5k0aZJfeuml7u7+zjvveM+ePX3FihXu7n7ZZZf5hRde6O7u8+bN8+HDhzf63YkT\nJ/pVV13l27Zt8z322MOXL1/esO6KK67wT37yk1X36R79vobLd8qpOvIWQrQaa9asYdu2bWzZsoWj\njjqKnj170rNnT0488UTWrVvXsF3Pnj3p3Llzw3z//v3529/+1qy7V69eALz88stMmjSpwd+7d++G\nbSpZuHAhI0eOZP/996dHjx7MmDGD9evXA7DXXntx+umnM2fOHNydefPmMX78+IZ9LFy4sGEfPXv2\n5Pbbb+f1119n3bp1bNu2jb59dzzLpl+/frv5rlUnw8m7lK4953XLPPvzHHsR/G3FokWLWLNmDWPH\njqVz584sWbKEDRs2sGHDBt588002bdrUsO2GDRsaXXx8+eWXOfDAAxvmK8sPmzdv5pFHHmnoitev\nXz9+/OMfN/g3bNjA22+/zbBhw3aK66yzzmLs2LG88sorvPnmm1xwwQVs3769Yf0555zDz3/+cx55\n5BG6dOnC0KFDG/bxqU99qtE+3nrrLW666Sb23XdfOnbs2KjOHlVz310ynLyFEHnGw4uKmzZt4v77\n72fcuHGMHz+ewYMHc95553HJJZewdu1aIDgifvjhhxv9/tSpU9m6dSuPP/448+fP57TTTmvw1rvf\ne+89nnnmGcaOHUvv3r35yle+AsAFF1zAd77zHZYsWQLAxo0bG11MLGfz5s307NmTTp068dRTT3H7\n7bc3ahyOOeYYzIzLL7+cCRMmNCw/6aST+Otf/8rcuXPZunUrW7duZdGiRSxbtoyamhq+8IUvMG3a\nNN555x2WLFnC7NmzVfMWQgRU+z6Qbj/BXa55d+7c2bt27erdu3f3Y4891m+++Wbfvn27u7u/++67\nfsUVV/jAgQO9W7dufthhh/kPf/hDdw9q3n369PFrr73W9913X+/fv7/PnTu3wV1XV+edOnXyrl27\n+j777OOHH364T5kyxTdu3Ngohjlz5vgRRxzh3bp18759+/q5557bsK5Dhw4NNe+77rrL+/fv7127\ndvWTTjrJL774Yh8/fnwj1zXXXONm5i+99FKj5cuXL/cxY8b4fvvt57179/bjjz/eFy9e7O7ua9eu\n9ZNOOsm7devmQ4cO9W9/+9s71cgrP7cmlu+UU3WTjhA5pogDU5VKJcaPH8/q1avbOpQG5syZw09+\n8hMee+yx1PZRoJt0Sunac163zLM/z7EXwS9axpYtW7jpppuYOHFiW4fSiAwnbyFEe6U1bvHfFR56\n6CH2339/PvjBD3LWWWe1dTiNUNlEiBxTxLJJe6VAZRMhhBBRZDh5l9K157xumWd/nmMvgl8Ugwwn\nbyGEEFGo5i1EjsnKhT2RDC2peevp8ULkGB3ItF8yXDYppWvPed0yz/48xy6//FnxZzh5CyGEiEI1\nbyGEyDDq5y2EEAUiw8m7lK49J3WtIvrzHLv88mfFn+HkLYQQIgrVvIUQIsOo5i2EEAUiw8m7lK49\nJ3WtIvrzHLv88mfFn+HkLYQQIopma95mNhq4AagBfuru10Vs9wngj8Dp7n53lfWqeQshRAvZrZq3\nmdUANwKjgUHAODM7LGK764AHAY2UI4QQKdNc2eRoYIW7r3T3rcA84NQq210M3AWsTS60UnKqavac\n1LWK6M9z7PLLnxV/c8n7IKD8Ec6vhMsaMLODCBL6LeEi1TSEECJlmqx5m9kXgdHufl44fzYw1N0v\nLtvmTuB6d19oZrOAX7v7L6u4/JxzzqG2thaAHj16MGTIEEaOHEmQ70vhliPCn1HzI3H3htZrxIhg\nveY1r3nNF2G+VCoxa9YsAGpra7nqqquq1rybS97DgGnuPjqc/xawvfyipZm9yI46977AFuA8d7+v\nwqULlkII0UJ29yadp4FDzKzWzDoBZwCNkrK7D3T3Ae4+gKDu/fXKxL17lOIrmrLnpK5VRH+eY5df\n/qz4m3ySjrtvM7OLgIcIugrOdPelZnZ+uH5GIlEIIYRoERrbRAghMozGNhFCiAKR4eRdSteek7pW\nEf15jl1++bPiz3DyFkIIEYVq3kIIkWFU8xZCiAKR4eRdSteek7pWEf15jl1++bPiz3DyFkIIEYVq\n3kIIkWFU8xZCiAKR4eRdSteek7pWEf15jl1++bPiz3DyFkIIEYVq3kIIkWFU8xZCiAKR4eRdStee\nk7pWEf15jl1++bPiz3DyFkIIEUXha96Bv2Wopi6EyApRNe8mn6RTHFrWOLQENQ5CiLYgw2WTUo78\nXmVaELE8GfJSl2ttt/zytxd/hpO3EEKIKNpJzTu/fiFE+0b9vIUQokBkOHmX5G/KnpO6XGu75Ze/\nvfgznLyFEEJEoZp3xv1CiPaNat5CCFEgMpy8S/ITtLotnZJANW/55c+2P8PJW+yg9W8CEkJkG9W8\n27lfCJFt2vnYJiIKjc0iRD7JcNmkJH+r+dMpyxSxXi+//FnxZzh5i2Kger0QaaCat/yp+VWvFyI+\n6ucthBAFIsPJuyR/Yf1puvNTs5Rf/jhkOHkLIYSIotmat5mNBm4AaoCfuvt1FetPBa4GtofTv7r7\nb6t4VPNuZ349n1SI+ETVvJtM3mZWAywHTgDWAIuAce6+tGybvd397fD1EcA97n5wFZeSdzvz5zl2\nIbLC7l6wPBpY4e4r3X0rMA84tXyD+sQdsg+wLm6wAaVkNPJn0J+mO31/Xmqi8hfb31zyPghYXTb/\nSrisEWY21syWAg8A/5JIZEIIISJprmzyRWC0u58Xzp8NDHX3iyO2H05QF/9IlXUqm7Qzf55j3+Fv\nGSrLiKTZ3bFN1gB9y+b7Ehx9V8XdHzezjmbW293XV66vq6ujtrYWgB49ejBkyJCytaXw54hm5sO5\n8NRjxIgRTc7LL388f30y3hX/yBb7Na/5yvlSqcSsWbMAGvJlVdw9ciJI7i8AtUAn4E/AYRXbfIgd\nR/AfA16IcHk1AAevMi2IWF7dE4X8befPc+yt4Y9iwYIFiXjkL4Y//L/aKac2eeTt7tvM7CLgIYKu\ngjPdfamZnR+unwF8EZhgZluBzcCZTTmFEELER2ObyJ+aP8+xt4ZfiF1B43kLkTF0QVTEIcO3x5fk\nL6w/TXfe/F5lWhCxPBny0o9Z/qbJcPIWQggRhWre8qfmz3PsRfCLYqDxvIUQokBkOHmX5C+sP023\n/PXoGaLF9mc4eQsh4tP6F0RF66Cat/yp+fMcu/wiK6jmLYQQBSLDybskf2H9abrlby2/aupt689w\n8hZCZB/V1NsK1bzlT82f59jlb3u/CFDNWwghCkSGk3dJ/sL603TLX3R/XmrSafsznLyFEEJEoZq3\n/Kn58xy7/G3vFwGqeQshRIHIcPIuyV9Yf5pu+Yvuz0tNOm1/hpO3EEKIKFTzlj81f55jl7/t/SJA\nNW8hhCgQGU7eJfkL60/TLX9R/Bo7pWkynLyFEEJjp0Shmrf8qfnzHLv8xffnBdW8hRCiQGQ4eZfk\nL6w/Tbf88se0q+YthBAiLVTzlj81f55jl7/4/rygmrcQQhSIDCfvkvyF9afpll/+mHbVvIUQQqSF\nat7yp+bPc+zyF9+fF1TzFkKIApHh5F2Sv7D+NN3yyx/Trpq3EEKItFDNW/7U/HmOXf7i+/NCrJq3\nmY02s2Vm9ryZTa6y/stmttjMnjOzP5jZ4CSCFkIIUZ1mk7eZ1QA3AqOBQcA4MzusYrMXgX9298HA\nNcCP44dWiq+QP6P+NN3yyx/TXqCa99HACndf6e5bgXnAqeUbuPsf3X1jOLsQ6JNIdEIIIarSbM3b\nzL4EfNbdzwvnzwaGuvvFEdtfDnzY3SdWLFfNu5358xy7/MX354WomnfHXfjdXX43zGwk8FXguGrr\n6+rqqK2tBaBHjx4MGTKkbG0p/DmimflwLjz1GDFiRJPz8ssvv/zV5keOHElLqW8cdjX+3ZkvlUrM\nmjULoCFfRgbT1AQMAx4sm/8WMLnKdoOBFcDBER6vBuDgVaYFEcure6KQv+38eY5dfvnj+qNYsGBB\ni+N03zmn7krN+2ngEDOrNbNOwBnAfeUbmFk/4G7gbHdfsQtOIYQQMdilft5mdiJwA1ADzHT36WZ2\nPoC7zzCznwKfB1aFv7LV3Y+ucHi1feW9biZ/27jllz/r/qSIqnnrJh35U/PnOXb55Y/rT4ocDkxV\nkr+w/jTd8sufbb/GNhFCiHaMyibyp+bPc+zyyx/XnxRx+nkLIYRoIUHj0DJa0jhkuGxSkr+w/jTd\n8sufJb9XmRZELG8ZGU7eQggholDNW/7U/HmOXX75s+LPYVdBIYQQUWQ4eZfkL6w/Tbf88rcPf4aT\ntxBCiChU85Y/NX+eY5df/qz4VfMWQogCkeHkXZK/sP403fLL3z78GU7eQggholDNW/7U/HmOXX75\ns+JXzVsIIQpEhpN3Sf7C+tN0yy9/+/BnOHkLIYSIQjVv+VPz5zl2+eXPil81byGEKBAZTt4l+Qvr\nT9Mtv/ztw5/h5C2EECIK1bzlT82f59jllz8rftW8hRCiQGQ4eZfkL6w/Tbf88rcPf4aTtxBCiChU\n85Y/NX+eY5df/qz4VfMWQogCkeHkXZK/sP403fLL3z78GU7eQggholDNW/7U/HmOXX75s+JXzVsI\nIQpEhpN3Sf7C+tN0yy9/+/BnOHkLIYSIQjVv+VPz5zl2+eXPil81byGEKBDNJm8zG21my8zseTOb\nXGX9oWb2RzN718wuSy60UnIq+TPmT9Mtv/ztw9+xqZVmVgPcCJwArAEWmdl97r60bLP1wMXA2EQi\nEkII0SxN1rzN7BhgqruPDuenALj7f1TZdiqw2d2/H+FSzbud+fMcu/zyZ8W/uzXvg4DVZfOvhMuE\nEEK0IU2WTWhZs9EsdXV11NbWAtCjRw+GDBlStrYU/hwR/rwBGFI2X6KcUimYHzFiRJPz8mfRX75O\nfvnlL19WKpWYNWsWQEO+rIq7R07AMODBsvlvAZMjtp0KXNaEy6sBOHiVaUHE8uqeKORvO3+eY5df\n/qz4w+VUTs3VvDsCy4HjgVeBp4Bx3viCZf2204C3XDVv+VvBLb/87cUfVfNusmzi7tvM7CLgIaAG\nmOnuS83s/HD9DDM7AFgEdAO2m9kkYJC7b25B1EIIIVpAs/283f0Bd/+Iux/s7tPDZTPcfUb4+jV3\n7+vu3d29p7v3SyZxl+Ir5M+oP023/PK3D7/usBRCiByisU3kT82f59jllz8rfo1tIoQQBSLDybsk\nf2H9abrll799+DOcvIUQQkShmrf8qfnzHLv88mfFr5q3EEIUiAwn75L8hfWn6ZZf/vbhz3DyFkII\nEYVq3vKn5s9z7PLLnxW/at5CCFEgMpy8S/IX1p+mW37524c/w8lbCCFEFKp5y5+aP8+xyy9/Vvyq\neQshRIHIcPIuyV9Yf5pu+eVvH/4MJ28hhBBRqOYtf2r+PMcuv/xZ8avmLYQQBSLDybskf2H9abrl\nl799+DOcvIUQQkShmrf8qfnzHLv88mfFr5q3EEIUiAwn75L8hfWn6ZZf/vbhz3DyFkIIEYVq3vKn\n5s9z7PLLnxW/at5CCFEgMpy8S/IX1p+mW37524c/w8lbCCFEFKp5y5+aP8+xyy9/VvyqeQshRIHI\ncPIuyV9Yf5pu+eVvH/4MJ28hhBBRqOYtf2r+PMcuv/xZ8avmLYQQBaLZ5G1mo81smZk9b2aTI7b5\nQbh+sZkdmUxopWQ08mfQn6Zbfvnbh7/J5G1mNcCNwGhgEDDOzA6r2OZzwMHufggwEbglkcj4UzIa\n+TPoz3Ps8sufDX9zR95HAyvcfaW7bwXmAadWbHMKMBvA3RcCPczsA/FDezO+Qv6M+vMcu/zyZ8Pf\nXPI+CFhdNv9KuKy5bfrED00IIUQUzSXvXb1UWnklNIEuLCvjK+TPqD9Nt/zytw9/k10FzWwYMM3d\nR4fz3wK2u/t1Zdv8CCi5+7xwfhnwKXd/vcLVOn0ShRCiYFTrKtixmd95GjjEzGqBV4EzgHEV29wH\nXATMC5P9m5WJO2rnQgghdo8mk7e7bzOzi4CHgBpgprsvNbPzw/Uz3P03ZvY5M1sBvA18JfWohRCi\nndNqd1gKIYRIDt1hKYQQOaS5mnfqmNkS4HbgDnd/oa3jiYOZ7QPg7psTdN4M3O7uv0/KuQv77Elw\n7SK3p2Vmtkd4b0JczxHA5cDhwB7Ac8AP3H2RmXV0920x3F8k6JllZT/rcXe/e/cjBzPrD7zh7u+Y\nWQegDvgY8BfgJ3FiLwJmth5YCPwBeAJY6O5bWmnfQ8P7YnabLBx5nwXsAzxsZovM7FIzOzApuZnV\nmlmPsvlPh7fzf9PMOiW0j2+Y2SpgFbDKzFaZ2YVJuIG/At8zs5fN7LvJDT8QYGZT6++aNbM9zWwB\n8ALwupmNSnJf4T72MbPxZjY/BbeZ2QlmNpPgfoO4vlOBu4HfAecCE4AFwFwz+zzBQUccTg6nk4Cf\nhD/rp5NjugF+w44G4T+AzwFPEtx89+ME/ITXux4zs/Xh9DszG5OQ++yy18dVrLsogV0MBP4L6ARc\nAaw2s2fM7L/M7IwE/E1xV2yDu2dmAoYBNxAkwQXAxAScTwEHhq+HAOuBy4CfAT9NwH8lwZdkYNmy\ngcD9wLcTfG9qgSnAs8ByYCrw4QS8S9hx7WMiwcALNcBhwKKEYt8T+AJwJ7AJmAWcnOB7cwzwg/D/\nZjPBEWavBLzPAbURn8V7wPQE/4Znk3KVf7Zlr/8HqCn/2xLwn0fQI+3TQPdw+nT4nTs/yfek8v1J\n6f3aG7iY4OBle9L+in2tju1IM8Dd/KMMGEkwAMA/EvA9V/b6euC74esOwJ8T8P8V6FxleWfg+ZTe\noyPD9+f9BFzlX5C7gQuqrdtN92fDRL0qbCxPBlYm+D5MD9//B4GvAr2AlxL0L2li3fKEP9M0ktHD\nwPHh61/WN0TAvsDiBPxLgd5VlvcGliX5nqSRvIEDgdOA/wQeB35PcCR+ZrVGO+HPJnbybvOadz1m\ndjTBm/Yl4CXgRyRxatG4jng88C0Ad98ejLcbm+3u/k7lQg/qjO8nsQMAM+tIcNp7JsHfsYDg6Dsu\n/wjruq8BIwjqu1jw5nSJ6X6A4AxkmLu/Gnp/ENNZzteAZwgGQ3vA3f+R0Gdaz1Yz6+/uL5cvDGvJ\n7yW5o5T4GvAzM5tGMKDGn8zsT0APgrPP2Lj7+mrLcnJT3isEZyQ3AFPcPdHP1Mx+3cTq3nH9bZ68\nzew7BDf/bADuAI5199j1yjIWmNmdwN8I/ml/G+73QJL5Ar5qZie4+yPlC83s+HCfsTCzzxAk7DEE\np6N3EJSTkrooOomgnLE/8J/u/mK4/HME/9hx+BjBTV2/M7MXwv3UxHSW80FgVLiPG82sBHRO6mIl\nQeP4iJldS9BIAHyc4ACg6vDILaHiyz2gYt7d/ZQ4fndfBYwws0HAhwkGkFsNPO3uSRxYbDKzIe7e\naJg8M/sn4K0E/Iea2Z/D1x8qew3woQT8xwHHAmOBb5rZSoILl38keI/i5ofvN7Hu+pjutu/nbWb/\nTtDT5PmU/B0IGocDgP929zXh8uHAbHcfGNN/OHAvwSnXMwRH+kcBnwROdff/jen/LUHCvsvdN8Rx\nRfgrj8AcWAv83t1fSmgfRvAlGQd8kaDkc4+7J3LRLNzHXgQX+sYRvPePuvtZCXj/ieBsZFC4aAlw\nvbsvTsA9guD97gIcHC5eAWwBcPffxd1HlX3uC6z3BL74ZvZJ4OfAbTT+368Dznb3x2P6+9e/rFjl\nQH93fyyOv8r+aglKe5OAPu6+V4Lu/QDcfW1izgwk78kejpViZqe5+51l677j7lckuK/6I8HTCUoz\nv3T3H8Z0HkLQMHyYxl/w5cDfPGb3RzM7DbiW4KjpuwkdUZb7p7HzQGK9CerV09z9joT31wH4N2CA\nu381pqu+4SmPfx1B4zDE3X8Wx582ZrYHwWf7VYLrAgD9CJLhFXE/azM7huC6wN+Ba4A5BPXuGmCC\nuz8Qxx/u4wDgm+w4En6JoHF7LQH3i8CM0Pd+2f6uBw5z96MS2MdhBAcW9VMPgh45f3D3WEfH4UHL\nVILhQ+rPON8HfujuV8VxQzaS97PufmTl62rzu+n/CEHCPoPgiPJO4F/dvV8cb5l/PkG97M8VywcD\n17p77C5fFvQf/3eChDqHHcnK3f3/xfVH7LMXwdFrIl0TyxrO0wiGVUui4ZzGzg1PL4KHh8RueMIy\nRmX/63pilzXM7AaCbrKXuvtb4bJuBKfbW9x9Ukz/MwQlnu4EXRFHu/uTZnYoMM/dh8T0VzY+BvQl\nucanJ0EXx2OBS4AjgEuB7wE3u/v2mP71BGM2PUHQ1/uPSVYAzOybwIkEZc6XwmUDCa7nPRj3u9vm\nNe9WYCnBRbPPhjXA+jc1KT5QmbgB3P05MxuQ0D62EnSB2wvoCsT6p90V3P3vcS/+RTScHdx9ROwA\nAXefFrHfXsCjBOWmOAwjuKh1B8HNHLAjkSdx1HMSQXfPhs/T3TeZ2QUEZ26xkjdB18CHAczsand/\nMtzHsoQuKH6PoPEZUKXxuZ6Y8YdlwvPN7BLg/xMk2mPcfXXTv7nLDHT3jQm5qjEBGFVeKnH3F83s\nywR/j5J3M3yBIIE8ZmYPEiSQJLsk9GhiXeyamZmNJviQfw0c6a13B9hIgovIcUi74axKEg1PSPkF\n0XHAfILrM39JQk7QU2mnhtjd3zezJBro8gT9bgK+SlJtfMqOvIcRHMGeCDxgZpPc/dE47pBLwkYs\n6szq6pj+jtVq3O6+Nuw9Fk8eV5AAg82s/sp057LXEPSVjoW7/wr4VVh6OJXgtGs/M7uF4KLZwzF3\n8bSZTay8+GZm57Gjh0Ic/g04LcGE0YiKK/j19CToKTMhpj7thrMqCTU8eHD7+AMECWNPdvScmebu\nN8b1A0vN7Bx3n12+0MzGA8sS8Kf63SL9xqe+G+iF4WfxkJkNAW4xs6+5e+Xw1C3lbXY+g9qb4G7a\nfYG4ybupslH8oRvauubdFoSn1V8CznT3T8d0HQDcA/yDHcn6KIK7Cj/v7rG6C5qZJdEzoAl/bcUi\nJ+iNkOT4LPUN5ziCG7B+RgINZ3MNj7svjeMP97EXQTfNMwnurLwPuLW+11JMdx+CG6PeofH/TheC\n/50ku8yRLjE5AAACsUlEQVQmjpndC9wd0ficlsA1gb7VSiThhcDzEu6t1A34F4LE/d/A9939jZjO\n9wl7DlWhs7vHOnhul8k7acJ/ppHARwmS31/c/bdtG1U2SbjhrK1YlGjDY2ZzCAak+g3wi2rXNhLY\nhxHcUn44QfxLEioJpE7eGx8AM+tNcDb+ZYKDihvS6JKbBkreQkQQnvq/HbHa3b1ba8aTRXLe+FwP\nfJ5gkK6b6y+65gUlbyFEuyRsnP9B9fpz5htnJW8hhMghWRjPWwghRAtR8hZCiByi5C2EEDlEyVvk\nHjM7wMzmmdkKM3vazOaHA4btrm9+2O9XiMyiC5Yi14Rd1Z4Abqu/aSMcFKybN/PQ5vB3qb8JqnK+\nBTF0iDtIkhAtRUfeIu+MJHhcXsPddu7+HPCsmT1iwQNlnzOzU4D6B1IvN7PZwJ+B4RXzfc1sZXgz\nEWZ2tpktNLNnzexHFgxpi5ltNrPrLXgyzbBW/puFUPIWueejVB9D5l2Cu/yOIriJpPypJgcDN7n7\nRwmGMm2YDwfQqj8SP4xg7Pdjw6FxtxPciQfBXYRPuvsQd38ihb9LiCbJwsBUQsQhqsTRAZhuwROT\ntgMHmtn+4bqX3f2psm0r5yEYQOt4gtu9nw4rKp0JnvUJwaD6v0wgfiF2CyVvkXf+QjBWSiVfJhgZ\n7mPhKHcvsWOI3spb3qNugYfgUXnVnub0bpoDhgnRHCqbiFwTDgC2ZzgEL9BwwbIf8EaYuEcC/aMc\nUWqCBzp8ycLnD5pZLzNL5AlMQsRFyVsUgc8DJ4RdBf+X4NFcvwE+bmbPAeMJHgxRT+URc9X5cEjZ\nK4GHzWwx8DDB80qr/Y4QrYq6CgohRA7RkbcQQuQQJW8hhMghSt5CCJFDlLyFECKHKHkLIUQOUfIW\nQogcouQthBA5RMlbCCFyyP8BueSmoV9oSgYAAAAASUVORK5CYII=\n",
       "text": [
        "<matplotlib.figure.Figure at 0x7f0a2a0def90>"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As expected, some airlines are better than others. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Pre-processing: using Hadoop to build a feature matrix"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After exploring the data for a bit, we now move to building the feature matrix for our predictive model. \n",
      "\n",
      "Let's look at possible predictive variables for our model:\n",
      "- **month**: winter months should have more delays than summer months\n",
      "- **day of month**: this is likely not a very predictive variable, but let's keep it in anyway\n",
      "- **day of week**: weekend vs. weekday\n",
      "- **hour of the day**: later hours tend to have more delays\n",
      "- **Carrier**: we might expect some carriers to be more prone to delays than others\n",
      "- **Destination airport**: we expect some airports to be more prone to delays than others\n",
      "- **Distance**: interesting to see if this variable is a good predictor of delay\n",
      "\n",
      "We will also generate another feature: number of days from closest national holiday, with the assumption that holidays tend to be associated with more delays.\n",
      "\n",
      "We implement this \"feature generation\" process using PIG and some simply Python user-defined-functions (UDFs). \n",
      "First, let's implement some Python UDFs:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "```\n",
      "#\n",
      "# Python UDFs for our PIG script\n",
      "#\n",
      "from datetime import date\n",
      "\n",
      "# get hour-of-day from HHMM field\n",
      "@outputSchema(\"value: int\")\n",
      "def get_hour(val):\n",
      "  return int(val.zfill(4)[:2])\n",
      "\n",
      "# this array defines the dates of holiday in 2007 and 2008\n",
      "holidays = [\n",
      "        date(2007, 1, 1), date(2007, 1, 15), date(2007, 2, 19), date(2007, 5, 28), date(2007, 6, 7), date(2007, 7, 4), \\\n",
      "        date(2007, 9, 3), date(2007, 10, 8), date(2007, 11, 11), date(2007, 11, 22), date(2007, 12, 25), \\\n",
      "        date(2008, 1, 1), date(2008, 1, 21), date(2008, 2, 18), date(2008, 5, 22), date(2008, 5, 26), date(2008, 7, 4), \\\n",
      "        date(2008, 9, 1), date(2008, 10, 13), date(2008, 11, 11), date(2008, 11, 27), date(2008, 12, 25) \\\n",
      "     ]\n",
      "# get number of days from nearest holiday\n",
      "@outputSchema(\"days: int\")\n",
      "def days_from_nearest_holiday(year, month, day):\n",
      "  d = date(year, month, day)\n",
      "  x = [(abs(d-h)).days for h in holidays]\n",
      "  return min(x)\n",
      "\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Our PIG script is relatively simple:\n",
      "1. Load the dataset (2007 or 2008) \n",
      "2. Filter out flights that were cancelled or that are NOT originating in ORD\n",
      "3. Project only variables that we want to use in the analysis \n",
      "4. Generate the output feature matrix, using the Python UDFs\n",
      "\n",
      "We can execute this script directly from IPython (the Python UDFs are separately stored in \"util.py\"):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile preprocess1.pig\n",
      "\n",
      "Register 'util.py' USING jython as util;\n",
      "DEFINE preprocess(year_str, airport_code) returns data\n",
      "{\n",
      "        -- load airline data from specified year (need to specify fields since it's not in HCat)\n",
      "        airline = load 'airline/delay/$year_str.csv' using PigStorage(',') \n",
      "            as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, \n",
      "                CRSDepTime: chararray, ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, \n",
      "                CRSElapsedTime, AirTime, ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, \n",
      "                TaxiIn, TaxiOut, Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, \n",
      "                NASDelay, SecurityDelay, LateAircraftDelay);\n",
      "\n",
      "        -- keep only instances where flight was not cancelled and originate at ORD\n",
      "        airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';\n",
      "\n",
      "        -- Keep only fields I need\n",
      "        $data = foreach airline_flt generate DepDelay as delay, Month, DayOfMonth, DayOfWeek, \n",
      "                                             util.get_hour(CRSDepTime) as hour, Distance, Carrier, Dest,\n",
      "                                             util.days_from_nearest_holiday(Year, Month, DayOfMonth) as hdays;\n",
      "};\n",
      "\n",
      "ORD_2007 = preprocess('2007', 'ORD');\n",
      "rmf airline/fm/ord_2007_1\n",
      "store ORD_2007 into 'airline/fm/ord_2007_1' using PigStorage(',');\n",
      "\n",
      "ORD_2008 = preprocess('2008', 'ORD');\n",
      "rmf airline/fm/ord_2008_1\n",
      "store ORD_2008 into 'airline/fm/ord_2008_1' using PigStorage(',');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting preprocess1.pig\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's look at the output as the script continues to process..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash --err pig_out --bg \n",
      "pig -f preprocess1.pig"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Starting job # 0 in a separate thread.\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "while True:\n",
      "    line = pig_out.readline()\n",
      "    if not line: \n",
      "        break\n",
      "    sys.stdout.write(\"%s\" % line)\n",
      "    sys.stdout.flush()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that PIG finished processing, we have two new file generated:\n",
      "1. airline/fm/ord_2007_1\n",
      "2. airline/fm/ord_2008_1\n",
      "\n",
      "(the \"1\" indicates this is the first iteration; we will work on a second iteration later).\n",
      "\n",
      "PIG is great for pre-procesing raw data into a feature matrix, but it's not the only choice. We can use other tools such as HIVE, Cascading, Scalding or Spark for this type of pre-processing. We will show how to do the same type of pre-processing using Spark in the second part of this blog post."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Iteration #1: building a Logistic Regression and Random Forest models"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we have the files ord_2007_1 and ord_2008_1 under 'airline/fm' folder in HDFS. \n",
      "Let's read those files into Python, and prepare the training and testing (validation) datasets as Pandas DataFrame objects. \n",
      "\n",
      "Initially, we use only the numerical variables:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "# read files\n",
      "cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']\n",
      "col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, \n",
      "             'carrier': str, 'dest': str, 'days_from_holiday': int}\n",
      "data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)\n",
      "data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)\n",
      "\n",
      "# Create training set and test set\n",
      "cols = ['month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday']\n",
      "train_y = data_2007['delay'] >= 15\n",
      "train_x = data_2007[cols]\n",
      "\n",
      "test_y = data_2008['delay'] >= 15\n",
      "test_x = data_2008[cols]\n",
      "\n",
      "print train_x.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(359169, 6)\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we have ~359K rows and 6 features in our model.\n",
      "\n",
      "Now we use Python's excellent Scikit-learn machine learning package to to build two predictive models (Logistic regression and Random Forest) and compare their performance. First we print the confusion matrix, which counts the true positive, true negatives, false positives and false negatives. Then from the [confusion matrix](http://en.wikipedia.org/wiki/Confusion_matrix), we compute [precision, recall](http://en.wikipedia.org/wiki/Precision_and_recall), F1 metric and accuracy.\n",
      "Let's start with a logistic regression model and evaluate its performance on the testing dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create logistic regression model with L2 regularization\n",
      "clf_lr = linear_model.LogisticRegression(penalty='l2', class_weight='auto')\n",
      "clf_lr.fit(train_x, train_y)\n",
      "\n",
      "# Predict output labels on test set\n",
      "pr = clf_lr.predict(test_x)\n",
      "\n",
      "# display evaluation metrics\n",
      "cm = confusion_matrix(test_y, pr)\n",
      "print(\"Confusion matrix\")\n",
      "print(pd.DataFrame(cm))\n",
      "report_lr = precision_recall_fscore_support(list(test_y), list(pr), average='micro')\n",
      "print \"\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n",
      "        (report_lr[0], report_lr[1], report_lr[2], accuracy_score(list(test_y), list(pr)))\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion matrix\n",
        "        0      1\n",
        "0  143858  96036\n",
        "1   36987  58449\n",
        "\n",
        "precision = 0.38, recall = 0.61, F1 = 0.47, accuracy = 0.60\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Our logistic regression model got overall accuracy of 60%. \n",
      "Now let's try Random Forest:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Random Forest classifier with 50 trees\n",
      "clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)\n",
      "clf_rf.fit(train_x, train_y)\n",
      "\n",
      "# Evaluate on test set\n",
      "pr = clf_rf.predict(test_x)\n",
      "\n",
      "# print results\n",
      "cm = confusion_matrix(test_y, pr)\n",
      "print(\"Confusion matrix\")\n",
      "print(pd.DataFrame(cm))\n",
      "report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')\n",
      "print \"\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n",
      "        (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion matrix\n",
        "        0      1\n",
        "0  197303  42591\n",
        "1   65437  29999\n",
        "\n",
        "precision = 0.41, recall = 0.31, F1 = 0.36, accuracy = 0.68\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we can see, Random Forest has overall better accuracy, but lower F1 score. For our problem -- we are trying to predict delays, so the higher level of true positives (197K vs. 143K) is better.\n",
      "\n",
      "With any supervised learnign algorithm, one typically needs to choose values for the parameters of the model. For example, we chose \"L1\" regularization for the logistic regression model, and 50 trees for the Random Forest. Such choices are based on some experimentation and hyperparameter tuning (http://en.wikipedia.org/wiki/Hyperparameter_optimization). We are not addressing this topic in this demo, although such choices are important to achieve the overall best model."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Improving our predictive model with \"One Hot Encoding\" - Iteration #2"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It is very common in data science to work iteratively, and improve the model with each iteration. Let's see how this works.\n",
      "\n",
      "In this iteration, we improve our feature by converting existing variables that are categorical in nature (such as \"hour\", or \"month\") as well as categorical variables that are strings (like \"carrier\" and \"dest\"), into what is known as \"dummy variables\". Each \"dummy variable\" is a binary (0 or 1) that indicates whether a certain category value is \"on\" or \"off.\n",
      "\n",
      "Fortunately, scikit-learn has the [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) functionality to make this easy:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.preprocessing import OneHotEncoder\n",
      "\n",
      "# read files\n",
      "cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']\n",
      "col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, \n",
      "             'carrier': str, 'dest': str, 'days_from_holiday': int}\n",
      "data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)\n",
      "data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)\n",
      "\n",
      "# Create training set and test set\n",
      "train_y = data_2007['delay'] >= 15\n",
      "categ = [cols.index(x) for x in 'hour', 'month', 'day', 'dow', 'carrier', 'dest']\n",
      "enc = OneHotEncoder(categorical_features = categ)\n",
      "df = data_2007.drop('delay', axis=1)\n",
      "df['carrier'] = pd.factorize(df['carrier'])[0]\n",
      "df['dest'] = pd.factorize(df['dest'])[0]\n",
      "train_x = enc.fit_transform(df)\n",
      "\n",
      "test_y = data_2008['delay'] >= 15\n",
      "df = data_2008.drop('delay', axis=1)\n",
      "df['carrier'] = pd.factorize(df['carrier'])[0]\n",
      "df['dest'] = pd.factorize(df['dest'])[0]\n",
      "test_x = enc.transform(df)\n",
      "\n",
      "print train_x.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(359169, 409)\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we can see the first 5 lines of the feature matrix. Overall, we have ~359K rows and 409 features in our model.\n",
      "Let's re-run the Random Forest model and see if this improved our model:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Random Forest classifier with 50 trees\n",
      "clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)\n",
      "clf_rf.fit(train_x.toarray(), train_y)\n",
      "\n",
      "# Evaluate on test set\n",
      "pr = clf_rf.predict(test_x.toarray())\n",
      "\n",
      "# print results\n",
      "cm = confusion_matrix(test_y, pr)\n",
      "print(\"Confusion matrix\")\n",
      "print(pd.DataFrame(cm))\n",
      "report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')\n",
      "print \"\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n",
      "        (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion matrix\n",
        "        0      1\n",
        "0  216883  23011\n",
        "1   75451  19985\n",
        "\n",
        "precision = 0.46, recall = 0.21, F1 = 0.29, accuracy = 0.71\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This clearly helped -- accuracy is higher at ~70%, and true positive are also better at 216K (vs 197K previously)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Enriching the model -- how more data gets a better modeling - Iteration #3"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Another common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features. Our idea is to layer-in weather data. We can get this data from a publicly available dataset here:  http://www.ncdc.noaa.gov/cdo-web/datasets/\n",
      "\n",
      "We will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).\n",
      "\n",
      "First, let's re-write our PIG script to add these new features to our feature matrix:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile preprocess2.pig\n",
      "\n",
      "register 'util.py' USING jython as util;\n",
      "\n",
      "-- Helper macro to load data and join into a feature vector per instance\n",
      "DEFINE preprocess(year_str, airport_code) returns data\n",
      "{\n",
      "    -- load airline data from specified year (need to specify fields since it's not in HCat)\n",
      "    airline = load 'airline/delay/$year_str.csv' using PigStorage(',') \n",
      "                    as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, CRSDepTime:chararray, \n",
      "                        ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, \n",
      "                        ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, TaxiIn, TaxiOut, \n",
      "                        Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, \n",
      "                        SecurityDelay, LateAircraftDelay);\n",
      "\n",
      "    -- keep only instances where flight was not cancelled and originate at ORD\n",
      "    airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';\n",
      "\n",
      "    -- Keep only fields I need\n",
      "    airline2 = foreach airline_flt generate Year as year, Month as month, DayOfMonth as day, DayOfWeek as dow,\n",
      "                        Carrier as carrier, Origin as origin, Dest as dest, Distance as distance,\n",
      "                        CRSDepTime as time, DepDelay as delay, util.to_date(Year, Month, DayOfMonth) as date;\n",
      "\n",
      "    -- load weather data\n",
      "    weather = load 'airline/weather/$year_str.csv' using PigStorage(',') \n",
      "                    as (station: chararray, date: chararray, metric, value, t1, t2, t3, time);\n",
      "\n",
      "    -- keep only TMIN and TMAX weather observations from ORD\n",
      "    weather_tmin = filter weather by station == 'USW00094846' and metric == 'TMIN';\n",
      "    weather_tmax = filter weather by station == 'USW00094846' and metric == 'TMAX';\n",
      "    weather_prcp = filter weather by station == 'USW00094846' and metric == 'PRCP';\n",
      "    weather_snow = filter weather by station == 'USW00094846' and metric == 'SNOW';\n",
      "    weather_awnd = filter weather by station == 'USW00094846' and metric == 'AWND';\n",
      "\n",
      "    joined = join airline2 by date, weather_tmin by date, weather_tmax by date, weather_prcp by date, \n",
      "                                    weather_snow by date, weather_awnd by date;\n",
      "    $data = foreach joined generate delay, month, day, dow, util.get_hour(airline2::time) as tod, distance, carrier, dest,\n",
      "                                    util.days_from_nearest_holiday(year, month, day) as hdays,\n",
      "                                    weather_tmin::value as temp_min, weather_tmax::value as temp_max,\n",
      "                                    weather_prcp::value as prcp, weather_snow::value as snow, weather_awnd::value as wind;\n",
      "};\n",
      "\n",
      "ORD_2007 = preprocess('2007', 'ORD');\n",
      "rmf airline/fm/ord_2007_2;\n",
      "store ORD_2007 into 'airline/fm/ord_2007_2' using PigStorage(',');\n",
      "\n",
      "ORD_2008 = preprocess('2008', 'ORD');\n",
      "rmf airline/fm/ord_2008_2;\n",
      "store ORD_2008 into 'airline/fm/ord_2008_2' using PigStorage(',');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting preprocess2.pig\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash --bg --err pig_out2 \n",
      "pig -f preprocess2.pig"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Starting job # 6 in a separate thread.\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "while True:\n",
      "    line = pig_out2.readline()\n",
      "    if not line:\n",
      "        break\n",
      "    sys.stdout.write(\"%s\" % line)\n",
      "    sys.stdout.flush()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "15/01/28 12:34:57 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "15/01/28 12:34:57 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "15/01/28 12:34:57 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:34:57,267 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0.2.2.0.0-2041 (rexported) compiled Nov 19 2014, 15:24:46\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:34:57,268 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/demo/airline-demo/pig_1422477297265.log\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:34:58,425 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/demo/.pigbootup not found\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:34:58,700 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ds-master.cloud.hortonworks.com:8020\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:34:59,817 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_7325793351920775801\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:01,620 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:03,872 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.get_hour\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:03,875 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.to_date\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:03,877 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: util.days_from_nearest_holiday\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:04,588 [main] INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'date: chararray' defined for func to_date\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:04,916 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:05,702 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 5 time(s).\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:05,929 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,FILTER\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:05,974 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,027 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,101 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_weather_0: $4, $5, $6, $7\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,106 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_0: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,818 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,870 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,883 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,888 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 diamond splitter.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,888 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 2 MR operators.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:06,889 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:07,900 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,097 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,424 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,435 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,442 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,445 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,472 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6190005583\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,474 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 7\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,474 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:08,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp235338508/tmp-1688014395/pig-0.14.0.2.2.0.0-2041-core-h2.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,077 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp235338508/tmp643938485/jython-standalone-2.5.3.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,155 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp235338508/tmp-1413505763/automaton-1.11-8.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,517 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp235338508/tmp489706678/antlr-runtime-3.4.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,616 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp235338508/tmp1130991621/guava-11.0.2.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,702 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp235338508/tmp-1468876033/joda-time-2.5.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,811 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp235338508/tmp-117856993/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,873 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,882 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,883 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:10,884 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,235 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,477 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,480 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,584 [JobControl] WARN  org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,684 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,686 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,721 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,727 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,728 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,736 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,742 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,743 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,751 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,756 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,757 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,764 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,770 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,770 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,776 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,782 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,782 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:11,790 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,017 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:51\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,286 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1422408320939_0057\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,513 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,639 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1422408320939_0057\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,706 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1422408320939_0057/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1422408320939_0057\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2007,macro_preprocess_airline2_0,macro_preprocess_airline_0,macro_preprocess_airline_flt_0,macro_preprocess_joined_0,macro_preprocess_weather_0,macro_preprocess_weather_awnd_0,macro_preprocess_weather_prcp_0,macro_preprocess_weather_snow_0,macro_preprocess_weather_tmax_0,macro_preprocess_weather_tmin_0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_tmax_0[29,19],macro_preprocess_weather_tmax_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_airline_0[8,14],macro_preprocess_airline_0[-1,-1],macro_preprocess_airline_flt_0[16,18],macro_preprocess_airline2_0[19,15],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_snow_0[31,19],macro_preprocess_weather_snow_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_tmin_0[28,19],macro_preprocess_weather_tmin_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_prcp_0[30,19],macro_preprocess_weather_prcp_0[-1,-1],macro_preprocess_joined_0[34,13],macro_preprocess_weather_0[24,14],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_0[-1,-1],macro_preprocess_weather_awnd_0[32,19],macro_preprocess_weather_awnd_0[-1,-1],macro_preprocess_joined_0[34,13] C:  R: ORD_2007[36,15]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,726 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:35:12,727 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:37:27,207 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 5% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:37:27,208 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:37:48,270 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:37:48,270 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:00,305 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 14% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:00,307 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:12,341 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 19% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:12,342 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:22,367 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 24% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:22,367 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:30,387 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 29% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:30,387 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:36,907 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 34% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:36,907 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:44,926 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 38% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:44,926 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:52,947 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 42% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:52,950 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:59,970 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 47% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:38:59,971 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:07,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 51% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:07,996 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:15,016 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 55% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:15,017 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:22,038 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:22,039 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:30,063 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 64% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:30,064 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:38,086 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 69% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:38,087 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:40,092 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 79% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:40,094 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:42,102 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 83% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:42,104 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:57,145 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 87% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:39:57,145 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:05,165 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 93% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:05,168 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:12,187 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 97% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:12,188 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0057]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:23,411 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:23,414 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:23,485 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,137 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,138 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,147 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,557 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,559 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,567 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,639 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,644 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "HadoopVersion\tPigVersion\tUserId\tStartedAt\tFinishedAt\tFeatures\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2.6.0.2.2.0.0-2041\t0.14.0.2.2.0.0-2041\tdemo\t2015-01-28 12:35:08\t2015-01-28 12:40:24\tHASH_JOIN,FILTER\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Success!\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Job Stats (time in seconds):\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "JobId\tMaps\tReduces\tMaxMapTime\tMinMapTime\tAvgMapTime\tMedianMapTime\tMaxReduceTime\tMinReduceTime\tAvgReduceTime\tMedianReducetime\tAlias\tFeature\tOutputs\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "job_1422408320939_0057\t51\t7\t75\t14\t46\t45\t154\t117\t134\t136\tORD_2007,macro_preprocess_airline2_0,macro_preprocess_airline_0,macro_preprocess_airline_flt_0,macro_preprocess_joined_0,macro_preprocess_weather_0,macro_preprocess_weather_awnd_0,macro_preprocess_weather_prcp_0,macro_preprocess_weather_snow_0,macro_preprocess_weather_tmax_0,macro_preprocess_weather_tmin_0\tHASH_JOIN\thdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_2,\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Input(s):\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 31065125 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 31065125 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 31065125 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 31065125 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 7453216 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2007.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 31065125 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2007.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Output(s):\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully stored 359169 records (14789642 bytes) in: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2007_2\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Counters:\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total records written : 359169\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total bytes written : 14789642\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Spillable Memory Manager spill count : 0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total bags proactively spilled: 0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total records proactively spilled: 0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Job DAG:\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "job_1422408320939_0057\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,832 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,833 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:24,843 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,273 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,275 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,284 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,507 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,509 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,520 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,582 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 160755 time(s).\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,583 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:25,598 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Waited 0ms to delete file\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,024 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 5 time(s).\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,039 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,FILTER\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,118 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,123 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,151 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_weather_1: $4, $5, $6, $7\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,153 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for macro_preprocess_airline_1: $4, $6, $7, $9, $10, $11, $12, $13, $14, $19, $20, $22, $23, $24, $25, $26, $27, $28\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,386 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,392 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager)\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,393 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,397 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 diamond splitter.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,397 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 2 MR operators.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,398 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,567 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,569 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,575 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,579 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,581 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,581 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,597 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6417173424\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,597 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 7\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,598 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:26,945 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/pig-0.14.0.2.2.0.0-2041-core-h2.jar to DistributedCache through /tmp/temp235338508/tmp-1781883650/pig-0.14.0.2.2.0.0-2041-core-h2.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,111 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/jython-standalone-2.5.3.jar to DistributedCache through /tmp/temp235338508/tmp1892309556/jython-standalone-2.5.3.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,148 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp235338508/tmp1729188769/automaton-1.11-8.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,194 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/pig/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp235338508/tmp306038830/antlr-runtime-3.4.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,250 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop/lib/guava-11.0.2.jar to DistributedCache through /tmp/temp235338508/tmp-1720863932/guava-11.0.2.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,307 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/joda-time-2.5.jar to DistributedCache through /tmp/temp235338508/tmp-204211475/joda-time-2.5.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,435 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/tmp/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar to DistributedCache through /tmp/temp235338508/tmp-1203352129/PigScriptUDF-3f91fbfefba602bf28492c3cd7f8b54c.jar\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,452 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,454 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,455 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,456 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,601 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,752 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,755 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:27,912 [JobControl] WARN  org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,007 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,008 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,015 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,022 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,022 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,031 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,038 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,038 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,044 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,050 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,051 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,059 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,066 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,067 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,075 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 9\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,080 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,081 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,088 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 6\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,516 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:51\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,600 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1422408320939_0058\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,610 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,686 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1422408320939_0058\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,694 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://ds-master.cloud.hortonworks.com:8088/proxy/application_1422408320939_0058/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,696 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1422408320939_0058\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,697 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ORD_2008,macro_preprocess_airline2_1,macro_preprocess_airline_1,macro_preprocess_airline_flt_1,macro_preprocess_joined_1,macro_preprocess_weather_1,macro_preprocess_weather_awnd_1,macro_preprocess_weather_prcp_1,macro_preprocess_weather_snow_1,macro_preprocess_weather_tmax_1,macro_preprocess_weather_tmin_1\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,697 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_tmin_1[28,19],macro_preprocess_weather_tmin_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_airline_1[8,14],macro_preprocess_airline_1[-1,-1],macro_preprocess_airline_flt_1[16,18],macro_preprocess_airline2_1[19,15],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_awnd_1[32,19],macro_preprocess_weather_awnd_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_tmax_1[29,19],macro_preprocess_weather_tmax_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_snow_1[31,19],macro_preprocess_weather_snow_1[-1,-1],macro_preprocess_joined_1[34,13],macro_preprocess_weather_1[24,14],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_1[-1,-1],macro_preprocess_weather_prcp_1[30,19],macro_preprocess_weather_prcp_1[-1,-1],macro_preprocess_joined_1[34,13] C:  R: ORD_2008[36,15]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,711 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:40:28,712 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:18,924 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 4% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:18,925 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:27,951 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 9% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:27,954 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:38,987 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 13% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:38,989 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:46,008 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 17% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:46,008 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:51,021 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 22% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:41:51,021 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:06,059 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 31% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:06,060 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:16,087 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 35% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:16,089 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:33,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:33,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:39,159 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 45% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:39,161 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:46,184 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:46,186 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:51,200 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 54% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:51,201 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:59,222 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 58% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:42:59,224 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:08,249 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 63% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:08,251 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:16,271 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 71% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:16,271 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:18,275 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 83% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:18,276 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:33,316 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 88% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:33,318 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:39,334 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 93% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:39,336 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:46,356 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 98% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:46,357 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1422408320939_0058]\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:54,546 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:54,552 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:54,563 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:54,968 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:54,969 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:54,978 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,243 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,244 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,256 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,302 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,304 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: \n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "HadoopVersion\tPigVersion\tUserId\tStartedAt\tFinishedAt\tFeatures\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2.6.0.2.2.0.0-2041\t0.14.0.2.2.0.0-2041\tdemo\t2015-01-28 12:40:26\t2015-01-28 12:43:55\tHASH_JOIN,FILTER\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Success!\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Job Stats (time in seconds):\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "JobId\tMaps\tReduces\tMaxMapTime\tMinMapTime\tAvgMapTime\tMedianMapTime\tMaxReduceTime\tMinReduceTime\tAvgReduceTime\tMedianReducetime\tAlias\tFeature\tOutputs\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "job_1422408320939_0058\t51\t7\t93\t17\t51\t51\t127\t112\t120\t120\tORD_2008,macro_preprocess_airline2_1,macro_preprocess_airline_1,macro_preprocess_airline_flt_1,macro_preprocess_joined_1,macro_preprocess_weather_1,macro_preprocess_weather_awnd_1,macro_preprocess_weather_prcp_1,macro_preprocess_weather_snow_1,macro_preprocess_weather_tmax_1,macro_preprocess_weather_tmin_1\tHASH_JOIN\thdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_2,\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Input(s):\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 32534244 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 32534244 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 32534244 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 32534244 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 32534244 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/weather/2008.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully read 7009729 records from: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/delay/2008.csv\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Output(s):\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Successfully stored 335330 records (13817679 bytes) in: \"hdfs://ds-master.cloud.hortonworks.com:8020/user/demo/airline/fm/ord_2008_2\"\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Counters:\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total records written : 335330\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total bytes written : 13817679\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Spillable Memory Manager spill count : 0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total bags proactively spilled: 0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Total records proactively spilled: 0\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Job DAG:\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "job_1422408320939_0058\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,497 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,497 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,507 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,809 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,810 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:55,819 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:56,051 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://ds-master.cloud.hortonworks.com:8188/ws/v1/timeline/\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:56,053 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ds-master.cloud.hortonworks.com/172.24.70.17:8050\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:56,062 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:56,114 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 136253 time(s).\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:56,115 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2015-01-28 12:43:56,148 [main] INFO  org.apache.pig.Main - Pig script completed in 8 minutes, 59 seconds and 73 milliseconds (539073 ms)\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We now read this data in, convert temparatures to Fahrenheit (note original temp is in Celcius*10), and prepare the training and testing datasets for modeling."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.preprocessing import OneHotEncoder\n",
      "\n",
      "# Convert Celsius to Fahrenheit\n",
      "def fahrenheit(x): return(x*1.8 + 32.0)\n",
      "\n",
      "# read files\n",
      "cols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday',\n",
      "        'origin_tmin', 'origin_tmax', 'origin_prcp', 'origin_snow', 'origin_wind']\n",
      "col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, \n",
      "             'carrier': str, 'dest': str, 'days_from_holiday': int,\n",
      "             'origin_tmin': float, 'origin_tmax': float, 'origin_prcp': float, 'origin_snow': float, 'origin_wind': float}\n",
      "\n",
      "data_2007 = read_csv_from_hdfs('airline/fm/ord_2007_2', cols, col_types)\n",
      "data_2008 = read_csv_from_hdfs('airline/fm/ord_2008_2', cols, col_types)\n",
      "\n",
      "data_2007['origin_tmin'] = data_2007['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))\n",
      "data_2007['origin_tmax'] = data_2007['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))\n",
      "data_2008['origin_tmin'] = data_2008['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))\n",
      "data_2008['origin_tmax'] = data_2008['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))\n",
      "\n",
      "# Create training set and test set\n",
      "train_y = data_2007['delay'] >= 15\n",
      "categ = [cols.index(x) for x in 'hour', 'month', 'day', 'dow', 'carrier', 'dest']\n",
      "enc = OneHotEncoder(categorical_features = categ)\n",
      "df = data_2007.drop('delay', axis=1)\n",
      "df['carrier'] = pd.factorize(df['carrier'])[0]\n",
      "df['dest'] = pd.factorize(df['dest'])[0]\n",
      "train_x = enc.fit_transform(df)\n",
      "\n",
      "test_y = data_2008['delay'] >= 15\n",
      "df = data_2008.drop('delay', axis=1)\n",
      "df['carrier'] = pd.factorize(df['carrier'])[0]\n",
      "df['dest'] = pd.factorize(df['dest'])[0]\n",
      "test_x = enc.transform(df)\n",
      "\n",
      "print train_x.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(359169, 414)\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Good. So now that we have the training and test (validation) set ready, let's try Random Forest with the new features:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Random Forest classifier with 100 trees\n",
      "clf_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)\n",
      "clf_rf.fit(train_x.toarray(), train_y)\n",
      "\n",
      "# Evaluate on test set\n",
      "pr = clf_rf.predict(test_x.toarray())\n",
      "\n",
      "# print results\n",
      "cm = confusion_matrix(test_y, pr)\n",
      "print(\"Confusion matrix\")\n",
      "print(pd.DataFrame(cm))\n",
      "report_rf = precision_recall_fscore_support(list(test_y), list(pr), average='micro')\n",
      "print \"precision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n",
      "        (report_rf[0], report_rf[1], report_rf[2], accuracy_score(list(test_y), list(pr)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Confusion matrix\n",
        "        0      1\n",
        "0  226452  13442\n",
        "1   72537  22899\n",
        "precision = 0.63, recall = 0.24, F1 = 0.35, accuracy = 0.74\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "with the new weather features, accuracy went up again from 0.70 to 0.74.\n",
      "\n",
      "Clearly with more iterations, we are likely going to improve accuracy even further. For example, we can add weather information at the Origin, or explore the number of seats on the plan as a predictive feature (we can get that from the tail number), and so on. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Summary"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this blog post we have demonstrated how to build a predictive model with Hadoop and Python. We have used Hadoop to perform various types of data pre-processing and feature engineering tasks. We then applied Scikit-learn machine learning algorithm on the resulting datasets and have shown how via iterations we continuously add new and improved features resulting in better model performance.\n",
      "\n",
      "In the next part of this multi-part blog post we will show how to perform the same learning task with Spark and ML-Lib."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}