{ "metadata": { "name": "", "signature": "sha256:ce8f54ded6e9f229ca2ea9615956dd4930c213fdf6de4c15e9372a2d01ad5bfd" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data Science with Hadoop - Predicting airline delays - part 3: Scalding and R" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this 3rd part of the blog post on data science, we continue to demonstrate how to build a predictive model with Hadoop, highlighting various tools that can be effectively used in this process. This time we'll use Scalding for pre-processing and R for modeling.\n", "\n", "[R](http://www.r-project.org/) is a language and environment for statistical computing and graphics. It is a GNU project which was developed at Bell Laboratories by John Chambers and his colleagues. R is an open source project with more than 6000 packages available covering various topics in data science including classification, regression, clustering, anomaly detection, market basket analysis, text processing and many others. R is an extremely powerful and mature environment for statistical analysis and data science.\n", "\n", "[Scalding](https://github.com/twitter/scalding) is a Scala library that makes it easy to specify Hadoop MapReduce jobs using higher level abstractions of a data pipeline. Scalding is built on top of [Cascading](http://www.cascading.org/), a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.\n", "\n", "Recall from the first blog post that we are constructing a predictive model for flight delays. Our source dataset resides here: http://stat-computing.org/dataexpo/2009/the-data.html, and includes details about flights in the US from the years 1987-2008. We will also enrich the data with weather information from: http://www.ncdc.noaa.gov/cdo-web/datasets/, where we find daily temperatures (min/max), wind speed, snow conditions and precipitation. \n", "\n", "We will build a supervised learning model to predict flight delays for flights leaving O'Hare International airport (ORD), using the year 2007 data to build the model, and 2008 data to test its validity." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data Exploration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "R is a fantastic environment for data exploration and is often used for this purpoase. With a ton of statistical and data manipulation functionality being part of core R, as well as powerful graphics packages such as ggplot, performing data exploration in R is easy and fun.\n", "\n", "Let's first enable R cells in IPython:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%capture\n", "%load_ext rpy2.ipython" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we start our data exploration example, we load various packages we need for later. We will use the [RHDFS](https://github.com/RevolutionAnalytics/RHadoop/wiki) package from RHadoop to read files from HDFS; however we need the ability to read a multi-part file from HDFS into R as a single data frame, so we define the *read_csv_from_hdfs* function in R:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%capture\n", "%%R\n", "\n", "# load required R packages\n", "require(rhdfs)\n", "require(randomForest)\n", "require(gbm)\n", "require(plyr)\n", "require(data.table)\n", "\n", "# Initialize RHDFS package\n", "hdfs.init(hadoop='/usr/bin/hadoop')\n", "\n", "# Utility function to read a multi-part file from HDFS into an R data frame\n", "read_csv_from_hdfs <- function(filename, cols=NULL) {\n", " dir.list = hdfs.ls(filename)\n", " list.condition <- sapply(dir.list$size, function(x) x > 0)\n", " file.list <- dir.list[list.condition,]\n", " tables <- lapply(file.list$file, function(f) {\n", " content <- paste(hdfs.read.text.file(f, n = 100000L, buffer=100000000L), collapse='\\n')\n", " if (length(cols)==0) {\n", " dt = fread(content, sep=\",\", colClasses=\"character\", stringsAsFactors=F, header=T) \n", " } else {\n", " dt = fread(content, sep=\",\", colClasses=\"character\", stringsAsFactors=F, header=F) \n", " setnames(dt, names(dt), cols) \n", " }\n", " dt\n", " })\n", " rbind.fill(tables)\n", "}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now explore the 2007 delay dataset to determine which variables are reasonable to use for this prediction task. First let's load the data into an R dataframe:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "cols = c('year', 'month', 'day', 'dow', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime','Carrier', 'FlightNum', 'TailNum', \n", " 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', \n", " 'TaxiOut', 'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', \n", " 'LateAircraftDelay');\n", "flt_2007 = read_csv_from_hdfs('/user/demo/airline/delay/2007.csv', cols)\n", "\n", "print(dim(flt_2007))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "text": [ "\r", "Read 0.0% of 7453216 rows\r", "Read 5.0% of 7453216 rows\r", "Read 9.9% of 7453216 rows\r", "Read 14.9% of 7453216 rows\r", "Read 19.9% of 7453216 rows\r", "Read 24.8% of 7453216 rows\r", "Read 29.8% of 7453216 rows\r", "Read 34.8% of 7453216 rows\r", "Read 39.6% of 7453216 rows\r", "Read 44.4% of 7453216 rows\r", "Read 49.2% of 7453216 rows\r", "Read 54.1% of 7453216 rows\r", "Read 58.9% of 7453216 rows\r", "Read 63.9% of 7453216 rows\r", "Read 68.8% of 7453216 rows\r", "Read 73.8% of 7453216 rows\r", "Read 78.6% of 7453216 rows\r", "Read 83.6% of 7453216 rows\r", "Read 88.4% of 7453216 rows\r", "Read 93.4% of 7453216 rows\r", "Read 98.3% of 7453216 rows\r", "Read 7453216 rows and 29 (of 29) columns from 0.655 GB file in 00:00:32\n", "[1] 7453216 29\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have 7.4M+ flights in 2007 and 29 variables.\n", "\n", "Our \"target\" variable will be *DepDelay* (departure delay in minutes). To build a classifier, we define a derived target variable by defining a \"delay\" as having 15 mins or more of delay, and \"non-delay\" otherwise. We thus create a new binary variable that we name *'DepDelayed'*.\n", "\n", "Let's look at some basic statistics of flights and delays (per our new definition), after limiting ourselves to flights originating from ORD:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "df1 = flt_2007[which(flt_2007$Origin == 'ORD' & !is.na(flt_2007$DepDelay)),]\n", "df1$DepDelay = sapply(df1$DepDelay, function(x) (if (as.numeric(x)>=15) 1 else 0))\n", "\n", "print(paste0(\"total flights: \", as.character(dim(df1)[1])))\n", "print(paste0(\"total delays: \", as.character(sum(df1$DepDelay))))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "text": [ "[1] \"total flights: 359169\"\n", "[1] \"total delays: 109346\"\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"month\" feature is likely a good feature for modeling -- let's look at the distribution of delays (as percentage of total flights) by month:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "df2 = df1[, c('DepDelay', 'month'), with=F]\n", "df2$month = as.numeric(df2$month)\n", "df2 <- ddply(df2, .(month), summarise, mean_delay=mean(DepDelay))\n", "barplot(df2$mean_delay, names.arg=df2$month, xlab=\"month\", ylab=\"% of delays\", col=\"blue\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAHgCAIAAADytinCAAAdlElEQVR4nO3dfViUZb7A8XsYoGIQ\nhQFXsCjNQMkIM63EFzApo0LWdvdqrUZLt81stURUxLbNN+LUFmWdvbayK8Pcbc2TSG2tmdEuuttW\nRzLRECuyAgdIAkd5GeA5f8wuR3HkdOLm4Tfw/fw186i/+5aYb+Mz8zAWwzAUAEAev97eAADAOwIN\nAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEG\nAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikAD\ngFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaAB\nQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAA\nIBSBBgChCDQACEWgAUAo/97eAABoUF5eXlJSomtabGxsfHy8rmk/mMUwjN7eAwB01+233/7yy+FK\nxekY1pCc/Oddu3bpGNUtPIMG0BeEhIQoNVepy3QMq4+M3KtjTndxDhoAhCLQACAUgQYAoQg0AAhF\noAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi\n0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEMq/tzfgk9ra2r788ktd084777zIyEhd0wD0GQT6h9i5\nc+f06bcpdYumeRsNo0nTKAB9B4H+IVpbW5VaotRyTfMOaJoDoE/hHDQACEWgAUAoAg0AQhFoABCK\nQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACBU7wS6uLi4V9YF\nAB/SO4GeNGlSr6wLAD7EpEDbbDbLKZRSHTcAAF6ZFOj3339/7NixW7ZsMQzDMAylVMcNAIBXJgV6\n9OjR77zzzssvv7xkyRK3223OogDg08w7Bz1w4MCtW7fa7faUlBTTFgUA32Xqp3r7+fllZWVdffXV\n77zzjpnrAoAvMjXQHsnJycnJyeavCwC+pRcC7VFUVJScnNzF64Tbtm1bs2ZNp4P19fV33HHHr3/9\n6x7eHQD0vl4LdFJSUtfv4khPT09PT+90cPPmzfX19T25LwCQgku9AUAokwJdV1eXlZUVGxsbEhJi\ns9liY2MzMzMbGhrMWR0AfJFJgXY4HC6Xq6CgwOl01tTUFBYWWq1Wh8NhzuoA4ItMOgddXFy8devW\nwMBAz92YmJicnJzo6GhzVgcAX2TSM+jExMSMjIyysrLGxsampqby8vKsrKyEhARzVgcAX2RSoPPz\n84OCgtLS0iIiIux2e2pqqtvt3rRpkzmrA4AvMukUR2hoaG5ubm5urjnLAUAfwNvsAEAoAg0AQhFo\nABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0\nAAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQa\nAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEMq/tzeAzq699tpdu1qUOlfHsIrVq2evXLlSxygA\nZiPQ4px33nlK/ZdSA3UMKz5x4g0dcwD0Ak5xAIBQBBoAhCLQACAU56CBnrJq1aqTJ09qGXXs2LHF\nixePHDlSyzT4ir4c6MOHD9fX1+uaNnLkSJvNpmsa+oP8/PzDh/+oadjmpKT/JtD9TV8OdEJCwokT\n92ka9vLmzf/x85//XNM09AtDhgw5fHispmFlmubAl/TlQI8ZM6a4+BFNwy40DEPTKAD4XniREACE\nItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUH35QhX0GQcPHtT1Qy0sFsvll19utVq1TAN6\nFIGGD4iLi1NqmaZh+Xv2vHrNNddomgb0IAINnzBRKV1X7Vva2to0jQJ6FuegAUAoAg0AQpkU6L/8\n5S9Dhw699NJL9+zZM27cuODg4EmTJn366afmrA4AvsikQN9///3r169/5JFHJk6cOHPmzM8+++wn\nP/nJvHnzzFkdAHyRSYGurKxMT0+fMmWKYRiLFi360Y9+tHDhwgMHDpizOgD4IpMCHRUV9e677w4Y\nMOD48eNBQUFKqffee2/48OHmrA4AvsikQD/++OM/+9nP3nzzzeDgYKVUZmbmzJkzH3/8cXNWBwBf\nZNL7oG+44YaamprW1lbP3QULFqxduzYwMNCc1QHAF5l3oYqfn19HkS+66CLT1gUAH9VrVxIWFRUl\nJyd38UmsBQUFTz/9dKeDR48enT59eg9vrS9raGjQdR2dv7//gAEDtIwC4FWvBTopKanrz8meMWPG\njBkzOh3cvHlzfX19T+6rL6uqqoqKilJqmqZ5O10ul81m0zQNQGf8LI5+xO12KzVfqf/UNO+2jhcV\nAPQEk97FUVdXl5WVFRsbGxISYrPZYmNjMzMzGxoazFkdAHyRSYF2OBwul6ugoMDpdNbU1BQWFlqt\nVofDYc7qAOCLTDrFUVxcvHXr1o53ccTExOTk5ERHR5uzOgD4IpOeQScmJmZkZJSVlTU2NjY1NZWX\nl2dlZSUkJJizOgD4IpMCnZ+fHxQUlJaWFhERYbfbU1NT3W73pk2bzFkdAHyRSac4QkNDc3Nzc3Nz\nzVkOAPoAfmA/AAhFoAFAKAINAEIRaAAQikADgFD8LA70a1u2bJk9e3ZcXJyWaVar9f3339cyClAE\nGv1cbW1tY+PzH300S8u0iRMnaZkDeHCKAwCEItAAIBSBBgChvAf6ySefTElJcbvdKSkpYWFhL7zw\ngsnbAgB4f5Fw1apVf//737dt2xYeHr579+6pU6feddddJu8MAPo578+gAwICmpubN27ceOedd1qt\nVrfbbfK2AADeA7127drJkydbrdaUlJTrrrtu9erVJm8LAOD9FMfcuXPnzp3ruV1RUWHedgAA/+b9\nGfTjjz9eWVlp8lYAAKfyHujS0tL4+PiUlJSNGzfy2dsA0Cu8B3rDhg3ffPPNwoULd+zYMWLEiFtv\nvfX1119vaWkxeXMA0J+d9UKVc845Z/z48UlJSXFxcX/+859XrVp14YUXvvbaa2ZuDgD6M++BfuKJ\nJyZPnnzppZcWFxdnZGRUV1f/85//fP311++55x6T9wcA/Zb3d3EcOHBgxYoVU6dODQwM7DgYHx//\nu9/9zqyNAUB/5/0Z9HPPPTd9+nRPnVtbW2fPnq2UCggImDlzpqm7A4B+zHugH3300XPOOcdisVgs\nloCAAKfTafK2AADeA/3YY4/94x//cDgclZWVGzZsmDZtmsnbAgB4D3Rzc3N8fPzkyZM//PDDO++8\nc+PGjSZvCwDgPdDR0dFPPPHE6NGj//CHP5SWlnKKAwDM5z3Qq1evzs/PHzduXEtLS2Ji4sMPP2zy\ntgAA3t9mN2PGjBkzZiilXn31VXP3AwD4Fz7yCgCEOu0ZtMViOdvvMwyj5zcDAPhfpwWaCgOAHJzi\nAAChvAe6srLypptuGjx48JEjR+bPn3/y5EmTtwUA8B7oZcuWjR8/vqamxm63Hzp0aNGiRSZvCwDg\nPdC7d+/2RNlms23evLmgoMDcXQEAzhJol8t1zjnneG4PGDDAarWauCUAgFJnu1BlypQp27dvV0p9\n/vnn69atu+mmm8zdFYA+qLm5+bHHHvP3956d/y/DMJYsWaJrmkze/25PPfXU3XffbbPZpkyZcsst\nt6xZs8bkbQHoe5xO58qVBUqt0zRv0axZs6KjozVNk8h7oCMjIwsLC03eCoB+4EqldP344gRNc+Ti\nSkIAEOq0FwmNf8vLy5szZ87Ro0ePHj06Z86cF198sZe2BwD911nPQe/bt89msymlnn766YSEBM/H\nEgIATOP9bXbfffdde3u753ZbW1tdXZ2JWwIAKHW2QF933XXz58/3nOKYP3/+9OnTTd4WAMB7oJ9+\n+mml1KWXXnrZZZcFBAQ89dRT5u4KAHCWc9B2u33Tpk0mbwUAcKq+fBEOgB9gzJgxJSVBSgXpGPbp\n0qWzcnNzdYzqjwg0gNMMHTq0pORlpQbqGFbs5/eGjjn91GnnoIcMGVJbW6uUSkjo+5foAIBwpz2D\nnjdv3siRI7/99lt1xlWFXEmIrrlcrurqal3TwsPDQ0JCdE0DfNRpgV6zZo3n5yKlp6dv27atl7YE\nn3Tfffdt3Fip1DAdw6pTUk7s2LFDxyjAh3k/B02d8f8VFBSk1G+VukzHsCMREVk65gC+zfv7oI8d\nOzZnzpzBgweHh4fPnj2bKwkBwHzeA71o0aLAwMBPPvnkwIEDAQEBDzzwgMnbAgB4P8Xx9ttvV1RU\nnHvuuUqp9evXDx8+3NxdAQDO8gwaANDrvAc6JSXlV7/6VXV1dXV19cKFC1NSUkzeFgDAe6CffPLJ\n5ubmuLi4uLi4pqamvLy8bi5z6NChxMTE8PDwBQsWtLa2KqVcLlcXH+ACAPAe6LCwsJdeeqm2tra2\ntjY/Pz8sLKyby8ydO3fmzJkHDx5sa2t78MEHuzkNAPoDk34WxyeffLJjx47zzjvvmWeeGTdu3J13\n3hkVFWXO0kDfU1paetNNN40YMULLtNra2r1792oZBb1MCvTQoUM//vjjq6++2mq15ubmzps3j08N\nB36wioqKiopfVlQs1zRvkqY50Mykd3GsXbs2JSXlF7/4hVIqJSVlwoQJV111lTlLA4CP6irQb731\n1rBhwyIjI994o7s/MDA9Pb20tPS2227z3M3JydmwYYPn534AALzq6hTHvHnz1q9ff9FFF/30pz+9\n8cYbu7lSdHR0dHS057bFYklMTExMTOzmTADowzoH+qGHHlq6dKnNZlNK+fn96/l1xyd8a1RUVJSc\nnNzFTzHduXPnli1bOh08fPgw50YA9BOdA33HHXc88MADt9xyy/XXX//888/ffffdTU1Nzz77rPaF\nk5KSuv4Z09dcc82Zl5hv377darVq3wwACNQ50CNGjPj973//0ksvLVq0KDs7u6Kiojd2pZRSNpvt\nzEAPHjy4vr6+V/YDACbz8iKhxWKZPXt2dnb22rVrX3zxRS2fpVJXV5eVlRUbGxsSEmKz2WJjYzMz\nMxsaGro/GQD6qs6B3rZt25AhQ4YNG/bxxx8/+eSTkZGR8+bNKy8v7+YyDofD5XIVFBQ4nc6amprC\nwkKr1epwOLo5FgD6sM6nOB544IE333yzoqJi7ty5R44cuf766ydOnPjII4+sXr26O8sUFxdv3bo1\nMDDQczcmJiYnJ6fjTR0AgDN1fgZ95g8wstls3ayzUioxMTEjI6OsrKyxsbGpqam8vDwrK4vPDgeA\nLnQOdF5e3g033LB48eINGzZoXCY/Pz8oKCgtLS0iIsJut6emprrd7k2bNmlcAgD6mM6nONLS0tLS\n0rQvExoampubm5ubq30yAPRVfKIKAAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQA\nCEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoA\nhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0A\nQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYA\noQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQpkU6JEjR1q8\nMWd1APBFJgW6tLR03LhxhYWFxunMWR0AfJFJgbZarbNmzbLZbOYsBwB9gL9pK91///2mrQUAfQAv\nEgKAUOY9g+6kqKgoOTm5i9PQe/fu3bFjR6eDJSUlo0aN6uGtAYAIvRbopKSkrl8kHDRo0PDhwzsd\nrKysPPfcc3tyXwAgRa8F+v80bNiwYcOGdTrodrvr6+t7ZT8AYDKTzkHX1dVlZWXFxsaGhITYbLbY\n2NjMzMyGhgZzVgcAX2RSoB0Oh8vlKigocDqdNTU1hYWFVqvV4XCYszoA+CKTTnEUFxdv3bo1MDDQ\nczcmJiYnJyc6Otqc1QHAF5n0DDoxMTEjI6OsrKyxsbGpqam8vDwrKyshIcGc1QHAF5kU6Pz8/KCg\noLS0tIiICLvdnpqa6na7N23aZM7qAOCLTDrFERoampubm5uba85yANAHcCUhAAhFoAFAKAINAEIR\naAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEI\nNAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAE\nGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgC\nDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSB\nBgChCDQACEWgAUAokwJdV1eXlZUVGxsbEhJis9liY2MzMzMbGhrMWR0AfJFJgXY4HC6Xq6CgwOl0\n1tTUFBYWWq1Wh8NhzuoA4Iv8zVmmuLh469atgYGBnrsxMTE5OTnR0dHmrA4AvsikZ9CJiYkZGRll\nZWWNjY1NTU3l5eVZWVkJCQnmrA4AvsikQOfn5wcFBaWlpUVERNjt9tTUVLfbvWnTJnNWBwBfZNIp\njtDQ0Nzc3NzcXHOWA4A+wKRA/wBVVVWlpaWdDu7fv99ut3/PCRUVFUrt1LSdnUrdcvqRIqWu1DT8\n8Kl3jh49qtROpQbqmFzc1NR0+pEP9X1NSk6943K5lNqplFPH5CO1tbWnH9mvb9v/UOrGjjutra1K\n7VRqsJbRX3/99al3dX8HTjv9SLG+4Z+eeofvQKXqz/gO7B29FuiioqLk5GTDMM72G7766quPPvqo\n08H29vaRI0d+zyWWL1/ucnWe8MM0NIy8+uqrO+4mJCSsWDE2JETP8ICAzFPvLlmy5IsvDvn5aTj7\n1NRkue66n3XcHTx48IMPTrfZ9Gy7sfGnwcHBHXfnzJlzySW7AwM1DG9tbR09+t5Tj6xatVjLZKWU\nyzUxLi6u4+7UqVNXrqwODtYz3G7PPvXuihUrGhp0bXvoVVdd1XH3iiuuWLlyjK5tWyy++h0YGRnZ\ncfeuu+665JK/afk+aW9vHzVqfvfndJ+li0QCAHoRVxICgFBcSQgAQnElIQAIZdI56NDQUKfT2XEl\noVLKMIzo6OivvvrKhNUBwBdxJSEACMWVhAAgFG+zAwCheJsdAAhFoAFAKAINAEIRaAAQikADgFAE\nGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCqvwe6ra3t+38K7fdXUFAwevToQYMGTZ48+dCh\nQ3qHv/XWW3FxcYMGDYqLi9uxY4fe4Uqp/fv322w27WMTExMt/3bPPffoHd7a2nrvvfdGREQkJiZ+\n8803GidbzqBx+HvvvZeQkDBgwICEhIS//vWvusZWV1fffvvtkZGR559//t133338+HEtY898sNTV\n1d18881hYWFpaWl1dXV6h5/toJbhPfoI1cnox/Ly8saPH6/9i/Dll18GBwfv2bPn5MmTjz766IQJ\nEzQOb2trCwsL27lzZ1tb25YtW6KiojQONwzju+++Gzt2rPavSXt7e1hY2Ndff338+PHjx483Njbq\nnf/oo4/edtttJ06cWLJkydy5czVOPn6KBx98cNmyZRqHn3/++X/6059aWlpeeeWVCy64QNfYG2+8\nceXKlc3NzY2NjUuXLl28eHH3Z3p9sCxbtmzBggVNTU0LFixYvny53uG6Hp5nzunRR6he/TrQu3bt\nKiws1B6jd999d968eZ7b1dXVdrtd4/Dm5uY33nijvb29oaFh+/btcXFxGoe3t7enp6dv2bJF+9ek\nqqoqODh47NixwcHBM2bMcDqdeuePGTOmpKTEMIyGhoYPP/xQ73CPffv2XXvttW63W+PMuLi45557\n7tixY88///yoUaN0jQ0ODv7uu+88t48dO3bhhRd2f6bXB0tMTMzBgwcNwzh48GBMTIze4boenmfO\n6dFHqF79OtAePffPiNbW1nvuuefee+/VPtnzj1aLxbJ7926NY3NycjIyMowe+Jrs3bs3OTl57969\n3377rcPhuPXWW/XODwsLW7ZsWWho6NixY/ft26d3uGEYzc3N48ePLy0t1Tv2gw8+6Pi37AcffKBr\nbFJS0vLly+vq6pxO58KFCwMDA3VN7vSNYbPZTp48aRjGyZMnBwwYoHd4Fwd1De+5R6guBLqnAv32\n22+PGTNm2bJlep9zdXC5XGvXrr3yyit1Ddy1a9eUKVNaWlqMnvyflmEYlZWVoaGhemf6+/svXbq0\nsrIyOzv7qquu0jvcMIx169bdd9992sdOnTrVs+3MzMxrr71W19iKiorU1NTg4ODhw4fn5eUNGTJE\n1+RO3xhBQUGes1UnTpwICgrSO7yLg1qG9/QjVAsCrT9G7e3ty5cvnzRpUllZmd7JhmF88cUXS5Ys\n8dw+evSozWbTNTk7O7vT6xN/+9vfdA3/6KOPOp7s19bWakyGR2RkZGVlpWEYVVVVGr8mHq2trdHR\n0eXl5XrHGoZhs9mqqqoMw6itrQ0ODtY1tqamprm52XO7qKhoypQpuiZ3erCMGDHi0KFDhmEcOnTo\nkksu0Tu8i4PdHN6jj1C9+vu7OHrCnj17Xnvtte3bt0dFRblcLpfLpXF4VFTUhg0b3nvvPcMwXnnl\nlTFjxuiavGbNmo5vC6WUYRgTJ07UNfzEiRM//vGPDx482NLSsnr16vT0dF2TPa6//voXX3yxubn5\n2WefvfLKK/UO37Vr1wUXXDBixAi9Y5VS8fHxGzZscLlcL7300uWXX65r7NKlS3/5y182NDRUVVUt\nX7584cKFuiZ3cvPNN7/wwguGYbzwwgszZszooVW069FHqGa99X8GObR/EdasWdOjX+SioqIrrrgi\nNDT0mmuu8bxEo532Pbe3tz/zzDMXX3xxeHi4w+Gor6/XO7+qqmratGkDBw6cPHmy9qe6s2bNevjh\nh/XO9Dh48OCECROCg4MnTJig8T9lbW1tWlpaSEjIqFGjnn32WV1jjTO+Merq6lJTU4cOHXrzzTd3\nvCypa3gXB7s5vKcfoRrxobEAIBSnOABAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWg\nAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA95ZLJbe3gL6\nOwINnGbatGm9vQXgX/hUb+A0Fsu/HhQdN4DewjNo9BEWi2XFihWRkZGrVq36zW9+ExMTM3DgwHXr\n1imljh075nA4IiMjo6KiZs+efezYsY4/8sc//vHyyy+32+15eXlKqfT0dKVUQkKC5zesX78+Pj7e\nbrf/9re/7aW/Fvo1Ao2+47LLLtuxY8dDDz00aNCg/fv3v/rqq6tXr1ZK3X///YGBgZ9//vlnn30W\nGBiYkZHR8UeOHDlSUlKyZcuWFStWKKW2bdumlCopKfH86smTJ/ft2/f222+vXLmyN/5C6O/4Rxz6\nCIvF0tzcHBAQ4Ofn53a7/f39DcPw8/MzDCM8PPzAgQODBw9WSjmdzvj4eKfT6fkjDQ0NAwYMUN7O\nbHj9VcBM/r29AUCbwMBAzw1/f391+tswOm5bLJa2traO457+nk3Xvwr0NE5xoO+74YYbsrOzm5qa\nGhsbs7OzU1NTu/79brfbnI0BXSPQ6Pvy8vIaGxsvuuii4cOHt7S0eF4PPJvU1NSLL77YtL0BXeDM\nGgAIxTNoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAU\ngQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFD/AzroUMh9WKTgAAAAAElFTkSuQmCC\n" } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, one would expect that \"hour of day\" would be a good feature to predict flight delays, as later flights in the day may present more delays due to density effects. Let's look at that:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "# Extract hour of day from 3 or 4 digit time-of-day string\n", "get_hour <- function(x) { \n", " s = sprintf(\"%04d\", as.numeric(x))\n", " return(substr(s, 1, 2))\n", "}\n", "\n", "df2 = df1[, c('DepDelay', 'CRSDepTime'), with=F]\n", "df2$hour = as.numeric(sapply(df2$CRSDepTime, get_hour))\n", "df2$CRSDepTime <- NULL\n", "df2 <- ddply(df2, .(hour), summarise, mean_delay=mean(DepDelay))\n", "barplot(df2$mean_delay, names.arg=df2$hour, xlab=\"hour of day\", ylab=\"% of delays\", col=\"green\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAHgCAIAAADytinCAAAgAElEQVR4nO3df1iUZb748XskxpZB\ncAQM0MjII0pGWKklpkCiSS6aVsdTLtZirVqphaak7G4aIWspUp2uLPcqMXf74SpSWqGF+8W2Ukst\nVCTTzRIQA4VRfjPfP6ZYo5HDMzjPfAber+tc1xnGuef+4Dy9d3yYGQxWq1UBAOTp5uoBAAD2EWgA\nEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQA\nCEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoA\nhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0A\nQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYA\noQg0AAhFoAFAKAINAEJd5uoBAHQJxcXF+/bt07TE09Nz4sSJBoPBSSPJZ7Bara6eAUDnd999920I\n2KDCtaz5k/ru8++uvPJKZ80kHs+gAejB19dXJSl1nZY1O1UXfwZJoAG01+uvv15aWqppSZ8+faZN\nm+akeTo9Ag2gvVasWFGYWahpSb8H+xFohxFoAO3Vr1+/wjHaAt23b18nDdMV8DI7ABCKQAOAUAQa\nAIQi0AAgFD8kBLqWM2fOaH1xsclkMhqNTpoHbSDQQBdSUlISHBys7tay5muVHJ/87LPPOmsmXByB\nBrqQhoYGNUup/9WypkB5vufprIHQJs5BA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQ\nikADgFAEGgCEItAAIBSBBgCh+LAkANIdPXr08ccfDwwMbP+S0tLSpUuXXn/99c6bSgcEGoB0hw8f\n3tJni5qvZc0G9d+F/02gHVFQUDBy5EiXbA3ALYUoFarl9ppuLJVrzkHfeuutLtkXANyIToE2mUyG\nCyilWi4AAOzSKdCfffbZjTfe+Pbbb1utVtvvQ2u5AACwS6dADx48eMeOHW+88cb8+fMbGhr02RQA\n3Jp+PyT09fXduHFjRkZGXFycbpsCnU9jY2N1dbXWVWaz2RnDwKl0fRVHt27dUlJSbr755h07dui5\nL9CZrFmz5uE/Pqxitax5W33++edDhw511kxwDhe8zC4mJiYmJub/vFl9ff25c+d+fX2PHj0uu4yX\nb6PrMhgMKkupe7WsSVF1dXXOGghO47LS5efnx8TEtPFzwu3bt7/22mutrvzhhx9uu+22pUuXOnc4\nABDAZYGOjo5u+1Uc8fHx8fHxra7csGHD2bNnnTkXAEjBhyUBgFA6BbqysjIlJSUsLMzHx8dkMoWF\nhS1YsKCqqkqf3QHAHekU6MTERIvFkpOTU1ZWVl5enpub6+HhkZiYqM/uAOCOdDoHXVBQsHHjRqPR\naPtywIAB6enpISEh+uwOAO5Ip2fQUVFRycnJRUVFNTU1tbW1xcXFKSkpkZGR+uwOAO5Ip0BnZ2d7\neXklJCQEBAT4+fnFx8c3NDSsX79en90BwB3pdIrDbDZnZGRkZGTosx0AdAK8zA4AhCLQACAUgQYA\noQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCE4ndHAS4wcuTIXWqX+o2WNf9PWWvb+h0X6HwINOAC\nPXv2VG8o5atlza3OGgZicYoDAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQ\nACAUgQYAoQg0AAhFoAFAKAINAELxcaOAgz799FOLxaJpSWhoaGhoqJPmQedDoAEH3XLLLWq5lgUn\nVezXsTt27HDWQOh0CDTgqJFKLdRy++9UYEqgs4ZBZ0SgAXRyBoNBjdGy4Lwa4zUmLy/PWQO1G4EG\n0NmNVEpTbM+q3rN7O2sYLXgVBwAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhF\noAFAKAINAELxWRzouk6dOrVmzRpPT8/2L6mvr58+fXpISIjzpgJaEGh0Xbt37079IFXN0bJmg7rm\nmmvuvfdeZ80EXIBAo2u7Q6m7tdy+wVmDAL/GOWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFA\nKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABBKp0B/8MEHffr0ufba\naz/55JOhQ4d6e3vfeuuthw8f1md3AHBHOgV63rx5zz///PLly0eOHDl58uSjR4/eddddM2bM0Gd3\nAHBHOgX65MmTkyZNGj16tNVqnTt37hVXXDFnzpyDBw/qszsAuCOdAh0cHPzxxx/36NGjurray8tL\nKbVz587Q0FB9dgcAd6RToFeuXHnPPfds27bN29tbKbVgwYLJkyevXLlSn90BwB1dps8248ePLy8v\nb2xstH358MMPp6WlGY1GfXYHAHekU6CVUt26dWspcr9+/XTbFwDclH6BbiU/Pz8mJsZqtV7sBnl5\nee+8806rK7/55pthw4Y5eTQAEMFlgY6Ojm6jzkqpqKioa665ptWVW7Zs8fDwcOZcACCFywL9f/Ly\n8vr1yzx69+599uxZl8wDADrT6VUclZWVKSkpYWFhPj4+JpMpLCxswYIFVVVV+uwOAO5Ip0AnJiZa\nLJacnJyysrLy8vLc3FwPD4/ExER9dgcAd6TTKY6CgoKNGze2vIpjwIAB6enpISEh+uwOAO5Ip2fQ\nUVFRycnJRUVFNTU1tbW1xcXFKSkpkZGR+uwOAO5Ip0BnZ2d7eXklJCQEBAT4+fnFx8c3NDSsX79e\nn90BwB3pdIrDbDZnZGRkZGTosx0AdAJ8YD8ACEWgAUAoAg0AQhFoABBK7lu9gfZ47bXX3nrrLdvn\njLdTRUXFtm3bPD09nTcVcEkQaLi3goKCbX/Ypq7TsuZBdf78eV9fX2fNBFwiBBruzWg0qlClNP32\ntEBnDQNcWpyDBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIxVu94XoL\nFy48c+aMpiVBQUF//vOfnTMOIAWBhutt3br165yvNS2JuDOCQKPTI9BwvZ49e2r7tCOlfHx8nDML\nIAiBxqWxb9++pqYmTUtCQ0PNZrOT5gE6AQKNS+DEiRNDxgxRM7Ss2acev/bx5557zlkzAe6PQOMS\nsFqt6h6llmtZU6CM7xmdNRDQKdh/md3q1avj4uIaGhri4uJ69er117/+VeexAAD2n0EvXbr0X//6\n1+bNm/39/Xft2hUbG/v73/9e58kAoIuz/wza09Ozrq7u9ddff+CBBzw8PBoaGnQeCwBgP9BpaWmj\nRo3y8PCIi4sbO3bssmXLdB4LAGD/FEdSUlJSUpLt8vHjx/UbBwDwM/vPoFeuXHny5EmdRwEAXMh+\noAsLCyMiIuLi4l5//fWqqiqdZwIAqIsFeu3atT/88MOcOXM+/PDD/v37T5069d13362vr9d5OADo\nyi76caPdu3cfNmxYdHR0eHj41q1bly5detVVV23atEnP4QCgK7Mf6FWrVo0aNeraa68tKChITk4+\nderU559//u67786cOVPn+QCgy7L/Ko6DBw8++eSTsbGxRuN/3owbERHx0ksv6TUYAHR19p9Bv/LK\nK7fffrutzo2NjdOnT1dKeXp6Tp48WdfpAKALs/8MesWKFUuWLGn5qeC4ceN0HAmu0dzcfPbsWU1L\nDAZDz549nTQPAPuBfvbZZz/99NPMzMzly5dv27atoqJC57Ggv1deeWXmipnqBi1r3lO7d+6+6aab\nnDUT0LXZD3RdXV1ERMSoUaP27NnzwAMPREREzJ8/X+fJoLPm5ma1Wqk7tKxJUbW1tc4aCOjy7J+D\nDgkJWbVq1eDBg//2t78VFhaWlZXpPBYAwH6gly1blp2dPXTo0Pr6+qioqKeeekrnsQAA9k9xTJw4\nceLEiUqpd955R995AAA/ueg7CQEArvWLZ9AGg+Fit7Narc4fBgDwH78INBUGADk4xQEAQtkP9MmT\nJydMmNC7d+/vvvtu1qxZ58+f13ksAID9QC9cuHDYsGHl5eV+fn5HjhyZO3euzmMBAOwHeteuXbYo\nm0ymDRs25OTk6DsVAOAigbZYLN27d7dd7tGjh4eHh44jAQCUuligR48evWXLFqXUt99+O2fOnAkT\nJug7FQDgIoHOysrKzs42mUyjR4/29vZetWqVzmMBAOy/1TsoKCg3N1fnUQAAF+KdhJ3Kpk2bioqK\n2ngcf81kMj3yyCPOGwmAw+y/k3D16tX79u1bvny5UmrRokXR0dH6TwYHpKWl7U3Zq3y1rPkfRaAB\nmeyf4sjKyjpw4IDJZFJKvfDCC5GRkbZfSwjhAgMD1RilLdADnTUMgA6y/0PCM2fONDc32y43NTVV\nVlbqOBIAQKmLBXrs2LGzZs0qLS0tLS2dNWvW7bffrvNYAAD7gX7hhReUUtdee+11113n6emZlZWl\n71QAgIucg/bz81u/fr3OowAALsTHjQKAUAQaAIT6RaADAwNPnz6tlIqMjHTRPACAn/wi0DNmzBg4\ncKDBYNi/f7/hlzq4zZEjR6Kiovz9/R9++OHGxkallMVi6fjdAkAn9otAP/3006dPn7ZarRMnTrT+\nUge3SUpKmjx58qFDh5qamlJTUzt4bwDQFdg/B7158+ZLu81XX301e/bsgICAF1988YMPPjhy5Mil\nvX8A6HzsB7qiouL+++/v3bu3v7//9OnTO/5Owj59+uzfv18p5eHhkZGRMWPGjKampg7eJwB0bvYD\nPXfuXKPR+NVXXx08eNDT0/Oxxx7r4DZpaWlxcXEPPvigUiouLm7EiBHDhw/v4H0CQOdm/40qeXl5\nx48fv/zyy5VSzz//fGhoaAe3mTRpUmFh4bfffmv7Mj09/be//W1+fn4H7xYAOjH7gXaGkJCQkJAQ\n22WDwRAVFRUVFdXG7c+cOXP06NFWVx47dszb29tZIwKAJPYDHRcX9+ijj6alpSmlFi9eHBcXd8k3\nzs/Pj4mJaeP1IV988cWbb77Z6spvvvnm5ptvvuTDAIBA9gO9evXqefPmhYeHK6XGjx+/evXqS75x\ndHR026/ei42NjY2NbXXlhg0bzp49e8mHAQCB7Ae6V69e69at03kUAMCFdPosjsrKypSUlLCwMB8f\nH5PJFBYWtmDBgqqqKn12BwB3pFOgExMTLRZLTk5OWVlZeXl5bm6uh4dHYmKiPrsDgDvS6VUcBQUF\nGzduNBqNti8HDBiQnp7e8qIOAMCvtfUM+v3337/66quDgoLee++9Dm4TFRWVnJxcVFRUU1NTW1tb\nXFyckpLCZ+YBQBvaCvSMGTNWrly5devWuXPndnCb7OxsLy+vhISEgIAAPz+/+Pj4hoYGfmkLALSh\ndaD/9Kc/nTt37qc/6/bTn7b8hm+Hmc3mjIyMoqIii8Vy7ty54uLi5557ztfXt4N3CwCdWOtz0L/7\n3e8ee+yxKVOmjBs37tVXX33ooYdqa2vXrFnjkuEAoCtr/Qy6f//+L7/8cmlp6dy5cyMjI48fP15a\nWpqQkOCS4QCgK7PzKg6DwTB9+vTx48enpaUNGTJk+vTp/OoTANBf62fQmzdvDgwMvPrqq/fv3796\n9eqgoKAZM2YUFxe7ZDgA6MpaB/qxxx7btm3bypUrk5KSlFLjxo3Lysribd8AoL/Wgf712QyTybRs\n2TK95gEA/KT1OejMzMzx48f/5je/Wbt2rUsGAgDYtA50QkICr9kAAAl0+rAkAIBWBBoAhCLQACAU\ngQYAoQg0AAil0wf2o/3i4uLMZnP7b19bWxsVFbVw4ULnjQTAJQi0ONt/3K7e0rLgO3V52uXOmgaA\n6xBoeUxKaXgCrVS18vT0dNYwAFyHc9AAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi\n0AAgFIEGAKEINAAIxWdxXHpWq/XMmTNaV/n4+Hh4eDhjHgBuikBfejt27Ii7PU5N1rImV7217q27\n777bWTMBcEME+tKrq6tTTyu1SMual1RDQ4OzBgLgnjgHDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYA\noQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCKQAOA\nUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCXebqAYTKzMzcsGGDr69v+5eUlZXt2bPHaDQ6byoA\nXQqBtu/IkSO71+5W12lZc5+qqakh0AAuFU5xAIBQBBoAhCLQACAUgQYAoQg0AAhFoAFAKJ0CPXDg\nQIM9+uwOAO5Ip0AXFhYOHTo0NzfX+kv67A4A7kinQHt4eNx7770mk0mf7QCgE9DvnYTz5s3TbS8A\n6AT4ISEACOWyz+LIz8+PiYlp4zR0bm5uVlZWqytLS0vj4+OdPBoAiOCyQEdHR7f9Q8L4+PiRI0e2\nuvKdd95paGhw5lwAIIXcT7Pz8PAwm82trjSZTGfPnm3nPUydOrW5uVnTpmaz+eWXX9a0BACcRKdA\nV1ZW/uUvf/nHP/5RUlLS1NTUt2/fhISE1NRUHx8f52169OjRPR/u0bRkzD1jnDQMAGil0w8JExMT\nLRZLTk5OWVlZeXl5bm6uh4dHYmKiUze9/PLLlVlp+r/u3bs7dSQAaD+dnkEXFBRs3Lix5cPsBwwY\nkJ6eHhISos/uAOCOdHoGHRUVlZycXFRUVFNTU1tbW1xcnJKSEhkZqc/uAOCOdAp0dna2l5dXQkJC\nQECAn59ffHx8Q0PD+vXr9dkdANyRTqc4zGZzRkZGRkaGPtsBQCfAOwkBQCgCDQBCEWgAEIpAA4BQ\nBBoAhCLQACAUgQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAo\nAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAU\ngQYAoQg0AAhFoAFAKAINAEIRaAAQikADgFAEGgCEItAAIBSBBgChCDQACEWgAUAoAg0AQhFoABCK\nQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhF\noAFAKAINAEIRaAAQikADgFAEGgCEItAAIJROga6srExJSQkLC/Px8TGZTGFhYQsWLKiqqtJndwBw\nRzoFOjEx0WKx5OTklJWVlZeX5+bmenh4JCYm6rM7ALijy/TZpqCgYOPGjUaj0fblgAED0tPTQ0JC\n9NkdANyRTs+go6KikpOTi4qKampqamtri4uLU1JSIiMj9dkdANyRToHOzs728vJKSEgICAjw8/OL\nj49vaGhYv369PrsDgDvS6RSH2WzOyMjIyMjQZzsA6AR0CrQDSkpKCgsLW1359ddf+/n5tfMejh8/\nrrZr2/T777+3XbBYLGq7UmVaFu+74HK+UjdpWbtdqSk/X/5aaRv7O3X69GnbxdLSUrVdKV8tyw//\n9P8bGxvVdqW6a1n7qVJ3/Hx5j8axC1Rtba3toisfqQKNY29XaszPl137SPXWsragaz9ShzWuPfuf\nR8q1XBbo/Pz8mJgYq9V6sRucOHFi7969ra5sbm4eOHBgO7dYtGiRZa9F01RXzLvCduH+++//r13/\nZdxrbP/amrtrvL29lVI33HDDkzc+6bPXp/1rqwZW3XzzzbbLSx9fqmnfxsbGwbMH2y7Pnz//2JFj\n3bppOHPV7YmfbhwTE/PH03/02uvV/rWWkZbw8HClVO/evVNvTzXtNbV/ba2hduw9Y22XHXikAlID\nbBc6+EgtGbLEe693+9da+liGDx9uu7wseZnnXs/2r+3gI2VYYLBdiI2NXXJK29hVo6oGDRqkOvxI\nPfroo2f2nmn/WqWU3+KfnlE59kgFBQWpDj9Szzz+TLe9Gv6qm5ubB80a1P7bO4+hjUQCAFyIdxIC\ngFC8kxAAhOKdhAAglE7noM1mc1lZWcs7CZVSVqs1JCTkxIkTOuwOAO6IdxICgFC8kxAAhOJldgAg\nFC+zAwChCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBCda1A\nR0VFGX42c+ZMrcsbGxtnz54dEBAQFRX1ww8/aFpr+JX2r925c2dkZGSPHj0iIyP/+c9/atr31KlT\n06ZNCwoK6tu370MPPVRdXa1puWOampp+/bt97V7p7H3ff//98PDwnj17hoeHf/jhh3pu3cGDzeF9\nO3KkdWRr3Y60nJycwYMH9+zZc9SoUUeOHLnYPJ2Etctobm7u1avX999/X11dXV1dXVNTo/UeVqxY\ncd999507d27+/PlJSUma1lZfIDU1deHChe1f27dv37feequ+vv7NN9+88sorNe17xx13LFmypK6u\nrqam5oknnnj88cc1LXdAZmbmsGHDWh1adq909r5NTU29evXavn17U1PT22+/HRwcrNvWHT/YHNvX\n2rEjrSNb63Ok/fvf//b29v7kk0/Onz+/YsWKESNGXGyezqGzfT9tKCkp8fb2vvHGG729vSdOnFhW\nVqb1HoYMGbJv3z6r1VpVVbVnzx7Hxjhw4MBtt93W0NDQ/iXh4eGvvPJKRUXFq6++OmjQIE3beXt7\nnzlzxna5oqLiqquu0rTcAR999FFubm6r/1TsXunsfevq6t57773m5uaqqqotW7aEh4frtnXHDzbH\n9r2QA0daR7bW50j7+OOPZ8yYYbt86tQpPz+/i83TOXS276cNX375ZUxMzJdffvnjjz8mJiZOnTpV\n6z306tVr4cKFZrP5xhtvPHDggAMz1NXVDRs2rLCwUNOq3bt3t/yLZ/fu3ZrWRkdHL1q0qLKysqys\nbM6cOUajUdNyh9n9T0WH/35+vYXt39oGg2HXrl26bd3xg82xfVs4dqR1ZGudj7TGxsaZM2fOnj37\nYvN0Dp3t+2mnkydPms1mrasuu+yyJ5544uTJk4sXLx4+fLgD+z7zzDOPPPKI1lWxsbG2fRcsWHDb\nbbdpWnv8+PH4+Hhvb+/Q0NDMzMzAwECtuztGTqCtVqvFYklLS7vpppv039rq6MHWwX0dO9I6srWe\nR1peXt6QIUMWLlzY6t8HBNqN7d27t+U51OnTpx04gIKCgk6ePGm1WktKSkwmk9bljY2NISEhxcXF\nWheaTKaSkhKr1Xr69Glvb29Na8vLy+vq6myX8/PzR48erXV3x0gI9LFjx+bPn2+7XFpa6sBD5vDW\nHT/YHNvXxuEjrSNb63OkNTc3L1q06NZbby0qKmp7ns6hC72K49y5c3feeeehQ4fq6+uXLVs2adIk\nrfcwbty41157ra6ubs2aNTfddJPW5R999NGVV17Zv39/rQsjIiLWrl1rsVjWrVt3/fXXa1r7xBNP\n/OEPf6iqqiopKVm0aNGcOXO07u6+goOD165du3PnTqvV+uabbw4ZMkS3rTt+sHWEw0daR+hzpH3y\nySebNm3asmVLcHCwxWKxWCzO2EUQV/8vhH6am5tffPHFa665xt/fPzEx8ezZs1rvoaSkZMyYMb6+\nvqNGjXLg6cm999771FNPaV1ltVoPHTo0YsQIb2/vESNGHDp0SNPa06dPJyQk+Pj4DBo0aM2aNQ7s\n7hi7h5YOx1urLfLz82+44Qaz2XzLLbdo/avryNYdP9gc29fG4SOtI1vrc6Q9/fTTbRSs8wWNXxoL\nAEJ1oVMcAOBeCDQACEWgAUAoAg0AQhFoABCKQAOAUAQaAIQi0AAgFIEGAKEINAAIRaABQCgCDQBC\nEWgAEIpAA4BQBBoAhCLQACAUgQYAoQg0AAhFoAFAKAIN6QwGgz4bpaamBgYGunwMoAW/NBbSGQw6\nHaX+/v6HDx/29/d37RhAC445SKdbGdveiEBDf5zigBt4/vnnIyIi/Pz8nnvuOaVURUVFYmJiUFBQ\ncHDw9OnTKyoqbDe78CxEy2WDwZCdnX3huQu7yydNmqSUioyMvPBm9913n5+fX//+/bOyslqu37Jl\nS2RkZM+ePYOCgp599lml1IMPPrhq1SrbnyYlJa1cudI5fw3ocgg03MD58+cPHDiQl5e3ZMkSpdS8\nefOMRuO333579OhRo9GYnJzc9vLPPvtsx44dLV/aXb5582al1L59+1puNnfuXKXUsWPH9u/f/8UX\nX7Rcn5qaOm3atB9//HHr1q2LFy9WSk2ZMmXTpk1Kqbq6upycnKlTp17C7x1dGf9qg3QGg6GqqqpH\njx7q5/MM/v7+Bw8e7N27t1KqrKwsIiKirKxM/fIsRMtlg8Fw6tSpgICAljtsz3KllJ+fX2Fhoe2p\nd2lpaVBQkO1Pm5ubd+/eXVhYuHPnznXr1lmt1vr6+uDg4MLCws8//zwrKysvL0+vvxt0cjyDhhuw\n1flCF57BaGpqavWn1dXVF355YZ3bs9ymW7duv769Uuqee+5ZvXp1QEBAenq67Rqj0XjHHXds2bLl\n73//+7Rp09rzHQHtQaDhfsaPH7948eLa2tqamprFixfHx8fbru/evfvHH39stVpfeuklB5a3Eh8f\nP3/+/Orq6nPnzqWkpLRcn5eXt3jx4gkTJrz//vtKqcbGRqXUXXfd9cYbb+Tl5d15552X7PtEl0eg\n4X4yMzNramr69esXGhpaX1+fmZlpu/7pp5+eMmVKRETEFVdc4cDyVlatWtXc3NyvX7+IiIjRo0e3\nXP/MM89ER0dfd911P/7447hx45KSkpRScXFxe/fujY2N9fHxuXTfKLo6zkEDl8bw4cNTU1MnTJjg\n6kHQeVzm6gEAt9fQ0PDVV1+dOHFi7Nixrp4FnQqnOICOys3NHT9+/Isvvmg0Gl09CzoVTnEAgFA8\ngwYAoQg0AAhFoAFAKAINAEIRaAAQikADgNYmxUYAAAAuSURBVFAEGgCEItAAIBSBBgChCDQACEWg\nAUAoAg0AQhFoABCKQAOAUAQaAIT6/9UDl9sgUkleAAAAAElFTkSuQmCC\n" } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this demo we have not explored all the variables of course, just a couple to demonstrate R's capabilities in this area." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Pre-processing - iteration 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After playing around with the data, and exploring some potential features -- we will now demonstrate how to use Scalding to preform some simple pre-processing on Hadoop, and create a feature matrix from the raw dataset. This is similar to the pre-processing shown in part 1 of the blog post (where we've used PIG for this purpose) and part 2 (where we've used Spark for this purpose).\n", "\n", "In our first iteration, we create the following features:\n", "\n", "* **month**: winter months should have more delays than summer months\n", "* **day of month**: this is likely not a very predictive variable, but let's keep it in anyway\n", "* **day of week**: weekend vs. weekday\n", "* **hour of the day**: later hours tend to have more delays\n", "* **Distance**: interesting to see if this variable is a good predictor of delay\n", "* **days_from_closest_holiday**: number of days from date of flight to closest US holiday\n", "\n", "Let's look at the Scalding code.\n", "Note that we write the code in the next cell using IPython's \"writefile\" magic command and then execute it later from that local file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%writefile preprocess1.scala\n", "\n", "package com.hortonworks.datascience.demo1\n", "\n", "import com.twitter.scalding._\n", "import org.joda.time.format._\n", "import org.joda.time.{Days, DateTime}\n", "import com.hortonworks.datascience.demo1.ScaldingFlightDelays._\n", "\n", "/**\n", " * Pre-process flight delay data into feature matrix - iteration #1\n", " */\n", "class ScaldingFlightDelays(args: Args) extends Job(args) {\n", "\n", " val prepData = Csv(args(\"input\"), \",\", fields = inputSchema, skipHeader = true)\n", " .read\n", " .project(delaySchmea)\n", " .filter(('Origin,'Cancelled)) { x:(String,String) => x._1 == \"ORD\" && x._2 == \"0\"}\n", " .mapTo(featureSchmea -> outputSchema)(gen_features)\n", " .write(Csv(args(\"output\")))\n", "}\n", "\n", "object ScaldingFlightDelays {\n", " val inputSchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, \n", " 'DepTime, 'CRSDepTime, 'ArrTime, 'CRSArrTime, \n", " 'UniqueCarrier, 'FlightNum, 'TailNum, \n", " 'ActualElapsedTime, 'CRSElapsedTime, 'AirTime, 'ArrDelay, \n", " 'DepDelay, 'Origin, 'Dest, 'Distance, \n", " 'TaxiIn, 'TaxiOut, 'Cancelled, 'CancellationCode, \n", " 'Diverted, 'CarrierDelay, 'WeatherDelay, \n", " 'NASDelay, 'SecurityDelay, 'LateAircraftDelay)\n", " val delaySchmea = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, \n", " 'CRSDepTime, 'DepDelay, 'Origin, 'Distance, 'Cancelled)\n", " val featureSchmea = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, \n", " 'CRSDepTime, 'DepDelay, 'Distance)\n", " val outputSchema = List('flightDate,'y,'m,'dm,'dw,'crs,'dep,'dist)\n", "\n", " val holidays = List(\"01/01/2007\", \"01/15/2007\", \"02/19/2007\", \"05/28/2007\", \"06/07/2007\", \"07/04/2007\",\n", " \"09/03/2007\", \"10/08/2007\" ,\"11/11/2007\", \"11/22/2007\", \"12/25/2007\",\n", " \"01/01/2008\", \"01/21/2008\", \"02/18/2008\", \"05/22/2008\", \"05/26/2008\", \"07/04/2008\",\n", " \"09/01/2008\", \"10/13/2008\" ,\"11/11/2008\", \"11/27/2008\", \"12/25/2008\")\n", "\n", " def gen_features(tuple: (String,String,String,String,String,String,String)) = {\n", " val (year, month, dayOfMonth, dayOfWeek, crsDepTime, depDelay, distance) = tuple\n", " val date = to_date(year.toInt,month.toInt,dayOfMonth.toInt)\n", " val hour = get_hour(crsDepTime)\n", " val holidayDist = days_from_nearest_holiday(year.toInt,month.toInt,dayOfMonth.toInt)\n", "\n", " (date,depDelay,month,dayOfMonth,dayOfWeek,hour,distance,holidayDist.toString)\n", " }\n", "\n", " def get_hour(depTime: String) = \"%04d\".format(depTime.toInt).take(2)\n", "\n", " def to_date(year: Int, month: Int, day: Int) = \"%04d%02d%02d\".format(year, month, day)\n", "\n", " def days_from_nearest_holiday(year:Int, month:Int, day:Int) = {\n", " val sampleDate = new DateTime(year, month, day, 0, 0)\n", "\n", " holidays.foldLeft(3000) { (r, c) =>\n", " val holiday = DateTimeFormat.forPattern(\"MM/dd/yyyy\").parseDateTime(c)\n", " val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays)\n", " math.min(r, distance)\n", " }\n", " }\n", "}\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Overwriting preprocess1.scala\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We execute this Scalding code using the standard \"scald.rb\" script, generating the feature matrix for both the 2007 dataset and 2008 dataset. Note we use IPython's \"%%capture\" magic command to capture the output of the Scalding script and print only error messages, if any (stderr)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%capture capt2007_1\n", "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess1.scala --input \"airline/delay/2007.csv\" --output \"airline/fm/ord_2007_sc_1\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "capt2007_1.stderr" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "''" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "%%capture capt2008_1\n", "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess1.scala --input \"airline/delay/2008.csv\" --output \"airline/fm/ord_2008_sc_1\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "capt2008_1.stderr" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "''" ] } ], "prompt_number": 4 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Modeling - iteration 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've generated our feature matrix using Scalding and Hadoop, let's turn to using R to build a predictive model for predicting airline delays. First we prepare our trainning set (using the 2007 data) and test set (using 2008 data):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "# Function to compute Precision, Recall and F1-Measure\n", "get_metrics <- function(predicted, actual) {\n", " tp = length(which(predicted == TRUE & actual == TRUE))\n", " tn = length(which(predicted == FALSE & actual == FALSE))\n", " fp = length(which(predicted == TRUE & actual == FALSE))\n", " fn = length(which(predicted == FALSE & actual == TRUE))\n", "\n", " precision = tp / (tp+fp)\n", " recall = tp / (tp+fn)\n", " F1 = 2*precision*recall / (precision+recall)\n", " accuracy = (tp+tn) / (tp+tn+fp+fn)\n", " \n", " v = c(precision, recall, F1, accuracy)\n", " v\n", "}\n", "\n", "# Read input files\n", "process_dataset <- function(filename) {\n", " cols = c('date', 'delay', 'month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday')\n", " \n", " data = read_csv_from_hdfs(filename, cols)\n", " data$delay = as.factor(as.numeric(data$delay) >= 15)\n", " data$month = as.factor(data$month)\n", " data$day = as.factor(data$day)\n", " data$dow = as.factor(data$dow)\n", " data$hour = as.numeric(data$hour)\n", " data$distance = as.numeric(data$distance)\n", " data$days_from_holiday = as.numeric(data$days_from_holiday)\n", " data\n", "}\n", "\n", "# Prepare training set and test/validation set\n", "\n", "data_2007 = process_dataset('/user/demo/airline/fm/ord_2007_sc_1')\n", "data_2008 = process_dataset('/user/demo/airline/fm/ord_2008_sc_1')\n", "\n", "fcols = setdiff(names(data_2007), c('date', 'delay'))\n", "train_x = data_2007[,fcols, with=FALSE]\n", "train_y = data_2007$delay\n", "\n", "test_x = data_2008[,fcols, with=FALSE]\n", "test_y = data_2008$delay" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *preprocess_data* function reads the data from HDFS into an R data frame. We use it first to read the feature matrix based on 2007 into the *data_2007* R dataframe (used as a training set), and then similarly to build *data_2008* as the testing set. \n", "\n", "In this cell, we also define a helper function *get_metrics* that we will use later to measure precision, recall, F1 and accuracy.\n", "\n", "Now let's run R's random forest algorithm and evaluate the results:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "rf.model = randomForest(train_x, train_y, ntree=40)\n", "rf.pr <- predict(rf.model, newdata=test_x)\n", "m.rf = get_metrics(as.logical(rf.pr), as.logical(test_y))\n", "print(sprintf(\"Random Forest: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", m.rf[1], m.rf[2], m.rf[3], m.rf[4]))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "text": [ "[1] \"Random Forest: precision=0.43, recall=0.30, F1=0.35, accuracy=0.69\"\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also try R's Gradient Boosted Machines (GBM) modeling. \n", "GBM is an ensemble method that like random forest is typically robust to over-fitting." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "gbm.model <- gbm.fit(train_x, as.numeric(train_y)-1, n.trees=500, verbose=F, shrinkage=0.01, distribution=\"bernoulli\", \n", " interaction.depth=3, n.minobsinnode=30)\n", "gbm.pr <- predict(gbm.model, newdata=test_x, n.trees=500, type=\"response\")\n", "m.gbm = get_metrics(gbm.pr >= 0.5, as.logical(test_y))\n", "print(sprintf(\"Gradient Boosted Machines: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", m.gbm[1], m.gbm[2], m.gbm[3], m.gbm[4]))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "text": [ "[1] \"Gradient Boosted Machines: precision=0.53, recall=0.10, F1=0.17, accuracy=0.72\"\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Random Foresta and Gradient Boosted Machines we get pretty good results for our predictive model using a simple set of features. Following the same iterative model from the previous parts of this blog post, we now use additional data sources to enrich our core dataset and create new features that would help us achieve better predictive performance. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Pre-processing - iteration 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have demonstrated in part 1 and 2 of this blog post, a way to improve accuracy for this model is to layer-in weather data and with it add more informative features to our model. We can get this data from a publicly available dataset here: http://www.ncdc.noaa.gov/cdo-web/datasets//\n", "\n", "We now add these additional weather-related features to our model: daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). \n", "\n", "So let's re-write our Scalding program to add these new features to our feature matrix:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%writefile preprocess2.scala\n", "\n", "package com.hortonworks.datascience.demo2\n", "\n", "import com.twitter.scalding._\n", "import org.joda.time.format._\n", "import org.joda.time.{Days, DateTime}\n", "import com.hortonworks.datascience.demo2.ScaldingFlightDelays._\n", "\n", "/**\n", " * pre-process flight and weather data into feature matrix - iteration #2\n", " */\n", "class ScaldingFlightDelays(args: Args) extends Job(args) {\n", "\n", " val delayData = Csv(args(\"delay\"), \",\", fields = delayInSchema, skipHeader = true)\n", " .read\n", " .project(delaySchema)\n", " .filter(('Origin,'Cancelled)) { x:(String,String) => x._1 == \"ORD\" && x._2 == \"0\"}\n", " .mapTo(filterSchema -> featureSchmea)(gen_features)\n", "\n", " val weatherData = Csv(args(\"weather\"),\",\", fields = weatherInSchema)\n", " .read\n", " .project(weatherSchema)\n", " .filter('Station){x:String => x == \"USW00094846\"}\n", " .filter('Metric){m:String => m == \"TMIN\" | m == \"TMAX\" | m == \"PRCP\" | m == \"SNOW\" | m == \"AWND\"}\n", " .mapTo(weatherSchema -> ('Date,'MM)){tuple:(String,String,String,String) => (tuple._2,tuple._3+\":\"+tuple._4)}\n", " .groupBy('Date){_.foldLeft('MM -> 'Measures)(Map[String,Double]()){\n", " (m,s:String) => {val kv = s.split(\":\"); m + (kv(0) -> kv(1).toDouble)}\n", " }\n", " }\n", "\n", " delayData.joinWithSmaller(('flightDate,'Date),weatherData)\n", " .project('delay,'m,'dm,'dw,'h,'dist,'holiday,'Measures)\n", " .mapTo(joinSchema -> outputSchema){x:(Double,Double,Double,Double,Double,Double,Double,Map[String,Double]) => {\n", " (x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8(\"TMIN\"), x._8(\"TMAX\"), x._8(\"PRCP\"), x._8(\"SNOW\"), x._8(\"AWND\"))\n", " }\n", " }\n", " .write(Csv(args(\"output\"),\",\"))\n", "}\n", "\n", "object ScaldingFlightDelays {\n", " val delayInSchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek,\n", " 'DepTime, 'CRSDepTime, 'ArrTime, 'CRSArrTime,\n", " 'UniqueCarrier, 'FlightNum, 'TailNum,\n", " 'ActualElapsedTime, 'CRSElapsedTime, 'AirTime,\n", " 'ArrDelay, 'DepDelay, 'Origin, 'Dest,\n", " 'Distance, 'TaxiIn, 'TaxiOut,\n", " 'Cancelled, 'CancellationCode, 'Diverted,\n", " 'CarrierDelay, 'WeatherDelay, 'NASDelay,\n", " 'SecurityDelay, 'LateAircraftDelay)\n", " val weatherInSchema = List('Station, 'Date, 'Metric, 'Measure, 'v1, 'v2, 'v3, 'v4)\n", " val delaySchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek,\n", " 'CRSDepTime, 'DepDelay, 'Origin, 'Distance, 'Cancelled);\n", " val weatherSchema = List('Station, 'Date, 'Metric, 'Measure)\n", " val filterSchema = List('Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'DepDelay, 'Distance)\n", " val featureSchmea = List('flightDate,'delay,'m,'dm,'dw,'h,'dist,'holiday);\n", " val joinSchema = List('delay,'m,'dm,'dw,'h,'dist,'holiday,'Measures)\n", " val outputSchema = List('delay,'m,'dm,'dw,'h,'dist,'holiday,'tmin,'tmax,'prcp,'snow,'awnd)\n", "\n", " val holidays = List(\"01/01/2007\", \"01/15/2007\", \"02/19/2007\", \"05/28/2007\", \"06/07/2007\", \"07/04/2007\",\n", " \"09/03/2007\", \"10/08/2007\" ,\"11/11/2007\", \"11/22/2007\", \"12/25/2007\",\n", " \"01/01/2008\", \"01/21/2008\", \"02/18/2008\", \"05/22/2008\", \"05/26/2008\", \"07/04/2008\",\n", " \"09/01/2008\", \"10/13/2008\" ,\"11/11/2008\", \"11/27/2008\", \"12/25/2008\")\n", "\n", " def gen_features(tuple: (String,String,String,String,String,String,String)) = {\n", " val (year, month,dayOfMonth,dayOfWeek:String, crsDepTime:String,depDelay:String,distance:String) = tuple\n", " val date = to_date(year.toInt,month.toInt,dayOfMonth.toInt)\n", " val hour = get_hour(crsDepTime)\n", " val holidayDist = days_from_nearest_holiday(year.toInt,month.toInt,dayOfMonth.toInt)\n", "\n", " (date,depDelay,month,dayOfMonth,dayOfWeek,hour,distance,holidayDist.toString)\n", " }\n", "\n", " def get_hour(depTime: String) = \"%04d\".format(depTime.toInt).take(2)\n", "\n", " def to_date(year: Int, month: Int, day: Int) = \"%04d%02d%02d\".format(year, month, day)\n", "\n", " def days_from_nearest_holiday(year:Int, month:Int, day:Int) = {\n", " val sampleDate = new DateTime(year, month, day, 0, 0)\n", "\n", " holidays.foldLeft(3000) { (r, c) =>\n", " val holiday = DateTimeFormat.forPattern(\"MM/dd/yyyy\").parseDateTime(c)\n", " val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays)\n", " math.min(r, distance)\n", " }\n", " }\n", "}\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Overwriting preprocess2.scala\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "%%capture capt2007_2\n", "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess2.scala --delay \"airline/delay/2007.csv\" --weather \"airline/weather/2007.csv\" --output \"airline/fm/ord_2007_sc_2\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "capt2007_2.stderr" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "''" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "%%capture capt2008_2\n", "!/home/demo/scalding/scripts/scald.rb --hdfs preprocess2.scala --delay \"airline/delay/2008.csv\" --weather \"airline/weather/2008.csv\" --output \"airline/fm/ord_2008_sc_2\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "capt2008_2.stderr" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "''" ] } ], "prompt_number": 12 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Modeling - iteration 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's re-build the Random Forest and Gradient Boosted Tree models with the enahanced feature matrices.\n", "As before, we first prepare our training set and test set:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "# Read input files\n", "process_dataset <- function(filename) {\n", " cols = c('delay', 'month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday', \n", " 'tmin', 'tmax', 'prcp', 'snow', 'awnd')\n", " \n", " data = read_csv_from_hdfs(filename, cols)\n", " data$delay = as.factor(as.numeric(data$delay) >= 15)\n", " data$month = as.factor(data$month)\n", " data$day = as.factor(data$day)\n", " data$dow = as.factor(data$dow)\n", " data$hour = as.numeric(data$hour)\n", " data$distance = as.numeric(data$distance)\n", " data$days_from_holiday = as.numeric(data$days_from_holiday)\n", " data$tmin = as.numeric(data$tmin)\n", " data$tmax = as.numeric(data$tmax)\n", " data$prcp = as.numeric(data$prcp)\n", " data$snow = as.numeric(data$snow)\n", " data$awnd = as.numeric(data$awnd)\n", " data\n", "}\n", "\n", "# Prepare training set and test/validation set\n", "\n", "data_2007 = process_dataset('/user/demo/airline/fm/ord_2007_sc_2')\n", "data_2008 = process_dataset('/user/demo/airline/fm/ord_2008_sc_2')\n", "\n", "fcols = setdiff(names(data_2007), c('delay'))\n", "train_x = data_2007[,fcols]\n", "train_y = data_2007$delay\n", "\n", "test_x = data_2008[,fcols]\n", "test_y = data_2008$delay" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now to build a Random Forest model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "rf.model = randomForest(train_x, train_y, ntree=40)\n", "rf.pr <- predict(rf.model, newdata=test_x)\n", "m.rf = get_metrics(as.logical(rf.pr), as.logical(test_y))\n", "print(sprintf(\"Random Forest: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", \n", " m.rf[1], m.rf[2], m.rf[3], m.rf[4]))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "text": [ "[1] \"Random Forest: precision=0.58, recall=0.38, F1=0.45, accuracy=0.74\"\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the gradient boosted tree model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%R\n", "\n", "gbm.model <- gbm.fit(train_x, as.numeric(train_y)-1, n.trees=500, verbose=F, shrinkage=0.01, distribution=\"bernoulli\", \n", " interaction.depth=3, n.minobsinnode=30)\n", "gbm.pr <- predict(gbm.model, newdata=test_x, n.trees=500, type=\"response\")\n", "m.gbm = get_metrics(gbm.pr >= 0.5, as.logical(test_y))\n", "print(sprintf(\"Gradient Boosted Machines: precision=%0.2f, recall=%0.2f, F1=%0.2f, accuracy=%0.2f\", \n", " m.gbm[1], m.gbm[2], m.gbm[3], m.gbm[4]))" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "text": [ "[1] \"Gradient Boosted Machines: precision=0.63, recall=0.27, F1=0.38, accuracy=0.75\"\n" ] } ], "prompt_number": 9 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this 3rd part of our blog post we used an IPython notebook to demonstrate how to build a predictive model for airline delays. We have used Scalding on our HDP cluster to perform various types of data pre-processing and feature engineering tasks. We then applied a few machine learning algorithms such as random forest and gradient boosted machines to the resulting datasets and showed how through iterations we add new features resulting in better model performance." ] } ], "metadata": {} } ] }