{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculating MAE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook explains how to calculate MAE from `scikit-learn` on a regression model from `catboost`.\n",
    "\n",
    "This notebook will build and evaluate a model to predict arrival delay for flights in and out of NYC in 2013.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Packages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial uses:\n",
    "* [pandas](https://pandas.pydata.org/docs/)\n",
    "* [statsmodels](https://www.statsmodels.org/stable/index.html)\n",
    "    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)\n",
    "* [numpy](https://numpy.org/doc/stable/)\n",
    "* [scikit-learn](https://scikit-learn.org/stable/)\n",
    "    * [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)\n",
    "    * [sklearn.model_selection](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n",
    "* [catboost](https://catboost.ai/docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from catboost import CatBoostRegressor, Pool"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data is from `rdatasets` imported using the Python package `statsmodels`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 336776 entries, 0 to 336775\n",
      "Data columns (total 19 columns):\n",
      " #   Column          Non-Null Count   Dtype  \n",
      "---  ------          --------------   -----  \n",
      " 0   year            336776 non-null  int64  \n",
      " 1   month           336776 non-null  int64  \n",
      " 2   day             336776 non-null  int64  \n",
      " 3   dep_time        328521 non-null  float64\n",
      " 4   sched_dep_time  336776 non-null  int64  \n",
      " 5   dep_delay       328521 non-null  float64\n",
      " 6   arr_time        328063 non-null  float64\n",
      " 7   sched_arr_time  336776 non-null  int64  \n",
      " 8   arr_delay       327346 non-null  float64\n",
      " 9   carrier         336776 non-null  object \n",
      " 10  flight          336776 non-null  int64  \n",
      " 11  tailnum         334264 non-null  object \n",
      " 12  origin          336776 non-null  object \n",
      " 13  dest            336776 non-null  object \n",
      " 14  air_time        327346 non-null  float64\n",
      " 15  distance        336776 non-null  int64  \n",
      " 16  hour            336776 non-null  int64  \n",
      " 17  minute          336776 non-null  int64  \n",
      " 18  time_hour       336776 non-null  object \n",
      "dtypes: float64(5), int64(9), object(5)\n",
      "memory usage: 48.8+ MB\n"
     ]
    }
   ],
   "source": [
    "df = sm.datasets.get_rdataset('flights', 'nycflights13').data\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handle null values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "year                 0\n",
       "month                0\n",
       "day                  0\n",
       "dep_time          8255\n",
       "sched_dep_time       0\n",
       "dep_delay         8255\n",
       "arr_time          8713\n",
       "sched_arr_time       0\n",
       "arr_delay         9430\n",
       "carrier              0\n",
       "flight               0\n",
       "tailnum           2512\n",
       "origin               0\n",
       "dest                 0\n",
       "air_time          9430\n",
       "distance             0\n",
       "hour                 0\n",
       "minute               0\n",
       "time_hour            0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As this model will predict arrival delay, the `Null` values are caused by flights did were cancelled or diverted. These can be excluded from this analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert the times from floats or ints to hour and minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))\n",
    "df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))\n",
    "df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))\n",
    "df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))\n",
    "df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))\n",
    "df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))\n",
    "df.rename(columns={'hour': 'dep_hour',\n",
    "                   'minute': 'dep_minute'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare data for modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up train-test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = 'arr_delay'\n",
    "y = df[target]\n",
    "X = df.drop(columns=[target, 'flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build the regression model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<catboost.core.CatBoostRegressor at 0x13ccf5a50>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "categorical_features = X_train.select_dtypes(exclude=[np.number])\n",
    "\n",
    "train_pool = Pool(X_train, y_train, categorical_features)\n",
    "test_pool = Pool(X_test, y_test, categorical_features)\n",
    "\n",
    "model = CatBoostRegressor(iterations=500, max_depth=5, learning_rate=0.05, random_seed=1066, logging_level='Silent')\n",
    "model.fit(X_train, y_train, eval_set=test_pool, cat_features=categorical_features, use_best_model=True, early_stopping_rounds=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `mean_absolute_error` from `scikit-learn`, calculate the MAE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.5178775173583325"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mae = mean_absolute_error(y_test, model.predict(X_test))\n",
    "mae"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}