{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculating MAE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook explains how to calculate MAE from `scikit-learn` on a regression model from `catboost`.\n", "\n", "This notebook will build and evaluate a model to predict arrival delay for flights in and out of NYC in 2013. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial uses:\n", "* [pandas](https://pandas.pydata.org/docs/)\n", "* [statsmodels](https://www.statsmodels.org/stable/index.html)\n", " * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)\n", "* [numpy](https://numpy.org/doc/stable/)\n", "* [scikit-learn](https://scikit-learn.org/stable/)\n", " * [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)\n", " * [sklearn.model_selection](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n", "* [catboost](https://catboost.ai/docs)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.metrics import mean_absolute_error\n", "from sklearn.model_selection import train_test_split\n", "\n", "from catboost import CatBoostRegressor, Pool" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is from `rdatasets` imported using the Python package `statsmodels`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 336776 entries, 0 to 336775\n", "Data columns (total 19 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 year 336776 non-null int64 \n", " 1 month 336776 non-null int64 \n", " 2 day 336776 non-null int64 \n", " 3 dep_time 328521 non-null float64\n", " 4 sched_dep_time 336776 non-null int64 \n", " 5 dep_delay 328521 non-null float64\n", " 6 arr_time 328063 non-null float64\n", " 7 sched_arr_time 336776 non-null int64 \n", " 8 arr_delay 327346 non-null float64\n", " 9 carrier 336776 non-null object \n", " 10 flight 336776 non-null int64 \n", " 11 tailnum 334264 non-null object \n", " 12 origin 336776 non-null object \n", " 13 dest 336776 non-null object \n", " 14 air_time 327346 non-null float64\n", " 15 distance 336776 non-null int64 \n", " 16 hour 336776 non-null int64 \n", " 17 minute 336776 non-null int64 \n", " 18 time_hour 336776 non-null object \n", "dtypes: float64(5), int64(9), object(5)\n", "memory usage: 48.8+ MB\n" ] } ], "source": [ "df = sm.datasets.get_rdataset('flights', 'nycflights13').data\n", "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handle null values" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "year 0\n", "month 0\n", "day 0\n", "dep_time 8255\n", "sched_dep_time 0\n", "dep_delay 8255\n", "arr_time 8713\n", "sched_arr_time 0\n", "arr_delay 9430\n", "carrier 0\n", "flight 0\n", "tailnum 2512\n", "origin 0\n", "dest 0\n", "air_time 9430\n", "distance 0\n", "hour 0\n", "minute 0\n", "time_hour 0\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As this model will predict arrival delay, the `Null` values are caused by flights did were cancelled or diverted. These can be excluded from this analysis." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df.dropna(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert the times from floats or ints to hour and minutes" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))\n", "df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))\n", "df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))\n", "df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))\n", "df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))\n", "df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))\n", "df.rename(columns={'hour': 'dep_hour',\n", " 'minute': 'dep_minute'}, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare data for modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up train-test split" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "target = 'arr_delay'\n", "y = df[target]\n", "X = df.drop(columns=[target, 'flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the regression model" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical_features = X_train.select_dtypes(exclude=[np.number])\n", "\n", "train_pool = Pool(X_train, y_train, categorical_features)\n", "test_pool = Pool(X_test, y_test, categorical_features)\n", "\n", "model = CatBoostRegressor(iterations=500, max_depth=5, learning_rate=0.05, random_seed=1066, logging_level='Silent')\n", "model.fit(X_train, y_train, eval_set=test_pool, cat_features=categorical_features, use_best_model=True, early_stopping_rounds=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `mean_absolute_error` from `scikit-learn`, calculate the MAE." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6.5178775173583325" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mae = mean_absolute_error(y_test, model.predict(X_test))\n", "mae" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }